0%

# Introduction

This blog will give a simple explanation of BERT and its variants’(Mainly focus on variants) main idea and how these models improve base on its training process.

The trending improving methods are divided into 2 teams. The first team is using BERT as (part of) the encoder of other neural networks, that is another network will take BERT output as its input or concatenate BERT output and other network output as some layers’ input to make the final prediction. They often use pre-trained BERT to get some improvement.

The second team will focus more on the foundation of BERT. They will try to add some pre-training objectives to make BERT get more information of the corpora.

This blog will summarize some BERT Variants that I recently learned. I will explain the method of the original paper and show the improvement if I think the improvement is obvious.

# BERT as Encoder

### 1. Semantic-aware BERT for Language Understanding

This paper is presented by Zhuoshen Zhang et al. They hold an argument that although BERT can learn some semantic knowledge during pertaining, it’s still not enough for BERT to generate a sentence with full semantic information. For example, in the machine reading mission, the question is How many people dose the Greater Los Angeles Area have? The baseline model will give the answer of 17.5 million while the true answer is over 17.5 million. So the authors encoded an extra piece of semantic information and combined it with the BERT output to achieve higher performance. The architecture of the model is:

The left side of the model is a pre-trained BERT encoder. The authors added a Semantic Role Labeler(SRL) next to the BERT. This labeler will read the sentence and give out $m$ groups of the semantic role of each token. Each group will go through a GRU encoder. The authors concatenated the $m$ outputs and go through a fully connected layer as the semantic information. This information will concatenate with the BERT output and pass forward to finish the objective.

This model is having better performance on the tasks which need semantic information. But the model is too huge and the improvement is limited.

### 2. Multi-Task Deep Neural Networks for Natural Language Understanding

This paper is presented by Xiaodong Liu et al and this is the current state-of-the-art model in GLUE leader board(2020.5). This model use BERT as shared context embedding encoder among several Task Specific Layers. The architecture of the model is:

The authors fine-tuned one BERT across 4 different NLP tasks. The objective is to approximately optimize the sum of all multi-task objectives so that the BERT can have more general knowledge of the language.

# Modifying Pre-training Objective

### 1. STRUCTBERT Incorporating Language Structures Into Pre-Training for Deep Language Understanding

This paper is presented by Wei Wang et al. They are focusing the same issue as the authors’ of the Semantic-BERT, they argue that BERT does not make the most of underlying language structures. So the authors of this paper came out with 2 new pretraining objectives and got new state-of-the-art performance in a variety of downstream tasks.

The first new objective is Word Structural Objective. Besides Masking words in the original BERT pretraining process, the authors will also randomly shuffle K unmasked words and let the algorithm predict the right word order. Given the randomicity of token shuffling, the word objective is equivalent to maximizing the likelihood of placing every shuffled token in its correct position. More formally, this objective can be formulated as:

The authors recommend to randomly shuffle 3 continuous words and shuffle 5% of trigrams are selected for random shuffling.

The second objective is Sentence Structural Objective. They extend the sentence prediction task by predicting both the next sentence and the previous sentence, to make the pre-trained language model aware of the sequential order of the sentences in a bidirectional manner.

With these two new objectives, BERT can get more semantic information and get new state-of-the-art performance on many tasks.

### 2. SpanBERT

This paper is presented by Mandar Joshi et al. The authors argue that the original BERT is limited in the problem of generating a sequence of words like MRC tasks. So they gave out a new pretraining method with a new objective: span-boundary objective. The authors firstly sampled a span length from a geometric distribution then randomly select the starting point for the span to be masked.

According to the paper, they masked 15% of the words in the documents as in BERT(replacing 80% of the masked tokens with [MASK], 10% with random tokens, and 10% with the original tokens). However, they performed this replacement at the span level and not for each token individually; that is, all the tokens in a span are replaced with [MASK] or sampled tokens.

Then they proposed a Span Boundary Objective. Let $\mathbf{x}$ be the output of BERT and (s,e) is the start and the end position of the span. $\mathbf{p}$ is the position embedding. So the represent of the i-th masked token $\mathbf{y_i}$ is:

They use $\mathbf{y_i}$ to compute the probability of the token and use cross-entropy as its loss. So the loss function is the sum of Span Boundary Objective loss and regular masked language loss:

They also omitted the Next Sentence Prediction objective used in the original BERT since other research showed this function will hurt the performance of BERT by impeding it learning from longer full-length contexts and by adding noise for Masked Language Model objective.

With this pretraining method, the model has an obvious improvement on extractive QA tasks and other tasks can also benefit a little from this pretraining method

### 3. RoBERTa

This paper is proposed by Yinhan Liu et al, in which they reconstructed the whole BERT training process and find which factors can influence the final result of pretraining, then they proposed a model that summarized all the positive influences that they found call RoBERTa.

They typically have these findings:

• Using dynamic masking which will generate different patterns every time they feed a sequence to the model can achieve slightly better performance.
• Removing Next Sentence Prediction objective just using a longer sequence that extracted from one document(no cross documents) can have better performance
• Training on large batch size can get better performance
• Though replace character-level representation of the input with Byte-Pair Encoding do not improve the performance even hurt a little, they believe the advantages of a universal encoding scheme outweigh the minor degradation in performance
• They also found the data and the number of training passes through the data are two important factors, the authors said with well-designed data and passing through time, the RoBERTa will have better performance than BERT even with the same objective

Future work: