当前位置:网站首页>Transformer XL: attention language modelsbbeyond a fixed length context paper summary

Transformer XL: attention language modelsbbeyond a fixed length context paper summary

2022-04-23 08:22:00 A grain of sand in the vast sea of people

Paper:Transformer-XL: Attentive Language ModelsBeyond a Fixed-Length Context

Code:Transformer-XL code

1. Brief introduction of the paper

Transfomer-XL = Transformer Extra Long

2. What is? Transformer 

XLNet Used Transformer-XL Medium  Segment Recurrence Mechanism ( Segmental circulation ) and  Relative Positional Encoding ( Relative position coding ) To optimize .

Segment Recurrence Mechanism The segment loop mechanism will save the information output from the previous text , Calculation for the current text , So that the model can have broader context information .

After introducing the previous information , There may be two token Have the same location information , For example, the position information of the first word in the previous paragraph is the same as that of the current paragraph . therefore Transformer-XL Adopted Relative Positional Encoding ( Relative position coding ) , No fixed position , Instead, we use the relative positions of words to encode .

3. Vanilla transfomer langange models Brief introduction and disadvantages

3.1  Brief introduction

3.2 shortcoming

3.2.1 Training with the Vanilla Model (Vanila The training phase of )

1. Tokens at the beginning of each segment do not have sufficent context for proper optimization.  

2. Limited by a fixed-length context

3.2.2 Evaluation with the Vanilla Model

1. Longest context limited by segment length.

2. very expensive due to recomputation.

3.2.3. Temporal Incoherence 

4. Transformer-XL Contribution or major improvement

4.1 Transformer-XL Introduce

4.1.1 Training with Transformer-XL

 4.1.2  Evaluation with Transformer-XL

 4.1.3. Solution: Relative Positional Encodings

Benefits:

1. Allows recurrence mechanism

2. Better generalization

-> WordLM: Train with memory length 150 , evaluate with 640

-> CharLM: Train with memory length 680, evalute with 3800

 

4.1  Segment-level Recurrence

Cache and reuse hidden states from last batch

Analogous to Truncated BPTT for RNN : pass the last hidden state to the next segment as the initial hidden

4.2. Keep Temporal information coherenet 

5. summary

Reference material

Transformer-XL_ Attentive Language Models beyond a Fixed-Length Context_ Bili, Bili _bilibili

版权声明
本文为[A grain of sand in the vast sea of people]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230704087027.html