当前位置:网站首页>5-minute NLP: text to text transfer transformer (T5) unified text to text task model

5-minute NLP: text to text transfer transformer (T5) unified text to text task model

2022-04-23 16:35:00 deephub

This article will explain the following terms :T5,C4,Unified Text-to-Text Tasks

Transfer learning in NLP The effectiveness in comes from the model of pre training rich unmarked text data with self-monitoring tasks , For example, language modeling or filling in missing words . After pre training , You can fine tune the model on smaller labeled datasets , Generally better performance than training with tagged data alone . Transfer learning is such as GPT,Bert,XLNet,Roberta,Albert and Reformer As proved by the model .

Text-To-Text Transfer Transformer (T5)

The paper “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”(2019 Published in ) Put forward a large-scale empirical survey , It shows which transfer learning technology is the most effective , And apply these insights to create new ones called Text-To-Text Transfer Transformer (T5) Model .

An important part of migration learning is unlabeled data sets for pre training , This should not only be of high quality and diversity , And there should be a lot of . Previous pre training data sets did not meet all three criteria , because :

  • Wikipedia High quality of text , But the style is uniform , The purpose suitable for us is relatively small
  • come from Common Crawl Web The captured text is huge , Highly diverse , But the quality is relatively low .

So a new data set is developed in this paper : Colossal Clean Crawled Corpus (C4), This is a Common Crawl Of “ clean ” edition , Two orders of magnitude larger than Wikipedia .

stay C4 Pre trained T5 Models can be used in many NLP Get the most advanced results on the benchmark , And flexible enough , Several downstream tasks can be fine tuned .

Unify text to text format

Use T5, all NLP Tasks can be converted into a unified text to text format , The input and output of the task is always a text string .

The framework provides consistent training objectives , For pre training and fine tuning . Whatever the task , The models all have maximum likelihood targets . If you want to specify what kind of tasks the model should , You need to identify the target of the task before sending it to the model . Added to the original input sequence as a specific text prefix .

This framework allows for any NLP Use the same model on the task 、 Loss functions and hyperparameters , For example, machine translation 、 Document summary 、 Q & A and classification tasks .

Compare different models and training strategies

T5 The paper provides a variety of model architectures , Pre training objectives , Data sets , Comparison of training strategy and scale level . The baseline model for comparison is the standard encoder decoder Transformer.

  • Model architecture : Although some about NLP The work of transfer learning has been considered Transformer Architectural variants of , But the original encoder - The decoder form works perfectly in experiments with text to text frames .
  • Pre training objectives : Most denoising target training models will reconstruct randomly damaged text , stay T5 Similar operations are also performed in the settings of . therefore , It is suggested to use unsupervised pre training to increase computational efficiency , For example, the deprivation goal of filling the gap .
  • Unlabeled dataset : Training of in domain data may be beneficial , However, pre training of small data sets may lead to harmful over fitting , Especially when the data set is small enough , Repeat several times during pre training . This has prompted people to use things like C4 Such a large and diverse data set to complete the task of general language understanding .
  • Training strategy : Fine tune after training tasks , It can produce a good performance improvement for unsupervised pre training .
  • Scale horizontal scaling : Various strategies using additional computing are compared , Include more data , Larger models , And use the integration of the model . Each method can improve the performance , But train a smaller model with more data , It's often better than training a larger model with fewer steps .

It turns out that , The text method is successfully applied to the generation task ( for example , Abstract abstract ), Classification task ( For example, natural language inference ), Even the return mission , It has considerable performance for task specific architecture and state .

The final T5 Model

Combined with experimental insights , The author uses different dimensions ( As many as 110 One hundred million parameters ) Training models , And achieve the most advanced results in many benchmarks . These models are in C4 Pre trained on the dataset , Then before fine tuning individual tasks , Pre training on multi task mix .

The biggest model is GLUE, SuperGLUE, SQuAD, and CNN/Daily Mail When the test reaches the most advanced results .

summary

In this paper , It introduces Text-To-Text Transfer Transformer (T5) Models and Colossal Clean Crawled Corpus (C4) Data sets . At the same time, examples of different tasks are introduced , This is called the unified text to text task , And see the qualitative experimental results of performance with different model architectures and training strategies .

If you're interested in this , You can try the following work by yourself :

  • understand T5 Subsequent improvements to the model , Such as T5v1.1( With some architectural adjustments T5 Improved version ),MT5( Multilingual T5 Model ) and BYT5( Pre trained on byte sequences T5 Model, not token Sequence )
  • You can see Hugging Face Of T5 Implement and fine tune

https://www.overfit.cn/post/a0e9aaeaabf04087a278aea6f06d14d6

author :Fabio Chiusano

版权声明
本文为[deephub]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231624506733.html