NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

Last update: Apr 07, 2022

Related tags

Text Data & NLP pretrain4ir_tutorial

Overview

pretrain4ir_tutorial

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

用作NLPIR实验室, Pre-training for IR方向入门.

代码包括了如下部分:

tasks/ : 生成预训练数据
pretrain/: 在生成的数据上Pre-training (MLM + NSP)
finetune/: Fine-tuning on MS MARCO

Preinstallation

First, prepare a Python3 environment, and run the following commands:

  git clone [email protected]:zhengyima/pretrain4ir_tutorial.git pretrain4ir_tutorial
  cd pretrain4ir_tutorial
  pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Besides, you should download the BERT model checkpoint in format of huggingface transformers, and save them in a directory BERT_MODEL_PATH. In our paper, we use the version of bert-base-uncased. you can download it from the huggingface official model zoo, or Tsinghua mirror.

生成预训练数据

代码库提供了最简单易懂的预训练任务 rand。该任务随机从文档中选取1~5个词作为query, 用来demo面向IR的预训练。

生成rand预训练任务数据命令: cd tasks/rand && bash gen.sh

你可以自己编写脚本, 仿照rand任务, 生成你自己认为合理的预训练任务的数据。

Notes: 运行rand任务的shell之前, 你需要先将 gen.sh 脚本中的 msmarco_docs_path 参数改为MSMARCO数据集的文档tsv 路径; 将bert_model参数改为下载好的bert模型目录;

模型预训练

代码库提供了模型预训练的相关代码, 见pretrain。该代码完成了MLM+NSP两个任务的预训练。

模型预训练命令: cd pretrain && bash train_bert.sh

Notes: 注意要修改train_bert中的相应参数：将bert_model参数改为下载好的bert模型目录; train_file改为你上一步生成好的预训练数据文件路径。

模型Fine-tune

代码库提供了在MSMARCO Document Ranking任务上进行Fine-tune的相关代码。见finetune。该代码完成了在MSMARCO上通过point-wise进行fine-tune的流程。

模型fine-tune命令: cd finetune && bash train_bert.sh

Leaderboard

Tasks	[email protected] on dev set
PROP-MARCO	0.4201
PROP-WIKI	0.4188
BERT-Base	0.4184
rand	0.4123

Homework

设计一个你认为合理的预训练任务, 并对BERT模型进行预训练, 并在MSMARCO上完成fine-tune, 在Leaderboard上更新你在dev set上的结果。

你需要做的是:

编写你自己的预训练数据生成脚本, 放到 tasks/yourtask 目录下。
使用以上脚本, 生成自己的预训练数据。
运行代码库提供的pre-train与fine-tune脚本, 跑出结果, 更新Leaderboard。

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

Related tags

Overview

pretrain4ir_tutorial

Preinstallation

生成预训练数据

模型预训练

模型Fine-tune

Leaderboard

Homework

Links

Owner

ZYMa

The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

Active learning for text classification in Python

Contains descriptions and code of the mini-projects developed in various programming languages

Extract Keywords from sentence or Replace keywords in sentences.

Tools and data for measuring the popularity & growth of various programming languages.

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

A method to generate speech across multiple speakers

Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

A library for end-to-end learning of embedding index and retrieval model

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

Residual2Vec: Debiasing graph embedding using random graphs

A script that automatically creates a branch name using google translation api and jira api

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

Share constant definitions between programming languages and make your constants constant again

Scene Text Retrieval via Joint Text Detection and Similarity Learning

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。