Chinese Pre-Trained Language Models (CPM-LM) Version-I

Overview

CPM-Generate

为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告]

若您想使用CPM-1进行推理,我们建议使用高效推理工具BMInf,支持1060以上显卡单卡推理。

安装

首先安装pytorch等基础依赖,再安装APEX以支持fp16:

pip install -r requirements.txt
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

考虑apex的安装容易发生问题,我们构建了对应的Docker容器,可以进行快速环境搭建。安装方式如下:

docker pull dmye/cpm:v0

参考运行指令如下:

:/CPM --name=cpm cpm:v0 ">
sudo docker run --gpus '"device=0,1"' -it -v 
   
    :/CPM  --name=cpm  cpm:v0

   

其中 为代码所在目录,-v进行文件目录挂载

注:感谢qhduan同学提供了基于TensorFlow的使用代码,用作Pytorch之外的备选。

模型

模型下载后文件夹的目录结构需设置如下:

.
├── 80000
│   ├── mp_rank_00_model_states.pt
│   └── mp_rank_01_model_states.pt
└── latest_checkpointed_iteration.txt

为保证下载文件的正确性,文件的checksum如下:

SHA1
71d6b6ad4f47b46724eb82c05da8fb9175e62a7d  80000/mp_rank_00_model_states.pt
42aa247a262e2011fa5e276f1a8389fad6d80edc  80000/mp_rank_01_model_states.pt
MD5
f3f6d2f7d84c6a45290a31dabf79ddac  80000/mp_rank_00_model_states.pt
b0e960be4b5226e759ae6fc5246f9160  80000/mp_rank_01_model_states.pt

使用

提供了命令行交互式生成:

bash scripts/generate_text.sh /path/to/CPM

如不使用交互式输入,可增加第二个参数,告知输入文本的位置

bash scripts/generate_text.sh /path/to/CPM example.txt

运行该脚本需要两块GPU,每张卡的GPU内存占用约为7GB。该项目主要基于 Megatron-LM 进行修改。模型的主体架构与GPT-2一致。

默认的模型并行参数为2,如果需要修改,可以使用change_mp.py,并调整generate_text.sh中的MPSIZEchange_mp.py的使用示例如下:

python change_mp.py /path/to/CPM MPSIZE

这里的/path/to/CPM为模型路径,MPSIZE为一个整数,可以为1或者2的倍数,结果会生成一个新的模型,存储路径为/path/to/CPM_MPSIZE

Tokenization

Tokenization实现主要在data_util/tokenization_gpt2.py,先对于文本进行分词,再使用 SentencePiece 得到 BPE 的结果。由于 SentencePiece 不能有效编码空格和换行符,在 BPE 之前,我们将文本中的空格和换行符替换为\u2582\u2583。生成文本的时候也会对应的把生成的\u2582\u2583替换回空格和换行符。

对应问题已解决。

分类任务零次学习(Zero-shot Learning)

提供了三个任务的零次学习任务脚本以供参考,包括OCNLI、TNEWS和IFLYTEK,数据下载链接。脚本使用方法如下:

# OCNLI
bash scripts/zero-shot-ocnli.sh /path/to/CPM /path/to/dataset
# TNEWS
bash scripts/zero-shot-tnews.sh /path/to/CPM /path/to/dataset
# IFLYTEK
bash scripts/zero-shot-iflytek.sh /path/to/CPM /path/to/dataset

TODO

  • 实验环境的docker镜像
  • 提供各个任务具体的使用模板
  • 公开技术报告
  • 模型并行数可动态调整
  • Fine-tune代码
  • 开源实验中使用的小规模模型参数

引用

@article{cpm-v1,
  title={CPM: A Large-scale Generative Chinese Pre-trained Language Model},
  author={Zhang, Zhengyan and Han, Xu, and Zhou, Hao, and Ke, Pei, and Gu, Yuxian and Ye, Deming and Qin, Yujia and Su, Yusheng and Ji, Haozhe and Guan, Jian and Qi, Fanchao and Wang, Xiaozhi and Zheng, Yanan and Zeng, Guoyang and Cao, Huanqi and Chen, Shengqi and Li, Daixuan and Sun, Zhenbo and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Tang, Jie and Li, Juanzi and Sun, Maosong},
  year={2020}
}
Owner
Tsinghua AI
Tsinghua AI
Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

PythonTextObfuscator Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense. Requi

2 Aug 29, 2022
aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

tanreinama 13 Aug 11, 2022
뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)

뉴스 도메인 질의응답 시스템 본 프로젝트는 뉴스기사에 대한 질의응답 서비스 를 제공하기 위해서 진행한 프로젝트입니다. 약 3개월간 ( 21. 03 ~ 21. 05 ) 진행하였으며 Transformer 아키텍쳐 기반의 Encoder를 사용하여 한국어 질의응답 데이터셋으로

TaegyeongEo 4 Jul 08, 2022
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy floret is an extended version of fastText that can produce word repr

Explosion 222 Dec 16, 2022
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
Kurumi ChatBot

KurumiChatBot Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @TokisakiChatB

Yoga Pranata 3 Jun 28, 2022
Python library to make development of portfolio analysis faster and easier

Trafalgar Python library to make development of portfolio analysis faster and easier Installation 🔥 For the moment, Trafalgar is still in beta develo

Santosh Passoubady 641 Jan 01, 2023
Problem: Given a nepali news find the category of the news

Classification of category of nepali news catorgory using different algorithms Problem: Multiclass Classification Approaches: TFIDF for vectorization

pudasainishushant 2 Jan 09, 2022
Ask for weather information like a human

weather-nlp About Ask for weather information like a human. Goals Understand typical questions like: Hourly temperatures in Potsdam on 2020-09-15. Rai

5 Oct 29, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

ZUNIT Dependencies you can install all the dependencies by pip install -r requirements.txt Datasets Download CUB dataset. Unzip the birds.zip at ./da

Chen Yuanqi 9 Jun 24, 2022
Community and sentiment analysis based on tweets

The project has set itself the goal of analyzing the thoughts and interaction of Italian users through the social posts expressed through the Twitter platform on the day of the entry into force of th

3 Nov 17, 2022
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization This repo is for our paper "Enhanced Seq2Seq Autoencode

Rachel Zheng 14 Nov 01, 2022
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Akbar Karimi 81 Dec 09, 2022
CoNLL-English NER Task (NER in English)

CoNLL-English NER Task en | ch Motivation Course Project review the pytorch framework and sequence-labeling task practice using the transformers of Hu

Kevin 2 Jan 14, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

461 Dec 28, 2022
COVID-19 Related NLP Papers

COVID-19 outbreak has become a global pandemic. NLP researchers are fighting the epidemic in their own way.

xcfeng 28 Oct 30, 2022
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

msg systems ag 169 Dec 21, 2022