A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Overview

ReaLiSe

ReaLiSe is a multi-modal Chinese spell checking model.

This the office code for the paper Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking.

The paper has been accepted in ACL Findings 2021.

Environment

  • Python: 3.6
  • Cuda: 10.0
  • Packages: pip install -r requirements.txt

Data

Raw Data

SIGHAN Bake-off 2013: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html
SIGHAN Bake-off 2014: http://ir.itc.ntnu.edu.tw/lre/clp14csc.html
SIGHAN Bake-off 2015: http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
Wang271K: https://github.com/wdimmy/Automatic-Corpus-Generation

Data Processing

The code and cleaned data are in the data_process directory.

You can also directly download the processed data from this and put them in the data directory. The data directory would look like this:

data
|- trainall.times2.pkl
|- test.sighan15.pkl
|- test.sighan15.lbl.tsv
|- test.sighan14.pkl
|- test.sighan14.lbl.tsv
|- test.sighan13.pkl
|- test.sighan13.lbl.tsv

Pretrain

  • BERT: chinese-roberta-wwm-ext

    Huggingface hfl/chinese-roberta-wwm-ext: https://huggingface.co/hfl/chinese-roberta-wwm-ext
    Local: /data/dobby_ceph_ir/neutrali/pretrained_models/roberta-base-ch-for-csc/

  • Phonetic Encoder: pretrain_pho.sh

  • Graphic Encoder: pretrain_res.sh

  • Merge: merge.py

You can also directly download the pretrained and merged BERT, Phonetic Encoder, and Graphic Encoder from this, and put them in the pretrained directory:

pretrained
|- pytorch_model.bin
|- vocab.txt
|- config.json

Train

After preparing the data and pretrained model, you can train ReaLiSe by executing the train.sh script. Note that you should set up the PRETRAINED_DIR, DATE_DIR, and OUTPUT_DIR in it.

sh train.sh

Test

Test ReaLiSe using the test.sh script. You should set up the DATE_DIR, CKPT_DIR, and OUTPUT_DIR in it. CKPT_DIR is the OUTPUT_DIR of the training process.

sh test.sh

Well-trained Model

You can also download well-trained model from this direct using. The performance scores of RealiSe and some baseline models on the SIGHAN13, SIGHAN14, SIGHAN15 test set are here:

Methods

Metrics

  • "D" means "Detection Level", "C" means "Correction Level".
  • "A", "P", "R", "F" means "Accuracy", "Precision", "Recall", and "F1" respectively.

SIGHAN15

Method D-A D-P D-R D-F C-A C-P C-R C-F
FASpell 74.2 67.6 60.0 63.5 73.7 66.6 59.1 62.6
Soft-Masked BERT 80.9 73.7 73.2 73.5 77.4 66.7 66.2 66.4
SpellGCN - 74.8 80.7 77.7 - 72.1 77.7 75.9
BERT 82.4 74.2 78.0 76.1 81.0 71.6 75.3 73.4
ReaLiSe 84.7 77.3 81.3 79.3 84.0 75.9 79.9 77.8

SIGHAN14

Method D-A D-P D-R D-F C-A C-P C-R C-F
Pointer Network - 63.2 82.5 71.6 - 79.3 68.9 73.7
SpellGCN - 65.1 69.5 67.2 - 63.1 67.2 65.3
BERT 75.7 64.5 68.6 66.5 74.6 62.4 66.3 64.3
ReaLiSe 78.4 67.8 71.5 69.6 77.7 66.3 70.0 68.1

SIGHAN13

Method D-A D-P D-R D-F C-A C-P C-R C-F
FASpell 63.1 76.2 63.2 69.1 60.5 73.1 60.5 66.2
SpellGCN 78.8 85.7 78.8 82.1 77.8 84.6 77.8 81.0
BERT 77.0 85.0 77.0 80.8 77.4 83.0 75.2 78.9
ReaLiSe 82.7 88.6 82.5 85.4 81.4 87.2 81.2 84.1

Citation

@misc{xu2021read,
      title={Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking}, 
      author={Heng-Da Xu and Zhongli Li and Qingyu Zhou and Chao Li and Zizhen Wang and Yunbo Cao and Heyan Huang and Xian-Ling Mao},
      year={2021},
      eprint={2105.12306},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Owner
DaDa
A student majoring in Computer Science in BIT.
DaDa
Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

289 Jan 06, 2023
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 04, 2022
Subtitle Workshop (subshop): tools to download and synchronize subtitles

SUBSHOP Tools to download, remove ads, and synchronize subtitles. SUBSHOP Purpose Limitations Required Web Credentials Installation, Configuration, an

Joe D 4 Feb 13, 2022
The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

This repository contains the raw dataset used in NHNet [1] for the task of News Story Headline Generation. The code of data processing and training is available under Tensorflow Models - NHNet.

Google Research Datasets 31 Jul 15, 2022
Text editor on python to convert english text to malayalam(Romanization/Transiteration).

Manglish Text Editor This is a simple transiteration (romanization ) program which is used to convert manglish to malayalam (converts njaan to ഞാൻ ).

Merin Rose Tom 1 May 11, 2022
🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy floret is an extended version of fastText that can produce word repr

Explosion 222 Dec 16, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
Repositório do trabalho de introdução a NLP

Trabalho da disciplina de BI NLP Repositório do trabalho da disciplina Introdução a Processamento de Linguagem Natural da pós BI-Master da PUC-RIO. Eq

Leonardo Lins 1 Jan 18, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 07, 2022
Package for controllable summarization

summarizers summarizers is package for controllable summarization based CTRLsum. currently, we only supports English. It doesn't work in other languag

Hyunwoong Ko 72 Dec 07, 2022
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 A repository part of the MarIA project. Corpora 📃 Corpora Number of documents Number of tokens Size (GB) BNE 201,080,084

Plan de Tecnologías del Lenguaje - Gobierno de España 203 Dec 20, 2022
LUKE -- Language Understanding with Knowledge-based Embeddings

LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transf

Studio Ousia 587 Dec 30, 2022
Khandakar Muhtasim Ferdous Ruhan 1 Dec 30, 2021
A Streamlit web app that generates Rick and Morty stories using GPT2.

Rick and Morty Story Generator This project uses a pre-trained GPT2 model, which was fine-tuned on Rick and Morty transcripts, to generate new stories

₸ornike 33 Oct 13, 2022
TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

TFPNER TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech Named entity recognition (NER), which aims at identifyin

1 Feb 07, 2022
Knowledge Oriented Programming Language

KoPL: 面向知识的推理问答编程语言 安装 | 快速开始 | 文档 KoPL全称 Knowledge oriented Programing Language, 是一个为复杂推理问答而设计的编程语言。我们可以将自然语言问题表示为由基本函数组合而成的KoPL程序,程序运行的结果就是问题的答案。目前,

THU-KEG 62 Dec 12, 2022
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
PyWorld3 is a Python implementation of the World3 model

The World3 model revisited in Python Install & Hello World3 How to tune your own simulation Licence How to cite PyWorld3 with Bibtex References & ackn

Charles Vanwynsberghe 248 Dec 14, 2022
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022