A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

Overview

RE2

This is a pytorch implementation of the ACL 2019 paper "Simple and Effective Text Matching with Richer Alignment Features". The original Tensorflow implementation: https://github.com/alibaba-edu/simple-effective-text-matching.

Quick Links

Simple and Effective Text Matching

RE2 is a fast and strong neural architecture for general purpose text matching applications. In a text matching task, a model takes two text sequences as input and predicts their relationship. This method aims to explore what is sufficient for strong performance in these tasks. It simplifies many slow components which are previously considered as core building blocks in text matching, while keeping three key features directly available for inter-sequence alignment: original point-wise features, previous aligned features, and contextual features.

RE2 achieves performance on par with the state of the art on four benchmark datasets: SNLI, SciTail, Quora and WikiQA, across tasks of natural language inference, paraphrase identification and answer selection with no or few task-specific adaptations. It has at least 6 times faster inference speed compared to similarly performed models.

The following table lists major experiment results. The paper reports the average and standard deviation of 10 runs. Inference time (in seconds) is measured by processing a batch of 8 pairs of length 20 on Intel i7 CPUs. The computation time of POS features used by CSRAN and DIIN is not included.

Model SNLI SciTail Quora WikiQA Inference Time
BiMPM 86.9 - 88.2 0.731 0.05
ESIM 88.0 70.6 - - -
DIIN 88.0 - 89.1 - 1.79
CSRAN 88.7 86.7 89.2 - 0.28
RE2 88.9±0.1 86.0±0.6 89.2±0.2 0.7618 ±0.0040 0.03~0.05

Refer to the paper for more details of the components and experiment results.

Setup

Data used in the paper are prepared as follows:

SNLI

  • Download and unzip SNLI (pre-processed by Tay et al.) to data/orig.
  • Unzip all zip files in the "data/orig/SNLI" folder. (cd data/orig/SNLI && gunzip *.gz)
  • cd data && python prepare_snli.py

SciTail

  • Download and unzip SciTail dataset to data/orig.
  • cd data && python prepare_scitail.py

Quora

  • Download and unzip Quora dataset (pre-processed by Wang et al.) to data/orig.
  • cd data && python prepare_quora.py

WikiQA

  • Download and unzip WikiQA to data/orig.
  • cd data && python prepare_wikiqa.py
  • Download and unzip evaluation scripts. Use the make -B command to compile the source files in qg-emnlp07-data/eval/trec_eval-8.0. Move the binary file "trec_eval" to resources/.

Usage

To train a new text matching model, run the following command:

python train.py $config_file.json5

Example configuration files are provided in configs/:

  • configs/main.json5: replicate the main experiment result in the paper.
  • configs/robustness.json5: robustness checks
  • configs/ablation.json5: ablation study

The instructions to write your own configuration files:

[
    {
        name: 'exp1', // name of your experiment, can be the same across different data
        __parents__: [
            'default', // always put the default on top
            'data/quora', // data specific configurations in `configs/data`
            // 'debug', // use "debug" to quick debug your code  
        ],
        __repeat__: 5,  // how may repetitions you want
        blocks: 3, // other configurations for this experiment 
    },
    // multiple configurations are executed sequentially
    {
        name: 'exp2', // results under the same name will be overwritten
        __parents__: [
            'default', 
            'data/quora',
        ],
        __repeat__: 5,  
        blocks: 4, 
    }
]

To check the configurations only, use

python train.py $config_file.json5 --dry

To evaluate an existed model, use python evaluate.py $model_path $data_file, here's an example:

python evaluate.py models/snli/benchmark/best.pt data/snli/train.txt 
python evaluate.py models/snli/benchmark/best.pt data/snli/test.txt 

Note that multi-GPU training is not yet supported in the pytorch implementation. A single 16G GPU is sufficient for training when blocks < 5 with hidden size 200 and batch size 512. All the results reported in the paper except the robustness checks can be reproduced with a single 16G GPU.

Citation

Please cite the ACL paper if you use RE2 in your work:

@inproceedings{yang2019simple,
  title={Simple and Effective Text Matching with Richer Alignment Features},
  author={Yang, Runqi and Zhang, Jianhai and Gao, Xing and Ji, Feng and Chen, Haiqing},
  booktitle={Association for Computational Linguistics (ACL)},
  year={2019}
}

License

This project is under Apache License 2.0.

Codes to pre-train Japanese T5 models

t5-japanese Codes to pre-train a T5 (Text-to-Text Transfer Transformer) model pre-trained on Japanese web texts. The model is available at https://hug

Megagon Labs 37 Dec 25, 2022
I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others

1 Jan 13, 2022
Chatbot for the Chatango messaging platform

BroiestBot The baddest bot in the game right now. Uses the ch.py framework for joining Chantango rooms and responding to user messages. Commands If a

Todd Birchard 3 Jan 17, 2022
A Python/Pytorch app for easily synthesising human voices

Voice Cloning App A Python/Pytorch app for easily synthesising human voices Documentation Discord Server Video guide Voice Sharing Hub FAQ's System Re

Ben Andrew 840 Jan 04, 2023
Experiments in converting wikidata to ftm

FollowTheMoney / Wikidata mappings This repo will contain tools for converting Wikidata entities into FtM schema. Prefixes: https://www.mediawiki.org/

Friedrich Lindenberg 2 Nov 12, 2021
Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

PythonTextObfuscator Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense. Requi

2 Aug 29, 2022
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

272 Dec 15, 2022
Edge-Augmented Graph Transformer

Edge-augmented Graph Transformer Introduction This is the official implementation of the Edge-augmented Graph Transformer (EGT) as described in https:

Md Shamim Hussain 21 Dec 14, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Kundan Krishna 6 Jun 04, 2021
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
This project converts your human voice input to its text transcript and to an automated voice too.

Human Voice to Automated Voice & Text Introduction: In this project, whenever you'll speak, it will turn your voice into a robot voice and furthermore

Hassan Shahzad 3 Oct 15, 2021
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: Train new vocabularies and tok

Hugging Face 6.2k Dec 31, 2022
⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡

Translations 🇩🇪 DE 🇫🇷 FR 🇭🇺 HU 🇮🇩 ID 🇮🇹 IT 🇳🇱 NL 🇧🇷 PT-BR 🇷🇺 RU 🇨🇳 ZH ➡️ Documentation | Discord | Installation Guide ⬅️ Fully autom

11.2k Jan 05, 2023
Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 02, 2023
FewCLUE: 为中文NLP定制的小样本学习测评基准

FewCLUE: 为中文NLP定制的小样本学习测评基准

CLUE benchmark 387 Jan 04, 2023
Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

Ceyda Cinarel 22 Nov 16, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
Repository of the Code to Chatbots, developed in Python

Description In this repository you will find the Code to my Chatbots, developed in Python. I'll explain the structure of this Repository later. Requir

Li-am K. 0 Oct 25, 2022