End-to-end speech secognition toolkit

Overview

End-to-end speech secognition toolkit

This is an E2E ASR toolkit modified from Espnet1 (version 0.9.9).
This is the official implementation of paper:
Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI
This is also the official implementation of paper:
Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model
We achieve state-of-the-art results on two of the most popular results in Aishell-1 and AIshell-2 Mandarin datasets.
Please feel free to change / modify the code as you like. :)

Update

  • 2021/12/29: Release the first version, which contains all MMI-related features, including MMI training criteria, MMI Prefix Score (for attention-based encoder-decoder, AED) and MMI Alignment Score (For neural transducer, NT).
  • 2022/1/6: Release the word-level N-gram LM scorer.

Environment:

The main dependencies of this code can be divided into three part: kaldi, espnet and k2.

  1. kaldi is mainly used for feature extraction. To install kaldi, please follow the instructions here.
  2. Espnet is a open-source end-to-end speech recognition toolkit. please follow the instructions here to install its environment.
    2.1. Pytorch, cudatoolkit, along with many other dependencies will be install automatically during this process. 2.2. If you are going to use NT models, you are recommend to install a RNN-T warpper. Please run ${ESPNET_ROOT}/tools/installer/install_warp-transducer.sh
    2.3. Once you have installed the espnet envrionment successfully, please run pip uninstall espnet to remove the espnet library. So our code will be used.
    2.4. Also link the kaldi in ${ESPNET_ROOT}: ln -s ${KALDI-ROOT} ${ESPNET_ROOT}
  3. k2 is a python-based FST library. Please follow the instructions here to install it. GPU version is required.
    3.1. To use word N-gram LM, please also install kaldilm
  4. There might be some dependency conflicts during building the environment. We report ours below as a reference:
    4.1 OS: CentOS 7; GCC 7.3.1; Python 3.8.10; CUDA 10.1; Pytorch 1.7.1; k2-fsa 1.2 (very old for now)
    4.2 Other python libraries are in requirement.txt (It is not recommend to use this file to build the environment directly).

Results

Currently we have released examples on Aishell-1 and Aishell-2 datasets.

With MMI training & decoding methods and the word-level N-gram LM. We achieve results on Aishell-1 and Aishell-2 as below. All results are in CER%

Test set Aishell-1-dev Aishell-1-test Aishell-2-ios Aishell-2-android Aishell-2-mic
AED 4.73 5.32 5.73 6.56 6.53
AED + MMI + Word Ngram 4.08 4.45 5.26 6.22 5.92
NT 4.41 4.81 5.70 6.75 6.58
NT + MMI + Word Ngram 3.86 4.18 5.06 6.08 5.98

(example on Librispeech is not fully prepared)

Get Start

Take Aishell-1 as an example. Working process for other examples are very similar.
Prepare data and LMs

cd ${ESPNET_ROOT}/egs/aishell1
source path.sh
bash prepare.sh # prepare the data

split the json file of training data for each GPU. (we use 8GPUs)

python3 espnet_utils/splitjson.py -p 
   
     dump/train_sp/deltafalse/data.json

   

Training and decoding for NT model:

bash nt.sh      # to train the nueal transducer model

Training and decoding for AED model:

bash aed.sh     # or to train the attention-based encoder-decoder model

Several Hint:

  1. Please change the paths in path.sh accordingly before you start
  2. Please change the data to config your data path in prepare.sh
  3. Our code runs in DDP style. Before you start, you need to set them manually. We assume Pytorch distributed API works well on your machine.
export HOST_GPU_NUM=x       # number of GPUs on each host
export HOST_NUM=x           # number of hosts
export NODE_NUM=x           # number of GPUs in total (on all hosts)
export INDEX=x              # index of this host
export CHIEF_IP=xx.xx.xx.xx # IP of the master host
  1. Multiple choices are available during decoding (we take aed.sh as an example, but the usage of nt.sh is the same).
    To use the MMI-related scorers, you need train the model with MMI auxiliary criterion;

To use MMI Prefix Score (in AED) or MMI Alignment score (in NT):

bash aed.sh --stage 2 --mmi-weight 0.2

To use any external LM, you need to train them in advance (as implemented in prepare.sh)

To use word-level N-gram LM:

bash aed.sh --stage 2 --word-ngram-weight 0.4

To use character-level N-gram LM:

bash aed.sh --stage 2 --ngram-weight 1.0

To use neural network LM:

bash aed.sh --stage 2 --lm-weight 1.0

Reference

kaldi: https://github.com/kaldi-asr/kaldi
Espent: https://github.com/espnet/espnet
k2-fsa: https://github.com/k2-fsa/k2

Citations

@article{tian2021consistent,  
  title={Consistent Training and Decoding For End-to-end Speech Recognition Using Lattice-free MMI},  
  author={Tian, Jinchuan and Yu, Jianwei and Weng, Chao and Zhang, Shi-Xiong and Su, Dan and Yu, Dong and Zou, Yuexian},  
  journal={arXiv preprint arXiv:2112.02498},  
  year={2021}  
}  

@misc{tian2022improving,
      title={Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model}, 
      author={Jinchuan Tian and Jianwei Yu and Chao Weng and Yuexian Zou and Dong Yu},
      year={2022},
      eprint={2201.01995},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Authorship

Jinchuan Tian; [email protected] or [email protected]
Jianwei Yu; [email protected] (supervisor)
Chao Weng; [email protected]
Yuexian Zou; [email protected]

Owner
Jinchuan Tian
Graduate student @ Peking University, Shenzhen; Research intern @ Tencent AI LAB;
ColossalAI-Benchmark - Performance benchmarking with ColossalAI

Benchmark for Tuning Accuracy and Efficiency Overview The benchmark includes our

HPC-AI Tech 31 Oct 07, 2022
AVD Quickstart Containerlab

AVD Quickstart Containerlab WARNING This repository is still under construction. It's fully functional, but has number of limitations. For example: RE

Carl Buchmann 3 Apr 10, 2022
PyTorch implementation of 'Gen-LaneNet: a generalized and scalable approach for 3D lane detection'

(pytorch) Gen-LaneNet: a generalized and scalable approach for 3D lane detection Introduction This is a pytorch implementation of Gen-LaneNet, which p

Yuliang Guo 233 Jan 06, 2023
Real-time Object Detection for Streaming Perception, CVPR 2022

StreamYOLO Real-time Object Detection for Streaming Perception Jinrong Yang, Songtao Liu, Zeming Li, Xiaoping Li, Sun Jian Real-time Object Detection

Jinrong Yang 237 Dec 27, 2022
VID-Fusion: Robust Visual-Inertial-Dynamics Odometry for Accurate External Force Estimation

VID-Fusion VID-Fusion: Robust Visual-Inertial-Dynamics Odometry for Accurate External Force Estimation Authors: Ziming Ding , Tiankai Yang, Kunyi Zhan

ZJU FAST Lab 86 Nov 18, 2022
Sequential model-based optimization with a `scipy.optimize` interface

Scikit-Optimize Scikit-Optimize, or skopt, is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements

Scikit-Optimize 2.5k Jan 04, 2023
HGCAE Pytorch implementation. CVPR2021 accepted.

Hyperbolic Graph Convolutional Auto-Encoders Accepted to CVPR2021 🎉 Official PyTorch code of Unsupervised Hyperbolic Representation Learning via Mess

Junho Cho 37 Nov 13, 2022
Weakly Supervised 3D Object Detection from Point Cloud with Only Image Level Annotation

SCCKTIM Weakly Supervised 3D Object Detection from Point Cloud with Only Image-Level Annotation Our code will be available soon. The class knowledge t

1 Nov 12, 2021
EfficientNetV2 implementation using PyTorch

EfficientNetV2-S implementation using PyTorch Train Steps Configure imagenet path by changing data_dir in train.py python main.py --benchmark for mode

Jahongir Yunusov 86 Dec 29, 2022
Noise Conditional Score Networks (NeurIPS 2019, Oral)

Generative Modeling by Estimating Gradients of the Data Distribution This repo contains the official implementation for the NeurIPS 2019 paper Generat

451 Dec 26, 2022
An open software package to develop BCI based brain and cognitive computing technology for recognizing user's intention using deep learning

An open software package to develop BCI based brain and cognitive computing technology for recognizing user's intention using deep learning

deepbci 272 Jan 08, 2023
Code & Data for Enhancing Photorealism Enhancement

Enhancing Photorealism Enhancement Stephan R. Richter, Hassan Abu AlHaija, Vladlen Koltun Paper | Website (with side-by-side comparisons) | Video (Pap

Intelligent Systems Lab Org 1.1k Dec 31, 2022
Referring Video Object Segmentation

Awesome-Referring-Video-Object-Segmentation Welcome to starts ⭐ & comments 💹 & sharing 😀 !! - 2021.12.12: Recent papers (from 2021) - welcome to ad

Explorer 57 Dec 11, 2022
Koç University deep learning framework.

Knet Knet (pronounced "kay-net") is the Koç University deep learning framework implemented in Julia by Deniz Yuret and collaborators. It supports GPU

1.4k Dec 31, 2022
Code of Classification Saliency-Based Rule for Visible and Infrared Image Fusion

CSF Code of Classification Saliency-Based Rule for Visible and Infrared Image Fusion Tips: For testing: CUDA_VISIBLE_DEVICES=0 python main.py For trai

Han Xu 14 Oct 31, 2022
Code for WSDM 2022 paper, Contrastive Learning for Representation Degeneration Problem in Sequential Recommendation.

DuoRec Code for WSDM 2022 paper, Contrastive Learning for Representation Degeneration Problem in Sequential Recommendation. Usage Download datasets fr

Qrh 46 Dec 19, 2022
MiniSom is a minimalistic implementation of the Self Organizing Maps

MiniSom Self Organizing Maps MiniSom is a minimalistic and Numpy based implementation of the Self Organizing Maps (SOM). SOM is a type of Artificial N

Giuseppe Vettigli 1.2k Jan 03, 2023
Single-Stage 6D Object Pose Estimation, CVPR 2020

Overview This repository contains the code for the paper Single-Stage 6D Object Pose Estimation. Yinlin Hu, Pascal Fua, Wei Wang and Mathieu Salzmann.

CVLAB @ EPFL 89 Dec 26, 2022
A universal memory dumper using Frida

Fridump Fridump (v0.1) is an open source memory dumping tool, primarily aimed to penetration testers and developers. Fridump is using the Frida framew

551 Jan 07, 2023
Code for the TPAMI paper: "Syntax Customized Video Captioning by Imitating Exemplar Sentences"

Syntax-Customized-Video-Captioning Code for the TPAMI paper: "Syntax Customized Video Captioning by Imitating Exemplar Sentences". This is my second w

3 Dec 05, 2022