Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

Last update: Jul 21, 2022

Overview

🤗 Transformers Wav2Vec2 + Parlance's CTCDecode

Introduction

This repo shows how 🤗 Transformers can be used in combination with Parlance's ctcdecode & KenLM ngram as a simple way to boost word error rate (WER).

Included is a file to create an ngram with KenLM as well as a simple evaluation script to compare the results of using Wav2Vec2 with ctcdecode + KenLM vs. without using any language model.

Note: The scripts are written to be used on GPU. If you want to use a CPU instead, simply remove all .to("cuda") occurances in eval.py.

Installation

In a first step, one should install KenLM. For Ubuntu, it should be enough to follow the installation steps described here. The installed kenlm folder should be move into this repo for ./create_ngram.py to function correctly. Alternatively, one can also link the lmplz binary file to a lmplz bash command to directly run lmplz instead of ./kenlm/build/bin/lmplz.

Next, some Python dependencies should be installed. Assuming PyTorch is installed, it should be sufficient to run pip install -r requirements.txt.

Run evaluation

Create ngram

In a first step on should create a ngram. E.g. for polish the command would be:

./create_ngram.py --language polish --path_to_ngram polish.arpa

After the language model is created, one should open the file. one should add a The file should have a structure which looks more or less as follows:

\data\        
ngram 1=86586
ngram 2=546387
ngram 3=796581           
ngram 4=843999             
ngram 5=850874              
                                                  
\1-grams:
-5.7532206      
   
       0
0       
         -0.06677356                                                                            
-3.4645514      drugi   -0.2088903
...

~~Now it is very important also add a~~ token to the n-gram so that it can be correctly loaded. You can simple copy the line:

0 -0.06677356

and change to . When doing this you should also inclease ngram by 1. The new ngram should look as follows:

\data\ ngram 1=86587 ngram 2=546387 ngram 3=796581 ngram 4=843999 ngram 5=850874 \1-grams: -5.7532206 0 0 -0.06677356 0 -0.06677356 -3.4645514 drugi -0.2088903 ...

Now the ngram can be correctly used with pyctcdecode

Run eval

Having created the ngram, one can run:

./eval.py --language polish --path_to_ngram polish.arpa

To compare Wav2Vec2 + LM vs. Wav2Vec2 + No LM on polish.

Results

==================================================polish================================================== polish - No LM - | WER: 0.3069742867206763 | CER: 0.06054530156286364 | Time: 32.37423086166382 polish - With LM - | WER: 0.39526828695550076 | CER: 0.17596985266474516 | Time: 62.017329692840576

I didn't obtain any good results even when trying out a variety of different settings for alpha and beta. Sadly there aren't many examples, tutorials or docs on parlance/ctcdecode so it's hard to find the reason for the problem.

Also tried it out for other languages like Portuguese and Spanish, but no luck there either.

Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

Related tags

Overview

🤗 Transformers Wav2Vec2 + Parlance's CTCDecode

Introduction

Installation

Run evaluation

Create ngram

Run eval

Results

Owner

Patrick von Platen

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

JaQuAD: Japanese Question Answering Dataset

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Shared, streaming Python dict

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Mesh TensorFlow: Model Parallelism Made Easier

The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Voilà turns Jupyter notebooks into standalone web applications

Treemap visualisation of Maya scene files

Pre-Training with Whole Word Masking for Chinese BERT

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

Label data using HuggingFace's transformers and automatically get a prediction service

Two-stage text summarization with BERT and BART

Demo programs for the Talking Head Anime from a Single Image 2: More Expressive project.