A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

Last update: Oct 23, 2022

Related tags

Overview

wav2vec-toolkit

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

This repository accompanies the 🤗 HuggingFace Community Paper on finetuning Wav2Vec2 XLSR for low-resource languages [link]

How to contribute

(Mostly identical to the huggingface/datasets contributing guide)

Fork the repository by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.

Clone your fork to your local disk, and add the base repository as a remote:

git clone [email protected]:<your Github handle>/wav2vec-toolkit.git
cd wav2vec-toolkit
git remote add upstream https://github.com/anton-l/wav2vec-toolkit.git

Create a new branch to hold your development changes:
```
git checkout -b a-descriptive-name-for-my-changes
```
do not work on the master branch.
Set up a development environment by running the following command in a virtual environment:
```
pip install -e ".[dev]"
```
(If wav2vec-toolkit was already installed in the virtual environment, remove it with pip uninstall wav2vec_toolkit before reinstalling it in editable mode with the -e flag.)
Develop the features on your branch.
Format your code. Run black and isort so that your newly added files look nice with the following command:
```
black --line-length 119 --target-version py36 src scripts
isort src scripts
```
Once you're happy with your implementation, add your changes and make a commit to record your changes locally:
```
git add .
git commit
```
It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes:
```
git fetch upstream
git rebase upstream/main
```
Push the changes to your account using:
```
git push -u origin a-descriptive-name-for-my-changes
```
Once you are satisfied, go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review.

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

Related tags

Overview

wav2vec-toolkit

How to contribute

Owner

Anton Lozhkov

WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

Club chatbot

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

A minimal Conformer ASR implementation adapted from ESPnet.

AIDynamicTextReader - A simple dynamic text reader based on Artificial intelligence

Big Bird: Transformers for Longer Sequences

MEDIALpy: MEDIcal Abbreviations Lookup in Python

Non-Autoregressive Predictive Coding

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

A high-level Python library for Quantum Natural Language Processing

Google AI 2018 BERT pytorch implementation

nlpcommon is a python Open Source Toolkit for text classification.

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

NLP command-line assistant powered by OpenAI

State-of-the-art NLP through transformer models in a modular design and consistent APIs.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Meta learning algorithms to train cross-lingual NLI (multi-task) models