A self-supervised learning framework for audio-visual speech

Overview

AV-HuBERT (Audio-Visual Hidden Unit BERT)

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Robust Self-Supervised Audio-Visual Speech Recognition

lip-reading

Introduction

AV-HuBERT is a self-supervised representation learning framework for audio-visual speech. It achieves state-of-the-art results in lip reading, ASR and audio-visual speech recognition on the LRS3 audio-visual speech benchmark.

If you find AV-HuBERT useful in your research, please use the following BibTeX entry for citation.

@inproceedings{shi2022avhubert,
    author  = {Bowen Shi and Wei-Ning Hsu and Kushal Lakhotia and Abdelrahman Mohamed},
    title = {Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction},
    year = {2022}
}

@article{shi2022avsr,
    author  = {Bowen Shi and Wei-Ning Hsu and Abdelrahman Mohamed},
    title = {Robust Self-Supervised Audio-Visual Speech Recognition},
    journal = {arXiv preprint arXiv:2201.01763}
    year = {2022}
}

License

AV-HuBERT LICENSE AGREEMENT

This License Agreement (as may be amended in accordance with this License Agreement, “License”), between you (“Licensee” or “you”) and Meta Platforms, Inc. (“Meta” or “we”) applies to your use of any computer program, algorithm, source code, object code, or software that is made available by Meta under this License (“Software”) and any specifications, manuals, documentation, and other written information provided by Meta related to the Software (“Documentation”).

By using the Software, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the Software or Documentation (collectively, the “Software Products”), and you must immediately cease using the Software Products.

Pre-trained and fine-tuned models

Please find the checkpoints here

Installation

First, create a conda virtual environment and activate it:

conda create -n avhubert python=3.8 -y
conda activate avhubert

Then, clone this directory:

git clone https://github.com/facebookresearch/av_hubert.git
cd avhubert
git submodule init
git submodule update

Lastly, install Fairseq and the other packages:

pip install -r requirements.txt
cd fairseq
pip install --editable ./

Load a pretrained model

$ cd avhubert
$ python
>>> import fairseq
>>> import hubert_pretraining, hubert
>>> ckpt_path = "/path/to/the/checkpoint.pt"
>>> models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
>>> model = models[0]

Train a new model

Data preparation

Follow the steps in preparation to pre-process:

  • LRS3 and VoxCeleb2 datasets

Follow the steps in clustering (pre-train only) to create:

  • {train,valid}.km frame-aligned pseudo label files. The label_rate is the same as the feature frame rate used for clustering, which is 100Hz for MFCC features and 25Hz for AV-HuBERT features by default.

Pre-train an AV-HuBERT model

Suppose {train,valid}.tsv are saved at /path/to/data, {train,valid}.km are saved at /path/to/labels, the configuration file is saved at /path/to/conf/conf-name, and the label rate is 100Hz.

To train a model, run:

$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  model.label_rate=100 hydra.run.dir=/path/to/experiment/pretrain/ \
  common.user_dir=`pwd`

Finetune an AV-HuBERT model with Seq2Seq

Suppose {train,valid}.tsv are saved at /path/to/data, {train,valid}.wrd are saved at /path/to/labels, the configuration file is saved at /path/to/conf/conf-name.

To fine-tune a pre-trained HuBERT model at /path/to/checkpoint, run:

$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  task.tokenizer_bpe_model=/path/to/tokenizer model.w2v_path=/path/to/checkpoint \
  hydra.run.dir=/path/to/experiment/finetune/ common.user_dir=`pwd`

Decode an AV-HuBERT model

Suppose the test.tsv and test.wrd are the video list and transcripts of the split to be decoded, saved at /path/to/data, and the fine-tuned model is saved at /path/to/checkpoint.

Seq2Seq decoding

task.normalize needs to be consistent with the value used during fine-tuning. Decoding results will be saved at /path/to/experiment/decode/s2s/test.

$ cd avhubert
$ python -B infer_s2s.py --config-dir ./conf/ --config-name conf-name \
  dataset.gen_subset=test common_eval.path=/path/to/checkpoint \
  common_eval.results_path=/path/to/experiment/decode/s2s/test \
  override.modalities=['video'] common.user_dir=`pwd`

The command above uses the default decoding hyperparameter, which can be found in conf/s2s_decode.yaml. override.modalities can be set to ['video'] (for lip reading), or ['audio'] (for ASR) or ['audio','video'] (for audio-visual speech recognition).These parameters can be configured from the command line. For example, to search with a beam size of 20, we can append the command above with generation.beam=20. Important parameters include:

  • generation.beam
  • generation.lenpen

If you want to test your model under noisy environment, append the following to the above command.

+override.noise_wav=/path/to/noise override.noise_prob=1 override.noise_snr={snr}

{snr} is the signal-to-noise ratio (SNR) and /path/to/noise is a folder containing noise manifest files (/path/to/noise/{valid,test}.tsv). See preparation for setting up this folder.

Owner
Meta Research
Meta Research
YOLOv7 - Framework Beyond Detection

🔥🔥🔥🔥 YOLO with Transformers and Instance Segmentation, with TensorRT acceleration! 🔥🔥🔥

JinTian 3k Jan 01, 2023
A powerful framework for decentralized federated learning with user-defined communication topology

Scatterbrained Decentralized Federated Learning Scatterbrained makes it easy to build federated learning systems. In addition to traditional federated

Johns Hopkins Applied Physics Laboratory 7 Sep 26, 2022
A baseline code for VSPW

A baseline code for VSPW Preparation Download VSPW dataset The VSPW dataset with extracted frames and masks is available here.

28 Aug 22, 2022
这是一个利用facenet和retinaface实现人脸识别的库,可以进行在线的人脸识别。

Facenet+Retinaface:人脸识别模型在Pytorch当中的实现 目录 注意事项 Attention 所需环境 Environment 文件下载 Download 预测步骤 How2predict 参考资料 Reference 注意事项 该库中包含了两个网络,分别是retinaface和

Bubbliiiing 102 Dec 30, 2022
NeRF Meta-Learning with PyTorch

NeRF Meta Learning With PyTorch nerf-meta is a PyTorch re-implementation of NeRF experiments from the paper "Learned Initializations for Optimizing Co

Sanowar Raihan 78 Dec 18, 2022
Dynamics-aware Adversarial Attack of 3D Sparse Convolution Network

Leaded Gradient Method (LGM) This repository contains the PyTorch implementation for paper Dynamics-aware Adversarial Attack of 3D Sparse Convolution

An Tao 2 Oct 18, 2022
For holding anime-related object classification and detection models

Animesion An end-to-end framework for anime-related object classification, detection, segmentation, and other models. Update: 01/22/2020. Due to time-

Edwin Arkel Rios 72 Nov 30, 2022
a delightful machine learning tool that allows you to train, test and use models without writing code

igel A delightful machine learning tool that allows you to train/fit, test and use models without writing code Note I'm also working on a GUI desktop

Nidhal Baccouri 3k Jan 05, 2023
Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph

Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph Model Description Open-CyKG is a framework that is constructed using an attenti

Injy Sarhan 34 Jan 05, 2023
City-Scale Multi-Camera Vehicle Tracking Guided by Crossroad Zones Code

City-Scale Multi-Camera Vehicle Tracking Guided by Crossroad Zones Requirements Python 3.8 or later with all requirements.txt dependencies installed,

88 Dec 12, 2022
A collection of educational notebooks on multi-view geometry and computer vision.

Multiview notebooks This is a collection of educational notebooks on multi-view geometry and computer vision. Subjects covered in these notebooks incl

Max 65 Dec 09, 2022
A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection

Confluence: A Robust Non-IoU Alternative to Non-Maxima Suppression in Object Detection 1. 介绍 用以替代 NMS,在所有 bbox 中挑选出最优的集合。 NMS 仅考虑了 bbox 的得分,然后根据 IOU 来

44 Sep 15, 2022
An Object Oriented Programming (OOP) interface for Ontology Web language (OWL) ontologies.

Enabling a developer to use Ontology Web Language (OWL) along with its reasoning capabilities in an Object Oriented Programming (OOP) paradigm, by pro

TheEngineRoom-UniGe 7 Sep 23, 2022
Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions

torch-imle Concise and self-contained PyTorch library implementing the I-MLE gradient estimator proposed in our NeurIPS 2021 paper Implicit MLE: Backp

UCL Natural Language Processing 249 Jan 03, 2023
🙄 Difficult algorithm, Simple code.

🎉TensorFlow2.0-Examples🎉! "Talk is cheap, show me the code." ----- Linus Torvalds Created by YunYang1994 This tutorial was designed for easily divin

1.7k Dec 25, 2022
Finite-temperature variational Monte Carlo calculation of uniform electron gas using neural canonical transformation.

CoulombGas This code implements the neural canonical transformation approach to the thermodynamic properties of uniform electron gas. Building on JAX,

FermiFlow 9 Mar 03, 2022
PyTorch implementation of DreamerV2 model-based RL algorithm

PyDreamer Reimplementation of DreamerV2 model-based RL algorithm in PyTorch. The official DreamerV2 implementation can be found here. Features ... Run

118 Dec 15, 2022
MAterial del programa Misión TIC 2022

Mision TIC 2022 Esta iniciativa, aparece como respuesta frente a los retos de la Cuarta Revolución Industrial, y tiene como objetivo la formación de 1

6 May 25, 2022
Framework web SnakeServer.

SnakeServer - Framework Web 🐍 Documentação oficial do framework SnakeServer. Conteúdo Sobre Como contribuir Enviar relatórios de segurança Pull reque

Jaedson Silva 0 Jul 21, 2022
Data and codes for ACL 2021 paper: Towards Emotional Support Dialog Systems

Emotional-Support-Conversation Copyright © 2021 CoAI Group, Tsinghua University. All rights reserved. Data and codes are for academic research use onl

126 Dec 21, 2022