Official implementation of FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 @ ICASSP 2021

Last update: Sep 28, 2022

Overview

FCL-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech synthesis (ICASSP 2021) Paper | Demo

Block diagram of FCL-taco2, where the decoder generates mel-spectrograms in AR mode within each phoneme and is shared for all phonemes.

💬 Huawei Noah's Ark Lab is recruiting interns on speech processing fields, if you're interested, you're welcome to contact Dr. Deng: [email protected]

Training and inference scripts for FCL-taco2

Environment

python 3.6.10
torch 1.3.1
chainer 6.0.0
espnet 8.0.0
apex 0.1
numpy 1.19.1
kaldiio 2.15.1
librosa 0.8.0

Training and inference:

Step1. Data preparation & preprocessing

Download LJSpeech
Unpack downloaded LJSpeech-1.1.tar.bz2 to /xx/LJSpeech-1.1
Obtain the forced alignment information by using Montreal forced aligner tool. Or you can download our alignment results, then unpack it to /xx/TextGrid
Preprocess the dataset to extract mel-spectrograms, phoneme duration, pitch, energy and phoneme sequence by:
```
 python preprocessing.py --data-root /xx/LJSpeech-1.1 --textgrid-root /xx/TextGrid
```

Step2. Model training

Training teacher model FCL-taco2-T:
```
 ./teacher_model_training.sh
```
Training student model FCL-taco2-S:
```
 ./student_model_training.sh
```
Parallel-WaveGAN vocoder training: follow instructions at here. You can also download the pre-trained PWG vocoder, and put the PWG model under the directory "vocoder".

Step3. Model evaluation

FCL-taco2-T evaluation:
```
 ./inference_teacher.sh
```
FCL-taco2-S evaluation:
```
 ./inference_student.sh
```

Citation

If the code is used in your research, please star our repo and cite our paper:

@inproceedings{wang2021fcl,
  title={Fcl-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech Synthesis},
  author={Wang, Disong and Deng, Liqun and Zhang, Yang and Zheng, Nianzu and Yeung, Yu Ting and Chen, Xiao and Liu, Xunying and Meng, Helen},
  booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={5714--5718},
  year={2021},
  organization={IEEE}
}

Official implementation of FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 @ ICASSP 2021

Related tags

Overview

FCL-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech synthesis (ICASSP 2021) Paper | Demo

💬 Huawei Noah's Ark Lab is recruiting interns on speech processing fields, if you're interested, you're welcome to contact Dr. Deng: [email protected]

Training and inference scripts for FCL-taco2

Environment

Training and inference:

Citation

Owner

Disong Wang

Official Code for VideoLT: Large-scale Long-tailed Video Recognition (ICCV 2021)

Blender Add-on that sets a Material's Base Color to one of Pantone's Colors of the Year

Accepted at ICCV-2021: Workshop on Computer Vision for Automated Medical Diagnosis (CVAMD)

9th place solution

CharacterGAN: Few-Shot Keypoint Character Animation and Reposing

[NeurIPS 2021] A weak-shot object detection approach by transferring semantic similarity and mask prior.

[ICCV 2021] Deep Hough Voting for Robust Global Registration

An implementation of the efficient attention module.

Multi-modal Content Creation Model Training Infrastructure including the FACT model (AI Choreographer) implementation.

TrackFormer: Multi-Object Tracking with Transformers

SemEval2022 Patronizing and Condescending Language (PCL) Detection

DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation

Lux AI environment interface for RLlib multi-agents

Manage the availability of workspaces within Frappe/ ERPNext (sidebar) based on user-roles

Training Structured Neural Networks Through Manifold Identification and Variance Reduction

Fast EMD for Python: a wrapper for Pele and Werman's C++ implementation of the Earth Mover's Distance metric

Dungeons and Dragons randomized content generator

Some simple programs built in Python: webcam with cv2 that detects eyes and face, with grayscale filter

Unsupervised Learning of Video Representations using LSTMs

Training DiffWave using variational method from Variational Diffusion Models.