PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop.

Last update: Dec 15, 2022

Related tags

Deep Learning loop

Overview

VoiceLoop

PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop.

VoiceLoop is a neural text-to-speech (TTS) that is able to transform text to speech in voices that are sampled in the wild. Some demo samples can be found here.

Quick Links

Demo Samples
Quick Start
Setup
Training

Quick Start

Follow the instructions in Setup and then simply execute:

python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 13 --checkpoint models/vctk/bestmodel.pth

Results will be placed in models/vctk/results. It will generate 2 samples:

The generated sample will be saved with the gen_10.wav extension.
Its ground-truth (test) sample is also generated and is saved with the orig.wav extension.

You can also generate the same text but with a different speaker, specifically:

python generate.py  --npz data/vctk/numpy_features_valid/p318_212.npz --spkr 18 --checkpoint models/vctk/bestmodel.pth

Which will generate the following sample.

Here is the corresponding attention plot:

Legend: X-axis is output time (acoustic samples) Y-axis is input (text/phonemes). Left figure is speaker 10, right is speaker 14.

Finally, free text is also supported:

python generate.py  --text "hello world" --spkr 1 --checkpoint models/vctk/bestmodel.pth

Setup

Requirements: Linux/OSX, Python2.7 and PyTorch 0.1.12. Generation requires installing phonemizer, follow the setup instructions there. The current version of the code requires CUDA support for training. Generation can be done on the CPU.

git clone https://github.com/facebookresearch/loop.git
cd loop
pip install -r scripts/requirements.txt

Data

The data used to train the models in the paper can be downloaded via:

bash scripts/download_data.sh

The script downloads and preprocesses a subset of VCTK. This subset contains speakers with american accent.

The dataset was preprocessed using Merlin - from each audio clip we extracted vocoder features using the WORLD vocoder. After downloading, the dataset will be located under subfolder data as follows:

loop
├── data
    └── vctk
        ├── norm_info
        │   ├── norm.dat
        ├── numpy_feautres
        │   ├── p294_001.npz
        │   ├── p294_002.npz
        │   └── ...
        └── numpy_features_valid

The preprocess pipeline can be executed using the following script by Kyle Kastner: https://gist.github.com/kastnerkyle/cc0ac48d34860c5bb3f9112f4d9a0300.

Pretrained Models

Pretrainde models can be downloaded via:

bash scripts/download_models.sh

After downloading, the models will be located under subfolder models as follows:

loop
├── data
├── models
    ├── blizzard
    ├── vctk
    │   ├── args.pth
    │   └── bestmodel.pth
    └── vctk_alt

Update 10/25/2017: Single speaker model available in models/blizzard/

SPTK and WORLD

Finally, speech generation requires SPTK3.9 and WORLD vocoder as done in Merlin. To download the executables:

bash scripts/download_tools.sh

Which results the following sub directories:

loop
├── data
├── models
├── tools
    ├── SPTK-3.9
    └── WORLD

Training

Single-Speaker

Single speaker model is trained on blizzard 2011. Data should be downloaded and prepared as described above. Once the data is ready, run:

python train.py --noise 1 --expName blizzard_init --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-5 --epochs 10

Then, continue training the model with :

python train.py --noise 1 --expName blizzard --seq-len 1600 --max-seq-len 1600 --data data/blizzard --nspk 1 --lr 1e-4 --checkpoint checkpoints/blizzard_init/bestmodel.pth --epochs 90

Multi-Speaker

Training a new model on vctk, first train the model using noise level of 4 and input sequence length of 100:

python train.py --expName vctk --data data/vctk --noise 4 --seq-len 100 --epochs 90

Then, continue training the model using noise level of 2, on full sequences:

python train.py --expName vctk_noise_2 --data data/vctk --checkpoint checkpoints/vctk/bestmodel.pth --noise 2 --seq-len 1000 --epochs 90

Citation

If you find this code useful in your research then please cite:

@article{taigman2017voice,
  title           = {VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop},
  author          = {Taigman, Yaniv and Wolf, Lior and Polyak, Adam and Nachmani, Eliya},
  journal         = {ArXiv e-prints},
  archivePrefix   = "arXiv",
  eprinttype      = {arxiv},
  eprint          = {1705.03122},
  primaryClass    = "cs.CL",
  year            = {2017}
  month           = October,
}

License

Loop has a CC-BY-NC license.

PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop.

Related tags

Overview

VoiceLoop

Quick Links

Quick Start

Setup

Data

Pretrained Models

SPTK and WORLD

Training

Single-Speaker

Multi-Speaker

Citation

License

Owner

Meta Archive

Bounding Wasserstein distance with couplings

Deep-learning X-Ray Micro-CT image enhancement, pore-network modelling and continuum modelling

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm.

Implementation of Memformer, a Memory-augmented Transformer, in Pytorch

PyTorch implementation of "Image-to-Image Translation Using Conditional Adversarial Networks".

Semi-Supervised Semantic Segmentation with Pixel-Level Contrastive Learning from a Class-wise Memory Bank

Pytorch implementation of TailCalibX : Feature Generation for Long-tail Classification

Supplementary code for the AISTATS 2021 paper "Matern Gaussian Processes on Graphs".

Exploiting a Zoo of Checkpoints for Unseen Tasks

The official codes of our CVPR2022 paper: A Differentiable Two-stage Alignment Scheme for Burst Image Reconstruction with Large Shift

Pytorch Implementation of Value Retrieval with Arbitrary Queries for Form-like Documents.

CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

LLVM-based compiler for LightGBM gradient-boosted trees. Speeds up prediction by ≥10x.

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow.

This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

Neural Articulated Radiance Field

SmoothGrad implementation in PyTorch

Language Used: Python . Made in Jupyter(Anaconda) notebook.

Implementation of the SUMO (Slim U-Net trained on MODA) model

Model of an AI powered sign language interpreter.