Unofficial Pytorch Implementation of WaveGrad2

Last update: Nov 29, 2022

Overview

WaveGrad 2 — Unofficial PyTorch Implementation

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Unofficial PyTorch+Lightning Implementation of Chen et al.(JHU, Google Brain), WaveGrad2.
Audio Samples: https://mindslab-ai.github.io/wavegrad2/

TODO

More training for WaveGrad-Base setup
Checkpoint release
WaveGrad-Large Decoder
Inference by reduced sampling steps

Requirements

Pytorch
Pytorch-Lightning==1.2.10
The requirements are highlighted in requirements.txt.
We also provide docker setup Dockerfile.

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
AISHELL-3: a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.
etc.

We take LJSpeech as an example hereafter.

Preprocessing

Adjust preprocess.yaml, especially path section.

path:
  corpus_path: '/DATA1/LJSpeech-1.1' # LJSpeech corpus path
  lexicon_path: 'lexicon/librispeech-lexicon.txt'
  raw_path: './raw_data/LJSpeech'
  preprocessed_path: './preprocessed_data/LJSpeech'

run prepare_align.py for some preparations.

python prepare_align.py -c preprocess.yaml

Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments for the LJSpeech and AISHELL-3 datasets are provided here. You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/.
After that, run preprocess.py.

python preprocess.py -c preprocess.yaml

Alternately, you can align the corpus by yourself.
Download the official MFA package and run it to align the corpus.

./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech

And then run preprocess.py.

python preprocess.py -c preprocess.yaml

Training

Adjust hparameter.yaml, especially train section.

train:
  batch_size: 12 # Dependent on GPU memory size
  adam:
    lr: 3e-4
    weight_decay: 1e-6
  decay:
    rate: 0.05
    start: 25000
    end: 100000
  num_workers: 16 # Dependent on CPU cores
  gpus: 2 # number of GPUs
  loss_rate:
    dur: 1.0

If you want to train with other dataset, adjust data section in hparameter.yaml

data:
  lang: 'eng'
  text_cleaners: ['english_cleaners'] # korean_cleaners, english_cleaners, chinese_cleaners
  speakers: ['LJSpeech']
  train_dir: 'preprocessed_data/LJSpeech'
  train_meta: 'train.txt'  # relative path of metadata file from train_dir
  val_dir: 'preprocessed_data/LJSpeech'
  val_meta: 'val.txt'  # relative path of metadata file from val_dir'
  lexicon_path: 'lexicon/librispeech-lexicon.txt'

run trainer.py

python trainer.py

If you want to resume training from checkpoint, check parser.

parser = argparse.ArgumentParser()
parser.add_argument('-r', '--resume_from', type =int,\
	required = False, help = "Resume Checkpoint epoch number")
parser.add_argument('-s', '--restart', action = "store_true",\
	required = False, help = "Significant change occured, use this")
parser.add_argument('-e', '--ema', action = "store_true",
	required = False, help = "Start from ema checkpoint")
args = parser.parse_args()

During training, tensorboard logger is logging loss, spectrogram and audio.

tensorboard --logdir=./tensorboard --bind_all

Inference

run inference.py

python inference.py -c <checkpoint_path> --text <'text'>

Or you can run inference.ipynb.

Checkpoint file will be released!

Note

Since this repo is unofficial implementation and WaveGrad2 paper do not provide several details, a slight differences between paper could exist.
We listed modifications or arbitrary setups

Normal LSTM without ZoneOut is applied for encoder.
g2p_en is applied instead of Google's unknown G2P.
Trained with LJSpeech datasdet instead of Google's proprietary dataset.
- Due to dataset replacement, output audio's sampling rate becomes 22.05kHz instead of 24kHz.
MT + SpecAug are not implemented.
hyperparameters
- train.batch_size: 12 for 2 A100 (40GB) GPUs
- train.adam.lr: 3e-4 and train.adam.weight_decay: 1e-6
- train.decay learning rate decay is applied during training
- train.loss_rate: 1 as total_loss = 1 * L1_loss + 1 * duration_loss
- ddpm.ddpm_noise_schedule: torch.linspace(1e-6, 0.01, hparams.ddpm.max_step)
- encoder.channel is reduced to 512 from 1024 or 2048
Current sample page only contains samples from WaveGrad-Base decoder.
TODO things.

Tree

.
├── Dockerfile
├── README.md
├── dataloader.py
├── docs
│   ├── spec.png
│   ├── tb.png
│   └── tblogger.png
├── hparameter.yaml
├── inference.py
├── lexicon
│   ├── librispeech-lexicon.txt
│   └── pinyin-lexicon-r.txt
├── lightning_model.py
├── model
│   ├── base.py
│   ├── downsampling.py
│   ├── encoder.py
│   ├── gaussian_upsampling.py
│   ├── interpolation.py
│   ├── layers.py
│   ├── linear_modulation.py
│   ├── nn.py
│   ├── resampling.py
│   ├── upsampling.py
│   └── window.py
├── prepare_align.py
├── preprocess.py
├── preprocess.yaml
├── preprocessor
│   ├── ljspeech.py
│   └── preprocessor.py
├── text
│   ├── __init__.py
│   ├── cleaners.py
│   ├── cmudict.py
│   ├── numbers.py
│   └── symbols.py
├── trainer.py
├── utils
│   ├── mel.py
│   ├── stft.py
│   ├── tblogger.py
│   └── utils.py
└── wavegrad2_tester.ipynb

Author

This code is implemented by

Seungu Han at MINDs Lab [email protected]
Junhyeok Lee at MINDs Lab [email protected]

Special thanks to

Kang-wook Kim at MINDs Lab
Wonbin Jung at MINDs Lab
Sang Hoon Woo at MINDs Lab

References

Chen et al., WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Chen et al.,WaveGrad: Estimating Gradients for Waveform Generation
Ho et al., Denoising Diffusion Probabilistic Models
Shen et al., Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

This implementation uses code from following repositories:

The webpage for the audio samples uses a template from:

WaveGrad2 Official Github.io

The audio samples on our webpage(TBD) are partially derived from:

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
WaveGrad2 Official Github.io

Unofficial Pytorch Implementation of WaveGrad2

Related tags

Overview

WaveGrad 2 — Unofficial PyTorch Implementation

TODO

Requirements

Datasets

Preprocessing

Training

Inference

Note

Tree

Author

References

Owner

MINDs Lab

phylotorch-bito is a package providing an interface to BITO for phylotorch

Direct design of biquad filter cascades with deep learning by sampling random polynomials.

Springer Link Download Module for Python

This is the code for Deformable Neural Radiance Fields, a.k.a. Nerfies.

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Enigma-Plus - Python based Enigma machine simulator with some extra features

StarGAN v2-Tensorflow - Simple Tensorflow implementation of StarGAN v2

FID calculation with proper image resizing and quantization steps

A decent AI that solves daily Wordle puzzles. Works with different websites with similar wordlists,.

A real world application of a Recurrent Neural Network on a binary classification of time series data

LyaNet: A Lyapunov Framework for Training Neural ODEs

Session-based Recommendation, CoHHN, price preferences, interest preferences, Heterogeneous Hypergraph, Co-guided Learning, SIGIR2022

A command line simple note taking app

Code for "Modeling Indirect Illumination for Inverse Rendering", CVPR 2022

Code repository for "Free View Synthesis", ECCV 2020.

Mae segmentation - Reproduction of semantic segmentation using masked autoencoder (mae)

NeuPy is a Tensorflow based python library for prototyping and building neural networks

Official PyTorch implementation of Retrieve in Style: Unsupervised Facial Feature Transfer and Retrieval.

A collection of SOTA Image Classification Models in PyTorch

A machine learning benchmark of in-the-wild distribution shifts, with data loaders, evaluators, and default models.