Implementation of Google Brain's WaveGrad high-fidelity vocoder

Overview

alt-text-1

WaveGrad

Implementation (PyTorch) of Google Brain's high-fidelity WaveGrad vocoder (paper). First implementation on GitHub with high-quality generation for 6-iterations.

Status

  • Documented API.
  • High-fidelity generation.
  • Multi-iteration inference support (stable for low iterations).
  • Stable and fast training with mixed-precision support.
  • Distributed training support.
  • Training also successfully runs on a single 12GB GPU with batch size 96.
  • CLI inference support.
  • Flexible architecture configuration for your own data.
  • Estimated RTF on popular GPU and CPU devices (see below).
  • 100- and lower-iteration inferences are faster than real-time on RTX 2080 Ti. 6-iteration inference is faster than one reported in the paper.
  • Parallel grid search for the best noise schedule.
  • Uploaded generated samples for different number of iterations (see generated_samples folder).
  • Pretrained checkpoint on 22KHz LJSpeech dataset with noise schedules.

Real-time factor (RTF)

Number of parameters: 15.810.401

Model Stable RTX 2080 Ti Tesla K80 Intel Xeon 2.3GHz*
1000 iterations + 9.59 - -
100 iterations + 0.94 5.85 -
50 iterations + 0.45 2.92 -
25 iterations + 0.22 1.45 -
12 iterations + 0.10 0.69 4.55
6 iterations + 0.04 0.33 2.09

*Note: Used an old version of Intel Xeon CPU.


About

WaveGrad is a conditional model for waveform generation through estimating gradients of the data density with WaveNet-similar sampling quality. This vocoder is neither GAN, nor Normalizing Flow, nor classical autoregressive model. The main concept of vocoder is based on Denoising Diffusion Probabilistic Models (DDPM), which utilize Langevin dynamics and score matching frameworks. Furthemore, comparing to classic DDPM, WaveGrad achieves super-fast convergence (6 iterations and probably lower) w.r.t. Langevin dynamics iterative sampling scheme.


Installation

  1. Clone this repo:
git clone https://github.com/ivanvovk/WaveGrad.git
cd WaveGrad
  1. Install requirements:
pip install -r requirements.txt

Training

1 Preparing data

  1. Make train and test filelists of your audio data like ones included into filelists folder.
  2. Make a configuration file* in configs folder.

*Note: if you are going to change hop_length for STFT, then make sure that the product of your upsampling factors in config is equal to your new hop_length.

2 Single and Distributed GPU training

  1. Open runs/train.sh script and specify visible GPU devices and path to your configuration file. If you specify more than one GPU the training will run in distributed mode.
  2. Run sh runs/train.sh

3 Tensorboard and logging

To track your training process run tensorboard by tensorboard --logdir=logs/YOUR_LOGDIR_FOLDER. All logging information and checkpoints will be stored in logs/YOUR_LOGDIR_FOLDER. logdir is specified in config file.

4 Noise schedule grid search

Once model is trained, grid search for the best schedule* for a needed number of iterations in notebooks/inference.ipynb. The code supports parallelism, so you can specify more than one number of jobs to accelerate the search.

*Note: grid search is necessary just for a small number of iterations (like 6 or 7). For larger number just try Fibonacci sequence benchmark.fibonacci(...) initialization: I used it for 25 iteration and it works well. From good 25-iteration schedule, for example, you can build a higher-order schedule by copying elements.

Noise schedules for pretrained model
  • 6-iteration schedule was obtained using grid search. After, based on obtained scheme, by hand, I found a slightly better approximation.
  • 7-iteration schedule was obtained in the same way.
  • 12-iteration schedule was obtained in the same way.
  • 25-iteration schedule was obtained using Fibonacci sequence benchmark.fibonacci(...).
  • 50-iteration schedule was obtained by repeating elements from 25-iteration scheme.
  • 100-iteration schedule was obtained in the same way.
  • 1000-iteration schedule was obtained in the same way.

Inference

CLI

Put your mel-spectrograms in some folder. Make a filelist. Then run this command with your own arguments:

sh runs/inference.sh -c <your-config> -ch <your-checkpoint> -ns <your-noise-schedule> -m <your-mel-filelist> -v "yes"

Jupyter Notebook

More inference details are provided in notebooks/inference.ipynb. There you can also find how to set a noise schedule for the model and make grid search for the best scheme.


Other

Generated audios

Examples of generated audios are provided in generated_samples folder. Quality degradation between 1000-iteration and 6-iteration inferences is not noticeable if found the best schedule for the latter.

Pretrained checkpoints

You can find a pretrained checkpoint file* on LJSpeech (22KHz) via this Google Drive link.

*Note: uploaded checkpoint is a dict with a single key 'model'.


Important details, issues and comments

  • During training WaveGrad uses a default noise schedule with 1000 iterations and linear scale betas from range (1e-6, 0.01). For inference you can set another schedule with less iterations. Tune betas carefully, the output quality really highly depends on it.
  • By default model runs in a mixed-precision way. Batch size is modified compared to the paper (256 -> 96) since authors trained their model on TPU.
  • After ~10k training iterations (1-2 hours) on a single GPU the model performs good generation for 50-iteration inference. Total training time is about 1-2 days (for absolute convergence).
  • At some point training might start to behave weird and crazy (loss explodes), so I have introduced learning rate (LR) scheduling and gradient clipping. If loss explodes for your data, then try to decrease LR scheduler gamma a bit. It should help.
  • By default hop length of your STFT is equal 300 (thus total upsampling factor). Other cases are not tested, but you can try. Remember, that total upsampling factor should be still equal to your new hop length.

History of updates

  • (NEW: 10/24/2020) Huge update. Distributed training and mixed-precision support. More correct positional encoding. CLI support for inference. Parallel grid search. Model size significantly decreased.
  • New RTF info for NVIDIA Tesla K80 GPU card (popular in Google Colab service) and CPU Intel Xeon 2.3GHz.
  • Huge update. New 6-iteration well generated sample example. New noise schedule setting API. Added the best schedule grid search code.
  • Improved training by introducing smarter learning rate scheduler. Obtained high-fidelity synthesis.
  • Stable training and multi-iteration inference. 6-iteration noise scheduling is supported.
  • Stable training and fixed-iteration inference with significant background static noise left. All positional encoding issues are solved.
  • Stable training of 25-, 50- and 1000-fixed-iteration models. Found no linear scaling (C=5000 from paper) of positional encoding (bug).
  • Stable training of 25-, 50- and 1000-fixed-iteration models. Fixed positional encoding downscaling. Parallel segment sampling is replaced by full-mel sampling.
  • (RELEASE, first on GitHub). Parallel segment sampling and broken positional encoding downscaling. Bad quality with clicks from concatenation from parallel-segment generation.

References

Owner
Ivan Vovk
• Mathematics • Machine Learning • Speech technologies
Ivan Vovk
A model that attempts to learn and benefit from data collected on card counting.

A model that attempts to learn and benefit from data collected on card counting. A decision tree like model is built to win more often than loose and increase the bet of the player appropriately to c

1 Dec 17, 2021
an implementation of softmax splatting for differentiable forward warping using PyTorch

softmax-splatting This is a reference implementation of the softmax splatting operator, which has been proposed in Softmax Splatting for Video Frame I

Simon Niklaus 338 Dec 28, 2022
You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks.

AllSet This is the repo for our paper: You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks. We prepared all codes and a subse

Jianhao 51 Dec 24, 2022
Re-implement CycleGAN in Tensorlayer

CycleGAN_Tensorlayer Re-implement CycleGAN in TensorLayer Original CycleGAN Improved CycleGAN with resize-convolution Prerequisites: TensorLayer Tenso

89 Aug 15, 2022
Clean and readable code for Decision Transformer: Reinforcement Learning via Sequence Modeling

Minimal implementation of Decision Transformer: Reinforcement Learning via Sequence Modeling in PyTorch for mujoco control tasks in OpenAI gym

Nikhil Barhate 104 Jan 06, 2023
Code for Motion Representations for Articulated Animation paper

Motion Representations for Articulated Animation This repository contains the source code for the CVPR'2021 paper Motion Representations for Articulat

Snap Research 851 Jan 09, 2023
Code for technical report "An Improved Baseline for Sentence-level Relation Extraction".

RE_improved_baseline Code for technical report "An Improved Baseline for Sentence-level Relation Extraction". Requirements torch = 1.8.1 transformers

Wenxuan Zhou 74 Nov 29, 2022
Differential Privacy for Heterogeneous Federated Learning : Utility & Privacy tradeoffs

Differential Privacy for Heterogeneous Federated Learning : Utility & Privacy tradeoffs In this work, we propose an algorithm DP-SCAFFOLD(-warm), whic

19 Nov 10, 2022
Official repository for the ISBI 2021 paper Transformer Assisted Convolutional Neural Network for Cell Instance Segmentation

SegPC-2021 This is the official repository for the ISBI 2021 paper Transformer Assisted Convolutional Neural Network for Cell Instance Segmentation by

Datascience IIT-ISM 13 Dec 14, 2022
Hide screen when boss is approaching.

BossSensor Hide your screen when your boss is approaching. Demo The boss stands up. He is approaching. When he is approaching, the program fetches fac

Hiroki Nakayama 6.2k Jan 07, 2023
Earthquake detection via fiber optic cables using deep learning

Earthquake detection via fiber optic cables using deep learning Author: Fantine Huot Getting started Update the submodules After cloning the repositor

Fantine 4 Nov 30, 2022
Source codes of CenterTrack++ in 2021 ICME Workshop on Big Surveillance Data Processing and Analysis

MOT Tracked object bounding box association (CenterTrack++) New association method based on CenterTrack. Two new branches (Tracked Size and IOU) are a

36 Oct 04, 2022
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022; Official code

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

Jinglin Liu 803 Dec 28, 2022
Using Hotel Data to predict High Value And Potential VIP Guests

Description Using hotel data and AI to predict high value guests and potential VIP guests. Hotel can leverage on prediction resutls to run more effect

HCG 12 Feb 14, 2022
EMNLP'2021: Simple Entity-centric Questions Challenge Dense Retrievers

EntityQuestions This repository contains the EntityQuestions dataset as well as code to evaluate retrieval results from the the paper Simple Entity-ce

Princeton Natural Language Processing 119 Sep 28, 2022
Official Pytorch implementation of "DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network" (CVPR'21)

DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network Pytorch implementation for our DivCo. We propose a simple ye

64 Nov 22, 2022
A PyTorch implementation of PointRend: Image Segmentation as Rendering

PointRend A PyTorch implementation of PointRend: Image Segmentation as Rendering [arxiv] [Official Implementation: Detectron2] This repo for Only Sema

AhnDW 336 Dec 26, 2022
This code is an unofficial implementation of HiFiSinger.

HiFiSinger This code is an unofficial implementation of HiFiSinger. The algorithm is based on the following papers: Chen, J., Tan, X., Luan, J., Qin,

Heejo You 87 Dec 23, 2022
Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

ResDAVEnet-VQ Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech What is in this repo? M

Wei-Ning Hsu 21 Aug 23, 2022
Our CIKM21 Paper "Incorporating Query Reformulating Behavior into Web Search Evaluation"

Reformulation-Aware-Metrics Introduction This codebase contains source-code of the Python-based implementation of our CIKM 2021 paper. Chen, Jia, et a

xuanyuan14 5 Mar 05, 2022