Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

Overview

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

Table of Contents

General description

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.
Checkpoints and code originate from following sources:

Done:

  • took all the best code parts from all of the 5 sources above
  • clean the code and fixed some of the mistakes
  • change code structure
  • add multi-speaker and emotion embendings
  • add preprocessing
  • move all the configs from command line args into experiment config file under configs/experiments folder
  • add restoring / checkpointing mechanism
  • add tensorboard
  • make decoder work with n > 1 frames per step
  • make training work at FP16

TODO:

  • make it work with pytorch-1.4.0
  • add multi-spot instance training for AWS

Getting Started

The following section lists the requirements in order to start training the Tacotron 2 and WaveGlow models.

Clone the repository:

git clone https://github.com/ide8/tacotron2  
cd tacotron2
PROJDIR=$(pwd)
export PYTHONPATH=$PROJDIR:$PYTHONPATH

Requirements

This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

Setup

Build an image from Docker file:

docker build --tag taco .

Run docker container:

docker run --shm-size=8G --runtime=nvidia -v /absolute/path/to/your/code:/app -v /absolute/path/to/your/training_data:/mnt/train -v /absolute/path/to/your/logs:/mnt/logs -v /absolute/path/to/your/raw-data:/mnt/raw-data -v /absolute/path/to/your/pretrained-checkpoint:/mnt/pretrained -detach taco sleep inf

Check container id:

docker ps

Select container id of image with tag taco and log into container with:

docker exec -it container_id bash

Code structure description

Folders tacotron2 and waveglow have scripts for Tacotron 2, WaveGlow models and consist of:

  • /model.py - model architecture
  • /data_function.py - data loading functions
  • /loss_function.py - loss function

Folder common contains common layers for both models (common/layers.py), utils (common/utils.py) and audio processing (common/audio_processing.py and common/stft.py).

Folder router is used by training script to select an appropriate model

In the root directory:

  • train.py - script for model training
  • preprocess.py - performs audio processing and creates training and validation datasets
  • inference.ipynb - notebook for running inference

Folder configs contains __init__.py with all parameters needed for training and data processing. Folder configs/experiments consists of all the experiments. waveglow.py and tacotron2.py are provided as examples for WaveGlow and Tacotron 2. On training or data processing start, parameters are copied from your experiment (in our case - from waveglow.py or from tacotron2.py) to __init__.py, from which they are used by the system.

Data preprocessing

Preparing for data preprocessing

  1. For each speaker you have to have a folder named with speaker name, containing wavs folder and metadata.csv file with the next line format: file_name.wav|text.
  2. All necessary parameters for preprocessing should be set in configs/experiments/waveglow.py or in configs/experiments/tacotron2.py, in the class PreprocessingConfig.
  3. If you're running preprocessing first time, set start_from_preprocessed flag to False. preprocess.py performs trimming of audio files up to PreprocessingConfig.top_db (cuts the silence in the beginning and the end), applies ffmpeg command in order to mono, make same sampling rate and bit rate for all the wavs in dataset.
  4. It saves a folder wavs with processed audio files and data.csv file in PreprocessingConfig.output_directory with the following format: path|text|speaker_name|speaker_id|emotion|text_len|duration.
  5. Trimming and ffmpeg command are applied only to speakers, for which flag process_audio is True. Speakers with flag emotion_present is False, are treated as with emotion neutral-normal.
  6. You won't need start_from_preprocessed = False once you finish running preprocessing script. Only exception in case of new raw data comes in.
  7. Once start_from_preprocessed is set to True, script loads file data.csv (created by the start_from_preprocessed = False run), and forms train.txt and val.txt out from data.csv.
  8. Main PreprocessingConfig parameters:
    1. cpus - defines number of cores for batch generator
    2. sr - defines sample ratio for reading and writing audio
    3. emo_id_map - dictionary for emotion name to emotion_id mapping
    4. data[{'path'}] - is path to folder named with speaker name and containing wavs folder and metadata.csv with the following line format: file_name.wav|text|emotion (optional)
  9. Preprocessing script forms training and validation datasets in the following way:
    1. selects rows with audio duration and text length less or equal those for speaker PreprocessingConfig.limit_by (this step is needed for proper batch size)
    2. if such speaker is not present, than it selects rows within PreprocessingConfig.text_limit and PreprocessingConfig.dur_limit. Lower limit for audio is defined by PreprocessingConfig.minimum_viable_dur
    3. in order to be able to use the same batch size as NVIDIA guys, set PreprocessingConfig.text_limit to linda_jonson
    4. splits dataset randomly by ratio train : val = 0.95 : 0.05
    5. if speaker train set is bigger than PreprocessingConfig.n - samples n rows
    6. saves train.txt and val.txt to PreprocessingConfig.output_directory
    7. saves emotion_coefficients.json and speaker_coefficients.json with coefficients for loss balancing (used by train.py).

Run preprocessing

Since both scripts waveglow.py and tacotron2.py contain the class PreprocessingConfig, training and validation dataset can be produced by running any of them:

python preprocess.py --exp tacotron2

or

python preprocess.py --exp waveglow

Training

Preparing for training

Tacotron 2

In configs/experiment/tacotron2.py, in the class Config set:

  1. training_files and validation_files - paths to train.txt, val.txt;
  2. tacotron_checkpoint - path to pretrained Tacotron 2 if it exist (we were able to restore Waveglow from Nvidia, but Tacotron 2 code was edited to add speakers and emotions, so Tacotron 2 needs to be trained from scratch);
  3. speaker_coefficients - path to speaker_coefficients.json;
  4. emotion_coefficients - path to emotion_coefficients.json;
  5. output_directory - path for writing logs and checkpoints;
  6. use_emotions - flag indicating emotions usage;
  7. use_loss_coefficients - flag indicating loss scaling due to possible data disbalance in terms of both speakers and emotions; for balancing loss, set paths to jsons with coefficients in emotion_coefficients and speaker_coefficients;
  8. model_name - "Tacotron2".
  • Launch training
    • Single gpu:
      python train.py --exp tacotron2
      
    • Multigpu training:
      python -m multiproc train.py --exp tacotron2
      

WaveGlow:

In configs/experiment/waveglow.py, in the class Config set:

  1. training_files and validation_files - paths to train.txt, val.txt;
  2. waveglow_checkpoint - path to pretrained Waveglow, restored from Nvidia. Download checkopoint.
  3. output_directory - path for writing logs and checkpoints;
  4. use_emotions - False;
  5. use_loss_coefficients - False;
  6. model_name - "WaveGlow".
  • Launch training
    • Single gpu:
      python train.py --exp waveglow
      
    • Multigpu training:
      python -m multiproc train.py --exp waveglow
      

Running Tensorboard

Once you made your model start training, you might want to see some progress of training:

docker ps

Select container id of image with tag taco and run:

docker exec -it container_id bash

Start Tensorboard:

 tensorboard --logdir=path_to_folder_with_logs --host=0.0.0.0

Loss is being written into tensorboard:

Tensorboard Scalars

Audio samples together with attention alignments are saved into tensorbaord each Config.epochs_per_checkpoint. Transcripts for audios are listed in Config.phrases

Tensorboard Audio

Inference

Running inference with the inference.ipynb notebook.

Run Jupyter Notebook:

jupyter notebook --ip 0.0.0.0 --port 6006 --no-browser --allow-root

output:

[email protected]:/app# jupyter notebook --ip 0.0.0.0 --port 6006 --no-browser --allow-root
[I 09:31:25.393 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
[I 09:31:25.393 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 09:31:25.395 NotebookApp] Serving notebooks from local directory: /app
[I 09:31:25.395 NotebookApp] The Jupyter Notebook is running at:
[I 09:31:25.395 NotebookApp] http://(04096a19c266 or 127.0.0.1):6006/?token=bbd413aef225c1394be3b9de144242075e651bea937eecce
[I 09:31:25.395 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 09:31:25.398 NotebookApp] 
    
    To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-15398-open.html
    Or copy and paste one of these URLs:
        http://(04096a19c266 or 127.0.0.1):6006/?token=bbd413aef225c1394be3b9de144242075e651bea937eecce

Select adress with 127.0.0.1 and put it in the browser. In this case: http://127.0.0.1:6006/?token=bbd413aef225c1394be3b9de144242075e651bea937eecce

This script takes text as input and runs Tacotron 2 and then WaveGlow inference to produce an audio file. It requires pre-trained checkpoints from Tacotron 2 and WaveGlow models, input text, speaker_id and emotion_id.

Change paths to checkpoints of pretrained Tacotron 2 and WaveGlow in the cell [2] of the inference.ipynb.
Write a text to be displayed in the cell [7] of the inference.ipynb.

Parameters

In this section, we list the most important hyperparameters, together with their default values that are used to train Tacotron 2 and WaveGlow models.

Shared parameters

  • epochs - number of epochs (Tacotron 2: 1501, WaveGlow: 1001)
  • learning-rate - learning rate (Tacotron 2: 1e-3, WaveGlow: 1e-4)
  • batch-size - batch size (Tacotron 2: 64, WaveGlow: 11)
  • grad_clip_thresh - gradient clipping treshold (0.1)

Shared audio/STFT parameters

  • sampling-rate - sampling rate in Hz of input and output audio (22050)
  • filter-length - (1024)
  • hop-length - hop length for FFT, i.e., sample stride between consecutive FFTs (256)
  • win-length - window size for FFT (1024)
  • mel-fmin - lowest frequency in Hz (0.0)
  • mel-fmax - highest frequency in Hz (8.000)

Tacotron parameters

  • anneal-steps - epochs at which to anneal the learning rate (500/ 1000/ 1500)
  • anneal-factor - factor by which to anneal the learning rate (0.1) These two parameters are used to change learning rate at the points defined in anneal-steps according to:
    learning_rate = learning_rate * ( anneal_factor ** p),
    where p = 0 at the first step and increments by 1 each step.

WaveGlow parameters

  • segment-length - segment length of input audio processed by the neural network (8000). Before passing to input, audio is padded or croped to segment-length.
  • wn_config - dictionary with parameters of affine coupling layers. Contains n_layers, n_chanels, kernel_size.

Contributing

If you've ever wanted to contribute to open source, and a great cause, now is your chance!

See the contributing docs for more information

Owner
Ivan Didur
CTO at data root labs
Ivan Didur
Simple GUI where you can enter an article and get a crisp summarized version.

Text-Summarization-using-TextRank-BART Simple GUI where you can enter an article and get a crisp summarized version. How to run: Clone the repo Instal

Rohit P 4 Sep 28, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

Habib Abdurrasyid 5 Dec 28, 2021
Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes.

Deep-Learning-for-Text-Document-Classification Text classification is one of the popular tasks in NLP that allows a program to classify free-text docu

Happy N. Monday 2 Mar 17, 2022
The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

CIF-PyTorch This is a PyTorch based implementation of continuous integrate-and-fire (CIF) module for end-to-end (E2E) automatic speech recognition (AS

Minglun Han 24 Dec 29, 2022
A Structured Self-attentive Sentence Embedding

Structured Self-attentive sentence embeddings Implementation for the paper A Structured Self-Attentive Sentence Embedding, which was published in ICLR

Kaushal Shetty 488 Nov 28, 2022
Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

Mark Musil 1 Nov 08, 2021
A python wrapper around the ZPar parser for English.

NOTE This project is no longer under active development since there are now really nice pure Python parsers such as Stanza and Spacy. The repository w

ETS 49 Sep 12, 2022
An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

pl_prompt_sst An example project using OpenPrompt under the framework of pytorch-lightning for a training prompt-based text classification model on SS

Zhiling Zhang 5 Oct 21, 2022
Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

simple_diarizer Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diariz

Chau 65 Dec 30, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
📔️ Generate a text-based journal from a template file.

JGen 📔️ Generate a text-based journal from a template file. Contents Getting Started Example Overview Usage Details Reserved Keywords Gotchas Getting

Harrison Broadbent 21 Sep 25, 2022
Semi-automated vocabulary generation from semantic vector models

vec2word Semi-automated vocabulary generation from semantic vector models This script generates a list of potential conlang word forms along with asso

9 Nov 25, 2022
Opal-lang - A WIP programming language based on Python

thanks to aphitorite for the beautiful logo! opal opal is a WIP transcompiled pr

3 Nov 04, 2022
Associated Repository for "Translation between Molecules and Natural Language"

MolT5: Translation between Molecules and Natural Language Associated repository for "Translation between Molecules and Natural Language". Table of Con

67 Dec 15, 2022
PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

Salesforce 261 Nov 12, 2022
Sequence-to-Sequence Framework in PyTorch

nmtpytorch allows training of various end-to-end neural architectures including but not limited to neural machine translation, image captioning and au

LIUM 395 Nov 21, 2022
CredData is a set of files including credentials in open source projects

CredData is a set of files including credentials in open source projects. CredData includes suspicious lines with manual review results and more information such as credential types for each suspicio

Samsung 19 Sep 07, 2022
**NSFW** A chatbot based on GPT2-chitchat

DangBot -- 好怪哦,再来一句 卡群怪话bot,powered by GPT2 for Chinese chitchat Training Example: python train.py --lr 5e-2 --epochs 30 --max_len 300 --batch_size 8

Tommy Yang 11 Jul 21, 2022