Refactored version of FastSpeech2

Last update: May 26, 2022

Overview

FastSpeech2

This repository is a refactored version from ming024's own. I focused on refactoring structure for fitting my cases and making parallel pre-processing codes. And I wrote installation guide with the latest version of MFA(Montreal Force Aligner).

Installation

Tested on python 3.8, Ubuntu 20.04
- Notice ! For installing MFA, you should install the miniconda.
- If you run MFA under 16.04 or ealier version of Ubuntu, you will face a compile error.
In your system
- To install pyworld, run "sudo apt-get install python3.x-dev". (x is your python version).
- To install sndfile, run "sudo apt-get install libsndfile-dev"
- To use MFA, run "sudo apt-get install libopenblas-base"
Install requirements

# install pytorch_sound
pip install git+https://github.com/appleholic/pytorch_sound
pip install -e .

Download datasets

VCTK
- Visit and download dataset from https://datashare.is.ed.ac.uk/handle/10283/2651
- Move to "./data" and extract compressed file.
  - If you wanna save dataset to another directory, you must change the path of configuration files.
LibriTTS
- To be updated

Install MFA
- Visit and follow a guide that described in MFA installation website.
- Additional installation
  - mfa thirdparty download
  - mfa download acoustic english
Pre-trained checkpoint
- VCTK, 400k steps : Google Drive Link

Preprocess (VCTK case)

Prepare MFA

python fastspeech2/scripts/prepare_align.py configs/vctk_prepare_align.json

Run MFA for making alignments

# Define your the number of threads to run MFA at the last of a command. "-j [The number of threads]"
mfa align data/fastspeech2/vctk lexicons/librispeech-lexicon.txt english data/fastspeech2/vctk-pre -j 24

Feature preprocessing

python fastspeech2/scripts/preprocess.py configs/vctk_preprocess.json

Train

Multi-speaker fastspeech2

python fastspeech2/scripts/train.py configs/fastspeech2_vctk_tts.json

If you want to change the parameters of training FastSpeech2, check out the code and put the option to configuration file.
- train code : fastspeech2/scripts/train.py
- config : configs/fastspeech2_vctk_tts.json

Fastspeech2 with reference encoder (To be updated)

Synthesize

Multi-spaker model

In a code

from fastspeech2.inference import Inferencer
from speech_interface.interfaces.hifi_gan import InterfaceHifiGAN

# arguments
# chk_path: str, lexicon_path: str, device: str = 'cuda'
inferencer = Inferencer(chk_path=chk_path, lexicon_path=lexicon_path, device=device)

# initialize hifigan
interface = InterfaceHifiGAN(model_name='hifi_gan_v1_universal', device='cuda')

# arguments
# text: str, speaker: int = 0, pitch_control: float = 1., energy_control: float = 1., duration_control: float = 1.
txt = 'Hello, I am a programmer.'
mel_spectrogram = inferencer.tts(txt, speaker=0)

# Reconstructs speech by using Hifi-GAN
pred_wav = interface.decode(mel_spectrogram.transpose(1, 2)).squeeze()

# If you test on a jupyter notebook
from IPython.display import Audio
Audio(pred_wav.cpu().numpy(), rate=22050)

In command line

python fastspeech2/scripts/synthesize.py [TEXT] [OUTPUT PATH] [CHECKPOINT PATH] [LEXICON PATH] [[DEVICE]] [[SPEAKER]]

Reference encoder (not updated)

Reference

ming024/FastSpeech2

Refactored version of FastSpeech2

Related tags

Overview

FastSpeech2

Installation

Preprocess (VCTK case)

Train

Synthesize

Multi-spaker model

Reference encoder (not updated)

Reference

Owner

ILJI CHOI

Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

Officile code repository for "A Game-Theoretic Perspective on Risk-Sensitive Reinforcement Learning"

CoSENT 比Sentence-BERT更有效的句向量方案

Simple, hackable offline speech to text - using the VOSK-API.

Linear programming solver for paper-reviewer matching and mind-matching

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

A CSRankings-like index for speech researchers

jiant is an NLP toolkit

Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Fully featured implementation of Routing Transformer

Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

DataCLUE: 国内首个以数据为中心的AI测评（含模型分析报告）