Official PyTorch implementation of SyntaSpeech (IJCAI 2022)

Overview

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

arXiv | GitHub Stars | downloads | Hugging Face | 中文文档

This repository is the official PyTorch implementation of our IJCAI-2022 paper, in which we propose SyntaSpeech for syntax-aware non-autoregressive Text-to-Speech.



Our SyntaSpeech is built on the basis of PortaSpeech (NeurIPS 2021) with three new features:

  1. We propose Syntactic Graph Builder (Sec. 3.1) and Syntactic Graph Encoder (Sec. 3.2), which is proved to be an effective unit to extract syntactic features to improve the prosody modeling and duration accuracy of TTS model.
  2. We introduce Multi-Length Adversarial Training (Sec. 3.3), which could replace the flow-based post-net in PortaSpeech, speeding up the inference time and improving the audio quality naturalness.
  3. We support three datasets: LJSpeech (single-speaker English dataset), Biaobei (single-speaker Chinese dataset) , and LibriTTS (multi-speaker English dataset).

Environments

conda create -n synta python=3.7
condac activate synta
pip install -U pip
pip install Cython numpy==1.19.1
pip install torch==1.9.0 
pip install -r requirements.txt
# install dgl for graph neural network, dgl-cu102 supports rtx2080, dgl-cu113 support rtx3090
pip install dgl-cu102 dglgo -f https://data.dgl.ai/wheels/repo.html 
sudo apt install -y sox libsox-fmt-mp3
bash mfa_usr/install_mfa.sh # install force alignment tools

Run SyntaSpeech!

Please follow the following steps to run this repo.

1. Preparation

Data Preparation

You can directly use our binarized datasets for LJSpeech and Biaobei. Download them and unzip them into the data/binary/ folder.

As for LibriTTS, you can download the raw datasets and process them with our data_gen modules. Detailed instructions can be found in dosc/prepare_data.

Vocoder Preparation

We provide the pre-trained model of vocoders for three datasets. Specifically, Hifi-GAN for LJSpeech and Biaobei, ParallelWaveGAN for LibriTTS. Download and unzip them into the checkpoints/ folder.

2. Training Example

Then you can train SyntaSpeech in the three datasets.

cd <the root_dir of your SyntaSpeech folder>
export PYTHONPATH=./
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset # training in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset # training in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset # training in LibriTTS

3. Tensorboard

tensorboard --logdir=checkpoints/lj_synta
tensorboard --logdir=checkpoints/biaobei_synta
tensorboard --logdir=checkpoints/libritts_synta

4. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset --infer # inference in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset --infer # inference in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset ---infer # inference in LibriTTS

Audio Demos

Audio samples in the paper can be found in our demo page.

We also provide HuggingFace Demo Page for LJSpeech. Try your interesting sentences there!

Citation

@article{ye2022syntaspeech,
  title={SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech},
  author={Ye, Zhenhui and Zhao, Zhou and Ren, Yi and Wu, Fei},
  journal={arXiv preprint arXiv:2204.11792},
  year={2022}
}

Acknowledgements

Our codes are based on the following repos:

Comments
  • pinyin preprocess problem

    pinyin preprocess problem

    005804 你当#1我傻啊#3?脑子#1那么大#2怎么#1塞进去#4? ni3 dang1 wo2 sha3 a5 nao3 zi5 na4 me5 da4 zen3 me5 sai1 jin4 qu4

    txt_struct=[['', ['']], ['你', ['n', 'i3']], ['当', ['d', 'ang1']], ['我', ['uo3']], ['傻', ['sh', 'a3']], ['啊', ['a', '?', 'n', 'ao3']], ['?', ['z', 'i']], ['脑', ['n', 'a4']], ['子', ['m', 'e']], ['那', ['d', 'a4']], ['么', ['z', 'en3']], ['大', ['m', 'e']], ['怎', ['s', 'ai1']], ['么', ['j', 'in4']], ['塞', ['q', 'v4', '?']], ['进', []], ['去', []], ['?', []], ['', ['']]]

    ph_gb_word=['', 'n_i3', 'd_ang1', 'uo3', 'sh_a3', 'a_?n_ao3', 'z_i', 'n_a4', 'm_e', 'd_a4', 'z_en3', 'm_e', 's_ai1', 'j_in4', 'q_v4?', '', '', '', '']

    what is 'a_?_n_ao3'

    in the mfa_dict it appears ch_a1_d_ou1 ,a_?_n_ao3 and so on

    opened by windowxiaoming 2
  • discriminator output['y_c'] never used

    discriminator output['y_c'] never used

    Discriminator's output['y_c'] never used, and never calculated in discriminator forward func. What does this variable mean? https://github.com/yerfor/SyntaSpeech/blob/5b07439633a3e714d2a6759ea4097eb36d6cd99a/tasks/tts/synta.py#L81

    opened by mayfool 2
  • A question of KL divergence calculation

    A question of KL divergence calculation

    In modules/tts/portaspeech/fvae.py, SyntaFVAE compute loss_kl (line 121) , Can someone help explain why loss_kl = ((logqx - logpx) * nonpadding_sqz).sum() / nonpadding_sqz.sum() / logqx.shape[1],I think loss_kl should be compute by loss_kl = logqx.exp()*(logqx - logpx) I would be very grateful if you could reply to me!

    opened by JiaYK 2
  • mfa for multi speaker.

    mfa for multi speaker.

    In the code, group MFA inputs for better parallelism. For multi speaker, it maybe go wrong. For input g_uang3 zh_ou1 n_v3 d_a4 x_ve2 sh_eng1 d_eng1 sh_an1 sh_i1 l_ian2 s_i4 t_ian1 j_ing3 f_ang1 zh_ao3 d_ao4 i2 s_i4 n_v3 sh_i1. The TexGrid is

    	item [1]:
    		class = "IntervalTier"
    		name = "words"
    		xmin = 0.0
    		xmax = 9.4444
    		intervals: size = 56
    			intervals [1]:
    				xmin = 0
    				xmax = 0.5700000000000001
    				text = ""
    			intervals [2]:
    				xmin = 0.5700000000000001
    				xmax = 0.61
    				text = "eng"
    			intervals [3]:
    				xmin = 0.61
    				xmax = 0.79
    				text = "s_an1"
    			intervals [4]:
    				xmin = 0.79
    				xmax = 0.89
    				text = "eng"
    			intervals [5]:
    				xmin = 0.89
    				xmax = 1.06
    				text = "i1"
    			intervals [6]:
    				xmin = 1.06
    				xmax = 1.24
    				text = "eng"
    			intervals [7]:
    				xmin = 1.24
    				xmax = 1.3
    				text = ""
    			intervals [8]:
    				xmin = 1.3
    				xmax = 1.36
    				text = "s_an1"
    			intervals [9]:
    				xmin = 1.36
    				xmax = 1.42
    				text = ""
    			intervals [10]:
    				xmin = 1.42
    				xmax = 1.49
    				text = "eng"
    			intervals [11]:
    				xmin = 1.49
    				xmax = 1.67
    				text = "s_i4"
    			intervals [12]:
    				xmin = 1.67
    				xmax = 1.78
    				text = "eng"
    			intervals [13]:
    				xmin = 1.78
    				xmax = 1.91
    				text = ""
    			intervals [14]:
    				xmin = 1.91
    				xmax = 1.96
    				text = "er4"
    			intervals [15]:
    				xmin = 1.96
    				xmax = 2.06
    				text = "eng"
    			intervals [16]:
    				xmin = 2.06
    				xmax = 2.19
    				text = ""
    			intervals [17]:
    				xmin = 2.19
    				xmax = 2.35
    				text = "i1"
    			intervals [18]:
    				xmin = 2.35
    				xmax = 2.53
    				text = "eng"
    			intervals [19]:
    				xmin = 2.53
    				xmax = 3.03
    				text = "i1"
    			intervals [20]:
    				xmin = 3.03
    				xmax = 3.42
    				text = "eng"
    			intervals [21]:
    				xmin = 3.42
    				xmax = 3.48
    				text = "i1"
    			intervals [22]:
    				xmin = 3.48
    				xmax = 3.6
    				text = ""
    			intervals [23]:
    				xmin = 3.6
    				xmax = 3.64
    				text = "eng"
    			intervals [24]:
    				xmin = 3.64
    				xmax = 3.86
    				text = "i1"
    			intervals [25]:
    				xmin = 3.86
    				xmax = 3.99
    				text = "eng"
    			intervals [26]:
    				xmin = 3.99
    				xmax = 4.59
    				text = ""
    			intervals [27]:
    				xmin = 4.59
    				xmax = 4.869999999999999
    				text = "er4"
    			intervals [28]:
    				xmin = 4.869999999999999
    				xmax = 4.9799999999999995
    				text = "eng"
    			intervals [29]:
    				xmin = 4.9799999999999995
    				xmax = 5.1899999999999995
    				text = "s_i4"
    			intervals [30]:
    				xmin = 5.1899999999999995
    				xmax = 5.34
    				text = ""
    			intervals [31]:
    				xmin = 5.34
    				xmax = 5.43
    				text = "eng"
    			intervals [32]:
    				xmin = 5.43
    				xmax = 5.6
    				text = ""
    			intervals [33]:
    				xmin = 5.6
    				xmax = 5.76
    				text = "i1"
    			intervals [34]:
    				xmin = 5.76
    				xmax = 6.279999999999999
    				text = "eng"
    			intervals [35]:
    				xmin = 6.279999999999999
    				xmax = 6.359999999999999
    				text = "s_an1"
    			intervals [36]:
    				xmin = 6.359999999999999
    				xmax = 6.47
    				text = ""
    			intervals [37]:
    				xmin = 6.47
    				xmax = 6.6
    				text = "eng"
    			intervals [38]:
    				xmin = 6.6
    				xmax = 6.9399999999999995
    				text = "i1"
    			intervals [39]:
    				xmin = 6.9399999999999995
    				xmax = 7.039999999999999
    				text = "eng"
    			intervals [40]:
    				xmin = 7.039999999999999
    				xmax = 7.289999999999999
    				text = "s_an1"
    			intervals [41]:
    				xmin = 7.289999999999999
    				xmax = 7.369999999999999
    				text = "eng"
    			intervals [42]:
    				xmin = 7.369999999999999
    				xmax = 7.6
    				text = "s_i4"
    			intervals [43]:
    				xmin = 7.6
    				xmax = 7.699999999999999
    				text = "eng"
    			intervals [44]:
    				xmin = 7.699999999999999
    				xmax = 7.869999999999999
    				text = ""
    			intervals [45]:
    				xmin = 7.869999999999999
    				xmax = 8.049999999999999
    				text = "er4"
    			intervals [46]:
    				xmin = 8.049999999999999
    				xmax = 8.26
    				text = ""
    			intervals [47]:
    				xmin = 8.26
    				xmax = 8.299999999999999
    				text = "eng"
    			intervals [48]:
    				xmin = 8.299999999999999
    				xmax = 8.36
    				text = "s_i4"
    			intervals [49]:
    				xmin = 8.36
    				xmax = 8.389999999999999
    				text = ""
    			intervals [50]:
    				xmin = 8.389999999999999
    				xmax = 8.42
    				text = "eng"
    			intervals [51]:
    				xmin = 8.42
    				xmax = 8.45
    				text = ""
    			intervals [52]:
    				xmin = 8.45
    				xmax = 8.59
    				text = "s_an1"
    			intervals [53]:
    				xmin = 8.59
    				xmax = 8.83
    				text = ""
    			intervals [54]:
    				xmin = 8.83
    				xmax = 9.1
    				text = "eng"
    			intervals [55]:
    				xmin = 9.1
    				xmax = 9.44
    				text = "i1"
    			intervals [56]:
    				xmin = 9.44
    				xmax = 9.4444
    				text = ""
    
    opened by leon2milan 2
  • Problem with DDP

    Problem with DDP

    Hello, I have experimented on your excellent job with this repo. But I found the ddp is not effective. I wonder if the way I used is wrong?

    CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node 3 tasks/run.py --config //fs.yaml --exp_name fs_test_demo --reset

    opened by zhazl 0
Releases(v1.0.0)
Owner
Zhenhui YE
I am currently a second-year computer science Ph.D student at Zhejiang University, working on deep learning and reinforcement learning.
Zhenhui YE
[제 13회 투빅스 컨퍼런스] OK Mugle! - 장르부터 멜로디까지, Content-based Music Recommendation

Ok Mugle! 🎵 장르부터 멜로디까지, Content-based Music Recommendation 'Ok Mugle!'은 제13회 투빅스 컨퍼런스(2022.01.15)에서 진행한 음악 추천 프로젝트입니다. Description 📖 본 프로젝트에서는 Kakao

SeongBeomLEE 5 Oct 09, 2022
Implementation for Learning to Track with Object Permanence

Learning to Track with Object Permanence A video-based MOT approach capable of tracking through full occlusions: Learning to Track with Object Permane

Toyota Research Institute - Machine Learning 91 Jan 03, 2023
A short and easy PyTorch implementation of E(n) Equivariant Graph Neural Networks

Simple implementation of Equivariant GNN A short implementation of E(n) Equivariant Graph Neural Networks for HOMO energy prediction. Just 50 lines of

Arsenii Senya Ashukha 97 Dec 23, 2022
[CVPR 2022 Oral] Crafting Better Contrastive Views for Siamese Representation Learning

Crafting Better Contrastive Views for Siamese Representation Learning (CVPR 2022 Oral) 2022-03-29: The paper was selected as a CVPR 2022 Oral paper! 2

249 Dec 28, 2022
Head and Neck Tumour Segmentation and Prediction of Patient Survival Project

Head-and-Neck-Tumour-Segmentation-and-Prediction-of-Patient-Survival Welcome to the Head and Neck Tumour Segmentation and Prediction of Patient Surviv

5 Oct 20, 2022
Local Attention - Flax module for Jax

Local Attention - Flax Autoregressive Local Attention - Flax module for Jax Install $ pip install local-attention-flax Usage from jax import random fr

Phil Wang 16 Jun 16, 2022
YOLOX + ROS(1, 2) object detection package

YOLOX + ROS(1, 2) object detection package

Ar-Ray 158 Dec 21, 2022
It's A ML based Web Site build with python and Django to find the breed of the dog

ML-Based-Dog-Breed-Identifier This is a Django Based Web Site To Identify the Breed of which your DOG belogs All You Need To Do is to Follow These Ste

Sanskar Dwivedi 2 Oct 12, 2022
A PyTorch implementation of "CoAtNet: Marrying Convolution and Attention for All Data Sizes".

CoAtNet Overview This is a PyTorch implementation of CoAtNet specified in "CoAtNet: Marrying Convolution and Attention for All Data Sizes", arXiv 2021

Justin Wu 268 Jan 07, 2023
SemEval2022 Patronizing and Condescending Language (PCL) Detection

SemEval2022 Patronizing and Condescending Language (PCL) Detection This task is from SemEval 2022. What is Patronizing and Condescending Language (PCL

Daniel Saeedi 0 Aug 05, 2022
Locally Most Powerful Bayesian Test for Out-of-Distribution Detection using Deep Generative Models

LMPBT Supplementary code for the Paper entitled ``Locally Most Powerful Bayesian Test for Out-of-Distribution Detection using Deep Generative Models"

1 Sep 29, 2022
Chinese Advertisement Board Identification(Pytorch)

Chinese-Advertisement-Board-Identification. We use YoloV5 to extract the ROI of the location of the chinese word. Next, we sort the bounding box and recognize every chinese words which we extracted.

Li-Wei Hsiao 12 Jul 21, 2022
A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

Note: This is an alpha (preview) version which is still under refining. nn-Meter is a novel and efficient system to accurately predict the inference l

Microsoft 244 Jan 06, 2023
Pytorch Implementation of "Contrastive Representation Learning for Exemplar-Guided Paraphrase Generation"

CRL_EGPG Pytorch Implementation of Contrastive Representation Learning for Exemplar-Guided Paraphrase Generation We use contrastive loss implemented b

YHR 25 Nov 14, 2022
codes for Self-paced Deep Regression Forests with Consideration on Ranking Fairness

Self-paced Deep Regression Forests with Consideration on Ranking Fairness This is official codes for paper Self-paced Deep Regression Forests with Con

Learning in Vision 4 Sep 11, 2022
Voice Conversion by CycleGAN (语音克隆/语音转换):CycleGAN-VC3

CycleGAN-VC3-PyTorch 中文说明 | English This code is a PyTorch implementation for paper: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectr

Kun Ma 110 Dec 24, 2022
MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images

MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images This repository contains the implementation of our paper MetaAvatar: Learni

sfwang 96 Dec 13, 2022
CONetV2: Efficient Auto-Channel Size Optimization for CNNs

CONetV2: Efficient Auto-Channel Size Optimization for CNNs Exciting News! CONetV2: Efficient Auto-Channel Size Optimization for CNNs has been accepted

Mahdi S. Hosseini 3 Dec 13, 2021
This repository implements variational graph auto encoder by Thomas Kipf.

Variational Graph Auto-encoder in Pytorch This repository implements variational graph auto-encoder by Thomas Kipf. For details of the model, refer to

DaehanKim 215 Jan 02, 2023
Marvis is Mastouri's Jarvis version of the AI-powered Python personal assistant.

Marvis v1.0 Marvis is Mastouri's Jarvis version of the AI-powered Python personal assistant. About M.A.R.V.I.S. J.A.R.V.I.S. is a fictional character

Reda Mastouri 1 Dec 29, 2021