Official PyTorch implementation of SyntaSpeech (IJCAI 2022)

Last update: Nov 24, 2022

Related tags

Overview

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

This repository is the official PyTorch implementation of our IJCAI-2022 paper, in which we propose SyntaSpeech for syntax-aware non-autoregressive Text-to-Speech.

Our SyntaSpeech is built on the basis of PortaSpeech (NeurIPS 2021) with three new features:

We propose Syntactic Graph Builder (Sec. 3.1) and Syntactic Graph Encoder (Sec. 3.2), which is proved to be an effective unit to extract syntactic features to improve the prosody modeling and duration accuracy of TTS model.
We introduce Multi-Length Adversarial Training (Sec. 3.3), which could replace the flow-based post-net in PortaSpeech, speeding up the inference time and improving the audio quality naturalness.
We support three datasets: LJSpeech (single-speaker English dataset), Biaobei (single-speaker Chinese dataset) , and LibriTTS (multi-speaker English dataset).

Environments

conda create -n synta python=3.7
condac activate synta
pip install -U pip
pip install Cython numpy==1.19.1
pip install torch==1.9.0 
pip install -r requirements.txt
# install dgl for graph neural network, dgl-cu102 supports rtx2080, dgl-cu113 support rtx3090
pip install dgl-cu102 dglgo -f https://data.dgl.ai/wheels/repo.html 
sudo apt install -y sox libsox-fmt-mp3
bash mfa_usr/install_mfa.sh # install force alignment tools

Run SyntaSpeech!

Please follow the following steps to run this repo.

1. Preparation

Data Preparation

You can directly use our binarized datasets for LJSpeech and Biaobei. Download them and unzip them into the data/binary/ folder.

As for LibriTTS, you can download the raw datasets and process them with our data_gen modules. Detailed instructions can be found in dosc/prepare_data.

Vocoder Preparation

We provide the pre-trained model of vocoders for three datasets. Specifically, Hifi-GAN for LJSpeech and Biaobei, ParallelWaveGAN for LibriTTS. Download and unzip them into the checkpoints/ folder.

2. Training Example

Then you can train SyntaSpeech in the three datasets.

cd <the root_dir of your SyntaSpeech folder>
export PYTHONPATH=./
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset # training in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset # training in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset # training in LibriTTS

3. Tensorboard

tensorboard --logdir=checkpoints/lj_synta
tensorboard --logdir=checkpoints/biaobei_synta
tensorboard --logdir=checkpoints/libritts_synta

4. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset --infer # inference in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset --infer # inference in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset ---infer # inference in LibriTTS

Audio Demos

Audio samples in the paper can be found in our demo page.

We also provide HuggingFace Demo Page for LJSpeech. Try your interesting sentences there!

Citation

@article{ye2022syntaspeech,
  title={SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech},
  author={Ye, Zhenhui and Zhao, Zhou and Ren, Yi and Wu, Fei},
  journal={arXiv preprint arXiv:2204.11792},
  year={2022}
}

Acknowledgements

Our codes are based on the following repos:

Comments

pinyin preprocess problem

005804 你当#1我傻啊#3？脑子#1那么大#2怎么#1塞进去#4？ ni3 dang1 wo2 sha3 a5 nao3 zi5 na4 me5 da4 zen3 me5 sai1 jin4 qu4

txt_struct=[['', ['']], ['你', ['n', 'i3']], ['当', ['d', 'ang1']], ['我', ['uo3']], ['傻', ['sh', 'a3']], ['啊', ['a', '?', 'n', 'ao3']], ['?', ['z', 'i']], ['脑', ['n', 'a4']], ['子', ['m', 'e']], ['那', ['d', 'a4']], ['么', ['z', 'en3']], ['大', ['m', 'e']], ['怎', ['s', 'ai1']], ['么', ['j', 'in4']], ['塞', ['q', 'v4', '?']], ['进', []], ['去', []], ['?', []], ['', ['']]]

ph_gb_word=['', 'n_i3', 'd_ang1', 'uo3', 'sh_a3', 'a_?n_ao3', 'z_i', 'n_a4', 'm_e', 'd_a4', 'z_en3', 'm_e', 's_ai1', 'j_in4', 'q_v4?', '', '', '', '']

what is 'a_?_n_ao3'

in the mfa_dict it appears ch_a1_d_ou1 ,a_?_n_ao3 and so on

opened by windowxiaoming 2
discriminator output['y_c'] never used

Discriminator's output['y_c'] never used, and never calculated in discriminator forward func. What does this variable mean? https://github.com/yerfor/SyntaSpeech/blob/5b07439633a3e714d2a6759ea4097eb36d6cd99a/tasks/tts/synta.py#L81

opened by mayfool 2
A question of KL divergence calculation

In modules/tts/portaspeech/fvae.py, SyntaFVAE compute loss_kl (line 121) , Can someone help explain why loss_kl = ((logqx - logpx) * nonpadding_sqz).sum() / nonpadding_sqz.sum() / logqx.shape[1]，I think loss_kl should be compute by loss_kl = logqx.exp()*(logqx - logpx) I would be very grateful if you could reply to me！

opened by JiaYK 2

mfa for multi speaker.

In the code, group MFA inputs for better parallelism. For multi speaker, it maybe go wrong. For input g_uang3 zh_ou1 n_v3 d_a4 x_ve2 sh_eng1 d_eng1 sh_an1 sh_i1 l_ian2 s_i4 t_ian1 j_ing3 f_ang1 zh_ao3 d_ao4 i2 s_i4 n_v3 sh_i1. The TexGrid is

	item [1]:
		class = "IntervalTier"
		name = "words"
		xmin = 0.0
		xmax = 9.4444
		intervals: size = 56
			intervals [1]:
				xmin = 0
				xmax = 0.5700000000000001
				text = ""
			intervals [2]:
				xmin = 0.5700000000000001
				xmax = 0.61
				text = "eng"
			intervals [3]:
				xmin = 0.61
				xmax = 0.79
				text = "s_an1"
			intervals [4]:
				xmin = 0.79
				xmax = 0.89
				text = "eng"
			intervals [5]:
				xmin = 0.89
				xmax = 1.06
				text = "i1"
			intervals [6]:
				xmin = 1.06
				xmax = 1.24
				text = "eng"
			intervals [7]:
				xmin = 1.24
				xmax = 1.3
				text = ""
			intervals [8]:
				xmin = 1.3
				xmax = 1.36
				text = "s_an1"
			intervals [9]:
				xmin = 1.36
				xmax = 1.42
				text = ""
			intervals [10]:
				xmin = 1.42
				xmax = 1.49
				text = "eng"
			intervals [11]:
				xmin = 1.49
				xmax = 1.67
				text = "s_i4"
			intervals [12]:
				xmin = 1.67
				xmax = 1.78
				text = "eng"
			intervals [13]:
				xmin = 1.78
				xmax = 1.91
				text = ""
			intervals [14]:
				xmin = 1.91
				xmax = 1.96
				text = "er4"
			intervals [15]:
				xmin = 1.96
				xmax = 2.06
				text = "eng"
			intervals [16]:
				xmin = 2.06
				xmax = 2.19
				text = ""
			intervals [17]:
				xmin = 2.19
				xmax = 2.35
				text = "i1"
			intervals [18]:
				xmin = 2.35
				xmax = 2.53
				text = "eng"
			intervals [19]:
				xmin = 2.53
				xmax = 3.03
				text = "i1"
			intervals [20]:
				xmin = 3.03
				xmax = 3.42
				text = "eng"
			intervals [21]:
				xmin = 3.42
				xmax = 3.48
				text = "i1"
			intervals [22]:
				xmin = 3.48
				xmax = 3.6
				text = ""
			intervals [23]:
				xmin = 3.6
				xmax = 3.64
				text = "eng"
			intervals [24]:
				xmin = 3.64
				xmax = 3.86
				text = "i1"
			intervals [25]:
				xmin = 3.86
				xmax = 3.99
				text = "eng"
			intervals [26]:
				xmin = 3.99
				xmax = 4.59
				text = ""
			intervals [27]:
				xmin = 4.59
				xmax = 4.869999999999999
				text = "er4"
			intervals [28]:
				xmin = 4.869999999999999
				xmax = 4.9799999999999995
				text = "eng"
			intervals [29]:
				xmin = 4.9799999999999995
				xmax = 5.1899999999999995
				text = "s_i4"
			intervals [30]:
				xmin = 5.1899999999999995
				xmax = 5.34
				text = ""
			intervals [31]:
				xmin = 5.34
				xmax = 5.43
				text = "eng"
			intervals [32]:
				xmin = 5.43
				xmax = 5.6
				text = ""
			intervals [33]:
				xmin = 5.6
				xmax = 5.76
				text = "i1"
			intervals [34]:
				xmin = 5.76
				xmax = 6.279999999999999
				text = "eng"
			intervals [35]:
				xmin = 6.279999999999999
				xmax = 6.359999999999999
				text = "s_an1"
			intervals [36]:
				xmin = 6.359999999999999
				xmax = 6.47
				text = ""
			intervals [37]:
				xmin = 6.47
				xmax = 6.6
				text = "eng"
			intervals [38]:
				xmin = 6.6
				xmax = 6.9399999999999995
				text = "i1"
			intervals [39]:
				xmin = 6.9399999999999995
				xmax = 7.039999999999999
				text = "eng"
			intervals [40]:
				xmin = 7.039999999999999
				xmax = 7.289999999999999
				text = "s_an1"
			intervals [41]:
				xmin = 7.289999999999999
				xmax = 7.369999999999999
				text = "eng"
			intervals [42]:
				xmin = 7.369999999999999
				xmax = 7.6
				text = "s_i4"
			intervals [43]:
				xmin = 7.6
				xmax = 7.699999999999999
				text = "eng"
			intervals [44]:
				xmin = 7.699999999999999
				xmax = 7.869999999999999
				text = ""
			intervals [45]:
				xmin = 7.869999999999999
				xmax = 8.049999999999999
				text = "er4"
			intervals [46]:
				xmin = 8.049999999999999
				xmax = 8.26
				text = ""
			intervals [47]:
				xmin = 8.26
				xmax = 8.299999999999999
				text = "eng"
			intervals [48]:
				xmin = 8.299999999999999
				xmax = 8.36
				text = "s_i4"
			intervals [49]:
				xmin = 8.36
				xmax = 8.389999999999999
				text = ""
			intervals [50]:
				xmin = 8.389999999999999
				xmax = 8.42
				text = "eng"
			intervals [51]:
				xmin = 8.42
				xmax = 8.45
				text = ""
			intervals [52]:
				xmin = 8.45
				xmax = 8.59
				text = "s_an1"
			intervals [53]:
				xmin = 8.59
				xmax = 8.83
				text = ""
			intervals [54]:
				xmin = 8.83
				xmax = 9.1
				text = "eng"
			intervals [55]:
				xmin = 9.1
				xmax = 9.44
				text = "i1"
			intervals [56]:
				xmin = 9.44
				xmax = 9.4444
				text = ""

opened by leon2milan 2

Problem with DDP

Hello, I have experimented on your excellent job with this repo. But I found the ddp is not effective. I wonder if the way I used is wrong?

CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node 3 tasks/run.py --config //fs.yaml --exp_name fs_test_demo --reset

opened by zhazl 0

Releases(v1.0.0)

v1.0.0(May 21, 2022)

We release the pretrained models of SyntaSpeech on LJSpeech, Biaobei, and LibriTTS. For pretrained vocoder and datasets, please refer to the provided links in README.md
Source code(tar.gz)
Source code(zip)
biaobei_synta.zip(295.58 MB)
libritts_synta.zip(310.03 MB)
lj_synta.zip(304.98 MB)

Owner

Zhenhui YE

I am currently a second-year computer science Ph.D student at Zhejiang University, working on deep learning and reinforcement learning.

GitHub Repository

1st place solution to the Satellite Image Change Detection Challenge hosted by SenseTime

209 Jan 01, 2023

Technical Analysis Indicators - Pandas TA is an easy to use Python 3 Pandas Extension with 130+ Indicators

Pandas TA - A Technical Analysis Library in Python 3 Pandas Technical Analysis (Pandas TA) is an easy to use library that leverages the Pandas package

3.2k Jan 09, 2023

Learning Logic Rules for Document-Level Relation Extraction

LogiRE Learning Logic Rules for Document-Level Relation Extraction We propose to introduce logic rules to tackle the challenges of doc-level RE. Equip

41 Dec 26, 2022

Streaming over lightweight data transformations

Description Data augmentation libarary for Deep Learning, which supports images, segmentation masks, labels and keypoints. Furthermore, SOLT is fast a

256 Jan 08, 2023

Fast SHAP value computation for interpreting tree-based models

FastTreeSHAP FastTreeSHAP package is built based on the paper Fast TreeSHAP: Accelerating SHAP Value Computation for Trees published in NeurIPS 2021 X

369 Jan 04, 2023

YoloV3 Implemented in Tensorflow 2.0

YoloV3 Implemented in TensorFlow 2.0 This repo provides a clean implementation of YoloV3 in TensorFlow 2.0 using all the best practices. Key Features

2.5k Dec 26, 2022

The official PyTorch code for NeurIPS 2021 ML4AD Paper, "Does Thermal data make the detection systems more reliable?"

MultiModal-Collaborative (MMC) Learning Framework for integrating RGB and Thermal spectral modalities This is the official code for NeurIPS 2021 Machi

12 Nov 02, 2022

Keyword2Text This repository contains the code of the paper: "A Plug-and-Play Method for Controlled Text Generation"

Keyword2Text This repository contains the code of the paper: "A Plug-and-Play Method for Controlled Text Generation", if you find this useful and use

57 Dec 27, 2022

A hybrid SOTA solution of LiDAR panoptic segmentation with C++ implementations of point cloud clustering algorithms. ICCV21, Workshop on Traditional Computer Vision in the Age of Deep Learning

ICCVW21-TradiCV-Survey-of-LiDAR-Cluster Motivation In contrast to popular end-to-end deep learning LiDAR panoptic segmentation solutions, we propose a

103 Nov 22, 2022

AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation

AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation A pytorch-version implementation codes of paper:

11 Dec 13, 2022

A Domain-Agnostic Benchmark for Self-Supervised Learning

DABS: A Domain Agnostic Benchmark for Self-Supervised Learning This repository contains the code for DABS, a benchmark for domain-agnostic self-superv

81 Dec 09, 2022

nfelo: a power ranking, prediction, and betting model for the NFL

nfelo nfelo is a power ranking, prediction, and betting model for the NFL. Nfelo take's 538's Elo framework and further adapts it for the NFL, hence t

6 Nov 22, 2022

Semi-Supervised Graph Prototypical Networks for Hyperspectral Image Classification, IGARSS, 2021.

Semi-Supervised Graph Prototypical Networks for Hyperspectral Image Classification, IGARSS, 2021. Bobo Xi, Jiaojiao Li, Yunsong Li and Qian Du. Code f

7 Nov 03, 2022

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

18 Dec 09, 2022

A python implementation of Physics-informed Spline Learning for nonlinear dynamics discovery

PiSL A python implementation of Physics-informed Spline Learning for nonlinear dynamics discovery. Sun, F., Liu, Y. and Sun, H., 2021. Physics-informe

8 Jul 13, 2022

Off-policy continuous control in PyTorch, with RDPG, RTD3 & RSAC

arXiv technical report soon available. we are updating the readme to be as comprehensive as possible Please ask any questions in Issues, thanks. Intro

31 Dec 30, 2022

Plenoxels: Radiance Fields without Neural Networks, Code release WIP

Plenoxels: Radiance Fields without Neural Networks Alex Yu*, Sara Fridovich-Keil*, Matthew Tancik, Qinhong Chen, Benjamin Recht, Angjoo Kanazawa UC Be

2.3k Dec 30, 2022

Code and models for "Pano3D: A Holistic Benchmark and a Solid Baseline for 360 Depth Estimation", OmniCV Workshop @ CVPR21.

Pano3D A Holistic Benchmark and a Solid Baseline for 360o Depth Estimation Pano3D is a new benchmark for depth estimation from spherical panoramas. We

50 Dec 29, 2022

Dynamic Head: Unifying Object Detection Heads with Attentions

Dynamic Head: Unifying Object Detection Heads with Attentions dyhead_video.mp4 This is the official implementation of CVPR 2021 paper "Dynamic Head: U

550 Dec 21, 2022

Portfolio asset allocation strategies: from Markowitz to RNNs

Portfolio asset allocation strategies: from Markowitz to RNNs Research project to explore different approaches for optimal portfolio allocation starti

1 Feb 05, 2022