Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Overview

Parallel Tacotron2

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Updates

  • 2021.05.15: Implementation done. Sanity checks on training and inference. But still the model cannot converge.

    I'm waiting for your contribution! Please inform me if you find any mistakes in my implementation or any valuable advice to train the model successfully. See the Implementation Issues section.

Training

Requirements

  • You can install the Python dependencies with

    pip3 install -r requirements.txt
  • In addition to that, install fairseq (official document, github) to utilize LConvBlock.

Datasets

The supported datasets:

  • LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
  • (will be added more)

Preprocessing

After downloading the datasets, set the corpus_path in preprocess.yaml and run the preparation script:

python3 prepare_data.py config/LJSpeech/preprocess.yaml

Then, run the preprocessing script:

python3 preprocess.py config/LJSpeech/preprocess.yaml

Training

Train your model with

python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

The model cannot converge yet. I'm debugging but it would be boosted if your awesome contribution is ready!

TensorBoard

Use

tensorboard --logdir output/log/LJSpeech

to serve TensorBoard on your localhost.

Implementation Issues

Overall, normalization or activation, which is not suggested in the original paper, is adequately arranged to prevent nan value (gradient) on forward and backward calculations.

Text Encoder

  1. Use the FFTBlock of FastSpeech2 for the transformer block of the text encoder.
  2. Use dropout 0.2 for the ConvBlock of the text encoder.
  3. To restore "proprietary normalization engine",
    • Apply the same text normalization as in FastSpeech2.
    • Implement grapheme_to_phoneme function. (See ./text/init).

Residual Encoder

  1. Use 80 channels mel-spectrogrom instead of 128-bin.
  2. Regular sinusoidal positional embedding is used in frame-level instead of combinations of three positional embeddings in Parallel Tacotron. As the model depends entirely on unsupervised learning for the position, this choice can be a reason for the fails on model converge.

Duration Predictor & Learned Upsampling (The most important but ambiguous part)

  1. Use log durations with the prior: there should be at least one frame in total per sequence.
  2. Use nn.SiLU() for the swish activation.
  3. When obtaining W and C, concatenation operation is applied among S, E, and V after frame-domain (T domain) broadcasting of V. As the detailed process is not described in the original paper, this choice can be a reason for the fails on model converge.

Decoder

  1. Use (Multi-head) Self-attention and LConvBlock.
  2. Iterative mel-spectrogram is projected by a linear layer.
  3. Apply nn.Tanh() to each LConvBLock output (following activation pattern of decoder part in FastSpeech2).

Loss

  1. Use optimization & scheduler of FastSpeech2 (which is from Attention is all you need as described in the original paper).
  2. Base on pytorch-softdtw-cuda (post) for the soft-DTW.
    1. Implement customized soft-DTW in model/soft_dtw_cuda.py, reflecting the recursion suggested in the original paper.
    2. In the original soft-DTW, the final loss is not assumed and therefore only E is computed. But employed as a loss function, jacobian product is added to return target derivetive of R w.r.t. input X.
    3. Currently, the maximum batch size is 6 in 24GiB GPU (TITAN RTX) due to space complexity problem in soft-DTW Loss.
      • In the original paper, a custom differentiable diagonal band operation was implemented and used to solve the complexity of O(T^2), but this part has not been explored in the current implementation yet.
  3. For the stability, mel-spectrogroms are compressed by a sigmoid function before the soft-DTW. If the sigmoid is eliminated, the soft-DTW value is too large, producing nan in the backward.
  4. Guided attention loss is applied for fast convergence of the attention module in residual encoder.

Citation

@misc{lee2021parallel_tacotron2,
  author = {Lee, Keon},
  title = {Parallel-Tacotron2},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/Parallel-Tacotron2}}
}

References

Comments
  • LightWeightConv layer warnings during training

    LightWeightConv layer warnings during training

    If just install specified requirements + Pillow and fairseq following warnings appear during training start:

    No module named 'lightconv_cuda'

    If install lightconv-layer from fairseq, the folllowing warning displayed:

    WARNING: Unsupported filter length passed - skipping forward pass

    Pytorch 1.7 Cuda 10.2 Fairseq 1.0.0a0+19793a7

    opened by idcore 10
  • Suggestion for adding open German

    Suggestion for adding open German "Thorsten" dataset

    Hi.

    According to text in README (will be added more) i would like to suggest to add my open German "Thorsten" dataset.

    Thorsten: a single-speaker German open dataset consists of 22.668 short audio clips of a male speaker, approximately 23 hours in total (LJSpeech file/directory syntax).

    https://github.com/thorstenMueller/deep-learning-german-tts/

    opened by thorstenMueller 4
  • Soft DTW with Cython implementation

    Soft DTW with Cython implementation

    Hi @keonlee9420 , have you tried the Cython version of Soft DTW from this repo

    https://github.com/mblondel/soft-dtw

    Is it available to apply for Parallel Tacotron 2 ? I am trying that repo because the current batch is too small when using CUDA implement of @Maghoumi .


    I just wonder that @Maghoumi in https://github.com/Maghoumi/pytorch-softdtw-cuda claims that experiment with batch size

    image

    But when applying for Para Taco, the batch size is too small, are there any gap?

    opened by v-nhandt21 2
  • Handle audios with long duration

    Handle audios with long duration

    When I load audios with mel-spectrogram frames larger than max sequence of mel len (1000 frames):

    • There is a problem when concatenating pos + speaker + mels: I try to set max_seq_len larger (1500),
    • Then lead to a problem with Soft DTW, they said the maximum is 1024

    image

    For solution, I tried to trim mels for fitting 1024 but it seems complicated, now I filter out all audios with frames > 1024

    Any suggestion for handle Long Audios? I wonder how it work at inference steps.

    opened by v-nhandt21 2
  • cannot import name II from omegaconf

    cannot import name II from omegaconf

    Great work. But I encounter one problems when train this model :( The error message:

    ImportError: cannot import name II form omegaconf
    

    The version of fairseq is 0.10.2 (latest releaser version) and omegaconf is 1.4.1. How to fix it?

    Thank you

    opened by cnlinxi 2
  • It seems cannot run

    It seems cannot run

    I following your command to run the code, but I get following error. File "train.py", line 87, in main output = model(*(batch[2:])) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward return self.gather(outputs, self.output_device) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather return gather(outputs, output_device, dim=self.dim) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map return Gather.apply(target_device, dim, *outputs) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 71, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/home/ydc/anaconda3/envs/CD/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 230, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: Input tensor at index 1 has invalid shape [1, 474, 80], but expected [1, 302, 80]

    opened by yangdongchao 2
  • fix mask and soft-dtw loss

    fix mask and soft-dtw loss

    1 fix mask problem when calculating W in LearnUpsampling Module and attention matrix in VaribleLengthAttention module. 2 a new Jacobian matrix of Manhattan distance 3 deal with mel spectrograms of different lengths

    opened by zhang-wy15 1
  • why Lconv block doesn't have stride argument?

    why Lconv block doesn't have stride argument?

    Hi, Thanks for implement.

    I think Parallel TacoTron2 using same residual Encoder as parallel tacotron 1. In parallel tacotron, using five 17 × 1 LConv blocks interleaved with strided 3 × 1 convolutions

    캡처

    But, in your implementation, Lconvblock doesn't have stride argument. How did you handle this part?

    Thanks.

    opened by yw0nam 0
  • Soft DTW

    Soft DTW

    Hello, Has anybody been able to train with softdtw loss. It doesn't converge at all. I think there is a problem with the implementation but I could't spot it. When I train with the real alignments it works well

    opened by talipturkmen 0
  • weights required

    weights required

    Can someone share the weights file link? I couldn't synthesize it or use its inference. If I am wrong please tell me the correct method of using it. Thanks

    opened by mrqasimasif 0
  • Why no alignment at all?

    Why no alignment at all?

    I cloned the code, prepared data according to README, and just updated:

    1. ljspeech data path in config/LJSpeech/train.yaml
    2. unzip generator_LJSpeech.pth.tar.zip to get generator_LJSpeech.pth.tar and the code can run! But, no matter how many steps I trained, the images are always like this and demo audio sounds like noise:
    截屏2022-08-25 下午3 08 07
    opened by mikesun4096 2
  • training problem

    training problem

      File "/data1/hjh/pycharm_projects/tts/parallel-tacotron2_try/model/parallel_tacotron2.py", line 68, in forward
        self.learned_upsampling(durations, V, src_lens, src_masks, max_src_len)
      File "/home/huangjiahong.dracu/miniconda2/envs/parallel_tc2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/data1/hjh/pycharm_projects/tts/parallel-tacotron2_try/model/modules.py", line 335, in forward
        mel_mask = get_mask_from_lengths(mel_len, max_mel_len)
      File "/data1/hjh/pycharm_projects/tts/parallel-tacotron2_try/utils/tools.py", line 87, in get_mask_from_lengths
        ids = torch.arange(0, max_len).unsqueeze(0).expand(batch_size, -1).to(device)
    RuntimeError: upper bound and larger bound inconsistent with step sign
    

    Thank you for you jobs. I got above problem when training. I guess it's a Duration prediction problem. How to solve it?

    opened by aijianiula0601 0
  • Could you please share your audio samples, pretrained models and loss curves?

    Could you please share your audio samples, pretrained models and loss curves?

    Hi, Thanks for your excellent work! Could you possibly share your audio samples, pretrained models and loss curves with me? Thanks so much for your help!

    opened by CocoWang1010 0
  • fix in implementation of S-DTW backward @taras-sereda

    fix in implementation of S-DTW backward @taras-sereda

    Hey, I've found that in your implementation of S-DTW backward, E - matrices are not used, instead you are using G - matrices and their entries are ignoring scaling factors a, b, c. What's the reason for this? My guess you are doing this in order to preserve and propagate gradients, because they are vanishing due to small values of a, b, c. But I might be wrong, so I'd be glad to hear your motivation on doing this.

    Playing with your code, I also found that gradients are vanishing, especially when bandwitdth=None. So I'm solving this problem by normalizing distance matrix, by n_mel_channel. And with this normalization and exact implementation of S-dtw backward I'm able to converge on overfit experiments quicker then with non-exact computation of s-dtw backward. I'm using these SDT hparams:

    gamma = 0.05
    warp = 256
    bandwidth = 50
    

    here is a small test I'm using for checks:

            target_spectro = np.load('')
            target_spectro = torch.from_numpy(target_spectro)
            target_spectro = target_spectro.unsqueeze(0).cuda()
            pred_spectro = torch.randn_like(target_spectro, requires_grad=True)
    
            optimizer = Adam([pred_spectro])
    
            # model fits in ~3k iterations
            n_iter = 4_000
            for i in range(n_iter):
    
                loss = self.numba_soft_dtw(pred_spectro, target_spectro)
                loss = loss / pred_spectro.size(1)
                loss.backward()
    
                if i % 1_000 == 0:
                    print(f'iter: {i}, loss: {loss.item():.6f}')
                    print(f'd_loss_pred {pred_spectro.grad.mean()}')
    
                optimizer.step()
                optimizer.zero_grad()
    

    Curious to hear how your training is going! Best. Taras

    opened by taras-sereda 1
Releases(v0.1.0)
Owner
Keon Lee
Conversational AI | Expressive Speech Synthesis | Open-domain Dialog | Empathic Computing | NLP | Disentangled Representation | Generative Models | HCI
Keon Lee
🌾 PASTIS 🌾 Panoptic Agricultural Satellite TIme Series

🌾 PASTIS 🌾 Panoptic Agricultural Satellite TIme Series (optical and radar) The PASTIS Dataset Dataset presentation PASTIS is a benchmark dataset for

86 Jan 04, 2023
SwinTrack: A Simple and Strong Baseline for Transformer Tracking

SwinTrack This is the official repo for SwinTrack. A Simple and Strong Baseline Prerequisites Environment conda (recommended) conda create -y -n SwinT

LitingLin 196 Jan 04, 2023
A PyTorch re-implementation of the paper 'Exploring Simple Siamese Representation Learning'. Reproduced the 67.8% Top1 Acc on ImageNet.

Exploring simple siamese representation learning This is a PyTorch re-implementation of the SimSiam paper on ImageNet dataset. The results match that

Taojiannan Yang 72 Nov 09, 2022
CATE: Computation-aware Neural Architecture Encoding with Transformers

CATE: Computation-aware Neural Architecture Encoding with Transformers Code for paper: CATE: Computation-aware Neural Architecture Encoding with Trans

16 Dec 27, 2022
Inferring Lexicographically-Ordered Rewards from Preferences

Inferring Lexicographically-Ordered Rewards from Preferences Code author: Alihan Hüyük ([e

Alihan Hüyük 1 Feb 13, 2022
Official implementation of the paper ``Unifying Nonlocal Blocks for Neural Networks'' (ICCV'21)

Spectral Nonlocal Block Overview Official implementation of the paper: Unifying Nonlocal Blocks for Neural Networks (ICCV'21) Spectral View of Nonloca

91 Dec 14, 2022
PySOT - SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

PySOT is a software system designed by SenseTime Video Intelligence Research team. It implements state-of-the-art single object tracking algorit

STVIR 4.1k Dec 29, 2022
Pytorch implementation of set transformer

set_transformer Official PyTorch implementation of the paper Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks .

Juho Lee 410 Jan 06, 2023
RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation (CIKM'17)

RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation This is the implementation of RATE: Overcoming Noise and Spar

Yu Zhang 5 Feb 10, 2022
这是一个mobilenet-yolov4-lite的库,把yolov4主干网络修改成了mobilenet,修改了Panet的卷积组成,使参数量大幅度缩小。

YOLOV4:You Only Look Once目标检测模型-修改mobilenet系列主干网络-在Keras当中的实现 2021年2月8日更新: 加入letterbox_image的选项,关闭letterbox_image后网络的map一般可以得到提升。

Bubbliiiing 65 Dec 01, 2022
Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Storium GPT-2 Models This is the official repository for the GPT-2 models described in the EMNLP 2020 paper [STORIUM: A Dataset and Evaluation Platfor

Nader Akoury 27 Dec 20, 2022
Face Recognize System on camera AI OAK1

FRS on OAK1 Face Recognize System on camera OAK1 This project contains our work that deploy on camera OAK1 Features Anti-Spoofing Face detection Face

Tran Anh Tuan 6 Aug 08, 2022
Pseudo-rng-app - whos needs science to make a random number when you have pseudoscience?

Pseudo-random numbers with pseudoscience rng is so complicated! Why cant we have a horoscopic, vibe-y way of calculating a random number? Why cant rng

Andrew Blance 1 Dec 27, 2021
deep learning for image processing including classification and object-detection etc.

深度学习在图像处理中的应用教程 前言 本教程是对本人研究生期间的研究内容进行整理总结,总结的同时也希望能够帮助更多的小伙伴。后期如果有学习到新的知识也会与大家一起分享。 本教程会以视频的方式进行分享,教学流程如下: 1)介绍网络的结构与创新点 2)使用Pytorch进行网络的搭建与训练 3)使用Te

WuZhe 13.6k Jan 04, 2023
PyTorch implementation of Interpretable Explanations of Black Boxes by Meaningful Perturbation

PyTorch implementation of Interpretable Explanations of Black Boxes by Meaningful Perturbation The paper: https://arxiv.org/abs/1704.03296 What makes

Jacob Gildenblat 322 Dec 17, 2022
CLUES: Few-Shot Learning Evaluation in Natural Language Understanding

CLUES: Few-Shot Learning Evaluation in Natural Language Understanding This repo contains the data and source code for baseline models in the NeurIPS 2

Microsoft 29 Dec 29, 2022
code for paper "Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning" by Zhongzheng Ren*, Raymond A. Yeh*, Alexander G. Schwing.

Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning Overview This code is for paper: Not All Unlabeled Data are Equa

Jason Ren 22 Nov 23, 2022
NLMpy - A Python package to create neutral landscape models

NLMpy is a Python package for the creation of neutral landscape models that are widely used by landscape ecologists to model ecological patterns

Manaaki Whenua – Landcare Research 1 Oct 08, 2022
Running Google MoveNet Multipose Tracking models on OpenVINO.

MoveNet MultiPose Tracking on OpenVINO

60 Nov 17, 2022
Very Deep Convolutional Networks for Large-Scale Image Recognition

pytorch-vgg Some scripts to convert the VGG-16 and VGG-19 models [1] from Caffe to PyTorch. The converted models can be used with the PyTorch model zo

Justin Johnson 217 Dec 05, 2022