PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech

Overview

Cross-Speaker-Emotion-Transfer - PyTorch Implementation

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech.

Quickstart

DATASET refers to the names of datasets such as RAVDESS in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, install fairseq (official document, github) to utilize LConvBlock. Please check here to resolve any issue on installing it. Note that Dockerfile is provided for Docker users, but you have to install fairseq manually.

Inference

You have to download the pretrained models and put them in output/ckpt/DATASET/.

To extract soft emotion tokens from a reference audio, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --ref_audio REF_AUDIO_PATH --restore_step RESTORE_STEP --mode single --dataset DATASET

Or, to use hard emotion tokens from an emotion id, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --emotion_id EMOTION_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt. Please note that only the hard emotion tokens from a given emotion id are supported in this mode.

Training

Datasets

The supported datasets are

  • RAVDESS: This portion of the RAVDESS contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

Your own language and dataset can be adapted following here.

Preprocessing

  • For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.

  • Run

    python3 prepare_align.py --dataset DATASET
    

    for some preparations.

    For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

    After that, run the preprocessing script by

    python3 preprocess.py --dataset DATASET
    

Training

Train your model with

python3 train.py --dataset DATASET

Useful options:

  • To use Automatic Mixed Precision, append --use_amp argument to the above command.
  • The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Notes

  • The current implementation is not trained in a semi-supervised way due to the small dataset size. But it can be easily activated by specifying target speakers and passing no emotion ID with no emotion classifier loss.
  • In Decoder, 15 X 1 LConv Block is used instead of 17 X 1 due to memory issues.
  • Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
  • DeepSpeaker on RAVDESS dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

  • For vocoder, HiFi-GAN and MelGAN are supported.

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References

Comments
  • loading state dict ——size mismatch

    loading state dict ——size mismatch

    I have a problem when I use your pre-trained model for synthesis. However, the following error happens:

    RuntimeError: Error(s) in loading state_dict for XSpkEmoTrans: size mismatch for duratin_predictor.lconv_stack.0.conv_layer.weight: copying a param with shape torch.Size([2, 3]) from checkpoint, the shape in current model is torch.Size([2, 1, 3]). size mismatch for decoder.lconv_stack.0.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.1.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.2.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.3.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.4.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.5.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]).

    opened by cythc 2
  • Closed Issue

    Closed Issue

    Hi, I synthesized some samples with the provided pretrained models and the speaker embeedding from philipperemy's DeepSpeaker repo. However, the sampled results were bad in that all of the words were garbled and I could not hear any words.

    I am not sure if I am doing anything wrong since I just cloned your repository, downloaded the RAVDESS data and did everything listed in the README.md. Based on how I was able to generate samples, I do not think I am doing anything wrong, but was anyone able to synthesize good speech? And to the author of this repo @keonlee9420 do you mind uploading some samples generated from the pretrained models from the README.md?

    Thanks in advance.

    opened by jinny1208 0
  • The generated wav is not good

    The generated wav is not good

    Hi, thank you for open source the wonderful work ! I followed your instructions 1) install lightconv_cuda, 2) download the checkpoint, 3) download the speaker embedding npy. However, the generated result is not good.

    Below is my running command

    python3 synthesize.py \
      --text "Hello world" \
      --speaker_id Actor_22 \
      --emotion_id sad \
      --restore_step 450000 \
      --mode single \
      --dataset RAVDESS
    
    # sh run.sh 
    2022-11-30 13:45:22.626404: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
    Device of XSpkEmoTrans: cuda
    Removing weight norm...
    Raw Text Sequence: Hello world
    Phoneme Sequence: {HH AH0 L OW1 W ER1 L D}
    

    ENV

    python 3.6.8
    fairseq                 0.10.2
    torch                   1.7.0+cu110
    CUDA 11.0
    

    Hello world_Actor_22_sad

    Hello world_Actor_22_sad.wav.zip

    opened by pangtouyuqqq 1
  • Synthesis with other person out of RAVDESS

    Synthesis with other person out of RAVDESS

    Hello author, Firstly, thank you for giving this repo, it is really nice. I have a question that:

    1. I download CMU data with single person with 100 audios and make speaker embedding vector and synthesis with this, the performance is not good. I cannot detect any words.
    2. Should we need to fine-tuning deep-speaker model to generate speaker embedding with my data.

    Thank you

    opened by hathubkhn 5
  • Error using the pretrained model

    Error using the pretrained model

    I'm trying to run synthesize with the pretrained model, like such:

    python3 synthesize.py --text "This sentence is a test" --speaker_id Actor_01 --emotion_id neutral --restore_step 450000  --dataset RAVDESS --mode single
    

    but I get an error in layer size:

    Traceback (most recent call last):
      File "synthesize.py", line 206, in <module>
        model = get_model(args, configs, device, train=False,
      File "/home/jrings/diviai/installs/Cross-Speaker-Emotion-Transfer/utils/model.py", line 27, in get_model
        model.load_state_dict(model_dict, strict=False)
      File "<...>/torch/nn/modules/module.py", line 1604, in load_state_dict
        raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for XSpkEmoTrans:
    	size mismatch for emotion_emb.etl.embed: copying a param with shape torch.Size([8, 64]) from checkpoint, the shape in current model is torch.Size([9, 64]).
    	size mismatch for duratin_predictor.lconv_stack.0.conv_layer.weight: copying a param with shape torch.Size([2, 1, 3]) from checkpoint, the shape in current model is torch.Size([2, 3]).
    	size mismatch for decoder.lconv_stack.0.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
    	size mismatch for decoder.lconv_stack.1.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
    	size mismatch for decoder.lconv_stack.2.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
    	size mismatch for decoder.lconv_stack.3.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
    	size mismatch for decoder.lconv_stack.4.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
    	size mismatch for decoder.lconv_stack.5.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
    
    opened by jrings 1
  • speaker embedding npy file not found

    speaker embedding npy file not found

    Hi,

    I am facing the following issue while synthesizing using pretrained model.

    Removing weight norm... Traceback (most recent call last): File "synthesize.py", line 234, in )) if load_spker_embed else None File "/home/sagar/tts/Cross-Speaker-Emotion-Transfer/venv/lib/python3.7/site-packages/numpy/lib/npyio.py", line 417, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: './preprocessed_data/RAVDESS/spker_embed/Actor_19-spker_embed.npy'

    Please suggest any way out. Thanks in advance -Sagar

    opened by raikarsagar 4
Releases(v0.2.0)
Owner
Keon Lee
Expressive Speech Synthesis | Conversational AI | Open-domain Dialog | NLP | Generative Models | Empathic Computing | HCI
Keon Lee
A parametric soroban written with CADQuery.

A parametric soroban written in CADQuery The purpose of this project is to demonstrate how "code CAD" can be intuitive to learn. See soroban.py for a

Lee 4 Aug 13, 2022
Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

Optimizing Dense Retrieval Model Training with Hard Negatives Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma 🔥 News 2021-10

Jingtao Zhan 99 Dec 27, 2022
Homepage of paper: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, ICCV 2021.

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction [Paper] [Official Paddle Implementation] [Huggingface Gradio Demo] [Unofficial

442 Dec 16, 2022
Generative Models for Graph-Based Protein Design

Graph-Based Protein Design This repo contains code for Generative Models for Graph-Based Protein Design by John Ingraham, Vikas Garg, Regina Barzilay

John Ingraham 159 Dec 15, 2022
Scheduling BilinearRewards

Scheduling_BilinearRewards Requirement Python 3 =3.5 Structure main.py This file includes the main function. For getting the results in Figure 1, ple

junghun.kim 0 Nov 25, 2021
HybridNets: End-to-End Perception Network

HybridNets: End2End Perception Network HybridNets Network Architecture. HybridNets: End-to-End Perception Network by Dat Vu, Bao Ngo, Hung Phan 📧 FPT

Thanh Dat Vu 370 Dec 29, 2022
FPSAutomaticAiming——基于YOLOV5的FPS类游戏自动瞄准AI

FPSAutomaticAiming——基于YOLOV5的FPS类游戏自动瞄准AI 声明: 本项目仅限于学习交流,不可用于非法用途,包括但不限于:用于游戏外挂等,使用本项目产生的任何后果与本人无关! 简介 本项目基于yolov5,实现了一款FPS类游戏(CF、CSGO等)的自瞄AI,本项目旨在使用现

Fabian 246 Dec 28, 2022
A python software that can help blind people find things like laptops, phones, etc the same way a guide dog guides a blind person in finding his way.

GuidEye A python software that can help blind people find things like laptops, phones, etc the same way a guide dog guides a blind person in finding h

Munal Jain 0 Aug 09, 2022
Mahadi-Now - This Is Pakistani Just Now Login Tools

PAKISTANI JUST NOW LOGIN TOOLS Install apt update apt upgrade apt install python

MAHADI HASAN AFRIDI 19 Apr 06, 2022
Matlab Python Heuristic Battery Opt - SMOP conversion and manual conversion

SMOP is Small Matlab and Octave to Python compiler. SMOP translates matlab to py

Tom Xu 1 Jan 12, 2022
An intuitive library to extract features from time series

Time Series Feature Extraction Library Intuitive time series feature extraction This repository hosts the TSFEL - Time Series Feature Extraction Libra

Associação Fraunhofer Portugal Research 589 Jan 04, 2023
This program presents convolutional kernel density estimation, a method used to detect intercritical epilpetic spikes (IEDs)

Description This program presents convolutional kernel density estimation, a method used to detect intercritical epilpetic spikes (IEDs) in [Gardy et

Ludovic Gardy 0 Feb 09, 2022
Pull sensitive data from users on windows including discord tokens and chrome data.

⭐ For a 🍪 Pegasus Pull sensitive data from users on windows including discord tokens and chrome data. Features 🟩 Discord tokens 🟩 Geolocation data

Addi 44 Dec 31, 2022
This is just a funny project that we want to see AutoEncoder (AE) can actually work to enhance the features we want

Funny_muscle_enhancer :) 1.Discription: This is just a funny project that we want to see AutoEncoder (AE) can actually work on the some features. We w

Jing-Yao Chen (Jacob) 8 Oct 01, 2022
🤗 Paper Style Guide

🤗 Paper Style Guide (Work in progress, send a PR!) Libraries to Know booktabs natbib cleveref Either seaborn, plotly or altair for graphs algorithmic

Hugging Face 66 Dec 12, 2022
PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

Saim Wani 4 May 08, 2022
Reading list for research topics in Masked Image Modeling

awesome-MIM Reading list for research topics in Masked Image Modeling(MIM). We list the most popular methods for MIM, if I missed something, please su

ligang 231 Dec 07, 2022
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

🐤 Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 156 Jan 09, 2023
PAthological QUpath Obsession - QuPath and Python conversations

PAQUO: PAthological QUpath Obsession Welcome to paquo 👋 , a library for interacting with QuPath from Python. paquo's goal is to provide a pythonic in

Bayer AG 60 Dec 31, 2022
Python3 Implementation of (Subspace Constrained) Mean Shift Algorithm in Euclidean and Directional Product Spaces

(Subspace Constrained) Mean Shift Algorithms in Euclidean and/or Directional Product Spaces This repository contains Python3 code for the mean shift a

Yikun Zhang 0 Oct 19, 2021