Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

Last update: Jan 05, 2023

Related tags

Overview

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

This repo contains the official implementation of the VAE-GAN from the INTERSPEECH 2020 paper Voice Conversion Using Speech-to-Speech Neuro-Style Transfer.

Examples of generated audio using the Flickr8k Audio Corpus: https://ebadawy.github.io/post/speech_style_transfer. Note that these examples are a result of feeding audio reconstructions of this VAE-GAN to an implementation of WaveNet.

1. Data Preperation

Dataset file structure:

/path/to/database
├── spkr_1
│   ├── sample.wav
├── spkr_2
│   ├── sample.wav
│   ...
└── spkr_N
    ├── sample.wav
    ...
# The directory under each speaker cannot be nested.

Here is an example script for setting up data preparation from the Flickr8k Audio Corpus. The speakers of interest are the same as in the paper, but may be modified to other speakers if desirable.

2. Data Preprocessing

The prepared dataset is organised into a train/eval/test split, the audio is preprocessed and melspectrograms are computed.

python preprocess.py --dataset [path/to/dataset] --test-size [float] --eval-size [float]

3. Training

The VAE-GAN model uses the melspectrograms to learn style transfer between two speakers.

python train.py --model_name [name of the model] --dataset [path/to/dataset]

3.1. Visualization

By default, the code plots a batch of input and output melspectrograms every epoch. You may add --plot-interval -1 to the above command to disable it. Alternatively you may add --plot-interval 20 to plot every 20 epochs.

3.2. Saving Models

By default, models are saved every epoch. With smaller datasets than Flickr8k it may be more appropriate to save less frequently by adding --checkpoint_interval 20 for 20 epochs.

3.3. Epochs

The max number of epochs may be set with --n_epochs. For smaller datasets, you may want to increase this to more than the default 100. To load a pretrained model you can use --epoch and set it to the epoch number of the saved model.

3.4. Pretrained Model

You can access pretrained model files here. By downloading and storing them in a directory src/saved_models/pretrained, you may call it for training or inference with:

--model_name pretrained --epoch 99

Note that for inference the discriminator files D1 and D2 are not required (meanwhile for training further they are). Also here, G1 refers to the decoding generator for speaker 1 (female) and G2 for speaker 2 (male).

4. Inference

The trained VAE-GAN is used for inference on a specified audio file. It works by; sliding a window over a full melspectrogram, locally inferring melspectrogram subsamples, and averaging the overlap. The script then uses Griffin-Lim to reconstruct audio from the generated melspectrogram.

python inference.py --model_name [name of the model] --epoch [epoch number] --trg_id [id of target generator] --wav [path/to/source_audio.wav]

For achieving high quality results like the paper you can feed the reconstructed audio to trained vocoders such as WaveNet. An example pipeline of using this model with wavenet can be found here.

4.1. Directory Input

Instead of a single .wav as input you may specify a whole directory of .wav files by using --wavdir instead of --wav.

4.2. Visualization

By default, plotting input and output melspectrograms is enabled. This is useful for a visual comparison between trained models. To disable set --plot -1

4.3. Reconstructive Evaluation

Alongside the process of generating, components for reconstruction and cyclic reconstruction may be enabled by specifying the generator id of the source audio --src_id [id of source generator].

When set, SSIM metrics for reconstructed melspectrograms and cyclically reconstructed melspectrograms are computed and printed at the end of inference.

This is an extra feature to help with comparing the reconstructive capabilities of different models. The higher the SSIM, the higher quality the reconstruction.

References

Citation

If you find this code useful please cite us in your work:

@inproceedings{AlBadawy2020,
  author={Ehab A. AlBadawy and Siwei Lyu},
  title={{Voice Conversion Using Speech-to-Speech Neuro-Style Transfer}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={4726--4730},
  doi={10.21437/Interspeech.2020-3056},
  url={http://dx.doi.org/10.21437/Interspeech.2020-3056}
}

TODO:

Rewrite preprocess.py to handle:
- multi-process feature extraction
- display error messages for failed cases
Create:
- Notebook for data visualisation
Want to add something else? Please feel free to submit a PR with your changes or open an issue for that.

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

Related tags

Overview

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

1. Data Preperation

2. Data Preprocessing

3. Training

3.1. Visualization

3.2. Saving Models

3.3. Epochs

3.4. Pretrained Model

4. Inference

4.1. Directory Input

4.2. Visualization

4.3. Reconstructive Evaluation

References

Citation

TODO:

Owner

Ehab AlBadawy

Anomaly detection in multi-agent trajectories: Code for training, evaluation and the OpenAI highway simulation.

PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

DeepFaceLive - Live Deep Fake in python, Real-time face swap for PC streaming or video calls

Official implementation of the paper 'Efficient and Degradation-Adaptive Network for Real-World Image Super-Resolution'

Here I will explain the flow to deploy your custom deep learning models on Ultra96V2.

A demonstration of using a live Tensorflow session to create an interactive face-GAN explorer.

Code for the paper "M2m: Imbalanced Classification via Major-to-minor Translation" (CVPR 2020)

Recurrent Variational Autoencoder that generates sequential data implemented with pytorch

PyTorch implementation of DUL (Data Uncertainty Learning in Face Recognition, CVPR2020)

Fast and Context-Aware Framework for Space-Time Video Super-Resolution (VCIP 2021)

An unsupervised learning framework for depth and ego-motion estimation from monocular videos

Official Implementation of "DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization."

We will release the code of "ConTNet: Why not use convolution and transformer at the same time?" in this repo

Implementation for "Exploiting Aliasing for Manga Restoration" (CVPR 2021)

A curated list of long-tailed recognition resources.

AAI supports interdisciplinary research to help better understand human, animal, and artificial cognition.

Distributed Deep learning with Keras & Spark

A universal framework for learning timestamp-level representations of time series

Deep Learning Specialization by Andrew Ng, deeplearning.ai.

TriMap: Large-scale Dimensionality Reduction Using Triplets