StarGAN-ZSVC: Unofficial PyTorch Implementation

Last update: Aug 28, 2022

Overview

StarGAN-ZSVC: Unofficial PyTorch Implementation

This repository is an unofficial PyTorch implementation of StarGAN-ZSVC by Matthew Baas and Herman Kamper. This repository provides both model architectures and the code to inference or train them.

One of the StarGAN-ZSVC advantages is that it works on zero-shot settings and can be trained on unparallel audio data (different audio content by different speakers). Also, the model inference time is real-time or faster.

Disclaimer: I implement this repository for educational purpose only. All credits go to the original authors. Also, it may contains different details as described in the paper. If there is a room for improvement, please feel free to contact me.

Set up

git clone [email protected]:Top34051/stargan-zsvc.git
cd stargan-zsvc
conda env create -f environment.yml
conda activate stargan-zsvc

Usage

Voice conversion

Given two audio files, source.wav and target.wav, you can generate a new audio file with the same speaking content as in source.wav spoken by the speaker in target.wav as follow.

First, load my pretrained model weights (best.pt) and put it in checkpoints folder.

Next, we need to embed both speaker identity.

python embed.py --path path_to_source.wav --name src
python embed.py --path path_to_target.wav --name trg

This will generate src.npy and trg.npy, the source and target speaker embeddings.

To perform voice conversion,

python convert.py \
  --audio_path path_to_source.wav \
  --src_id src \
  --trg_id trg

That's it! 🎉 You can check out the result at results/output.wav.

Training

To train the model, you have to download and preprocess the dataset first. Since your data might be different from mine, I recommend you to read and fix the logic I used in preprocess.py (the dataset I used is here).

The fixed size utterances from each speaker will be extracted, resampled to 22,050 Hz, and converted to Mel-spectrogram with window and hop length of size 1024 and 256. This will preprocess the speaker embeddings as well, so that you don't have to embed them one-by-one.

The processed dataset will look like this

data/
    train/
        spk1.npy # contains N samples of (80, 128) mel-spectrogram
        spk2.npy
        ...
    test/
        spk1.npy
        spk2.npy
        ...
        
embeddings/
    spk1.npy # a (256, ) speaker embedding vector
    spk2.npy
    ...

You can customize some of the training hyperparameters or select resuming checkpoint in config.json. Finally, train the models by

python main.py \ 
  --config_file config.json 
  --num_epoch 3000

You will now see new checkpoint pops up in the checkpoints folder.

Please check out my code and modify them for improvement. Have fun training! ✌️

StarGAN-ZSVC: Unofficial PyTorch Implementation

Related tags

Overview

StarGAN-ZSVC: Unofficial PyTorch Implementation

Set up

Usage

Voice conversion

Training

Owner

Jirayu Burapacheep

A repository that finds a person who looks like you by using face recognition technology.

PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

[ICLR 2021] "Neural Architecture Search on ImageNet in Four GPU Hours: A Theoretically Inspired Perspective" by Wuyang Chen, Xinyu Gong, Zhangyang Wang

This is Official implementation for "Pose-guided Feature Disentangling for Occluded Person Re-Identification Based on Transformer" in AAAI2022

Implementation of Uniformer, a simple attention and 3d convolutional net that achieved SOTA in a number of video classification tasks

The implementation of PEMP in paper "Prior-Enhanced Few-Shot Segmentation with Meta-Prototypes"

Code for Mesh Convolution Using a Learned Kernel Basis

unet-family: Ultimate version

Using deep learning to predict gene structures of the coding genes in DNA sequences of Arabidopsis thaliana

Indices Matter: Learning to Index for Deep Image Matting

Code for EMNLP 2021 paper: "Learning Implicit Sentiment in Aspect-based Sentiment Analysis with Supervised Contrastive Pre-Training"

This is a beginner-friendly repo to make a collection of some unique and awesome projects. Everyone in the community can benefit & get inspired by the amazing projects present over here.

This is an easy python software which allows to sort images with faces by gender and after by age.

SEJE Pytorch implementation

Methods to get the probability of a changepoint in a time series.

Implementation of the paper Scalable Intervention Target Estimation in Linear Models (NeurIPS 2021), and the code to generate simulation results.

Back to Event Basics: SSL of Image Reconstruction for Event Cameras

This repository introduces a short project about Transfer Learning for Classification of MRI Images.

Neighbor2Seq: Deep Learning on Massive Graphs by Transforming Neighbors to Sequences

Hough Transform and Hough Line Transform Using OpenCV