Repository for the paper: VoiceMe: Personalized voice generation in TTS

Overview

πŸ—£ VoiceMe: Personalized voice generation in TTS

arXiv

Abstract

Novel text-to-speech systems can generate entirely new voices that were not seen during training. However, it remains a difficult task to efficiently create personalized voices from a high dimensional speaker space. In this work, we use speaker embeddings from a state-of-the-art speaker verification model (SpeakerNet) trained on thousands of speakers to condition a TTS model. We employ a human sampling paradigm to explore this speaker latent space. We show that users can create voices that fit well to photos of faces, art portraits, and cartoons. We recruit online participants to collectively manipulate the voice of a speaking face. We show that (1) a separate group of human raters confirms that the created voices match the faces, (2) speaker gender apparent from the face is well-recovered in the voice, and (3) people are consistently moving towards the real voice prototype for the given face. Our results demonstrate that this technology can be applied in a wide number of applications including character voice development in audiobooks and games, personalized speech assistants, and individual voices for people with speech impairment.

Demos

  • πŸ“’ Demo website
  • πŸ”‡ Unmute to listen to the videos on Github:
Examples-for-art-works.mp4
Example-chain.mp4

Preprocessing

Setup the repository

git clone https://github.com/polvanrijn/VoiceMe.git
cd VoiceMe
main_dir=$PWD

preprocessing_env="$main_dir/preprocessing-env"
conda create --prefix $preprocessing_env python=3.7
conda activate $preprocessing_env
pip install Cython
pip install git+https://github.com/NVIDIA/[email protected]#egg=nemo_toolkit[all]
pip install requests

Create face styles

We used the same sentence ("Kids are talking by the door", neutral recording) from the RAVDESS corpus from all 24 speakers. You can download all videos by running download_RAVDESS.sh. However, the stills used in the paper are also part of the repository (stills). We can create the AI Gahaku styles by running python ai_gahaku.py and the toonified version by running python toonify.py (you need to add your API key).

Obtain the PCA space

The model used in the paper was trained on SpeakerNet embeddings, so we to extract the embeddings from a dataset. Here we use the commonvoice data. To download it, run: python preprocess_commonvoice.py --language en

To extract the principal components, run compute_pca.py.

Synthesis

Setup

We'll assume, you'll setup a remote instance for synthesis. Clone the repo and setup the virtual environment:

git clone https://github.com/polvanrijn/VoiceMe.git
cd VoiceMe
main_dir=$PWD

synthesis_env="$main_dir/synthesis-env"
conda create --prefix $synthesis_env python=3.7
conda activate $synthesis_env

##############
# Setup Wav2Lip
##############
git clone https://github.com/Rudrabha/Wav2Lip.git
cd Wav2Lip

# Install Requirements
pip install -r requirements.txt
pip install opencv-python-headless==4.1.2.30
wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "face_detection/detection/sfd/s3fd.pth"  --no-check-certificate

# Install as package
mv ../setup_wav2lip.py setup.py
pip install -e .
cd ..


##############
# Setup VITS
##############
git clone https://github.com/jaywalnut310/vits
cd vits

# Install Requirements
pip install -r requirements.txt

# Install monotonic_align
mv monotonic_align ../monotonic_align

# Download the VCTK checkpoint
pip install gdown
gdown https://drive.google.com/uc?id=11aHOlhnxzjpdWDpsz1vFDCzbeEfoIxru

# Install as package
mv ../setup_vits.py setup.py
pip install -e .

cd ../monotonic_align
python setup.py build_ext --inplace
cd ..


pip install flask
pip install wget

You'll need to do the last step manually (let me know if you know an automatic way). Download the checkpoint wav2lip_gan.pth from here and put it in Wav2Lip/checkpoints. Make sure you have espeak installed and it is in PATH.

Running

Start the remote service (I used port 31337)

python server.py --port 31337

You can send an example request locally, by running (don't forget to change host and port accordingly):

python request_demo.py

We also made a small 'playground' so you can see how slider values will influence the voice. Start the local flask app called client.py.

Experiment

The GSP experiment cannot be shared at this moment, as PsyNet is still under development.

Owner
Pol van Rijn
PhD student at Max Planck Institute for Empirical Aesthetics
Pol van Rijn
This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

Sachit Yadav 1 Feb 11, 2022
Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

1.9k Jan 08, 2023
Python library for parsing resumes using natural language processing and machine learning

CVParser Python library for parsing resumes using natural language processing and machine learning. Setup Installation on Linux and Mac OS Follow the

nafiu 0 Jul 29, 2021
Tool which allow you to detect and translate text.

Text detection and recognition This repository contains tool which allow to detect region with text and translate it one by one. Description Two pretr

Damian Panek 176 Nov 28, 2022
Transformer training code for sequential tasks

Sequential Transformer This is a code for training Transformers on sequential tasks such as language modeling. Unlike the original Transformer archite

Meta Research 578 Dec 13, 2022
Need: Image Search With Python

Need: Image Search The problem is that a user needs to search for a specific ima

Surya Komandooru 1 Dec 30, 2021
End-to-end MLOps pipeline of a BERT model for emotion classification.

image source EmoBERT-MLOps The goal of this repository is to build an end-to-end MLOps pipeline based on the MLOps course from Made with ML, but this

Dimitre Oliveira 4 Nov 06, 2022
Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention

Sinkhorn Transformer This is a reproduction of the work outlined in Sparse Sinkhorn Attention, with additional enhancements. It includes a parameteriz

Phil Wang 217 Nov 25, 2022
KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(ν•œκ΅­μ–΄) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

74 Dec 13, 2022
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

34 Nov 24, 2022
Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

capbot-siic Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021. Problem Inspiration A plethora

Aryan Kargwal 19 Feb 17, 2022
null

CP-Cluster Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection, Instance Segme

Yichun Shen 41 Dec 08, 2022
Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

Dani El-Ayyass 23 Sep 05, 2022
This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

Common Voice Utils This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project. It aims t

Francis Tyers 40 Dec 20, 2022
Course project of [email protected]

NaiveMT Prepare Clone this repository git clone [email protected]:Poeroz/NaiveMT.git

Poeroz 2 Apr 24, 2022
λ‚΄λΆ€ μž‘μ—…μš© django + vue(vuetify) boilerplate. μ§  ν•˜λ©΄ λŒμ•„κ°.

Pocket Galaxy μ•„μ£Ό κ°„λ‹¨ν•œ 개인용, ν˜Ήμ€ λ‚΄λΆ€μš© νˆ΄μ„ λ§Œλ“€μ–΄μ•Όν•˜λŠ”λ° 이왕이면 웹이 νŽΈν•˜μ£ ? κ·ΈλŸ΄λ•Œλ₯Ό μœ„ν•΄ λ§Œλ“€μ–΄λ‘” django와 vue(vuetify)둜 이뀄진 boilerplate μž…λ‹ˆλ‹€. 각 폴더에 μžˆλŠ” μ„€λͺ…μ„œλŒ€λ‘œ 싀행을 μ‹œν‚€λ©΄ 일단 λ‹Ήμž₯ λ­”κ°€κ°€ λŒμ•„κ°‘λ‹ˆ

Jamie J. Seol 16 Dec 03, 2021
Code for paper: An Effective, Robust and Fairness-awareHate Speech Detection Framework

BiQQLSTM_HS Code and data for paper: Title: An Effective, Robust and Fairness-awareHate Speech Detection Framework. Authors: Guanyi Mou and Kyumin Lee

Guanyi Mou 2 Dec 27, 2022
Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Tensor2Tensor Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and ac

12.9k Jan 07, 2023
Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

Maksim Zhdanov 7 Sep 20, 2022
Python SDK for working with Voicegain Speech-to-Text

Voicegain Speech-to-Text Python SDK Python SDK for the Voicegain Speech-to-Text API. This API allows for large vocabulary speech-to-text transcription

Voicegain 3 Dec 14, 2022