MakeItTalk: Speaker-Aware Talking-Head Animation

Last update: Jan 08, 2023

Related tags

Overview

MakeItTalk: Speaker-Aware Talking-Head Animation

This is the code repository implementing the paper:

MakeItTalk: Speaker-Aware Talking-Head Animation

Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria , Evangelos Kalogerakis, Dingzeyu Li

SIGGRAPH Asia 2020

Abstract We present a method that generates expressive talking-head videos from a single facial image with audio as the only input. In contrast to previous attempts to learn direct mappings from audio to raw pixels for creating talking faces, our method first disentangles the content and speaker information in the input audio signal. The audio content robustly controls the motion of lips and nearby facial regions, while the speaker information determines the specifics of facial expressions and the rest of the talking-head dynamics. Another key component of our method is the prediction of facial landmarks reflecting the speaker-aware dynamics. Based on this intermediate representation, our method works with many portrait images in a single unified framework, including artistic paintings, sketches, 2D cartoon characters, Japanese mangas, and stylized caricatures. In addition, our method generalizes well for faces and characters that were not observed during training. We present extensive quantitative and qualitative evaluation of our method, in addition to user studies, demonstrating generated talking-heads of significantly higher quality compared to prior state-of-the-art methods.

[Project page] [Paper] [Video] [Arxiv] [Colab Demo] [Colab Demo TDLR]

Figure. Given an audio speech signal and a single portrait image as input (left), our model generates speaker-aware talking-head animations (right). Both the speech signal and the input face image are not observed during the model training process. Our method creates both non-photorealistic cartoon animations (top) and natural human face videos (bottom).

Updates

facewarp source code and compile instructions
Pre-trained models
Google colab quick demo for natural faces [detail] [TDLR]
Training code for each module
Customized puppet creating tool

Requirements

Python environment 3.6

conda create -n makeittalk_env python=3.6
conda activate makeittalk_env

ffmpeg (https://ffmpeg.org/download.html)

sudo apt-get install ffmpeg

python packages

pip install -r requirements.txt

winehq-stable for cartoon face warping in Ubuntu (https://wiki.winehq.org/Ubuntu). Tested on Ubuntu16.04, wine==5.0.3.

sudo dpkg --add-architecture i386
wget -nc https://dl.winehq.org/wine-builds/winehq.key
sudo apt-key add winehq.key
sudo apt-add-repository 'deb https://dl.winehq.org/wine-builds/ubuntu/ xenial main'
sudo apt update
sudo apt install --install-recommends winehq-stable

Pre-trained Models

Download the following pre-trained models to examples/ckpt folder for testing your own animation.

Model	Link to the model
Voice Conversion	Link
Speech Content Module	Link
Speaker-aware Module	Link
Image2Image Translation Module	Link
Non-photorealistic Warping (.exe)	Link

Animate You Portraits!

Download pre-trained embedding [here] and save to examples/dump folder.

Nature Human Faces / Paintings

crop your portrait image into size 256x256 and put it under examples folder with .jpg format. Make sure the head is almost in the middle (check existing examples for a reference).
put test audio files under examples folder as well with .wav format.
animate!

python main_end2end.py --jpg

use addition args --amp_lip_x --amp_lip_y --amp_pos to amply lip motion (in x/y-axis direction) and head motion displacements, default values are =2., =2., =.5

Cartoon Faces

put test audio files under examples folder as well with .wav format.
animate one of the existing puppets

Puppet Name	wilk	roy	sketch	color	cartoonM	danbooru1
Image

python main_end2end_cartoon.py --jpg 
   
     --jpg_bg

--jpg_bg takes a same-size image as the background image to create the animation, such as the puppet's body, the overall fixed background image. If you want to use the background, make sure the puppet face image (i.e. --jpg image) is in png format and is transparent on the non-face area. If you don't need any background, please also create a same-size image (e.g. a pure white image) to hold the argument place.
use addition args --amp_lip_x --amp_lip_y --amp_pos to amply lip motion (in x/y-axis direction) and head motion displacements, default values are =2., =2., =.5
create your own puppets (ToDo...)

Train

Train Voice Conversion Module

Todo...

Train Content Branch

Create dataset root directory
Dataset: Download preprocessed dataset [here], and put it under /dump.

Train script: Run script below. Models will be saved in /ckpt/.

python main_train_content.py --train --write --root_dir <root_dir> --name <train_instance_name>

Train Speaker-Aware Branch

Todo...

Train Image-to-Image Translation

Todo...

License

Acknowledgement

We would like to thank Timothy Langlois for the narration, and Kaizhi Qian for the help with the voice conversion module. We thank Jakub Fiser for implementing the real-time GPU version of the triangle morphing algorithm. We thank Daichi Ito for sharing the caricature image and Dave Werner for Wilk, the gruff but ultimately lovable puppet.

This research is partially funded by NSF (EAGER-1942069) and a gift from Adobe. Our experiments were performed in the UMass GPU cluster obtained under the Collaborative Fund managed by the MassTech Collaborative.

MakeItTalk: Speaker-Aware Talking-Head Animation

Related tags

Overview

MakeItTalk: Speaker-Aware Talking-Head Animation

Updates

Requirements

Pre-trained Models

Animate You Portraits!

Nature Human Faces / Paintings

Cartoon Faces

Train

Train Voice Conversion Module

Train Content Branch

Train Speaker-Aware Branch

Train Image-to-Image Translation

License

Acknowledgement

Owner

Adobe Research

RMTD: Robust Moving Target Defence Against False Data Injection Attacks in Power Grids

Image super-resolution through deep learning

Learning to Map Large-scale Sparse Graphs on Memristive Crossbar

Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

RaceBERT -- A transformer based model to predict race and ethnicty from names

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Tutorials and implementations for "Self-normalizing networks"

Code image classification of MNIST dataset using different architectures: simple linear NN, autoencoder, and highway network

PassAPI is a password generator in hash format and fully developed in Python, with the aim of teaching how to handle and build

PyTorch implementation of "Learn to Dance with AIST++: Music Conditioned 3D Dance Generation."

Code for models used in Bashiri et al., "A Flow-based latent state generative model of neural population responses to natural images".

Diabet Feature Engineering - Predict whether people have diabetes when their characteristics are specified

Pytorch implementation of paper "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery"

Sample Prior Guided Robust Model Learning to Suppress Noisy Labels

A Simple Key-Value Data-store written in Python

A `Neural = Symbolic` framework for sound and complete weighted real-value logic

Python code for the paper How to scale hyperparameters for quickshift image segmentation

Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

MakeItTalk: Speaker-Aware Talking-Head Animation

Related tags

Overview

MakeItTalk: Speaker-Aware Talking-Head Animation

Updates

Requirements

Pre-trained Models

Animate You Portraits!

Nature Human Faces / Paintings

Cartoon Faces

Train

Train Voice Conversion Module

Train Content Branch

Train Speaker-Aware Branch

Train Image-to-Image Translation

License

Acknowledgement

Owner

Adobe Research

RMTD: Robust Moving Target Defence Against False Data Injection Attacks in Power Grids

Image super-resolution through deep learning

Learning to Map Large-scale Sparse Graphs on Memristive Crossbar

Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

RaceBERT -- A transformer based model to predict race and ethnicty from names

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,

Tutorials and implementations for "Self-normalizing networks"

Code image classification of MNIST dataset using different architectures: simple linear NN, autoencoder, and highway network

PassAPI is a password generator in hash format and fully developed in Python, with the aim of teaching how to handle and build

PyTorch implementation of "Learn to Dance with AIST++: Music Conditioned 3D Dance Generation."

Code for models used in Bashiri et al., "A Flow-based latent state generative model of neural population responses to natural images".

Diabet Feature Engineering - Predict whether people have diabetes when their characteristics are specified

Pytorch implementation of paper "Learning Co-segmentation by Segment Swapping for Retrieval and Discovery"

Sample Prior Guided Robust Model Learning to Suppress Noisy Labels

A Simple Key-Value Data-store written in Python

A `Neural = Symbolic` framework for sound and complete weighted real-value logic

Python code for the paper How to scale hyperparameters for quickshift image segmentation

Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Chinese Mandarin tts text-to-speech 中文 (普通话) 语音合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder,