Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

Last update: Dec 28, 2022

Related tags

Overview

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu.

Project | Paper | Demo

We propose Pose-Controllable Audio-Visual System (PC-AVS), which achieves free pose control when driving arbitrary talking faces with audios. Instead of learning pose motions from audios, we leverage another pose source video to compensate only for head motions. The key is to devise an implicit low-dimension pose code that is free of mouth shape or identity information. In this way, audio-visual representations are modularized into spaces of three key factors: speech content, head pose, and identity information.

Requirements

Python 3.6 and Pytorch 1.3.0 are used. Basic requirements are listed in the 'requirements.txt'.

pip install -r requirements.txt

Quick Start: Generate Demo Results

Download the pre-trained checkpoints.
Create the default folder ./checkpoints and unzip the demo.zip at ./checkpoints/demo. There should be a 5 pths in it.
Unzip all *.zip files within the misc folder.
Run the demo scripts:

bash experiments/demo_vox.sh

The --gen_video argument is by default on, ffmpeg >= 4.2.0 is required to use this flag in linux systems. All frames along with an avconcat.mp4 video file will be saved in the ./id_517600055_pose_517600078_audio_681600002/results folder in the following form:

From left to right are the reference input, the generated results, the pose source video and the synced original video with the driving audio.

Prepare Testing Meta Data

Automatic VoxCeleb2 Data Formulation

The inference code experiments/demo.sh refers to ./misc/demo.csv for testing data paths. In linux systems, any applicable csv file can be created automatically by running:

python scripts/prepare_testing_files.py

Then modify the meta_path_vox in experiments/demo_vox.sh to './misc/demo2.csv' and run

bash experiments/demo_vox.sh

An additional result should be seen saved.

Metadata Details

Detailedly, in scripts/prepare_testing_files.py there are certain flags which enjoy great flexibility when formulating the metadata:

--src_pose_path denotes the driving pose source path. It can be an mp4 file or a folder containing frames in the form of %06d.jpg starting from 0.
--src_audio_path denotes the audio source's path. It can be an mp3 audio file or an mp4 video file. If a video is given, the frames will be automatically saved in ./misc/Mouth_Source/video_name, and disables the --src_mouth_frame_path flag.
--src_mouth_frame_path. When --src_audio_path is not a video path, this flags could provide the folder containing the video frames synced with the source audio.
--src_input_path is the path to the input reference image. When the path is a video file, we will convert it to frames.
--csv_path the path to the to-be-saved metadata.

You can manually modify the metadata csv file or add lines to it according to the rules defined in the scripts/prepare_testing_files.py file or the dataloader data/voxtest_dataset.py.

We provide a number of demo choices in the misc folder, including several ones used in our video. Feel free to rearrange them even across folders. And you are welcome to record audio files by yourself.

Self-Prepared Data Processing

Our model handles only VoxCeleb2-like cropped data, thus pre-processing is needed for self-prepared data.

Coming soon

Train Your Own Model

Coming soon

License and Citation

The usage of this software is under CC-BY-4.0.

@InProceedings{zhou2021pose,
author = {Zhou, Hang and Sun, Yasheng and Wu, Wayne and Loy, Chen Change and Wang, Xiaogang and Liu, Ziwei},
title = {Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}

Acknowledgement

The structure of this codebase is borrowed from SPADE.
The generator is borrowed from stylegan2-pytorch.
The audio encoder is borrowed from voxceleb_trainer.

Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

Related tags

Overview

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

Project | Paper | Demo

Requirements

Quick Start: Generate Demo Results

Prepare Testing Meta Data

Automatic VoxCeleb2 Data Formulation

Metadata Details

Self-Prepared Data Processing

Train Your Own Model

License and Citation

Acknowledgement

Owner

Hang_Zhou

rliable is an open-source Python library for reliable evaluation, even with a handful of runs, on reinforcement learning and machine learnings benchmarks.

This repository is the official implementation of the Hybrid Self-Attention NEAT algorithm.

A Transformer-Based Siamese Network for Change Detection

Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel order of RGB and BGR. Simple Channel Converter for ONNX.

Arbitrary Distribution Modeling with Censorship in Real Time 59 2 60 3 Bidding Advertising for KDD'21

Semi-Supervised Learning with Ladder Networks in Keras. Get 98% test accuracy on MNIST with just 100 labeled examples !

The first public PyTorch implementation of Attentive Recurrent Comparators

Cereal box identification in store shelves using computer vision and a single train image per model.

Replication attempt for the Protein Folding Model

NeuroFind - A solution to the to the Task given by the Oberseminar of Messtechnik Institute of TU Dresden in 2021

Code for "The Box Size Confidence Bias Harms Your Object Detector"

A flexible and extensible framework for gait recognition.

Implementation of Segformer, Attention + MLP neural network for segmentation, in Pytorch

Spectrum is an AI that uses machine learning to generate Rap song lyrics

This repository is for the preprint "A generative nonparametric Bayesian model for whole genomes"

DAFNe: A One-Stage Anchor-Free Deep Model for Oriented Object Detection

State-of-the-art data augmentation search algorithms in PyTorch

A selection of State Of The Art research papers (and code) on human locomotion (pose + trajectory) prediction (forecasting)

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

Spatial Intention Maps for Multi-Agent Mobile Manipulation (ICRA 2021)