Official implementation for paper Render In-between: Motion Guided Video Synthesis for Action Interpolation

Last update: Oct 27, 2022

Related tags

Overview

Render In-between: Motion Guided Video Synthesis for Action Interpolation

This is the official Pytorch implementation for our work. Our proposed framework is able to synthesize challenging human videos in an action interpolation setting. This repository contains three subdirectories, including code and scripts for preparing our collected HumanSlomo dataset, the implementation of human motion modeling network trained on the large-scale AMASS dataset, as well as the pose-guided neural rendering model to synthesize video frames from poses. Please check each subfolder for the detailed information and how to execute the code.

HumanSlomo Dataset

We collected a set of high FPS creative commons of human videos from Youtube. The videos are manually split into several continuous clips for training and test. You can also build your video dataset using the provided scripts.

Human Motion Modeling

Our human motion model is trained on a large scale motion capture dataset AMASS. We provide code to synthesize 2D human motion sequences for training from the SMPL parameters defined in AMASS. You can also simply use the pre-trained model to interpolate low-frame-rate noisy human body joints to high-frame-rate motion sequences.

Pose Guided Neural Rendering

The neural rendering model learned to map the pose sequences back to the original video domain. The final result is composed with the background warping from DAIN and the generated human body according to the predicted blending mask autoregressively. The model is trained in a conditional image generation setting, given only low-frame-rate videos as training data. Therefore, you can train your custom neural rendering model by constructing your own video dataset.

Quick Start

⬇️ example.zip [MEGA] (25.4MB)

Download this example action clip which includes necessary input files for our pipeline.

The first step is generating high FPS motion from low FPS poses with our motion modeling network.

cd Human_Motion_Modelling
python inference.py --pose-dir ../example/input_poses --save-dir ../example/ --upsample-rate 2

⬇️ checkpoints.zip [MEGA] (147.2MB)

Next we will map high FPS poses back to video frames with our pose-guided neural rendering. Download the checkpoint files to the corresponding folder to run the model.

cd Pose_Guided_Neural_Rendering
python inference.py --input-dir ../example/ --save-dir ../example/

Citation

@inproceedings{ho2021render,
    author = {Hsuan-I Ho, Xu Chen, Jie Song, Otmar Hilliges},
    title = {Render In-between: Motion GuidedVideo Synthesis for Action Interpolation},
    booktitle = {BMVC},
    year = {2021}
}

Acknowledgement

We use the pre-processing code in AMASS to synthesize our motion dataset. AlphaPose is used for generating 2D human body poses. DAIN is used for warping background images. Our human motion modeling network is based on the transformer backbone in DERT. Our pose-guided neural rendering model is based on imaginaire. We sincerely thank these authors for their awesome work.

Official implementation for paper Render In-between: Motion Guided Video Synthesis for Action Interpolation

Related tags

Overview

Render In-between: Motion Guided Video Synthesis for Action Interpolation

HumanSlomo Dataset

Human Motion Modeling

Pose Guided Neural Rendering

Quick Start

Citation

Acknowledgement

Owner

Exploring Versatile Prior for Human Motion via Motion Frequency Guidance (3DV2021)

Action Segmentation Evaluation

Low Complexity Channel estimation with Neural Network Solutions

Learning to Reconstruct 3D Non-Cuboid Room Layout from a Single RGB Image

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

Tensor-based approaches for fMRI classification

Interactive dimensionality reduction for large datasets

DeepMReye: magnetic resonance-based eye tracking using deep neural networks

BalaGAN: Image Translation Between Imbalanced Domains via Cross-Modal Transfer

Fully Convolutional Networks for Semantic Segmentation by Jonathan Long, Evan Shelhamer, and Trevor Darrell. CVPR 2015 and PAMI 2016.

Compositional and Parameter-Efficient Representations for Large Knowledge Graphs

Tools for manipulating UVs in the Blender viewport.

ICNet and PSPNet-50 in Tensorflow for real-time semantic segmentation

Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

This project is for a Twitter bot that monitors a bird feeder in my backyard. Any detected birds are identified and posted to Twitter.

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

an implementation of 3D Ken Burns Effect from a Single Image using PyTorch

Optimize Trading Strategies Using Freqtrade

Constrained Language Models Yield Few-Shot Semantic Parsers

A FAIR dataset of TCV experimental results for validating edge/divertor turbulence models.

Official implementation for paper Render In-between: Motion Guided Video Synthesis for Action Interpolation

Related tags

Overview

Render In-between: Motion Guided Video Synthesis for Action Interpolation

HumanSlomo Dataset

Human Motion Modeling

Pose Guided Neural Rendering

Quick Start

Citation

Acknowledgement

Owner

Exploring Versatile Prior for Human Motion via Motion Frequency Guidance (3DV2021)

Action Segmentation Evaluation

Low Complexity Channel estimation with Neural Network Solutions

Learning to Reconstruct 3D Non-Cuboid Room Layout from a Single RGB Image

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

Tensor-based approaches for fMRI classification

Interactive dimensionality reduction for large datasets

DeepMReye: magnetic resonance-based eye tracking using deep neural networks

BalaGAN: Image Translation Between Imbalanced Domains via Cross-Modal Transfer

Fully Convolutional Networks for Semantic Segmentation by Jonathan Long*, Evan Shelhamer*, and Trevor Darrell. CVPR 2015 and PAMI 2016.

Compositional and Parameter-Efficient Representations for Large Knowledge Graphs

Tools for manipulating UVs in the Blender viewport.

ICNet and PSPNet-50 in Tensorflow for real-time semantic segmentation

Code for Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation (CVPR 2021)

This project is for a Twitter bot that monitors a bird feeder in my backyard. Any detected birds are identified and posted to Twitter.

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

an implementation of 3D Ken Burns Effect from a Single Image using PyTorch

Optimize Trading Strategies Using Freqtrade

Constrained Language Models Yield Few-Shot Semantic Parsers

A FAIR dataset of TCV experimental results for validating edge/divertor turbulence models.

Fully Convolutional Networks for Semantic Segmentation by Jonathan Long, Evan Shelhamer, and Trevor Darrell. CVPR 2015 and PAMI 2016.