Code for the Active Speakers in Context Paper (CVPR2020)

Overview

Active Speakers in Context

This repo contains the official code and models for the "Active Speakers in Context" CVPR 2020 paper.

Before Training

The code relies on multiple external libraries go to ./scripts/dev_env.sh.an recreate the suggested envirroment.

This code works over face crops and their corresponding audio track, before you start training you need to preprocess the videos in the AVA dataset. We have 3 utility files that contain the basic data to support this process, download them using ./scripts/dowloads.sh.

  1. Extract the audio tracks from every video in the dataset. Go to ./data/extract_audio_tracks.py in main adapt the ava_video_dir (directory with the original ava videos) and target_audios (empty directory where the audio tracks will be stored) to your local file system. The code relies on 16k .wav files and will fail with other formats and bit rates.
  2. Slice the audio tracks by timestamp. Go to ./data/slice_audio_tracks.py in main adapt the ava_audio_dir (the directory with the audio tracks you extracted on step 1), output_dir (empty directory where you will store the sliced audio files) and csv (the utility file you download previously, use the set accordingly) to your local file system.
  3. Extract the face crops by timestamp. Go to ./data/extract_face_crops_time.py in main adapt the ava_video_dir (directory with the original ava videos), csv_file (the utility file you download previously, use the train/val/test set accordingly) and output_dir (empty directory where you will store the face crops) to your local file system. This process will result in about 124GB extra data.

The full audio tracks obtained on step 1. will not be used anymore.

Training

Training the ASC is divided in two major stages: the optimization of the Short-Term Encoder (similar to google baseline) and the optimization of the Context Ensemble Network. The second step includes the pair-wise refinement and the temporal refinement, and relies on a full forward pass of the Short-Term Encoder on the training and validation sets.

Training the Short-Term Encoder

Got to ./core/config.py and modify the STE_inputs dictionary so that the keys audio_dir, video_dir and models_out point to the audio clips, face crops (those extracted on ‘Before Training’) and an empty directory where the STE models will be saved.

Execute the script STE_train.py clip_lenght cuda_device_number, we used clip_lenght=11 on the paper, but it can be set to any uneven value greater than 0 (performance will vary!).

Forward Short Term Encoder

The Active Speaker Context relies on the features extracted from the STE for its optimization, execute the script python STE_forward.py clip_lenght cuda_device_number, use the same clip_lenght as the training. Check lines 44 and 45 to switch between a list of training and val videos, you will need both subsets for the next step.

If you want to evaluate on the AVA Active Speaker Datasets, use ./STE_postprocessing.py, check lines 44 to 50 and adjust the files to your local file system.

Training the ASC Module

Once all the STE features have been calculated, go to ./core/config.py and change the dictionary ASC_inputs modify the value of keys, features_train_full, features_val_full, and models_out so that they point to the local directories where the features extracted with the STE in the train and val set have been stored, and an empty directory where the ASC models will 'be stored. Execute ./ASC_train.py clip_lenght skip_frames speakers cuda_device_number clip_lenght must be the same clip size used to train the STE, skip_frames determines the amount of frames in between sampled clips, we used 4 for the results presented in the paper, speakers is the number of candidates speakers in the contex.

Forward ASC

use ./ASC_forward.py clips time_stride speakers cuda_device_number to forward the models produced by the last step. Use the same clip and stride configurations. You will get one csv file for every video, for evaluation purposes use the script ASC_predcition_postprocessing.py to generate a single CSV file which is compatible with the evaluation tool, check lines 54 to 59 and adapt the paths to your local configuration.

If you want to evaluate on the AVA Active Speaker Datasets, use ./ASC_predcition_postprocessing.py, check lines 54 to 59 and adjust the files to your local file system.

Pre-Trained Models

Short Term Encoder

Active Speaker Context

Prediction Postprocessing and Evaluation

The prediction format follows the very same format of the AVA-Active speaker dataset, but contains an extra value for the active speaker class in the final column. The script ./STE_postprocessing.py handles this step. Check lines 44, 45 and 46 and set the directory where you saved the output of the forward pass (44), the directory with the original ava csv (45) and and empty temporary directory (46). Additionally set on lines 48 and 49 the outputs of the script, one of them is the final prediction formated to use the official evaluation tool and the other one is a utility file to use along the same tool. Notice you can do some temporal smoothing on the function 'softmax_feats', is a simple median filter and you can choose the window size on lines 35 and 36.

A boosting-based Multiple Instance Learning (MIL) package that includes MIL-Boost and MCIL-Boost

A boosting-based Multiple Instance Learning (MIL) package that includes MIL-Boost and MCIL-Boost

Jun-Yan Zhu 27 Aug 08, 2022
GUI for TOAD-GAN, a PCG-ML algorithm for Token-based Super Mario Bros. Levels.

If you are using this code in your own project, please cite our paper: @inproceedings{awiszus2020toadgan, title={TOAD-GAN: Coherent Style Level Gene

Maren A. 13 Dec 14, 2022
MultiTaskLearning - Multi Task Learning for 3D segmentation

Multi Task Learning for 3D segmentation Perception stack of an Autonomous Drivin

2 Sep 22, 2022
Pyramid Grafting Network for One-Stage High Resolution Saliency Detection. CVPR 2022

PGNet Pyramid Grafting Network for One-Stage High Resolution Saliency Detection. CVPR 2022, CVPR 2022 (arXiv 2204.05041) Abstract Recent salient objec

CVTEAM 109 Dec 05, 2022
All course materials for the Zero to Mastery Deep Learning with TensorFlow course.

All course materials for the Zero to Mastery Deep Learning with TensorFlow course.

Daniel Bourke 3.4k Jan 07, 2023
Curved Projection Reformation

Description Assuming that we already know the image of the centerline, we want the lumen to be displayed on a plane, which requires curved projection

夜听残荷 5 Sep 11, 2022
Official PyTorch implementation of "Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets" (ICLR 2021)

Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets This is the official PyTorch implementation for the paper Rapid Neural A

48 Dec 26, 2022
A collection of resources, problems, explanations and concepts that are/were important during my Data Science journey

Data Science Gurukul List of resources, interview questions, concepts I use for my Data Science work. Topics: Basics of Programming with Python + Unde

Smaranjit Ghose 10 Oct 25, 2022
Towards uncontrained hand-object reconstruction from RGB videos

Towards uncontrained hand-object reconstruction from RGB videos Yana Hasson, Gül Varol, Ivan Laptev and Cordelia Schmid Project page Paper Table of Co

Yana 69 Dec 27, 2022
Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks

Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks (SDPoint) This repository contains the cod

Jason Kuen 17 Jul 04, 2022
The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Shuffle Transformer The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer" Introduction Very recently, window-

87 Nov 29, 2022
Compact Bilinear Pooling for PyTorch

Compact Bilinear Pooling for PyTorch. This repository has a pure Python implementation of Compact Bilinear Pooling and Count Sketch for PyTorch. This

Grégoire Payen de La Garanderie 234 Dec 07, 2022
Categorizing comments on YouTube into different categories.

Youtube Comments Categorization This repo is for categorizing comments on a youtube video into different categories. negative (grievances, complaints,

Rhitik 5 Nov 26, 2022
Revisiting Temporal Alignment for Video Restoration

Revisiting Temporal Alignment for Video Restoration [arXiv] Kun Zhou, Wenbo Li, Liying Lu, Xiaoguang Han, Jiangbo Lu We provide our results at Google

52 Dec 25, 2022
Code, final versions, and information on the Sparkfun Graphical Datasheets

Graphical Datasheets Code, final versions, and information on the SparkFun Graphical Datasheets. Generated Cells After Running Script Example Complete

SparkFun Electronics 102 Jan 05, 2023
验证码识别 深度学习 tensorflow 神经网络

captcha_tf2 验证码识别 深度学习 tensorflow 神经网络 使用卷积神经网络,对字符,数字类型验证码进行识别,tensorflow使用2.0以上 目前项目还在更新中,诸多bug,欢迎提出issue和PR, 希望和你一起共同完善项目。 实例demo 训练过程 优化器选择: Adam

5 Apr 28, 2022
Uncertainty-aware Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving

SalsaNext: Fast, Uncertainty-aware Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving Abstract In this paper, we introduce SalsaNext f

308 Jan 04, 2023
Neural Turing Machine (NTM) & Differentiable Neural Computer (DNC) with pytorch & visdom

Neural Turing Machine (NTM) & Differentiable Neural Computer (DNC) with pytorch & visdom Sample on-line plotting while training(avg loss)/testing(writ

Jingwei Zhang 269 Nov 15, 2022
RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation (CIKM'17)

RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation This is the implementation of RATE: Overcoming Noise and Spar

Yu Zhang 5 Feb 10, 2022
Official repository of DeMFI (arXiv.)

DeMFI This is the official repository of DeMFI (Deep Joint Deblurring and Multi-Frame Interpolation). [ArXiv_ver.] Coming Soon. Reference Jihyong Oh a

Jihyong Oh 56 Dec 14, 2022