Localizing-Visual-Sounds-the-Hard-Way

Code and Dataset for "Localizing Visual Sounds the Hard Way".

The repo contains code and our pre-trained model.

Environment

Python 3.6.8
Pytorch 1.3.0

Flickr-SoundNet

We provide the pretrained model here.

To test the model, testing data and ground truth should be downloaded from learning to localize sound source.

Then run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --gt_path "path to ground truth" --testset "flickr"

VGG-Sound Source

We provide the pretrained model here.

To test the model, run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --testset "vggss"

(Note, some gt bounding boxes are updated recently, all results on VGG-SS cause a 2~3% difference on IoU.)

Both test data should be placed in the following structure.

data path
│
└───frames
│   │   image001.jpg
│   │   image002.jpg
│   │
└───audio
    │   audio011.wav
    │   audio012.wav

Citation

@InProceedings{Chen21,
              title        = "Localizing Visual Sounds the Hard Way",
              author       = "Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman",
              booktitle    = "CVPR",
              year         = "2021"}

Localizing Visual Sounds the Hard Way

Related tags

Overview

Localizing-Visual-Sounds-the-Hard-Way

Environment

Flickr-SoundNet

VGG-Sound Source

Citation

Owner

Honglie Chen

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

Simulated garment dataset for virtual try-on

Auto HMM: Automatic Discrete and Continous HMM including Model selection

Examples of using f2py to get high-speed Fortran integrated with Python easily

Text Summarization - WCN — Weighted Contextual N-gram method for evaluation of Text Summarization

A PyTorch implementation of deep-learning-based registration

PyTorch implementation of Histogram Layers from DeepHist: Differentiable Joint and Color Histogram Layers for Image-to-Image Translation

GAN-STEM-Conv2MultiSlice - Exploring Generative Adversarial Networks for Image-to-Image Translation in STEM Simulation

Pyramid Pooling Transformer for Scene Understanding

This is a pytorch implementation for the BST model from Alibaba https://arxiv.org/pdf/1905.06874.pdf

Official Repo of my work for SREC Nandyal Machine Learning Bootcamp

Testing the Facial Emotion Recognition (FER) algorithm on animations

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Public implementation of "Learning from Suboptimal Demonstration via Self-Supervised Reward Regression" from CoRL'21

A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks

Codes and models of NeurIPS2021 paper - DominoSearch: Find layer-wise fine-grained N:M sparse schemes from dense neural networks

To prepare an image processing model to classify the type of disaster based on the image dataset

RoFormer_pytorch