MUSIC-AVQA, CVPR2022 (ORAL)

Related tags

AudioMUSIC-AVQA
Overview

Audio-Visual Question Answering (AVQA)

PyTorch code accompanies our CVPR 2022 paper:

Learning to Answer Questions in Dynamic Audio-Visual Scenarios (Oral Presentation)

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen and Di Hu

Resources: [Paper], [Supplementary], [Poster], [Video]

Project Homepage: https://gewu-lab.github.io/MUSIC-AVQA/


What's Audio-Visual Question Answering Task?

We focus on audio-visual question answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes.

MUSIC-AVQA Dataset

The large-scale MUSIC-AVQA dataset of musical performance, which contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All QA pairs types are divided into 3 modal scenarios, which contain 9 question types and 33 question templates. Finally, as an open-ended problem of our AVQA tasks, all 42 kinds of answers constitute a set for selection.

  • QA examples

Model Overview

To solve the AVQA problem, we propose a spatio-temporal grounding model to achieve scene understanding and reasoning over audio and visual modalities. An overview of the proposed framework is illustrated in below figure.

Requirements

python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg
numpy

Usage

  1. Clone this repo

    https://github.com/GeWu-Lab/MUSIC-AVQA_CVPR2022.git
  2. Download data

    Annotations (QA pairs, etc.)

    • Available for download at here
    • The annotation files are stored in JSON format. Each annotation file contains seven different keyword. And more detail see in Project Homepage

    Features

    • We use VGGish, ResNet18, and ResNet (2+1)D to extract audio, 2D frame-level, and 3D snippet-level features, respectively.

    • The audio and visual features of videos in the MUSIC-AVQA dataset can be download from Baidu Drive (password: cvpr):

      • VGGish feature shape: [T, 128]  Download (112.7M)
      • ResNet18 feature shape: [T, 512]  Download (972.6M)
      • R(2+1)D feature shape: [T, 512]  Download (973.9M)
    • The features are in the ./data/feats folder.

    • 14x14 features, too large to share ... but we can extract from raw video frames.

    Download videos frames

    • Raw videos: Availabel at Baidu Drive (password: cvpr):.

      Note: Please move all downloaded videos to a folder, for example, create a new folder named MUSIC-AVQA-Videos, which contains 9,288 real videos and synthetic videos.

    • Raw video frames (1fps): Available at Baidu Drive (14.84GB) (password: cvpr).

    • Download raw videos in the MUSIC-AVQA dataset. The downloaded videos will be in the /data/video folder.

    • Pandas and ffmpeg libraries are required.

  3. Data pre-processing

    Extract audio waveforms from videos. The extracted audios will be in the ./data/audio folder. moviepy library is used to read videos and extract audios.

    python feat_script/extract_audio_cues/extract_audio.py	

    Extract video frames from videos. The extracted frames will be in the data/frames folder.

    python feat_script/extract_visual_frames/extract_frames_adaptive_script.py
  4. Feature extraction

    Audio feature. TensorFlow1.4 and VGGish pretrained on AudioSet is required. Feature file also can be found from here (password: cvpr).

    python feat_script/extract_audio_feat/audio_feature_extractor.py

    2D visual feature. Pretrained models library is required.

    python feat_script/eatract_visual_feat/extract_rgb_feat.py

    3D visual feature.

    python feat_script/eatract_visual_feat/extract_3d_feat.py

    14x14 visual feature.

    python feat_script/extract_visual_feat_14x14/extract_14x14_feat.py
  5. Baseline Model

    Training

    python net_grd_baseline/main_qa_grd_baseline.py --mode train

    Testing

    python net_grd_baseline/main_qa_grd_baseline.py --mode test
  6. Our Audio-Visual Spatial-Temporal Model

    We provide trained models and you can quickly test the results. Test results may vary slightly on different machines.

    python net_grd_avst/main_avst.py --mode train \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

    Audio-Visual grounding generation

    python grounding_gen/main_grd_gen.py

    Training

    python net_grd_avst/main_avst.py --mode train \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

    Testing

    python net_grd_avst/main_avst.py --mode test \
    	--audio_dir = "path to your audio features"
    	--video_res14x14_dir = "path to your visual res14x14 features"

Results

  1. Audio-visual video question answering results of different methods on the test set of MUSIC-AVQA. The top-2 results are highlighted. Please see the citations in the [Paper] for comparison methods.

  2. Visualized spatio-temporal grounding results

    We provide several visualized spatial grounding results. The heatmap indicates the location of sounding source. Through the spatial grounding results, the sounding objects are visually captured, which can facilitate the spatial reasoning.

    Firstly, ./grounding_gen/models_grd_vis/ should be created.

    python grounding_gen/main_grd_gen_vis.py

Citation

If you find this work useful, please consider citing it.


@ARTICLE{Li2022Learning,
  title	= {Learning to Answer Questions in Dynamic Audio-Visual Scenarios},
  author	= {Guangyao li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu},
  journal	= {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year	= {2022},
}

Acknowledgement

This research was supported by Public Computing Cloud, Renmin University of China.

License

This project is released under the GNU General Public License v3.0.

Sequencer: Deep LSTM for Image Classification

Sequencer: Deep LSTM for Image Classification Created by Yuki Tatsunami Masato Taki This repository contains implementation for Sequencer. Abstract In

Yuki Tatsunami 111 Dec 16, 2022
A Python library for audio data augmentation. Inspired by albumentations. Useful for machine learning.

Audiomentations A Python library for audio data augmentation. Inspired by albumentations. Useful for deep learning. Runs on CPU. Supports mono audio a

Iver Jordal 1.2k Jan 07, 2023
Speech Algorithms Collections

Speech Algorithms Collections

Ryuk 498 Jan 06, 2023
Real-Time Spherical Microphone Renderer for binaural reproduction in Python

ReTiSAR Implementation of the Real-Time Spherical Microphone Renderer for binaural reproduction in Python [1][2]. Contents: | Requirements | Setup | Q

Division of Applied Acoustics at Chalmers University of Technology 51 Dec 17, 2022
❤️ This Is The EzilaXMusicPlayer Advaced Repo 🎵

Telegram EzilaXMusicPlayer Bot 🎵 A bot that can play music on telegram group's voice Chat ❤️ Requirements 📝 FFmpeg NodeJS nodesource.com Python 3.7+

Sadew Jayasekara 11 Nov 12, 2022
A bot that can play music on Telegram Group and Channel Voice Chats

DaisyXmusic ❤ is the best and only Telegram VC player with playlists, Multi Playback, Channel play and more

TeamOfDaisyX 20 Jun 11, 2021
Read music meta data and length of MP3, OGG, OPUS, MP4, M4A, FLAC, WMA and Wave files with python 2 or 3

tinytag tinytag is a library for reading music meta data of MP3, OGG, OPUS, MP4, M4A, FLAC, WMA and Wave files with python Install pip install tinytag

Tom Wallroth 577 Dec 26, 2022
This bot can stream audio or video files and urls in telegram voice chats

Voice Chat Streamer This bot can stream audio or video files and urls in telegram voice chats :) 🎯 Follow me and star this repo for more telegram bot

WiskeyWorm 4 Oct 09, 2022
Synchronize a local directory of songs' (MP3, MP4) metadata (genre, ratings) and playlists with a Plex server.

PlexMusicSync Synchronize a local directory of songs' (MP3, MP4) metadata (genre, ratings) and playlists (m3u, m3u8) with a Plex server. The song file

Tom Goetz 9 Jul 07, 2022
Stream Music 🎵 𝘼 𝙗𝙤𝙩 𝙩𝙝𝙖𝙩 𝙘𝙖𝙣 𝙥𝙡𝙖𝙮 𝙢𝙪𝙨𝙞𝙘 𝙤𝙣 𝙏𝙚𝙡𝙚𝙜𝙧𝙖𝙢 𝙂𝙧𝙤𝙪𝙥 𝙖𝙣𝙙 𝘾𝙝𝙖𝙣𝙣𝙚𝙡 𝙑𝙤𝙞𝙘𝙚 𝘾𝙝𝙖𝙩𝙨 𝘼𝙫𝙖𝙞𝙡?

Stream Music 🎵 𝘼 𝙗𝙤𝙩 𝙩𝙝𝙖𝙩 𝙘𝙖𝙣 𝙥𝙡𝙖𝙮 𝙢𝙪𝙨𝙞𝙘 𝙤𝙣 𝙏𝙚𝙡𝙚𝙜𝙧𝙖𝙢 𝙂𝙧𝙤𝙪𝙥 𝙖𝙣𝙙 𝘾𝙝𝙖𝙣𝙣𝙚𝙡 𝙑𝙤𝙞𝙘𝙚 𝘾𝙝𝙖𝙩𝙨 𝘼𝙫𝙖𝙞𝙡?

Sadew Jayasekara 15 Nov 12, 2022
Bot duniya Music Player

Bot duniya Music Player Requirements 📝 FFmpeg (Latest) NodeJS nodesource.com (NodeJS 17+) Python (3.10+) PyTgCalls (Lastest) 2nd Telegram Account (ne

Aman Vishwakarma 16 Oct 21, 2022
Stevan KZ 1 Oct 27, 2021
Just-Music - Spotify API Driven Music Web app, that allows to listen and control and share songs

Just Music... Just Music Is A Web APP That Allows Users To Play Song Using Spoti

Ayush Mishra 3 May 01, 2022
GiantMIDI-Piano is a classical piano MIDI dataset contains 10,854 MIDI files of 2,786 composers

GiantMIDI-Piano is a classical piano MIDI dataset contains 10,854 MIDI files of 2,786 composers

Bytedance Inc. 1.3k Jan 04, 2023
An Amazon Music client for Linux (unpretentious)

Amusiz An Amazon Music client for Linux (unpretentious) ↗️ Install You can install Amusiz in multiple ways, choose your favorite. 🚀 AppImage Here you

Mirko Brombin 25 Nov 08, 2022
?️ Open Source Audio Matching and Mastering

Matching + Mastering = ❤️ Matchering 2.0 is a novel Containerized Web Application and Python Library for audio matching and mastering. It follows a si

Sergey Grishakov 781 Jan 05, 2023
Python audio and music signal processing library

madmom Madmom is an audio signal processing library written in Python with a strong focus on music information retrieval (MIR) tasks. The library is i

Institute of Computational Perception 1k Dec 26, 2022
music library manager and MusicBrainz tagger

beets Beets is the media library management system for obsessive music geeks. The purpose of beets is to get your music collection right once and for

beetbox 11.3k Dec 31, 2022
Algorithmic Multi-Instrumental MIDI Continuation Implementation

Matchmaker Algorithmic Multi-Instrumental MIDI Continuation Implementation Taming large-scale MIDI datasets with algorithms This is a WIP so please ch

Alex 2 Mar 11, 2022
voice assistant made with python that search for covid19 data(like total cases, deaths and etc) in a specific country

covid19-voice-assistant voice assistant made with python that search for covid19 data(like total cases, deaths and etc) in a specific country installi

Miguel 2 Dec 05, 2021