Easy to use Audio Tagging in PyTorch

Last update: Dec 22, 2022

Overview

Audio Classification, Tagging & Sound Event Detection in PyTorch

Progress:

Model Zoo

AudioSet Pretrained Models

Model	Task	mAP ^(%)	Sample Rate ^(kHz)	Window Length	Num Mels	Fmax	Weights
CNN14	Tagging	43.1	32	1024	64	14k	download
CNN14_16k	Tagging	43.8	16	512	64	8k	download

CNN14_DecisionLevelMax	SED	38.5	32	1024	64	14k	download

Note: These models will be used as a pretrained model in the fine-tuning tasks below. Check out audioset-tagging-cnn, if you want to train on AudioSet dataset.

Fine-tuned Classification Models

Model	Dataset	Accuracy ^(%)	Sample Rate ^(kHz)	Weights
CNN14	ESC50 (Fold-5)	95.75	32	download
CNN14	FSDKaggle2018 (test)	93.56	32	download
CNN14	SpeechCommandsv1 (val/test)	96.60/96.77	32	download

Fine-tuned Tagging Models

Model	Dataset	mAP(%)	AUC	d-prime	Sample Rate ^(kHz)	Config	Weights
CNN14	FSDKaggle2019	-	-	-	32	-	-

Fine-tuned SED Models

Model	Dataset	F1	Sample Rate ^(kHz)	Config	Weights
CNN14_DecisionLevelMax	DESED	-	32	-	-

Supported Datasets

Dataset	Task	Classes	Train	Val	Test	Audio Length	Audio Spec	Size
ESC-50	Classification	50	2,000	5 folds	-	5s	44.1kHz, mono	600MB
UrbanSound8k	Classification	10	8,732	10 folds	-	<=4s	Vary	5.6GB
FSDKaggle2018	Classification	41	9,473	-	1,600	300ms~30s	44.1kHz, mono	4.6GB
SpeechCommandsv1	Classification	30	51,088	6,798	6,835	<=1s	16kHz, mono	1.4GB
SpeechCommandsv2	Classification	35	84,843	9,981	11,005	<=1s	16kHz, mono	2.3GB

FSDKaggle2019*	Tagging	80	4,970+19,815	-	4,481	300ms~30s	44.1kHz, mono	24GB
MTT*	Tagging	50	19,000	-	-	-	-	3GB

DESED*	SED	10	-	-	-	10	-	-

Notes: * datasets are not available yet. Classification dataset are treated as multi-class/single-label classification and tagging and sed datasets are treated as multi-label classification.

Dataset Structure (click to expand)

Download the dataset and prepare it into the following structure.

datasets
|__ ESC50
    |__ audio

|__ Urbansound8k
    |__ audio

|__ FSDKaggle2018
    |__ audio_train
    |__ audio_test
    |__ FSDKaggle2018.meta
        |__ train_post_competition.csv
        |__ test_post_competition_scoring_clips.csv

|__ SpeechCommandsv1/v2
    |__ bed
    |__ bird
    |__ ...
    |__ testing_list.txt
    |__ validation_list.txt

Augmentations (click to expand)

Currently, the following augmentations are supported. More will be added in the future. You can test the effects of augmentations with this notebook

WaveForm Augmentations:

Spectrogram Augmentations:

Time Masking
Frequency Masking
Filter Augmentation

Usage

Requirements (click to expand)

python >= 3.6
pytorch >= 1.8.1
torchaudio >= 0.8.1

Other requirements can be installed with pip install -r requirements.txt.

Configuration (click to expand)

Create a configuration file in configs. Sample configuration for ESC50 dataset can be found here.
Copy the contents of this and then edit the fields you think if it is needed.
This configuration file is needed for all of training, evaluation and prediction scripts.

Training (click to expand)

To train with a single GPU:

$ python tools/train.py --cfg configs/CONFIG_FILE_NAME.yaml

To train with multiple gpus, set DDP field in config file to true and run as follows:

$ python -m torch.distributed.launch --nproc_per_node=2 --use_env tools/train.py --cfg configs/CONFIG_FILE_NAME.yaml

Evaluation (click to expand)

Make sure to set MODEL_PATH of the configuration file to your trained model directory.

$ python tools/val.py --cfg configs/CONFIG_FILE.yaml

Audio Classification/Tagging Inference

Set MODEL_PATH of the configuration file to your model's trained weights.
Change the dataset name in DATASET >> NAME as your trained model's dataset.
Set the testing audio file path in TEST >> FILE.
Run the following command.

$ python tools/infer.py --cfg configs/CONFIG_FILE.yaml

## for example
$ python tools/infer.py --cfg configs/audioset.yaml

You will get an output similar to this:

Class                     Confidence
----------------------  ------------
Speech                     0.897762
Telephone bell ringing     0.752206
Telephone                  0.219329
Inside, small room         0.20761
Music                      0.0770325

Sound Event Detection Inference

Set MODEL_PATH of the configuration file to your model's trained weights.
Change the dataset name in DATASET >> NAME as your trained model's dataset.
Set the testing audio file path in TEST >> FILE.
Run the following command.

$ python tools/sed_infer.py --cfg configs/CONFIG_FILE.yaml

## for example
$ python tools/sed_infer.py --cfg configs/audioset_sed.yaml

You will get an output similar to this:

Class                     Start    End
----------------------  -------  -----
Speech                      2.2    7
Telephone bell ringing      0      2.5

The following plot will also be shown, if you set PLOT to true:

References (click to expand)

Citations (click to expand)

@misc{kong2020panns,
      title={PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition}, 
      author={Qiuqiang Kong and Yin Cao and Turab Iqbal and Yuxuan Wang and Wenwu Wang and Mark D. Plumbley},
      year={2020},
      eprint={1912.10211},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

@misc{gong2021ast,
      title={AST: Audio Spectrogram Transformer}, 
      author={Yuan Gong and Yu-An Chung and James Glass},
      year={2021},
      eprint={2104.01778},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

@misc{nam2021heavily,
      title={Heavily Augmented Sound Event Detection utilizing Weak Predictions}, 
      author={Hyeonuk Nam and Byeong-Yun Ko and Gyeong-Tae Lee and Seong-Hu Kim and Won-Ho Jung and Sang-Min Choi and Yong-Hwa Park},
      year={2021},
      eprint={2107.03649},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

You might also like...

TorchMetrics is a collection of 25+ PyTorch metrics implementations and an easy-to-use API to create custom metrics.

Machine learning metrics for distributed, scalable PyTorch applications.

1.2k Jan 6, 2023

TorchFlare is a simple, beginner-friendly, and easy-to-use PyTorch Framework train your models effortlessly.

TorchFlare TorchFlare is a simple, beginner-friendly and an easy-to-use PyTorch Framework train your models without much effort. It provides an almost

85 Dec 26, 2022

A more easy-to-use implementation of KPConv based on PyTorch.

A more easy-to-use implementation of KPConv This repo contains a more easy-to-use implementation of KPConv based on PyTorch. Introduction KPConv is a

36 Dec 29, 2022

Use MATLAB to simulate the signal and extract features. Use PyTorch to build and train deep network to do spectrum sensing.

Deep-Learning-based-Spectrum-Sensing Use MATLAB to simulate the signal and extract features. Use PyTorch to build and train deep network to do spectru

10 Dec 14, 2022

Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

Albumentations Albumentations is a Python library for image augmentation. Image augmentation is used in deep learning and computer vision tasks to inc

11.4k Jan 9, 2023

Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

3k Dec 29, 2022

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

3k Jan 3, 2023

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

2.8k Feb 12, 2021

A fast and easy to use, moddable, Python based Minecraft server!

PyMine PyMine - The fastest, easiest to use, Python-based Minecraft Server! Features Note: This list is not always up to date, and doesn't contain all

144 Dec 30, 2022

Releases(v0.2.0)

v0.2.0(Aug 17, 2021)
This release includes the following:

Fine-tuned on ESC50, FSDKaggle2018, SpeechCommandsv1

Add waveform augmentations

Add spectrogram augmentations

Add augmentation testing notebook

Add tagging metrics

Source code(tar.gz)
Source code(zip)
v0.1.0(Aug 13, 2021)
Add the following datasets:

ESC50

UrbanSound8k

FSDKaggle2018

SpeechCommandsv1/v2

Release fine-tuned model on ESC50.
Source code(tar.gz)
Source code(zip)

Owner

sithu3

AI Developer

GitHub Repository

[CVPR2021] De-rendering the World's Revolutionary Artefacts

De-rendering the World's Revolutionary Artefacts Project Page | Video | Paper In CVPR 2021 Shangzhe Wu1,4, Ameesh Makadia4, Jiajun Wu2, Noah Snavely4,

49 Nov 06, 2022

The tool under this branch fork can be used to crack devices above A12 and up to A15. After cracking, you can also use SSH channel strong opening tool to open SSH channel and activate it with Demo or Shell script. The file can be extracted from my Github homepage, and the SSH channel opening tool can be extracted from Dr238 account.

Welcome to C0xy-A12-A15-Attack-Tool The tool under this branch fork can be used to crack devices above A12 and up to A15. After cracking, you can also

13 Dec 23, 2022

GEA - Code for Guided Evolution for Neural Architecture Search

Efficient Guided Evolution for Neural Architecture Search Usage Create a conda e

6 Jan 03, 2023

ToFFi - Toolbox for Frequency-based Fingerprinting of Brain Signals

ToFFi Toolbox This repository contains "before peer review" version of the software related to the preprint of the publication ToFFi - Toolbox for Fre

4 Aug 31, 2022

Implementation of CVPR'2022:Reconstructing Surfaces for Sparse Point Clouds with On-Surface Priors

Reconstructing Surfaces for Sparse Point Clouds with On-Surface Priors (CVPR 2022) Personal Web Pages | Paper | Project Page This repository contains

151 Dec 26, 2022

A basic neural network for image segmentation.

Unet_erythema_detection A basic neural network for image segmentation. 前期准备 1.在logs文件夹中下载h5权重文件，百度网盘链接在logs文件夹中 2.将所有原图放置在“/dataset_1/JPEGImages/”文件夹

1 Jan 16, 2022

A modular application for performing anomaly detection in networks

Deep-Learning-Models-for-Network-Annomaly-Detection The modular app consists for mainly three annomaly detection algorithms. The system supports model

1 Dec 09, 2021

Combinatorially Hard Games where the levels are procedurally generated

puzzlegen Implementation of two procedurally simulated environments with gym interfaces. IceSlider: the agent needs to reach and stop on the pink squa

3 Jun 26, 2022

SSD-based Object Detection in PyTorch

SSD-based Object Detection in PyTorch 서강대학교 현대모비스 SW 프로그램에서 진행한 인공지능 프로젝트입니다. Jetson nano를 이용해 pre-trained network를 fine tuning시켜 차량 및 신호등 인식을 구현하였습니다

1 Nov 16, 2021

🕺Full body detection and tracking

Pose-Detection 🤔 Overview Human pose estimation from video plays a critical role in various applications such as quantifying physical exercises, sign

20 Nov 21, 2022

Back to the Feature: Learning Robust Camera Localization from Pixels to Pose (CVPR 2021)

Back to the Feature with PixLoc We introduce PixLoc, a neural network for end-to-end learning of camera localization from an image and a 3D model via

610 Jan 05, 2023

Efficient electromagnetic solver based on rigorous coupled-wave analysis for 3D and 2D multi-layered structures with in-plane periodicity

Efficient electromagnetic solver based on rigorous coupled-wave analysis for 3D and 2D multi-layered structures with in-plane periodicity, such as gratings, photonic-crystal slabs, metasurfaces, surf

17 Dec 19, 2022

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

1.4k Jan 08, 2023

Mix3D: Out-of-Context Data Augmentation for 3D Scenes (3DV 2021)

Mix3D: Out-of-Context Data Augmentation for 3D Scenes (3DV 2021) Alexey Nekrasov*, Jonas Schult*, Or Litany, Bastian Leibe, Francis Engelmann Mix3D is

189 Dec 26, 2022

ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin et al., 2020).

ReConsider ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin

47 Jul 26, 2022

Easy to use Audio Tagging in PyTorch

Related tags

Overview

Audio Classification, Tagging & Sound Event Detection in PyTorch

Model Zoo

Supported Datasets

Usage

You might also like...

TorchMetrics is a collection of 25+ PyTorch metrics implementations and an easy-to-use API to create custom metrics.

TorchFlare is a simple, beginner-friendly, and easy-to-use PyTorch Framework train your models effortlessly.

A more easy-to-use implementation of KPConv based on PyTorch.

Use MATLAB to simulate the signal and extract features. Use PyTorch to build and train deep network to do spectrum sensing.

Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

Fast, flexible and easy to use probabilistic modelling in Python.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

A fast and easy to use, moddable, Python based Minecraft server!

Releases(v0.2.0)

v0.2.0(Aug 17, 2021)

v0.1.0(Aug 13, 2021)

Owner

sithu3

[CVPR2021] De-rendering the World's Revolutionary Artefacts

GEA - Code for Guided Evolution for Neural Architecture Search

ToFFi - Toolbox for Frequency-based Fingerprinting of Brain Signals

Implementation of CVPR'2022:Reconstructing Surfaces for Sparse Point Clouds with On-Surface Priors

A basic neural network for image segmentation.

A modular application for performing anomaly detection in networks

Combinatorially Hard Games where the levels are procedurally generated

SSD-based Object Detection in PyTorch

🕺Full body detection and tracking

Back to the Feature: Learning Robust Camera Localization from Pixels to Pose (CVPR 2021)

Efficient electromagnetic solver based on rigorous coupled-wave analysis for 3D and 2D multi-layered structures with in-plane periodicity

Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Mix3D: Out-of-Context Data Augmentation for 3D Scenes (3DV 2021)

ReConsider is a re-ranking model that re-ranks the top-K (passage, answer-span) predictions of an Open-Domain QA Model like DPR (Karpukhin et al., 2020).

Food Drinks and groceries Images Multi Lingual (FooDI-ML) dataset.

Using OpenAI's CLIP to upscale and enhance images

Definition of a business problem according to Wilson Lower Bound Score and Time Based Average Rating

This is the official repository for our paper: ''Pruning Self-attentions into Convolutional Layers in Single Path''.

A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.