The codebase for Data-driven general-purpose voice activity detection.

Last update: Nov 27, 2022

Overview

Data driven GPVAD

Repository for the work in TASLP 2021 Voice activity detection in the wild: A data-driven approach using teacher-student training.

Sample predictions against other methods

Noise robustness

Results

Our best model trained on the SRE (V3) dataset obtains the following results:

	Precision	Recall	F1	AUC	FER	Event-F1
aurora_clean	96.844	95.102	95.93	98.66	3.06	74.8
aurora_noisy	90.435	92.871	91.544	97.63	6.68	54.45
dcase18	89.202	88.362	88.717	95.2	10.82	57.85

Usage

We provide most of our pretrained models in this repository, including:

Both teachers (T_1, T_2)
Unbalanced audioset pretrained model
Voxceleb 2 pretrained model
Our best submission (SRE V3 trained)

To download and run evaluation just do:

git clone https://github.com/RicherMans/Datadriven-VAD
cd Datadriven-VAD
pip3 install -r requirements.txt
python3 forward.py -w example/example.wav

Running this will print:

|   index | event_label   |   onset |   offset | filename            |
|--------:|:--------------|--------:|---------:|:--------------------|
|       0 | Speech        |    0.28 |     0.94 | example/example.wav |
|       1 | Speech        |    1.04 |     2.22 | example/example.wav |

Predicting voice activity

We support single file and filelist-batching in our script. Obtaining VAD predictions is easy:

python3 forward.py -w example/example.wav

Or if one prefers to do that batch_wise, first prepare a filelist: find . -type f -name *.wav > wavlist.txt' And then just run:

python3 forward.py -l wavlist

Extra parameters

-model adjusts the pretrained model. Can be one of t1,t2,v2,a2,a2_v2,sre. Refer to the paper for each respective model. By default we use sre.
-soft instead of predicting human-readable timestamps, the model is now outputting the raw probabilities.
-hard instead of predicting human-readable timestamps, the model is now outputting the post-processed 0-1 flags indicating speech. Please note this is different from the paper, which thresholded the soft probabilities without post-processing.
-th adjusts the threshold. If a single threshold is passed (e.g., -th 0.5), we utilize simple binearization. Otherwise use the default double threshold with -th 0.5 0.1.
-o outputs the results into a new folder.

Training from scratch

If you intend to rerun our work, prepare some data and extract log-Mel spectrogram features. Say, you have downloaded the balanced subset of AudioSet and stored all files in a folder data/balanced/. Then:

cd data;
mkdir hdf5 csv_labels;
find balanced -type f > wavs.txt;
python3 extract_features.py wavs.txt -o hdf5/balanced.h5
h5ls -r hdf5/balanced.h5 | awk -F[/' '] 'BEGIN{print "filename","hdf5path"}NR>1{print $2,"hdf5/balanced.h5"}'> csv_labels/balanced.csv

The input for our label prediction script is a csv file with exactly two columns, filename and hdf5path.

An example csv_labels/balanced.csv would be:

filename hdf5path
--PJHxphWEs_30.000.wav hdf5/balanced.h5                                                                                          
--ZhevVpy1s_50.000.wav hdf5/balanced.h5                                                                                          
--aE2O5G5WE_0.000.wav hdf5/balanced.h5                                                                                           
--aO5cdqSAg_30.000.wav hdf5/balanced.h5

After feature extraction, proceed to predict labels:

mkdir -p softlabels/{hdf5,csv};
python3 prepare_labels.py --pre ../pretrained_models/teacher1/model.pth csv_labels/balanced.csv softlabels/hdf5/balanced.h5 softlabels/csv/balanced.csv

Lastly, just train:

cd ../; #Go to project root
# Change config accoringly with input data
python3 run.py train configs/example.yaml

Citation

If youre using this work, please cite it in your publications.

@article{Dinkel2021,
author = {Dinkel, Heinrich and Wang, Shuai and Xu, Xuenan and Wu, Mengyue and Yu, Kai},
doi = {10.1109/TASLP.2021.3073596},
issn = {2329-9290},
journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
pages = {1542--1555},
title = {{Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training}},
url = {https://ieeexplore.ieee.org/document/9405474/},
volume = {29},
year = {2021}
}

and

@inproceedings{Dinkel2020,
  author={Heinrich Dinkel and Yefei Chen and Mengyue Wu and Kai Yu},
  title={{Voice Activity Detection in the Wild via Weakly Supervised Sound Event Detection}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3665--3669},
  doi={10.21437/Interspeech.2020-0995},
  url={http://dx.doi.org/10.21437/Interspeech.2020-0995}
}

The codebase for Data-driven general-purpose voice activity detection.

Related tags

Overview

Data driven GPVAD

Sample predictions against other methods

Noise robustness

Results

Usage

Predicting voice activity

Extra parameters

Training from scratch

Citation

Owner

Heinrich Dinkel

🥈78th place in Riiid Solution🥈

ObjDetApp deploys a pytorch model for object detection

Fusion-DHL: WiFi, IMU, and Floorplan Fusion for Dense History of Locations in Indoor Environments

Official repository for the paper "Self-Supervised Models are Continual Learners" (CVPR 2022)

MetaBalance: Improving Multi-Task Recommendations via Adapting Gradient Magnitudes of Auxiliary Tasks

Dynamic Capacity Networks using Tensorflow

These are the materials for the paper "Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations"

Localization Distillation for Object Detection

Sdf sparse conv - Deep Learning on SDF for Classifying Brain Biomarkers

Unit-Convertor - Unit Convertor Built With Python

Redash reset for python

Code for the USENIX 2017 paper: kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels

Employs neural networks to classify images into four categories: ship, automobile, dog or frog

Unofficial PyTorch implementation of Google AI's VoiceFilter system

Machine-in-the-Loop Rewriting for Creative Image Captioning

This repo includes our code for evaluating and improving transferability in domain generalization (NeurIPS 2021)

Best practices for segmentation of the corporate network of any company

Run object detection model on the Raspberry Pi

The codes and related files to reproduce the results for Image Similarity Challenge Track 1.

Source code for Fathony, Sahu, Willmott, & Kolter, "Multiplicative Filter Networks", ICLR 2021.

The codebase for Data-driven general-purpose voice activity detection.

Related tags

Overview

Data driven GPVAD

Sample predictions against other methods

Noise robustness

Results

Usage

Predicting voice activity

Extra parameters

Training from scratch

Citation

Owner

Heinrich Dinkel

🥈78th place in Riiid Solution🥈

*ObjDetApp* deploys a pytorch model for object detection

Fusion-DHL: WiFi, IMU, and Floorplan Fusion for Dense History of Locations in Indoor Environments

Official repository for the paper "Self-Supervised Models are Continual Learners" (CVPR 2022)

MetaBalance: Improving Multi-Task Recommendations via Adapting Gradient Magnitudes of Auxiliary Tasks

Dynamic Capacity Networks using Tensorflow

These are the materials for the paper "Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations"

Localization Distillation for Object Detection

Sdf sparse conv - Deep Learning on SDF for Classifying Brain Biomarkers

Unit-Convertor - Unit Convertor Built With Python

Redash reset for python

Code for the USENIX 2017 paper: kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels

Employs neural networks to classify images into four categories: ship, automobile, dog or frog

Unofficial PyTorch implementation of Google AI's VoiceFilter system

Machine-in-the-Loop Rewriting for Creative Image Captioning

This repo includes our code for evaluating and improving transferability in domain generalization (NeurIPS 2021)

Best practices for segmentation of the corporate network of any company

Run object detection model on the Raspberry Pi

The codes and related files to reproduce the results for Image Similarity Challenge Track 1.

Source code for Fathony, Sahu, Willmott, & Kolter, "Multiplicative Filter Networks", ICLR 2021.

ObjDetApp deploys a pytorch model for object detection