Pytorch implementation of our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.

Related tags

Deep LearningLiMuSE
Overview

LiMuSE

Overview

Pytorch implementation of our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.

LiMuSE explores group communication on a multi-modal speaker extraction model and further compresses the model size with quantization strategy.

Model

Our proposed model is a multi-steam architecture that takes multichannel mixture, target speaker’s enrolled utterance and visual sequences of detected faces as inputs, and outputs the target speaker’s mask in time domain. The encoded audio representations of mixture are then multiplied by the generated mask to obtain the target speech. Please see the figure below for detailed model structure.

flowchart_limuse

Datasets

We evaluate our system on two-speaker speech separation and speaker extraction problems using GRID dataset. The pretrained face embedding extraction network is trained on LRW dataset and MS-Celeb-1M dataset. And we use SMS-WSJ toolkit to obtain simulated anechoic dual-channel audio mixture. We place 2 microphones at the center of the room. The distance between microphones is 7 cm.

Getting Started

Preparation

If you want to adjust configurations of the framework and the path of dataset, please modify the option/train/train.yml file.

Training

Specify the path to train.yml file and run the training command:

python train.py -opt ./option/train/train.yml

This project supports full-precision and quantization training at the same time. Note that you need to modify two values of QA_flag in train.yml file if you would like to switch between full-precision and quantization stage. QA_flag in training settings stands for weight quantization while the one in net_conf stands for activation quantization.

View tensorboardX

tensorboard --logdir ./tensorboard

Result

  • Hyperparameters of LiMuSE

    Symbol Description Value
    N Number of filters in auto-encoder 128
    L Length of the filters (in audio samples) 16
    T Temperature 5
    X Number of GC-equipped TCN blocks in each repeat 6
    Ra Number of repeats in audio block 2
    Rb Number of repeats in fusion block 1
    K Number of groups -
  • Performance of LiMuSE and TasNet under various configurations. Q stands for quantization, VIS stands for visual cue and VP stands for voiceprint cue. Model size and compression ratio are also reported.

Method K SI-SDR (dB) #Params Model Size Compression Ratio
LiMuSE 32 16.72 0.36M 0.16MB 223.75
16 18.08 0.96M 0.40MB 89.50
LiMuSE (w/o Q) 32 23.77 0.36M 1.44MB 24.86
16 24.90 0.96M 3.84MB 9.32
LiMuSE (w/o Q and VP) 32 18.60 0.19M 0.76MB 47.11
16 24.20 0.52M 2.08MB 17.21
LiMuSE (w/o Q and VIS) 32 15.68 0.22M 0.88MB 40.68
16 21.91 0.55M 2.20MB 16.27
LiMuSE (w/o Q and GC) - 23.67 8.95M 35.8MB 1
TasNet (dual-channel) - 19.94 2.48M 9.92MB -
TasNet (single-channel) - 13.15 2.48M 9.92MB -

Citations

If you find this repo helpful, please consider citing:

@inproceedings{liu2021limuse,
  title={LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION},
  author={Liu, Qinghua and Huang, Yating and Hao, Yunzhe and Xu, Jiaming and Xu, Bo},
  booktitle={arXiv:2111.04063},
  year={2021},
}
Owner
Auditory Model and Cognitive Computing Lab
Auditory Model and Cognitive Computing Laboratory @ Institute of Automation, Chinese Academy of Sciences
Auditory Model and Cognitive Computing Lab
Sound Event Detection with FilterAugment

Sound Event Detection with FilterAugment Official implementation of Heavily Augmented Sound Event Detection utilizing Weak Predictions (DCASE2021 Chal

43 Aug 28, 2022
Codes and Data Processing Files for our paper.

Code Scripts and Processing Files for EEG Sleep Staging Paper 1. Folder Tree ./src_preprocess (data preprocessing files for SHHS and Sleep EDF) sleepE

Chaoqi Yang 18 Dec 12, 2022
A python library for highly configurable transformers - easing model architecture search and experimentation.

A python library for highly configurable transformers - easing model architecture search and experimentation.

Anthony Fuller 51 Nov 20, 2022
The implementation of the CVPR2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes"

STAR-FC This code is the implementation for the CVPR 2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes" 🌟 🌟 . 🎓 Re

Shuai Shen 87 Dec 28, 2022
QueryDet: Cascaded Sparse Query for Accelerating High-Resolution SmallObject Detection

QueryDet-PyTorch This repository is the official implementation of our paper: QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small O

Chenhongyi Yang 276 Dec 31, 2022
Monk is a low code Deep Learning tool and a unified wrapper for Computer Vision.

Monk - A computer vision toolkit for everyone Why use Monk Issue: Want to begin learning computer vision Solution: Start with Monk's hands-on study ro

Tessellate Imaging 507 Dec 04, 2022
Animation of solving the traveling salesman problem to optimality using mixed-integer programming and iteratively eliminating sub tours

tsp-streamlit Animation of solving the traveling salesman problem to optimality using mixed-integer programming and iteratively eliminating sub tours.

4 Nov 05, 2022
Structural Constraints on Information Content in Human Brain States

Structural Constraints on Information Content in Human Brain States Code accompanying the paper "The information content of brain states is explained

Leon Weninger 3 Sep 07, 2022
A2LP for short, ECCV2020 spotlight, Investigating SSL principles for UDA problems

Label-Propagation-with-Augmented-Anchors (A2LP) Official codes of the ECCV2020 spotlight (label propagation with augmented anchors: a simple semi-supe

20 Oct 27, 2022
Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

InfoPro-Pytorch The Information Propagation algorithm for training deep networks with local supervision. (ICLR 2021) Revisiting Locally Supervised Lea

78 Dec 27, 2022
Research code for Arxiv paper "Camera Motion Agnostic 3D Human Pose Estimation"

GMR(Camera Motion Agnostic 3D Human Pose Estimation) This repo provides the source code of our arXiv paper: Seong Hyun Kim, Sunwon Jeong, Sungbum Park

Seong Hyun Kim 1 Feb 07, 2022
Twin-deep neural network for semi-supervised learning of materials properties

Deep Semi-Supervised Teacher-Student Material Synthesizability Prediction Citation: Semi-supervised teacher-student deep neural network for materials

MLEG 3 Dec 14, 2022
A Tensorflow implementation of CapsNet based on Geoffrey Hinton's paper Dynamic Routing Between Capsules

CapsNet-Tensorflow A Tensorflow implementation of CapsNet based on Geoffrey Hinton's paper Dynamic Routing Between Capsules Notes: The current version

Huadong Liao 3.8k Dec 29, 2022
Spatial color quantization in Rust

rscolorq Rust port of Derrick Coetzee's scolorq, based on the 1998 paper "On spatial quantization of color images" by Jan Puzicha, Markus Held, Jens K

Collyn O'Kane 37 Dec 22, 2022
DexterRedTool - Dexter's Red Team Tool that creates cronjob/task scheduler to consistently creates users

DexterRedTool Author: Dexter Delandro CSEC 473 - Spring 2022 This tool persisten

2 Feb 16, 2022
Session-based Recommendation, CoHHN, price preferences, interest preferences, Heterogeneous Hypergraph, Co-guided Learning, SIGIR2022

This is our implementation for the paper: Price DOES Matter! Modeling Price and Interest Preferences in Session-based Recommendation Xiaokun Zhang, Bo

Xiaokun Zhang 27 Dec 02, 2022
Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

Neural Retrieval Embedding-based Zero-shot Retrieval through Query Generation leverages query synthesis over large corpuses of unlabeled text (such as

Amazon Web Services - Labs 35 Apr 14, 2022
MAGMA - a GPT-style multimodal model that can understand any combination of images and language

MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning Authors repo (alphabetical) Constantin (CoEich), Mayukh (Mayukh

Aleph Alpha GmbH 331 Jan 03, 2023
Improving 3D Object Detection with Channel-wise Transformer

"Improving 3D Object Detection with Channel-wise Transformer" Thanks for the OpenPCDet, this implementation of the CT3D is mainly based on the pcdet v

Hualian Sheng 107 Dec 20, 2022
Deep Residual Learning for Image Recognition

Deep Residual Learning for Image Recognition This is a Torch implementation of "Deep Residual Learning for Image Recognition",Kaiming He, Xiangyu Zhan

Kimmy 561 Dec 01, 2022