This is the repo of the manuscript "Dual-branch Attention-In-Attention Transformer for speech enhancement"

Last update: Dec 16, 2022

Related tags

Overview

DB-AIAT: A Dual-branch attention-in-attention transformer for single-channel SE (https://arxiv.org/abs/2110.06467)

This is the repo of the manuscript "Dual-branch Attention-In-Attention Transformer for speech enhancement", which is accepted by ICASSP2022.

Abstract：Curriculum learning begins to thrive in the speech enhancement area, which decouples the original spectrum estimation task into multiple easier sub-tasks to achieve better performance. Motivated by that, we propose a dual-branch attention-in-attention transformer-based module dubbed DB-AIAT to handle both coarse- and fine-grained regions of spectrum in parallel. From a complementary perspective, a magnitude masking branch is proposed to estimate the overall spectral magnitude, while a complex refining branch is designed to compensate for the missing complex spectral details and implicitly derive phase information. Within each branch, we propose a novel attention-in-attention transformer-based module to replace the conventional RNNs and temporal convolutional network for temporal sequence modeling. Specifically, the proposed attention-in-attention transformer consists of adaptive temporal-frequency attention transformer blocks and an adaptive hierarchical attention module, which can capture long-term time-frequency dependencies and further aggregate global hierarchical contextual information. The experimental results on VoiceBank + Demand dataset show that DB-AIAT yields state-of-the-art performance (e.g., 3.31 PESQ, 95.6% STOI and 10.79dB SSNR) over previous advanced systems with a relatively light model size (2.81M).

Code:

You can use dual_aia_trans_merge_crm() in aia_trans.py for dual-branch SE, while aia_complex_trans_mag() and aia_complex_trans_ri() are single-branch aprroaches. The trained weights on VB dataset is also provided. You can directly perform inference or finetune the model by using vb_aia_merge_new.pth.tar.

requirements:

CUDA 10.1
torch == 1.8.0
pesq == 0.0.1
librosa == 0.7.2
SoundFile == 0.10.3

How to train

Step1

prepare your data. Run json_extract.py to generate json files, which records the utterance file names for both training and validation set

# Run json_extract.py
json_extract.py

Step2

change the parameter settings accroding to your directory (within config_merge.py)

Step3

Network Training (you can also use aia_complex_trans_mag() and aia_complex_trans_ri() network in aia_trans.py for single-branch SE)

# Run main.py to begin network training 
# solver_merge.py and train_merge.py contain detailed training process
main_merge.py

Inference:

The trained weights vb_aia_merge_new.pth.tar on VB dataset is also provided in BEST_MODEL.

# Run main.py to enhance the noisy speech samples.
enhance.py

Comparison with SOTA:

Citation

If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry.

@article{yu2021dual,
title={Dual-branch Attention-In-Attention Transformer for single-channel speech enhancement},
author={Yu, Guochen and Li, Andong and Wang, Yutian and Guo, Yinuo and Wang, Hui and Zheng, Chengshi},
journal={arXiv preprint arXiv:2110.06467},
year={2021}
}

This is the repo of the manuscript "Dual-branch Attention-In-Attention Transformer for speech enhancement"

Related tags

Overview

DB-AIAT: A Dual-branch attention-in-attention transformer for single-channel SE (https://arxiv.org/abs/2110.06467)

Code:

requirements:

How to train

Step1

Step2

Step3

Inference:

Comparison with SOTA:

Citation

Owner

Guochen Yu

A new version of the CIDACS-RL linkage tool suitable to a cluster computing environment.

A PyTorch implementation of QANet.

Official code for CVPR2022 paper: Depth-Aware Generative Adversarial Network for Talking Head Video Generation

[CVPR2022] Representation Compensation Networks for Continual Semantic Segmentation

一个目标检测的通用框架(不需要cuda编译)，支持Yolo全系列(v2~v5)、EfficientDet、RetinaNet、Cascade-RCNN等SOTA网络。

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Bottleneck Transformers for Visual Recognition

Scalable implementation of Lee / Mykland (2012) and Ait-Sahalia / Jacod (2012) Jump tests for noisy high frequency data

FinRL-Meta: A Universe for Data-Driven Financial Reinforcement Learning. 🔥

Explainable Medical ImageSegmentation via GenerativeAdversarial Networks andLayer-wise Relevance Propagation

Active and Sample-Efficient Model Evaluation

One-line your code easily but still with the fun of doing so!

Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Rotation-Only Bundle Adjustment

Object Database for Super Mario Galaxy 1/2.

This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Developed By Google!

Deep Surface Reconstruction from Point Clouds with Visibility Information

Lightweight, Python library for fast and reproducible experimentation :microscope:

The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".

Code for the paper "Improved Techniques for Training GANs"

This is the repo of the manuscript "Dual-branch Attention-In-Attention Transformer for speech enhancement"

Related tags

Overview

DB-AIAT: A Dual-branch attention-in-attention transformer for single-channel SE (https://arxiv.org/abs/2110.06467)

Code:

requirements:

How to train

Step1

Step2

Step3

Inference:

Comparison with SOTA:

Citation

Owner

Guochen Yu

A new version of the CIDACS-RL linkage tool suitable to a cluster computing environment.

A PyTorch implementation of QANet.

Official code for CVPR2022 paper: Depth-Aware Generative Adversarial Network for Talking Head Video Generation

[CVPR2022] Representation Compensation Networks for Continual Semantic Segmentation

一个目标检测的通用框架(不需要cuda编译)，支持Yolo全系列(v2~v5)、EfficientDet、RetinaNet、Cascade-RCNN等SOTA网络。

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Bottleneck Transformers for Visual Recognition

Scalable implementation of Lee / Mykland (2012) and Ait-Sahalia / Jacod (2012) Jump tests for noisy high frequency data

FinRL­-Meta: A Universe for Data­-Driven Financial Reinforcement Learning. 🔥

Explainable Medical ImageSegmentation via GenerativeAdversarial Networks andLayer-wise Relevance Propagation

Active and Sample-Efficient Model Evaluation

One-line your code easily but still with the fun of doing so!

Orange Chicken: Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation

Rotation-Only Bundle Adjustment

Object Database for Super Mario Galaxy 1/2.

This is a Machine Learning Based Hand Detector Project, It Uses Machine Learning Models and Modules Like Mediapipe, Developed By Google!

Deep Surface Reconstruction from Point Clouds with Visibility Information

Lightweight, Python library for fast and reproducible experimentation :microscope:

The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".

Code for the paper "Improved Techniques for Training GANs"

FinRL-Meta: A Universe for Data-Driven Financial Reinforcement Learning. 🔥