Application of deep learning methods in speech enhancement

Application of deep learning in speech enhancement

Press ：

This article is about DNS,AEC,PLC And other organizers of international voice competitions ——Microsoft Research Labs Audio and acoustics research group （Audio and Acoustics Research Group） On 2021 Published in Sound capture and speech enhancement for speech-enabled devices An excerpt from , The work of this group in the field of speech enhancement this year is summarized . The author of the report is Ivan Tashev and Sebastian Braun. All pictures in this article are derived from the report and its citations .

1. （ Based on time-frequency domain supervised learning ） Speech enhancement module

This module mainly shows the process of time-frequency domain speech enhancement , Including short-time Fourier transform （STFT）、 feature extraction 、 neural network 、 Predict the goal 、 enhance / Transformation （ The process ）、 Inverse short-time Fourier transform （iSTFT） And the loss function . Starting from the second line in the figure, it is only carried out in the training stage , This drawing suggests that A previous work of the group Map in （ See the picture below ） Use a combination of .

Here are the following points to discuss ：

STFT： Due to the diversity of noise modes, speech enhancement tasks are naturally different from speech separation tasks , Using Fourier transform basis function to transform noise and speech components into a specific space to distinguish patterns may be more conducive to the training and robustness of the network ; Besides , Due to the disadvantages of time-domain algorithm in reverberation and the possible cascade with traditional beamforming technology in array enhancement ; Of course, there are habits developed from traditional speech enhancement technology ; And most importantly , at present DNS Challenge The result of the game . Although there are some such as demucs And other excellent time-domain based speech enhancement algorithms , The speech enhancement algorithm based on time-frequency domain may have more advantages （tips： Here is my personal opinion ）.
feature extraction ： In addition to the direct complex spectrum and amplitude spectrum , Microsoft specifically mentioned log power spectrum and （ Power law ） Compressed complex spectrum , Before describing these two features , Please note that there is no corresponding log amplitude spectrum or compressed complex spectrum for the prediction target in the report , It's primitive STFT Spectrum or masking of the domain , This is different from some previous literature corresponding to network output and input , What it wants to express is that the input is compressed and transformed （ Whether logarithmic compression or exponential compression ） The characteristics of will contribute to system performance .

Here is a brief description of the logarithmic power spectrum and （ Power law ） Compressed complex spectrum , The use of log amplitude spectrum is seen in the This article and This article , Defined as \(P = log10(|X(k, n)|^2)\) P = torch.log10(torch.norm(x_stft, dim=-1) + 1e-9);

The power-law compressed complex spectrum can be used for reference This article , Defined as \(X_{cprs}=\frac{X(k,n)}{|X(k,n)|}|X(k,n)|^{c}\)
```
x_mag = torch.norm(x_stft, dim=-1) + 1e-9

x_cprs_mag = x_mag ** c

x_cprs = torch.stack((x_stft[..., 0] / x_mag * x_cprs_mag, x_stft[..., 1] / x_mag * x_cprs_mag), dim=-1)
```
Loss function ： The loss of compressed spectrum is recommended in the report , Other losses include mask Distance of 、 energy loss （SDR/SI-SDR）、 Spectral distance 、 Perceived weighted loss sum （ Of the above items or above items and others ） Joint losses .

Loss functions are defined and evaluated respectively in The literature and The literature in , Some of them are shown in the figure below

It is speculated that the compression spectrum loss with regular amplitude is recommended （Magnitude-regularized compressed spectral loss）：\(\mathcal{L}=\frac{1}{\sigma_S^{c}}(\lambda\sum_{k,n}{|S^c-\widehat{S}^c|^2+(1-\lambda\sum_{k,n}{||S|^c-|\widehat{S}|^c|^2})})\), among \(\sigma_S\) It is the energy of pure voice with sound segment , The operation of compressed spectrum is consistent with the above definition ,\(c\) and \(\lambda\) Microsoft recommends 0.3.

2. Generation and expansion of training data

The data generation method recommended by Microsoft is shown in the figure above , Let's not consider reverberation , The energy of pure speech and noise are calculated respectively , The noisy data is obtained by mixing according to the signal-to-noise ratio , Then < Noisy data , Pure voice > Spectrum broadening with the same filter , Finally, adjust the dynamic range of voice volume. . It should be noted that ：

Clean the pure voice , choice MOS high , exclude “ dirty ” data
Each is expected to be long enough （ Microsoft recommends 10s A word of , According to its evaluation, in its model 、 The characteristics and loss It should be at least longer than 5s, When conditions change, the shortest length of the sentence may also change ）
The signal-to-noise ratio recommended in the report is based on the mean 5 dB, variance 10 dB Random selection of Gaussian distribution
The report recommends dBFS Increase the volume by average -28 dB, variance 10 dB Random selection of Gaussian distribution
Spectrum enlargement refers to RNNoise Filter used in ：\(H(z)=\frac{1+r_1z^{-1}+r_2z^{-2}}{1+r_3z^{-1}+r_4z^{-2}}\), among \(r_i\sim\mathcal{U}(-\frac{3}{8},\frac{3}{8})\). However, the literature cited in the report points out that due to the heavy amount of data , It is unnecessary to broaden the spectrum

Finally, consider the reverberation （ The green area in the picture ）： Like other articles , Room impulse response （RIR） Convolution of pure speech to obtain reverberation speech . In order to make the voice sound natural , The target voice still has a small amount of reverberation , The specific implementation is to make the speech weighted by the weighting function RIR Convolution , The positioning of the weighting function is ：\(w_{RIR}(t)=exp(-(t-t_0)\frac{6log(10)}{0.3}), if\quad t \ge t_0,(otherwise\quad w_{RIR}(t)=1)\)

3. Effective network architecture

Microsoft provides two network architectures , Respectively NSNet2(DNS Challenge Of baseline) and CRUSE(DNS Challenge Competition scheme submitted by China and Microsoft Microsoft-2).

The above two networks are RNNoise(by Valin)-style Amplitude spectral domain model and GCRN(by Tan)-style Complex spectral domain model ,RNNoise and GCRN It will be described in detail in a later blog , For an introduction to these two networks, see Blog and Blog Described in the . Finally, their results ：

[ The report ] Microsoft ：Application of deep learning methods in speech enhancement More articles about

Deep learning methods in vision CVPR 2012 Tutorial Deep Learning Methods for Vision
Deep Learning Methods for Vision CVPR 2012 Tutorial 9:00am-5:30pm, Sunday June 17th, Ballroom D (Fu ...
NLP related basic knowledge with deep learning methods
NLP related basic knowledge with deep learning methods 2017-06-22 First things first >>> ...
Paper reading ：Face Recognition: From Traditional to Deep Learning Methods 《 A survey of face recognition ： From traditional methods to deep learning 》
Paper reading :Face Recognition: From Traditional to Deep Learning Methods < A survey of face recognition : From traditional methods to deep learning > One . lead ...
Why are very few schools involved in deep learning research? Why are they still hooked on to Bayesian methods?
Why are very few schools involved in deep learning research? Why are they still hooked on to Bayesia ...
deep learning An overview of
from 13 year 11 At the beginning of the month DL, How boss busy or Various problems , Yes DL I don't understand CSDN A great god such as zouxy09 etc. profound , It's mainly because I don't think there's any progress , It's a waste of time ( Shame , So long ....) Let's start writing , It is recorded from ...
machine learning (Machine Learning)& Deep learning (Deep Learning) Information 【 turn 】
from : machine learning (Machine Learning)& Deep learning (Deep Learning) Information <Brief History of Machine Learning> Introduce : This is one ...
What are some good books/papers for learning deep learning?
What's the most effective way to get started with deep learning? 29 Answers Yoshua Bengio, ...
machine learning (Machine Learning)& Deep learning (Deep Learning) Data summary （ On ）
Reprint :http://dataunion.org/8463.html?utm_source=tuicool&utm_medium=referral <Brief History of Ma ...
machine learning (Machine Learning)&amp; Deep learning (Deep Learning) Information
machine learning (Machine Learning)& Deep learning (Deep Learning) Information Machine learning . Good materials for in-depth learning , Reprint . original :https://github.com/ty4z2008 ...
machine learning (Machine Learning) And deep learning (Deep Learning) Data summary
<Brief History of Machine Learning> Introduce : This is an article about the history of machine learning , The introduction is very comprehensive , From the perceptron . neural network . Decision tree .SVM.Adaboost To random forest .D ...

Random recommendation

[ turn ] install SciTE Report errors No package ‘gtk+-2.0′ found
centos Notepad , Sometimes it doesn't feel enough , perhaps Go wrong , Can't open the file Then decided to install another Notepad , Looking around , Feeling SciTE just so so , So download the source code, compile and install , result No package ‘gtk+-2 ...
Number of Islands
Given a 2d grid map of '1's (land) and '0's (water), count the number of islands. An island is surro ...
ios Unique identifier
Everyone knows that every apple movie iOS Every device has a UDID, It's like the ID card of the device , Record the name of the device . Types and even some private information about users . Usually ,UDID One of its biggest functions is to help ad publishers push targeted ads to specific users ...
Linux One of the time subsystems （ Twelve ）：periodic tick
Summary catalog of special documents Notes:TickDevice Pattern , as well as clocckevent equipment .TickDevice Initialization of the device ,TickDevice How to join the system . periodic Tick The birth of . Original address :L ...
C# Note2: entrust (delegate) & Lambda expression & event (event)
Preface This article mainly talks about entrustment and Lambda Basic knowledge of expressions , And how to pass Lambda The expression implements the delegate call , And explain .NET How to use delegates as a way to implement events . Reference resources :C# Advanced programming 1. What is a delegate (delegate)? ...
IP White list
One . What is? IP White list The background of the public platform has added IP White list function . Through developers ID And password call to get access_token Interface , You need to set the access source IP For the white list . IP A white list is a group of IP list , Only... In this list IP Address of the ...
Oracle Realization like Multiple value queries
Background problem description : One day the customer has a need , Given a batch of cell phone numbers or phone numbers , Find out the relevant call records , And related information . The called number given by the customer is shown in the figure : The results are shown in the figure below ( The result of this batch is not the result imported in the figure above ...
JS String truncation function slice(),substring(),substr() The difference between
stay JS in ,slice().substring().substr() Has the function of intercepting strings , What are the differences in their usage ? If you have doubts , This article may help you . One .substring() substr ...
AndroidStudio Tools will Module The project is exported to Jar and arr library
original text :http://blog.csdn.net/liulei823581722/article/details/52919697 The article begins with the use of AndroidStudio How to put a module Project Guide ...
OpenFaceswap Introductory tutorial （3）: Software parameters ！
OpenFaceswap The use of can be said to be very simple , Just a little more dialing can learn , Better people don't need tutorials at all , Just click it yourself . Read the previous installation and use articles . I think most people will . When you learn to use it , You may be right ...