当前位置：网站首页>[AI vision · quick review of today's sound acoustic papers, issue 2] Fri, 15 APR 2022

[AI vision · quick review of today's sound acoustic papers, issue 2] Fri, 15 APR 2022

2022-04-23 04:01:00 【hitrjj】

AI View · Today, CS.Sound An overview of acoustic papers
Fri, 15 Apr 2022
Totally 6 papers
Quick view of last issue For more highlights, please move to the home page

Insert picture description here

Daily Sound Papers

Learning and controlling the source-filter representation of speech with a variational autoencoder
Authors Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda Pineda, Renaud S guier
Understanding and controlling the potential representation in the depth generation model is important for analysis 、 Transforming and generating various types of data is a challenging but important problem . In speech processing , Inspired by the anatomical mechanism of phonation , The source filter model considers that the speech signal is composed of several independent components 、 Physically meaningful continuous potential factors , Fundamental frequency f 0 And formants are the most important . In this work , We show that the source filter model of speech generation naturally appears in the variational automatic encoder VAE In the potential space of , The VAE Unsupervised training on natural speech signal data set . Only a few seconds of marked speech signal generated by artificial speech synthesizer , We show through experiments that f 0 And formant frequency at VAE Coding in the orthogonal subspace of potential space , And we developed a weak supervision method to accurately and independently control the changing factors in the potential subspace of speech learning .

Streamable Neural Audio Synthesis With Non-Causal Convolutions
Authors Antoine Caillon, Philippe Esling
Deep learning model is mainly used for off-line reasoning . However , This greatly limits the use of these models in audio generation settings , Because most creative workflows are based on real-time digital signal processing . Although the method based on cyclic network can naturally adapt to this buffer based calculation , But the use of convolution still poses some serious challenges . To solve this problem , The use of causal stream convolution has been proposed .

From Environmental Sound Representation to Robustness of 2D CNN Models Against Adversarial Attacks
Authors Mohammad Esmaeilpour, Patrick Cardinal, Alessandro Lameiras Koerich
This paper studies the effect of different standard ambient sound representation spectra on victim residual convolution neural network （ namely ResNet 18） The impact of recognition performance and robustness against attacks . The main motivation for us to focus on this front-end classifier rather than other complex architectures is to balance the recognition accuracy and the total number of training parameters . ad locum , We measured the impact of the different settings needed to generate more information Mel Frequency cepstrum coefficient MFCC、 The short-time Fourier transform STFT And discrete wavelet transform DWT Represent the impact on our front-end model . This measurement involves comparing classification performance with antagonism and robustness . We balance the average budget allocated by the attacker with the attack cost , For six benchmark attack algorithms, the inverse relationship between recognition accuracy and model robustness is proved . Besides , Our experimental results show that , Although in DWT Trained on the spectrum ResNet 18 The model achieves high recognition accuracy , But attacking this model is better for opponents than others 2D Indicates a relatively higher cost .

Predicting score distribution to improve non-intrusive speech quality estimation
Authors Abu Zaher Md Faridee, Hannes Gamper
Depth noise suppressor DNS Has become an attractive solution , It can eliminate the background noise in speech 、 Reverberation and distortion , And widely used in telephone voice applications . They are sometimes prone to introduce artifacts and reduce the perceived quality of speech . Use multiple human judges to get an average opinion score MOS Subjective listening test is a popular way to measure the performance of these models . Non intrusive neural network based on deep neural network MOS Estimation models have recently become a popular cost-effective alternative to these tests . These models use only MOS Tag for training , The secondary statistics of opinion scores are usually discarded . In this paper , We studied several methods to integrate the distribution of opinion scores , For example, variance , Histogram information , In order to improve the MOS Estimate performance . Our model passes 320 Different DNS Models and model variants are in 419K Training on the corpus of denoised samples , And from DNSMOS Of 18K Evaluation on test samples .

RadioSES: mmWave-Based Audioradio Speech Enhancement and Separation System
Authors Muhammed Zahid Ozturk, Chenshu Wu, Beibei Wang, Min Wu, K. J. Ray Liu
Speech enhancement and separation has always been a long-standing problem , Especially in the latest development of using a single microphone . Although the microphone performs well in restricted environments , But their speech separation performance will decline under noisy conditions . In this work , We proposed RadioSES, This is an audio speech enhancement and separation system , It overcomes the inherent problems in pure audio systems . By fusing complementary radio modes ,RadioSES You can estimate the number of speakers , Solve the problem of source Association , Separate and enhance noisy mixed speech , And improve intelligibility and perceptual quality . We perform millimeter wave sensing to detect and locate the speaker , And introduce a audioradio Deep learning framework to fuse individual radio features with mixed audio features . A large number of experiments using commercial off the shelf equipment show that ,RadioSES Superior to various state-of-the-art baselines , It has consistent performance gain in different environmental settings .

Lombard Effect for Bilingual Speakers in Cantonese and English: importance of spectro-temporal features
Authors Maximilian Karl Scharf, Sabine Hochmuth, Lena L.N. Wong, Birger Kollmeier, Anna Warzybok
In order to better understand the mechanism of speech perception and the contribution of different signal features , The computational model of speech recognition has a long tradition in listening research . Due to the need to recognize speech, there are many situations , Therefore, these models need to be under many acoustic conditions 、 Common to speakers and languages . This contribution tests the prediction of Mandarin and Lombardy speech recognition compared with Cantonese in fixed and modulated noise , The importance of different features in English speech recognition and prediction . Although Cantonese is a tonal language , It encodes information in the time characteristics of the spectrum , But as we all know , Lombardy effect is related to the change of spectrum in speech signal . These contrastive properties of tone language and Lombardy effect constitute an interesting basis for evaluating speech recognition models . ad locum , Use empirical data to evaluate the performance of automatic speech recognition based on spectrum or spectrum time characteristics ASR Model . It turns out that , Spectral temporal features are important for predicting speaker specific speech recognition thresholds for Cantonese and English SRT 50 And explain the improvement of speech recognition in modulation noise , and Lombard The influence of voice can be

Chinese Abs From Machine Translation

Papers from arxiv.org

For more highlights, please move to the home page

pic from pexels.com

版权声明
本文为[hitrjj]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204220600582644.html