当前位置:网站首页>[AI vision · quick review of today's sound acoustic papers, issue 2] Fri, 15 APR 2022
[AI vision · quick review of today's sound acoustic papers, issue 2] Fri, 15 APR 2022
2022-04-23 04:01:00 【hitrjj】
AI View · Today, CS.Sound An overview of acoustic papers
Fri, 15 Apr 2022
Totally 6 papers
Quick view of last issue For more highlights, please move to the home page
Daily Sound Papers
Learning and controlling the source-filter representation of speech with a variational autoencoder Authors Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda Pineda, Renaud S guier Understanding and controlling the potential representation in the depth generation model is important for analysis 、 Transforming and generating various types of data is a challenging but important problem . In speech processing , Inspired by the anatomical mechanism of phonation , The source filter model considers that the speech signal is composed of several independent components 、 Physically meaningful continuous potential factors , Fundamental frequency f 0 And formants are the most important . In this work , We show that the source filter model of speech generation naturally appears in the variational automatic encoder VAE In the potential space of , The VAE Unsupervised training on natural speech signal data set . Only a few seconds of marked speech signal generated by artificial speech synthesizer , We show through experiments that f 0 And formant frequency at VAE Coding in the orthogonal subspace of potential space , And we developed a weak supervision method to accurately and independently control the changing factors in the potential subspace of speech learning . |
Streamable Neural Audio Synthesis With Non-Causal Convolutions Authors Antoine Caillon, Philippe Esling Deep learning model is mainly used for off-line reasoning . However , This greatly limits the use of these models in audio generation settings , Because most creative workflows are based on real-time digital signal processing . Although the method based on cyclic network can naturally adapt to this buffer based calculation , But the use of convolution still poses some serious challenges . To solve this problem , The use of causal stream convolution has been proposed . |
From Environmental Sound Representation to Robustness of 2D CNN Models Against Adversarial Attacks Authors Mohammad Esmaeilpour, Patrick Cardinal, Alessandro Lameiras Koerich This paper studies the effect of different standard ambient sound representation spectra on victim residual convolution neural network ( namely ResNet 18) The impact of recognition performance and robustness against attacks . The main motivation for us to focus on this front-end classifier rather than other complex architectures is to balance the recognition accuracy and the total number of training parameters . ad locum , We measured the impact of the different settings needed to generate more information Mel Frequency cepstrum coefficient MFCC、 The short-time Fourier transform STFT And discrete wavelet transform DWT Represent the impact on our front-end model . This measurement involves comparing classification performance with antagonism and robustness . We balance the average budget allocated by the attacker with the attack cost , For six benchmark attack algorithms, the inverse relationship between recognition accuracy and model robustness is proved . Besides , Our experimental results show that , Although in DWT Trained on the spectrum ResNet 18 The model achieves high recognition accuracy , But attacking this model is better for opponents than others 2D Indicates a relatively higher cost . |
Predicting score distribution to improve non-intrusive speech quality estimation Authors Abu Zaher Md Faridee, Hannes Gamper Depth noise suppressor DNS Has become an attractive solution , It can eliminate the background noise in speech 、 Reverberation and distortion , And widely used in telephone voice applications . They are sometimes prone to introduce artifacts and reduce the perceived quality of speech . Use multiple human judges to get an average opinion score MOS Subjective listening test is a popular way to measure the performance of these models . Non intrusive neural network based on deep neural network MOS Estimation models have recently become a popular cost-effective alternative to these tests . These models use only MOS Tag for training , The secondary statistics of opinion scores are usually discarded . In this paper , We studied several methods to integrate the distribution of opinion scores , For example, variance , Histogram information , In order to improve the MOS Estimate performance . Our model passes 320 Different DNS Models and model variants are in 419K Training on the corpus of denoised samples , And from DNSMOS Of 18K Evaluation on test samples . |
RadioSES: mmWave-Based Audioradio Speech Enhancement and Separation System Authors Muhammed Zahid Ozturk, Chenshu Wu, Beibei Wang, Min Wu, K. J. Ray Liu Speech enhancement and separation has always been a long-standing problem , Especially in the latest development of using a single microphone . Although the microphone performs well in restricted environments , But their speech separation performance will decline under noisy conditions . In this work , We proposed RadioSES, This is an audio speech enhancement and separation system , It overcomes the inherent problems in pure audio systems . By fusing complementary radio modes ,RadioSES You can estimate the number of speakers , Solve the problem of source Association , Separate and enhance noisy mixed speech , And improve intelligibility and perceptual quality . We perform millimeter wave sensing to detect and locate the speaker , And introduce a audioradio Deep learning framework to fuse individual radio features with mixed audio features . A large number of experiments using commercial off the shelf equipment show that ,RadioSES Superior to various state-of-the-art baselines , It has consistent performance gain in different environmental settings . |
Lombard Effect for Bilingual Speakers in Cantonese and English: importance of spectro-temporal features Authors Maximilian Karl Scharf, Sabine Hochmuth, Lena L.N. Wong, Birger Kollmeier, Anna Warzybok In order to better understand the mechanism of speech perception and the contribution of different signal features , The computational model of speech recognition has a long tradition in listening research . Due to the need to recognize speech, there are many situations , Therefore, these models need to be under many acoustic conditions 、 Common to speakers and languages . This contribution tests the prediction of Mandarin and Lombardy speech recognition compared with Cantonese in fixed and modulated noise , The importance of different features in English speech recognition and prediction . Although Cantonese is a tonal language , It encodes information in the time characteristics of the spectrum , But as we all know , Lombardy effect is related to the change of spectrum in speech signal . These contrastive properties of tone language and Lombardy effect constitute an interesting basis for evaluating speech recognition models . ad locum , Use empirical data to evaluate the performance of automatic speech recognition based on spectrum or spectrum time characteristics ASR Model . It turns out that , Spectral temporal features are important for predicting speaker specific speech recognition thresholds for Cantonese and English SRT 50 And explain the improvement of speech recognition in modulation noise , and Lombard The influence of voice can be |
Chinese Abs From Machine Translation |
For more highlights, please move to the home page
版权声明
本文为[hitrjj]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204220600582644.html
边栏推荐
- ERROR: Could not find a version that satisfies the requirement win32gui
- 现货黄金基本介绍
- Source code and update details of new instance segmentation network panet (path aggregation network for instance segmentation)
- matlab讀取多張fig圖然後合並為一張圖(子圖的形式)
- 創下國產手機在海外市場銷量最高紀錄的小米,重新關注國內市場
- Set经典小题目
- PHP export excel table
- Xiaohongshu was exposed to layoffs of 20% as a whole, and the internal volume among large factories was also very serious
- STM32 MCU ADC rule group multi-channel conversion DMA mode
- How Zotero quotes in word jump to references / hyperlink
猜你喜欢
Retrieval question answering system baseline
Cause analysis of incorrect time of AI traffic statistics of Dahua Equipment Development Bank
中国移动日赚2.85亿很高?其实是5G难带来更多利润,那么钱去哪里了?
Does China Mobile earn 285 million a day? In fact, 5g is difficult to bring more profits, so where is the money?
Detailed explanation on the use of annotation tool via (VGg image annotator) in mask RCNN
[AI vision · quick review of NLP natural language processing papers today, issue 29] Mon, 14 Feb 2022
Matlab minimalist configuration of vscode configuration
Common string processing functions in C language
Xshell、Xftp连接新创建的Unbutu系统虚拟机全流程
Solve the technical problems in seq2seq + attention machine translation
随机推荐
为什么推荐你学嵌入式
Qt程序集成EasyPlayer-RTSP流媒体播放器出现画面闪烁是什么原因?
[latex] formula group
What if win10 doesn't have a local group policy?
【NeurIPS 2019】Self-Supervised Deep Learning on Point Clouds by Reconstructing Space
Alibaba cloud IOT transfer to PostgreSQL database scheme
Mysql出现2013 Lost connection to MySQL server during query
Express中间件②(中间件的分类)
Vs studio modifies C language scanf and other errors
作为一名码农,女友比自己更能码是一种什么体验?
Zotero6. Version 0 quicklook cannot be used / Chinese garbled code will not be displayed
Hard core chip removal
matlab讀取多張fig圖然後合並為一張圖(子圖的形式)
The super large image labels in remote sensing data set are cut into specified sizes and saved into coco data set - target detection
[AI vision · quick review of NLP natural language processing papers today, issue 29] Mon, 14 Feb 2022
Operating skills of spot gold_ Wave estimation curve
ROS series (IV): ROS communication mechanism series (2): Service Communication
Vs Studio modifie le langage C scanf et d'autres erreurs
Xiaomi, which has set the highest sales record of domestic mobile phones in overseas markets, paid renewed attention to the domestic market
Overview of knowledge map (II)