当前位置:网站首页>Two minutes recording can pass by second language!The volcano how to practice and become voice tone reproduction technology?
Two minutes recording can pass by second language!The volcano how to practice and become voice tone reproduction technology?
2022-08-09 13:02:00 【QbitAl】
Let's enjoy an audio and video first, maybe you will have a surprise discovery?
Yes, that's how the voice-over imitator of Anime Sponge looks.
The difference is that the protagonist of the American comedy animation, who is about to run for four years, is now changing the single language and fixed style of the past under the interpretation of the imitator, and he says it all at once.There are translation accents, TVB accents, Cantonese and even Shanghai dialects.
More importantly, all styles and languages are based on a two-minute pure Chinese audio training.
How much can a two-minute audio file contain?
According to the estimation of professionals in voice direction, it is basically equivalent to the content of 20 sentences spoken at a normal speech rate.
In this way, the "magic voice" can not only retain the sound of the deity, but also realize the seamless switching of multi-style and multi-language. It is also thanks to the "sound" developed by the volcano voice"Black Technology", that is, the sound reproduction technology.
For a long time, Volcano Voice has provided global high-quality voice AI technology capabilities and excellent full-stack voice product solutions for ByteDance's internal business lines and Volcano Engine ToB industries and innovative scenarios.Program.
The "timbre reproduction technology" launched this time can be simply understood as "timbre clone", which is a fully automatic, efficient and lightweight sound customization solution.
Less data, low cost, convenient and efficient
Different from the high threshold requirements for data in the model training process of traditional speech synthesis technology, Volcano Voice timbre reproduction technology requires only a large amount of data0.3% of the traditional method, and the requirements for timbre acquisition are also simpler——
There is no need for a professional announcer to record in a recording studio for a long time. Ordinary people can record in a relatively quiet open environment for more than 2 minutes to achieve the standard of sound space modeling and generate exclusive sound.The AI model is convenient and efficient.
Multi-style, multi-language, stable and high-quality
In addition, the self-developed Imitator model structure of Volcano Voice can also extract speaker-independent hidden layer speech representation (SI Context Feature) from audio, such as moreRhythm and accent information, etc., and use this as the intermediate feature of text and audio for auxiliary model training, so that timbre restoration is more accurate.
It is understood that in the pre-training stage, the team also used a multi-style, multi-language, multi-speaker voice database for average model training, which can be understood as a very small amount of recording data.With the support, transfer learning is used to adaptively create a speech synthesis model with a high degree of timbre restoration, so that the synthesized timbre can be outstanding in pronunciation rhythm and similarity.
There is no need for any audio or text annotations during the sound reproduction process, which not only saves labor costs, but also reduces the system complexity of the practical operation.
In addition, the technology of streaming synthesis can make the delay of the first packet of sound reproduction less than 500ms, which is suitable for most personalized voice scenarios.
On the whole, it not only realizes the decoupling of timbre, style and language, but also reaches the industry-leading level in pronunciation stability and sound quality.
Full-link automation, ready to use
This technical solution will provide external enterprise-level services through Volcano Engine, relying on Volcano Voice's high-quality sound reproduction SDK support, its convenient text reading and recording functions, andThe built-in environment detection and word accuracy detection can maximize the quality of audio input.
At the same time, the back-end has an automated model loading function, and without restarting the service, the corresponding sound can be hot loaded, realizing the whole chain of audio recording to sound experience.Closed-loop, that is to say, only one set of SDK can be used to complete the use of all resources. Currently, the online SDK already supports Chinese Mandarin and English.
The application of this technology strictly follows compliance requirements. The Volcano Voice team said:
We attach great importance to the protection of users' personal information rights and interests. We have obtained full authorization for sound collection and training to ensure the legality of the sound reproduction process and the compliance of sound use., and then applied to enterprise service scenarios.
It is worth mentioning that this technology has core patents.
In short, if you want to make personalized audio, you only need to record 2-10 minutes at a time and train for 10-20 minutes. After entering the text, select the desired style and language, you can quicklyIt is synthesized and applied in multiple enterprise-level service scenarios such as news broadcasts and intelligent customer service.
Now the speech recognition and speech synthesis technology capabilities of Volcano Voice have been successfully applied to many products such as Douyin, Jianying, Tomato Novels, etc., and are open to the outside world through Volcano Engineenterprise.
*This article is published with permission from qubits, and the opinions belong to the author.
— End—
QubitQbitAI
վ'ᴗ' ի Track new developments in AI technology and products
Three clicks of "Share", "Like" and "Watching"
Technology cutting-edge progress will meet every day ~
边栏推荐
猜你喜欢
F280049库函数API编程、直接寄存器控制编程和混合编程方法
索引index
又有大厂员工连续加班倒下/ 百度搜狗取消快照/ 马斯克生父不为他骄傲...今日更多新鲜事在此...
[Interview high-frequency questions] Linked list high-frequency questions that can be gradually optimized
900页数学论文证明旋转的黑洞不会爆炸,丘成桐:30多年来广义相对论首次重大突破...
【微服务~远程调用】整合RestTemplate、WebClient、Feign
基于STM32+铂电阻设计的测温仪
redis库没法引入
2022牛客多校(六)M. Z-Game on grid
鹅厂机器狗花式穿越10m梅花桩:前空翻、单桩跳、起身作揖...全程不打一个趔趄...
随机推荐
LeetCode #101. Symmetric Binary Tree
Scala 高阶(七):集合内容汇总(上篇)
Shell之常用小工具(sort、uniq、tr、cut)
学长告诉我,大厂MySQL都是通过SSH连接的
阿里高工带来的20022最新面试总结太香了
阿里云新增三大高性能计算解决方案,助力生命科学行业快速发展
Too much volume... Tencent was asked on the side that the memory was full, what would happen?
用皮肤“听”音乐,网友戴上这款装备听音乐会:仿佛住在钢琴里
Information system project managers must memorize the core test sites (63) The main process of project portfolio management & DIPP analysis
WeChat payment development process
【微服务~远程调用】整合RestTemplate、WebClient、Feign
听声辨物,这是AI视觉该干的???|ECCV 2022
WeChat Mini Program Payment and Refund Overall Process
报告:想学AI的学生数量已涨200%,老师都不够用了
Shell正则表达式,三剑客之grep命令
2022 Niu Ke Duo School (6) M. Z-Game on grid
The grep command Shell regular expressions, the three musketeers
win10右键文件,一直转圈
信息系统项目管理师必背核心考点(六十三)项目组合管理的主要过程&DIPP分析
ACM longest non-descent subsequence problem