当前位置:网站首页>Two minutes recording can pass by second language!The volcano how to practice and become voice tone reproduction technology?
Two minutes recording can pass by second language!The volcano how to practice and become voice tone reproduction technology?
2022-08-09 13:02:00 【QbitAl】
Let's enjoy an audio and video first, maybe you will have a surprise discovery?
Yes, that's how the voice-over imitator of Anime Sponge looks.
The difference is that the protagonist of the American comedy animation, who is about to run for four years, is now changing the single language and fixed style of the past under the interpretation of the imitator, and he says it all at once.There are translation accents, TVB accents, Cantonese and even Shanghai dialects.
More importantly, all styles and languages are based on a two-minute pure Chinese audio training.
How much can a two-minute audio file contain?
According to the estimation of professionals in voice direction, it is basically equivalent to the content of 20 sentences spoken at a normal speech rate.
In this way, the "magic voice" can not only retain the sound of the deity, but also realize the seamless switching of multi-style and multi-language. It is also thanks to the "sound" developed by the volcano voice"Black Technology", that is, the sound reproduction technology.
For a long time, Volcano Voice has provided global high-quality voice AI technology capabilities and excellent full-stack voice product solutions for ByteDance's internal business lines and Volcano Engine ToB industries and innovative scenarios.Program.
The "timbre reproduction technology" launched this time can be simply understood as "timbre clone", which is a fully automatic, efficient and lightweight sound customization solution.
Less data, low cost, convenient and efficient
Different from the high threshold requirements for data in the model training process of traditional speech synthesis technology, Volcano Voice timbre reproduction technology requires only a large amount of data0.3% of the traditional method, and the requirements for timbre acquisition are also simpler——
There is no need for a professional announcer to record in a recording studio for a long time. Ordinary people can record in a relatively quiet open environment for more than 2 minutes to achieve the standard of sound space modeling and generate exclusive sound.The AI model is convenient and efficient.
Multi-style, multi-language, stable and high-quality
In addition, the self-developed Imitator model structure of Volcano Voice can also extract speaker-independent hidden layer speech representation (SI Context Feature) from audio, such as moreRhythm and accent information, etc., and use this as the intermediate feature of text and audio for auxiliary model training, so that timbre restoration is more accurate.
It is understood that in the pre-training stage, the team also used a multi-style, multi-language, multi-speaker voice database for average model training, which can be understood as a very small amount of recording data.With the support, transfer learning is used to adaptively create a speech synthesis model with a high degree of timbre restoration, so that the synthesized timbre can be outstanding in pronunciation rhythm and similarity.
There is no need for any audio or text annotations during the sound reproduction process, which not only saves labor costs, but also reduces the system complexity of the practical operation.
In addition, the technology of streaming synthesis can make the delay of the first packet of sound reproduction less than 500ms, which is suitable for most personalized voice scenarios.
On the whole, it not only realizes the decoupling of timbre, style and language, but also reaches the industry-leading level in pronunciation stability and sound quality.
Full-link automation, ready to use
This technical solution will provide external enterprise-level services through Volcano Engine, relying on Volcano Voice's high-quality sound reproduction SDK support, its convenient text reading and recording functions, andThe built-in environment detection and word accuracy detection can maximize the quality of audio input.
At the same time, the back-end has an automated model loading function, and without restarting the service, the corresponding sound can be hot loaded, realizing the whole chain of audio recording to sound experience.Closed-loop, that is to say, only one set of SDK can be used to complete the use of all resources. Currently, the online SDK already supports Chinese Mandarin and English.
The application of this technology strictly follows compliance requirements. The Volcano Voice team said:
We attach great importance to the protection of users' personal information rights and interests. We have obtained full authorization for sound collection and training to ensure the legality of the sound reproduction process and the compliance of sound use., and then applied to enterprise service scenarios.
It is worth mentioning that this technology has core patents.
In short, if you want to make personalized audio, you only need to record 2-10 minutes at a time and train for 10-20 minutes. After entering the text, select the desired style and language, you can quicklyIt is synthesized and applied in multiple enterprise-level service scenarios such as news broadcasts and intelligent customer service.
Now the speech recognition and speech synthesis technology capabilities of Volcano Voice have been successfully applied to many products such as Douyin, Jianying, Tomato Novels, etc., and are open to the outside world through Volcano Engineenterprise.
*This article is published with permission from qubits, and the opinions belong to the author.
— End—
QubitQbitAI
վ'ᴗ' ի Track new developments in AI technology and products
Three clicks of "Share", "Like" and "Watching"
Technology cutting-edge progress will meet every day ~
边栏推荐
猜你喜欢
Double pointer - the role of char **, int **
Byte Qiu Zhao confused me on both sides, and asked me under what circumstances would the SYN message be discarded?
罗振宇折戟创业板/ B站回应HR称用户是Loser/ 腾讯罗技年内合推云游戏掌机...今日更多新鲜事在此...
两分钟录音就可秒变语言通!火山语音音色复刻技术如何修炼而成?
C# 获取系统已安装的.NET版本
ABAP 面试题:如何使用 ABAP 编程语言的 System CALL 接口,直接执行 ABAP 服务器所在操作系统的 shell 命令?
ThreadLocal的简单理解
WPF implements a MessageBox message prompt box with a mask
LeetCode #101. Symmetric Binary Tree
软件测试——金融测试类面试题,看完直接去面试了
随机推荐
#物联网征文#小熊派设备开发实战
【面试高频题】可逐步优化的链表高频题
水能自发变成“消毒水”,83岁斯坦福教授:揭示冬天容易得流感的部分原因...
electron 应用开发优秀实践
MongoDB-查询中$all的用法介绍
【Adobe Premiere Pro 2020】pr2020安装和基本操作【PR安装、新建项目流程、导入及管理素材项目文件、添加标记、创建出入点剪辑视频、快速剪接及自动音乐卡点的方法
又有大厂员工连续加班倒下/ 百度搜狗取消快照/ 马斯克生父不为他骄傲...今日更多新鲜事在此...
Batch大小不一定是2的n次幂!ML资深学者最新结论
PM2之配置文件
The grep command Shell regular expressions, the three musketeers
【小程序】低代码+小游戏=小游戏可视化开发
智驾科技完成C1轮融资,此前2轮已融4.5亿元
二重指针-char **、int **的作用
WPF 实现带蒙版的 MessageBox 消息提示框
TI的片上固化好的boot ROM(上电引导加载程序)退出后的去向
字节秋招二面把我干懵了,问我SYN报文什么情况下会被丢弃?
Golang学习之路(五):Golang的函数
十分钟教会你如何使用VitePress搭建及部署个人博客站点
用皮肤“听”音乐,网友戴上这款装备听音乐会:仿佛住在钢琴里
无需精子卵子子宫体外培育胚胎,Cell论文作者这番话让网友们炸了