当前位置:网站首页>From Douyin to Volcano Engine——Seeing the Evolution and Opportunities of Streaming Media Technology

From Douyin to Volcano Engine——Seeing the Evolution and Opportunities of Streaming Media Technology

2022-08-11 11:43:00 LiveVideoStack


编者按:8月5日上午,LiveVideoStackCon 2022 The Shanghai Station of the Audio and Video Technology Conference invited the Volcano EngineRTCHead Song Shenyi teacher,For us from the real time、沉浸式、Across the region and developers such as the four directions,To see from the trill to volcanic engine,The evolution process of streaming media technology and opportunity.In a speech Song Shenyi teacher,We saw the journey of the volcano engine along the way,Also realized that by combining different scenarios,The volcano the persistence of the engine to explore.



LiveVideoStack是2017Years started,I also happened to join ByteDance that year,High growth with byte beating,From trill to fly,Now, these mature capabilities are exported through the volcanic engine.,Serve more developers and enterprise customers.

First of all, introduce the past few years,On the stream media technology,We met what scene,What puzzles were solved and in the long run,Future directions and opportunities that we need to continue to focus on in the coming years.

Bytes to beat around the2016To begin with streaming media technology,Before only on demand.At first we do one-way live,Then a real-time interactive.2020Early in the outbreak,More complex interaction patterns begin to produce.我们在2020Introduced a volcano in engine,The whole company including cloud based、Video cloud and big data have begun to migrate to the volcano engine.从那之后,ByteDance's technical capabilities began to slowly become the external output of some services on the volcano engine.之后,We use volcanic engine support game、Immersive audio and video, and other applications,Volcanic engine also help these applications get better growth.

We are doing live even wheat business,In the first question is:How in high def、Balance the three dimensions of real-time and smooth.It is impossible to fully balance these three dimensions,Although not completely100%achieve high-definition、Fluency and real-time,但是可以90%实现,随着技术的演进,90%可以提升到99%,甚至更高.The development of our more technology is also constantly improving our performance in high-definition、Fluency and real-time ability,不断提升用户体验.

疫情爆发以后,Online education and video conferencing are also booming,Gave birth to a lot of very complicated、Challenging new scene,Such as large group class、Webinars and the game's battle voice, etc.,It needs some new breakthrough,At the same time to solve the global、Many areas of problem,To solve the problem of people and equipment.现在,In addition to people will join interactive,Equipment will also join interactive,So through the optimization architecture,We propose a multi-centric distributed signaling architecture,To support very large scale interactive.

近两年,Immersive audio and video suddenly heated up,In audio and video technology new growth curve,Generate new trend,包括VR的视频、Live broadcast interaction, ultra-low latency live broadcast and cloud rendering, etc.,Give technical solution proposed the new challenge.

今天,From four dimensions I introduce you to,Several directions we focus on in the evolution of streaming media technology.第一个是实时性.Real-time is the biggest limitation of the development of streaming media technology,If there is no real time limit,Audio and video technology complexity can be reduced by at least half;The second is the immersion,This is an evolutionary direction that is now in great demand.,Technological advances in recent years are also beginning to immerse、Slowly walk into homes,To everyone's life;The third is a global,At present, it is a stage when China's traffic dividend has peaked.,在这个前提下,Have to face a problem is,We both want to continue to develop technology,And want to get more users,Get more growth,It must be to face the problem of globalization;Finally, let me talk about our thoughts on the developer ecology,Developer is not only in the broadest sense of the term,Include content creators.我们看看,Through technical progress can provide for developers,What kind of tools,What kind of service.


Mentioned real-time interactive,I found that my colleagues have more discussions on channel transmission,But how to combine the source with the channel,Go to do optimization,is a key technical point of real-time interaction.

Core idea is the source of real-time hierarchical classification and channel,With the limited channel to transmit the most important information.展开来讲,We can put the source or the information to be transmitted,Split according to the two dimensions of real-time and reliability.Some of this information is very important,It has very high requirements for real-time and reliability,For example a signaling message;Some of the information it for high real-time demand,The requirements for reliability may be lower,例如音频;对于视频而言,Reliability requirements would be much lower;At the same time, there are some information,It demands for reliability is very high,For real-time requirements are not so high,Such as file transfer、直播等;There are also some information that do not require high reliability and real-time performance.,比如说日志.Then the source can be hierarchical,Channel is also can be hierarchical.channel appears to be a single channel,But after all, the source of so much is in the unity of a channel to transmit,So we have to divide the channel into many different capabilities,To put some important channel、传输方式、Transmission capacity reserved,To the source to use.Method is through the adjustment priority、FECOr tune and retransmit to achieve this channel division and channel isolation.

In front of the theory of partial,Now tell me something about how to be born.First is the source classification of be born.实际场景中,Audio and video sources will continue to be split according to their importance.It is common that low-frequency signals or low-definition signals are definitely more important,The critical moment of the high-definition signal can be discarded.On the classification of video source is relatively more mature,我们常见的有Simulcast这样的技术,Including is often used in liveHLS、CMAF,Are essentially do video source classification.You can also refer to each other between different resolutions,比如SVC的技术,It divides the mechanism of time domain and space domain and the technology of long-term reference frame,Allows us to reduce some bandwidth to a certain extent.Part of the hd can't pass it doesn't matter in the past,There are also many techniques for super-resolution reconstruction.,In a short period of time that can be.

We focus more on audio,Because the real channel change,Audio-generated problems have a greater impact on the senses.Here is a relatively cost-effective technology we use isMDCMultiple description coding.简单来讲,Is that you can put an audio sequence into multiple child sequence,让它们「1、2、3,1、2、3」ground report,All newspapers1stand in a team,All newspapers2stand in a team,All newspapers3stand in a team.The advantage is that the three sections of code、Three sections of the sequence can be separately to coding、传输,As long as any sequence can reach the opposite,Is equal to the low frequency part of the audio、The low frequency signal can reach the opposite.Reach the other side of the shard, the more,Definition of the audio will be higher.If only one or two to,Its clarity there will be some impact,But does not affect the basic communication,This is a way of higher cost performance,It is a native of lossCodec.The technology and againFEC、AI-Codec结合起来,基本就可以实现:As long as the network is,As long as there is a little data that can be transmitted to the opposite side,Can achieve the result of more smooth.Can't pass the past,还有一些NetEQ、PLCThe way can supplement.

经常有人问我,What is the best transport?

In fact, there is no difference between good and bad transmission protocols.,只有合适与不合适的区别.

What kind of transport protocol is suitable?

主要看应用场景、Are there any real-time requirements in real conditions?.If there is no requirement of real-time,A transport protocol that can efficiently use the channel and bandwidth,就是最合适的.Such as no real-time requirements,那TCPIs a very good transport protocol.If there is a real-time requirements,In the design of the transmission protocol, three purposes must be achieved:First of all to be able to adjust the retransmission to realize reliable;After had certain reliability,And by adjusting theFECLet it realize a real-time;High priority to ensure important data,丢弃/Delay is not important data.

A very important design feature of transport protocols is priority.A good transfer protocol,Real-time transport protocols must have priority distinctions,This ensures that important data is transmitted first,Not important data is determined to give up.比如,Some very important data have very high requirements for real-time and reliability,Then you can add very high redundancy,And very fast retransmission method,Then the high priority to transfer.Although such data may be a serious waste of bandwidth,There is no way to transfer a large amount of high bit rate data,However, such data can achieve good reliability and real-time performance with such a transmission protocol..Data that does not require very high reliability,Can be low power redundancy,Can also be limited retransmission,Such as only the retransmission key frames,For non-critical frame retransmission,In this case, the classification of the channel can be achieved.These abilities can be used together,But can also be a relative,Probably the most sensible way for some long fat networks is to turn off all retransmissions,让FECto function.

An important task after channel classification is how to model the channel.In simple terms, channel modeling is to accurately represent the network state..以前,Channel modeling method is simple,Is done with a single variable,Such as packet loss、Describe a channel bandwidth,Later, it was discovered that these still have certain defects.,So there are some models of relative combinations,Such as delay time bandwidth product.现在,随着算力的增加,The complexity of channel modeling is also relatively increasing,Some models will have many variables added for reference.But the channel model to describe the network status is on the other hand,A more important way is to,How quickly the change of the sensor networks.Some models actually use statistical indicators,Coupled with filtering methods to achieve network change perception.It may be more accurate,但是有一个很大的问题,It is no way to achieve rapid.有时候,Our network environment has changed,But after a few seconds to perceive it out.这样的话,The requirement of real-time will be higher,Therefore, in practice, in addition to statistical indicators,比如丢包率、Bandwidth, it takes several seconds to count metrics,We tend to add some instantaneous indicators to make corrections,比如乱序、延时这些指标,Can quickly go to detect.

So can describe network status,After the change of the sensor networks,The rest is how to quickly compatible with weak nets.Everyone likes to use this weak net against the word,But I don't like with weak network against,Because a lot of times the weak net is no way to fight,Can only go to compatible.有些时候,Dynamic changes are inherent conditions of the network itself、The dynamic changes of the intrinsic properties.有些时候,because our information transmission is congested,Then what needs to be changed at this time is the source itself.We said before source classification,Source classification later,Through the change of the sensor networks quickly,Adjust the source,Then decisively abandoned not important data,To achieve better real-time.So how do you finally evaluate whether a channel modeling is good?,See when network rapidly changing,We can buffer rapidly emptying,So quickly,Then pass the most important information to the other side.


可能会有人问,What's the use of our research??

As we now demand more and more rich、越来越膨胀,A lot of immersive scenarios,It is also very high to the requirement of real time.If there is no such detailed channel classification as before、Source classification technology,Often can produce some great congestion,For example, our real-time performance often has a bit rate of tens of megabytes or even hundreds of megabytes,If we don't do good congestion control,May it will congestion seconds.

We in immersive audio video,也做了一些探索.下面给大家介绍一下,One is the perspective of freedom,It challenges a whole has a lot of,Such as real-time Angle difference.We met in streaming media more、More difficult technology,Is how to quickly switch channel 、The perspective of quick switch.The easy way is to preload with a low-definition stream,But it will not be able to solve quickly perspective when switching,How should to preload?有人提出全Iframe scheme,Also some people put forward the lowGOP的方案,But its damage to clarity is very large.We have a relatively reliable solution in practice,是用SimulcastLong-term reference frames the way.when the viewing angle is switched,只需传一个Iframe plus a long-term reference frame and a fewP帧,Can achieve rapid channel switch.这样的话GOPCan do more,It can also achieve a relatively good balance between real-time and clarity.

Another widely used panorama video,This scheme are also studied.The core concept of panoramic video is source classification.First of all, there will be a low qing540pPanorama for out,This we don't speak.In hd part,怎样把一个8KThe video to the opposite to?Our approach is shard to coding,So that we can achieve a code,A lot of people at the same time low latency to watch.We can use a hexahedron projection,分成640×640的分片,This enables parallel encoding with high-end graphics cards,一块8KThe video card can made up about60帧,需要编120帧的话,Can use two pieces of graphics in parallel make up.在这个demo上,If our head movement angle is to move to the upper right,You can quickly move the top of several new subdivision,And then to the left a few subdivision rule out,Head of rapid dynamic delay.Of course there are some questions merger encoding and decoding,比如264、265Are relevant ability of,264有Slice,265有Tile.The definition of coding standards is relatively perfect,However, there are still many engineering problems to be solved in practice.,Demand for streaming format is higher.In the end, this solution can achieve the uplink energy pressure to about100兆,Downward can do10兆以内,And in this scenario we can still achieve400毫秒的延时,也可以做到150Head of milliseconds dynamic delay.If you want to faster,You can also use edge to accelerate and private cloud.

Tell me the point cloud to transfer,Is the video size.Compared with panoramic transmission, point cloud transmission is a better experience.、More ultimate solution,But the current common practice is still fake3D的方式,就是用2DAdd in front of the background of3D对象,其实也是在2DUnder the environment of the projection,The overlay and the last,And then to place them on a two-dimensional plane,Recode projection and depth data using 2D techniques,Multiplexing of two-dimensional video encoding and transmission channels.I hope that in the next few years you will have a real3D压缩能力,把它做出来.

Speak of are in front of the video,Then introduces audio.One of the more commonly used audio immersion methods is two-channel spatial sound effects.,It can basically reuse existing audio coding and transmission links,Only do some simulation processing in rendering end,But it's definitely not the best,A better way is the way we're looking at——Multichannel acquisition.Using multiple microphone to collect,And then after transformation out of out multiple audio,Encode and transmit multiple audio sources separately.High precision sampling techniques are used to here also,我们也可以把16bitThe sampling data to24bit,Then improve sampling frequency together.

But this brings a problem,Its rate can expand,Audio bit rate can sometimes hundreds ofKEven several megabytes,How to achieve real-time panoramic sound in this case?One is that we just mentionedMDCMultiple description coding,The other is the high and low frequency bands that are often used in music,Block codingSBRTechnology and technical parameters of the stereo,It can also be combined with the current audio encoding transmission,Guarantee the important data to preach the past,来保证实时性.


Another major concern is globalization,Globalization problem is divided into a few pieces,One of the more important is the ability of audio and video infrastructure,Then there is the issue of globalized data consistency.

Global node is very much,面对成千上万的节点,It is inevitable that nodes and links will be broken.,Dozens and hundreds of link degradations occur every day.The route detection of the public network often only detects two situations: on and off.,It doesn't detect quality degradation,And lack of judgment on the comprehensive ability of the path,After all tests of timeliness is bad,Detection switch speed is more slow,还有一个缺点就是DSCPOn the wide area network is a failure,So our approach is toSDNTechnology moved to wan,Build one flooroverlay的网络.At the transport plane was completely useSDN的传输能力,On the control plane we rewrite the control layer again,Can let wan implementationSDN的能力.

这样有几点好处:一个是DSCP可以兼容,If we go to changeDSCP,It can still take effect on this transmission network,Can achieve the package levelQS控制,And can achieve second-level detection and switching,We probably can finally realize the possibility of telecommunication level in public.So audio and video has slowly become the infrastructure,Our developers need not too concerned about experience、成本、The stability of these things,Focus on your own product development.

另外讲一下,Data Consistency in Distributed Scenarios.Because our actual users are all over the world,It is not appropriate to use a central room controller and central signaling.If you have a central signaling,User on the other side of the world joins the room and the first frame delay of the push-pull stream will be very long,Can sometimes until nearly one second.So we can't make strong consistency guarantees here.,This room's relationship with the user at least,This is not able to do strong consistency guarantee,Can only do the guarantee of eventual consistency.

In the real number,When a strong consistency guarantee cannot be made,How to ensure that large-scale interactive?

Our biggest challenge,is the architectural challenge from the great room.,Such as the large class、Webinars and multiplayer games against,Need to be very many people came to interact,And there are a lot of users to watch,There are hundreds of thousands or even hundreds of thousands of users who want to watch at the same time,There are hundreds of thousands of user interaction on the stage.这种情况下,Both signaling and transport layers need to be optimized.First signaling to do distributed signal,要去中心化,Must not do a single signal,The impact because it can not carry so many users.When the signal is going down as far as possible,Distributed in multiple data centers around the world,Even sink to the place closest to the user,We only need to step by step to update the active users to a state of other signaling,Without the need for all the user state distribution.这样的话,In theory to achieve unlimited extension,Finally you can achieve aRTTthe room,一个RTT的推拉流.

The other is the distribution of the entire transmission topology..At the time of globalization,Transfer the topology is parallel to the calculation and update.Because the serial calculation speed is slow,会面临很多问题.This transport topology requires a roughly balanced distribution tree,Its overall tree height should be relatively uniform,In this way can we achieve low latency.In parallel computing, it is necessary to deal with a large number of users entering the room quickly、Quickly out of the house problem,Therefore, dynamic splitting and merging of nodes is also required..Sometimes to do some disaster and emergency repair,它的QPS也非常高,At the same time, in this case, it is easy to cause loopback problems.,So to do loopback detection and remove,Here the restrictions also many.


Finish talking about globalization,I finally to tell the developers ecological.整体上,With the globalization of our business more and more,开发者生态的建设,Include developers from all over the world,Also we need to explore the problem of.The commercialization of streaming technology we used to think,The biggest cost from the resources、带宽,Now I'm gradually discovering the commercialization of streaming media technology,The biggest cost comes from the developer's development cost,Developers here are not just developers of code,Also include the content creators.The creator is simply too much,They want to make a great application,Want to make some cool content,Is need very high learning cost,If we can figure out a way to give them some really handy tools together、The ability to very convenient,这就是很大的价值,It is also the direction that the volcanic engine continues to pay attention to..

We focus on the developer tools,Is the creation of the common tool.现在,Many creators are very professional、复杂、Limited by the authoring tools that expensive imagination.A new chance,A common authoring tools,A good technology wants to enter thousands of households,Its authoring tools must be very simple.比如说,We are the earliest time to fix that figure withPhotoshop,But it is Meitu Xiuxiu who really brings retouching into thousands of households..We are the earliest to do video、剪辑用Premiere和Final Cut,But really cut the video production into homes is reflected.抖音这么火,Recommendation system is very important,But the constant supply of content is also one of the most important factors,A steady stream of content supply relies on civilianization、Simple authoring tools.We now have many new technology,Such as special effects technology、流媒体技术,And motion capture、Digital and real time freedom perspective,But it's really hard to use.If we all go together to make a lot of can let yourself using a mobile phone can make tools、You can use some of the scenes,I believe it will bring more imagination to the whole industry.

过去几年,Developer tool change is very big also.As early as we haveSaaS,SaaSThe application form is relatively single,So everyone find inflexible.过了几年,出现了PaaS.PaaSis flexible,But the threshold for developers to use is getting higher and higher.And he came out a newaPaaS.I think it's hard to tell,Which one is the trend of the future,But there are so many possibilities,We give developers more choice,It is better.

最后提一下,One of the things that we have been focusing on thinking and focusing on recently is the synergy between multiple modules.All solutions involve the collaboration of multiple modules,People tend to like to be a big and comprehensive oneAPP,Pack all the function,The installation package immediately100多兆,直播、点播、RTC、网页、消息、Small program all inside,Most of the time they are working.All modules including theAPPIs a common bandwidth、Share all capture and playback devices,And also to shareCPU、GPU的内存.But many modules are not considered in the design,How do other modules use resources、带宽的.Hopefully we'll take that into consideration when designing these modules in the future,How to share,Even giving away resources instead of preempting them.Everyone at the time of design should be to think about how to share、协同,More options to developers.

Volcano Engine has also done some exploration in this regard:One is that all our modules support dynamic performance degradation、The bandwidth of the dynamic drop.Not only to support our own downgrade,Also supports custom drop.Such as ourSDK的时候,可以把我们的SDKRestrictions on the tallest bandwidth usage,例如:500K上,这样SDKWill not use higher bandwidth,It can also support reserving the bandwidth that you can currently explore.100KFor other modules.No matter how much bandwidth detection now,Is only now existing bandwidth decreases100K,The rest of the reserved for other modules,It can achieve better together.At the same time, the volcano engine can also support differentSDKSet the priority between,让这个SDK比其他SDK更重要,This is useful in many scenarios.Such as doing live interaction,Though live is important,It is very important to audio and video,But the news is more important,Sometimes the special effect gift sent by the user may be an on-demand file,But sometimes these things are even better than live andRTC更重要.在教育、Conference scenario,This kind of phenomenon is more,Such as our audio and video sharing,May be more important than audio signal,At this time, each of our modules is required to have the ability to customize promotion and demotion and custom priority.,In addition to the same bandwidth,The performance can also do the ceiling and reserved operation.The purpose is to leave more choices to developers.

Based on the above design concept,Volcano Engine launched an integrated audio and video cloud solutionveVOS,There are many advantages of this solution,Welcome to the official website of Volcano Engine.Tell me the collaborative focus here,Because I think other advantages everyone can pay attention to before,But this point of cooperation is also hope that everyone、The points that all industry partners should pay attention to in the future.To do the cloud and the synergy,To do between modules and modules together,Even do our own modules and hostsAPPApplication of the synergy between,To prevent the resource conflicts between the modules,We put them into the same package,打成一个SDKGoing directly to customers and developers is not enough,To provide the ability to more,Let between different modules、不同公司的SDKAnd application between between,防止资源冲突,Do better together.

Streaming media industry has developed for decades,For interactive、沉浸式、Globalization and the developer ecological aspects of the demand is higher and higher,So I think there are still many opportunities in the future.

从抖音到火山引擎,Several important opportunities in the streaming industry that we continue to focus on:One is the immersive real-time audio and video,Hopefully there will be a real immersive real-time audio and video production soon,and can truly integrate into our lives.The real-time interactions we see in sci-fi movies,That kind of immersive scenarios,I believe that soon will become a reality;Another is the creation of the common tools,Can bring us rich content.I think the civilian creation tool is actually the continuous development of this industry、Continue to produce new source of power for;The last one is the foundation of audio and video facilities,It allows us to help developers to do a lot of things,Let developers focus to do more、更好的产品,Can promote the progress of the industry faster.So I hope everyone here will pay attention to these opportunities together,Through the progress of technology and tools for the construction of the,Bring the possibility of more abundant streaming media industry.

谢谢大家,谢谢LiveVideoStack,Welcome to stay focused on volcanic engine!

▼识别二维码or stab下图订阅课程




本文分享自微信公众号 - LiveVideoStack(livevideostack).
如有侵权,请联系 [email protected] 删除.

