当前位置：网站首页>Read the article, high-performance and predictable data center network

Read the article, high-performance and predictable data center network

2022-08-11 03:51:00 【Lingyun moment】

At the just-concluded first China Computing Conference, Alibaba Cloud's Panjiu infrastructure attracted a lot of attention.During the period, "how to realize the efficient and stable operation of high-performance network" has become the most frequently asked question by customers. This article will reveal the core technology behind the "panjiu predictable network".

Panjiu predictable network

In recent years, the artificial intelligence industry has grown rapidly, but the growth rate of GPU computing power has never been able to meet the needs of artificial intelligence applications, so the distributed machine learning model has become the norm in the industry.It is not easy to make a huge number of heterogeneous computing resources work together efficiently, and high-performance network is the key enabling technology.

Panjiu Predictable Network is a high-performance and predictable data center network developed by Alibaba Cloud. It is application-centric and realizes a high-performance and predictable network through "Alibaba Cloud's full-stack self-developed + end-network integration technology".system.

The whole system has built a hard-core technical base through Alibaba Cloud's self-developed switches, self-developed network cards, and self-developed high-performance network protocol stacks, and through innovative end-network integration technology, each self-developed component can efficiently collaborate, has many significant advantages such as large scale, high bandwidth, low latency, high reliability, and predictable performance, providing a solid network base for Alibaba Cloud's ultra-large-scale computing and storage clusters.
insert image description here

Picture | Panjiu can expect network exhibition site

Showcase of three core technologies

High-performance network architecture

In order to optimize the best computing power and energy efficiency, Alibaba Cloud has developed the High Performance Network (HPN) high-performance network architecture. It adopts a 2-layer clos non-convergence structure with dual-plane forwarding, and can support up to more than 10,000 A100 GPUs.It can achieve the theoretical minimum static forwarding delay between any two points in the Wanka GPU cluster. More forwarding links also make the probability of hash congestion as low as possible, and achieve the optimal cluster computing performance as a whole..

In addition, the dual-plane architecture design ensures that a single device or single-plane network failure will not affect the entire cluster network. Coupled with the service access of dual uplinks in the stack, the entire network cluster is stable and reliable.Users can provide continuous network service capabilities, and users do not have to worry about the impact of data center network software and hardware failures.

insert image description here

Graph | High-performance predictable data center network architecture

Full-stack self-developed end-network integration

Self-developed switch

All network equipment and optical interconnection components in the high-performance network cluster have been independently developed. The software system based on AliNOS has effectively opened up the supervision and control capabilities of a single device and the whole network dimension, and realized supervision and control while rapidly iterating new functions.All-in-one, self-developed hardware devices are modularly designed in line with Alibaba Cloud's scenarios, realizing multi-dimensional autonomous control of cost, supply, and operation and maintenance capabilities.

insert image description here

Figure | Full-stack self-development of end-network integration

Self-developed high-performance protocol stack

Currently the most widely used high-performance protocol stacks in the industry are IB and RoCEv2, but both have certain deficiencies in large-scale applications (IB equipment is expensive and cannot communicate with Ethernet, so users often need to build an expensive IBPrivate network; RoCEv2 protocol enables PFC technology, resulting in huge stability risks and limited scale).

After several years of large-scale practice of RoCEv2, Alibaba Cloud has independently developed the high-performance network protocol Solar-RDMA since 2019.Solar-RDMA protocol can significantly reduce switch queue jitter through Alibaba's self-developed end-network integration HPCC congestion control algorithm, achieve high bandwidth and low latency while achieving PFC-free deployment, and ensure that data is transmitted between nodes in the shortest time., so as to ensure the continuous maximum output of computing power.

Self-developed high-performance network card

In order to truly achieve high performance, Alibaba Cloud started to design a hardware offload solution for the Solar-RDMA protocol in 2020, and successfully developed a high-performance network card FIC (Fusion Intelligence Card) that carries the protocol in 2021.At present, the FIC card has been launched on a large scale.

Platform Services

The efficient and stable operation of high-performance network is always the core requirement of customers.

In order to achieve this goal, Alibaba Cloud has developed its own NUSA (Network Unified Service Architecture) service platform, which provides end-to-end network automation service capabilities from R&D, testing, delivery, operation, and change.

Based on the innovative end-network integration technology system, NUSA provides high-performance network automatic provisioning services, automatic network performance measurement and diagnosis services, automatic network fault monitoring, alarm and location services, network-wide resource management and high-performance network virtualizationServe.

Through the integration of end-to-end and network key technologies, Alibaba Cloud has opened up a new era of predictable data center networks, providing the underlying network guarantee for the continuous and stable output of cluster computing power.

In the future, Alibaba Cloud will continue to evolve towards richer communication semantics, higher bandwidth, lower latency, and better usability.(End of text)

一文读懂高性能可Expected data center network

原网站

版权声明
本文为[Lingyun moment]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/223/202208110347199031.html