Case Study | Tackling High AI Computing Costs and Multi-Tenant Efficiency with the 400G RoCE Network Solution
With the widespread adoption of the DeepSeek high-efficiency inference model, the market's demand structure for computing has undergone a significant change. The proportion of demand for inference computing power has risen substantially. While the overall demand for computing power has also rapidly grown due to the model's high efficiency and the lowered barrier to application. Under these circumstances, the intelligent computing center's leasing business has become very popular. Currently, many enterprises with high computing demands, including those involved in model training, film and television special visual effects, digital humans, and individual developers, are choosing to lease computing power from this intelligent computing center.
However, the rapid growth in customer numbers also brings new challenges to the intelligent computing center: performance assurance, cost control, and multi-tenant isolation have become core pain points that urgently need resolution. Customers want to obtain computing resources at a lower cost while also requiring the network architecture to be flexible and scalable enough to meet future business growth needs. Furthermore, data isolation and security in multi-tenant scenarios are important considerations for customers when choosing a service. To address these challenges, the intelligent computing center urgently needs a network solution that can ensure high performance, enable low-cost deployment, and offer flexible expansion.
400G RoCE Network
Breaking the Stalemate of High-Performance Demand and Cost Challenges
One of the intelligent computing centers supported by H3C has a very broad customer base, covering customers who need AI model training and large-scale inference scenarios.
In training scenarios, hundreds to thousands of GPU servers need to work collaboratively, frequently exchanging massive model parameters and data, which demands extremely high network bandwidth and low latency, and also requires support for scaling up to thousand-card clusters.
In inference scenarios (such as real-time speech recognition and video analysis), high bandwidth, low-latency network connectivity, and a flexible network architecture are crucial for supporting the expansion and management of edge nodes. However, the expansion of thousand-card clusters and the deployment of edge nodes significantly increase costs, putting pressure on corporate operations. Therefore, the infrastructure design must fully consider scalability.
The enterprise needed to build a high-performance network at the optimal cost from the initial stage to meet computing needs while effectively controlling operating expenses.
To meet the demands of these high-requirement scenarios and solve the problems associated with traditional InfiniBand (IB) solutions—high procurement costs, long supply cycles, and high service costs—the intelligent computing center selected RoCE (RDMA over Converged Ethernet) network technology. RoCE is based on the Ethernet architecture and integrates RDMA technology, providing high-bandwidth and low-latency data transmission capabilities with low CPU overhead. Its performance is close to that of IB solution, matching the cluster evolution needs from thousand-card to ten-thousand-card scale. The intelligent computing center built a 400G training network, which supports a maximum cluster size of 2,048 GPU cards, laying the technical foundation for future business development.
Furthermore, RoCE equipment has lower costs and shorter supply cycles and can seamlessly integrate with existing network equipment, significantly reducing deployment costs (hardware cost reduced by 40-50%). Its smooth upgrade capability supports seamless expansion from 200G to 400G while maintaining system stability and efficiency.
Intelligent Computing Network Architecture
Supporting Efficient Expansion of Thousand-Card Clusters
In the network architecture design, the Spine layer deployed 16 H3C S9825-64D switches, each equipped with 64 x 400G downlink ports, connected to 32 Leaf devices through a fully interconnected topology, building a high-bandwidth, low-latency 400G backbone interconnection network. The Leaf layer deployed 32 switches of the same model, each configured with 32 x 400G uplink ports interconnected with the Spine layer, and providing 32 x 400G downlink ports to connect to GPU servers, ensuring efficient communication between compute nodes and the network.
This network architecture offers significant technical advantages:
High-Performance: The application of 400G RoCE technology significantly increases network bandwidth and reduces data transmission latency, providing faster parameter synchronization speed for AI training tasks.
Non-Blocking: The 1:1 non-convergent architecture design ensures non-blocking network transmission, avoiding performance bottlenecks.
Flexible Expansion: The modular design supports flexible expansion. It can be gradually scaled from the existing 1,024 GPU cards to a scale of 2,048 GPU cards based on business needs, and the expansion process does not require replacing existing equipment, achieving truly smooth capacity expansion.
Future Potential: The architecture has immense scaling potential and can be expanded to a scale of over ten thousand cards in the future, based on business requirements.
By deploying the 400G RoCE network, this intelligent computing center has successfully built a future-proof, high-performance computing network platform. It meets the stringent requirements for high bandwidth and low latency in current large-scale AI training scenarios while leaving ample room for future business growth, ensuring the long-term sustainability of the network architecture in terms of performance and cost.
RoCE Network
Multi-Tenant Business Isolation and Efficient Operations
In the computing leasing market, intelligent computing centers face the dual challenge of multi-tenant isolation and cost-effective services. While traditional public cloud services are convenient, their standardized package configurations primarily offer generic services, making it difficult to meet customers' needs for personalized, high-performance computing power. To break this predicament, this intelligent computing center chose an innovative path: a multi-tenant solution based on RoCE network technology.
RoCE, with its support for automatic configuration deployment, fine-grained isolation domains, and efficient management of large-scale tenants, is the ideal choice for multi-tenant scenarios. H3C deployed a multi-tenant isolation mechanism in the parameter network. By combining the RoCE network with ACL and VxLAN technologies, they achieved lossless network logical isolation between tenants. This technical solution not only ensures the independence and reliability of each tenant's resource pool but also significantly improves network resource utilization efficiency.
Specifically, the RoCE network supports multi-tenant scenarios in the following ways:
Lossless Isolation: Based on ACL/VxLAN technology, it ensures complete data isolation between tenants, avoiding resource contention and data leakage risks. Taking the example of 2,048 GPU cards, each card can be independently allocated to a tenant, achieving a high degree of fine-grained resource partitioning.
Flexible Scaling: The equipment supports a tenant scale of up to thousands of GPU cards. Even when expanding to the ten-thousand-card level, an independent tenant can be allocated to each card, fully satisfying the needs of large-scale multi-tenant scenarios.
RoCE provides a customized network architecture that meets high-performance and low-latency requirements, surpassing the standardized services of public clouds. By deploying the RoCE network, the intelligent computing center has achieved efficient operation in multi-tenant scenarios. For example, multiple tenants can simultaneously run large-scale AI training tasks without worrying about network congestion or data security issues. Furthermore, the low-cost deployment and flexible expansion capability of the RoCE network have helped the intelligent computing center significantly reduce operating costs. Through customized network configurations and flexible tenant management strategies, the intelligent computing center can provide more customers with efficient, flexible, and more cost-effective computing services.
The successful deployment of the 400G RoCE network has enabled the intelligent computing center to not only meet the high-bandwidth, low-latency requirements of current large-scale AI training, inference, and multi-tenant scenarios but also to reserve ample room for future business growth through a low-cost, highly scalable network architecture. In the future, the intelligent computing center will continue to rely on the technical advantages of the RoCE network to optimize service quality, providing efficient, flexible, and secure computing support for more enterprises and developers, and promoting continuous industry innovation and development.
