Addressing High AI Computing Costs and Inefficient Multi-Tenant Operations with 400G RoCE Network Solutions

2025-03-20 3 min read
Topics:

    With the widespread application of large model such as ChatGPT, DeepSeek, etc. the market demand structure for computing power has changed significantly. The proportion of demand for inference computing power has increased significantly, and the overall computing power demand has also grown rapidly due to the high efficiency of the model and the lower application threshold. Against this background, the leasing business of intelligent computing centers has become popular. At present, many companies with high demand for computing power, including companies and individual developers engaged in model training, film and television special effects, virtual digital humans and other fields, have chosen to rent computing power from this intelligent computing center.

    However, the rapid growth in the number of customers has also brought new challenges to the Intelligent Computing Center: performance assurance, cost control, and multi-tenant isolation issues have become pain points that need to be solved urgently. Customers want to obtain computing resources at a lower cost, and require the network architecture to have sufficient flexibility and scalability to meet the needs of future business growth. In addition, data isolation and security in multi-tenant scenarios have also become important considerations for customers to choose services . To meet these challenges, the Intelligent Computing Center urgently needs a network solution that can guarantee high performance, achieve low-cost deployment, and flexible expansion.

    400G RoCE Network

    The intelligent computing center has a wide customer base, covering AI model training and large-scale reasoning scenarios. In training scenarios, hundreds to thousands of GPU servers need to work together and frequently exchange massive model parameters and gradient data, which places extremely high requirements on network bandwidth and latency, while also supporting the expansion of thousands of GPU clusters. In inference scenarios such as real-time speech recognition and video analysis, high-bandwidth, low-latency network connections and flexible network architecture are key to supporting the expansion and management of edge nodes. However, the expansion of the Qianka cluster and the deployment of edge nodes will lead to significant cost increases and put pressure on enterprise operations. Therefore, infrastructure design must fully consider scalability. Enterprises need to build high-performance networks at the optimal cost in the early stages to effectively control operating costs while meeting computing power requirements.

    In order to meet the needs of these high-demand scenarios and solve the problems of high procurement costs, long supply cycles, and high service costs of traditional InfiniBand (IB) solutions, the Intelligent Computing Center chose RoCE network technology. RoCE is based on Ethernet architecture and integrates RDMA technology to provide high-bandwidth, low-latency, and low-CPU data transmission capabilities. Its performance is close to IB and matches the needs of cluster evolution from thousands of cards to tens of thousands of cards. The Intelligent Computing Center has built a 400G training network that supports a cluster size of up to 2048 GPU cards, laying a technical foundation for future business development.

    In addition, RoCE equipment has low cost and short delivery cycle, and can be seamlessly connected with existing network equipment, significantly reducing deployment costs (hardware costs are reduced by 40-50%). Its smooth upgrade capability supports seamless expansion from 200G to 400G while maintaining system stability and efficiency.

    Intelligent computing network architecture

    In terms of network architecture design, 16 H3C S9825-64D switches are deployed at the Spine layer. Each device is equipped with 64 400G downstream ports and is connected to 32 Leaf devices in a fully interconnected manner, building a high-bandwidth, low-latency 400G backbone Internet network. The Leaf layer deploys 32 switches of the same model, each equipped with 32 400G uplink ports to interconnect with the Spine layer, and provides 32 400G downlink ports to connect to the GPU server, ensuring efficient communication between computing nodes and the network.

    This network architecture has significant technical advantages: First, the application of 400G RoCE technology significantly improves network bandwidth, reduces data transmission latency, and provides faster parameter synchronization for AI training tasks; second, the 1:1 non-convergence architecture design ensures non-blocking network transmission and avoids performance bottlenecks; third, the modular design supports flexible expansion, which can be gradually expanded from the existing 1024 GPU cards to 2048 GPU cards according to business needs, and there is no need to replace existing equipment during the expansion process, achieving truly smooth expansion. In addition, the architecture has great expansion potential and can be expanded to a scale of more than 10,000 cards in the future based on business needs.

    By deploying the 400G RoCE network, the Intelligent Computing Center has successfully built a high-performance computing network platform for the future, which can meet the stringent requirements of large-scale AI training scenarios for high bandwidth and low latency, while leaving ample space for future business growth and ensuring the long-term sustainability of the network architecture in terms of performance and cost.

    RoCE Network Driver

    Multi-tenant business isolation and efficient operation

    In the computing power leasing market, the Intelligent Computing Center faces the dual challenges of multi-tenant isolation and cost-effective services. Although traditional public cloud services are convenient, their package configuration mainly provides standardized services, which is difficult to meet customers' needs for personalized, high-performance computing power. In order to break through this dilemma, the Intelligent Computing Center chose an innovative path - a multi-tenant solution based on RoCE network technology.

    RoCE is an ideal choice for multi-tenant scenarios due to its automatic configuration delivery, fine-grained isolation domains, and efficient management support for large-scale tenants. H3C has deployed a multi-tenant isolation mechanism in the parameter network, and achieved lossless network logical isolation between tenants through the RoCE network combined with ACL (access control list) and VxLAN technology. This technical solution not only ensures the independence and reliability of each tenant's resource pool, but also significantly improves the utilization efficiency of network resources. Specifically, RoCE networks support multi-tenant scenarios in the following ways:

    • Lossless isolation: Based on ACL/ VxLAN technology, data between tenants is completely isolated to avoid resource contention and data leakage risks. Taking 2048 GPU cards as an example, each card can be independently allocated to a tenant, achieving highly granular division of resources.
    • Flexible expansion: The device supports tenants with up to thousands of GPU cards. Even if it is expanded to the level of 10,000 cards, it can assign independent tenants to each card, fully meeting the needs of large-scale multi-tenant scenarios.

    RoCE provides a customized network architecture that meets high-performance, low-latency requirements and exceeds the standardized services of public clouds. By deploying the RoCE network, the Intelligent Computing Center achieves efficient operation in multi-tenant scenarios. For example, multiple tenants can run large-scale AI training tasks simultaneously without worrying about network congestion or data security issues. In addition, the low-cost deployment and flexible expansion capabilities of the RoCE network have helped the Intelligent Computing Center significantly reduce operating costs. Through customized network configuration and flexible tenant management strategies, the Intelligent Computing Center can provide more customers with efficient, flexible and more cost-effective computing services.

    The successful deployment of the 400G RoCE network has enabled the Intelligent Computing Center to not only meet the high bandwidth and low latency requirements of current large-scale AI training, reasoning and multi-tenant scenarios, but also reserve sufficient space for future business growth through a low-cost, highly scalable network architecture. In the future, as the "East Data West Computing" strategy is further promoted, the Intelligent Computing Center will continue to rely on the technical advantages of the RoCE network to optimize service quality, provide more enterprises and developers with efficient, flexible and secure computing support, and promote continuous innovation and development in the industry.

    You may also like

    Addressing High AI Computing Costs and Inefficient Multi-Tenant Operations with 400G RoCE Network Solutions

    2025-03-20
    With the widespread application of large model such as ChatGPT, DeepSeek, etc. the market demand structure for computing power has changed significantly. The proportion of demand for inference computing power has increased significantly, and the overall computing power demand has also grown rapidly due to the high efficiency of the model and the lower application threshold. Against this background, the leasing business of intelligent computing centers has become popular. At present, many companies with high demand for computing power, including companies and individual developers engaged in model training, film and television special effects, virtual digital humans and other fields, have chosen to rent computing power from this intelligent computing center.

    Addressing Data Center Relocation Challenges Insights from H3C's bilibili Project

    2025-03-06
    Bilibili is a popular platform that serves as a unique blend of social media and video-sharing site within the Chinese internet landscape. It offers a wide range of content, including diverse activities, lifestyle insights, gaming, entertainment, and technology knowledge.

    Hyperconverged Infrastructure:
    The Future of Enterprise Data Centers

    2025-03-04
    Hyperconverged Infrastructure (HCI) represents a revolutionary technical architecture that seamlessly integrates computing, storage, and networking resources. Its core objective is to consolidate the hardware and software resources within the data center, offering a highly integrated, scalable, and easily manageable solution.
    • Product Support Services
    • Technical Service Solutions
    All Services
    新华三官网