Session

Congestion-control in AI/ML networks at datacenter scale

Speakers

Vivek Kashyap
Ping Yu

Label

Nuts and Bolts

Session Type

Talk

Description

Datacenters are increasingly supporting ml workloads. This necessitates high-speed and high-utilization networks providing low-latency, low jitter, quick response to changes while maintaining fairness. RDMA has been the technology of choice since it offers high-throughput and low-latency memory-to-memory data transfer across the nodes. RoCEv2 is the primary protocol used on Ethernet networks. In recent times we have seen efforts in the industry to overcome the limitations of RoCEv2 such as Google’s Falcon or the Ultra Ethernet Consortium’s interconnect.

These enhanced protocols, and other research in the community, includes a focus on datacenter congestion control. There is years of research on TCP/IP congestion that has informed the RDMA congestion proposals. The various industry proposals define the signals to detect congestion, how to react and adapt to congestion, how to schedule the packets, what is the role of the endpoints and switches. The various proposals, though targeting the same set of issues, may differ in exactly in these dimensions; and in many cases have developed their mechanisms borrowing from the other proposals. The modern proposals also leverage In-Network Telemetry (INT) to gather comprehensive link load information to further refine the CC algorithms.

In this paper we survey the prominent congestion mechanisms used or proposed for the datacenter, compare the basic algorithms used, factors effecting the coexistence of various congestion, and focus on the key features that the endpoint devices must provide for efficient implementation. We investigate what must be the minimal signals that must be provided from the switches and devices to the congestion control algorithms. Looking to the future, we reason and propose that the devices’ congestion control algorithms need to be programmable.

We also note that the traditional congestion control methods’ adjustment of the congestion window is governed by algorithms rather than quantifiable values. We discuss a reinforcement learning based congestion control method that collects relevant congestion control parameters and defines a reward function to train the agent in real-time.