Fosstodon
NETDEV VIDEOS
Session
AI Networks - RoCEv2 and the role of "netdev"
Chairs
David Ahern
Leon Romanovsky
Label
Nuts and Bolts
Session Type
Workshop
Description
AI training requires high bandwidth and low latency networks and networking stacks as training times are highly dependent on tail latency. Collective communications such as NCCL can use socket based designs (e.g., TCP/IP) or RDMA operations to move training data between servers. The amount of data to be moved between nodes has exploded with larger LLM sizes. That volume of data along with the increasing speeds of ethernet (800G as state of the art) emphasize the inefficiencies of socket based networking. A data path with packets traversing a full networking stack has entirely too much overhead that affects both throughput and latency to compete with RDMA and implementations like RoCEv2. So what does that mean for “netdev”? While RDMA operations are used to efficiently move data, the RoCEv2 protocol itself is based on standard networking protocols (UDP/IP/ethernet), and the core infiniband S/W stack leverages the Linux networking stack (aka, “netdev”) where possible.
In this workshop, we will discuss the RoCEv2 protocol for AI training networks and the role of “netdev”. This supporting role includes the netdev device model with operations for H/W offloads, port state (mtu and carrier), and network addresses and as well as routing and neighbor resolution. The socket based stack is also used for out-of-band communications (e.g., exchanging metadata). We will also revisit the solution presented at netdev 0x16 that shows how to connect Linux TCP with QPs to avoid the traditional overhead of sockets and full-stack traversal to improve performance while re-using the advantages of TCP and its congestion control protocols. Finally, we will review the recent contributions to use device memory with Linux TCP, the related io_uring work and what that means for performance relative to RoCEv2.
Recent News
Bronze Sponsor, NVIDIA
[Sun, 09, Mar. 2025]
Bronze Sponsor, ByteDance
[Fri, 21, Feb. 2025]
Bronze Sponsor, Fastly
[Wed, 19, Feb. 2025]
Bronze Sponsor, secunet
[Mon, 17, Feb. 2025]
Bronze Sponsor, Relianoid
[Thu, 13, Feb. 2025]
Important Dates
Closing of CFS | Jan 17th, 2025 |
Notification by | Jan 26th, 2025 |
Conference dates | March 10th-13th |