Netdev 0x16 venue

Modern applications for high performance networking have universally turned towards RDMA and infiniband after declaring TCP/IP as too slow and inefficient. As discussed at LPC 2022 [1], the core of the Linux IP/TCP/ethernet networking stack is capable of running at very high speeds (line rates of existing NICs and next version), but changes are needed with respect to how the stack is used. This work examines the effort needed to use the well established TCP stack and its rich congestion control tools and algorithms with the modern high performance applications like Machine Learning and Storage Disaggregation

It has long been known that the current BSD socket APIs for sending and receiving data have too much overhead (e.g., system calls, memcpy, page reference counting) that severely limits the data rate, but the choice of simplicity of well established interfaces and universal applicability has prevailed and the interface has been hard to replace. In the pursuit of keeping the fundamental components but reconfiguring the interfaces for higher performance we studied a few approaches and settled on one.

io_uring provides software queues between kernel and userspace with the idea of reducing the number of system calls to submit work and manage completions, and it recently gained support for the Tx ZC API. However, it solves only part of the problem - zerocopy on Tx, reducing the number of syscalls and refcounting with registered buffers. io_uring does not work with or have access to hardware queues or the ability to register buffers with hardware to amortize the DMA setup.
XDP sockets provide a solution to hardware queues and registering userspace buffers with hardware, but XDP sockets involve a full kernel bypass which affects how the S/W is designed and its ability to use the kernel based socket API and networking stack (addresses, routing, TSO, GRO, retransmission, congestion control, etc) leveraged by typical socket-based applications.
Verbs, RDMA and Infiniband have the current mindshare for low latency, high throughput networking, but RDMA is its own separate ecosystem - both hardware and software. This is a predominantly ethernet world with a push for converged infrastructure in the data center based on ethernet, a point recognized in the RDMA world by ROCE, now on its second version. While ROCE allows RDMA to run over ethernet in fairly directed networks, it has been difficult to build shared use networks where TCP’s congestion control and performance in lossy environments is much better understood.

This talk discusses a proposal to merge some concepts of RDMA with traditional TCP/IP networking. Specifically, the ability to create and manage memory regions between processes and hardware along with userspace access to hardware queues and software queues to a kernel driver for submitting work and getting completions. This allows applications to submit pre-allocated and shaped buffers to hardware for zerocopy Rx and Tx, yet leverage the kernel’s TCP stack. The end result is that an application can use the traditional socket APIs and networking stack for communication with the efficiencies of RDMA, io_uring and xdp sockets - reduced syscalls, improved buffer management, and direct data access into userspace buffers.

We walk through a comparison of an application using current socket APIs today vs an application using some of the rdma concepts.

Finally, the proposal is more about making the stack, it’s interfaces and control knobs be modular so that a developer can assemble the combination they want to get the best trade-off they seek. Further, this is actually doable without a major departure.

[1] https://lpc.events/event/16/contributions/1345/