Netdev 0x17 venue

Memory bandwidth is a bottleneck in many distributed services running at scale as I/O bandwidth has not kept up with CPU or NIC speeds over time. One of the limitations of kernel socket-based networking is that data is first copied into kernel memory via DMA, and then again into user memory, adding pressure to overall memory bandwidth and comes with a CPU cost. There are historically two ways to improve this performance: kernel bypass (e.g. DPDK, PF_RING and XDP) and RDMA (e.g. RoCE and InfiniBand), each having its own downsides in exchange for such improvements. For kernel bypass, there’s often a need to run custom or patched drivers, kernel modules, or kernels, and requires userspace to re-implement a networking stack. For RDMA, with the NIC now responsible for storing state and handling flows in hardware, it is harder to debug when things go wrong or to patch leading to “ossification”. Plus, the solutions are often proprietary.

There are benefits to using the kernel networking stack: it is free, open, highly debuggable and patchable. With newer hardware support for technologies such as flow steering, it is now possible to have a zero copy network RX solution that sits in between kernel bypass and RDMA. Using header splitting, the kernel handles the stateful networking but data is written directly into its intended destination without copying using DMA. Flows are directed to specific RX queues configured for this using flow steering. This paper proposes an implementation of zero copy network RX into host userspace memory, using io_uring as the user facing API.

In the first part of this paper, we first discuss the overall approach to zero copy in the network RX path and the requirements a user might have. For example, users may have alignment requirements, dynamic resizing of zero copy destinations as network traffic can be bursty, and a way to fall back to copying. This sets the stage for what changes need to be made in the kernel, and how the user facing API should ideally look like.

In the second part of the paper, we survey the existing implementations of zero copy network RX that are both already merged into mainline and still work in progress. We talk about the shared parts in the networking layer: resource registration, page pool memory provider backends, page pool buffer/iov representations and proper lifetime management.

In the third part of the paper, we discuss our implementation in detail. We’ll look at the various options for user API in the kernel, and focus on the pros/cons that led us to choose io_uring. After that, we will share some preliminary performance results of our implementation compared against non-ZC and other ZC implementations.

In the last part of the paper, we discuss the limitations of zero copy and the challenges to deploying it effectively to userspace. Unlike TX where data size is known ahead of time, RX is unpredictable and potentially bursty. To get the most out of zero copy receive, the write end has to coordinate closely with the receive end on the shape of the data in order to avoid more userspace copies. For example, if the final destination is to write to a block device, then there are requirements for O_DIRECT to work. Finally, we talk about our plans for further work e.g. extending support to GPU device memory.