Session

Towards µs Tail Latency and Terabit Ethernet: Disaggregating the Host Network Stack

Speakers

Qizhe Cai
Midhul Vuppalapati
Jaehyun Hwang
Christos Kozyrakis
Rachit Agarwal

Label

Moonshot

Session Type

Talk

Contents

Description

In discussions about future host network stacks, there is widespread agreement that, despite its great success, today’s Linux network stack is seriously deficient along one or more dimensions. Some of the most frequently cited flaws include its inefficient packet processing pipeline, its inability to isolate latency-sensitive and throughput-bound applications, its rigid and complex implementation, its inefficient transport protocols, to name a few. These critiques have led to many interesting (and exciting!) debates on various design aspects of the Linux network stack: interface (e.g., streaming versus RPC), semantics (e.g., synchronous versus asynchronous I/O), and placement (e.g., in-kernel versus userspace versus hardware).

This talk will demonstrate that many deficiencies of the Linux network stack are not rooted in its interface, semantics and/or placement, but rather in its core architecture (one exception is per-core performance, which indeed depends on its interface, semantics and placement. This talk is not about per-core performance of network stacks—our architecture is agnostic to the interface, semantics, and placement of network stacks). In particular, since the very first incarnation, the Linux network stack has offered applications the same “pipe” abstraction designed around essentially the same rigid architecture:

  • Dedicated pipes: each application and/or thread submits data to one end of a dedicated pipe (sender-side socket) and the network stack attempts to deliver the data to the other end of that dedicated pipe (receiver-side socket);

  • Tightly-integrated packet processing pipeline: each pipe is assigned its own socket, has its own independent transport layer operations (congestion control, flow control, etc.), and is operated upon by the network subsystem completely independently of other coexisting pipes;

  • Static pipes: the entire packet processing pipeline (buffers, protocol processing, host resource provisioning, etc.) is determined at the time of pipe creation, and remains unchanged during the pipe lifetime, again, independent of other pipes and dynamic resources availability at the host.

Such dedicated, tightly-integrated and static pipelines were well-suited for the Internet and early-generation datacenter networks—since performance bottlenecks were primarily in the network core, careful allocation of host resources (compute, caches, NIC queues, etc.) among coexisting pipes was unnecessary. However, rapid increase in link bandwidths, coupled with relatively stagnant technology trends for other host resources, has now pushed bottlenecks to hosts. For this new regime, our measurements show that dedicated, tightly-integrated and static pipelines are now limiting today’s network stacks from fully exploiting capabilities of modern hardware that supports µs-scale latency and hundred(s) of gigabits of bandwidth. Experimenting with new ideas has also become more challenging: performance patches have made the tightly-integrated pipelines so firmly entrenched within the stack that it is frustratingly hard, if not impossible, to incorporate new protocols and mechanisms. Unsurprisingly, existing network stacks are already at the brink of a breakdown and the emergence of Terabit Ethernet will inevitably require rearchitecting the network stack. Laying the intellectual foundation for such a rearchitecture is the goal of this work.

NetChannel disaggregates the tightly-integrated packet processing pipeline in today’s network stack into three loosely-coupled layers. In the hindsight, NetChannel is remarkably similar to the Linux storage stack architecture; this similarity is not coincidental—for storage workloads, bottlenecks have always been at the host, and the ``right’’ architecture has evolved over years to both perform fine-grained resource allocation across applications, and to make it easy to incorporate new storage technologies.

Applications interact with a Virtual Network System (VNS) layer that offers standardized interfaces, e.g., system calls for streaming and RPC traffic. Internally, VNS enables data transfer between application buffers and kernel buffers, while ensuring correctness for the interface semantics (e.g., in-order delivery for the streaming interface). The core of NetChannel is a NetDriver layer that abstracts away the network and remote servers as a multi-queue device using a Channel abstraction. In particular, the NetDriver layer decouples packet processing from individual application buffers and cores: data read/written by an application on one core can be mapped to one or more Channels without breaking application semantics. Each Channel implements protocol-specific functionalities (congestion and flow control, for example) independently, can be dynamically mapped to one of the underlying hardware queues, and the number of Channels between any pair of servers can be scaled independent of number of applications running on these servers and the number of cores used by individual applications. Between the VNS and NetDriver layers is a NetScheduler layer that performs fine-grained multiplexing and demultiplexing (as well as scheduling) of data from individual cores/applications to individual Channels using information about individual core utilization, application buffer occupancy and Channel buffer occupancy.

The primary benefit of NetChannel is to enable new operating points for existing network stacks without any modification in existing protocol implementations (TCP, DCTCP, BBR, etc.). These new operating points are a direct result of NetChannel’s disaggregated architecture: it not only allows independent scaling of each layer (that is, resources allocated to each layer), but also flexible multiplexing and demultiplexing of data to multiple Channels. We provide three examples. First, for short messages where throughput is bottlenecked by network layer processing overheads, NetChannel allows unmodified (even single-threaded) applications to scale throughput near-linearly with number of cores by dynamically scaling cores dedicated to network layer processing. Second, in the extreme case of a single application thread, NetChannel can saturate multi-hundred gigabit links by transparently scaling number of cores for packet processing on an on-demand basis; in contrast, the Linux network stack forces application designers to write multi-threaded code to achieve throughput higher than tens of gigabits per second. As a third new operating point, we show that fine-grained multiplexing and demultiplexing of packets between individual cores/applications and individual Channels enabled by NetChannel, combined with a simple NetScheduler, allows isolation of latency-sensitive applications from throughput-bound applications: NetChannel enables latency-sensitive applications to achieve µs-scale tail latency (as much as 17.5x better than the Linux network stack), while allowing bandwidth-intensive applications to use the remaining bandwidth near-perfectly.

NetChannel also has several secondary benefits that relate to the extensibility of network stacks. For instance, NetChannel alleviates the painful process of applications developers manually tuning their code for networking performance (e.g., number of threads, connections, sockets, etc.) in increasingly common case of multi-tenant deployments. NetChannel also simplifies experimentation with new designs (protocols and/or schedulers) without breaking legacy hosts—implementation of a new transport protocol (e.g., dcPIM or pHost) in NetChannel is equivalent to writing a new ``device driver’’ that realizes only transport layer functionalities without worrying about functionalities in other layers of the stack like data copy, isolation between latency-sensitive and throughput-bound applications, CPU scheduling, load balancing, etc. Thus, similar to storage stacks (that have simplified evolution of new hardware and protocols via device drivers, and have simplified writing applications with different performance objectives via multiple coexisting block layer schedulers, etc.), NetChannel would hopefully lead to a broader and ever-evolving ecosystem of network stack designs. To that end, we have open-sourced \name for our community; the implementation of \name is available at https://github.com/Terabit-Ethernet/NetChannel.