Netdev 0x17 venue

Several recent studies from large-scale production clusters (e.g., from Google, Microsoft and Alibaba) demonstrate the existence of congestion within the network that delivers the data from the NIC to the CPU/memory (referred to as the host network). These studies show that adoption of high-bandwidth access links, coupled with relatively stagnant technology trends for resources within the host—CPU speeds, cache sizes, memory access latency, memory bandwidth per core, NIC buffer sizes, etc.—has led to the emergence of congestion within the host network, that is, congestion within the processor, memory and peripheral interconnects of the host network. For instance, Google demonstrates that host congestion in their production clusters leads to significant queueing and packet drops at hosts, resulting in application-level performance degradation in terms of latency and throughput. We reproduce the host congestion phenomenon using Linux DCTCP; we observe that host congestion can lead to as much as 1% packet drops at the host, 35-55% throughput degradation, and 120-5000x tail latency inflation.

The regime of host congestion forces us to revisit the many fundamental assumptions entrenched within practice of congestion control. For instance, classical congestion control mechanisms assumes that packet drops happen at the congestion point; in contrast, host congestion results in queueing and drops away from the actual congestion point (since the host network is lossless). Thus, we must rethink congestion signals to capture the precise time, location, and reason for host congestion. As another example, an unspoken assumption in classical congestion control mechanisms is that all competing traffic adheres to the congestion control protocol; such is not the case in the host congestion regime where traffic from ``outside the network’’ (eg., applications generating CPU-to-memory traffic) does not employ congestion control mechanisms, is much closer to the congestion point, and can thus change dramatically at sub-RTT granularity. This has powerful implications in terms of rethinking congestion response: existing congestion control protocols that operate at RTT granularity may achieve performance far from optimal in the host congestion regime. Thus, host congestion requires us to revisit congestion control architecture and protocols.

In this talk, we present hostCC, a congestion control architecture that handles both host congestion and network fabric congestion by allocating both host and network resources among competing traffic. HostCC embodies three key technical ideas. First, in addition to congestion signals from within the network fabric, hostCC generates host-local congestion signals at processor, memory, and peripheral interconnects at sub-microsecond timescales. These host congestion signals enable hostCC to precisely capture the time, location, and reason for host congestion. The second key technical idea in hostCC is a sub-RTT granularity host-local congestion response: at both the sender and the receiver, hostCC uses host-local congestion signals to allocate host resources between network traffic and host-local traffic. At the sender, hostCC uses host-local congestion response to ensure that network traffic is not starved, even at sub-RTT granularity; at the receiver, hostCC uses host-local congestion response to minimize queueing and packet drops at the host: it modulates host resources allocated to the network traffic at sub-RTT granularity to ensure that NIC queues are drained at the same rate at which network traffic arrives at the NIC. Finally, the third key technical idea in hostCC is to use both host and network congestion signals to perform efficient network resource allocation at RTT timescales.

hostCC admits efficient realization within existing host network stacks, without any modifications in applications, host hardware, and/or network hardware; moreover, hostCC can be integrated with existing congestion control protocols to efficiently handle both host and network fabric congestion. To demonstrate this, we perform an end-to-end implementation of hostCC in the Linux kernel using ~800LOC, and evaluate it along with unmodified Linux DCTCP. Our evaluation demonstrates that, in the presence of host congestion, hostCC avoids network underutilization (achieving values close to a user-specified network utilization guarantee, enabled by hostCC), while simultaneously reducing packet drops by multiple orders of magnitude. hostCC also significantly reduces tail latencies—with >75-3000x improvements under evaluated settings. Further, we demonstrate that hostCC interpolates well with network congestion control—maintaining the above-mentioned benefits in the presence of both network congestion and host congestion. And finally, we also showcase that hostCC benefits are robust across a wide variety of employed workload/hardware settings (for eg., number of active connections, MTU sizes, enabling direct cache access, etc.). The end-to-end implementation of hostCC, along with all the documentation needed to reproduce our results, is available at https://github.com/Terabit-Ethernet/hostCC.