Session

Achieving linear CPU scaling in WireGuard with an efficient multi-tunnel architecture

Speakers

Mirco Barone
Federico Parola
Fulvio Risso
Davide Miola

Label

Moonshot

Session Type

Talk

Contents

Description

Wireguard is one of the most common tunneling technologies used in Linux, thanks to its simplicity and excellent integration in the Linux kernel. Despite its widespread adoption, it proves to be unable to provide high-speed connectivity between two sites when adopting a standard single-tunnel configuration. This represents a significant limitation when a secure, high-speed interconnection is required e.g., between two remote clusters in the cloud. In fact, the capability to scale wireguard performance with the number of available CPU cores is somehow limited, even in presence of a software architecture that is intrinsically parallel.

In this talk we investigate the multi-core scalability properties of WireGuard, identifying current limitations and proposing an improved architecture which facilitates effective scaling, reaching a nearly linear throughput increase depending on the number of involved CPU cores. We first analyze the architecture of a single tunnel setup, underlining how, despite its capability to parallelize encryption and decryption stages, the presence of serial per-tunnel stages still imposes a limit on the use of additional resources. Hence, we attempt to spread flows over multiple tunnels, in order to also scale per-tunnel stages over multiple cores. Our analysis reveals how simply leveraging multiple tunnels can end up not scaling at all, due to a subtle NAPI poll functions “black hole” condition related to the use of the standard softirq-based NAPI. We overcome this limitation by enabling the threaded NAPI on WireGuard interfaces, however, despite being able to leverage all the resources of our nodes, the approach still shows far from linear performance improvement when increasing the number of allocated cores. To push things further we propose a modified architecture which, for each flow, handles all WireGuard stages inline, in a signal processing context on a single core, eliminating the costs of tasks and cache synchronization. This improved architecture, tailored for multi-tunnel support, shows an almost 2x performance improvement over a multi-tunnel deployment based on the vanilla WireGuard implementation, as well as being able to support 18x times the throughput of a single tunnel setup on our machines.

Our approach is not a one-size-fits-all solution, its main limitation currently being the inability to parallelize the en/decryption stages for a single flow, which might end up penalizing elephant flows. However, it provides an interesting starting point for further discussion and represents a first step towards a more scalable WireGuard architecture. For example, a mechanism could be designed that uses traffic patterns to dynamically determine the number of distinct tunnels needed. This would aim to use the fewest tunnels possible and adjust their allocation to associate each tunnel with a specific core, thereby also minimizing the number of cores used. Additionally, a hybrid solution could be considered. This would involve switching between the existing version of WireGuard and our modified module, thereby leveraging the strengths of both implementations.