Fosstodon
NETDEV VIDEOS
Session
Performance Comparison of Transport Mechanisms for LLM Inference KVCache Transfers
Speakers
Jamal Hadi Salim
Pedro Tammela
Victor Nogueira
Label
Nuts and Bolts
Session Type
Talk
Description
The AI/ML infrastructure landscape is evolving at an unprecedented pace, driven by continuous innovation across AI accelerators, networking technologies, and software stacks. In networking specifically, the industry is rapidly advancing fabric architectures, switch capabilities, NIC technologies, and transport protocols.
Leading LLM models roughly doubling in size every 6 to 10 months. Current state of the art is in the range of trillion parameters. At 1 byte per parameter that requires 1TB of GPU RAM. There is no way to fit that on 1 or 8 GPUs. You need networking to interconnect many GPUS!
To evaluate networking technologies one would need to keep up with:
- GPUs are getting updated every 6 months to perform better - with available memory capacity and bandwidth not growing as fast (state of the art per GPU is about 200GB in the higher end). Which means that if you are to accommodate newer larger models you need to acquire more GPUs and interconnect them with a network.
- Changing deployment techniques and model optimizations may mean total replacement of hardware.
Networking plays a pivotal role for interconnecting the GPUs and many proposals exist on how to transport the inter-cluster traffic - which is of interest to us.
There is a challenge: If you want to keep up and evaluate how evolving networking technologies handle these fast moving variables or if you are trying to innovate a new network transport you would need to constantly invest in GPUs and associated hardware which is not cheap.
Our approach overcomes the challenge by:
1) Coming up with a technique which emulates the GPU + model processing using mathematical equations to estimate the processing capacity of GPUs and specific model behavior to generate real time traffic that would use standard network infrastructure. Real NICs and switch fabrics can be used without needing any GPUs (which are many factors more expensive than network gear).
2) Allowing plugging in of different transports into the emulated GPUs. We will demonstrate that approaches used in different LLM models exhibit different traffic patterns, so using a tool like classical tools like iperf will be insufficient to express traffic patterns. This approach also allows creating and testing new transports in a more realistic environment.
In this talk we use our approach to generate traffic to evaluate and compare the performance characteristics—latency and throughput—of the several existing transport mechanisms by emulating inference KV cache transfers across a set of existing LLM models and GPU configurations:
RoCEv2 Zero-Copy TCP Devmem io_uring ZC TCP Standard TCP
Recent News
Bronze Sponsor, Common Net
[Tue, 16, Jun. 2026]
Bronze Sponsor, secunet
[Fri, 12, Jun. 2026]
Bronze Sponsor, Red Hat
[Fri, 12, Jun. 2026]
Bronze Sponsor, Mpiric
[Tue, 09, Jun. 2026]
Bronze Sponsor, Viasat
[Mon, 08, Jun. 2026]
Important Dates
| Closing of CFS | June 1st |
| Notification by | June 10th |
| Conference dates | July 13th-16th |