The NetDev conference wiki

Day 1 / Track 1 / Talk 1 Workshop: TCP Analytics Chair: Sowmini Varadhan Report by: Kiran Patil

The goal of this session was to highlight the problems with TCP analytics, means of deployment of TCP/IP based networks and “how to monitor TCP flow efficiently”.

This sessions covered experiences from various deployments of companies who deal with TCP analytics on day-to-day basis. There were various techniques described, shared and proposed to monitor flows and provide some sort of QoS by monitoring, metering and improving performance of TCP flows and eventually better service to end-user.

The following techniques were discussed and proposed: TCP_INFO/TCP_SO_TIMESTAMPING Extended TCP stack instrumentation (RFC 4898) Monitoring TCP Usage of TCP-eBPF (at Premature stage) Large scale TCP analytic collection TCP Analytics at Microsoft TCP Analytic for satellite Broadband

During the meeting there was an introduction and walkthrough of the TCP_INFO/TCP_SO_TIMESTAMPING. This part talked about details of TCP_INFO and its fields (tcp_state, options, data_seg*, delivered, retransmission) and how this information is useful in determining state of TCP flows and provides insight into TCP flows (aide to debug).

One of the issues that were described was that TCP_INFO is a large blob and obtaining TCP_INFO has measurable overhead due to socket lock, and still it doesn't include everything (CC used, SOL_TCP, etc..)

This was followed by a covering of the usage of OPT_STATS with TCP_SO_TIMESTAMPING and its use for perf analysis. The key-takeaway was that TCP_INFO and TCP_SO_TIMSTAMPING are both powerful instrumentation but their usage must be wisely.

Some caveats to remember: If enabled on every TX packets, there is a 20% reduction/regression for throughput.

One question that arose was what happens when TSO offload is used for TCP? The answer was that timestamp will be available for last_byte if packet send resulted into multiple packets unlike UDP.

Next there was a talk about extended TCP stack instrumentation. This talk covered RFC 4898 (Framework for TCP stack instrumentation) which defines approximately 120 MIBs for instrumentation. In kernel, RFC 4898 stats are implemented via. TCP_INFO. This session covered 3 implementations 1. Web100 2. Web10G 3. tcpstat.

For Web10G, in TCP Instrumentation, metrics are stored in hash (in memory structs). This feature can be enabled through kernel parameter (net.ipv4.tcp_estats). This stats are accessed via netlink kernel module (TCP_ESTATS and TCP_INFO). There is user land API through limnl library for user space to query stats. Web10G provides real work detail flow metric and used in multiple research such as TEACUP. It is also used in various papers exploiting buffer bloat, cloud perf, wireless latency, and network modeling reproducibility.

This talk also covered caveats and recommendations. For example it is useful for data transfer nodes and not deploy it in high scale production environment (e.g. 100K connection)

This talk also explained the use case “XSight”, in Web10G in production environment. It is used to identify failing and sub-optimal flows during lifespan of flows. Also it is used to provide panoptical view of network performance.

Next was a talk about Monitoring TCP, covering challenges with respect to monitoring TCP such as “what stats to collect (TCP_CHRONO) and how frequently to sample TCP_INFO state”. It also covered interesting TCP state events and how TCP-BPF opens up new possibilities. This talk also included TCP-BPF and how TCP-BPF can be used to provide per connection optimization for TCP parameters. It covered tunable parameters for intra-DC traffic such as use of small buffers, small SYN_RTO, and cwnd-clamp.

TCP-BPF is a new BPF program and it provides access to TCP_SOCK_FIELDS. It means visibility to internal state of TCP flows. It also opens up mechanism of new callbacks for analytics and better decision making (w.r.t. provisioning dynamic resources). Example of new callbacks are, notify when packets are sent or received. This feature has to be used with caution - a user shouldn't enable on all flows but as needed (randomly on small % of flows) or enable while debugging atypical flow. Additionally this talk covered external trigger (e.g. TCP_INFO) like “ss”. TCP-BPF per connection is not yet there but there is TCP-BPF per cgroup.

The next talk covered Large Scale TCP Analytics collection. It described issues with inet_diag (referring to Telco use case) such as events getting dropped, polling takes long time, no events during connection setup and termination. It also covered issues about getting information about connections/flows (such as which congestion algorithm is used) out of kernel to user space and how this information is propagated to user space.

The next talk discussed TCP Analytic at Microsoft covering real life problem being dealt in Microsoft. It covered about several classes of problems such as connectivity and performance. There are various reasons for connectivity problems such as “app failed to connect” - this could be due to network/infrastructure issues, no listener, listen backlog, firewall rules, port exhaustion, routing misconfiguration, NIC driver issues. Likewise it covered performance problems such as “why TCP throughput is so low” and its possible causes - application issues (not posting enough buffers, not draining fast-enough),

Following, the talk was about TCP Rx window, network congestion, CPU usage, described typical analysis process for connectivity and performance problems. For connectivity issues - tracing and packet capture, detailed tracing for connection setup. To analyze performance issues and attempting to micro benchmark to rule out application issues such as time sequence plots, TCP stats, and network trace analysis.

The next talk discussed TCP stats in regards to mapping of user to servers based on TCP stats (delivery metrics). It covered stats collection methods such as random sampling (callbacks in TCP layer, additional per socket stats), usage of mmap, poll to retrieve tcpsockstat from /dev/tcpsockstat, etc… It also covered TCP_INFO and how this information could be useful to derive delivery metrics. Proposal for TCP stats collection using BPF/tracepoint, trace per socket. There was suggestion from Google to trace “sendmsg” using cmsg, TCP_INFO, and timestamping.

The final talk was related to TCP Analytic for Satellite Broadband. This talk covered issues about TCP perf challenges and need of min RTT of 500 ms and how none of the congestion algorithm deals with it. The recommendation was to use PEP (Perf Enhancement Proxies) to avoid congestion. PEP and AQM (Active queue management) avoid packet drops. It also covered the need to monitor TCP performance issues (needed to meet Service Level Objective) and monitoring challenges 1. active measurement if intrusive and not scalable 2. use of passive measurement (L2 stats monitoring) The talk also discussed QoE assurance and need of troubleshooting abnormality by correlating PEP TCP flow stats with RF stats.

Site: https://www.netdevconf.info/0x13/session.html?tcp-analytics