I can chime in with some optimizations (linux). For normal UDP sockets UDP_GRO a...

Matthias247 · on Dec 13, 2022

It can not only be a lot faster, it definitely is.

I did a lot of work on QUIC protocol efficiency improvements in the last 3 years. The use of sendmmsg/recvmmsg yields maybe a 10% efficiency improvement - because it only helps with reducing the system call overhead. Once the data is inside the kernel, these calls just behave like a loop of sendmsg/recvmsg calls.

The syscall overhead however isn't the bottleneck - all the other work in the network stack is. E.g. looking up routes for each packet, applying iptable rules, applying BPF calls, etc.

Using segmentation offloads means the packets will traverse also the remaining path as a single unit. This can allow for efficiency improvements from somewhere between 200% and 500% depending on the overall application. It's vastly useful to look at GSO/GRO if you are doing anything which requires bulk UDP datagram transmission.

yencabulator · on Dec 14, 2022

> Using segmentation offloads means the packets will traverse also the remaining path as a single unit.

Note that this is not necessarily the right thing to do, for a VPN.

Encryption: Wireguard uses the destination IP of each packet to pick the outgoing tunnel.

Decryption: The destination addresses of the encapsulated packets are arbitrary.

Where the sequence is required to share things like routing lookups, a VPN generally can't just take a sequence of packets from one side and send the same sequence of packets, individually transformed, out the other side.

Batching in general is still a common trick for performance, but the VPN gateway acts as a multiplexer/demultiplexer. It's doing routing after all!

Matthias247 · on Dec 14, 2022

Sure! Obviously you can only coalesce packets which have the have the same destination

hinkley · on Dec 14, 2022

Are these improvements peculiar to QUIC or could they be applied to HTTP 1.1 as well?

jlokier · on Dec 14, 2022

For these particular improvements, good implementations of HTTP/1.1 (and HTTP/2) already have these optimisations because HTTP versions before 3 use TCP.

With TCP, they are mostly done in the kernel and network hardware. TCP syscalls can transfer many packets worth of data per syscall, and the kernel and hardware have done segmentation offloads for TCP for many years automatically, depending on the hardware and kernel version. Over time these optimisations have been improved in the kernel, and over time newer hardware provided more capabilities for the kernel to use.

There's not much need for complex application or library support in userspace, though there are some syscall patterns that help the TCP stack more than others. The main thing is, send and receive in larger sizes to let the kernel handle buffering as well as reducing syscall count. Don't do silly things like a write() syscall for each HTTP header and then the body, for example. Buffer it up into a single write.

c-cube · on Dec 14, 2022

Http 1.1 doesn't use UDP, so probably not?

majke · on Dec 19, 2022

https://blog.cloudflare.com/accelerating-udp-packet-transmis...