For normal UDP sockets UDP_GRO and UDP_SEGMENT can be even faster than sendmmsg/recvmmsg.
In Gvisor they decided that read/write from tun is slow so they did PACKET_MMAP on raw socket instead. AFAIU they just ignore tap device and run a raw socket on it. Dumping packet from raw socket has faster interface than the device itself.
It can not only be a lot faster, it definitely is.
I did a lot of work on QUIC protocol efficiency improvements in the last 3 years.
The use of sendmmsg/recvmmsg yields maybe a 10% efficiency improvement - because it only helps with reducing the system call overhead. Once the data is inside the kernel, these calls just behave like a loop of sendmsg/recvmsg calls.
The syscall overhead however isn't the bottleneck - all the other work in the network stack is. E.g. looking up routes for each packet, applying iptable rules, applying BPF calls, etc.
Using segmentation offloads means the packets will traverse also the remaining path as a single unit. This can allow for efficiency improvements from somewhere between 200% and 500% depending on the overall application. It's vastly useful to look at GSO/GRO if you are doing anything which requires bulk UDP datagram transmission.
> Using segmentation offloads means the packets will traverse also the remaining path as a single unit.
Note that this is not necessarily the right thing to do, for a VPN.
Encryption: Wireguard uses the destination IP of each packet to pick the outgoing tunnel.
Decryption: The destination addresses of the encapsulated packets are arbitrary.
Where the sequence is required to share things like routing lookups, a VPN generally can't just take a sequence of packets from one side and send the same sequence of packets, individually transformed, out the other side.
Batching in general is still a common trick for performance, but the VPN gateway acts as a multiplexer/demultiplexer. It's doing routing after all!
For these particular improvements, good implementations of HTTP/1.1 (and HTTP/2) already have these optimisations because HTTP versions before 3 use TCP.
With TCP, they are mostly done in the kernel and network hardware. TCP syscalls can transfer many packets worth of data per syscall, and the kernel and hardware have done segmentation offloads for TCP for many years automatically, depending on the hardware and kernel version. Over time these optimisations have been improved in the kernel, and over time newer hardware provided more capabilities for the kernel to use.
There's not much need for complex application or library support in userspace, though there are some syscall patterns that help the TCP stack more than others. The main thing is, send and receive in larger sizes to let the kernel handle buffering as well as reducing syscall count. Don't do silly things like a write() syscall for each HTTP header and then the body, for example. Buffer it up into a single write.
For normal UDP sockets UDP_GRO and UDP_SEGMENT can be even faster than sendmmsg/recvmmsg.
In Gvisor they decided that read/write from tun is slow so they did PACKET_MMAP on raw socket instead. AFAIU they just ignore tap device and run a raw socket on it. Dumping packet from raw socket has faster interface than the device itself.
https://github.com/google/gvisor/blob/master/pkg/tcpip/link/... https://github.com/google/gvisor/issues/210