For normal UDP sockets UDP_GRO and UDP_SEGMENT can be even faster than sendmmsg/recvmmsg.
In Gvisor they decided that read/write from tun is slow so they did PACKET_MMAP on raw socket instead. AFAIU they just ignore tap device and run a raw socket on it. Dumping packet from raw socket has faster interface than the device itself.
It can not only be a lot faster, it definitely is.
I did a lot of work on QUIC protocol efficiency improvements in the last 3 years.
The use of sendmmsg/recvmmsg yields maybe a 10% efficiency improvement - because it only helps with reducing the system call overhead. Once the data is inside the kernel, these calls just behave like a loop of sendmsg/recvmsg calls.
The syscall overhead however isn't the bottleneck - all the other work in the network stack is. E.g. looking up routes for each packet, applying iptable rules, applying BPF calls, etc.
Using segmentation offloads means the packets will traverse also the remaining path as a single unit. This can allow for efficiency improvements from somewhere between 200% and 500% depending on the overall application. It's vastly useful to look at GSO/GRO if you are doing anything which requires bulk UDP datagram transmission.
> Using segmentation offloads means the packets will traverse also the remaining path as a single unit.
Note that this is not necessarily the right thing to do, for a VPN.
Encryption: Wireguard uses the destination IP of each packet to pick the outgoing tunnel.
Decryption: The destination addresses of the encapsulated packets are arbitrary.
Where the sequence is required to share things like routing lookups, a VPN generally can't just take a sequence of packets from one side and send the same sequence of packets, individually transformed, out the other side.
Batching in general is still a common trick for performance, but the VPN gateway acts as a multiplexer/demultiplexer. It's doing routing after all!
For these particular improvements, good implementations of HTTP/1.1 (and HTTP/2) already have these optimisations because HTTP versions before 3 use TCP.
With TCP, they are mostly done in the kernel and network hardware. TCP syscalls can transfer many packets worth of data per syscall, and the kernel and hardware have done segmentation offloads for TCP for many years automatically, depending on the hardware and kernel version. Over time these optimisations have been improved in the kernel, and over time newer hardware provided more capabilities for the kernel to use.
There's not much need for complex application or library support in userspace, though there are some syscall patterns that help the TCP stack more than others. The main thing is, send and receive in larger sizes to let the kernel handle buffering as well as reducing syscall count. Don't do silly things like a write() syscall for each HTTP header and then the body, for example. Buffer it up into a single write.
For something completely different, you might want to look at Unikernel Linux: https://arxiv.org/abs/2206.00789 You could run all the code without switching between userspace and the kernel, and you can call into kernel functions directly (with the usual caveats about kernel ABI not being stable).
There is a v1 patch posted on LKML, and I think they're hoping to get a v2 patch posted by January. If you are interested in a chat with the team, email me rjones redhat.com.
Fun! We have support for running on gokrazy (https://gokrazy.org/) already, and that's probably where Unikernel Linux is more applicable for us, for when people just want a "Tailscale appliance" image.
One of the authors here: it could, all the kernel code is present. Right now the android selinux policy blocks some of the necessary ioctls (at least on the pixel builds I tested).
Title is click bait-y. This has next to nothing to do with kernel interfaces and is all about network tuning and encapsulation. Not sure why the authors went with the title as networking is interesting enough.
Also, the "slow" things about kernel interfaces (if you aren't doing IO which is nearly always the slowest thing) usually isn't a given syscall, it's the transition from user to kernel space and back. Lots of stuff going on such as flushing cache and buffers due to security concerns these days.
We certainly didn't try to make it click-baity. The point of the title is that people assume that Tailscale was slower than kernel wireguard because the kernel must be intrinsically faster somehow. The point of the blog post is to say, "no, code can run fast on either side... you just have to cross the boundary less." The blog post is all about how we then cross that boundary less, using a less obvious kernel interface.
I would have titled it something like "Userspace is slow if you do lots of context switches/userspace transitions" (if I understand the point of the post)... but that's sort of been known for a while now. I think the novelty of this comes mainly from the inversion of control in your design, and the post explicitly points out that it's likely every single performance improvement in userspace could be equalled by kernelspace, and likely, exceeded by kernelspace.
What's really crazy is that we're talking about userspace and 10Gbit, which shows that CPUs, busses, and the interface protocols have all been scaling well with interface speeds!
Personally I don't ever increase MTU, even if there's a significant performance win, since I prefer to not place our oncall in a situation where they have to debug an outage due to MTU incompability.
Just some feedback, that's not what I expected from the title and I would agree with the previous poster that the title is a little (quite minor though) clickbaity.
The purpose of the title is to summarize, enabling the reader to decide if the article is relevant and interesting to him. If the title presents a situation that seems more dire, urgent, or relevant then the article, then it is written to entice click through rates, even if it makes sense after having read the article.
Thanks for the clarifying reply. I thought most folks who cared knew it was about context switches and not speed on one side vs the other. Now I'm really interested to read the full article.
I disagree. Two main points of the article are "nothing is inherently slow about doing stuff in userland (as shown by the fact that we made a fast implementation)" and "kennel interfaces, e.g. particular methods of boundary crossing, can be (as shown by the fact that the way they made it faster was in large part by doing the boundary crossings better)".
The title gave me a reasonably decent idea of what to expect, and the article delivered.
One of the authors here: What I was going for with the title is that singular read/write switching (the before case) is very slow (for packet sized work), and batching (~ >=64kb) is much faster - it's about amortizing the cost of the transition, as you rightly point out. That's the point the title is making - some interfaces do not provide the ability to amortize that cost, others do!
I mostly just find the title fun. But I guess my main issue with clickbait is more the bait-and-switch when the result doesn’t live up to expectations. I felt like I had a reasonable idea that what was inside wouldn’t be a waste of time from the words in the title. There are plenty of articles posted on this site with dry titles that don’t match the body, which seems worse to me.
When Josh et al tried it, they hit some fun kernel bugs on certain kernel versions and that soured them on it for a bit, knowing it wouldn't be as widely usable as we'd hoped based on what kernels were in common use at the time. It's almost certainly better nowadays.
It will likely not help a lot, because syscall overhead is not the bottleneck - but the implementation of the actual syscalls is - which is very visible in the flamegraphs. Both sendmmsg/recvmmsg and io_uring only help with the syscall overhead, and therefore are not as helpful in improving efficiency as the the offloads which make the actual network stack more efficient.
Besides kernel supplied offloads, the thing which helps further is actually bypassing the kernel with AF_XDP or DPDK. But those techniques will have other challenges and limitations.
Yeah, but why? Between the lines, this also means you've been wasting my battery life for some time. And likely will be again when the kernel impl is updated. For what purpose? (Not totally unrelated, I can't help but wonder if Tailscale on a certain platform wouldn't be a bit more reliable if it used the builtin WG support instead of insisting on wireguard-go).
As someone who spends a lot of time working with DPDK, the idea that anyone would consider userspace "slow" and kernel of all things "fast" is amusing to me.
Most everyone nowadays seems to try to avoid kernel as much as possible, since it's vastly slower than a userspace forwarding plane (although eBPF/XDP will likely change this).
There are a few gotchas with GRO, although I'm not sure they're applicable to Wireguard - in particular, there used to be a line in the kernel vswitch code that dropped a packet if it had been processed by GRO. A while back I spent a long time debugging a network problem caused by that particular "feature"...
It'd be interesting to see the benchmark environment's raw ChaCha20Poly1305 throughput (the x/crypto implementation) in the analysis. My hunch is it's several times greater than the network, which would further support the argument.
Yeah the flame graphs show ~9% of the time being spent in golang.org/x/crypto/chacha20poly1305 so you're probably right, but flame graphs and throughput aren't always a one-to-one mapping. Flame graphs just tell you where time was spent per unit packet, but depending on the workload, there are some things in the life of a packet that you can parallelise and some things you can't.
Just thought it'd be interesting to see the actual throughput along with the rest for the benchmarked environment.
One of the authors here: yeah, it's very interesting. The flame graphs here don't do a great job at highlighting an aspect of the challenge which is that crypto fans out across many CPUs. I think the hunch that 20-30gbps is attainable (on a fast system) is accurate - it'll take more work to get there.
What's interesting is that the cost for x/crypto on these system classes is prohibitive for serial decoding at 10gbps. I was ballparking with 1280 MTU, you have about 1000ns to process a packet, it takes about 2000ns to encrypt. The fan-out is critical at these levels, and will always introduce it's own additional costs, with synchronization, memory management and so on.
Is the per-packet processing in Wireguard stateless? As in, no sequential packet numbering in the crypto, etc?
If yes, then you should be able to get the kernel to spread your incoming traffic across cores with minimal contention and coordination, with multiqueue tuntap:
Absolutely multiqueue is on the list. What that would in theory allow us to solve well is rss/xss. As mentioned above, we would still have to fan out crypto, which means we need to solve for numa aware queue processing, which isn’t immediately solvable with the current go runtime APIs. Lots of interesting things to work on!
> but flame graphs and throughput aren't always a one-to-one mapping
I wanted to give a concrete example of this. Crypto or such code using SIMD can drop some Intel CPUs into lower frequencies, and this effect can also apply to all cores. It takes a short moment for the CPU to "speed up" again. Then there's icache pressure etc.
The usual rule is, microbenchmarks are microuseful. Systems mixing micro-workloads just can't reach the same throughput, and trying to guess the final performance is very tricky.
I neither have experience with wireguard nor x/crypto. But I did some measurements on crypto algorithms for QUIC efficiency before. There I achieved about 350MB/s throughput on a single Ryzen 3900X CPU core using ChaCha20, compared to 500MB/s using hardware accelerated AES256. That might or might not be faster than your network.
Note that the numbers include networking overhead, if you would just measure crypto performance the numbers would obviously be higher - as would the difference between the algorithms be.
You may want to have a try on virtio-user to avoid your user space code calling recvmsg/sendmsg, instead, a (or severals) vhost kthread(s) will do that for you. Unfortunately, there is a proper virtio-net driver in Go.
That's why I qualified it with "exceedingly slow" I would think in most cases a "bare metal" implementation would be quicker (and kernel for most intents is that), but user space is good enough for the vast majority of cases not involving direct bit twiddling in hardware. I know that user space is for simplicity and for safety (memory and security). I'm not a newb :) . I'm a developer, I'm just not a kernel dev (although I have written some very simple linux drivers for fun and to get a feel for it and tweaked existing drivers as well).
Leaving aside the general question (which sibling comment covers), there's an unwritten qualification of "userspace is generally seen as slow for drivers (network, disk, filesystem)", and... we generally don't bother using it for those things, or at least we try to move the data path into the kernel when we care about performance.
For normal UDP sockets UDP_GRO and UDP_SEGMENT can be even faster than sendmmsg/recvmmsg.
In Gvisor they decided that read/write from tun is slow so they did PACKET_MMAP on raw socket instead. AFAIU they just ignore tap device and run a raw socket on it. Dumping packet from raw socket has faster interface than the device itself.
https://github.com/google/gvisor/blob/master/pkg/tcpip/link/... https://github.com/google/gvisor/issues/210