It'd be interesting to see the benchmark environment's raw ChaCha20Poly1305 thro...

wmf · on Dec 13, 2022

I noticed that very little of the flame graph is crypto which implies that the system under test could do 20-30 Gbps of ChaCha20Poly1305.

adtac · on Dec 13, 2022

Yeah the flame graphs show ~9% of the time being spent in golang.org/x/crypto/chacha20poly1305 so you're probably right, but flame graphs and throughput aren't always a one-to-one mapping. Flame graphs just tell you where time was spent per unit packet, but depending on the workload, there are some things in the life of a packet that you can parallelise and some things you can't.

Just thought it'd be interesting to see the actual throughput along with the rest for the benchmarked environment.

raggi · on Dec 13, 2022

One of the authors here: yeah, it's very interesting. The flame graphs here don't do a great job at highlighting an aspect of the challenge which is that crypto fans out across many CPUs. I think the hunch that 20-30gbps is attainable (on a fast system) is accurate - it'll take more work to get there.

What's interesting is that the cost for x/crypto on these system classes is prohibitive for serial decoding at 10gbps. I was ballparking with 1280 MTU, you have about 1000ns to process a packet, it takes about 2000ns to encrypt. The fan-out is critical at these levels, and will always introduce it's own additional costs, with synchronization, memory management and so on.

yencabulator · on Dec 14, 2022

Is the per-packet processing in Wireguard stateless? As in, no sequential packet numbering in the crypto, etc?

If yes, then you should be able to get the kernel to spread your incoming traffic across cores with minimal contention and coordination, with multiqueue tuntap:

https://www.kernel.org/doc/html/latest/networking/tuntap.htm...

https://lwn.net/Articles/459270/

raggi · on Dec 15, 2022

Absolutely multiqueue is on the list. What that would in theory allow us to solve well is rss/xss. As mentioned above, we would still have to fan out crypto, which means we need to solve for numa aware queue processing, which isn’t immediately solvable with the current go runtime APIs. Lots of interesting things to work on!

yencabulator · on Dec 14, 2022

> but flame graphs and throughput aren't always a one-to-one mapping

I wanted to give a concrete example of this. Crypto or such code using SIMD can drop some Intel CPUs into lower frequencies, and this effect can also apply to all cores. It takes a short moment for the CPU to "speed up" again. Then there's icache pressure etc.

The usual rule is, microbenchmarks are microuseful. Systems mixing micro-workloads just can't reach the same throughput, and trying to guess the final performance is very tricky.

https://stackoverflow.com/questions/19722950/do-sse-instruct...

Matthias247 · on Dec 13, 2022

I neither have experience with wireguard nor x/crypto. But I did some measurements on crypto algorithms for QUIC efficiency before. There I achieved about 350MB/s throughput on a single Ryzen 3900X CPU core using ChaCha20, compared to 500MB/s using hardware accelerated AES256. That might or might not be faster than your network.

Note that the numbers include networking overhead, if you would just measure crypto performance the numbers would obviously be higher - as would the difference between the algorithms be.