I’ve been building SD card based Pi devices and the limiting factor in IO perf is the FUSE exFat implementation. There’s a leaked Samsung internal implementation that is over 2x as fast in my benchmarks. I can’t attest whether it’s the fact it’s FUSE or just other performance optimizations that is the reason though.
Doing things in-kernel is almost always faster. No dealing with complicated context-switching, moving memory back-and-forth, etc. I can't necessarily provide a formal, data-backed citation, but it's pretty well understood.