Recording traces of allocation requests from actual applications for allocator-benchmarking purposes revolutionized allocator performance research in the late 90s, if I recall correctly, in addition to the earlier filesystem research you mentione. Benchmarks based on those traces were the bedrock of Zorn's case that generational GCs were faster in practice than malloc().
But I think (though possibly this is motivated reasoning) that benchmarking some workload is good enough for many optimizations. A fixed-size-block allocator is already pretty restricted in what workloads you can run on it, after all, and its performance characteristics are a lot simpler than something like first-fit. Showing that an optimization makes some workload faster is infinitely better than not knowing if it makes any workload faster. Showing that it generalizes across many workloads is better, of course, but it's not nearly the same step in utility.
Well, you can use it for everything. I've used it for in memory hash tables, trees, etc. To be clear, for those who may not understand, the "interpretation" in these "millibenchmarks" for something as simple as an allocator is not like Python or Bash or something slow. It could literally be "mmap a binary file of integers and either all/free the binary numbers". So, yeah, there might be some CPU pipeline overhead in unpredictable branches and you might still want to try to measure that, but beyond that there is basically no work and any real workload might have one-ish hard to predict branch leading to allocator operations. So, even unadjusted for overhead it's not so bad. And it keeps your measurements fast so you can get statistically many of them even if there is an "inner min loop" to try to isolate the best case or something. (Well, at least if what you are measuring is fast.) The downside/tradeoff is just lack of fidelity to interaction effects in the full application like resource competition/branch predictor disruption/etc.
You could probably set up a https://github.com/c-blake/nio file for it (or other) if you wanted to be systematic. A problem with the "text-newline" part of the "unix philosophy" is that it incurs a lot of unneeded binary->ascii->binary cycles that's actually more programming work as well as more CPU work, and it's really fairly orthogonal to the rest of the Unix ideas.
EDIT reply to parent EDIT
> benchmarking some workload is good enough
Fair enough, except that "good enough" is sort of "famous last words" if anyone is reviewing your work. :-) :-) Some will complain you only did Intel not AMD or not ARM or not xyz or Apple Silicon or whatever other zillion variables. Something is always better than nothing, though. It should be thought of more like a Bayesian thing - "no surprise for untested situations" or maybe "confidence in proportion to evidence/experience".
FWIW, I thought your big advice to 8dcc was pretty decent.
Recording traces of allocation requests from actual applications for allocator-benchmarking purposes revolutionized allocator performance research in the late 90s, if I recall correctly, in addition to the earlier filesystem research you mentione. Benchmarks based on those traces were the bedrock of Zorn's case that generational GCs were faster in practice than malloc().
But I think (though possibly this is motivated reasoning) that benchmarking some workload is good enough for many optimizations. A fixed-size-block allocator is already pretty restricted in what workloads you can run on it, after all, and its performance characteristics are a lot simpler than something like first-fit. Showing that an optimization makes some workload faster is infinitely better than not knowing if it makes any workload faster. Showing that it generalizes across many workloads is better, of course, but it's not nearly the same step in utility.