I'm no expert on this, so probably someone else can give you better advice, but I'm happy to share what I do know. Here are the most important points.
Benchmarking is a full-fledged scientific experiment. You have a hypothesis that a particular change that you can make, such as using a different allocator, will have a particular effect, such as making your program run in less time. So you make just that one single change and measure the results, while carefully controlling all other conditions that could also affect your program's run time. You have to do the test more than once in order to find out whether you've succeeded in controlling them. If you do a good enough job controlling the conditions of the experiment, it becomes reproducible and therefore falsifiable. Programming usually isn't science, but benchmarking is. A great thing about it is that each run of the experiment can take a fraction of a second, so you can do the whole scientific method from beginning to end in a few minutes and get a reproducible, falsifiable result.
Benchmarking can get arbitrarily sophisticated, but the most important thing is to dodge the bullets that can turn benchmarks into nonsense, rather than produce very sophisticated graphs or get very high precision.
The simplest thing on Linux or macOS is to compile a command-line executable you can run in a terminal and run
time ./testprogram
at least three times to get an idea of how long the wallclock time is and how variable it is. (The user and system times shown are not very reliable anymore.) Presumably there is also some way to do the same thing on Microsoft Windows. If you're on a laptop, check to see if it makes a difference if you're plugged in. Also, check whether the result from
time ./testprogram; time ./testprogram
is consistent; sometimes lags in CPU speed scaling can produce unpredictable results.
Linux is not very consistent, so you should expect variations in the range of 1–10% from run to run when your program takes on the order of a second to run.
If you get consistently shorter wallclock times after recompiling the program with a different allocator and the same compiler options (probably make sure optimization isn't completely turned off), you can probably attribute that difference to the different allocator. It's a good idea to switch back to the original allocator and verify that you can reproduce your original results to make sure you didn't accidentally change something else at the same time (maybe your laptop went into low-power mode because the battery is getting low, or maybe you changed the number of iterations in a loop and forgot you changed it, or maybe Firefox loaded a YouTube video in the background and is making the machine swap, that kind of thing).
It's useful to ensure that you aren't swapping and that your CPU load other than the benchmark is minimal. `top` (or, better, `htop`) and `vmstat 5` (or, better, `dstat 5`) can be helpful here. Unless your program is heavily multithreaded, the CPU load is not as crucial as it used to be now that we all have multicore processors.
It's helpful to check the particular version of the program you're benchmarking into Git, including the Makefile or whatever specifies the compiler options. That way when you write down your results you can refer to the specific Git commit hash. Also, record your compiler version, your operating system version, and your CPU. Ideally someone who isn't you should be able to read your notes on the benchmark, run it again, and get the same results, if they have access to the same hardware.
The actual benchmark program itself can be as simple as allocating and deallocating in a loop, ideally inside another loop to run the whole benchmark multiple times; but on processors with cache (which includes every PC since the early 90s) the working set size can be very important, and you may get significantly different results for allocating 20 kilobytes 64 bytes at a time and 20 gigabytes 64 bytes at a time. If you pass in the number of iterations for one of the loops on the command line, instead of hardcoding it in the program code, it's easy to start with a small iteration count and work your way up a factor of 10 at a time until the program takes a few seconds to run. In this case, also be sure to record the specific command line you were measuring the performance of in your notes; knowing that a program took 2.205–2.216 seconds to run is of no value if you don't know if it was running a thousand iterations or a million.
It's a good idea to actually compute something that you output using the memory you're allocating and deallocating, just in case GCC's optimizer decides that your benchmark loop has no effect and therefore can be safely removed from your program. (In http://canonical.org/~kragen/sw/dev3/kmregion_example.c I computed the sum of the data value stored in each node of the 24th of all the 5000-node linked lists I allocated and deallocated. That's because, when I read the disassembly of my test code with `objdump -d kmregion_example`, I found out that GCC had removed most of the code inside the test loop because it could prove that nobody ever read the struct fields I was initializing. Reading disassembly is neither necessary nor sufficient for dodging all bullets that invalidate benchmarks.)
Varying the number of iterations in both the innermost and outermost loop should produce roughly proportional increases and decreases in wallclock time. (A helpful technique is to subtract the time for 1000 iterations from the time for 1001000 iterations to get the time for the last million iterations.) If it doesn't, you're not measuring what you think you're measuring. For example, especially with JIT compilers, you may be measuring startup overhead. That bit me this week with http://canonical.org/~kragen/sw/dev3/hadamardsin.js, which I erroneously reported was slower to run than the LuaJIT version. It turns out that it's slightly faster to run, just much slower to compile.
Running the benchmark on more than one computer is helpful because sometimes changes that speed things up on one system slow them down on another. I remember a recursive backtracking search program for logic circuit optimization I wrote in C long ago which used longjmp() to terminate the search when it found a solution; it ran reasonably fast on Linux but very slowly on my friend's Macintosh because MacOS implemented setjmp() as sigsetjmp(), transforming it from a very cheap operation into a very expensive one.
That's all you need to know to reliably do optimizations. In fact, that's probably more than you need to know. Everything below this point is extra.
— ⁂ —
Allocators are especially tricky to benchmark because often, by changing where data is stored in memory, they profoundly affect the speed of the rest of your program, in ways that vary from program to program. More memory fragmentation, for example, can increase how many TLB misses your program runs into. I don't know what to do about that except try to benchmark some real programs as well as simple nested loops.
In cases where you're trying to measure an effect that's roughly the size of your measurement precision, statistical tests can be useful. https://github.com/c-blake/bu/blob/main/doc/edplot.md linked by cb321 above discusses using the 2-sample Kolmogorov-Smirnov test, which is not nearly as terrifying as it sounds. But most effects will be either much larger than your measurement error, in which case the result is obvious enough that you don't need the statistics, or much smaller, in which case all they'll tell you is that you can't be sure about the direction of the effect, much less its magnitude. In medicine, where your experiments take years and people die in them, using sophisticated statistics to make the most of your precious data is very important. In performance benchmarking, where your experiments take milliseconds, it's usually better to improve your measurement precision instead. Sometimes just running the same benchmark more times is enough, but there are more powerful techniques available.
A way to get higher precision than `time` can give you is to run the program under `valgrind` (on Linux). Even raw valgrind will tell you exactly how many instructions it ran, and in most cases that number is the same from run to run, so instead of a measurement with ±3% precision like `time` usually gives you when you've controlled everything properly, you get a measurement typically with ±0.00000001% precision. This is fantastic, far beyond most things you can do in a physics, electronics, or chemistry lab. It means that even with just two runs you can tell if your optimization affected that number by even 0.0000001%, or, more interestingly, 0.1%. And it isn't affected by things like which CPU you run it on or whether your laptop is unplugged, and it only slows the program down about 10×.
Unfortunately, it measures the wrong thing, because a program that runs more instructions can still be faster. `valgrind --tool=cachegrind` tries to simulate your CPU's cache hierarchy to give you a better idea of how long the program will actually take; the `kcachegrind` GUI can give you a "cycle estimation" number based on the numbers it spits out. But it can still be unrelated to how fast the program actually runs for a variety of reasons; system calls are the biggest one. `kcachegrind` is also a "profiler": it shows you how much time (simulated instructions, simulated cache misses, cycle estimation, whatever) is used by each subroutine in your program.
You might wonder why an optimization that speeds your program up by 0.1% matters. The answer is that if you do 106 such optimizations you have sped your program up by 10%. SQLite has improved its performance by more than a factor of 3 over a 10-year period by using this approach: https://www.sqlite.org/cpu.html
There's a more basic profiler called gprof built into GCC; building your program with `-pg` will add profiling instrumentation in to it, so that it writes profiling information into a file called `gmon.out`. After running a program built with gprof, you can run `gprof testprogram` to analyze `gmon.out`; it will tell you how much time each subroutine took (including and excluding its transitive callees) and how many times it was called.
Linux also has a built-in profiler called `perf_events` https://perfwiki.github.io/main/ which uses hardware performance counters to get numbers like cachegrind's, but using the real hardware instead of simulation, and including a lot more detail; it can also profile the time your program spends in system calls. I've only used it in very basic ways, so I don't know very much about it. Intel also has a proprietary thing called VTune which can do things perf can't and runs on OSes that aren't Linux, but doesn't support AMD's processors.
— ⁂ —
I hope this is of some value to you! I'd be happy to answer any other questions. Hopefully correctly.
Wow. This is probably the best comment I have ever seen on HN. Thank you so much for the information, I will add benchmarking to my 'libpool' project and mention it in the article. I can't think of any questions right now, but I will let you know if anything comes up. Thank you again. :)
> The actual benchmark program itself can be as simple as allocating and deallocating in a loop [...]
What do you mean, exactly? Deallocating immediately after allocating? I feel like there would be a big difference between that and allocating multiple blocks before freeing them.
You are right - there will be a big difference for these two different workloads. As I am 100% sure @kragen already knows, this is arguably the first, smallest step in thinking for you to realize just how many possible workloads you might want to model. :-) And to realize why senior engineers will say pithy things like "the best benchmark is your actual application".
While that is true (and I've said it myself), one reason I like the design of a little "workload interpreter shell" as in https://github.com/c-blake/bst/blob/main/test/bst-interact.c (or maybe better https://github.com/c-blake/adix/blob/master/tests/btshell.ni... which has a preprocessor) is that you can use them to run "isolated tests" as a kind of "half step" between a pure hot loop dumb thing and a full-on application with all its attendant noise and self-interaction/self-disturbance.
So, for example, in this allocation problem setting you could write a similar 30-line "shell/interpreter" that interprets a binary stream of "commands" or allocation requests. The origin a of said binary stream could be a recording/log from some actual app you have doing something actually useful. Then you could apply that one recording in the "shell" in different configurations as @kragen mentions to study the allocator performance (both space & speed) in isolation. At that point you could start to say pretty meaningful things from replaying logs many times about this or that tradeoff on this|that workload.
EDIT: Of course, at that point you start to get into debates about "how representative" different workloads are, but at least you could potentially share your log with someone who had a different computer and have them run benchmarks, too, and at least the workload is not synthetic but relates to some actual possible program. Moving from synthetic benchmarks to log-based ones was a big thing in OS research on filesystems in the 1970s & 80s and it helped the "generalizability of results" (I think, anyway..I'm sure many still do only synthetic benchmarking due to its convenience).
EDIT2: I just thought of a cute name for the log replay half-step between synthetic microbenchmarks and a full application - "The millibenchmark". ;-) Maybe someone else has thought of this before, though.
Recording traces of allocation requests from actual applications for allocator-benchmarking purposes revolutionized allocator performance research in the late 90s, if I recall correctly, in addition to the earlier filesystem research you mentione. Benchmarks based on those traces were the bedrock of Zorn's case that generational GCs were faster in practice than malloc().
But I think (though possibly this is motivated reasoning) that benchmarking some workload is good enough for many optimizations. A fixed-size-block allocator is already pretty restricted in what workloads you can run on it, after all, and its performance characteristics are a lot simpler than something like first-fit. Showing that an optimization makes some workload faster is infinitely better than not knowing if it makes any workload faster. Showing that it generalizes across many workloads is better, of course, but it's not nearly the same step in utility.
Well, you can use it for everything. I've used it for in memory hash tables, trees, etc. To be clear, for those who may not understand, the "interpretation" in these "millibenchmarks" for something as simple as an allocator is not like Python or Bash or something slow. It could literally be "mmap a binary file of integers and either all/free the binary numbers". So, yeah, there might be some CPU pipeline overhead in unpredictable branches and you might still want to try to measure that, but beyond that there is basically no work and any real workload might have one-ish hard to predict branch leading to allocator operations. So, even unadjusted for overhead it's not so bad. And it keeps your measurements fast so you can get statistically many of them even if there is an "inner min loop" to try to isolate the best case or something. (Well, at least if what you are measuring is fast.) The downside/tradeoff is just lack of fidelity to interaction effects in the full application like resource competition/branch predictor disruption/etc.
You could probably set up a https://github.com/c-blake/nio file for it (or other) if you wanted to be systematic. A problem with the "text-newline" part of the "unix philosophy" is that it incurs a lot of unneeded binary->ascii->binary cycles that's actually more programming work as well as more CPU work, and it's really fairly orthogonal to the rest of the Unix ideas.
EDIT reply to parent EDIT
> benchmarking some workload is good enough
Fair enough, except that "good enough" is sort of "famous last words" if anyone is reviewing your work. :-) :-) Some will complain you only did Intel not AMD or not ARM or not xyz or Apple Silicon or whatever other zillion variables. Something is always better than nothing, though. It should be thought of more like a Bayesian thing - "no surprise for untested situations" or maybe "confidence in proportion to evidence/experience".
FWIW, I thought your big advice to 8dcc was pretty decent.
Benchmarking is a full-fledged scientific experiment. You have a hypothesis that a particular change that you can make, such as using a different allocator, will have a particular effect, such as making your program run in less time. So you make just that one single change and measure the results, while carefully controlling all other conditions that could also affect your program's run time. You have to do the test more than once in order to find out whether you've succeeded in controlling them. If you do a good enough job controlling the conditions of the experiment, it becomes reproducible and therefore falsifiable. Programming usually isn't science, but benchmarking is. A great thing about it is that each run of the experiment can take a fraction of a second, so you can do the whole scientific method from beginning to end in a few minutes and get a reproducible, falsifiable result.
Benchmarking can get arbitrarily sophisticated, but the most important thing is to dodge the bullets that can turn benchmarks into nonsense, rather than produce very sophisticated graphs or get very high precision.
The simplest thing on Linux or macOS is to compile a command-line executable you can run in a terminal and run
at least three times to get an idea of how long the wallclock time is and how variable it is. (The user and system times shown are not very reliable anymore.) Presumably there is also some way to do the same thing on Microsoft Windows. If you're on a laptop, check to see if it makes a difference if you're plugged in. Also, check whether the result from is consistent; sometimes lags in CPU speed scaling can produce unpredictable results.Linux is not very consistent, so you should expect variations in the range of 1–10% from run to run when your program takes on the order of a second to run.
If you get consistently shorter wallclock times after recompiling the program with a different allocator and the same compiler options (probably make sure optimization isn't completely turned off), you can probably attribute that difference to the different allocator. It's a good idea to switch back to the original allocator and verify that you can reproduce your original results to make sure you didn't accidentally change something else at the same time (maybe your laptop went into low-power mode because the battery is getting low, or maybe you changed the number of iterations in a loop and forgot you changed it, or maybe Firefox loaded a YouTube video in the background and is making the machine swap, that kind of thing).
It's useful to ensure that you aren't swapping and that your CPU load other than the benchmark is minimal. `top` (or, better, `htop`) and `vmstat 5` (or, better, `dstat 5`) can be helpful here. Unless your program is heavily multithreaded, the CPU load is not as crucial as it used to be now that we all have multicore processors.
It's helpful to check the particular version of the program you're benchmarking into Git, including the Makefile or whatever specifies the compiler options. That way when you write down your results you can refer to the specific Git commit hash. Also, record your compiler version, your operating system version, and your CPU. Ideally someone who isn't you should be able to read your notes on the benchmark, run it again, and get the same results, if they have access to the same hardware.
The actual benchmark program itself can be as simple as allocating and deallocating in a loop, ideally inside another loop to run the whole benchmark multiple times; but on processors with cache (which includes every PC since the early 90s) the working set size can be very important, and you may get significantly different results for allocating 20 kilobytes 64 bytes at a time and 20 gigabytes 64 bytes at a time. If you pass in the number of iterations for one of the loops on the command line, instead of hardcoding it in the program code, it's easy to start with a small iteration count and work your way up a factor of 10 at a time until the program takes a few seconds to run. In this case, also be sure to record the specific command line you were measuring the performance of in your notes; knowing that a program took 2.205–2.216 seconds to run is of no value if you don't know if it was running a thousand iterations or a million.
It's a good idea to actually compute something that you output using the memory you're allocating and deallocating, just in case GCC's optimizer decides that your benchmark loop has no effect and therefore can be safely removed from your program. (In http://canonical.org/~kragen/sw/dev3/kmregion_example.c I computed the sum of the data value stored in each node of the 24th of all the 5000-node linked lists I allocated and deallocated. That's because, when I read the disassembly of my test code with `objdump -d kmregion_example`, I found out that GCC had removed most of the code inside the test loop because it could prove that nobody ever read the struct fields I was initializing. Reading disassembly is neither necessary nor sufficient for dodging all bullets that invalidate benchmarks.)
Varying the number of iterations in both the innermost and outermost loop should produce roughly proportional increases and decreases in wallclock time. (A helpful technique is to subtract the time for 1000 iterations from the time for 1001000 iterations to get the time for the last million iterations.) If it doesn't, you're not measuring what you think you're measuring. For example, especially with JIT compilers, you may be measuring startup overhead. That bit me this week with http://canonical.org/~kragen/sw/dev3/hadamardsin.js, which I erroneously reported was slower to run than the LuaJIT version. It turns out that it's slightly faster to run, just much slower to compile.
Running the benchmark on more than one computer is helpful because sometimes changes that speed things up on one system slow them down on another. I remember a recursive backtracking search program for logic circuit optimization I wrote in C long ago which used longjmp() to terminate the search when it found a solution; it ran reasonably fast on Linux but very slowly on my friend's Macintosh because MacOS implemented setjmp() as sigsetjmp(), transforming it from a very cheap operation into a very expensive one.
That's all you need to know to reliably do optimizations. In fact, that's probably more than you need to know. Everything below this point is extra.
— ⁂ —
Allocators are especially tricky to benchmark because often, by changing where data is stored in memory, they profoundly affect the speed of the rest of your program, in ways that vary from program to program. More memory fragmentation, for example, can increase how many TLB misses your program runs into. I don't know what to do about that except try to benchmark some real programs as well as simple nested loops.
In cases where you're trying to measure an effect that's roughly the size of your measurement precision, statistical tests can be useful. https://github.com/c-blake/bu/blob/main/doc/edplot.md linked by cb321 above discusses using the 2-sample Kolmogorov-Smirnov test, which is not nearly as terrifying as it sounds. But most effects will be either much larger than your measurement error, in which case the result is obvious enough that you don't need the statistics, or much smaller, in which case all they'll tell you is that you can't be sure about the direction of the effect, much less its magnitude. In medicine, where your experiments take years and people die in them, using sophisticated statistics to make the most of your precious data is very important. In performance benchmarking, where your experiments take milliseconds, it's usually better to improve your measurement precision instead. Sometimes just running the same benchmark more times is enough, but there are more powerful techniques available.
A way to get higher precision than `time` can give you is to run the program under `valgrind` (on Linux). Even raw valgrind will tell you exactly how many instructions it ran, and in most cases that number is the same from run to run, so instead of a measurement with ±3% precision like `time` usually gives you when you've controlled everything properly, you get a measurement typically with ±0.00000001% precision. This is fantastic, far beyond most things you can do in a physics, electronics, or chemistry lab. It means that even with just two runs you can tell if your optimization affected that number by even 0.0000001%, or, more interestingly, 0.1%. And it isn't affected by things like which CPU you run it on or whether your laptop is unplugged, and it only slows the program down about 10×.
Unfortunately, it measures the wrong thing, because a program that runs more instructions can still be faster. `valgrind --tool=cachegrind` tries to simulate your CPU's cache hierarchy to give you a better idea of how long the program will actually take; the `kcachegrind` GUI can give you a "cycle estimation" number based on the numbers it spits out. But it can still be unrelated to how fast the program actually runs for a variety of reasons; system calls are the biggest one. `kcachegrind` is also a "profiler": it shows you how much time (simulated instructions, simulated cache misses, cycle estimation, whatever) is used by each subroutine in your program.
You might wonder why an optimization that speeds your program up by 0.1% matters. The answer is that if you do 106 such optimizations you have sped your program up by 10%. SQLite has improved its performance by more than a factor of 3 over a 10-year period by using this approach: https://www.sqlite.org/cpu.html
There's a more basic profiler called gprof built into GCC; building your program with `-pg` will add profiling instrumentation in to it, so that it writes profiling information into a file called `gmon.out`. After running a program built with gprof, you can run `gprof testprogram` to analyze `gmon.out`; it will tell you how much time each subroutine took (including and excluding its transitive callees) and how many times it was called.
Linux also has a built-in profiler called `perf_events` https://perfwiki.github.io/main/ which uses hardware performance counters to get numbers like cachegrind's, but using the real hardware instead of simulation, and including a lot more detail; it can also profile the time your program spends in system calls. I've only used it in very basic ways, so I don't know very much about it. Intel also has a proprietary thing called VTune which can do things perf can't and runs on OSes that aren't Linux, but doesn't support AMD's processors.
— ⁂ —
I hope this is of some value to you! I'd be happy to answer any other questions. Hopefully correctly.