It should be kept in mind that these are microbenchmarks; I'd guess the much bigger AVX2 instruction sequence will have visible cache effects in macrobenchmarks (i.e. combined with a mix of other instructions) and lose its small lead --- on Haswell AVX2 is ~17% faster, on Skylake only 6%.
POPCNT is a single instruction and should definitely be inlined; I doubt that would be such a good idea for these longer sequences, but then putting it in a function means the call+return overhead could also become significant. It's hard to quantify without doing actual measurements, but with the figures given I'd be inclined to stick with POPCNT.
Hi, author here. You're perfectly right, microbenchmarks have flaws. Procedures should be run for different data sizes (as I did years ago) and code should be compiled with different compilers. And of course the method described in the text is designed to deal with large data. Using it for counting bits in 64-bit value would be... not wise. :)
AVX2's speedup over POPCNT is not not big, but seems it's such due to my indolence/stupidity. I've just merged pull request by Simon Lindholm (https://github.com/WojciechMula/sse-popcount/pull/2) with manually unrolled loops and it made the AVX2 code faster 40% than POPCNT.
The article is really targeting popcount for large bitvectors; in this scenario, call overhead is a non-issue (constant cost of <5-10 cycles vs O(n) work to be performed). I$ effects will also be negligible (the AVX2 instruction sequence is only a few cachelines long, and you wouldn't inline it).
POPCNT is a single instruction and should definitely be inlined; I doubt that would be such a good idea for these longer sequences, but then putting it in a function means the call+return overhead could also become significant. It's hard to quantify without doing actual measurements, but with the figures given I'd be inclined to stick with POPCNT.