It should be kept in mind that these are microbenchmarks; I'd guess the *much* b...

wmu · on March 13, 2016

Hi, author here. You're perfectly right, microbenchmarks have flaws. Procedures should be run for different data sizes (as I did years ago) and code should be compiled with different compilers. And of course the method described in the text is designed to deal with large data. Using it for counting bits in 64-bit value would be... not wise. :)

AVX2's speedup over POPCNT is not not big, but seems it's such due to my indolence/stupidity. I've just merged pull request by Simon Lindholm (https://github.com/WojciechMula/sse-popcount/pull/2) with manually unrolled loops and it made the AVX2 code faster 40% than POPCNT.

stephencanon · on March 13, 2016

The article is really targeting popcount for large bitvectors; in this scenario, call overhead is a non-issue (constant cost of <5-10 cycles vs O(n) work to be performed). I$ effects will also be negligible (the AVX2 instruction sequence is only a few cachelines long, and you wouldn't inline it).

wmu · on March 15, 2016

I did some tests and it seems that the AVX2 code is faster than popcnt when the input size is 512 bytes or more.