This article is a little confusing. I think this is a roundabout way to invent the blocked bloom filter with k=2 bits inserted per element.
It seems like the authors wanted to use a single hash for performance (?). Maybe they correctly determined that naive Bloom filters have poor cache locality and reinvented block bloom filters from there.
Overall, I think block bloom filters should be the default most people reach for. They completely solve the cache locality issues (single cache miss per element lookup), and they sacrifice only like 10–15% space increase to do it. I had a simple implementation running at something like 20ns per query with maybe k=9. It would be about 9x that for native Bloom filters.
There’s some discussion in the article about using a single hash to come up with various indexing locations, but it’s simpler to just think of block bloom filters as:
1. Hash-0 gets you the block index
2. Hash-1 through hash-k get you the bits inside the block
If your implementation slices up a single hash to divide it into multiple smaller hashes, that’s fine.
Yeah I kind of think authors didn't conduct a thorough-enough literature review here. There are well-known relations between number of hash functions you use and the FPR, cache-blocking and register-blocking are classic techniques (Cache-, Hash-, and Space-Efficient Bloom Filters by Putze et. al), and there are even ways of generating patterns from only a single hash function that works well (shamelessly shilling my own blogpost on the topic: https://save-buffer.github.io/bloom_filter.html)
I also find the use of atomics to build the filter confusing here. If you're doing a join, you're presumably doing a batch of hashes, so it'd be much more efficient to partition your Bloom filter, lock the partitions, and do a bulk insertion.
Your blogpost is great! Except for one detail: you have used modulo n. If n is not known at compile time, multiply+shift is much faster [1]. Division and modulo (remainder) are slow, except on Apple silicon (I don't know what they did there). BTW for blocked Bloom filters, there are some SIMD variants that seem to be simpler than yours [2] (maybe I'm wrong, I didn't look at the details, just it seems yours uses more code). I implemented a register-based one in one in Java here [3].
Bulk insertion: yes, if there are many keys, bulk insertion is faster. For xor filters, I used radix sort before insertion [4] (I should have documented the code better), but for fuse filters and blocked Bloom filters it might not be worth it, unless if the filter is huge.
Very interesting blog post. I’d never seen that method for quickly computing the patterns. I thought I had done a lot of research on bloom filters, too!
Post author here. Yes, you are correct. I was doing this code change 3 years ago when I was a junior dev, I was not familiar with a blocked bloom filters at that time. Looking back, it’s cool to see that I accidentally reinvented a basic blocked bloom.
I was also limited by the constraint of legacy code. This project was not a complete rewrite, but just an idea: "can we use more information from that 32-bit hash that we recieve in without regressing any perf". We didn't have a time for a deep research or a rewrite, so I just wanted to show the result of this small exercise on how we can make things run better without rewriting the world.
> Overall, I think block bloom filters should be the default most people reach for.
I think this depends on how big your filters are. Most people think of Bloom filters as having to have hundreds of thousands of elements, but I frequently find them useful all the way down to 32 bits (!). (E.g., there are papers showing chained hash tables where each bucket has a co-sited tiny Bloom filter to check if it's worth probing the chain.) In the “no man's land” in-between with a couple ten thousand buckets, the blocking seems to be mostly negative; it only makes sense as long as you actually keep missing the cache.
I never implemented their hash table, but it opened my eyes to the technique of a tiny Bloom filter, which I've used now a couple of times to fairly good (if small) effect. :-)
Yeah, I agree with this. I think there are open addressing hash tables like Swiss Table that do something similar. IIRC, they have buckets with a portion at the beginning with lossy “fingerprints” of items, which kind of serve a similar purpose as a bloom filter.
Problem is bloom isn’t close to the theoretical space complexity of the idea it implements and if you add 15% then it starts becoming attractive to switch to one that gets a tighter bound on the space complexity.
Cause it was written by AI.the entire mid section is classic AI slop writing. Repeating the same points and numbers over and over, repackaging the same idea with "key takeaway" and shit. The voice of the author is heavily AI coded there.
I wonder if the “spawn” API is ever preferable over “fork”. Do we really want to remove context if we can help it? There will certainly be situations where we have to, but then what you want is good compaction for the subagent. “Clean-slate” compaction seems like it would always be suboptimal.
Is there any reason to explicitly have this binary decision.
Instead of single primitive where the parent dynamically defines the childs context. Naturally resulting in either spawn or fork or anything in between.
I would broadly expect software made by most camera brands to be shit, while I would expect a developer who creates their own hardware projects (generally, not talking specifically about cameras) to range from idiots who have no idea what they're doing (like me, though to be fair I also wouldn't release it believing it to be good) to highly skilled coders who would get it right despite being on their own.
So I wouldn't automatically assume that a product like this would be better designed, but I would think there's a chance it might have been!
What are your thoughts on how Bevy's developing UI toolkit compares (in terms of goals and use cases) to some of the other Rust efforts in the space (egui, xilem, iced, etc.)? Do you expect it will be specialized/limited to scene development for games?
To kill a mosquito, you need "a few tens of millijoules, delivered within a few milliseconds" [0], so let's say 10W. To destroy the Earth (so that it turns into scattered dust and never reforms) you need about 10^32 J [1]; if we assume this is applied over maybe 100s, the laser would be 10^30W.
So the log10 scale goes from 1–30, where mosquitos die at 1 and the Earth dies at 30. The 2 PW in the article is about a 15.3. The Vulcan 20-20 project (set to complete in 2029) will register at about 20PW, or a 16.3 on the mosquito-Death Star scale [2].
So on a log scale, we're over halfway to building the Death Star.
For sure; using total energy delivered makes a lot more sense. But then I think it would be better to use whatever tool humanity has that delivers the max total energy; let’s say Tsar Bomba.
Let’s say the mosquito is 1 again, so Death Star is 34. Tsar Bomba would be about 17.3. Over halfway again!
It’s kind of surprising that our max power output and max energy output are about the same on these scales.
I think this is the crux of the assumption right here. It sounds like this is apply for well under a nanosecond.
I think we're closer to maybe killing a mosquito than "half way to building a Death Star on a log scale" (which, I guess is already much closer to a mosquito than a planet).
They used log10, so each step is 10x the previous, so in a linear sense, it would double when going from about 29.7 to 30. But it seems that humans tend to improve tech at exponential rates, where we are constantly making improvements here and there that keep stacking up, when it comes to things that are actually in a developmental stage anyways.
Say your "endstage" goal is GPU with 200 billion transistors. Using linear scale, the current biggest GPU is only halfway there, and it took all of human civilization to get this far, and it will take another civilization to get to 200b. In reality, we'll have that in a couple years with our current civilization.
A hypothetical "death star" project like this would require improvements in energy generation/storage capacity/etc., which haven't improved in nearly the way transistor production has (and are also much more limited by physical realities, such as the specific heat, enthalpy of combustion etc. of materials).
Yes, extremely high sustained power lasers still have a hard time competing with hypersonic projectiles in energy delivered. The difference in being able to throw nuclei at the problem.
I’m still not sure what 15.3 on the MDS scale can destroy but I am sure the Emperor will be pleased to hear that we are half-way to building the Death Star.
Well, mass scales as the cube of radius, and we have 15 orders of magnitude to work with, so I guess it should be an object on the order of hundreds of meters in radius. But as noted, the duration of firing matters as well. Given https://news.ycombinator.com/item?id=44054239, the actual laser can only vaporize much smaller things.
https://www.hackerfactor.com/blog/index.php?%2Farchives%2F10...
reply