Cuckoo filters and their analysis

jbapple · on May 29, 2016

Note that the comparative experiments in the original Bloom filter paper are very sensitive to the implementation details of the other Bloom filters they are comparing to. For instance, in the implementation of blocked Bloom filters, they get a false positive rate of 0.43% when using 13 bits per element. I get 0.27%, perhaps because in each small Bloom filter I am using a split Bloom filter.

I emailed one of the authors of the original paper, and, while their Cuckoo filter implementation is available, they have lost their experimental setup and comparison code.

lqdc13 · on May 29, 2016

There is also quotient filter that has theoretical analysis and implementation (1).

Also smaller than bloom filter and lets you delete.

(1) https://github.com/vedantk/quotient-filter

dgacmu · on May 29, 2016

Quotient filters are cool, but are both modestly less memory-efficuent and quite a bit slower than Cuckoo filters. Their advantage is primarily theoretical -- Cuckoo filters force a minimum false positive rate that grows with logN/4 (I.e., if you insert 2^32 items, you must use at least 8 bits per item inserted). In practice, that's usually not much of a problem, but its why we say the Cuckoo filter is best for applications where you want a 1% or lower false positive rate.

(disclosure: I'm one of the inventors of Cuckoo filters. Happy to answer questions about them, and very excited to see theoretical progress on them!)

prashnts · on May 29, 2016

Is there any way of being able to store a corresponding value to the elements in the set? I understand that hash tables do that -- however, what I wish to ask is that do we have a space efficient data structure for key-value pairs?

dgacmu · on May 29, 2016

Yes, many - but it depends how practical you want it to be and what tradeoffs you can accept. Broadly speaking, you're asking about the class of data structures for (almost) minimal perfect hashing. For example:

* ECT, or Entropy-Coded Tries, (paper: http://www.cs.cmu.edu/~hl/papers/ect-alenex2013.pdf ) are a static data structure (no insertions after you build it) that are very fast to construct, take about 7 microseconds of CPU time to lookup, and have 2.5 bits of overhead per item inserted. This is the best data structure I know of for external perfect hashing, where you want to store the key/value pair itself on something expensive to access, like an SSD. Implementation available in SILT: https://github.com/silt/silt

* CMPH (http://cmph.sourceforge.net/concepts.html) and friends are very fast MPH functions when you want to store the key value pairs in memory.

* There was a really cool paper at SEA a few years ago on Retrieval and Perfect Hashing using Fingerprinting (http://link.springer.com/chapter/10.1007%2F978-3-319-07959-2...) that came out of SAP. I don't know if they have code available. It's also internal, but uses the same fingerprint-centric idea as ECT, but in a faster way that doesn't require the expensive huffman coding that ECT uses.

There's a ton of very fun research in this area -- this is just the tip of the iceberg (and I haven't really looked at it for two years, so there's probably been more since).

prashnts · on May 29, 2016

Thanks for these! I haven't given a thorough read yet, will do so later today.

Probably off topic, however, in the CMPH link you quote above, what I gather about the MPH functions is that in any practical setting they probably would require a combination of hash functions to avoid collisions -- right?

dgacmu · on May 29, 2016

Because they're internal hash functions, they have access to the original key being used, so they can tolerate a few collisions -- in the implementation of CHD, there's a -t parameter for number of collisions in a bin to provide a bit of flexibility. So they don't require a completely collision-free hash function, just one that on the input dataset is not going to produce more than a few.

B1FF_PSUVM · on May 29, 2016

In comments:

"other things that do work well in practice but not in theory?"

"Levit's algorithm for shortest paths in a graph"

Hmm. Quick search, seems it's called Pape-Levit for independent discoveries, twenty year old benchmark paper at:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.54....