Probabilistic datastructes are a useful tool for every programmers tool belt. Imagine you have an expensive lookup operation, for example fetching a file over the network or hitting a slow, huge database. You only want to spend time and resources on this call if there’s actually a result to retrieve!

This is where probabilistic filters come in handy: They are a lightweight datastructure that provide a lookup function that

  1. returns false if and only if the key is missing
  2. returns true if the key is available with high likelihood

After evaluating the existing implementations on Github, I created a new optimized one at panmari/cuckoofilter.

Background

We’ll look at two different algorithms that are used for implementing probabilistic filters

This page does an awesome job explaining the theory behind, and differences between the two. Here I’m focusing on comparing implementations.

How to evaluate implementations

As established above, these datastructures are probabilistic: When they return true, there’s a chance that the key is still not present in the underlying set. We consider a this case a false positive. The rate of false positives is a function of the consumed memory for all filters.

Evaluating this trade-off is my primary goal. Secondary, I’m also interested in the benchmark timings when interacting with the filter. Where possible, parameters were chosen to give comparable false positive rates (with a target of 0.001).

The evaluation

I found these popular implementations on Github and gave them all a spin

The code used for the evaluation is in this repo.

Memory consumption

Only vedhavyas/cuckoo-filter sticks out here for using much more memory. Also note how panmari/cuckoofilter uses double the memory of seiflotfy/cuckoofilter (more on this later).

Memory consumption by implementation and number of entries

False positive rate

Again some outlier with seiflotfy/cuckoofilter and vedhavyas/cuckoo-filter. They both don’t offer ways to parametrize false positive rate.

Memory consumption by implementation and number of entries

Removing these two entries paints a clearer image among the other contenders.

Memory consumption by implementation and number of entries

Benchmarks

Note the benchmarks are single threaded. This puts some libraries at a disadvantage as they only offer thread safe methods (using an internal lock). For contains calls, there’s two separate benchmarks: One with the item contained, the other with the item missing.

Two implementations stick out with having much worse performance. Comparing the three more efficient implementations, cuckoo filters have the edge over all benchmarks.

Conclusion

In terms of false positiv rate vs memory trade-off, steakknife/bloomfilter and panmari/cuckoofilter stand out. Both give great value for the memory you invest!

When it comes to the runtime benchmarks of Contains calls, the cuckoo filters tend to be faster than the bloom filters. There’s a good explanation for that: Cuckoo filters only need to check two positions, i.e. have at most two cache misses. Bloom filters need to access memory locations for all hash functions, i.e can have up to k cache misses.

Configurability is better for bloom filters: all implementations tested have construction time parameters to tweak the memory/false positive ratio trade-off. For Cuckoo filters, this is harder and baked into all Cuckoo implementations tested. This is also how my own implementation at panmari/cuckoofilter achieves a better false positive rate while using double the memory of seiflotfy/cuckoofilter: It uses a fingerprint size of 16 bit, as opposed to 8 bit.

The authors of “Cuckoo Filter: Better Than Bloom” derived the relation between fingprint size f and false positive rate r:

f >= log2(2 * 4/r) bits

With the 16 bit fingerprint, you can expect r ~= 0.0001. 8 bit fingerprints correspond to r ~= 0.03.

As most of the time in software engineering, it comes down to trade-offs.