Probabilistic datastructes are a useful tool for every programmers tool belt. Imagine you have an expensive lookup operation, for example fetching a file over the network or hitting a slow, huge database. You only want to spend time and resources on this call if there’s actually a result to retrieve!
This is where probabilistic filters come in handy: They are a lightweight datastructure that provide a lookup function that
- returns
false
if and only if the key is missing - returns
true
if the key is available with high likelihood
After evaluating the existing implementations on Github, I created a new optimized one at panmari/cuckoofilter.
Background
We’ll look at two different algorithms that are used for implementing probabilistic filters
This page does an awesome job explaining the theory behind, and differences between the two. Here I’m focusing on comparing implementations.
How to evaluate implementations
As established above, these datastructures are probabilistic: When they return true
, there’s a chance that the key is still not present in the underlying set. We consider a this case a false positive. The rate of false positives is a function of the consumed memory for all filters.
Evaluating this trade-off is my primary goal. Secondary, I’m also interested in the benchmark timings when interacting with the filter.
Where possible, parameters were chosen to give comparable false positive rates (with a target of 0.001
).
The evaluation
I found these popular implementations on Github and gave them all a spin
- AndreasBriese/bbloom
- panmari/cuckoofilter
- seiflotfy/cuckoofilter
- steakknife/bloomfilter
- vedhavyas/cuckoo-filter
The code used for the evaluation is in this repo.
Memory consumption
Only vedhavyas/cuckoo-filter
sticks out here for using much more memory. Also note how panmari/cuckoofilter
uses double the memory of seiflotfy/cuckoofilter
(more on this later).
False positive rate
Again some outlier with seiflotfy/cuckoofilter
and vedhavyas/cuckoo-filter
. They both don’t offer ways to parametrize false positive rate.
Removing these two entries paints a clearer image among the other contenders.
Benchmarks
Note the benchmarks are single threaded. This puts some libraries at a disadvantage as they only offer thread safe methods (using an internal lock). For contains
calls, there’s two separate benchmarks: One with the item contained, the other with the item missing.
Two implementations stick out with having much worse performance. Comparing the three more efficient implementations, cuckoo filters have the edge over all benchmarks.
Conclusion
In terms of false positiv rate vs memory trade-off, steakknife/bloomfilter
and panmari/cuckoofilter
stand out. Both give great value for the memory you invest!
When it comes to the runtime benchmarks of Contains
calls, the cuckoo filters tend to be faster than the bloom filters. There’s a good explanation for that: Cuckoo filters only need to check two positions, i.e. have at most two cache misses. Bloom filters need to access memory locations for all hash functions, i.e can have up to k
cache misses.
Configurability is better for bloom filters: all implementations tested have construction time parameters to tweak the memory/false positive ratio trade-off. For Cuckoo filters, this is harder and baked into all Cuckoo implementations tested.
This is also how my own implementation at panmari/cuckoofilter
achieves a better false positive rate while using double the memory of seiflotfy/cuckoofilter
: It uses a fingerprint size of 16 bit
, as opposed to 8 bit
.
The authors of “Cuckoo Filter: Better Than Bloom” derived the relation between fingprint size f
and false positive rate r
:
f >= log2(2 * 4/r) bits
With the 16 bit fingerprint, you can expect r ~= 0.0001
. 8 bit fingerprints correspond to r ~= 0.03
.
As most of the time in software engineering, it comes down to trade-offs.