How does one use bloom filters in practice with their inability to remove items? Do one just toss it regularly and start over?

How do you use a bloom filter? Because that's never been a limitation in my use cases.

The only times I've really needed them is when I needed to know a db row has been updated in this import run. (Can't fit hash in memory, updates happen in batch/async so writes haven't been made yet)

I.e. basically using it for dedupe

mungoman2

The article mentions using it for indicating which things are in a cache. Not being able to delete entries seems like a severe limitation to effectiveness over time, as false positives increase.

thomasmg

If you need the ability to remove entries, then I would consider the counting Bloom filter (which is simpler, but needs more space), and the cuckoo filter. But you do need to be a bit careful as both can overflow if you are unlucky, or if you add too many entries. In which case you would have to re-build them. So better be prepared for this case, or prevent overflowing the filters by using a cascade of filters.

Some clever cache implementations internally use special kinds of Bloom filters, e.g. the Caffeine cache (Java) uses TinyLFU, which "builds upon Bloom filter theory" (from the abstract of the paper): https://github.com/ben-manes/caffeine and https://arxiv.org/pdf/1512.00727.pdf