Bloom filter

A Bloom filter is a data structure that allows computers to see if a given element occurs in a set.^[1]^[2] Bloom filters use hash functions to do this. For each element that is added, a hash value is calculated. When a new element is added, its hash value is compared to that of the other elements in the set. A Bloom filter is a probabilistic data structure. It is possible to get a false positive, but not to get a false negative.^[3] In other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed.^[2] For each added element, the probability of getting a false positive grows.^[2]

Edward Bloom proposed the Bloom filter in 1970. In the article, Bloom supposes there is an algorithm to hyphenate words at the end of a line. According to the example, most words have simple hyphenation patterns. But about 10% of the words require time-consuming lookups to fetch the correct rule. His case was that of hyphenating about 500,000 words. He saw that using the "normal" error-free hashing techniques, storing the hyphenation patterns, would require a lot of memory. He found that using his technique, he could eliminate most lookups. For example, a hash area only 15% of the size needed by an ideal error-free hash still eliminates 85% of the disk accesses.

More generally, fewer than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set.^[4]

References[change | change source]

↑ Burton H. Bloom (July 1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors" (PDF), Communications of the ACM, 13 (7): 422–426, doi:10.1145/362686.362692, S2CID 7931252
↑ ^2.0 ^2.1 ^2.2 Intelligent Data Engineering and Automated Learning - IDEAL 2010: 11th international conference, Paisley, UK, September 1-3, 2010 : proceedings, eds. Colin Fyfe; et al (Berlin: SpringerLink, 2010), p. 162
↑ Rayan Chikhi; Guillaume Rizk, 'Space-efficient and exact de Bruijn graph representation based on a Bloom filter', Biomedical Research, Volume 36 (2014), p. 3 of 9
↑ Bonomi, Flavio; Mitzenmacher, Michael; Panigrahy, Rina; Singh, Sushil; Varghese, George (2006), "An Improved Construction for Counting Bloom Filters", Algorithms – ESA 2006, 14th Annual European Symposium (PDF), Lecture Notes in Computer Science, vol. 4168, pp. 684–695, doi:10.1007/11841036_61, ISBN 978-3-540-38875-3

[Bloom1970-1] Burton H. Bloom (July 1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors" (PDF), Communications of the ACM, 13 (7): 422–426, doi:10.1145/362686.362692, S2CID 7931252

[Fyfe162-2] 2.0 ^2.1 ^2.2 Intelligent Data Engineering and Automated Learning - IDEAL 2010: 11th international conference, Paisley, UK, September 1-3, 2010 : proceedings, eds. Colin Fyfe; et al (Berlin: SpringerLink, 2010), p. 162

[3] Rayan Chikhi; Guillaume Rizk, 'Space-efficient and exact de Bruijn graph representation based on a Bloom filter', Biomedical Research, Volume 36 (2014), p. 3 of 9

[4] Bonomi, Flavio; Mitzenmacher, Michael; Panigrahy, Rina; Singh, Sushil; Varghese, George (2006), "An Improved Construction for Counting Bloom Filters", Algorithms – ESA 2006, 14th Annual European Symposium (PDF), Lecture Notes in Computer Science, vol. 4168, pp. 684–695, doi:10.1007/11841036_61, ISBN 978-3-540-38875-3

[1]

[2]

[3]

[4]