Bloom filter: is there an element in the set?

4. 10. 2022

A Bloom filter is a probabilistic structure that allows us to determine, with a certain degree of probability, whether an element x is in the set M.

For example, in our system, we need to measure how many times users click on an advertising banner. After each click, we receive an HTTP GET request informing us that user XYZ clicked on ad 123. Occasionally, a user might accidentally double-click, resulting in two requests. We would like to filter out these duplicate requests and record them as a single click in the database.

Therefore, we need an application that reads the click message feed and filters out duplicates. How would we do this? Each message has an ID, so we don't need to compare the entire message; we just need to check if we have already processed a message with the same ID. Simple JavaScript code that removes duplicates from the stream (here, for simplicity, the array) of messages might look like this:

var messages = [{id: "123"}, {id:"456"}, {id:"123"}, {id: "789"}]
var processedIds = {};

messages.forEach(function(message) {
if (!processedIds[message.id]) {
processedIds[message.id] = true;
console.log(message);
}
});

// Vytiskne: { id: "123" } { id: "456" } { id: "789" }

This solution will suffice until we run out of memory – and we inevitably will, because the set of processedIds will only continue to grow over time. We can, of course, implement some data expiration programming, such as deleting messages older than one hour or similar measures.

What's worse is when even that is not enough. What if we need to have at least one day's worth of data in the processedIds set, and we're processing, say, ten thousand messages per second? In the worst case, we need to store up to 864,000,000 IDs. If we use uuid v4, each ID has a length of 36, which means we need to store at least 32 GB per day. That's a bit much. Couldn't the memory requirement be reduced?

Bloom filter

Instead of storing the entire ID, we can store just the hash of the ID. We can use the classic murmurhash, which returns a 32-bit integer for each string and has a good distribution:

var murmurhash = require("murmurhash");

var messages = [{id: "123"}, {id:"456"}, {id:"123"}, {id: "789"}];
var processedHashedIds = {};

messages.forEach(function(message) {
var hashedId = murmurhash.v2(message.id);
if (!processedHashedIds[hashedId]) {
processedHashedIds[hashedId] = true;
console.log(message);
}
});

// Vytiskne: { id: "123" } { id: "456" } { id: "789" }

Next, we use the bit vector trick. This is an array in which we store only zeros and ones. If we have an n-bit hash function, we need a vector of size ²ⁿ to address all possible outputs of the hash function. The idea is that the vector initially contains all zeros. Then, if the hash function returns the number 47 for the input "123", we simply put a one in the vector at index 47. This indicates that we have already seen a hash of 47.

var n = 8;
var numberOfValues = Math.pow(2, n);
var messages = [{id: "123"}, {id:"456"}, {id:"123"}, {id: "789"}];
var bitVector = new Array(numberOfValues);

messages.forEach(function(message) {
var hashedId = murmurhash.v2(message.id) % numberOfValues;
if (!bitVector[hashedId]) {
bitVector[hashedId] = true;
console.log(message);
}
});

// Vytiskne: { id: "123" } { id: "456" } { id: "789" }

With the %numberOfValues code, we simply trimmed the murmurhash to create an n-bit hash function. This is essentially a Bloom filter—albeit a very degenerate version. The primary issue with this method is the collisions that are bound to occur. Different input values can produce the same output hash. To reduce the likelihood of collisions, we use several hash functions, not just one. For example, we can create multiple hash functions by adding some salt to the input value. A simple function that creates different n-bit hash functions might look like this:

var murmurhash = require("murmurhash");

function createNBitHashFunction(n, salt) {
var upperBound = Math.pow(2, n);
return function(inputString) {
return murmurhash.v2(inputString + salt) % upperBound;
}
}

var firstHashFunction = createNBitHashFunction(8, "nejakaSul");
var secondHashFunction = createNBitHashFunction(8, "nejakaJinaSul");

console.log(firstHashFunction("123")); // 42
console.log(secondHashFunction("123")); // 174

We can write a class that represents a Bloom filter as follows:

function BloomFilter(numberOfHashFunctions, numberOfBits) {
this.reset(numberOfHashFunctions, numberOfBits);
}

BloomFilter.prototype.reset = function(numberOfHashFunctions, numberOfBits) {
this.bitVector = new Array(Math.pow(2, numberOfBits));
this.hashFunctions = [];

for (var i = 0; i < numberOfHashFunctions; i++) {
this.hashFunctions.push(createNBitHashFunction(numberOfBits, i.toString()));
}
}

BloomFilter.prototype.add = function(item) {
var checksum;
for (var i = 0; i < this.hashFunctions.length; i++) {
checksum = this.hashFunctions[i]<4>;
this.bitVector[checksum] = true;
}
}

BloomFilter.prototype.contains = function(item) {
var checksum;
for (var i = 0; i < this.hashFunctions.length; i++) {
checksum = this.hashFunctions[i]<5>;
if (!this.bitVector[checksum]) {
return false;
}
}
return true;
}

The add method adds an element to the Bloom filter by computing the hash of each function and storing true at the corresponding index.
The contains method determines whether the element is contained in the Bloom filter.

We can easily use the Bloom filter:

var bloomFilter = new BloomFilter(3, 8);
var messages = [{id: "123"}, {id:"456"}, {id:"123"}, {id: "789"}];

messages.forEach(function(message) {
if (!bloomFilter.contains(message.id)) {
bloomFilter.add(message.id);
console.log(message);
}
});

// Vytiskne: { id: "123" } { id: "456" } { id: "789" }

And that's all the magic of the basic Bloom Filter. It has these basic properties:

If the contains method indicates that the element is not in the Bloom Filter, it confirms that the element was never added to the Bloom Filter.
If the method indicates that the element is present in the Bloom Filter, it doesn't necessarily mean that the element was actually added to the Bloom Filter; there may have been a hash collision.
In general, to reduce the likelihood of an incorrect answer, you can increase the number of hash functions or enlarge their size (referring to the size of the output value).
The more hash functions you use, the slower the Bloom Filter will be.
The larger the hash functions used, the more space the Bloom Filter will occupy.
Once an element is inserted into a Bloom filter, it cannot be removed.

« Previous: HyperLogLog: How to estimate the number of unique values

Next: Cryptography and encryption »