Linear counting: How to estimate the number of unique values

Kapitoly: Linear counting: How to estimate the number of unique values, Min Value: How to Estimate the Number of Unique Values, LogLog: How to estimate the number of unique values, HyperLogLog: How to estimate the number of unique values

How could we count the number of unique values in a data set if there were really a lot of data? For example, we could have a service that measures the number of unique users on some big website or maybe on the whole .cz/.com/.whatever domain or something.

Naive approach

Ok, there is nothing to solve, we just go through the data and always store each value just once in some suitable data structure. In Python we could write it like this (the code is unnecessarily verbose, that's on purpose):

nonunique_values = [1, 2, 3, 2, 4, 1, 1, 3, 5, 2] def count_unique(values): found = set() for item in values: if item not in found: found.add(item) return len(found) print count_unique(nonunique_values) # 5

It works. Okay, but what if the data is 36-character unique v4 identifiers, and what if the unique ids are like ten million? Well... the program still calculates the correct value, but how much memory does it take? We can try to do the math. For simplicity, we'll assume that we're really storing the uuid as a 36-character string. So one uuid takes up 36 bytes, ten million ids take up 360 MB.

Not exactly small, let's face it. If there were a hundred million unique ids, we'd be into the gigabyte range. And we're still counting purely what the data takes up. Add in some overhead that takes, say, Python/JavaScript, and you can get into the orders of gigabytes with ten million ides:

You don't want that. How to get out of it?

We could try to somehow store the ides themselves more efficiently, we could use C instead of Python, which would substantially reduce the memory requirements, but it still wouldn't be the same, and such a program would still take up an incredible amount of space. The question remains: do we really need to know to the letter whether those unique values are exactly 9,563,618? Would anything happen if we were working with a count of 9,547,666? If your answer is that it did, and that you want it exactly, then please do as you please. Otherwise, let's demonstrate procedures to estimate the number of unique values in a dataset as accurately, quickly, and with minimal memory requirements as possible.

Hash function - our rescue

We will consider a hash function that takes a string as input and returns some n-bit number as output. If you're more used to hash functions that "return a string", then just think of it as further converting the string back to a number, there's nothing to it. The number of bits then indicates how big the number is. An 8-bit function returns a number that takes up eight bits, and the eight bits can store28 different values, typically integers from 0 to 255. We'll continue working with the 24-bit hash function. Such a function returns224 different values, which is 16,777,216.

Instead of storing the values themselves, we will store their 24-bit hashes. We need 30 MB to store ten million 24-bit hashes. This is an order of magnitude better than the previous 360 MB. The code might look like this:

def nhash(item, n): return hash(item) % (2 ** n) nonunique_values = [1, 2, 3, 2, 4, 1, 1, 3, 5, 2] def count_unique_using_nhash(values): found = set() for item in values: checksum = nhash(item, 24) if checksum not in found: found.add(checksum) return len(found) print count_unique_using_nhash(nonunique_values) # 5

The hash function is a native Python function and returns an integer, the nhash function shows a naive implementation of an n-bit hash function; it's not important to understand how it works because it doesn't work properly anyway - the result of the function doesn't take 24 bits, but that doesn't matter now, I'm concerned with the principle.

Then comes the count_unique_using_nhash function, which counts unique values. It goes through all the values, calculates their hash, and stores that hash in the found variable if the value isn't already there.

The problem with this procedure is that the hash function may return the same checksum for different strings - a collision may occur. The more such collisions that occur, the more inaccurate the result. The count_unique_using_nhash function almost always returns fewer unique values than there actually are.

And which hashing function to choose? It doesn't really matter that much, but some variant of MurmurHash is generally recommended.

Bitmaps

Did you pay attention in Fundamentals of C Programming? If so, you know bitmaps. How could you make use of them?

Let's sketch the idea in the field. Because we don't need to store computed checksums, we can create an array of length224 and initialize all elements to zero. Then the moment we hit a checksum - which is a number! - we write a one in the checksum index field to indicate that we have already found this checksum. The number of unique elements will then be approximately equal to the number of ones in this field. The code might look like this:

def count_unique_using_array(values): # list of 2^24 initialized to the same zeros array = [0] * (2 ** 24) for item in values: array[nhash(item, 24)] = 1 return sum(array) print count_unique_using_array(nonunique_values)

However, using arrays in this case is very memory intensive because we always need to store either a one or a zero in each cell. Instead, we can use a bit array. To understand, an integer can be stored in 32 bits, for example. If we convert the number 354 to binary, we get 101100010. So the number 354 would be stored in the computer as a 32-bit integer (let's ignore the details, shall we?) like this:

00000000 00000000 00000001 01100010

That's such a nice format we could use, right? Because we only want to store ones and zeros in our function, and now we just found out that an ordinary number is actually nothing but a bunch of zeros and ones (like everything in a computer, I guess...). In short, we can use bitwise operators to store a one so that it actually takes up one bit.

Let's go back to our array function. If we rewrite this in bit arrays, we would only need a bit array of224 bits, which is 2 MiB, for a 24-bit hashing function. That's a decent upgrade!

Using a 32-bit hashing function that has232 different output values, which is 4,294,967,296, we would need a bit array of half a gigabyte. That's not entirely bad for the fact that by doing this we are able to count uniques up to somewhere in the order of billions.

Linear Counting

I mentioned earlier that when we use the hashing function, we only get an estimate of the number of unique values. This estimate becomes more inaccurate the closer the number of unique values is to the number of different output values of the hash function. It makes sense that if we have a dataset that has five billion unique values, that we just can't count that with a 24-bit hash function.

However, the estimate we compute using hash functions can be further refined. A simple idea is used to do this - the more unique values there are in a dataset, the more likely it is that collisions will arise over time. (Collision = two different inputs have the same output from the hash function, the same checksum.) Now we're going to do a bit of math, so we'll introduce a few variables:

  • We'll work with an n-bit hash function h.
  • So the number of all possible outputs of the function h is equal to 2n. We will denote this number by m, so: m=2n.
  • The bit array in which we will store which checksums we have found will be denoted by the capital letter A. The array A has length m.
  • If we run our algorithm through the entire data set, some values in the bit field will be zero and some will be one. We will be mainly interested in the ratio "number of zeros divided by the length of the array", which gives us the relative proportion of zeros in the bit array. Initially, the proportion is 1, because the array is initialized to all zeros. If it were fifty-fifty, the proportion would be equal to one-half, etc. We denote this value by Z and calculate it as Z = number of zeros / m.

At this point, the higher the value of Z, the more confident we are that our estimate of the number of uniques is accurate. For example, if we go through an empty dataset, there will be all zeros left in the bit array and the value of Z will be one - so we are absolutely certain that there are as many unique elements in our dataset as there are ones in the bit array - zero.

If the value of Z is equal to 0.99, it means that the bit array is one-hundredth full of ones. 99% of the array contains zeros. This is still good, it is likely that not too many collisions occurred during the calculation and no or minimal correction will be needed. However, if the Z value is 0.05, only 5% of the array is left with zeros and we probably ran into lots and lots of collisions during the calculation - a large correction will be needed.

But how to calculate this correction? We'll use the logarithm, which is a function that takes this form:

We will be interested in the red part on the interval (0, 1>. If we plot Z on the x-axis, we will get that the higher Z is, the closer the y-value will be to zero. Conversely, the smaller Z, the smaller the y value will be. We calculate the resulting estimate by multiplying this number by -m. Note that for a sufficiently low value of Z we get a number that is even less than -1, and if we multiply this value by -m we get an estimate that is greater than m: we have estimated that the number of uniques is greater than the number of all possible checksums that can come out of the hashing function h.

The full formula for calculating the number p of unique values would look like this:

$$\Large p=-m\cdot \ln Z$$

An implementation (without bit fields, just plain fields) might look like this:

from uuid import uuid4 from math import log number_of_unique_values = 10000 nonunique_values = [uuid4() for _ in xrange(number_of_unique_values)] nonunique_values += nonunique_values def nhash(item, n):
    return hash(item) % (2 ** n) def count_unique_using_array(values, n): array = [0] * (2 ** n) for item in values: array[nhash(item, n)] = 1 return sum(array) def linear_counting(values, n):
    estimate = count_unique_using_array(values, n) m = 2 ** n number_of_zeros = m - estimate Z = float(number_of_zeros) / m # log here is the natural logarithm of the base e p = (-m) * log(Z) p = int(round(p)) print "Debug info: estimate=%s, m=%s, number_of_zeros=%s, Z=%s, p=%s" % \ (estimate, m, number_of_zeros, Z, p) return p print "Cardinality: %s" % linear_counting(nonunique_values, 16)

In the linear_counting function, we first compute the first estimate using the previous algorithm, which uses an n-bit hash function. Then we calculate the proportion of zeros in the array and perform the correction using the formula. The logarithm in the code is the natural logarithm. The variable nonunique_values contains the number_of_unique_values of unique strings, so in this example there are ten thousand unique values in the list of nonunique_values. The weird code on line 6 makes sure that there are multiple identical values in the array: we simply append the same rank list to the end of the list, so each value is in the array twice. And then we're counting. The function should ideally return that the list contains ten thousand uniques. Output:

Debug info: estimate=9289, m=65536, number_of_zeros=56247, Z=0.858261108398, p=10016 Cardinality: 10016

Since a different list of nonunique_values is always generated, the results will vary with each call. We used a 16-bit hash function. We can see that the first guess from the count_unique_using_array function was that the list contained 9289 uniques. We refined this estimate to 10,016 by correcting it, which is already pretty good: our estimate is only 0.16% larger, whereas before it was 7.11% smaller - that's quite a lot.

If we took a "larger" hashing function, for example a 24-bit hash function, we would get a more accurate result:

Debug info: estimate=9997, m=16777216, number_of_zeros=16767219, Z=0.999404132366, p=10000 Cardinality: 10000

The original estimate itself was OK, the refinement has already gotten to as high as ten thousand. For an example, we can try a million unique ids, i.e. number_of_unique_values = 1000000. With the 24-bit hash function we get to these results:

Debug info: estimate=971033, m=16777216, number_of_zeros=15806183, Z=0.94212192297, p=1000267 Cardinality: 1000267

It goes. At the same time, if we implement this as a bit array, the array should only eat up the aforementioned two megabytes (+ what the language eats). Pretty good for me. But we can do better! Much better. But for next time.

In the meantime, you can read the article A Linear-Time Probabilistic Counting Algorithm for Database Applications [PDF]. Or you can continue on to the next part of the series: the min-value algorithm.