HyperLogLog: How to estimate the number of unique values very accurately

Kapitoly: Linear counting: How to estimate the number of unique values, Min Value: How to Estimate the Number of Unique Values, LogLog: How to estimate the number of unique values, HyperLogLog: How to estimate the number of unique values

The series on algorithms for estimating the number of unique values in a multiset continues with an article on the state of the art algorithm HyperLogLog. However, all of the ideas behind this algorithm are based on the previously described LogLog, so if you want to read on, make sure you know LogLog well.

The LogLog algorithm had two main problems:

Outliers

The resulting LogLog estimate is very susceptible to outliers, values that are very far from the other values. This is because typically most of the maximum lengths of the null prefixes for each group are about the same, for example between 12 and 15, but then wham - and one null suffix has a length of 27. This big difference will make a mess of the resulting estimate of the number of uniques, because the classical arithmetic average doesn't handle it very well. Basically, the same principle has worked, as when a few top managers and directors raise the average wage in the country with their high salaries.

The authors of the algorithm tried to deal with this by getting rid of the outliers and not including the few highest peaks in the average, but that wasn't a very good solution. Later, they tried a different approach - instead of using the arithmetic mean, they used the harmonic mean, which has much better properties when we have values that are far from the mean.

Very bad results for small cardinalities

LogLog has terrible results when it has to estimate the cardinality of small multiplicities. If we have a few unique elements in a multiset, LogLog will return a nonsensically large estimate. The authors have solved this in a fancy way - in the case where we have a low cardinality multiset as input, we don't use LogLog, but use a completely different algorithm, namely the old familiar Linear counting. Well, yeah, but how do we know that we have a low cardinality multiset on the input if we don't know the cardinality?

We let LogLog estimate the number of unique elements, and when we find that this estimate is relatively low, we throw away the estimate and recalculate it with the Linear counting algorithm. Let's quickly recall how Linear counting works. We start by allocating a zero (bit) array of some predetermined size, typically a power of two. We traverse all elements of the multiset, compute their hash, and convert it to an integer i from the interval 0 to the length of the array minus 1. We then store a one at the index of the i in the array. Finally, we count the number of zeros and ones in the array and use a simple formula to estimate the cardinality.

However, the LogLog algorithm does not use any bit array to "store" the calculated hashes. Instead, it uses an array in which it stores the maximum suffix lengths. For example, the array might look like this:

[2, 5, 9, 0, 5, 4, 4, 0]

where each number indicates the maximum length of the null suffix in that group of computed hashes. However, a non-zero value in the i-th position then necessarily means that there is an element among the values in the multiset whose (three-bit) hash is equal to i. In other words, we can convert this array to a bit array for linear counting by keeping the zeros and converting any positive number to a one. This would give us a bit array from the previous array:

[1, 1, 1, 0, 1, 1, 1, 0]

Now we can apply the Linear counting algorithm and estimate the cardinality using it.

However, there is one catch. We are storing the maximum zero suffix in the array, but what if all hashes in some value set end in one? For example, if we use the last three bits as the identifier of the group to which the hash belongs, then the length of the zero suffix of this hash "01001100" would be zero - because the vector "01001" ends in one. Thus, we would store a zero in the maxima array at index 4 (1002=410). However, in a linear counting algorithm, this would show that we did not find a value that had a four after the hash, which is not true.

Therefore, we modify the algorithm so that it does not store the length of the longest zero suffix in the max field, but stores the index of the "first one from the right" (for the sake of clarity - we index from one), which is the same as adding one to the length of the zero suffix. This means that a zero at index i means that no value has a hash equal to i, and we can then use such an array for linear counting.

HyperLogLog implementation

One step at a time:

def alpha(num_buckets): return (0.7213 / (1 + 1.079 / num_buckets))

The alpha function returns a correction for the calculated estimate, we already know this from last time, only in the HyperLogLog option alpha has a different value depending on the number of groups.

def trailing_zeroes(number): if number == 0: return 0 zeroes = 0 while (number & 1) == 0: zeroes += 1 number = number >> 1 return zeroes

A function that returns the number of zero bits at the end of a number.

def count_maxims(values, k): num_buckets = 2 ** k max_zeroes = [0] * num_buckets for value in values:
        h = hash(value) bucket_index = h & (num_buckets - 1) bucket_hash = h >> k num_zeros = trailing_zeroes(bucket_hash) + 1 max_zeroes[bucket_index] = max(max_zeroes[bucket_index], num_zeros) return max_zeroes

A function that goes through all the values, splits them into num_buckets groups, and computes the index of the first one from the right (the length of the maximum zero suffix plus one) for each group.

def linear_counting(number_of_ones, num_buckets): number_of_zeros = num_buckets - number_of_ones Z = float(number_of_zeros) / num_buckets p = (-num_buckets) * log(Z) return p

Slightly modified function for calculating the Linear counting algorithm.

def estimate_cardinality(values, k): num_buckets = 2 ** k max_zeroes = count_maxims(values, k) estimate = float(alpha(num_buckets) * (num_buckets**2)) / sum(2**(-r) for r in max_zeroes) number_of_ones = sum(1 if x > 0 else 0 for x in max_zeroes) if (estimate <= 2.5 * num_buckets) and (number_of_ones < num_buckets): return linear_counting(number_of_ones, num_buckets) else: return estimate

The main function that puts it all together. It first calculates an estimate using the modified LogLog, uses the harmonic mean instead of the arithmetic mean on the fourth line, and finally applies a correction in the form of Linear Counting if the calculated estimate is less than 2.5 times the number of groups (at this point we expect the standard LogLog estimate to be worse than if Linear Counting were used) and if there is at least one zero in the maxima field (we can't use a field containing all ones in Linear Counting, see previous article). And what are the results of HyperLogLog?

We run a test code that estimates the cardinality of sets that successively have cardinality 10, 100, ..., 1 000 000:

for x in range(1, 7): num_values = 10**x values = (str(uuid4()) for _ in xrange(num_values)) cardinality = estimate_cardinality(values, 14) print "Cardinality: %s, relative error: %s" % (cardinality, num_values / cardinality)

Results:

Cardinality: 10.0030530001, relative error: 0.999694793165 Cardinality: 100.306423257, relative error: 0.996945128268 Cardinality: 1003.0892434, relative error: 0.99692027063 Cardinality: 10013.1348931, relative error: 0.998688233677 Cardinality: 100520.045562, relative error: 0.994826449206 Cardinality: 998088.420332, relative error: 1.0019152408

We can see that the results are always off by less than one percent from the correct result. And that's good!

Other improvements to HyperLogLog

HyperLogLog has seen some other interesting improvements. One of the remaining problems is the high memory requirement. This sounds rather strange, since last time we praised memory usage to the skies and calculated that we only needed 768 bytes to store an array of maximums. Only the problem arises if you want to compute the cardinality not of one huge multiset, but of several million small multisets in parallel, or if you want to store these computed arrays of maxima in a database for some later chess machra.

For example, if we were to compute the cardinality of one billion multisets at the same time, we would need 1,000,000 times 768 bytes, which is 768 GB in the current implementation. On the one hand, that's not bad ... but on the other hand, we are able to implement HyperLogLog better.

Sparse representation

Let's imagine that we set k=12, i.e. we use 12 bits to identify the groups and thus get212 different groups of hashes. At the beginning of the algorithm, we need to allocate an array of length 4096 (see line 3 of the count_maxims function). And let's assume that our multiset has three different elements. This means that we set a value in three different places in the array, and the remaining 4093 values in the array remain null; we simply haven't used the remaining 4093 in any way. That's not very efficient, is it?

To avoid having an array that is mostly null and unused, but which still takes up a valuable chunk of memory, we can store the values as [index, value] pairs at the beginning. Example for better understanding. This array

[7, 0, 0, 0, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

can be represented much more memory-efficiently as two pairs:

[0, 7], [4, 12]

Instead of having a bunch of unused space allocated in memory, we can incrementally allocate only the space we really need. We call this representation sparse. However, at some point, such a representation would be more memory intensive than just storing it in an array - then we can transfer the sparse representation to a classical array without losing information.

Offsets

Practice has shown that all values in an array are roughly the same. An array of maxima normally looks like this, for example:

[12, 13, 11, 13, 15, 12, 14, 11, 12, ..., 13, 14, 11]

All values are somewhere around the number 13. Another strategy to reduce the size of such an array is to subtract some fixed value from each element and remember that value. For example, we can subtract an 8:

[4, 5, 3, 5, 7, 4, 6, 3, 4, ..., 5, 6, 3]

At this point, each value in the array consumes a smaller amount in memory - in the first array we need to store numbers from 0 to 15, for which we need 4 bits, while for the second array we only need 3 bits because we are storing numbers from 0 to 7. Of course, this assumes an array implementation at the bit level. The moment we want to read the values from the array, we have to add the noted value back to them.

a bunch of other improvements. Research in this area is still ongoing, so we can certainly look forward to more interesting ideas. We'll talk about HyperLogLog vector unification and intersection next time, and sometime in the near future we'll talk about why storing billions of HyperLogLog vectors might be useful.

Links and resources