LogLog: How to estimate the number of unique values

Kapitoly: Linear counting: How to estimate the number of unique values, Min Value: How to Estimate the Number of Unique Values, LogLog: How to estimate the number of unique values, HyperLogLog: How to estimate the number of unique values

After previous pieces on Linear Counting and Min Value, we finally get to the almost best probabilistic algorithm for estimating the number of unique values in a data set, LogLog. The only thing better is its little brother HyperLogLog, which is based on LogLog, but more on that next time.

Let's repeat the problem statement. On input we have some dataset/multiset, for example some array/generator of ids. There are some multiple values in this field and we want to count the number of unique values in this field. We will call this number the cardinality. We expect this cardinality to be really large, otherwise we can use conventional algorithms.

The idea behind the LogLog algorithm

Imagine we have a set of binary vectors. Don't be alarmed, for example, this 0001011001001100 is a binary vector of length 16. Suppose we have binary vectors of length 32 and that we have a total of one million of them, all completely randomly generated. How many vectors will end in zero?

We can't know exactly, of course, but if we really did generate the vectors randomly, about half of them should end in zero, i.e. 500,000. How many vectors will end in 00? A quarter. 000? Eight. And so on. We can represent it roughly like this:

vector pattern proportion among all vectors ...xxxxxx0 50% ...xxxxx00 25% ...xxxx000 12.5% ...xxx0000 6.25% ... ...

If we ask for a suffix (a suffix is a string that ends another string - the opposite of a prefix) of length d, we expect to find vectors that end with that suffix. We compute the value of v as:

$$\Large v=\frac{1}{2^d}\cdot N$$

where N is the number of all vectors in the set. A suffix of length 3 (e.g. ...010) should occur in 1/2^3 N = 1/8 N vectors, i.e. 12.5% of all vectors.

Now let's do a little mindfuck. Let's have a multiset of 1024 unique vectors. How many vectors should end with the suffix 0000000000?

$$v=\frac{1}{2^{10}}\cdot1024=\frac{1}{1024}\cdot1024=1$$

Huh, one! In a thousand randomly generated binary vectors, there should be exactly one vector that ends in ten zeros. And how do we use this information to estimate the number of unique elements? Because we can do the opposite. What if we go through all the binary vectors we have in the input and find that there is a single vector in the set that ends in ten zeros? Then we can estimate that there were 1024 vectors in the set!

We generalize and construct an algorithm

The goal should be to find a suffix that is contained in only one vector, and that suffix is as short as possible. However, this is very hard to do because to actually find such a vector, we would have to store all the suffixes, which we don't want to do because it would take too much memory. Instead, we settle for looking for the longest sequence of zeros at the end of the vector.

So we'll go through all the vectors, count for each vector how many zeros it contains at the end, and store the maximum of those values in one variable. When we've gone through all the vectors, we'll know the length of the longest suffix, which is made up entirely of zeros. We denote this length by d. We now modify the previous formula to isolate the variable N, i.e., to make it the output of the algorithm, not the input. In doing so, we can substitute one instead of v because we expect to find just one such vector, i.e. we expect v=1 to hold.

$$\begin{eqnarray} v&=&\frac{N}{2^d}\\ v\cdot2^d&=&N\\ 2^d&=&N \end{eqnarray}$$

Thus, if we find the length d of the longest null suffix, we estimate the cardinality using the formula N=2d.

How to obtain random binary vectors?

The algorithm already works with random binary vectors, but how do we apply this to real data? As usual, we use some suitable hash function. If we have a set of data on the input, for example ids (strings), we hash them first. The output of the n-bit hashing function is n bits, which from a practical point of view look like randomly generated binary vectors.

At this point, we can write our algorithm:

def trailing_zeroes(number): if number == 0: return 0 zeroes = 0 while (number & 1) == 0: zeroes += 1 number = number >> 1 return zeroes def count_unique(values):
    max_zero_suffix_length = 0 for value in values: zeroes_length = trailing_zeroes(hash(value)) max_zero_suffix_length = max(zeros_length, max_zero_suffix_length) return 2 ** max_zero_suffix_length

The trailing_zeroes function returns the number of zeros at the end of a number. How does it do this? If you didn't get it from the code, don't even ask. And what is the success rate of the function? For a set that contained 100,000 unique elements, the function returned an estimate of 524,288.

Well...

That's not exactly good, is it? But that's okay! Let's try to improve the algorithm in the same way as the previous Min Value. We split the input data into several disjunctive groups and compute the maximum zero suffix for each group, and finally compute the average length of the zero suffix and use that to estimate the number of uniques.

For example, we can say that the last two bits of each vector will determine the number of the group to which the vector belongs. The six vectors

1101110111111011 1110110010100101 1011001111110010 1111101100000011 1010101000100000 0110010110011000

can be divided into four groups based on their endings:

00011010101000100000 1110110010100101011001011001100010111011001111110010 1101110111111011 1111101100000011

Next, we compute the length of the longest null suffix in each vector, but skipping the bits we used as the group number. We mark the zero "suffixes" in green:

00 011010101000100000 1110110010100101 011001011001100010111011001111110010 1101110111111011111110110000001100:3 01: 0 10: 2 11: 6

At the end are the longest null suffixes from that group. From these suffixes we can calculate the arithmetic mean, which comes out to 2.75. We plug the value into the formula and it comes out

$$2^{2{,}75}\approx6{,}727$$

We get the average number of unique elements in each group. Which of course doesn't fit at all, but let's not worry about that, for small sets the LogLog algorithm gives terrible results.

To estimate the cardinality of the whole set, we still need to multiply this value by the number of groups: 6,727 · 4 = 26,908. The algorithm calculates that there are approximately 27 unique values in the set. There were six, so it's completely off, but that doesn't really matter now.

The implementation of the algorithm

def trailing_zeroes(number): if number == 0: return 0 zeroes = 0 while (number & 1) == 0: zeroes += 1 number = number >> 1 return zeroes def estimate_cardinality(values, k): num_buckets = 2 ** k max_zeroes = [0] * num_buckets for value in values:
        h = hash(value) bucket_index = h & (num_buckets - 1) bucket_hash = h >> k num_zeros = trailing_zeroes(bucket_hash) max_zeroes[bucket_index] = max(max_zeroes[bucket_index], num_zeros) average_max_zeros = float(sum(max_zeroes)) / num_buckets return (2 ** average_max_zeros) * num_buckets

The k parameter determines how many bits are taken as the group index. This parameter determines the accuracy of the algorithm - the higher k, the higher the accuracy. The bucket_index variable stores the group number. How does the magic on the right side of the equation work?

The num_buckets variable holds the number of groups, which is always a power of two. If, for example, k = 3, then num_buckets = 8. The number 8 has the form 1000 in the binary system. If we subtract one, we get the number 7, which has the form 0111. Subtracting one from num_buckets always gives a number that has k ones at the end and all zeros before. Next, we do the logical product (that's the & operator ). For the vector 1110110010100101, this operation would look like this:

 1110110010100101 & 0000000000000111 ------------------ 0000000000000101

The purpose of all this magic is to get a number that contains all zeros at the beginning and has the last k bits the same as the given hash h.

Next, we compute bucket_hash, which is the original hash from which we remove the last k bits by shifting the entire hash by k bits to the right. The shift behaves by simply shifting all the bits in the number k bits to the right and padding the number with zeros from the left. (A bit shift to the right by one behaves the same as if you divided the number by two and discarded the remainder.) From the vector 1110110010100101 we would get 0001110110010100 after shifting by three . This vector would be stored in the bucket_hash variable.

Next, we count the number of zeros at the end of this vector, and if this value is greater than the current maximum value stored in the max_zeroes field on the bucket_index, we overwrite the value.

At the end, we just count the average number of zeros in all groups and calculate what the average number of unique elements in each group is(2 ** average_max_zeros). Since we have a total of num_buckets groups, we multiply this value by just the number of groups to get a final estimate of the number of unique elements. Ugh, and it didn't hurt either. Results for several datasets containing 250,000 uniques and k=10:

331774.56203 323349.39265 307343.444439 309430.914045 294512.379123

Noooo, still wrong :-(. But we're pretty close, don't worry. Note that all results are greater than the correct result. This is because various errors accumulate during the algorithm and the estimate is therefore noticeably larger than the correct result. However, these errors are fairly constant and it is possible to estimate how large the error was. When we multiply the estimate by the hausnumer 0.79402, we obtain results that are substantially better:

263435.63774306054 256745.88475195298 244036.84175345476 245694.33437001088 233848.71927124445

I tried another multiset that has a cardinality of a million, results:

Result Error 1017297.17411 1.01729717411 1022820.99717 1.02282099717 984108.725487 0.984108725487 1018675.32683 1.01867532683 1007020.2853 1.0070202853

Nice and pretty! The error is under 2%, which is a decent performance. And that, ladies and gentlemen, is the LogLog algorithm. The final implementation with the hausnumber looks like this:

def trailing_zeroes(number): if number == 0: return 0 zeroes = 0 while (number & 1) == 0: zeroes += 1 number = number >> 1 return zeroes def estimate_cardinality(values, k): num_buckets = 2 ** k max_zeroes = [0] * num_buckets for value in values:
        h = hash(value) bucket_index = h & (num_buckets - 1) bucket_hash = h >> k num_zeros = trailing_zeroes(bucket_hash) max_zeroes[bucket_index] = max(max_zeroes[bucket_index], num_zeros) average_max_zeros = float(sum(max_zeroes)) / num_buckets return (2 ** average_max_zeros) * num_buckets * 0.79402

How to modify this algorithm to get that mythical and best HyperLogLog we will tell next time. For now, let's look at how memory intensive the algorithm is.

Memory requirement

The algorithm is very memory efficient. The only thing we need to remember is the array maintaining the maxima. The length of this array is affected by the value of k, i.e., the number of bits from the vector that we take as the group index. For a given k, we have a total of 2k different groups and therefore the array of maxima has a length of 2k. We called this value num_buckets in the implementation.

How big must one cell of the array be? We will store the lengths of the null suffixes in it. I can tell you up front that a 64-bit hash is probably enough for every conceivable multiset in the world, so we need to be able to store numbers from 0 (a vector of all ones) to 64 (a vector of all zeros) in this array. Furthermore, if we assume that k will always be greater than zero, then it is sufficient to store numbers from 0 to 63. For that, please, we only need 6 bits, because 26 = 64.

Yes, we just need an "array" in which one cell is 6 bits in size. I can also tell you that k=10 is enough to give us a pretty good estimate of cardinality in the order of hundreds of millions. An algorithm set up this way would produce an array of size 210 = 1024, in which each cell occupies 6 bits. In total, this array would occupy 6144 bits, which is 768 bytes. Not even a kilobyte.

And that's it.

We don't need to store anything else, we can discard the calculated hashes right away, we just need to know the number of zeros at the end of the vector.

Note that if you were to use a larger hashing function that returned 128 bits, for example, the memory footprint wouldn't change much. In fact, we would need to store numbers from 0 to 127, for which we only need 7 bits. Thus, the memory requirement is mainly determined by the parameter k, i.e. the number of groups.

LogLog is particularly useful where we need to compute multisets of really large cardinality. As seen in the simple example above, LogLog does not ideally handle multisets that have small cardinality. However, this can also be helped, for example, by using a different algorithm to estimate it if we find that the multiset has low cardinality, such as the Linear Counting algorithm described earlier. This will also be discussed next time.

Resources: