Min Value: How to Estimate the Number of Unique Values

Kapitoly: Linear counting: How to estimate the number of unique values, Min Value: How to Estimate the Number of Unique Values, LogLog: How to estimate the number of unique values, HyperLogLog: How to estimate the number of unique values

In the last article on Linear Counting, we reviewed the basic algorithms for calculating and estimating the number of unique values in a data set. Today, we have another probabilistic algorithm that allows us to estimate the number of uniques with decent accuracy and small memory requirements.

Min Value

In this article, we will work only with real numbers from the unit interval (0; 1). Consider a set of numbers from this interval that is uniformly distributed, e.g. the set of numbers {0.25; 0.5; 0.75}. Neighboring numbers are always exactly 0.25 away from each other (and from zero and one). How does this help us count the number of unique elements?

If I tell you that we have such a uniformly distributed set of numbers from the unit interval, and these numbers have a distance between them of just 0.2 - can you calculate how many elements such a set has? Of course, it is a set {0.2; 0.4; 0.6; 0.8} whose cardinality is equal to four. If we have uniformly distributed numbers from the unit interval as input, and if we know the distance between adjacent numbers, we can easily calculate the number of elements.

The distance between adjacent numbers is equal to the minimum element of the set - the distance between adjacent elements of the set {0.25; 0.5; 0.75} is 0.25, etc. When calculating the number of elements of uniformly distributed numbers, we just need to somehow find the minimum element. Then we can calculate the number of elements p by a simple formula:

$$\Large p=\frac{1}{m}-1$$

where m is the minimum element.

And again we will hash...

Well, yeah, but how to use this if we have some strings or other values on the input that don't exactly satisfy the condition of uniform distribution in the unit interval? Again, the good old hash function will help, which can return a rational number from the unit interval for any input. It will be easy to construct such a hash function.

This would give us rational numbers from the data, but how do we make these numbers uniformly distributed? We can't do that, but we can help ourselves in other ways. In fact, the hash function behaves rather "randomly" from our point of view. For very similar words like "hello", "ahok" and "ahol", the hash function will probably return very different checksums. In other words, if we are hashing a thousand different values into a unit interval, it is very unlikely that a significant majority of the numbers will be less than one half, for example. Conversely, it is very likely that these numbers will be distributed more or less regularly. The more data we hash, the more regular the distribution will be.

From this idea, we get a simple algorithm: we go through the input data, hash each element into a unit interval, and remember the minimum element. We estimate the number of unique elements using the previous formula. Python implementation:

from uuid import uuid4 def nhash(item, n): return hash(item) % (2 ** n) def unit_interval_hash(item, n): integer_hash = nhash(item, n) + 1 return integer_hash / float(2 ** n) def count_unique_using_minimum(values, n):
    numbers_from_unit_interval = (unit_interval_hash(item, n) for item in values) minimum = min(numbers_from_unit_interval) return int(1 / minimum) - 1 for _ in xrange(10): print count_unique_using_minimum((uuid4() for _ in xrange(10000)), 16)

The unit_interval_hash function returns a decimal number from the interval (0, 1>. The count_unique_using_minimum function first converts all input values to just that number, finds the minimum, and returns 1 / minimum - 1. Note that it doesn't matter how many identical values are in the dataset - for two identical values, the hash function returns the same rational number, which doesn't affect the resulting minimum. Therefore, we use this algorithm to count the number of unique elements.

What are the results? (There is an estimated cardinality on each line, but the correct cardinality is 10,000.)

5957 16384 13107 9362 65536 8192 21845 13107 21845 8192

Well, not much, is it? The algorithm ran through ten different sets of ids, each with ten thousand unique values. So the calculated estimates are mostly way off. If we averaged them, we'd get a value of 18,352. Still pretty off, on the other hand, take the memory requirements - they are constant! To find the minimum, all you need to do is always keep one variable in memory to store the current minimum, and that's it. So yeah, the estimation sucks, but then again, it costs almost nothing.

Improving

We can see that the algorithm is mostly unstable - one time it returns a pretty good estimate (9362), the second time it returns an estimate that is completely off (65536). How do we get out of this? We can help by splitting the input data into several parts and counting the number of unique elements for each part, and finally averaging everything nicely. We could decide by the first character in our id. So we'll count the number of uniques separately for ids that start with the letter "a", then for those that start with the letter "b", etc. Code:

from uuid import uuid4 def nhash(item, n): return hash(item) % (2 ** n) def unit_interval_hash(item, n): integer_hash = nhash(item, n) + 1 return integer_hash / float(2 ** n) def count_unique_using_minimum(values, n):
    min_values = {} for item in values: first_char = str(item)[0] hashed_value = unit_interval_hash(item, n) min_values[first_char] = min(min_values.get(first_char, 1), hashed_value) average_minimum = sum(min_values.values()) / len(min_values) return (int(1 / average_minimum) - 1) * len(min_values) for _ in xrange(10): print count_unique_using_minimum((uuid4() for _ in xrange(10000)), 16)

The changes are in the count_unique_using_minimum function. In the min_values variable, we keep the minima for each group. At the end, we add up all the minima and divide them by the number of all stored values - we get the average minimum. We use the expression 1/average_minimum to find the average number of unique values in each group - so we multiply this value by the number of groups to get the estimated number of unique values in the passed dataset. And how does this modified procedure work?

11312 7504 10560 8160 9344 12688 10976 11616 16240 6832

Well, it's still not good, but it's certainly better. The algorithm is more stable. We can try to modify the function so that the groups are not formed by the first character, but for example by the first two characters of the id, i.e. like this:

first_char = str(item)[0:2] # ToDo: choose better name for variable

The results would then be like this:

9728 9728 9984 10240 8960 10752 9728 9472 11264 9472

Nooo that's not too bad, is it? If we still try a hundred thousand unique values, the first two characters and n = 24, we get these results:

103936 110080 102656 101120 101376 108288 97536 97024 107776 92416

Not bad. But there are still better algorithms, don't worry.

And what's the memory footprint? We have to keep a minimum for each group, nothing more. If our ids are in hex format, that means we have to remember at most 16x minimums, where x is the number of characters that make up the group key. One minimum is one rational number, i.e. some type of double or float.

A final note: we could afford to split the input into several groups because we had ids on the input that behaved like a random string. If we didn't have a random string on the input, but some data in a certain format, we would have had to do the splitting differently. For example, if each id began with the prefix userid-, then grouping by the first character probably wouldn't work very well, would it.