Counting unique items fast

Analytics systems count. Trivial as this may sound, implementing one is far from easy. Indeed, Nathan Marz, creator of Apache Storm, tweeted thus:
90% of analytics startups: 1. Find something new to count that no one else is counting 2. Raise $10M

In this blog post and the next, I will try and summarise a couple of ways I have learnt to do this efficiently at scale for the specific use case of counting unique items.

Consider a standard use case for several ad-tech companies – counting the number of unique devices they get to see, commonly referred to as audience count. Or a web analytics use case – counting number of users who have seen a particular web page. One can also slice and dice these counts to provide more context. For e.g. how many unique users are viewing an article from browser vs mobile app and so on. Typically, these counts get collected from events captured by an analytics system. They then get exposed by analytics products for their customers along with context, trends, etc to form a basis for decision making. A recent example that illustrates this well is the Parse.ly blog on their new Analytics features for publishers.

Technically, the difficulty in counting uniques as opposed to counting occurrences in events is that while the latter is a simple counter, the former is, at the core, a set cardinality operation. A user visiting a web page twice should be counted as only one user, as opposed to two views. To get the set cardinality, one needs to manage the set. At today’s BigData scale:

Sets are getting larger. Hundreds of millions of unique users and upwards does not raise eyebrows that much anymore.
Sets are updated very fast. Sites are processing several thousands and upwards of queries per second each of which need to update the set.
Users are demanding more. They expect to see the updates to these counts as quickly as possible.

Technical Approaches

In technical terms, the unique counts exposed by these analytics can be treated as views of the events that drive these analytics. Data warehousing or the batch mode BigData processing solutions powered by frameworks like Hadoop have traditionally separated the collection of event streams and the generation of these views. More contemporary approaches have been proposing a change to this approach, in which the processing of the event streams results in the creation of these ‘materialised views’ directly.

At the core of these more recent approaches are streaming solutions and techniques that operate on incoming event streams and update state for views that expose this information in near real time. There are established stream processing engines that provide the framework and API to consume large event streams in a scalable fashion – such as Apache Storm, Apache Spark, Apache Samza and more recently Apache Flink. Generating the view though, is still a solution the developers need to solve themselves. And their solution needs to still meet all requirements the stream processing frameworks themselves meet.

In order to solve our uniques problem, we need to maintain a set of ’n’ identities, where n is very large and is updated very fast. Also, counting of these n identities needs to happen with very low (sub second) latencies. It is easy to see that for large ’n’s the time and/or space complexity of solving this problem conventionally is going to be large.

HyperLogLog

One novel approach to address these constraints has been available for a good amount of time, albeit it has not been very well known. This is the concept of maintaining sketches. Informally, a ‘sketch’ is a data structure that summarises large volumes of data into very small amounts of space so as to provide approximate answers to queries about the data with extremely low latency and well-defined error percentages. There are several different sketches, and one set of them deal with counting unique or distinct values.

The specific one I discuss here is a sketch called HyperLogLog (HLL). There are several great articles on the web that describe a HLL. The one I found most intuitive to follow from a layman perspective was from Neustar, previously AggregateKnowledge.

Here is my attempt to summarise and para-phrase the black magic of HLL (although I strongly recommend reading the original article):

Say we have a good hash function that converts the item we want to add to a set, into a binary bit stream.
By counting the number of consecutive zero bits in the hash, we can *estimate* the size of the set. The intuition mentioned in the Neustar link above is that counting a consecutive stream of zeros is somewhat like counting the number of consecutive heads we get when tossing a coin. The larger the number of heads, the more number of times we can guess we have tossed the coin.
We improve the estimate using a procedure called Stochastic averaging, in which we maintain not one, but many such estimates and take a harmonic mean of these. In order to maintain multiple estimates, we split the hash into two parts: a prefix that indexes into a bucket to hold an estimate and the suffix that is used to count the consecutive zero bits.
There are also some corrections to make the estimates more accurate in cases where the buckets are too empty or too full.

HyperLogLog in Redis

There are several libraries and systems that implement a HLL algorithm (so we are spared from having to implement one ourselves). However, the one I have used is an implementation in the awesome in-memory data structure server – Redis. The blog on the Redis implementation of HyperLogLog is a classic in itself and is also a highly recommended read. The level of thought and work that went into an efficient implementation of HLL in Redis is great learning. In the blog, the Redis HLL standard error has been mentioned as 0.81%. I have actually seen lower in my tests.

Redis exposes the following APIs for manipulating the HLL set:

PFADD <key> <item> – Adds item to HLL represented by the key
PFCOUNT <key> – gives the estimate of the number of items added to the key.

The PF prefix stands for Philippe Flajolet who is credited with a lot of work on HLL.

Concerning the internals of HLLs in Redis, the following points are interesting:

The length of the prefix of the binary hash in Redis is 14 bits. That means, it uses 16384 buckets.
The hash is 64-bit, hence the remaining 50 bits is where we look for the consecutive zero bits. This means we can represent the value of the counter using at most 6 bits (2^6 = 64 > 50). Hence the total space required for a HLL is bounded by 16384 x 6 bits = 96 Kbits = 12 KB, which is amazingly small for storing a very large cardinality number. (Note that it is possible to lay out a bit array and index into the specific offset in this array representing a bucket)
Although 12KB is the maximum amount of memory required for a HLL, for smaller sets, the sizes are much smaller. Specifically, like with many other things in Redis, the underlying data structure has a dual representation, a memory efficient one for smaller sets (called ’sparse’ representation) and the 12 KB one for larger sets (‘dense’ representation). This is particularly useful if you need to store lots of HLLs with low cardinality.
- You can use the command PFDEBUG ENCODING <key> to see what representation Redis is currently using for the key.
- The switch from sparse to dense encoding is controlled via a configuration parameter – hll-sparse-max-bytes – default 3000 bytes. The Redis documentation has more details on how to tune this parameter.

The Good Parts with HLLs in Redis

There are some obvious benefits we can see with HLLs in Redis:

The 12KB bounded size for a practically unbounded set (read billions of items) is extremely memory efficient.
The operation PFCOUNT is fast enough for real time queries. Reading directly from this for front end dashboards is totally possible.

There are some subtler benefits, too:

The operation PFADD is quite fast too, as can be expected from the low latency high throughput performance of Redis, in general. This means that updates to the set represented by the HLL can happen in a streaming fashion. I have built data pipelines using Storm that add IDs to Redis HLL keys and operate with sub second latencies (doing a lot of other work too)
Since Redis is single threaded, adding the same element to a HLL from different threads works correctly.
Adding an element to a HLL is idempotent. Hence, when your stream processing framework follows at least once semantics and replays can cause duplicate execution of the PFADD commands, we do not need to worry about consistency.
The value of a Redis HLL is an encoded String. Hence, it is possible to retrieve or dump the value as a set of bytes and load it into a different Redis server to get identical results. One can imagine that this would be staggeringly fast compared to having to re-add a billion items to another server.

Set Operations in HLLs

As mentioned above, HLLs are sketches for sets. When modelled like this, one could naturally think if set operations are possible. From a use case perspective, this certainly makes sense.

Imagine we are maintaining Daily unique users in a set. Can I combine these sets to get weekly or monthly unique users? (akin to a rollup operation)
Imagine I have a set of users who have visited a specific web page. And another who are from a particular locality. Can I combine these two sets to see which users from that locality visited the web page? (akin to a slice operation)

In set theoretic terms, the first of these would be a union of existing HLLs, while the second is an intersection. It turns out that unions of HLLs is possible, but intersections need more work. I will explore these operations in a following blog post.