Hashes power Probabilistic Data Structures

Photo by Ryan Thomas Ang on Unsplash

Hash functions are used all over computer science but I want to mention they usefulness within probabilistic data structures and algorithms. We all know that data structures are the building blocks of most algorithms. A bad choice could lead to hard and inefficient solutions instead of elegant and efficient ones.

As Linus Torvalds once said talking about Git:

…In fact, I’m a huge proponent of designing your code around the data, rather
than the other way around, and I think it’s one of the reasons git has been fairly successful (*).

Moreover:

(*) I will, in fact, claim that the difference between a bad programmer
and a good one is whether he considers his code or his data structures
more important. Bad programmers worry about the code. Good programmers
worry about data structures and their relationships.


If you’re a computer scientist, engineer or programmer you’re very aware of data structures in different levels of detail. A ten minutes chat with any programmer will probably lead to hearing about: linked list, trees, heaps, maps, etc.

I’ve had plenty of discussion with colleagues and friends about how important is to master those data structures depending on what your work is about. If you’re an academic or work in a company where volume is a real issue, then you should master them.

Most of the time if you’re programming in a high-level language it’s a code smell to reinvent the wheel and make your own implementation of known algorithms. Most frameworks have those data structures built-in or there’re ultra-efficient libraries that, with high chance, will make a great job.

From another angle, most mortals work in maintaining software or where resource constraints aren’t really a problem. Being obsessed with performance or resource consumption as the only metric to data structures or algorithms decision-making isn’t very wise, in particular when that isn’t that relevant. Again, it depends on what problem you’re solving and its need to scale.

But sometimes resource constraints are really an issue, in that case, you should consider an important tool in your toolbox: probabilistic data structures. They compromise precision with quite impressive resource requirements. If you’re solving a small problem where memory space isn’t limited, computation isn’t a bottleneck, and exact precision is a must, then you may think twice before using them. If that’s not the case, you should know their existence. Plus, they’re pretty fun to learn.

There’re many books, papers, and articles that talk about this topic. Here I’ll talk about two of them that I consider pretty beautiful and are heavily used in real-world problems. I’ll use some math but not in great detail because there are plenty of excellent resources online. Even if you don’t understand the math, you’ll get the basic idea of them (which is the goal here).

Bloom filters

If my memory serves me correctly Bloom filters were the first probabilistic data structure I heard of some years ago while reading about some petascale database. As with many other probabilistic data structures, you became quite surprised about their effectiveness considering how little space and computation they need.

Say you have a big set of integers S, and you want to check if S contains an element i. You can approach this problem using simply a linked list with space/time O(n)/O(n). Also, try a Balanced Binary Tree with O(n)/O(log(n)). Or obviously, a Map with O(n)/O(1).

It seems that most of the options require a memory space linear to the cardinality of S which sounds pretty reasonable considering that if we don’t store the value of every element of S, how can we check if an element exists in S?.

The idea of Bloom filters is to encode the whole set S into a binary string of fixed length, which we’ll call Q. This is done by encoding every element in S within Q. As you may imagine, mapping a variable-size domain into a fixed one will, eventually, result in collisions.

Let’s see a canonical example of a Bloom filter. Say we define Q as a binary string with fixed length 64 bits. Also, we choose k different hash functions like MD5 or SHA-1. Next, we do the following:

  • Take the first element in S
  • Hash the element with the k hash-functions, and take them modulo 64 to generate k indexes in Q.
  • Set to 1 every bit in Q at the indexes calculated before
  • Do the two previous steps for every remaining element in S

When we finish we have a 64-bit value Q where some of the bits are set to 1 and others to 0.

Now we want to check if an element is in S. We do the exact same procedure as above to check what bits should have been set in Q. If any of the corresponding k bits are not set, we can be sure that the element isn’t in S. If they’re all set, we can say that the element is probably in Q since those bits could be also be set to 1 partially by many elements. In fact, while you keep adding elements in S, more and more bits are set to 1 in Q, thus you keep making that chance greater.

Bloom filters can produce false positives, but not false negatives.

If we increase the size of Q, we avoid collisions thus the probability of false positives. The value k also plays a role in the probability of collisions. It’s a tradeoff between size(Q) and k and false positive rate. If you’re interested in the math quantifying the rate of false positives you can read here. Also, you can see here the optimal size(Q) and k considering #S or the desired false positive rate.

There’s one more important consideration: choosing the hashing functions to generate the indexes in Q. Previously, I mentioned MD5 or SHA-1, but those aren’t really wise choices. Cryptographic hash functions try to generate outputs that are infeasible to reverse. That isn’t a concern here. We’re interested in random outputs and making computations as fast as possible, therefore there are better options.

Most implementations use a single hash function to generate the k desired outputs. In particular, the MurmurHash function, where some constant base set of MurmurHash outputs are calculated and afterward k outputs are generated by combining those base hashes. You can see here a popular Bloom filter implementation in Go.

There’s another probabilistic data structure known as Count-min sketch, which estimates the frequency of every item in a set. The idea is quite similar to how Bloom filters work, so you may be interested in taking a look.

If you’re interested in Ethereum, Bloom filters are used to check if a block contains logs related to certain topics. In Ethereum, topics are related to events and indexed parameters. A light-client, which doesn’t store any data about world state, transactions, or receipts, can check very fast if a block contains logs related to any topic of interest. If the Bloom filter check matches, considering the small chance of false positives, we can be pretty sure that this block contains a log entry for the queried topic. The cost of a false positive outweighs the permanent cost of analyzing all the receipts of all the transactions of all the blocks.

HyperLogLog

HyperLogLog is a refinement from previous ideas like LogLog and Linear Counting, which are concerned with the count-distinct problem.

Suppose you want to know the cardinality of a big set S, i.e: how many distinct elements are in S. Also we want to do this in O(1) space and time. Quoting from the original paper where this idea was presented:

For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10 ⁹ with a typical accuracy of 2 % while using a memory of only 1.5 kilobytes.

The formal math in this data structure is more complex than in the case of Bloom filters, but the main idea behind is rather simple.

Imagine you have a collection of 8000 randomly generated binary strings. How many of them do we expect that would have at least 3 leading zeros? Well, the probability of having at least 3 leading zeros is 1/8 so, approximately, we can estimate that 1000 of them would. Of course, since this is a random process we can see from 0 binary strings that satisfy this property up to 8000, but the probability of each case matters. More generally, if we have exactly n leading zeros, then seems reasonable that the cardinality is 2^n. This is the same idea of flipping a coin 100 times and saying that we will roughly see 50 heads and 50 tails.

When you dive into the details you’ll soon realize that variance is a problem. And a big one since every unit of error has an exponential impact on the estimation. Saying it differently, if by chance we have K+1 leading zeros instead of K leading zeros, then our estimation will double. The idea used to improve this aspect is dividing the set into multiple subsets and using a mean of max leading zeros found in each subset.

One evolution from LogLog to HyperLogLog is changing the type of mean for the estimation to control sensitivity to outliers. In particular, LogLog uses the arithmetic mean compared to harmonic mean used in HyperLogLog. Moreover, bias correction coefficients are applied to correct even more remaining bias.

Repeating the experiment by dividing the set into multiple ones is great but it generates another problem: if the cardinality of the set is too small, then we won’t have enough data to make statistics work. This is tackled by simply identifying the case and using other technique that works better in this context.

As in Bloom filters, every element in each subset is hashed to transform it into a fixed length binary string from which we can follow the logic from above. Again, Murmur hashing functions are widely used in implementations.

Bottom line

We can appreciate that in all these cases hash functions play an essential role. Elegantly, they provide many characteristics which are pretty useful for probabilistic data structures and algorithms:

  • They transform non-uniform distributed data into uniformly distributed which gives a starting point for probabilistic assumptions.
  • Universal identity of data, which leads to automatic deduplication that helps in solving problems like count-distinct, if we design operations as idempotent.

Taking advantage from the fact that hash irreversibility isn’t a requirement, non-cryptographic hash functions are an option which can be faster helping in the algorithm speed.


Hashes power Probabilistic Data Structures was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Reply

Your email address will not be published. Required fields are marked *