Demystifying HyperLogLog: How to Count Billions of Things in Kilobytes of Memory

Imagine you are running a rapidly growing website like backendmesh.com. One morning, a post goes viral. Traffic skyrockets. Millions of clicks are pouring in, and your boss asks a simple question:

“How many unique visitors did we get today?”

Introduction

In the early days of the web, this was an easy question to answer. You would just track everyone’s User ID or IP address, drop them into a database set, and count them. But when you are dealing with millions, billions, or trillions of data points, that simple approach breaks down completely.

HyperLogLog (HLL)—one of the most elegant, mind-bending data structures in computer science. It solves the massive scale problem by doing something radical: it stops trying to be perfect.

The Massive Scale Problem

Before we look at the solution, let’s look at why standard counting fails at scale.

If you want an exact count of unique items (called cardinality), you have to remember every single item you’ve already seen so you don’t count duplicates. Usually, you do this with a Set.

Traditional Set: ["user_123", "user_456", "user_123", "user_789"] 
-> Automatically de-duplicates to: ["user_123", "user_456", "user_789"]
-> Total Unique Count: 3

Traditional Set: ["user_123", "user_456", "user_123", "user_789"] 
-> Automatically de-duplicates to: ["user_123", "user_456", "user_789"]
-> Total Unique Count: 3

Python

This works perfectly when you have thousands of users. But what happens at big-data scale?

Suppose you have 100 million unique users.
If each User ID is a string that takes up roughly 32 bytes of memory, your Set will consume about 3.2 Gigabytes of RAM just to hold those IDs.
If you want to track unique users per hour, per day, or across different features, you quickly need terabytes of expensive memory just to count things.

The HyperLogLog Magic

HyperLogLog is a probabilistic data structure. It doesn’t store the actual items. Instead, it observes the data as it flies by, flips some metaphorical coins, and keeps a tiny, highly compressed notebook of what it saw.

Instead of 3.2 Gigabytes of RAM, HyperLogLog can estimate a 100-million-user count with an accuracy of around 99% using less than 12 Kilobytes of memory. That is a 250,000x reduction in memory usage!

The Core Intuition: The Coin Flip Trick

To understand the math behind HyperLogLog, forget about computers for a second. Imagine I hand you a coin and put you in a room. I tell you to flip the coin repeatedly until you get “Heads,” write down how many “Tails” you got in a row before that breakthrough, and repeat the experiment a few times.

Later, I come back into the room. You don’t tell me how many total times you flipped the coin. You only tell me your longest streak of consecutive tails.

Scenario A: You say, “My longest streak was 1 Tails.” (e.g., T, H)
- My guess: You probably only ran the experiment a couple of times.
Scenario B: You say, “My longest streak was 10 Tails in a row.” (e.g., TTTTTTTTTT, H)
- My guess: You must have been in this room flipping coins for hours! Getting 10 tails in a row is highly unusual.

Statistically, the probability of flipping k tails in a row is (1/2)^k. If your longest streak is p, a solid mathematical estimate is that you probably performed roughly 2^p unique experiments.

This is the exact foundation of HyperLogLog.

A Step-by-Step Practical Example

Let’s see how a website actually applies this coin-flip logic to count unique visitors. We will simulate a miniature HyperLogLog that uses just 4 memory slots (buckets).

Step 1: Turning Users into Coin Flips (Hashing)

When a user visits your site, you take their ID and run it through a Hash Function. A good hash function takes any input and scrambles it into a random-looking string of bits (1s and 0s). Because it’s deterministic, the same user ID will always generate the same binary string.

Let’s look at three users visiting our site:

User A (user_9421) hashes to: 0100101100…
User B (user_3310) hashes to: 1100010111…
User C (user_5541) hashes to: 0110000010…

In these binary strings, we can treat a 0 like “Tails” and a 1 like “Heads”.

Step 2: The Spreadsheet of Buckets

If we only kept track of one coin-flipping streak, a single “lucky” user could throw off our entire estimate. To fix this, HyperLogLog splits its memory into multiple buckets (or registers).

We will use the first 2 bits of our hash to decide which bucket the user belongs to (2^2 = 4 buckets). The remaining bits will be used to count our “tails” (leading zeroes).

Our 4 buckets start at 0:

[ Bucket 0: 0 ]  [ Bucket 1: 0 ]  [ Bucket 2: 0 ]  [ Bucket 3: 0 ]

[ Bucket 0: 0 ]  [ Bucket 1: 0 ]  [ Bucket 2: 0 ]  [ Bucket 3: 0 ]

Python

Step 3: Processing the Visitors

User A Arrives (01001011…)
- Find Bucket: Look at the first 2 bits: 01. Binary 01 is the number 1. User A goes to Bucket 1.
- Count Zeroes: Look at the remaining bits: 001011…. Count the consecutive zeroes before the first 1. There are 2 leading zeroes.
- Update Bucket: Bucket 1 checks if 2 is bigger than its current value (0). It is! Bucket 1 updates to 2.

[ Bucket 0: 0 ]  [ Bucket 1: 2 ]  [ Bucket 2: 0 ]  [ Bucket 3: 0 ]

[ Bucket 0: 0 ]  [ Bucket 1: 2 ]  [ Bucket 2: 0 ]  [ Bucket 3: 0 ]

Python

User B Arrives (11000101…)

Find Bucket: First 2 bits are 11 (Decimal 3). User B goes to Bucket 3.
Count Zeroes: Remaining bits are 000101…. There are 3 leading zeroes before the first 1.
Update Bucket: Bucket 3 updates to 3.

[ Bucket 0: 0 ]  [ Bucket 1: 2 ]  [ Bucket 2: 0 ]  [ Bucket 3: 3 ]

[ Bucket 0: 0 ]  [ Bucket 1: 2 ]  [ Bucket 2: 0 ]  [ Bucket 3: 3 ]

Python

User A Returns (Duplicate Visit!): User A refreshes their browser. The system hashes their ID and gets the same string: 01001011….

It routes them to Bucket 1 and calculates 2 leading zeroes.
Bucket 1 already holds a 2. Since 2 is not greater than 2, nothing changes.
Result: Duplicates are completely ignored automatically without storing a single history log!

User C Arrives (01100000…)

Find Bucket: First 2 bits are 01 (Bucket 1).
Count Zeroes: Remaining bits are 100000…. The very first bit is a 1, meaning there are 0 leading zeroes.
Update Bucket: Bucket 1 compares its current value (2) with the new value (0). Since 2 is larger, it keeps the 2.

The Grand Finale: Calculating the Estimate

At the end of the day, your server doesn’t know who visited. It just looks at its tiny notebook containing 4 numbers: [0, 2, 0, 3].

To calculate the estimate, HyperLogLog doesn’t just take a normal average. If it did, that 3 (which represents a 1-in-8 rare event) would heavily distort the reality of the other empty buckets. Instead, it uses the Harmonic Mean, which is a type of average that naturally ignores massive outliers and flukes.

The math scales up the values based on how many buckets are being used, applies a pre-calculated correction factor to eliminate statistical bias, and outputs the final answer. In a real-world scenario with thousands of buckets, the algorithm looks at the spread of numbers and says: “Based on these streaks, you have roughly 3 unique users.”

Why is it called “HyperLogLog”?

The name sounds intimidating, but it describes the two separate times logarithms are used to save memory:

Log #1: Instead of storing a massive user count number up to 4 billion, we only store the position of the bit where the first 1 appeared. Finding a bit position in a number requires a logarithm, log₂N.
Log #2: To store that bit position (which will never be higher than 64), we only need a tiny amount of physical RAM bits. Figuring out the size of that storage register requires a second logarithm, log₂(log₂N).

Because it uses the Log of a Log to compress data, it’s called a LogLog algorithm. The “Hyper” was added later when computer scientists upgraded the math from a standard geometric average to a harmonic average, making it radically more accurate.

When Should You Use It?

HyperLogLog is an incredible engineering trade-off, but it isn’t a silver bullet.

The Pros:

Fixed Memory Size: In production tools like Redis, a HyperLogLog structure uses a maximum of 12 KB, regardless of whether you throw 5,000 or 5,000,000,000 items at it.
Extremely Fast: Adding an item and checking the count both take O(1) constant time—fractions of a millisecond.
Perfect for Merging: You can instantly merge Monday’s HLL and Tuesday’s HLL using a simple bitwise comparison to get a unique count for the whole weekend.

The Cons:

It’s an Estimate: Standard implementations have a ~0.81% margin of error. If you are handling financial ledgers or ticketing systems where every single unit must be perfect, you cannot use HLL.
No Retrieval: You can ask HLL, “How many unique people came?”, but you can never ask it “, What were the names or IDs of the people who came?”

A Simple Rule of Thumb for When to Use It

Use a Traditional Set/Count if you are building features like a financial ledger, inventory tracking, or an app where an exact number is legally or operationally critical.
Use HyperLogLog if you are building an analytics dashboard, tracking trending hashtags, counting active game sessions, or running any feature where seeing 1,005,200 instead of exactly 1,000,000 is perfectly acceptable in exchange for massive hardware cost savings.

Real-life applications

When a major event happens (like a sports championship or a breaking news story), platforms like X (Twitter) or Instagram need to figure out what is trending right now.

The Challenge: Millions of people are tweeting or posting every minute. To find out if a hashtag like #WorldCup is genuinely trending, the platform can’t just count the total number of posts—a bot network or a few spam accounts could post it 50,000 times to game the system. The platform needs to know how many unique, individual users are using the hashtag.
The HLL Solution: The system assigns a tiny HyperLogLog register to every active hashtag. Every time a user posts, their User ID is added to that hashtag’s HLL. The platform can instantly query the HLLs to see which hashtags have the highest unique user counts and push them to the “Trending” sidebar in real-time without melting their servers.

Cyber Security: Detecting DDoS Attacks

A Distributed Denial of Service (DDoS) attack happens when a hacker uses a botnet (thousands of compromised devices) to flood a network or server with fake traffic, knocking it offline.

The Challenge: Network routers process billions of packets of data every second. To spot an attack, security systems look for an abnormal spike in the number of unique source IP addresses connecting to a single server. Storing every IP address passing through a router to find duplicates would instantly crash the router’s limited memory.
The HLL Solution: Modern hardware routers use HyperLogLog embedded directly into their chips. The router tracks the cardinality of incoming IPs over a rolling 10-second window. If the HLL count suddenly spikes from a normal 2,000 unique IPs to 500,000 unique IPs, the system flags a DDoS attack instantly and automatically triggers defence protocols.

Streaming giants like Spotify or Netflix rely heavily on daily metrics to run their recommendation engines and calculate internal performance.

The Challenge: If Spotify wants to know how many unique users listened to a specific podcast episode today to rank it on a chart, running a massive database search across hundreds of millions of stream logs every hour is incredibly slow and expensive.
The HLL Solution: Every song, podcast, and video has an associated HLL structure for the day. When you hit play, your User ID is tossed into that track’s HLL. This allows Spotify to calculate its global charts almost instantly. Furthermore, they can merge twenty-four-hour HLL keys using PFMERGE to see the total Daily Active Users (DAU) for the entire platform in milliseconds.

Ad Tech: Frequency Capping

Online advertising networks (like Google Ads or Meta Ads) charge advertisers based on impressions (views). However, advertisers don’t want to waste money showing the same ad to the same person 50 times in one day. This limit is called a frequency cap.

The Challenge: When an ad space is loading on a webpage, the ad server has a window of about 10 to 50 milliseconds to decide which ad to show you. It needs to check if you have seen a specific ad campaign before. It cannot query a massive relational database of trillions of historical views in 10 milliseconds.
The HLL Solution: Ad platforms maintain an HLL structure for each ad campaign, which maps out which users have seen it. By using the HLL, the ad engine can quickly estimate whether the current user is a “new” unique viewer or a duplicate, ensuring the ad budget is distributed effectively across a diverse audience at lightning-fast speeds.

How to use?

HyperLogLog is a conceptual data structure; you don’t typically code the math from scratch. Instead, you use existing, highly optimised tools that have HLL built right into them.

The most common, real-world way to use HyperLogLog is through Redis (an in-memory database) or modern SQL databases.

Here is a practical guide on how to implement it using Redis, which provides HLL out of the box with three simple commands:

PFADD: Adding Elements
PFCOUNT: Getting the Count
PFMERGE: Combining

Summary

HyperLogLog is a testament to clever engineering. By accepting a tiny, predictable margin of error, it leverages basic probability theory to bypass the physical memory limitations of modern servers—proving that sometimes, a brilliant estimate is vastly superior to an expensive exact answer.

Demystifying HyperLogLog: How to Count Billions of Things in Kilobytes of Memory

Table of Contents

Introduction

The Massive Scale Problem

The HyperLogLog Magic

The Core Intuition: The Coin Flip Trick

A Step-by-Step Practical Example

The Grand Finale: Calculating the Estimate

Why is it called “HyperLogLog”?

When Should You Use It?

A Simple Rule of Thumb for When to Use It

Real-life applications

Cyber Security: Detecting DDoS Attacks

Ad Tech: Frequency Capping

How to use?

Summary

Resources

About Puneet Verma

1 thought on “Demystifying HyperLogLog: How to Count Billions of Things in Kilobytes of Memory”

Leave a Comment Cancel reply

Table of Contents

Introduction

The Massive Scale Problem

The HyperLogLog Magic

The Core Intuition: The Coin Flip Trick

A Step-by-Step Practical Example

The Grand Finale: Calculating the Estimate

Why is it called “HyperLogLog”?

When Should You Use It?

A Simple Rule of Thumb for When to Use It

Real-life applications

Social Media: Trending Hashtags and Topics

Cyber Security: Detecting DDoS Attacks

Streaming Platforms: “Trending Now” & Daily Active Users (DAU)

Ad Tech: Frequency Capping

How to use?

Summary

Resources

About Puneet Verma

1 thought on “Demystifying HyperLogLog: How to Count Billions of Things in Kilobytes of Memory”

Leave a Comment Cancel reply