Why a Simple Counter Breaks at Scale

At a small scale, a counter is nothing. Increment a number, store it, done. But the moment your product goes viral — a tweet blowing up, a video trending, a flash sale going live — that innocent little number becomes a thundering bottleneck that can bring your entire system down. Sharded counters are the standard solution, and this guide explains not just how they work, but why every design decision is made the way it is.

The Problem

Why a Single Counter Breaks at Scale?

Before understanding sharded counters, you need to understand why a naive counter fails. Without this foundation, sharding just looks like unnecessary complexity.

Imagine you store a like count as a single row in a database:

tweet_likes WHERE tweet_id = 123  →  count = 9,847

tweet_likes WHERE tweet_id = 123  →  count = 9,847

Python

Every time someone likes that tweet, your system does:

Read the current value
Add 1
Write the new value back

This looks fine. But in a distributed system, step 1→3 is not instantaneous. Two servers can read 9847 simultaneously, both compute 9,848, and both write back 9,848 — You just lost a like. This is a race condition.

To prevent it, databases use row-level locking: only one writer can touch that row at a time. Everyone else queues up and waits. Now your counter is correct, but under high load, you have thousands of writers hammering a single row, forming an ever-growing queue. This is called write contention, and it is the killer.

Why not use a queue?

You could serialise all writes through a single queue and process them one at a time. This prevents race conditions, but it doesn’t eliminate the bottleneck — it just moves it. The queue processor is still a single sequential path, and your throughput ceiling remains the same. You’ve added infrastructure complexity without solving the core problem.

You might be thinking of using Redis

Redis solves the atomicity problem differently. Because Redis processes commands in a single thread, INCR is inherently atomic — no locks, no compare-and-swap, no transaction overhead:

INCR tweet:123:likes   # atomic, single-threaded, microseconds

INCR tweet:123:likes   # atomic, single-threaded, microseconds

Python

This is dramatically faster than a relational database. A Redis INCR takes 10–100 microseconds vs. 1–10ms for a transactional database write — roughly a 100x improvement. For moderate traffic, a single Redis key is perfectly fine.

But Redis is still a single key on a single node. Under extreme write load — tens of thousands of writes per second hitting the same key — even Redis becomes a bottleneck. The single-threaded model that makes INCR atomic also means every write is sequential. You can’t parallelise writes to the same key.

The root problem is always the same: any design that forces all writes to coordinate through a single point will hit a ceiling. The solution is to eliminate that single point entirely.

The Insight: Parallelism Beats Coordination

The sharded counter insight is elegant: instead of making writes faster, make writes independent.

If 100,000 writes per second are hitting one Redis key, what if you spread them across 100 keys? Now each key only sees 1,000 writes per second — comfortably within Redis’s throughput range. No write needs to know what any other write is doing. No coordination, no waiting, no queue.

This is the principle behind sharded counters: split one logical counter into N independent sub-counters (shards), each living at a different Redis key:

Logical Counter: "tweet:123:likes"  =  1,600 total

tweet:123:likes:shard:0  →  412
tweet:123:likes:shard:1  →  389
tweet:123:likes:shard:2  →  401
tweet:123:likes:shard:3  →  398

Logical Counter: "tweet:123:likes"  =  1,600 total

tweet:123:likes:shard:0  →  412
tweet:123:likes:shard:1  →  389
tweet:123:likes:shard:2  →  401
tweet:123:likes:shard:3  →  398

Python

On every write, pick one shard and increment it atomically—no coordination with other shards. On every read: Sum all shards. The total is the true count.

You’ve traded a single heavily-contended resource for many lightly-contended resources. The throughput scales linearly — double the shards, double the write capacity.

Why random, and not round-robin or sequential?

def increment(counter_id):
    shard_id = random(0, N)
    key = f"{counter_id}_shard_{shard_id}"
    db.atomic_increment(key, by=1)

def increment(counter_id):
    shard_id = random(0, N)
    key = f"{counter_id}_shard_{shard_id}"
    db.atomic_increment(key, by=1)

Python

Round-robin requires a shared atomic counter to track “which shard is next” — that counter itself becomes a contention point. You’d be solving the write contention problem by creating a new write contention problem.

Sequential (always shard 0, then shard 1…) has the same issue — you need coordination to decide the sequence.

Random requires no coordination at all. Each writer independently picks a shard with no knowledge of what other writers are doing. Over time, the law of large numbers ensures writes are distributed evenly across shards. It’s stateless, coordination-free, and scales to as many writers as you want.

Why is each shard write still atomic?

Even though shards are independent, each individual INCR operation must be atomic. Why? Because multiple writers can hit the same shard simultaneously. The atomicity is handled cheaply at the single-key level (a simple compare-and-swap or hardware-level atomic instruction), not through heavyweight row locking across a large table. The cost is dramatically lower than locking a hotspot row, because contention per shard is orders of magnitude lower.

Why must you read all shards?

def get_count(counter_id, N):
    keys = [f"{counter_id}:shard:{i}" for i in range(N)]
    values = redis.mget(*keys)           # single round trip via MGET
    return sum(int(v) for v in values if v)

def get_count(counter_id, N):
    keys = [f"{counter_id}:shard:{i}" for i in range(N)]
    values = redis.mget(*keys)           # single round trip via MGET
    return sum(int(v) for v in values if v)

Python

Because you don’t know which shards received writes. A write to shard 7 doesn’t update shards 0–6. The only way to know the true total is to ask every shard.

Redis’s MGET command fetches multiple keys in a single round trip. Without it, you’d need N separate GET commands — N network round-trips. With MGET, reading 100 shards costs the same network latency as reading 1 key. This makes the fan-out extremely cheap in practice, typically completing in under 2ms regardless of shard count.

Why is this a real cost?

Even with MGET, you’re asking Redis to deserialise and return N values. With 1,000 shards and 100,000 page loads per second, that’s 100 million key reads per second from Redis — a genuine load consideration. This is the core trade-off of sharded counters: write throughput scales up, but read complexity scales up with it. You are moving the load from the write path to the read path.

Why is this usually acceptable?

Writes are individual user events — they can’t be batched or cached away. Reads, on the other hand, can be cached aggressively. A single cached response can serve millions of users. This asymmetry is what makes the trade-off work in practice, which leads directly to the next section.

Choosing N: Why Shard Count Is the Critical Decision

Getting shard count right is not a detail — it is the most important parameter in the design.

Too few shards: You haven’t actually solved the problem. With 5 shards and 100,000 writes/sec, each shard still handles 20,000 writes/sec, which is still a bottleneck.

Too many shards: Read fan-out becomes expensive. Summing 10,000 shards per read — even with pipelining — adds latency and wastes I/O. The metadata to track which shards exist also grows.

The right intuition: Set N so that the per-shard write rate is comfortably within the database’s single-key throughput limit, with headroom for spikes.

Traffic Level	Expected Write Rate	Shard Count
Low (normal post)	< 100/sec	5–10
Medium (popular post)	1,000–10,000/sec	50–100
High (viral tweet)	10,000–100,000/sec	200–500
Extreme (YouTube video)	100,000+/sec	1,000+

Why do real systems use dynamic shard counts?

Static shard counts are a blunt instrument. A video that gets 50 views per day doesn’t need 500 shards — that’s 500 storage keys consuming space and requiring fan-out reads. But a video that suddenly goes viral needs shards added right now, not after the DBA adjusts a config.

YouTube, Twitter, and similar systems use dynamic sharding: a counter starts with a small number of shards. A monitoring system watches per-shard write rates. If any shard’s write rate exceeds a threshold, new shards are provisioned automatically. This balances storage efficiency with throughput headroom.

Why is shrinking shards harder than growing them?

Adding shards is safe — new writes go to new shards, old shards continue to hold historical counts. Removing shards requires migrating their counts elsewhere, which requires a careful multi-step process to avoid losing data during the transition. This asymmetry means systems tend to over-provision and rarely shrink, which is a known operational trade-off.

Shard Selection Deep Dive: Why Each Strategy Has a Fatal Flaw at Scale

Random Selection

Why it works: Zero coordination, perfect load distribution over time, completely stateless writers. Each writer independently calls random(0, N) and INCRs that key. Nothing shared, nothing to fail.

Why it has a subtle problem: If you need to prevent double-counting (a user should only be able to like a post once), random selection makes deduplication hard. To check if a user has already been incremented, you’d need to scan all shards — N reads before every write. This destroys the performance advantage. Random sharding, therefore, works best when deduplication is handled separately via a Redis Set or Bloom filter, outside of the increment path.

User ID Hashing

shard_id = hash(user_id) % N

shard_id = hash(user_id) % N

Python

Why it seems appealing: The same user always hits the same shard. Deduplication is trivial — check one shard, not N. You get per-user locality for free.

Why it breaks under celebrity traffic: If a small number of highly active users — bots, power users, celebrities — account for a large fraction of writes, their assigned shards become hot while others sit idle. You’ve re-created the exact skew problem you set out to solve. This is the hot shard problem — the distributed systems equivalent of a hash table with a bad hash function. No amount of increasing N fixes it if the underlying write distribution is skewed.

Time-Based Sharding

shard_id = (unix_timestamp_bucket + hash(user_id)) % N

shard_id = (unix_timestamp_bucket + hash(user_id)) % N

Python

Why it works: Within a time window, the same user hits a predictable shard (good for deduplication within that window). Across windows, writes rotate across shards, preventing long-term hot spots.

Why it complicates reads: Your fan-out must now span multiple time-window shards to get a full historical count. Instead of reading N current shards, you’re reading N × number_of_windows shards. Every time a window expires, your MGET call grows. Without a compaction strategy (covered below), reads become progressively more expensive over time.

Eventual Consistency: Why Wrong Counts are fine

A sharded counter read is not strongly consistent. There’s always a window between when a write lands on a shard and when a reader sums all shards. The reader may not see that write yet.

Why is this generally acceptable?

Human perception cannot distinguish between “9,847 likes” and “9,849 likes.” Social engagement metrics are inherently fuzzy — they’re signals of popularity, not precise measurements. If you miss counting 10 likes because your reads are 50ms behind your writes, nobody notices, and nobody cares. More importantly, displaying a slightly stale count is infinitely better than a system that’s slow or unavailable, a system that’s trying to maintain exact consistency under an impossibly high write load.

Why does YouTube famously show rounded counts?

YouTube displays “10.2M views” rather than “10,241,837 views”, not just for aesthetics — it’s an honest signal that the number is an approximation. Maintaining and displaying an exact real-time count at YouTube’s scale would require a consistency guarantee that is fundamentally incompatible with the architecture needed to handle the write volume. The rounding acknowledges this reality.

When does eventual consistency actually matter?

Inventory systems: If your counter tracks remaining stock in a flash sale, serving “3 items left” when the true count is 0 causes overselling. Here, you need stronger consistency guarantees or a reservation system.
Billing and metered usage: If users pay per API call, their counter needs to be accurate for invoicing. An eventually consistent counter is not appropriate; you need at least minimum read-your-write consistency.
Voting and elections: Any counted event with legal or contractual implications needs a consistency model that can be audited and proven correct.

For these cases, sharded counters can still be part of the solution — but they must be paired with additional mechanisms like periodic reconciliation jobs that produce a verified final count from the raw event log.

Caching the total: why you can’t afford to fan-out on every read

For any counter that’s read frequently, summing all shards on every request is prohibitively expensive. The solution is to cache the aggregated total with a short TTL.

def get_count(counter_id):
    cached = cache.get(f"{counter_id}_total")
    if cached:
        return cached                   # fast path: O(1), sub-millisecond

    # slow path: only taken on cache miss
    total = sum_all_shards(counter_id)
    cache.set(f"{counter_id}_total", total, ttl=5)   # cache for 5 seconds
    return total

def get_count(counter_id):
    cached = cache.get(f"{counter_id}_total")
    if cached:
        return cached                   # fast path: O(1), sub-millisecond

    # slow path: only taken on cache miss
    total = sum_all_shards(counter_id)
    cache.set(f"{counter_id}_total", total, ttl=5)   # cache for 5 seconds
    return total

Python

Why does this work so well?

The cache means the expensive MGET fan-out happens at most once every TTL seconds, no matter how many users are reading the counter simultaneously. A post with 1,000,000 page views per second only hits all N shards a few times per minute. The Redis memory and CPU cost drops from millions of multi-key reads to a handful.

Why 5 seconds, and not 1 second or 60 seconds?

TTL is a product decision, not a technical one. It encodes your tolerance for staleness:

Live auction bid count → 1 second TTL
Social media like count → 5–30 seconds TTL
Total signups ever → 5 minutes TTL

Match the TTL to the user’s expectation of freshness, not to some ideal of mathematical perfection.

Why doesn’t cache invalidation solve the eventual consistency problem?

If you invalidate on every write, and you have 50,000 writes per second, you get 50,000 cache invalidations per second — meaning 50,000 cache misses per second, each triggering a full MGET fan-out. The cache provides zero protection. Invalidation-based caching only works when writes are infrequent relative to reads. When writes are the bottleneck, TTL-based expiry is the right model.

The Compaction Pattern: Why shards should be periodically collapsed

Over time — especially during traffic spikes where you dynamically add shards — the number of active shards for a counter can grow large. A background job periodically collapses all shards back to one:

def compact(counter_id, N):
    # Step 1: Sum all shards
    keys = [f"{counter_id}:shard:{i}" for i in range(N)]
    values = redis.mget(*keys)
    total = sum(int(v) for v in values if v)

    # Step 2: Write total to shard 0
    redis.set(f"{counter_id}:shard:0", total)

    # Step 3: Delete remaining shards
    for i in range(1, N):
        redis.delete(f"{counter_id}:shard:{i}")

def compact(counter_id, N):
    # Step 1: Sum all shards
    keys = [f"{counter_id}:shard:{i}" for i in range(N)]
    values = redis.mget(*keys)
    total = sum(int(v) for v in values if v)

    # Step 2: Write total to shard 0
    redis.set(f"{counter_id}:shard:0", total)

    # Step 3: Delete remaining shards
    for i in range(1, N):
        redis.delete(f"{counter_id}:shard:{i}")

Python

Why is compaction necessary?

Without it, a counter that survived a viral traffic spike might permanently maintain 500 shards even when write volume returns to normal. Every read must fan out to 500 keys indefinitely. Storage, memory, and read I/O all grow without bound over time. Compaction reclaims these resources.

Why is compaction tricky to do safely?

The danger window is between Step 1 (read total) and Step 3 (delete shards). Writes arriving during this window land on old shards — if those shards are deleted before the next read, those increments are gone forever.
The safe approach: stop routing new writes to the shards being compacted (redirect everything to shard 0 only), wait briefly for in-flight INCRs to complete, then read, sum, write, and delete. Some systems accept a tiny count loss during compaction as an acceptable error, given the approximate nature of the counts anyway.

Why Stream-Based Aggregation Is the Next Level

Sharded counters are excellent for synchronous, real-time counting up to tens or hundreds of thousands of writes per second. But at truly planetary scale — billions of events per second, globally distributed — even sharded counters hit limits.

Why do sharded counters break at extreme scale?

Every INCR — even on a lightly-loaded shard — requires a network round trip to a Redis node. At 10 billion events/sec globally, that’s 10 billion network trips per second to your Redis cluster. The network and cluster become the bottleneck, not the per-key contention.

Geography compounds this. A user in Tokyo INCRing a shard hosted in Virginia adds 150ms of latency to a user interaction. Distributing Redis globally helps, but then you have cross-region replication lag and consistency headaches.

Why does stream-based aggregation solve this?

Instead of writing to a counter store on every event, you write to a log (Kafka, Kinesis, Pub/Sub). Logs are designed for extreme write throughput — they’re append-only, partitioned, and don’t require read-before-write. The write path is now just appending one message to a local partition — microseconds, not milliseconds.

A stream processor (Flink, Spark Streaming, Dataflow) continuously reads these logs, aggregates events in micro-batches every few seconds, and writes the aggregated deltas to Redis (or another store). Instead of 10 billion INCR calls per second hitting Redis, you get a few hundred aggregated writes per second. Redis load drops by orders of magnitude.

Why is the count now minutes behind?

The stream processor works in batches. Events accumulate in the log for batch_interval seconds before being counted and flushed. End-to-end latency from event to visible count update is typically 30 seconds to a few minutes, which is why YouTube’s view count visibly lags real-time activity.

This staleness is the price of planetary scale. The difference between “2.3 billion views” and “2.3 billion and 14 more” is meaningless to the user, and that trade-off is entirely worth it.

Full Architecture: Putting It All Together

Write Path:
  User action (like)
    → API server
    → Deduplication check: SADD tweet:123:likers user_id
      → Returns 0 (already liked) → reject
      → Returns 1 (new like) → proceed
    → INCR tweet:123:likes:shard:{random(0,N)}
    → Async: publish event to Kafka for analytics/ML

Read Path:
  Page load requests like count
    → GET tweet:123:likes:total  (cache key)
    → Cache hit  → return immediately (single key, sub-millisecond)
    → Cache miss → MGET tweet:123:likes:shard:0 ... shard:N
                   sum results
                   SETEX tweet:123:likes:total {ttl} {sum}
                   return total

Background Jobs:
  → Cache warmer: pre-populate totals for trending content before cache misses hit
  → Compaction job: nightly collapse of high-shard-count counters back to shard:0
  → Shard monitor: alert + auto-provision if any shard INCR rate exceeds threshold

Write Path:
  User action (like)
    → API server
    → Deduplication check: SADD tweet:123:likers user_id
      → Returns 0 (already liked) → reject
      → Returns 1 (new like) → proceed
    → INCR tweet:123:likes:shard:{random(0,N)}
    → Async: publish event to Kafka for analytics/ML

Read Path:
  Page load requests like count
    → GET tweet:123:likes:total  (cache key)
    → Cache hit  → return immediately (single key, sub-millisecond)
    → Cache miss → MGET tweet:123:likes:shard:0 ... shard:N
                   sum results
                   SETEX tweet:123:likes:total {ttl} {sum}
                   return total

Background Jobs:
  → Cache warmer: pre-populate totals for trending content before cache misses hit
  → Compaction job: nightly collapse of high-shard-count counters back to shard:0
  → Shard monitor: alert + auto-provision if any shard INCR rate exceeds threshold

Python

Why the deduplication check before the increment?

SADD returns 1 if the member was newly added, 0 if it already existed — atomic, O(1), and completely separate from the shard increment. Only when SADD returns 1 do you proceed to INCR the shard. This ensures a user can only contribute one increment per counter, regardless of retries or double-clicks, without needing to scan N shards.

Why async(Kafka) alongside Redis?

Redis knows the current total. It doesn’t know who liked what at what time in a queryable, durable form. Analytics pipelines, fraud detection, and ML recommendation features all need that granular event history. Writing to Kafka asynchronously (fire-and-forget from the API server’s perspective) gives you both: real-time counts in Redis, and a complete durable event log for everything else.

Read about Kafka

Summary: What, Why, and When

Concept	What	Why It Exists
Shard	Independent Redis key (`INCR`-able)	Eliminate write contention on a single key
Random shard selection	`random(0, N)` per write	Requires zero coordination between writers
`MGET` fan-out read	Fetch all N shards in one round trip	Each shard only knows its slice; total requires all
Shard count N	Tunable parallelism	Balances write throughput against read fan-out cost
Eventual consistency	Reads may lag writes briefly	Impossible to maintain real-time consistency at scale
Aggregated cache	SETEX total {ttl}	MGET fan-out on every read would cost more than the original write problem
Compaction	Collapse shards periodically	Prevents unbounded Redis key growth over time
Dynamic sharding	Monitor + add shards on demand	Static N over- or under-provisions for variable traffic
Stream aggregation	Kafka + Flink → Redis delta writes	When synchronous INCR at global scale is infeasible

The one-sentence principle behind all of it: Every problem in distributed counting comes down to the same root cause — shared mutable state under concurrent access — and every solution is a different way of eliminating the sharing, eliminating the mutation, or tolerating the inconsistency.

Sharded counters do all three in elegant proportion.

Conclusion

Sharded counters are a perfect example of a distributed systems trade-off done right. You give up a little read simplicity and accept a little staleness — in exchange for write throughput that scales linearly with your shard count, and a system that stays fast under the kind of traffic that would crush a naive implementation.

The progression is worth internalising: a single database row breaks under contention → Redis INCR on a single key helps enormously but still hits a ceiling → sharding across N Redis keys eliminates the ceiling → caching the aggregated total makes reads cheap → dynamic sharding handles unpredictable traffic spikes → stream-based aggregation takes over when even Redis at scale isn’t enough.

Each step in that progression is motivated by a real, specific failure mode. You don’t add complexity for its own sake — you add it when the system tells you it needs it.

Start with a single Redis key. Watch your per-key write rate. Add shards when you need them. That’s the whole playbook.

Why a Simple Counter Breaks at Scale

Table of Contents

The Problem

The Insight: Parallelism Beats Coordination

Why random, and not round-robin or sequential?

Why is each shard write still atomic?

Why must you read all shards?

Choosing N: Why Shard Count Is the Critical Decision

Shard Selection Deep Dive: Why Each Strategy Has a Fatal Flaw at Scale

Random Selection

User ID Hashing

Time-Based Sharding

Eventual Consistency: Why Wrong Counts are fine

Caching the total: why you can’t afford to fan-out on every read

The Compaction Pattern: Why shards should be periodically collapsed

Why Stream-Based Aggregation Is the Next Level

Full Architecture: Putting It All Together

Summary: What, Why, and When

Conclusion

References

About Puneet Verma

1 thought on “Why a Simple Counter Breaks at Scale”

Leave a Comment Cancel reply