How to split a database across machines — the trade-offs, the strategies, and the failure modes that trip up real systems at scale.
Table of Contents
What sharding is
A single database node has hard physical ceilings: disk capacity, RAM, CPU cycles, and write throughput are all finite. For most applications, this is fine for years. But at sufficient scale — billions of records, tens of thousands of writes per second — no amount of vertical scaling saves you.
Sharding (also called horizontal partitioning) breaks a dataset into subsets called shards, each owned and served by a separate database node. No single node holds the full data. Queries are routed to the correct shard by a routing layer that applies a partition function to the request’s key.
The routing layer computes: shard = f(partition_key). The result determines which node receives the read or write. The client never needs to know which shard holds which data — the router handles that transparently.
Core trade-off
Sharding gives you horizontal write scale and unlimited storage growth. In exchange, you lose the ability to query freely across the full dataset. Every sharding decision is downstream of this trade.
When you need it
Sharding is not a default. It adds significant operational and application complexity. Before adding it, exhaust these options in order:
Try first
- Vertical scaling (bigger machine)
- Read replicas (offload reads)
- Caching layer (Redis/Memcached)
- Query optimisation + indexes
- Archival (move cold data out)
- Functional decomposition (split by service)
Shard when
- Write throughput exceeds one node
- The dataset outgrows the largest available disk
- RAM can’t hold your working set
- Single-node failover latency is unacceptable
- Regulatory isolation requires data separation
The rough signal: if your database primary is CPU- or I/O-saturated despite caching and query optimisation, and read replicas can’t help because writes are the bottleneck — shard.
The three strategies
There are three fundamental approaches to deciding which shard owns a record. Each optimises for a different access pattern.
Strategy 1: Hash-based
shard = hash(key) % NPythonApply a hash function to the partition key, and take the modulo of the number of shards. Records scatter uniformly across all shards regardless of key value.
- ✓ Even distribution by design
- ✓ No hotspots from skewed data
- ✓ Simple to implement
- ✗ Range queries hit every shard
- ✗ Resharding remaps nearly all keys
- ✗ No data locality
Useful user data session stores key-value pairs
Strategy 2: Range-based
shard = range_map[key]
PythonAssign contiguous key ranges to shards. Keys A–G go to shard 0, H–P to shard 1, and so on. Naturally supports range scans within a shard.
- ✓ Range queries stay on one shard
- ✓ Natural temporal locality
- ✓ Easy to add new ranges at the edge
- ✗ Hotspot risk (time-series keys)
- ✗ Uneven distribution if the data isn’t uniform
- ✗ “Last shard” write bottleneck
Useful for logging time-series orders by date
Strategy 3: Directory-based
shard = lookup_table[key]PythonMaintain an explicit mapping of keys (or key ranges) to shards in a directory service. The routing layer consults the directory on every request.
- ✓ Total control over placement
- ✓ Move a single tenant without rehashing
- ✓ No algorithmic constraint on key space
- ✗ Directory is a single point of failure
- ✗ Extra network hop per request
- ✗ Directory must be kept consistent
Useful for multi-tenant SaaS tenant isolation
The choice maps cleanly to access patterns: hash for pure key-value lookup, range for anything time-ordered or sequentially scanned, directory for multi-tenant workloads where tenants need to be individually movable.
Consistent hashing
Plain modulo hashing (key % N) has a catastrophic resharding problem. Change N from 3 to 4, and roughly 75% of all keys remap to different shards, triggering a massive data migration. Consistent hashing solves this.
The idea: place both nodes and keys on a virtual ring (a circular key space, typically 0 to 232). Each key is owned by the first node clockwise from it on the ring. When a node joins or leaves, only the keys between it and its predecessor need to move — roughly 1/N of total data instead of nearly all of it.
The practical upshot: consistent hashing lets you add capacity incrementally without a full-cluster migration. This is the reason most modern distributed databases use it as their default partitioning scheme.
Virtual nodes (vnodes)
Each physical node owns multiple positions on the ring rather than one. This gives finer-grained load balancing and makes it easy to give beefier machines more of the ring. Cassandra, DynamoDB, and Riak all use vnodes. The typical configuration is 128–256 vnodes per physical node.
The shard key — the most important decision
Everything downstream of sharding — hotspots, query patterns, cross-shard join cost, resharding difficulty — depends entirely on the choice of shard key. Get this wrong, and no amount of infrastructure fixes it.
Properties of a good shard key
- High cardinality. The key must have enough distinct values to spread data across shards. A Boolean field is useless as a shard key.
- Even distribution. Values should not cluster. A user_id with a roughly uniform distribution is good; a field like account_tier where 95% of users are “free” is not.
- Access-pattern alignment. The shard key should match your most frequent query predicate. If most queries are WHERE user_id = ?, shard on user_id. This keeps the majority of queries on a single shard.
- Immutability. Changing a record’s shard key value means moving the record to a different node — a write to two shards plus a delete. Design shard keys that do not change over the lifetime of a record.
Classic bad shard key choices
| Key | Why it fails | Fix |
|---|---|---|
created_at | All new writes land on the “latest” time range shard. Every other shard is idle while one is saturated. | Combine with a high-cardinality field: (user_id, created_at) |
country | Low cardinality, extreme skew. US alone might be 10× any other country. | Use user_id instead; country as a secondary index. |
status (enum) | Nearly all records share a small set of values (“active”, “pending”). Entire shards sit empty. | Status is a filter, not a shard key. Shard on entity ID. |
| Auto-increment ID | Monotonically increasing keys go to the same shard with range-based sharding. Same as the created_at problem. | Use random or time-prefixed UUIDs, or hash the ID. |
Hotspots and how to handle them
Even with a good shard key, hotspots emerge when specific key values receive disproportionate traffic. A celebrity’s user_id, a viral post’s thread_id, or a popular product’s item_id can all saturate a single shard regardless of how well the overall distribution looks.
Salted keys
Append a random suffix to the partition key: user_id_0, user_id_1, …, user_id_9. Write a pick a suffix at random and scatter across 10 sub-shards. The hot entity is now 10× less hot per shard.
The cost: reads must fan out to all 10 sub-keys and aggregate. This is an explicit read-amplification trade for write distribution.
-- Write: pick a random bucket
INSERT INTO likes (key, count)
VALUES (CONCAT(post_id, '_', FLOOR(RAND() * 10)), 1)
ON CONFLICT (key) DO UPDATE SET count = count + 1;
-- Read: aggregate across all buckets
SELECT SUM(count) FROM likes
WHERE key LIKE CONCAT(post_id, '_%');PythonSharded counters
A natural extension of salted keys for counters specifically. Rather than a single likes_count row per entity, keep N counter rows per entity (one per physical shard, or just a fixed N). Each write increments a random row. Reads the sum across all N rows.
This is one of the highest-frequency patterns in HLD interviews — likes, view counts, follower counts, and upvotes all use it. The key insight: you’re trading read complexity for write scalability.
Read replicas for read hotspots
When the hotspot is read (not writes) — a celebrity profile page being fetched millions of times per minute — read replicas are the right tool. Route all reads to replicas, keep writes going to the primary. No fan-out needed. Replicas can even be geographically distributed to serve the hotspot locally.
Common confusion
Salted keys and sharded counters solve write hotspots. Read replicas solve read hotspots. They are complementary, not interchangeable. A viral post needs both: sharded counters for the like-write storm, read replicas for the read storm on the post content itself.
Cross-shard operations
The hardest problems in sharding all come down to operations that span multiple shards. This is where most of the complexity — and most of the interview depth — lives.
Cross-shard queries (scatter-gather)
A query that doesn’t include the shard key as a predicate must be sent to every shard. The routing layer fans out the query, each shard returns its partial result, and the application (or a middleware layer) merges and sorts them. This is called scatter-gather.
Scatter-gather is expensive. At 100 shards, a single query generates 100 sub-queries. The response time is bounded by the slowest shard. Design your access patterns to include the shard key in most queries — if you can’t, reconsider your shard key.
Cross-shard joins
SQL joins across shards are not natively possible. The options, roughly in order of preference:
Denormalise at write time. Embed the joined data in the same document/row at write time. No join needed at read time. Trades write amplification for read simplicity. Correct choice when the joined data rarely changes.
Co-locate related data. Choose a shard key that places related entities on the same shard. If you always join orders and order_items, shard both on user_id — a user’s orders and their items land on the same shard, enabling a local join.
Application-side join. Fetch from each shard independently, merge in application memory. Fast for small result sets; impractical for large ones.
Global secondary index. Maintain a separate index that maps a secondary attribute to the primary shard key. One extra lookup before routing to the correct shard. DynamoDB’s GSIs work this way — they’re maintained asynchronously and may be slightly stale.
Distributed transactions
Writing atomically across two shards requires coordination. Two approaches exist, both with serious trade-offs.
Two-phase commit (2PC). A coordinator node sends a “prepare” message to all participating shards, waits for acknowledgement, then sends “commit.” Guarantees atomicity but blocks if the coordinator fails mid-transaction. Rare in high-throughput systems due to latency and availability concerns.
Saga pattern. Break the distributed transaction into a sequence of local transactions. Each step publishes an event, triggering the next. If a step fails, compensating transactions roll back prior steps. Gives you eventual consistency without coordination, at the cost of complex rollback logic. The dominant pattern in microservice architectures.
Design advice
The best cross-shard transaction is one you never need to make. Spend the bulk of your design effort ensuring writes that must be atomic touch only one shard. This usually means careful shard key selection and accepting some denormalisation.
Resharding
Adding shards after launch — because you’ve grown beyond your initial shard count — is one of the hardest operational challenges in distributed systems. With naive modulo hashing, adding one shard to a cluster of three remaps roughly 75% of all keys, requiring an enormous migration.
Consistent hashing
As covered in section 4, consistent hashing limits key movement to roughly 1/N of the dataset when a node is added. This is the right default for most systems.
Pre-sharding
Start with many more logical shards than physical nodes. A cluster of 4 machines might own 256 logical shards, each machine owning 64. When you add a 5th machine, migrate 51 of the logical shards to it — a discrete, bounded operation that doesn’t require rehashing any keys, only reassigning ownership of whole shards.
Pre-sharding is how Redis Cluster works (16,384 hash slots across a small number of nodes) and was Discord’s approach to message storage scaling.
Dual-write migration
For live migrations: write to both old and new configurations simultaneously. Read from the new configuration once the data is verified consistent. Flip the read pointer. Drain writes to the old configuration. Clean up.
This approach keeps the system live throughout migration at the cost of temporarily doubled write load and complex consistency verification.
How real systems do it
| System | Sharding approach | Notable detail |
|---|---|---|
| DynamoDB | Hash on partition key | Internal, fully managed. Adaptive capacity handles hot partitions automatically. |
| Cassandra | Consistent hashing with Murmur3 | Vnodes (128 default). Token-aware drivers route directly to the correct node. |
| Redis Cluster | CRC16(key) % 16384 hash slots | 16,384 slots distributed across nodes. Clients cache the slot map locally. |
| MongoDB | Configurable — hash or range | Chunk-based (default 128MB). Balancer migrates chunks between shards automatically. |
| HBase / Bigtable | Range-based (row key ranges) | Tablets are split automatically when they exceed a size threshold. |
| MySQL (Vitess) | Keyspace-based with routing layer | VTGate handles routing. Supports resharding with minimal downtime. |
| user_id hash for user data | tweet_id uses Snowflake IDs (time-prefixed) to avoid sequential-key hotspots. |
Conclusion
Sharding is one of the most consequential architectural decisions you can make. Done well, it unlocks horizontal scale that no single machine can match. Done poorly, it creates a distributed system that is harder to operate than the monolith it replaced — with hotspots, scatter-gather queries, cross-shard transaction nightmares, and a resharding migration that takes months.