Database DesignDatabase Sharding

Sharding partitions a database horizontally across multiple servers, each holding a subset of the data. It's the primary technique for scaling writes beyond what a single machine can handle.

What is Sharding?

When a single database can't handle your data volume or write throughput, you split the data across multiple database servers — each called a shard. Each shard contains a partition of the total dataset and handles queries only for its data.

Example: Users with IDs 1–1M go to Shard A; IDs 1M–2M go to Shard B; etc.

Sharding Strategies

Range-Based Sharding

Split data by a continuous range of the shard key.

Shard A: user_id 0      – 9,999,999
Shard B: user_id 10M    – 19,999,999
Shard C: user_id 20M+

Pros: Simple, supports range queries efficiently Cons: Hot shards — if most active users are in one range (e.g., new users have high IDs), one shard gets all the traffic

Hash-Based Sharding

Apply a hash function to the shard key, then modulo by the number of shards.

shard = hash(user_id) % num_shards

Pros: Even distribution — avoids hot shards Cons: Range queries become scatter-gather (must query all shards); resharding is painful (changing num_shards requires moving most data)

Directory-Based Sharding

Maintain a lookup table that maps shard keys to shards.

Pros: Most flexible — supports any mapping, easy to move data between shards Cons: Lookup table is a single point of failure and potential bottleneck (needs caching)

Choosing a Shard Key

The shard key choice is the most critical sharding decision.

Good shard keys:

  • High cardinality (many unique values)
  • Evenly distributed access patterns
  • Usually queried with equality (not range)
  • Examples: user_id, organization_id

Bad shard keys:

  • Low cardinality (e.g., country_code — most traffic may come from US)
  • Monotonically increasing (e.g., created_at) — new data always hits one shard
  • Frequently changed values

The Resharding Problem

When you need to add more shards (because existing shards are full or overloaded), data needs to be redistributed. This is called resharding and is painful:

  • Must move large amounts of data while the system is live
  • Queries during migration may go to the wrong shard

Consistent Hashing solves this — only 1/N of data moves when adding a node instead of ~all of it.

Cross-Shard Operations

Sharding breaks operations that span multiple shards:

  • JOINs: Can't JOIN across shards at the database level. Must do it in application code (scatter-gather)
  • Transactions: Cross-shard transactions require distributed transactions (2-phase commit) — complex and slow
  • Aggregations: Must aggregate results from all shards in the application layer

Design your data model to minimize cross-shard operations.

Sharding vs. Replication

| Aspect | Sharding | Replication | |---|---|---| | Goal | Scale writes and data volume | Scale reads and availability | | Data | Each shard has different data | Each replica has all data | | Complexity | Very high | Medium | | When to use | Data too large for one server | Read-heavy, HA requirements |

They're complementary — you can replicate each shard for high availability.

When to Shard

Sharding adds enormous operational complexity. Exhaust these options first:

  1. Optimize queries and add indexes
  2. Add read replicas
  3. Vertically scale (bigger machine)
  4. Shard as a last resort

Interview Tips

  • Sharding is often the answer when asked "how would you scale the database to handle X billion users?"
  • Always discuss shard key selection and the hot-shard problem
  • Mention that cross-shard transactions and JOINs are hard — factor this into your data model
  • Managed databases like DynamoDB and Vitess (MySQL) handle sharding transparently — worth mentioning
  • Consistent hashing is a more elegant alternative to modulo hashing — know it well