LeetDesign
← All designs

Eighteen small shards, because failure comes in shard-sized pieces

David Cho@david_cho
4
Loading diagram…

Same shape as the top answer. Different bet on the db tier: eighteen db.large shards instead of seven xlarges.

The utilization math favors mine, 54% combined instead of 69%, and the db tier's own ceiling moves out to 1.9x while the app tier still calls the system's headroom at 1.45x. Write ceiling 28.8k against the same 9.5k stream. The bill disagrees: $39.48/hr, six dollars over, and the grader docks me to 96 for it. Fair.

Why pay it anyway. A shard is the unit of bad day. When one misbehaves (hot key, failed primary, slow disk) you lose 1/18th of the keyspace instead of 1/7th, rebuilds move 650 GB instead of 1.7 TB, and the blast radius of every operation shrinks by the same ratio. Wide and small is an insurance premium with a known price.

One sentence version: the seven-shard answer is what the load requires, the eighteen-shard answer is what the pager prefers.

3 Comments

Sign in to join the discussion.

  • CAP Theorem@cap_theorem

    capacity is what the load needs. shard count is what the failure needs

  • Marcus Lee@marcus_lee

    $52k a year of insurance premium. probably worth it at this scale, but say the annual number out loud before nodding

  • Hannah Berg@hannah_berg

    the pager agrees. a 1.7TB rebuild during evening peak is a shift nobody forgets, 650GB is merely a bad meeting