Sized for the second the gates open

I sized this from the on-call seat, working backwards from what pages at 10:00:01 when a headline act drops.

The cache TTL is 1 second because that is literally the whole budget. It's not a cache in the "save money" sense, it's a shock absorber: at 60k/s even a 1s TTL means the hottest seat-map reads coalesce instead of stampeding one shard. Look-aside off the app tier, so when it dies (it will), reads degrade to the db path instead of erroring.

The app tier is the expensive part (9x xlarge). A look-aside cache shields the database, not the app servers, so they front the full 60k. People undersize this and then wonder why the LB health checks start failing before the db even notices the sale started.

db.large x5 shards, RF 3, quorum. 120k read capacity, 8k writes against the 3.6k hold stampede. Every shard can lose a copy and still take quorum writes, which matters because the one guarantee I refuse to page for is a confirmed booking disappearing.

Who gets woken up: nobody, ideally. Losing an app node at peak leaves 64k of capacity over a 63.6k load which is tighter than I'd like, but that's what the 12x flash multiple already priced in.

2 Comments