I sized this from the on-call seat, working backwards from what pages at 10:00:01 when a headline act drops.
The cache TTL is 1 second because that is literally the whole budget. It's not a cache in the "save money" sense, it's a shock absorber: at 60k/s even a 1s TTL means the hottest seat-map reads coalesce instead of stampeding one shard. Look-aside off the app tier, so when it dies (it will), reads degrade to the db path instead of erroring.
The app tier is the expensive part (9x xlarge). A look-aside cache shields the database, not the app servers, so they front the full 60k. People undersize this and then wonder why the LB health checks start failing before the db even notices the sale started.
db.large x5 shards, RF 3, quorum. 120k read capacity, 8k writes against the 3.6k hold stampede. Every shard can lose a copy and still take quorum writes, which matters because the one guarantee I refuse to page for is a confirmed booking disappearing.
Who gets woken up: nobody, ideally. Losing an app node at peak leaves 64k of capacity over a 63.6k load which is tighter than I'd like, but that's what the 12x flash multiple already priced in.
Sign in to join the discussion.
the 'it's a shock absorber not a cache' framing is right. ship this, tune the app count after the first real on-sale
app tier at rho 0.88 at peak is where your p99 lives. its fine against a 150ms budget but if this were a 40ms product you'd need 11 not 9. blast radius of one app node dying is the number to watch