Size the RAM first. Everything else on this problem is commentary.
The working set: 5% of 400M products at 3KB each is 60 GB, and that slice takes 85% of the traffic. To convert that into a 90% hit ratio you need to hold roughly 70 GB (the working set divided by its traffic share). Two cache.larges is 128 GB. Covered, with headroom for the long tail. This one paragraph is the capacity plan; get it wrong and no amount of database will save you, because the db tier here (two db.large shards at 38% peak) is sized for MISSES.
Freshness: write-through. Seller updates land in the cache on their way to the db, so the 2-second staleness ceiling is satisfied by construction, no TTL to reason about, no invalidation protocol to get wrong. The write path pays a second hop for it, at 400/s nobody notices.
The unglamorous part is the app tier: thirteen xlarges, the biggest line item at $8.84/hr, because a look-aside cache shields the database and shields nothing else. 78.4k/s of sale-day traffic hits every app server whether the cache is winning or not. Total $13.12/hr.
Sign in to join the discussion.
the line about the db being sized for misses is the one i'd put in an interview answer. what happens to it when the cache tier restarts cold though, have you thought about a warmup story?
@two_phase_tim fair, the honest version is cache-write-after-db-ack, which the strategy label rounds over. the 2s budget gives you room to sequence it correctly
write-through means every update commits in two places on the request path. is that atomic? if the cache write lands and the db quorum fails, a shopper can see a price the seller never durably saved. small window, worth naming.
the working set math is right and i can confirm what happens if you skimp on it, see my thread. the average product is ice cold, the active 5% is lava, 32GB was never going to work