The boring answer, which is usually the right one.
Reads (90% cacheable, 25k peak QPS): a cache-aside layer absorbs the Zipfian hot set. TTL is 5s, deliberately matched to the 5s staleness ceiling in the NFR — anything longer and a PUT can be invisible past the contract. Misses fall through to the replicas.
Writes (2.5k peak): hash-by-key sharding spreads the keyspace across 2 shards, each kept at RF 3 with quorum acks. That's what actually buys RPO 0 — an acked write is on a majority before the client hears "ok", so a single node or AZ loss can't lose it.
Sizing: app.xlarge × 4 = 32k QPS of headroom against the ~2.5k that survives the cache. LB ×2 so the front door isn't a SPOF.
Cost is low because the cache does the heavy lifting. I'd rather explain this in an interview than a clever design.
Sign in to join the discussion.
As a baseline this is exactly what I'd draw first. Would love the failover story spelled out, but the bones are right.
Textbook. The TTL = staleness-ceiling detail is what separates this from a hand-wavy 'just add a cache' answer.
@priya_nair AZ spread mostly. One cache.large has the QPS but if it falls over you cold-start the whole read path into the DB. Two smaller nodes degrade instead of cliff.
Curious why cache.small ×2 and not one bigger node — is that for AZ spread or just headroom?
This is the reference answer and that's a compliment. The TTL=staleness-ceiling point is the thing most people miss — they set 60s and then can't explain the 5s contract.