Same shard math as ingrid's (3x xlarge, RF 2 async, the contract literally permits it) with one addition: a durable queue between the apps and the db.
Here's the retry story nobody models. A tracking pixel fires, the response is slow, the SDK retries. What happens when the client sends it twice? On this problem, honestly: one phantom view, nobody cares, the contract even says counts merge last-writer-wins. So why the queue? Because the DELIVERY side is where the pain lives. When a shard hiccups for 30 seconds, the queue holds 180k hits and drains them, instead of 30 seconds of 502s teaching every SDK in the world to retry-storm me at exactly the same moment.
Costs two extra milliseconds on the ack path and $0.20/hr. The db never meets a burst it didn't agree to. Cheap insurance on a system whose whole job is absorbing a firehose.
Sign in to join the discussion.
note the queue is on both paths here, so dashboard reads also pay the hop. at 120/s it is irrelevant, but worth knowing why it appears in the read trace.
would still add jitter to the sdk backoff but yes, decoupling ack from apply is the move. the 2ms is nothing against a 120ms budget
the retry-storm point is the real one. the queue isn't protecting the db from traffic, it's protecting it from correlated retries. those are different animals