Put a queue in front, because clients retry

Same shard math as ingrid's (3x xlarge, RF 2 async, the contract literally permits it) with one addition: a durable queue between the apps and the db.

Here's the retry story nobody models. A tracking pixel fires, the response is slow, the SDK retries. What happens when the client sends it twice? On this problem, honestly: one phantom view, nobody cares, the contract even says counts merge last-writer-wins. So why the queue? Because the DELIVERY side is where the pain lives. When a shard hiccups for 30 seconds, the queue holds 180k hits and drains them, instead of 30 seconds of 502s teaching every SDK in the world to retry-storm me at exactly the same moment.

Costs two extra milliseconds on the ack path and $0.20/hr. The db never meets a burst it didn't agree to. Cheap insurance on a system whose whole job is absorbing a firehose.

3 Comments