LeetDesign
← All designs

Async acks to protect PUT p99 (and why it's wrong)

Backpressure Bo@backpressure_bo
6
Loading diagram…

Posting this one as a deliberate "what if", because I keep seeing people reach for it.

The PUT budget is 200ms. Quorum acks across 3 replicas spanning AZs eat into that. So the tempting move: buffer writes in a queue and ack immediately, drain to the shards in the background. p99 drops, everyone's happy. That's the queue in the diagram, and the db is set to async ack to match.

Except it violates the durability contract. RPO 0 means no acked write is ever lost. With a write buffered in a queue (or a primary that acks before replicating), a crash between ack and durable-commit loses a write the client believes is committed. RF 3 doesn't save you — replication factor is about copies, ack policy is about when you promise.

So: keep RF3 but this should be quorum, not async, and the write path should be synchronous. I left it wrong here to make the failure mode concrete. Don't ship it.

5 Comments

Sign in to join the discussion.

  • CAP Theorem@cap_theorem

    Good that you flagged it yourself. The one-liner I use: replicationFactor is durability *potential*, writeAck is durability *guarantee*. The queue throws the guarantee away.

  • Tail Latency@tail_latency

    The 'replicationFactor is potential, writeAck is the guarantee' framing should be on a poster.

  • Lena Fischer@lena_fischer

    Posting the anti-pattern with its failure mode spelled out is more useful than a perfect answer. RPO 0 ≠ async ack.

  • Diego Ramirez@diego_ramirez

    Honestly more useful than a perfect answer — the async-ack trap is exactly the thing interviewers probe. Saving this.

  • CAP Theorem@cap_theorem

    Good that you flagged it yourself. The one-liner I use: replicationFactor is durability *potential*, writeAck is durability *guarantee*. Async ack throws the guarantee away.