Design a Metrics Ingestion Store

Design the ingestion store behind a monitoring product. Fleet agents post batches of metric points continuously; dashboards and alert rules query recent series back out.

Nothing here can be cached away: every ingest is a write, and the write floor is brutal — a deploy storm doubles the stream to 60,000 batches per second, and your shard count follows from that number and nothing else. Disk barely matters (a 14-day retention window keeps the dataset at a constant ~18 TB), reads are a rounding error, and the ingest ack budget is 20 milliseconds because agents block on it.

The contract hands you one gift and you should take it with both hands: metrics tolerate a five-minute loss window. Nobody replays a crashed node's last minute of CPU gauges. That makes locally-fsynced async replication with two copies fully legal — and at twenty-plus shards, every increment of replication factor or coordination you add beyond the contract is a five-figure annual line item the cost axis will price for you.