LeetDesign
← All problems
Hard

Design a Metrics Ingestion Store

Sixty thousand metric batches per second, around the clock. The write floor sets your shard count, the relaxed loss window is worth real money, and correctness at this scale has a price tag.

Design the ingestion store behind a monitoring product. Fleet agents post batches of metric points continuously; dashboards and alert rules query recent series back out.

Nothing here can be cached away: every ingest is a write, and the write floor is brutal — a deploy storm doubles the stream to 60,000 batches per second, and your shard count follows from that number and nothing else. Disk barely matters (a 14-day retention window keeps the dataset at a constant ~18 TB), reads are a rounding error, and the ingest ack budget is 20 milliseconds because agents block on it.

The contract hands you one gift and you should take it with both hands: metrics tolerate a five-minute loss window. Nobody replays a crashed node's last minute of CPU gauges. That makes locally-fsynced async replication with two copies fully legal — and at twenty-plus shards, every increment of replication factor or coordination you add beyond the contract is a five-figure annual line item the cost axis will price for you.