Search Systems | System Design Library

Why Not Just Use SQL LIKE?

SELECT * FROM products WHERE description LIKE '%wireless headphones%';

Problems:

Full table scan — extremely slow without an index, and LIKE with leading % can't use B-Tree indexes
No relevance ranking — results aren't sorted by how good a match they are
No typo tolerance — headphons returns nothing
No stemming — searching "run" doesn't find "running" or "runner"
No faceting — can't efficiently compute filter counts ("245 items in Electronics")

Full-Text Search Engines

Elasticsearch / OpenSearch

The industry standard. Distributed, scalable, feature-rich.

Key features:

Inverted index for O(1) keyword lookups
Relevance scoring (BM25 algorithm)
Fuzzy matching / typo tolerance
Aggregations and facets
Geospatial queries
Near real-time indexing (data is searchable within ~1 second)

Typesense / Meilisearch

Simpler, faster to set up, great typo tolerance. Better for smaller datasets and developer-friendly use cases.

Algolia

Managed search-as-a-service. Excellent developer experience, very fast. Higher cost than self-hosted.

How Search Engines Work

Inverted Index

The core data structure. Instead of storing documents and scanning them, the engine maintains a mapping:

"wireless" → [doc 3, doc 7, doc 15]
"headphones" → [doc 1, doc 7, doc 22]
"audio" → [doc 1, doc 3, doc 8]

To find documents matching "wireless headphones", intersect the two lists: {3, 7}. Incredibly fast — no scanning.

Analysis Pipeline

Before indexing, text is processed:

Tokenization: "Wireless Headphones" → ["Wireless", "Headphones"]
Lowercasing: → ["wireless", "headphones"]
Stemming: "running" → "run", "headphones" → "headphone"
Stop word removal: "the", "a", "of" removed (they appear everywhere, useless for search)

The same pipeline runs on queries, so searches match the indexed tokens.

Relevance Scoring (BM25)

Not all matches are equal. Relevance scoring determines which documents appear first:

Term Frequency (TF): How often does the search term appear in the document?
Inverse Document Frequency (IDF): How rare is this term across all documents? Rare terms are more meaningful.
Field weighting: A match in the title is more important than in the description.

Keeping Search in Sync

Search engines are secondary indexes — your source of truth is still your database. You must keep them in sync.

Synchronous (Double-Write)

Write to database AND search engine in the same request.

Problem: What if the search engine write fails? You have data in DB but not in search.

Event-Driven (Recommended)

Publish a "data changed" event to a message queue → Consumer reads the event and updates the search index.

DB Write → Kafka Event → Search Indexer → Elasticsearch

Benefits: Decoupled, retryable, search index can rebuild from event history.

Change Data Capture (CDC)

Debezium or similar tools stream database changes (via the write-ahead log) to Kafka → Index consumer updates Elasticsearch. Zero application code changes.

Search Architecture for Scale

User → API Gateway → Search Service → Elasticsearch Cluster
                                         ↑
                         Indexing Pipeline (Kafka Consumer)
                                         ↑
                              Kafka (change events)
                                         ↑
                         Primary Database (PostgreSQL)

Interview Tips

Mention Elasticsearch specifically when designing systems that need search (e-commerce, job boards, document search)
Know the inverted index — it's the key insight that makes search fast
Discuss sync strategy: event-driven via Kafka/CDC is the robust production answer
For typeahead/autocomplete, discuss prefix search or suggest APIs — different from full-text search
Faceted search (filter by category, price range, rating) is a common requirement — Elasticsearch aggregations handle this natively

ArchitectureSearch Systems