ArchitectureSearch Systems

Search is a first-class feature in most applications. Databases are poor at full-text search — dedicated search engines like Elasticsearch handle it at scale with features like relevance ranking, faceting, and typo tolerance.

Why Not Just Use SQL LIKE?

SELECT * FROM products WHERE description LIKE '%wireless headphones%';

Problems:

  • Full table scan — extremely slow without an index, and LIKE with leading % can't use B-Tree indexes
  • No relevance ranking — results aren't sorted by how good a match they are
  • No typo tolerance — headphons returns nothing
  • No stemming — searching "run" doesn't find "running" or "runner"
  • No faceting — can't efficiently compute filter counts ("245 items in Electronics")

Full-Text Search Engines

Elasticsearch / OpenSearch

The industry standard. Distributed, scalable, feature-rich.

Key features:

  • Inverted index for O(1) keyword lookups
  • Relevance scoring (BM25 algorithm)
  • Fuzzy matching / typo tolerance
  • Aggregations and facets
  • Geospatial queries
  • Near real-time indexing (data is searchable within ~1 second)

Typesense / Meilisearch

Simpler, faster to set up, great typo tolerance. Better for smaller datasets and developer-friendly use cases.

Algolia

Managed search-as-a-service. Excellent developer experience, very fast. Higher cost than self-hosted.

How Search Engines Work

Inverted Index

The core data structure. Instead of storing documents and scanning them, the engine maintains a mapping:

"wireless" → [doc 3, doc 7, doc 15]
"headphones" → [doc 1, doc 7, doc 22]
"audio" → [doc 1, doc 3, doc 8]

To find documents matching "wireless headphones", intersect the two lists: {3, 7}. Incredibly fast — no scanning.

Analysis Pipeline

Before indexing, text is processed:

  1. Tokenization: "Wireless Headphones" → ["Wireless", "Headphones"]
  2. Lowercasing: → ["wireless", "headphones"]
  3. Stemming: "running" → "run", "headphones" → "headphone"
  4. Stop word removal: "the", "a", "of" removed (they appear everywhere, useless for search)

The same pipeline runs on queries, so searches match the indexed tokens.

Relevance Scoring (BM25)

Not all matches are equal. Relevance scoring determines which documents appear first:

  • Term Frequency (TF): How often does the search term appear in the document?
  • Inverse Document Frequency (IDF): How rare is this term across all documents? Rare terms are more meaningful.
  • Field weighting: A match in the title is more important than in the description.

Keeping Search in Sync

Search engines are secondary indexes — your source of truth is still your database. You must keep them in sync.

Synchronous (Double-Write)

Write to database AND search engine in the same request.

Problem: What if the search engine write fails? You have data in DB but not in search.

Event-Driven (Recommended)

Publish a "data changed" event to a message queue → Consumer reads the event and updates the search index.

DB Write → Kafka Event → Search Indexer → Elasticsearch

Benefits: Decoupled, retryable, search index can rebuild from event history.

Change Data Capture (CDC)

Debezium or similar tools stream database changes (via the write-ahead log) to Kafka → Index consumer updates Elasticsearch. Zero application code changes.

Search Architecture for Scale

User → API Gateway → Search Service → Elasticsearch Cluster
                                         ↑
                         Indexing Pipeline (Kafka Consumer)
                                         ↑
                              Kafka (change events)
                                         ↑
                         Primary Database (PostgreSQL)

Interview Tips

  • Mention Elasticsearch specifically when designing systems that need search (e-commerce, job boards, document search)
  • Know the inverted index — it's the key insight that makes search fast
  • Discuss sync strategy: event-driven via Kafka/CDC is the robust production answer
  • For typeahead/autocomplete, discuss prefix search or suggest APIs — different from full-text search
  • Faceted search (filter by category, price range, rating) is a common requirement — Elasticsearch aggregations handle this natively