10. Web Crawler


Design a web crawler to be used by a search engine

Functional Requirements
  • Your system should generate an inverted index for use in a search engine
  • Should be able to extract urls from existing pages
Nonfunctional Requirements
  • We are only interested in html, we don't care about images or files
  • The index should be updated daily with the latest data
  • You are given an initial url data set to kick off the crawler
  • Already crawled websites are changing every day and updates should be considered
  • We should only include root words in the inverted index
  • We have access to a third party service that takes in text and outputs root words
Estimated Usage
  • 1 Billion pages to be searched
  • 50 starting pages
  • Each page is 1MB

Seen this question in a real interview before?

Not all editor features are supported on mobile