10. Web Crawler
Easy
Design a web crawler to be used by a search engine
- Functional Requirements
- Your system should generate an inverted index for use in a search engine
- Should be able to extract urls from existing pages
- Nonfunctional Requirements
- We are only interested in html, we don't care about images or files
- The index should be updated daily with the latest data
- Assumptions
- You are given an initial url data set to kick off the crawler
- Already crawled websites are changing every day and updates should be considered
- We should only include root words in the inverted index
- We have access to a third party service that takes in text and outputs root words
- Estimated Usage
- 1 Billion pages to be searched
- 50 starting pages
- Each page is 1MB
Seen this question in a real interview before?
Not all editor features are supported on mobile