Web Crawler Multithreaded - Problem
Given a URL startUrl and an interface HtmlParser, implement a multi-threaded web crawler to crawl all links that are under the same hostname as startUrl.
Your crawler should:
- Start from the page:
startUrl - Call
HtmlParser.getUrls(url)to get all URLs from a webpage - Do not crawl the same link twice
- Explore only the links that are under the same hostname as
startUrl
The HtmlParser interface is defined as:
interface HtmlParser {
public List<String> getUrls(String url);
}Note: getUrls(url) is a blocking call that simulates performing an HTTP request. Single-threaded solutions will exceed the time limit, so you need a multi-threaded solution.
Input & Output
Example 1 — Basic Website Crawling
$
Input:
startUrl = "http://news.yahoo.com/news/topics/", htmlParser returns {"http://news.yahoo.com/news/topics/": ["http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/business"]}
›
Output:
["http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/business"]
💡 Note:
Start with the given URL, crawl it to find 2 links with same hostname, and return all 3 URLs found
Example 2 — Single Page
$
Input:
startUrl = "http://example.com", htmlParser returns {"http://example.com": ["http://other.com/page"]}
›
Output:
["http://example.com"]
💡 Note:
Only the start URL has the correct hostname, other link is from different domain so ignored
Example 3 — Circular References
$
Input:
startUrl = "http://test.com", htmlParser returns {"http://test.com": ["http://test.com/page"], "http://test.com/page": ["http://test.com"]}
›
Output:
["http://test.com", "http://test.com/page"]
💡 Note:
Both pages link to each other, but visited set prevents infinite crawling
Constraints
- 1 ≤ urls.length ≤ 1000
- 1 ≤ urls[i].length ≤ 300
- startUrl is one of the urls
- Hostname label must be from 1 to 63 characters long
Visualization
Tap to expand
Understanding the Visualization
1
Input
Start URL and HtmlParser interface
2
Multi-threaded Crawl
Parallel processing of URLs with synchronization
3
Output
All URLs from same hostname
Key Takeaway
🎯 Key Insight: Multi-threading with proper synchronization transforms slow sequential I/O into fast parallel processing
💡
Explanation
AI Ready
💡 Suggestion
Tab
to accept
Esc
to dismiss
// Output will appear here after running code