Web Crawler Multithreaded - Problem

Given a URL startUrl and an interface HtmlParser, implement a multi-threaded web crawler to crawl all links that are under the same hostname as startUrl.

Your crawler should:

  • Start from the page: startUrl
  • Call HtmlParser.getUrls(url) to get all URLs from a webpage
  • Do not crawl the same link twice
  • Explore only the links that are under the same hostname as startUrl

The HtmlParser interface is defined as:

interface HtmlParser {
    public List<String> getUrls(String url);
}

Note: getUrls(url) is a blocking call that simulates performing an HTTP request. Single-threaded solutions will exceed the time limit, so you need a multi-threaded solution.

Input & Output

Example 1 — Basic Website Crawling
$ Input: startUrl = "http://news.yahoo.com/news/topics/", htmlParser returns {"http://news.yahoo.com/news/topics/": ["http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/business"]}
Output: ["http://news.yahoo.com/news/topics/", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/business"]
💡 Note: Start with the given URL, crawl it to find 2 links with same hostname, and return all 3 URLs found
Example 2 — Single Page
$ Input: startUrl = "http://example.com", htmlParser returns {"http://example.com": ["http://other.com/page"]}
Output: ["http://example.com"]
💡 Note: Only the start URL has the correct hostname, other link is from different domain so ignored
Example 3 — Circular References
$ Input: startUrl = "http://test.com", htmlParser returns {"http://test.com": ["http://test.com/page"], "http://test.com/page": ["http://test.com"]}
Output: ["http://test.com", "http://test.com/page"]
💡 Note: Both pages link to each other, but visited set prevents infinite crawling

Constraints

  • 1 ≤ urls.length ≤ 1000
  • 1 ≤ urls[i].length ≤ 300
  • startUrl is one of the urls
  • Hostname label must be from 1 to 63 characters long

Visualization

Tap to expand
Web Crawler: Single-threaded vs Multi-threadedSingle-threaded (Slow)URL1WaitURL2WaitTotal Time: T1 + T2 + T3 + ...Multi-threaded (Fast)URL1URL2URL3URL4Total Time: max(T1, T2, T3, T4)Key Challenge: Thread Synchronization• Prevent duplicate crawling with thread-safe visited set• Coordinate result collection without race conditionsSpeedup: N/T times faster (N=URLs, T=threads)Same hostname filtering + No duplicate visits = Correct results
Understanding the Visualization
1
Input
Start URL and HtmlParser interface
2
Multi-threaded Crawl
Parallel processing of URLs with synchronization
3
Output
All URLs from same hostname
Key Takeaway
🎯 Key Insight: Multi-threading with proper synchronization transforms slow sequential I/O into fast parallel processing
Asked in
Google 15 Facebook 12 Amazon 8
23.5K Views
Medium Frequency
~35 min Avg. Time
892 Likes
Ln 1, Col 1
Smart Actions
💡 Explanation
AI Ready
💡 Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen