Given a startUrl and an interface HtmlParser, implement a web crawler to crawl all links that are under the same hostname as startUrl.

Your crawler should:

  • Start from the page: startUrl
  • Call HtmlParser.getUrls(url) to get all urls from a webpage
  • Do not crawl the same link twice
  • Explore only the links that are under the same hostname as startUrl

For example, if startUrl = "http://news.yahoo.com/news/topics/", then the hostname is news.yahoo.com. You should only crawl URLs like http://news.yahoo.com/...

Note: Consider the same URL with trailing slash "/" as different. For example, "http://news.yahoo.com" and "http://news.yahoo.com/" are different urls.

Input & Output

Example 1 — Basic Web Crawling
$ Input: startUrl = "http://news.yahoo.com", urls = ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com"], edges = [[0,2],[2,1],[3,2],[3,1],[1,4]]
Output: ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/"]
💡 Note: Starting from "http://news.yahoo.com", we can reach pages with the same hostname "news.yahoo.com". The Google URL is excluded as it has different hostname.
Example 2 — Single Page
$ Input: startUrl = "http://news.yahoo.com", urls = ["http://news.yahoo.com"], edges = []
Output: ["http://news.yahoo.com"]
💡 Note: Only the start URL with no outgoing links, so result contains just the start page.
Example 3 — Different Hostname Filtering
$ Input: startUrl = "http://example.com/page", urls = ["http://example.com/page", "http://example.com/about", "http://other.com/page"], edges = [[0,1],[0,2]]
Output: ["http://example.com/page", "http://example.com/about"]
💡 Note: From start page, both links are found but only the example.com/about is included as other.com has different hostname.

Constraints

  • 1 ≤ urls.length ≤ 1000
  • 1 ≤ urls[i].length ≤ 300
  • startUrl is one of the urls
  • All URLs follow the format http://hostname/path without port

Visualization

Tap to expand
Web Crawler: Explore URLs in Same DomainSTARTyahoo.comyahoo.com/newsyahoo.com/sportgoogle.comValid: Same DomainFiltered: Different DomainResult: ["yahoo.com", "yahoo.com/news", "yahoo.com/sport"]🎯 Key Insight: Extract hostname once, traverse graph, filter by domain
Understanding the Visualization
1
Input
startUrl and graph of connected web pages
2
Process
Extract hostname, traverse graph, filter by domain
3
Output
List of all reachable URLs in same domain
Key Takeaway
🎯 Key Insight: Extract hostname once from startUrl, then use DFS/BFS with visited set to traverse the web graph while filtering URLs by matching hostname
Asked in
Google 45 Facebook 35 Amazon 28 Microsoft 22
28.5K Views
Medium Frequency
~25 min Avg. Time
892 Likes
Ln 1, Col 1
Smart Actions
💡 Explanation
AI Ready
💡 Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen