Web Crawler - Problem
Given a startUrl and an interface HtmlParser, implement a web crawler to crawl all links that are under the same hostname as startUrl.
Your crawler should:
- Start from the page:
startUrl - Call
HtmlParser.getUrls(url)to get all urls from a webpage - Do not crawl the same link twice
- Explore only the links that are under the same hostname as
startUrl
For example, if startUrl = "http://news.yahoo.com/news/topics/", then the hostname is news.yahoo.com. You should only crawl URLs like http://news.yahoo.com/...
Note: Consider the same URL with trailing slash "/" as different. For example, "http://news.yahoo.com" and "http://news.yahoo.com/" are different urls.
Input & Output
Example 1 — Basic Web Crawling
$
Input:
startUrl = "http://news.yahoo.com", urls = ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com"], edges = [[0,2],[2,1],[3,2],[3,1],[1,4]]
›
Output:
["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/"]
💡 Note:
Starting from "http://news.yahoo.com", we can reach pages with the same hostname "news.yahoo.com". The Google URL is excluded as it has different hostname.
Example 2 — Single Page
$
Input:
startUrl = "http://news.yahoo.com", urls = ["http://news.yahoo.com"], edges = []
›
Output:
["http://news.yahoo.com"]
💡 Note:
Only the start URL with no outgoing links, so result contains just the start page.
Example 3 — Different Hostname Filtering
$
Input:
startUrl = "http://example.com/page", urls = ["http://example.com/page", "http://example.com/about", "http://other.com/page"], edges = [[0,1],[0,2]]
›
Output:
["http://example.com/page", "http://example.com/about"]
💡 Note:
From start page, both links are found but only the example.com/about is included as other.com has different hostname.
Constraints
- 1 ≤ urls.length ≤ 1000
- 1 ≤ urls[i].length ≤ 300
- startUrl is one of the urls
- All URLs follow the format http://hostname/path without port
Visualization
Tap to expand
Understanding the Visualization
1
Input
startUrl and graph of connected web pages
2
Process
Extract hostname, traverse graph, filter by domain
3
Output
List of all reachable URLs in same domain
Key Takeaway
🎯 Key Insight: Extract hostname once from startUrl, then use DFS/BFS with visited set to traverse the web graph while filtering URLs by matching hostname
💡
Explanation
AI Ready
💡 Suggestion
Tab
to accept
Esc
to dismiss
// Output will appear here after running code