Web Crawler - Practice Coding Problems

Web Crawler - Problem

Given a startUrl and an interface HtmlParser, implement a web crawler to crawl all links that are under the same hostname as startUrl.

Your crawler should:

Start from the page: startUrl
Call HtmlParser.getUrls(url) to get all urls from a webpage
Do not crawl the same link twice
Explore only the links that are under the same hostname as startUrl

For example, if startUrl = "http://news.yahoo.com/news/topics/", then the hostname is news.yahoo.com. You should only crawl URLs like http://news.yahoo.com/...

Note: Consider the same URL with trailing slash "/" as different. For example, "http://news.yahoo.com" and "http://news.yahoo.com/" are different urls.

Input & Output

Example 1 — Basic Web Crawling

$ Input: startUrl = "http://news.yahoo.com", urls = ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com"], edges = [[0,2],[2,1],[3,2],[3,1],[1,4]]

› Output: ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/"]

💡 Note: Starting from "http://news.yahoo.com", we can reach pages with the same hostname "news.yahoo.com". The Google URL is excluded as it has different hostname.

Example 2 — Single Page

$ Input: startUrl = "http://news.yahoo.com", urls = ["http://news.yahoo.com"], edges = []

› Output: ["http://news.yahoo.com"]

💡 Note: Only the start URL with no outgoing links, so result contains just the start page.

Example 3 — Different Hostname Filtering

$ Input: startUrl = "http://example.com/page", urls = ["http://example.com/page", "http://example.com/about", "http://other.com/page"], edges = [[0,1],[0,2]]

› Output: ["http://example.com/page", "http://example.com/about"]

💡 Note: From start page, both links are found but only the example.com/about is included as other.com has different hostname.

Constraints

1 ≤ urls.length ≤ 1000
1 ≤ urls[i].length ≤ 300
startUrl is one of the urls
All URLs follow the format http://hostname/path without port

Visualization

Tap to expand

Understanding the Visualization

Input

startUrl and graph of connected web pages

Process

Extract hostname, traverse graph, filter by domain

Output

List of all reachable URLs in same domain

Key Takeaway

🎯 Key Insight: Extract hostname once from startUrl, then use DFS/BFS with visited set to traverse the web graph while filtering URLs by matching hostname

Asked in

G Google 45 f Facebook 35 a Amazon 28 M Microsoft 22

The key insight is to extract the hostname once and use graph traversal (DFS or BFS) with a visited set to explore all reachable URLs in the same domain. Best approach is Optimized DFS with adjacency list: Time O(N + E), Space O(N).

Common Approaches

Approach	Time	Space	Notes
✓ Naive DFS without Optimization	O(N × M)	O(N)	Simple depth-first search without efficient hostname extraction
Breadth-First Search	O(N + E)	O(N)	Use BFS with queue to explore URLs level by level
Optimized Depth-First Search	O(N + E)	O(N + H)	Efficient DFS with pre-computed hostname and optimized parsing

Naive DFS without Optimization — Algorithm Steps

Extract hostname from startUrl by string parsing
Use DFS with visited set
For each URL, parse hostname and compare with target
Add valid URLs to result

Visualization

Tap to expand

Step-by-Step Walkthrough

Start

Parse hostname from startUrl

Visit

For each URL, parse hostname and compare

Result

Collect all valid URLs

Code -

solution.c — C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char** solution(char* startUrl, char** urls, int urlsSize, int** edges, int edgesSize, int* returnSize) {
    // Simplified implementation
    *returnSize = 1;
    char** result = malloc(sizeof(char*));
    result[0] = malloc(strlen(startUrl) + 1);
    strcpy(result[0], startUrl);
    return result;
}

int main() {
    char startUrl[1000];
    fgets(startUrl, sizeof(startUrl), stdin);
    startUrl[strcspn(startUrl, "\n")] = 0;
    printf("[]\n"); // Placeholder
    return 0;
}

Time & Space Complexity

Time Complexity

⏱️

O(N × M)

N URLs visited, M average URL length for hostname extraction

✓ Linear Growth

Space Complexity

O(N)

Visited set and recursion stack store up to N URLs

✓ Linear Space

28.5K Views

Medium Frequency

~25 min Avg. Time

892 Likes

Ln 1, Col 1

Smart Actions

💡 Explanation

AI Ready

💡 Suggestion Tab to accept Esc to dismiss

// Output will appear here after running code

Code Editor Closed

Click the red button to reopen

Input & Output

Constraints

Visualization

Related Problems

Common Approaches

Naive DFS without Optimization — Algorithm Steps

Visualization

Code -

Time & Space Complexity

Select Compiler