LeetCode 1236. Web Crawler

Post by ailswan Aug. 22, 2024

1236. Web Crawler

Problem Statement

link: LeetCode.cn LeetCode Given a url startUrl and an interface HtmlParser, implement a web crawler to crawl all links that are under the same hostname as startUrl.

Return all urls obtained by your web crawler in any order.

Your crawler should:

Start from the page: startUrl Call HtmlParser.getUrls(url) to get all urls from a webpage of given url. Do not crawl the same link twice. Explore only the links that are under the same hostname as startUrl.

As shown in the example url above, the hostname is example.org. For simplicity sake, you may assume all urls use http protocol without any port specified. For example, the urls http://leetcode.com/problems and http://leetcode.com/contest are under the same hostname, while urls http://example.org/test and http://example.com/abc are not under the same hostname.

The HtmlParser interface is defined as such:

interface HtmlParser { // Return a list of all urls from a webpage of given url. public List getUrls(String url); } Below are two examples explaining the functionality of the problem, for custom testing purposes you'll have three variables urls, edges and startUrl. Notice that you will only have access to startUrl in your code, while urls and edges are not directly accessible to you in code.

Note: Consider the same URL with the trailing slash “/” as a different URL. For example, “http://news.yahoo.com”, and “http://news.yahoo.com/” are different urls.

Example:

Input: urls = ["http://news.yahoo.com", "http://news.yahoo.com/news","http://news.yahoo.com/news/topics/", "http://news.google.com","http://news.yahoo.com/us"]edges = [[2,0],[2,1],[3,2],[3,1],[0,4]] startUrl = "http://news.yahoo.com/news/topics/" Output: [ "http://news.yahoo.com", "http://news.yahoo.com/news","http://news.yahoo.com/news/topics/","http://news.yahoo.com/us"]

Input: urls = ["http://news.yahoo.com","http://news.yahoo.com/news","http://news.yahoo.com/news/topics/","http://news.google.com"]edges = [[0,2],[2,1],[3,2],[3,1],[3,0]]startUrl = "http://news.google.com" Output: ["http://news.google.com"]

Constraints: 1 <= urls.length <= 1000 1 <= urls[i].length <= 300 startUrl is one of the urls. Hostname label must be from 1 to 63 characters long, including the dots, may contain only the ASCII letters from ‘a’ to ‘z’, digits from ‘0’ to ‘9’ and the hyphen-minus character (‘-‘). The hostname may not start or end with the hyphen-minus character (‘-‘). See: https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames You may assume there’re no duplicates in url library.

Solution Approach

The solution approach for LeetCode problem #1236 ‘Web Crawler’ involves using BFS to traverse URLs, ensuring that only links under the same hostname as the starting URL are explored, while avoiding revisiting the same link twice.

Algorithm

  1. Initialization: Start by initializing a set and a list to keep track of crawled URLs and the URLs to be processed, respectively. Extract the hostname from the starting URL.
  2. BFS Traversal: Use BFS to explore each URL by retrieving its links via HtmlParser.getUrls(url), and only add those links to the queue if they belong to the same hostname and haven’t been visited yet.
  3. Output: Return the list of URLs collected during the traversal, ensuring all URLs under the same hostname as the start URL are included without duplication.

Implement

    class Solution:
  def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
      rec_set = {startUrl}
      rec_list = [startUrl]
      host_name = 'http://' + startUrl.split('/')[2]

      for url in rec_list:
          a = htmlParser.getUrls(url)
          for i in a:
              if i.startswith(host_name) and i not in rec_list:
                  rec_set.add(i)
                  rec_list.append(i)
      return rec_list