Web Crawler Strategy

Introduction

In the process of building web crawlers, firstly it is necessary to continuously fetch various URLs and store them in the queue to be crawled. Then, download these URLs through the downloader and store the crawled pages in the webpage library waiting to be indexed, while also storing a copy in the crawled URL queue to avoid repeated crawling.

Crawling Strategy

There are several choices for strategies to crawl different web pages:

Breadth First Traversal:Directly add the links contained in the downloaded page to the end of the URLs to be crawled.
Depth First Traversal:First crawl all the links on a page, then crawl each link deep one by one until done.

The core of these strategies is to prioritize crawling important pages rather than simply traversing the website.

PageRank Algorithm

The PageRank algorithm is a way to measure the importance of a webpage, mainly considering the quantity and quality of inbound links. However, it is not possible to fully calculate the PageRank score during the crawling process. Therefore, a comprehensive consideration of the downloaded pages and pages in the to-be-crawled list is done, resulting in prioritizing the crawling of pages with higher importance.

OPIC Strategy

OPIC is an upgraded version of PageRank, that is, online page importance calculation. Its main feature is to calculate the importance of pages in real time, assign initial scores to all URLs, then distribute scores to the links within the page based on the score of the downloaded page, and clear the score of that page. Pages are then prioritized for crawling based on the score in the crawling list.

Large site priority strategy

Based on the classification of URLs in the crawling list and the domain level judgment, websites with higher weight are prioritized for crawling. The specific way to determine large sites may vary depending on the situation.