How Do You Create a Web Crawler that can Scale?

njbox

BuSo Pro
Joined
Dec 26, 2023
Messages
27
Likes
9
Degree
0
Hello Friends

I have few architecture scaling questions.

For e.g Let say we are building a web based SEO crawler

1. You have a crawler that takes the seed/root url and then starts discovering all linked urls
2. Every new url it finds, adds it to the list
3. The crawler's job is to simply discover all the pages.
4. At the end of the crawl, it discovers say 500 total internal urls
5. Let say a single threaded crawler took 2 hours to discover all the 500 pages and do necessary processing. (there is a limit to how fast a crawler can discover links due to the fact that it is waiting on the website to deliver the response and often this takes few seconds)

The above steps are to crawl one root domain/seed URL.

Let us see how can we scale the operation.

Requirement: A web application with a crawler that is designed to crawl say 1000 websites each having around 500 pages and each website is scanned end-to-end every week.

1. For the above requirement, a single threaded crawler will simply not work as the math to scan 1000 websites would be 2000 hours and we simply cannot match the SLA of "processing each website every week".

2. Let say there are 160 hours available in a week (24x7 = 168 minus few for maintenance roughly 160 hours)

3. This leads us to 2000/160 = approx 13 crawlers

4. To keep things simple, let say each crawler is running on its own VM. We need just 13 VMs just for the crawlers.

5. We need a single big master database that will maintain the list of all the websites that are needed to be crawled.
Can MySQL or PostgreSQL server this purpose?

6. How to make each crawler smart enough so that it works only on its own subset of websites or URLs.
Even if two crawlers are working on the same website, it should make sure that it will be working on a different subset of urls to speed things up and not waste compute on the same URL scans.

6.1 Do we need to implement a queue mechanism in the database?

7. Scaling Up: If we scale up the VM to accommodate more than 1 crawler, is this a better design?

8. Scaling Horizontal: Azure and AWS provide auto-scaling of clusters. Does this kind of autoscaling still require the crawler to be smart ?
 
You don't need all that VM stuff.

Use REDIS, it's an In Memory (RAM) Database which responses faster than MYSQL for what you need.

You create a LIST, with the URLs that need to be crawled, you add to the list by PUSHING a new URL to the end of the list.

Each crawler simply need to read the next queued up item on the list and remove it by POPing it.

LPUSH = pushing a URL to the left of the list.

RPOP = popping something out at the right of the list.

So your list is moving from left to right in logical sense.

Now you can have multiple instances of the crawlers working in parallel to download the pages and parse them.

Redis: Lists

The reason I say using Redis is because if you try this with MYSQL or a database which is stored on the harddrive you'll be reading and writing to the hard-drive every request.

In Memory is light-speed faster, 10000x faster, so this LPUSH and RPOP will be easy.

Now you can have the crawlers save to MYSQL once done, but you want to scale, so here you can have a single server with 4 cores that can run most likely 4-12 PHP scripts simultaneously without a problem.

And if you need to scale horizontally, you just duplicate the servers to as many as you need.

The caveat is there is a SINGLE REDIS server, can be also hosted on the MYSQL server, but there is only one REDIS server that is the master thats your LIST.

You can achieve this with a $5 Linode nano server (shared CPU) to start, then scale it from there (Or Digital Ocean droplet).

The crawlers don't need to be smart, they need to be dumb and just read off a single list.

When your crawlers find new URLs they can check MYSQL against known URLS, if this is not a known URL it adds it to the list to be crawled by someone (LPUSH).

Doing it the list way allows you to stop crawlers, put stuff in maintenance, and re-start. Also if a crawler goes down it doesn't impact the work of other crawlers or servers you have horizontally scaled with.
 
Hi Carter,
It certainly feels great to receive such a detailed answer from an honourable member of this group.

After reading your solution it feels like I was overthinking on my architecture. With Redis approach it certainly seems more simplified.

I am assuming, I still need MySQL to keep all the data and related metrics.

When a website is ready for crawl, it needs to be pushed to the Redis LIST. Once the website and all its URLS are processed they are cleared via RPOP.

I need to read up on Redis more

Thank you very much
 
I am assuming, I still need MySQL to keep all the data and related metrics.

Yes

When a website is ready for crawl, it needs to be pushed to the Redis LIST. Once the website and all its URLS are processed they are cleared via RPOP.

No. The URLs are pushed to the list, not the domain. Something else, MASTER.PHP, will need to parse the initial domains and after a week redo it all over again. But the RPOP happens at the beginning of the crawling script, to get the next URL to process.

If you were to RPOP the whole domain, then that single script will be processing a ton of potential URLs and might crash or hang. That's why you need to break it down ever further.

So for new domains, get the sitemap if there is one from reading the robots.txt file. That will create your initial seed URLs. Send those URLs to the list.

Crawlers simple do one thing, crawl the URL and look for links - or whatever you are trying to parse. If it finds a link it checks against MYSQL if that URL is know, if it 's fine. If not, then that crawler will send the new URL to the LIST too.

Then the crawler save whatever metrics you want for each page into MYSQL.

Crawlers should only work on a single URL, this way, if it fails, only 1 URL failed, instead of a whole domain. This will also allow other crawlers to crawl the same domain simultaneously instead of just leaving all the work to a single crawler to do the whole domain.

So going to the failed URL, within MYSQL you should have a column where it checks the last time the URL was crawled and updates it. If whatever script that calls up the next set of URLs to be crawls sees there is some URL that hasn't been updated in awhile, but there have been mutliple attempts, it should alert something.

If it was just a bad connection or parsing failure the next run might succeed. But this is all the logical part you have to figure out for yourself.

If you can break your crawler and needs down as much as possible, like one script is a crawler.php, another is a parser.php, or whatever, you'll be able to spot failures - could be bad HTML, bad connect, or whatever.

Also you need to look into XPATH to extract the links for internal - simple PHP from ChatGPT:

Code:
<?php
// Your HTML content
$html = 'Your HTML content here';

// Create a new DOMDocument and load the HTML
$doc = new DOMDocument();
@$doc->loadHTML($html); // Use @ to suppress warnings caused by invalid HTML

// Create a new XPath
$xpath = new DOMXPath($doc);

// XPath query to find all internal links for example.com
$query = "//a[starts-with(@href, '/') or starts-with(@href, 'http://example.com') or starts-with(@href, 'https://example.com')]/@href";

// Execute the XPath query
$internalLinks = $xpath->query($query);

// Check if any links were found
if (!is_null($internalLinks)) {
    foreach ($internalLinks as $link) {
        // Print the href attribute of each link
        echo $link->nodeValue . PHP_EOL;
    }
}
?>

XPATH alone will save you a ton o time and work.

--

For old domains and old URLs that you need to recheck once a week, you just duplicate MASTER.PHP, and change the logic to only send to the list URLs who have not been updated in the last 7 days or whatever.

Once you get the Master.php, Rechecker.php, logic going. Scaling with Crawler.php and Parser.php is as simple as setting up cronjobs on multiple servers.

VM is completely un-necessary since you don't need a GUI or anything.

The one caveat is you have to know how to get around sites that can easily detect your bots, cause a simple thing like checking if there is a cookie enabled can give you away. For more advanced stuff you'll need a headless browser like PhantomJS or Selenium or similar.

I am assume PHP, but all this can be translated to your favor programming language. Mine happens to be C and Perl, I'm old school.

Programming logic is my passion.
 
Crawlers should only work on a single URL, this way, if it fails, only 1 URL failed, instead of a whole domain
I think this is the key to scaling up. Currently I have a internal link building plugin in the making and all of the logic is contained within various php classes.

It being a wordpress plugin, there is only one website to deal with. If I were to scale this up and make it a web app (future) to handle for shopfiy or webflow then your proprosed Redis architecture can easily let me scale up.

Currently my plugin_crawler maintains a list of all URLs in a mysql table and this needs to dealt with using a queue mechanism in Redis for the webapp_crawler

02_crawlspider_link_builder.png


All of the tabs in the plugin will need a corresponding UI in the webapp. I will need atleast one VM for the GUI, one for processing/crawlers etc.

But as you said, majority of the php scripts will be doing the background work via cronjobs. UI is mainly for reviewing and taking certain actions.

Before, I was thinking little complicated such as to come up with segmenting a batch of urls between 1 - 1000 and that is processed by crawler1, next 1000 by crawler2 and so on.

But now there is only one single queue for all the crawlers. They feed from the same hose. If I add more crawlers to the cron, then it will simply process more urls and empty the queue faster. Great!
 
Back