Open In App

Multithreaded crawler in Python

Last Updated : 09 Jan, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will describe how it is possible to build a simple multithreading-based crawler using Python.

Modules Needed

bs4: Beautiful Soup (bs4) is a Python library for extracting data from HTML and XML files. To install this library, type the following command in IDE/terminal.

pip install bs4

requests: This library allows you to send HTTP/1.1 requests very easily. To install this library, type the following command in IDE/terminal.

pip install requests

Stepwise implementation

Step 1: We will first import all the libraries that we need to crawl. If you’re using Python3, you should already have all the libraries except BeautifulSoup, requests. So if you haven’t installed these two libraries yet, you’ll need to install them using the commands specified above.

Python3




import multiprocessing
from bs4 import BeautifulSoup
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urljoin, urlparse
import requests


Step 2: Create a main program and then create an object of class MultiThreadedCrawler and pass the seed URL to its parameterized constructor, and call run_web_scrawler() method.

Python3




if __name__ == '__main__':
    cc = MultiThreadedCrawler("https://www.geeksforgeeks.org/")
    cc.run_web_crawler()
    cc.info()


Step 3: Create a class named MultiThreadedCrawler. And initialize all the variables in the constructor, assign base URL to the instance variable named seed_url. And then format the base URL into absolute URL, using schemes as HTTPS and net location.

To execute the crawl frontier task concurrently use multithreading in python. Create an object of ThreadPoolExecutor class and set max workers as 5 i.e To execute 5 threads at a time. And to avoid duplicate visits to web pages, In order to maintain the history create a set data structure.

Create a queue to store all the URLs of crawl frontier and put the first item as a seed URL.

Python3




class MultiThreadedCrawler:
 
    def __init__(self, seed_url):
        self.seed_url = seed_url
        self.root_url = '{}://{}'.format(urlparse(self.seed_url).scheme,
                                         urlparse(self.seed_url).netloc)
        self.pool = ThreadPoolExecutor(max_workers=5)
        self.scraped_pages = set([])
        self.crawl_queue = Queue()
        self.crawl_queue.put(self.seed_url)


Step 4: Create a method named run_web_crawler(), to keep on adding the link to frontier and extracting the information use an infinite while loop and display the name of the currently executing process.

Get the URL from crawl frontier, for lookup assign timeout as 60 seconds and check whether the current URL is already visited or not. If not visited already, Format the current  URL and add it to scraped_pages set to store in the history of visited pages and choose from a pool of threads and pass scrape page and target URL.

Python3




def run_web_crawler(self):
    while True:
        try:
            print("\n Name of the current executing process: ",
                  multiprocessing.current_process().name, '\n')
            target_url = self.crawl_queue.get(timeout=60)
             
            if target_url not in self.scraped_pages:
               
                print("Scraping URL: {}".format(target_url))
                self.scraped_pages.add(target_url)
                job = self.pool.submit(self.scrape_page, target_url)
                job.add_done_callback(self.post_scrape_callback)
 
        except Empty:
            return
        except Exception as e:
            print(e)
            continue


Step 5: Using the handshaking method place the request and set default time as 3 and maximum time as 30 and once the request is successful return the result set.

Python3




def scrape_page(self, url):
    try:
        res = requests.get(url, timeout=(3, 30))
        return res
    except requests.RequestException:
        return


Step 6: Create a method named scrape_info(). And pass the webpage data into BeautifulSoup which helps us to organize and format the messy web data by fixing bad HTML and present to us in an easily-traversable structure.

Using the BeautifulSoup operator extract all the text present in the HTML document.

Python3




def scrape_info(self, html):
    soup = BeautifulSoup(html, "html5lib")
    web_page_paragraph_contents = soup('p')
    text = ''
     
    for para in web_page_paragraph_contents:
        if not ('https:' in str(para.text)):
            text = text + str(para.text).strip()
    print('\n <-----Text Present in The WebPage is--->\n', text, '\n')
    return


Step 7: Create a method named parse links, using BeautifulSoup operator extract all the anchor tags present in HTML document. Soup.find_all(‘a’,href=True) returns a list of items that contain all the anchor tags present in the webpage. Store all the tags in a list named anchor_Tags. For each anchor tag present in the list Aachor_Tags, Retrieve the value associated with href in the tag using Link[‘href’]. For each retrieved URL check whether it is any of the absolute URL or relative URL.

  • Relative URL: URL Without root URL and protocol names.
  • Absolute URLs: URL With protocol name, Root URL, Document name.

If it is a Relative URL using urljoin method change it to an absolute URL using the base URL and relative URL. Check whether the current URL is already visited or not. If the URL has not been visited already, put it in the crawl queue.

Python3




def parse_links(self, html):
    soup = BeautifulSoup(html, 'html.parser')
    Anchor_Tags = soup.find_all('a', href=True)
     
    for link in Anchor_Tags:
        url = link['href']
         
        if url.startswith('/') or url.startswith(self.root_url):
            url = urljoin(self.root_url, url)
             
            if url not in self.scraped_pages:
                self.crawl_queue.put(url)


Step 8: For extracting the links call the method named parse_links() and pass the result. For extracting the content call the method named scrape_info() and pass the result.

Python3




def post_scrape_callback(self, res):
    result = res.result()
     
    if result and result.status_code == 200:
        self.parse_links(result.text)
        self.scrape_info(result.text)


Below is the complete implementation:

Python3




import multiprocessing
from bs4 import BeautifulSoup
from queue import Queue, Empty
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urljoin, urlparse
import requests
 
 
class MultiThreadedCrawler:
 
    def __init__(self, seed_url):
        self.seed_url = seed_url
        self.root_url = '{}://{}'.format(urlparse(self.seed_url).scheme,
                                         urlparse(self.seed_url).netloc)
        self.pool = ThreadPoolExecutor(max_workers=5)
        self.scraped_pages = set([])
        self.crawl_queue = Queue()
        self.crawl_queue.put(self.seed_url)
 
    def parse_links(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        Anchor_Tags = soup.find_all('a', href=True)
        for link in Anchor_Tags:
            url = link['href']
            if url.startswith('/') or url.startswith(self.root_url):
                url = urljoin(self.root_url, url)
                if url not in self.scraped_pages:
                    self.crawl_queue.put(url)
 
    def scrape_info(self, html):
        soup = BeautifulSoup(html, "html5lib")
        web_page_paragraph_contents = soup('p')
        text = ''
        for para in web_page_paragraph_contents:
            if not ('https:' in str(para.text)):
                text = text + str(para.text).strip()
        print(f'\n <---Text Present in The WebPage is --->\n', text, '\n')
        return
 
    def post_scrape_callback(self, res):
        result = res.result()
        if result and result.status_code == 200:
            self.parse_links(result.text)
            self.scrape_info(result.text)
 
    def scrape_page(self, url):
        try:
            res = requests.get(url, timeout=(3, 30))
            return res
        except requests.RequestException:
            return
 
    def run_web_crawler(self):
        while True:
            try:
                print("\n Name of the current executing process: ",
                      multiprocessing.current_process().name, '\n')
                target_url = self.crawl_queue.get(timeout=60)
                if target_url not in self.scraped_pages:
                    print("Scraping URL: {}".format(target_url))
                    self.current_scraping_url = "{}".format(target_url)
                    self.scraped_pages.add(target_url)
                    job = self.pool.submit(self.scrape_page, target_url)
                    job.add_done_callback(self.post_scrape_callback)
 
            except Empty:
                return
            except Exception as e:
                print(e)
                continue
 
    def info(self):
        print('\n Seed URL is: ', self.seed_url, '\n')
        print('Scraped pages are: ', self.scraped_pages, '\n')
 
 
if __name__ == '__main__':
    cc = MultiThreadedCrawler("https://www.geeksforgeeks.org/")
    cc.run_web_crawler()
    cc.info()


Output:



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads