Python - Multithreaded web crawler

I recently had written a web crawler in Python for crawling through a database website. It was a very interesting experience. I needed to crawl through 6850+ pages and extract data from each page, which required some conditions to be met. I used Python for this task, and it was very easy to write a web crawler in Python.

First approach

I wanted to make sure it works with a single page ( with 50 links ) properly. I needed the following

Retrieve urls from a list in the first page
Visit each url and extract data from the page
Check for the conditions met
Store the data in a csv file if the conditions are met

I used requests to send request to the page and BeautifulSoup to parse the HTML content. I used the following code to retrieve the urls from the first page.

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
anchor_tags = soup.find('table').find('tbody').find_all('a')

links = []
for tag in anchor_tags:
    links.append(tag['href'])

This was nice to get the links from the first page. I used a for loop to visit each link and extract data from the page. I used the following code to visit each link and extract data from the page.

# I wanted to see if some keywords are in the page
keywords = ['keyword1', 'keyword2', 'keyword3']

for link in links:
    response = requests.get(link)
    soup = BeautifulSoup(response.text, "html.parser")
    content = soup.find('table').find('tbody').find_all('td')
    for td in content:
        matched = [keyword for keyword in keywords if keyword in td.text]
        if any(matched):
            # Store the data in a csv file
            with open('data.csv', 'a') as file:
                file.write(link + ',' + td.text + '\n')

It was a nice solution but very time-consuming. I estimated about 3hrs to crawl through 6850+ pages. I decided to setup for multithreading to speed up the process.

Refactoring and prepare for multithreading

First I cleaned up the code and made it into a function. I used the following code to make it into a function.

# This function will visit each page and extract links in table
def find_links(soup):
    anchor_tags = soup.find('table').find('tbody').find_all('a')
    links = []
    for tag in anchor_tags:
        links.append(tag['href'])
    return links

# This function will extract next page link
def get_next_page_link(soup):
    paging = soup.find('ul', class_='paging')
    next_page = paging.find_all('li')[-2]
    if next_page.get('class') == 'disabledpg':
        return ''
    return next_page.find('a').get('href')

# This function will visit each link and extract data from the page
def process_detail_page(link, keywords, queue):
    response = requests.get(link)
    soup = BeautifulSoup(response.text, "html.parser")
    content = soup.find('table').find('tbody').find_all('td')
    for td in content:
        matched = [keyword for keyword in keywords if keyword in td.text]
        if any(matched):
            queue.put(link + ',' + td.text)

# This function will crawl through the pages
def crawl(url, keywords, queue):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    links = find_links(soup)
    for link in links:
        process_detail_page(link, keywords, queue)
    next_page = get_next_page_link(soup)
    if next_page:
        crawl(next_page, keywords, queue)
    
def main():
    url = "https://example.com"
    keywords = ['keyword1', 'keyword2', 'keyword3']
    queue = Queue()
    crawl(url, keywords, queue)
    while not queue.empty():
        with open('data.csv', 'a') as file:
            file.write(queue.get() + '\n')

if __name__ == "__main__":
    main()

This refactor helped me to prepare for multithreading. I used threading to speed up the process. I used the following code to setup multithreading.

Multithreading

I used the following code to setup multithreading.

import threading
from queue import Queue

def main():
    url = "https://example.com"
    keywords = ['keyword1', 'keyword2', 'keyword3']
    queue = Queue()
    crawl(url, keywords, queue)
    while not queue.empty():
        with open('data.csv', 'a') as file:
            file.write(queue.get() + '\n')
        
# This function will crawl through the pages
def crawl(url, keywords, queue):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    links = find_links(soup)
    # Create threads
    threads = []
    for link in links:
        # I'm spawning a new thread for each link ( since I'm sure maximum of 50 links per page, I didnt worry about creating too many threads )
        # queue is shared among threads, and it's thread safe
        thread = threading.Thread(target=process_detail_page, args=(link, keywords, queue))
        threads.append(thread)
        thread.start()
    # Wait for all threads to finish
    for thread in threads:
        thread.join()
    next_page = get_next_page_link(soup)
    if next_page:
        crawl(next_page, keywords, queue)

if __name__ == "__main__":
    main()

As simple as that, The whole site was crawled and my required data was extracted easily. Since the queue is shared among threads, I didn’t have to worry about thread safety. It was a very interesting experience.

This is a very simple usecase of multithreaded web crawler. And it’s simple because there are no dependencies between the pages. If there are dependencies between the pages, then it’s a bit more complex. But still, it’s very easy to write a web crawler in Python.

First approach#

Refactoring and prepare for multithreading#

Multithreading#

First approach

Refactoring and prepare for multithreading

Multithreading