I recently had written a web crawler in Python for crawling through a database website. It was a very interesting experience. I needed to crawl through 6850+ pages and extract data from each page, which required some conditions to be met. I used Python for this task, and it was very easy to write a web crawler in Python.
First approach
I wanted to make sure it works with a single page ( with 50 links ) properly. I needed the following
- Retrieve urls from a list in the first page
- Visit each url and extract data from the page
- Check for the conditions met
- Store the data in a csv file if the conditions are met
I used requests
to send request to the page and BeautifulSoup
to parse the HTML content. I used the following code to retrieve the urls from the first page.
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
anchor_tags = soup.find('table').find('tbody').find_all('a')
links = []
for tag in anchor_tags:
links.append(tag['href'])
This was nice to get the links from the first page. I used a for loop to visit each link and extract data from the page. I used the following code to visit each link and extract data from the page.
# I wanted to see if some keywords are in the page
keywords = ['keyword1', 'keyword2', 'keyword3']
for link in links:
response = requests.get(link)
soup = BeautifulSoup(response.text, "html.parser")
content = soup.find('table').find('tbody').find_all('td')
for td in content:
matched = [keyword for keyword in keywords if keyword in td.text]
if any(matched):
# Store the data in a csv file
with open('data.csv', 'a') as file:
file.write(link + ',' + td.text + '\n')
It was a nice solution but very time-consuming. I estimated about 3hrs to crawl through 6850+ pages. I decided to setup for multithreading to speed up the process.
Refactoring and prepare for multithreading
First I cleaned up the code and made it into a function. I used the following code to make it into a function.
# This function will visit each page and extract links in table
def find_links(soup):
anchor_tags = soup.find('table').find('tbody').find_all('a')
links = []
for tag in anchor_tags:
links.append(tag['href'])
return links
# This function will extract next page link
def get_next_page_link(soup):
paging = soup.find('ul', class_='paging')
next_page = paging.find_all('li')[-2]
if next_page.get('class') == 'disabledpg':
return ''
return next_page.find('a').get('href')
# This function will visit each link and extract data from the page
def process_detail_page(link, keywords, queue):
response = requests.get(link)
soup = BeautifulSoup(response.text, "html.parser")
content = soup.find('table').find('tbody').find_all('td')
for td in content:
matched = [keyword for keyword in keywords if keyword in td.text]
if any(matched):
queue.put(link + ',' + td.text)
# This function will crawl through the pages
def crawl(url, keywords, queue):
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = find_links(soup)
for link in links:
process_detail_page(link, keywords, queue)
next_page = get_next_page_link(soup)
if next_page:
crawl(next_page, keywords, queue)
def main():
url = "https://example.com"
keywords = ['keyword1', 'keyword2', 'keyword3']
queue = Queue()
crawl(url, keywords, queue)
while not queue.empty():
with open('data.csv', 'a') as file:
file.write(queue.get() + '\n')
if __name__ == "__main__":
main()
This refactor helped me to prepare for multithreading. I used threading
to speed up the process. I used the following code to setup multithreading.
Multithreading
I used the following code to setup multithreading.
import threading
from queue import Queue
def main():
url = "https://example.com"
keywords = ['keyword1', 'keyword2', 'keyword3']
queue = Queue()
crawl(url, keywords, queue)
while not queue.empty():
with open('data.csv', 'a') as file:
file.write(queue.get() + '\n')
# This function will crawl through the pages
def crawl(url, keywords, queue):
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = find_links(soup)
# Create threads
threads = []
for link in links:
# I'm spawning a new thread for each link ( since I'm sure maximum of 50 links per page, I didnt worry about creating too many threads )
# queue is shared among threads, and it's thread safe
thread = threading.Thread(target=process_detail_page, args=(link, keywords, queue))
threads.append(thread)
thread.start()
# Wait for all threads to finish
for thread in threads:
thread.join()
next_page = get_next_page_link(soup)
if next_page:
crawl(next_page, keywords, queue)
if __name__ == "__main__":
main()
As simple as that, The whole site was crawled and my required data was extracted easily. Since the queue is shared among threads, I didn’t have to worry about thread safety. It was a very interesting experience.
This is a very simple usecase of multithreaded web crawler. And it’s simple because there are no dependencies between the pages. If there are dependencies between the pages, then it’s a bit more complex. But still, it’s very easy to write a web crawler in Python.