Part 1: Crawling a website using BeautifulSoup and Requests

MB
Geek Culture
Published in
5 min readDec 2, 2021

--

Used with permission from Pixabay

Have you ever worked on a project where you need to scrape a website with an unknown number of subpages? What are the options and how fast can we write a basic script to get the job done? In this project that is exactly what I do as I search for famous quotes from a developer website, and then reformat this data into a useful web app with streamlit.

For this build we are going to play with www.quotes.toscrape.com which is a website specifically designed to let developers learn web scrapping.

We are going to use Requests and BeautifulSoup to show how easy you can crawl multiple pages even using a relatively simple scrapping library. For larger or more complex projects I would recommend utilizing Scrapy or Selenium.

As always, I like to minimize dependencies. Here I am going to rely on only three: Pandas, Requests and BeautifulSoup

import requests
from bs4 import BeautifulSoup
import pandas as pd

Here we import requests to POST HTML/1.1 requests with a single function, requests.get(). If you want to better understand what requests is doing under the hood I recommend a quick review of the official documentation.

“Requests allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.”

Now we are going to create a few global variables to work with.

# Globals
url = 'http://quotes.toscrape.com'
url_list = [url,]
pages = []
soup_list = []
not_last_page = True

Here we create a few lists to populate (url_list, pages, soup_list) and we set the not_last_page equal to True. We will see why in a moment.

3. Next we take a 3 step approach to parse all of our pages.

#1: Pull the requests
def pullUrl(func):
def inner(*args, **kwargs):
page = requests.get(url_list[-1])
if page.status_code == 200:
pages.append(page)
func(*args, **kwargs)
else:
print(f'The url {url} returned a status of {page.status_code}')
return inner
#2: Make some soup
def makeSoup(func):
def inner(*args, **kwargs):
soup = BeautifulSoup(pages[-1].content, 'html.parser')
soup_list.append(soup)
func(*args, **kwargs)
return inner
#3: Parse the URLs
@pullUrl
@makeSoup
def getURLs():
global not_last_page
try:
next_page = url+soup_list[-1].find('li', {'class': 'next'}).find('a')['href']
print(next_page)
url_list.append(next_page)
except:
not_last_page = False

There is a little bit going on here so lets walk through the code. To start, I reviewed the HTML of the site we are scraping using Google Chrome’s DevTools.

Notice that the ‘Next’ button is inside an html ordered list <ol> with a class of “next”. Also note that the url is the index page concatonated with “/page/n/”. With that information and our three imports, we can create a few simple functions to find and scrape the entire web site.

The first function is a simple pull request that is going to eventually iterate over our url_list by looking for the last item in the list using negative indexing ‘[-1]’. To ensure that we received a good request, we use an if clause on the .status_code attribute to ensure we received a 200 status. Note that we are going to use python decerators here to make the code more pythonic. For a good review of how to use Decorators, check out GeeksForGeeks.org.

In our second function, we parse each page returned from the request obtained from our first function and save it into our soup_list. Again we use the inner function syntax to create a python wrapper.

In our third function, we walk our HTML schema to pull out the ‘next’ href url and concatenate that with our index url and append that to our url_list. We prepare to run this through a loop by using try:except to catch the attribute error that will occur when we attempt to parse the final web page since it wont have a ‘next’ class. When this happens we change not_last_page to false to exit our loop. The syntax and example output for page 1 is shown below.

That’s it. With those three imports and our three functions we can parse all of the html from what ends up being 10 pages of quotes in this example.

All that’s left is to pull out the actual data and place it into a dataframe.

# Start with an empty Data Frame:
quotes_df = pd.DataFrame(columns=['Author', 'Quote'])
# Add in the quotes dictionary:
for k,v in quotes.items():
quotes_df = pd.concat([quotes_df, pd.DataFrame({'Author': k, 'Quote': v})], sort=True)

So what have we accomplished? We identified a site with an unknown number of pages of content that we wanted to scrape into a DataFrame for use in a future project. We relied on a minimum number of dependencies and developed a few functions and we were able to not only scrape, but actually crawl a website. Finally, we utilized Python decorators to make our code more pythonic. In part two of this article we will take the output of our web crawl and rapidly develop a web application to allow a user to either randomly receive a quote, or specifically select a quote based on the authors name.

Full project code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Globals
url = 'http://quotes.toscrape.com'
url_list = [url,]
pages = []
soup_list = []
not_last_page = True

#1: Pull the requests
def pullUrl(func):
def inner(*args, **kwargs):
page = requests.get(url_list[-1])
if page.status_code == 200:
pages.append(page)
func(*args, **kwargs)
else:
print(f'The url {url} returned a status of {page.status_code}')
return inner

#2: Make some soup
def makeSoup(func):
def inner(*args, **kwargs):
soup = BeautifulSoup(pages[-1].content, 'html.parser')
soup_list.append(soup)
func(*args, **kwargs)
return inner

#3: Parse the URLs
@pullUrl
@makeSoup
def getURLs():
global not_last_page
try:
next_page = url+soup_list[-1].find('li', {'class': 'next'}).find('a')['href']
print(next_page)
url_list.append(next_page)
except:
not_last_page = False

## Syntax and example output for page 1:
# next_page = url+soup.find('li', {'class': 'next'}).find('a')['href']
# print(next_page)

while not_last_page:
getURLs()

# Start with an empty Data Frame:
quotes_df = pd.DataFrame(columns=['Author', 'Quote'])

# Add in the quotes dictionary:
quotes_dict = {}

try:
for i in range(len(soup_list)):
quotes = soup_list[i].find_all('div', {'class': 'quote'})
for j in range(len(quotes)):
v = quotes[j].find('small', {'class': 'author'}).text
k = quotes[j].find('span', {'class': 'text'}).text
quotes_dict[k] = v
except: print('issue with', {i, j})

quotes_df = pd.DataFrame(list(quotes_dict.items()), columns=['Quote', 'Author'])[['Author', 'Quote']].sort_values('Author')

--

--

MB
Geek Culture

Husband, Father, Pediatrician & Informaticist writing about whatever is on my mind for today.