Three Ways to Web Scrape a Page With Python…

MB
The Startup
Published in
6 min readSep 28, 2020

--

It hurts my heart a little to realize that Jerry Maguire came out in 1996. Maybe if the movie was created today 40-somethings all over the world would shout the mantra, “SHOW ME THE DATA!” .

If you are a data scientist looking for data, there are only so many free resources to download before you realize that you need to learn how to scrape web pages.

Before we start, let me state that web scraping should be done responsibly with as minimal impact as possible to the host servers, and we all should be respectful of others creative work and copywrited content. Be sure to read the robot.text file of web sites to understand what is, and is not allowed, and stick with the rules. That said, lets learn three different ways to scrape a page in python with minimal dependencies.

1. Pandas

This is the no-brainer. When I find a table on a website begging to be scrapped, my first attempt is always to throw pandas at it.

To practice, we are going to use the table page on webscraper.io, a site for testing your webscraping scripts. Here is a screenshot of the data we will be scraping. In this example we will be trying to grab the second table with idex 4–6.

To scrape a page with pandas, simply read the url into a pandas data frame:

import pandas as pdurl = 'https://webscraper.io/test-sites/tables' 
df = pd.read_html(url)
print(df[1].head())

And just like that, we have scraped the data we wanted. Note that when you scrape a web page with pandas, the tables are brought back as data frames in a list. Thus we need to iterate through df to find the correct table. Looking at the web page, we can make a logical guess that it will be df[1].

Web scraping really doesn’t get easier than that, unfortunately not every table you find on the web will be able to be scraped with pandas. On to the next level…

2. Requests and BeautifulSoup:

In my mind, this is where any Pythonista should start when learning to scrape websites. There are several more sophisticated alternatives when you are ready to crawl web sites, but for starting out this is a great library. Start by installing Requests and BS4 if you haven’t already:

pip install requests bs4

Simple enough, now we use requests to obtain the pages html code and BeautifulSoup out of bs4 to parse the page:

import requests
from bs4 import BeautifulSoup
url = 'https://webscraper.io/test-sites/tables'
page = requests.get(url).text
soup = BeautifulSoup(page, 'html.parser')

Now we have options. We can take a look at the html by simply calling soup or we can take a look at our page and inspect the tags of elements of interest.

Looks like we need to pull <table>, <tr>, and finally <td> to get at our data. Lets get <table class=”table table-bordered”> first:

# get the tables first
tbls = soup.find_all('table', {'class': 'table table-bordered'})

Now if we look at the length of tbls we find that it is 4. We have obtained all 4 tables on the page in a few lines of code. Lets get the data out of it next:

tr = tbls[1].find_all('tr')

This code simply grabs all the data in the first <tr> tag which includes all of our ‘table header’ data. Now we can use list comprehension to get all the text from our table to include the column names.

data = [d.text for d tr]

Printing “data”, we see that we now have a list of strings.

Lets drop off the first and last “\n” for each string and then split on all the other Line Feed characters (\n):

data = [d[1:-1].split('\n') for d in data]

Now make a data frame:

df = pd.DataFrame(data[1:], columns=data[0])

3. Simplify and Remove Dependencies

Lets make this as independent as possible now. No imports at all, only core python modules. We will scrape this page in the following steps:

  1. Source the page: note that this wont work with JS dynamic pages
  2. Copy the data you want out
  3. Use string manipulation to get your data
View the page source of the website you want to scrape
Here is the data we want

Store it in a string variable and then manipulate that string to pull out the data. This can be tedious, but it allows you to get at data that can be very difficult to scrape otherwise. In this case it’s straightforward:

Set you string variable

Lets make it easier to work with by identifying and then matching our tags that will be easiest to manipulate. I am going to use the table row <tr> tag to pull out the data. Lets get rid of the </tr> closing tags:

data = data.replace('</tr>', '<tr>')

If we look at string now, we will see that there are a lot of tab and line breaks. Lets get rid of those as well. We could use REGEX here, but no imports…

data = data.replace('\n\t\t\t\t', '')

Now lets get after that data. Note that all our data is inside either the <th> or <td> tags. Lets make it easier to work with that:

data = data.replace('<th>','<td>').replace('</th>','<td>')
data = data.replace('</td>','<td>')
# now try spliting on '<td>'
data = ddata.split('<td>')

Looks like we need 1 to -1. Lets get that and lets drop all the garbage:

data_clean = [d for d in data[1:-1] if d != '' and '\n' not in d]

Now we have all our data in a list. Lets finish this! Our data is currently in a list and we need it in either an array or in a list of lists. We could import numpy and use it to convert the shape like this…

# Using Numpy - super easy, but requires another importimport numpy as np
data_array = np.array(rows).reshape(3,4)

But, again, we aren’t importing anything in this method so lets just use zip…

# zipping rows to itself in groups of 4data_grouped = list(zip(*(iter(data_clean),) * 4))

And finally we export this to a flat file:

with open('scraped_page_with_strings_only.csv', 'w+') as f:
for row in data_grouped:
for x in row:
f.write(str(x)+',')
f.write('\n')

Now I know you are doubting if this actually worked, so lets check.

A little cleaning and…

In this article, we walked through three different ways to use python to scrape a web page with varying levels of dependencies and complexity. After you are comfortable scraping a page with these methods, it’s time to explore additional libraries such as Scrapy and Selenium.

I hope you enjoyed this article and that it inspires you to play around with pulling some data from the web for your data science journey.

--

--

MB
The Startup

Husband, Father, Pediatrician & Informaticist writing about whatever is on my mind for today.