Pulling Data From PDF’s Part 1:

MB
3 min readOct 3, 2020

Using Python to Download Multiple PDF’s Quickly…

Photo by: Danni Simmonds (freeimages.com)

Sometimes finding data feels a lot like taking on a mountain. Recently I came across some data related to international adoption on Travel.State.Gov. The data was laid out nicely in Plotly, but not in the way I wanted to look at it. Certainly they provided a raw data download link. Nope, just a link to annual PDF reports that contain the data I need. Doh!

Obviously, I see this as an opportunity. Lets open up a notebook and get busy!

First I checked the site for a robot.txt file or for any restrictions on web-scraping — nothing. Good to go.

import requests
from bs4 import BeautifulSoup
# pull the data on the page, pass to requests.get(), and make soup!url = 'https://travel.state.gov/content/travel/en/Intercountry-Adoption/adopt_ref/adoption-publications.html' data = requests.get(url)soup = BeautifulSoup(data.content, 'html.parser')

We made our soup, now lets get the information we need by inspecting the page:

Lets filter our soup for that ‘<div>’ element:

# pull the div tag and then href data for all 'a' tags

ref = soup.find("div", {'class': 'tsg-rwd-text parbase section'})
links = [i.get('href') for i in ref.find_all('a')]

Here’s what we have:

Now to crawls the pages we need our links to be full url’s:

# manipulate the strings to get what we need:h = 'https://travel.state.gov'
full_links = [h+i for i in links]

Lets get those pdf’s:

# iterate through the full linksfor i in full_links:

# use reguests to pull down the pdf for each link
data = requests.get(i, stream=True) # I want to save the names based on the pdf names.
# That requires some cleaning
file_name = i.split('/')[-1]. \
replace('%','_'). \
replace('(', '_'). \
replace(' ','_'). \
replace(')',''). \
replace('.._',''). \
replace('_206.8.17_20_2','')
# Now we can just write the data to file_name with open(file_name, 'wb') as f:
f.write(data.content)

If the string manipulation is confusing you, all I have done is split the file_name on ‘/’ and brought back the last item in the resulting list. This is the actual pdf name but written for the browser to be able to interpret. To correct it, I replaced all % signs with an under-score and replaced all spaces, parenthesis, and extraneous (.._, _206.8.17_20_2) to get file names that make sense.

Part one done. In this short article I showed you how to ethically pull down multiple pdf files. In part two I will walk through the extraction of the data contained in these files for utilization in analytics.

--

--

MB

Husband, Father, Pediatrician & Informaticist writing about whatever is on my mind for today.