Web Scraping

You can download the notebook from here

Data is everywhere around us and is constantly being collected, redistributed, and visualized. Luckily, if the data is on the internet in a place that we can see it, we can typically retrieve it.

We will discuss a few ways that we can retrieve data:

  • API: This is the best way! It means that the team that is collecting the data has thought about the fact that others might want to use the data and have provided a documented route for retrieving it. Unfortunately, most data does not fall into this category…

  • HTML tables: Data is often retrievable from the site directly from the html table it is displayed

  • Request intercepting: Data is often retrieved from files internal to the company’s server and, if you’re willing to inspect what gets loaded with the site, then you can often grab the data from their request.

  • Parsing through html: Data is often stored in identified tags within a site’s html. We can often retrieve this by parsing the data with an html parser and examining the page.

  • Executing Javascript and POST requests: We won’t talk about this one today but it’s the most painful. It requires a bit of expertise and a lot of patience.

API

The Federal Reserve Economic Data (FRED) published by the Federal Reserve Bank of St Louis provides an exceptionally well documented API.

We will retrieve GDP data from their API.

html table

pandas has a pd.read_html method which allows us to read information from a well-formatted html table.

We will retrieve data from ESPN on which teams have upcoming games in the Bundesliga.

Request intercept

When a web-page loads, it often will need to dynamically fetch the most recent components of data. The data is often stored in files that live on the server that hosts the web page but they get loaded into your browser when you open the page. We can see a history of which files were loaded to our computer by the web page when we open the inspect element of our browser.

We will use this method to retrieve state level covid vaccination data from the CDC website

Parse the html

Components of this section were originally written by one of our collaborators, Spencer Lyon.

HTML: Language of the web

  • Web pages are written in a markup language called HTML

  • HTML stands for “hyper text markup language”

  • All HTML documents are composed of a tree of nested elements

  • For example, this bullet point list would be written like this in HTML:

    <ul>
      <li>Web pages are..</li>
      <li>HTML stands for...</li>
      <li>All HTML documents...</li>
      <li>For example, this...</li>
    </ul>
    

Components of HTML

  • Below is an image that annotates the core parts of an HTML document

html_parts

  • Tag: name or type of an element

  • CSS classes: used to style and change appearance (we’ll use it to identify specific elements!)

  • id: Unique identifier for element on whole webpage

  • value: class is one property, syntax is property\=value

  • text: the actual text contained in the element

Figure originally from Practical Web Design by Philippe Hong

Structure of Webpage

  • Most webpages follow a very common structure:

    <!DOCTYPE HTML>
    <html>
    <head>
      <meta>...</meta>
      <title>...</title>
      <link>...</link>
    </head>
    <body>
      <h1>Title</h1>
      .... MANY MORE ELEMENTS HERE ...
    </body>
    </html>
    

We are almost always interested in what is contained inside the<body> element

Parsing data

We are going to parse the surf report for Baltrum. This data is generated by a site called Magic Seaweed and we can see the page at https://magicseaweed.com/Baltrum-Surf-Report/1117/

Other data

Sometimes data from the web comes in a pdf – I’m currently a fan of either PyMuPDF or camelot for extracting data from pdf files!