Chapter 10 Case Studies

Goal

The main goal of this chapter is to apply lessons learnt in this book through a series of case studies. We will begin with one case study and subsequent case studies will be added from suggestions made.

Prerequisite

To appreciate this chapter, you must have gone through all the other chapters (1-9).

10.1 Case Study One:- Importing online data

10.1.1 Task

Using pure base R (no contributed packages), extract 800m and marathon data on recently held (summer) Olympics games and prepare it for data analysis (create data frames).

There are a number of excellent R packages suitable for this task, however, it is best to use packages when we have some basic idea of how it works.

10.1.2 What we shall cover

In addition to reviewing what we have covered in chapter one, we shall also learn how to:

  • Import web data using base R readLines()
  • Read and understand web data (basic HTML)
  • Extract web data (using CSS selectors)

10.1.3 Background and Data

Our first case study is all about extracting online data and making it available for analysis. Basic reason for having this as our first case study is because the web is full of (underutilized) data and most often you will find it necessary to use some of these data.

To make this interesting, we will look at the recent (2016) Summer Olympics games.

As we venture into this case study, some of the topics we will review are creating data objects (data frame) and converting data into a date-time object.

Olympics has quite a number of events, from Archery all the way to Wrestling. For this exercise, our interest will be Athletics and more specifically 800m and Marathon events.

We will scrape our data from Wikipedia; https://en.wikipedia.org/wiki/800_meteres and https://en.wikipedia.org/wiki/Marathon_at_the_Olympics. From these two pages, we will look for and extract tables giving Olympics medal standing for each event as of 2016 dis-aggregated by gender.

10.1.4 Web scraping

The simplest part of web scraping (data extraction) is actual data importation. This is because in addition to base R’s readLines() functions, there are contributed packages (from CRAN, and other repositories) with functions capable of achieving this task.

To learn more about what is available in terms of contributed packages, read https://cran.r-project.org/web/view/WebTechnologies.html. However, as mentioned before, for this chapter we will try as much as possible to use only base R functions as we want to get a good understanding of the concepts right from the beginning. This will not only help us understand how contributed packages like XML and rvest work, but will also ensure we can still extract and traverse through web data when we do not have access to these packages.

With that said, our first task is to download/scrap/import our data.

url1 <- "https://en.wikipedia.org/wiki/800_metres"
url2 <- "https://en.wikipedia.org/wiki/Marathon_at_the_Olympics"
wiki800mDOM <- readLines(url1)
wikiMarathonDOM <- readLines(url2)

As you have seen, downloading data is rather easy. Now let’s find out what data structure is created when we import web data.

str(wiki800mDOM)
##  chr [1:2545] "<!DOCTYPE html>" ...
str(wikiMarathonDOM)
##  chr [1:1185] "<!DOCTYPE html>" ...

We have character vectors, now let’s take a close look at these character.

head(wiki800mDOM)
## [1] "<!DOCTYPE html>"                                                                                                                                    
## [2] "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">"                                                                                               
## [3] "<head>"                                                                                                                                             
## [4] "<meta charset=\"UTF-8\"/>"                                                                                                                          
## [5] "<title>800 metres - Wikipedia</title>"                                                                                                              
## [6] "<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, \"$1client-js$2\" );</script>"
tail(wiki800mDOM)
## [1] "\t\t\t\t\t\t\t\t\t</ul>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "\t\t\t\t\t\t<div style=\"clear:both\"></div>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [3] "\t\t</div>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [4] "\t\t<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({\"wgPageParseReport\":{\"limitreport\":{\"cputime\":\"1.264\",\"walltime\":\"1.440\",\"ppvisitednodes\":{\"value\":33739,\"limit\":1000000},\"ppgeneratednodes\":{\"value\":0,\"limit\":1500000},\"postexpandincludesize\":{\"value\":353879,\"limit\":2097152},\"templateargumentsize\":{\"value\":69217,\"limit\":2097152},\"expansiondepth\":{\"value\":12,\"limit\":40},\"expensivefunctioncount\":{\"value\":0,\"limit\":500},\"entityaccesscount\":{\"value\":0,\"limit\":400},\"timingprofile\":[\"100.00%  948.704      1 -total\",\" 55.78%  529.208    132 Template:FlagIOCathlete\",\" 49.93%  473.653    264 Template:Country_alias\",\" 36.61%  347.308      1 Template:Olympic_medalists_in_men's_800_metres\",\" 21.22%  201.271      1 Template:Olympic_medalists_in_women's_800_metres\",\" 16.09%  152.657    281 Template:Flagathlete\",\"  5.99%   56.835      1 Template:Reflist\",\"  5.98%   56.735      1 Template:World_Championships_in_Athletics_medalists_in_men's_800_metres\",\"  4.96%   47.051      7 Template:Cite_web\",\"  4.92%   46.718    281 Template:Country_flagbio\"]},\"scribunto\":{\"limitreport-timeusage\":{\"value\":\"0.404\",\"limit\":\"10.000\"},\"limitreport-memusage\":{\"value\":3818602,\"limit\":52428800}},\"cachereport\":{\"origin\":\"mw1274\",\"timestamp\":\"20170201155806\",\"ttl\":2592000,\"transientcontent\":false}}});});</script><script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({\"wgBackendResponseTime\":73,\"wgHostname\":\"mw1210\"});});</script>"
## [5] "\t</body>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [6] "</html>"
head(wikiMarathonDOM)
## [1] "<!DOCTYPE html>"                                                                                                                                    
## [2] "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">"                                                                                               
## [3] "<head>"                                                                                                                                             
## [4] "<meta charset=\"UTF-8\"/>"                                                                                                                          
## [5] "<title>Marathons at the Olympics - Wikipedia</title>"                                                                                               
## [6] "<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, \"$1client-js$2\" );</script>"
tail(wikiMarathonDOM)
## [1] "\t\t\t\t\t\t\t\t\t</ul>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "\t\t\t\t\t\t<div style=\"clear:both\"></div>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## [3] "\t\t</div>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [4] "\t\t<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({\"wgPageParseReport\":{\"limitreport\":{\"cputime\":\"0.960\",\"walltime\":\"1.051\",\"ppvisitednodes\":{\"value\":10804,\"limit\":1000000},\"ppgeneratednodes\":{\"value\":0,\"limit\":1500000},\"postexpandincludesize\":{\"value\":82702,\"limit\":2097152},\"templateargumentsize\":{\"value\":12153,\"limit\":2097152},\"expansiondepth\":{\"value\":11,\"limit\":40},\"expensivefunctioncount\":{\"value\":1,\"limit\":500},\"entityaccesscount\":{\"value\":0,\"limit\":400},\"timingprofile\":[\"100.00%  861.977      1 -total\",\" 67.56%  582.383    284 Template:Country_alias\",\" 52.65%  453.827     55 Template:FlagIOCteam\",\" 21.87%  188.490     32 Template:FlagIOCathlete\",\" 17.98%  155.002      1 Template:Olympic_medalists_in_the_women's_marathon\",\"  7.09%   61.146      1 Template:Reflist\",\"  5.89%   50.739      1 Template:Infobox_Olympic_athletics_event\",\"  5.64%   48.629      1 Template:Infobox\",\"  5.00%   43.110      1 Template:Fact\",\"  4.82%   41.560      3 Template:Cite_web\"]},\"scribunto\":{\"limitreport-timeusage\":{\"value\":\"0.521\",\"limit\":\"10.000\"},\"limitreport-memusage\":{\"value\":6072062,\"limit\":52428800}},\"cachereport\":{\"origin\":\"mw1194\",\"timestamp\":\"20170201185253\",\"ttl\":2592000,\"transientcontent\":false}}});});</script><script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({\"wgBackendResponseTime\":76,\"wgHostname\":\"mw1240\"});});</script>"
## [5] "\t</body>"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
## [6] "</html>"

This is not exactly what we expected. What we wanted to see are headers, paragraphs and tables at the very least. So what happened, did R corrupt our data?

To answer this question, lets take a good look at our wiki pages and determine what content we want. From our 800m webpage, we are interested in the two tables (men and women) following the header “Olympic medalist” and from our marathon page we what the men and women tables under the title “Medal Summary”.

Now, right click on either of the pages and scroll to the bottom of the pop-up window, click on view page source (or ctrl+u from Windows). This will open another window with URL starting with “view-source:”, this window can be refereed to as a source page. Content of this page looks a lot like R’s scraped data, which means R did not corrupted our data, it only shipped in data in a format that we did not expect.

With that, our next logical question is; “Given this format, how can we locate and extract our data (tables)?”. To answer this question we need to know what type of data R has scraped. Let’s go back to “view-source:” web page and look at line 1. We can see document type is declared as “html” which stands for “hypertext markup language”. Markup is information and/or instruction added to web content mostly for structure and display purposes. Basically this is what will distinguish a header and a paragraph, usual content and content meant to be part of a table, as well as layout of different content on the webpage. This markup comes in the form of tags; the angle brackets and text () you see on the source page.

So somewhere withing all we see on the source page is our data and we must now figure out how to extract it in a form ready for analysis or reporting. In order for us to do this we need to know a little bit of HTML (just enough to traverse and extract what we need).

Based on this, in our next section we get to understand HTML from a data extraction point of view or the basic concepts (as opposed to a web developers point of view which requires an in-depth understanding).

10.1.4.1 Understanding HTML

There are three widely used web development languages, these are XML (Extensible Markup Language), HTML (Hypertext Markup Language) and XHTM (Extensible Hypertext Markup Language). Out of these three, HTML is the most widely used web language.

As you can tell from the names of all of these three languages, they are all markup languages, which means they use things called tags to give information about a web content (think of markup as a language to a web browser, that is, markup tells a web browser the type of content you have and how to handle it). The core distinguishing feature of these three markup languages are the names given to their tags; for HTML, they are clearly defined, but for XML, they are not defined. XHTML has both defined a undefined tags names, it is more like a cross breed of XML and HTML (All this will become clear in our next section).

Since we are most likely to come across HTML than the other two languages and since our “Wikipedia webpage” is written in HTML, we shall discuss and explore core concepts of HTML but with an understanding that these concepts can easily be extended to XML and XHTML.

10.1.4.1.1 Basic HTML

There are two things to grasp as far as HTML for data extraction/scraping is concerned. These are elements and their relationships. If you understand these two concepts, then you can easily traverse through imported data.

Let’s take look at each of these concepts in turn, but take note, HTML code can be what I call general HTML (omissions allowed by HTML specifications) or strict HTML (parsed HTML). Where necessary, we shall highlight difference between the two.

10.1.4.1.1.1 HTML Elements

HTML elements are single components that build a web page. These single components (elements) are built with tags, attributes and text.

10.1.4.1.1.2 Tags

Tags are markup which enclose content with information about the content; information like type of content. There are two types of tags, opening and closing. Both tags begin with a left angle bracket (<) and end with a right (>) angle bracket. For opening tags, in between the angle brackets are tag name and optional additional information known as attributes. For closing brackets, in between the angle brackets is a backslash (/) and a tag name (presence of a / indicates a closing bracket).

In HTML, tag names are predefined, for example, header begin with letter h and end with a number between 1 and 6 which indicate header level. Paragraphs have a p tag name, tables have table tag name, table rows have a tr tag name, and images have a img tag name. Check w3schools to see all HTML defined tag names.

Here are examples of opening and closing tags. Take note, HTML comments are added with /*Comment*/.

Opening tags
=============
<h1>      /*This is a level one header*/
<p>       /*This indicates content is a paragraph*/
<table>   /*This begins a table*/
<a>       /*This is used to add links both internal (from the document) and external urls*/

Closing tags
============
</h1>     /*This closes a level one header*/
</p>      /*This marks end of a paragraph*/
</table>  /*This ends a table section*/
</a>      /*This ends a given link*/
10.1.4.1.1.3 Attributes

Attributes are additional information added to opening tags. This additional information could be an identifier (mostly used for styling or display purposes), or description of content type (for example, lang which specifies language used).

Attributes usually come in name=value pair, for example lang="en">; here “lang” is an attribute name (short for “language”) “en” is its value meaning English.

Attributes added to opening tags
================================
<h1 class="center">                /*Attribute "class" is used by multiple  html elements to provide uniform styling or identification, in this case all elements with class center can be center aligned*/
<table id="firstTable">           /*Attribute "id" is a unique html object identifier, here it maps location of first table*/
<a href="https://myWebPage.org"> /*Attribute "href" creates a hypelink to an external url*/

Both tags and attributes are the markup component of an element, they are not visible on a web page although they are used to change it’s appearance or make them dynamic/interactive.

10.1.4.1.1.4 Text

Text is the visible part of a web page; this what you see on web page.

Before we look at examples of HTML elements, let’s discuss types of elements as each type will have different markup and text.

10.1.4.1.1.5 Types of HTML element

There are five types of elements (https://www.w3.org/TR/html5/syntax.html#elements-0), two of which are relevant for a web scraper, these are normal elements and void elements.

Normal elements are composed of markup and text.Markup is not invisible on a web page, it is composed of tags and attributes which provide details of an element.

Normal Elements
===============
<h1>Level 1 header</h1>
<p>A paragraph</p>
<div>A division/section</div>

Examples of normal elements from 800m wikipedia web page

wiki800mDOM[c(5, 58, 84)] 
## [1] "<title>800 metres - Wikipedia</title>"
## [2] "<th scope=\"row\">World</th>"         
## [3] "<h2>Contents</h2>"

Void elements (also called self-closing elements) are elements which have no text or closing tag, they are basically opening tags. To distinguish these tags from normal elements, strict HTML includes a forward slash right before closing angle bracket.

Void/Self-Closing
================
<br/>       /*Used to add line breaks*/
<meta/>    /*Used to add meta data*/
<link/>   /*Used to add links to url's*/
<image/> /*Used to add images*/

Examples of void/self-closing elements from 800m wikipedia web page

wiki800mDOM[c(4, 27)]
## [1] "<meta charset=\"UTF-8\"/>"                                  
## [2] "<link rel=\"dns-prefetch\" href=\"//meta.wikimedia.org\" />"

There are normal elements called Empty elements which are important to know. This is because empty elements contain no text.

wiki800mDOM[81]
## [1] "<p></p>"

Ideally, all elements that are not defined by W3C Recommendation as void/self-closing elements should have a closing tag (even empty elements). However, the same recommendations allow some elements to make some omission one being closing tags. From a data extraction point of view, omitting closing tags would cause a problem as it would be difficult to map-out beginning and ending of an element.

For void/self-closing elements, the recommendations allows omission of forward slash / right before end angle bracket (>). These omissions are not good when extracting data as the forward slash inform us we are dealing with void/self-closing elements.

10.1.4.1.1.6 Nesting Tags

Tags can embed other tags as it’s content, this is refereed to as nesting.

Nesting Elements
================
<p><dfn id ="myWebPage>My web page is a blog on matters R</dfn></p>
<table>
   <thead><th>Column Head 1</th><th>Column Head 2</th></thead>
   <tbody><td>Cell Input 1</td><td>Cell Input 2</td></tbody>
</table>
wiki800mDOM[194]
## [1] "<td><b>1:56.58</b></td>"
10.1.4.1.1.7 Extracted HTML Document

Given the allowance made by HTML syntax specifications, it is quite possible for a HTML script to be parse and loaded on a web page while it contains numerous omissions. It is also possible to parse and load error prone HTML even though specification clearly prohibit certain code. For example, one can easily write text without any markup and it will still be loaded on a web page, for example a script with only “Hello welcome to my webpage”.

Basically, HTML is a very relax and user-friendly language. There is nothing like an error in HTML, this is because W3C HTML specification provides corrections for all errors. These corrections are implemented by HTML parsers as they are configured to construct syntactically correct HTML code before parsing. All browsers have a HTML parser and follow these specifications when parsing HTML scripts, that is why web pages opened in one browser looks the same in all browser. For example, our wiki pages should look the same on chome, mozilla, operamini or any other browser. However, there a few subtle difference which might not be of concern to us as data extractors.

When a HTML script is parsed, it becomes a HTML document. HTML documents are well formed as they use the strict HTML syntax which requires all elements that are not void elements to be closed and void elements to include a forward slash before it’s ending angle bracket. From these HTML documents, browsers use Application Programming Interfaces (API) called Document Object Model (DOM) to create document objects which are basically used to structure HTML elements. HTML document objects show how each element is related to another element. It is from these relationship that we will be able to map-out exact location of elements of interest.

Unfortunately, when we read in web data using “readLine()”, what we receive is the HTML script used to load the page rather than the parsed HTML document. Hence, it is quite possible to extract data that does not follow strict HTML syntax. Creating our our own HTML parser might be out of scope of this section, there are also contributed packages which have HTML parser. Fortunate for us, Wikipedia pages are well-formed as far as closing and self-closing tags are concerned so we can safely proceed.

Having discussed HTML elements, we can now discuss how elements relate to one another.

10.1.4.1.1.8 Relationships among Elements

From our preceding discourse, we know HTML documents are parsed HTML scripts which are used to generate a HTML document object. We also mentioned that HTML document objects show relationship between elements, the question now is: how can we describe these relationships?

Relationships among elements is described by how elements are nested. Beginning from the root element which is <html>, all other elements can be described as either child, sibling, descendant or ancenstors (html being the parent). This description can also be used to describe any other element which becomes a parent element. This type of relationship is often refereed to as familial relationship.

For example, take the following document object;

<html>
   <head>
      <title>Sample web page</title>
   </head>
   <body>
      <h2>Welcome to my webpage</h2>
   </body>
</html>

Here we have our root element which is html, this element nests four other elements, that is, head, title, body, and h2. In terms of familial relationship, we call html the parent element while head and body are it’s children, title and h2 are html's descendants (but children of head and body respectively).

These relationships can be displayed in a tree-like structure generated by a Document Object Model (DOM). As web scrapers, our interest with DOM are the relationships, we can leave all the other technical bits on DOM implementation.

Now that we know what HTML elements are and how we can refer to them using their relationships, let’s look at a basic HTML document just to get a hang of how they look like.

10.1.5 HTML Document’s

A basic HTML document is composed of declaration, elements and comments. Declaration is the first line of code in a HTML document, it is used to indicate the html version used (there are currently six versions with HTML5 being the latest). This declaration is referred to as “Document Type Declaration” (DTD). Declarations are encased in <! > for example <!DOCTYPE html> for HTML5. HTML comments like R comments provide guiding information, in HTML, they begin with <!-- and end in --> for instance “”.

Like all other programming scripts, indentations are used to show structure, in HTML indentation is used to show nested elements. For example, in the following HTML snippet, indentation is used to show li elements are nested elements of ol.

<ol>
  <li>Item one</li>
  <li>Item two</li>
  <li>Item three</li>
</ol>

Using familial relationships, elements on the same line are siblings and the immediate opening tag is their parent.

Strict HTML required all HTML documents to have a <head> and <body> tag as children of root element html. head element is meant to give information about the web document, this element along side it’s nested tags are not visible on a web page, visible section of a web page begins from body element.

Before HTML5 came along, it was quite common to find a lot of web documents separating different sections with
tag (div means division), this tag would embed other div tags. This divisions were mostly used for layout purposes. Newer web pages using HTML5 would use self describing tags for different section of a web page like
tag for a section,

References

Online web sites

  1. Introduction to HTML: http://www.w3schools.com/html/html_intro.asp
  2. HTML specifications: https://www.w3.org/TR/html5/
  3. Introduction to CSS: http://www.w3schools.com/css/css_intro.asp
  4. Rvest, a web scraping package: https://cran.r-project.org/web/packages/rvest/index.html

More on Nth Pseudo-class expressions{#moreOnNth}

Gauge Yourself

  1. Practice scraping other web pages
  2. Think of other case studies we can use to practice lessons learnt in this book