Chapter 10 Case Studies
Goal
The main goal of this chapter is to apply lessons learnt in this book through a series of case studies. We will begin with one case study and subsequent case studies will be added from suggestions made.
Prerequisite
To appreciate this chapter, you must have gone through all the other chapters (1-9).
10.1 Case Study One:- Importing online data
10.1.1 Task
Using pure base R (no contributed packages), extract 800m and marathon data on recently held (summer) Olympics games and prepare it for data analysis (create data frames).
There are a number of excellent R packages suitable for this task, however, it is best to use packages when we have some basic idea of how it works.
10.1.2 What we shall cover
In addition to reviewing what we have covered in chapter one, we shall also learn how to:
- Import web data using base R
readLines()
- Read and understand web data (basic HTML)
- Extract web data (using CSS selectors)
10.1.3 Background and Data
Our first case study is all about extracting online data and making it available for analysis. Basic reason for having this as our first case study is because the web is full of (underutilized) data and most often you will find it necessary to use some of these data.
To make this interesting, we will look at the recent (2016) Summer Olympics games.
As we venture into this case study, some of the topics we will review are creating data objects (data frame) and converting data into a date-time object.
Olympics has quite a number of events, from Archery
all the way to Wrestling
. For this exercise, our interest will be Athletics
and more specifically 800m and Marathon events.
We will scrape our data from Wikipedia; https://en.wikipedia.org/wiki/800_meteres
and https://en.wikipedia.org/wiki/Marathon_at_the_Olympics
. From these two pages, we will look for and extract tables giving Olympics medal standing for each event as of 2016 dis-aggregated by gender.
10.1.4 Web scraping
The simplest part of web scraping (data extraction) is actual data importation. This is because in addition to base R’s readLines()
functions, there are contributed packages (from CRAN, and other repositories) with functions capable of achieving this task.
To learn more about what is available in terms of contributed packages, read https://cran.r-project.org/web/view/WebTechnologies.html
. However, as mentioned before, for this chapter we will try as much as possible to use only base R functions as we want to get a good understanding of the concepts right from the beginning. This will not only help us understand how contributed packages like XML and rvest work, but will also ensure we can still extract and traverse through web data when we do not have access to these packages.
With that said, our first task is to download/scrap/import our data.
url1 <- "https://en.wikipedia.org/wiki/800_metres"
url2 <- "https://en.wikipedia.org/wiki/Marathon_at_the_Olympics"
wiki800mDOM <- readLines(url1)
wikiMarathonDOM <- readLines(url2)
As you have seen, downloading data is rather easy. Now let’s find out what data structure is created when we import web data.
str(wiki800mDOM)
## chr [1:2545] "<!DOCTYPE html>" ...
str(wikiMarathonDOM)
## chr [1:1185] "<!DOCTYPE html>" ...
We have character vectors, now let’s take a close look at these character.
head(wiki800mDOM)
## [1] "<!DOCTYPE html>"
## [2] "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">"
## [3] "<head>"
## [4] "<meta charset=\"UTF-8\"/>"
## [5] "<title>800 metres - Wikipedia</title>"
## [6] "<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, \"$1client-js$2\" );</script>"
tail(wiki800mDOM)
## [1] "\t\t\t\t\t\t\t\t\t</ul>"
## [2] "\t\t\t\t\t\t<div style=\"clear:both\"></div>"
## [3] "\t\t</div>"
## [4] "\t\t<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({\"wgPageParseReport\":{\"limitreport\":{\"cputime\":\"1.264\",\"walltime\":\"1.440\",\"ppvisitednodes\":{\"value\":33739,\"limit\":1000000},\"ppgeneratednodes\":{\"value\":0,\"limit\":1500000},\"postexpandincludesize\":{\"value\":353879,\"limit\":2097152},\"templateargumentsize\":{\"value\":69217,\"limit\":2097152},\"expansiondepth\":{\"value\":12,\"limit\":40},\"expensivefunctioncount\":{\"value\":0,\"limit\":500},\"entityaccesscount\":{\"value\":0,\"limit\":400},\"timingprofile\":[\"100.00% 948.704 1 -total\",\" 55.78% 529.208 132 Template:FlagIOCathlete\",\" 49.93% 473.653 264 Template:Country_alias\",\" 36.61% 347.308 1 Template:Olympic_medalists_in_men's_800_metres\",\" 21.22% 201.271 1 Template:Olympic_medalists_in_women's_800_metres\",\" 16.09% 152.657 281 Template:Flagathlete\",\" 5.99% 56.835 1 Template:Reflist\",\" 5.98% 56.735 1 Template:World_Championships_in_Athletics_medalists_in_men's_800_metres\",\" 4.96% 47.051 7 Template:Cite_web\",\" 4.92% 46.718 281 Template:Country_flagbio\"]},\"scribunto\":{\"limitreport-timeusage\":{\"value\":\"0.404\",\"limit\":\"10.000\"},\"limitreport-memusage\":{\"value\":3818602,\"limit\":52428800}},\"cachereport\":{\"origin\":\"mw1274\",\"timestamp\":\"20170201155806\",\"ttl\":2592000,\"transientcontent\":false}}});});</script><script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({\"wgBackendResponseTime\":73,\"wgHostname\":\"mw1210\"});});</script>"
## [5] "\t</body>"
## [6] "</html>"
head(wikiMarathonDOM)
## [1] "<!DOCTYPE html>"
## [2] "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">"
## [3] "<head>"
## [4] "<meta charset=\"UTF-8\"/>"
## [5] "<title>Marathons at the Olympics - Wikipedia</title>"
## [6] "<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, \"$1client-js$2\" );</script>"
tail(wikiMarathonDOM)
## [1] "\t\t\t\t\t\t\t\t\t</ul>"
## [2] "\t\t\t\t\t\t<div style=\"clear:both\"></div>"
## [3] "\t\t</div>"
## [4] "\t\t<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({\"wgPageParseReport\":{\"limitreport\":{\"cputime\":\"0.960\",\"walltime\":\"1.051\",\"ppvisitednodes\":{\"value\":10804,\"limit\":1000000},\"ppgeneratednodes\":{\"value\":0,\"limit\":1500000},\"postexpandincludesize\":{\"value\":82702,\"limit\":2097152},\"templateargumentsize\":{\"value\":12153,\"limit\":2097152},\"expansiondepth\":{\"value\":11,\"limit\":40},\"expensivefunctioncount\":{\"value\":1,\"limit\":500},\"entityaccesscount\":{\"value\":0,\"limit\":400},\"timingprofile\":[\"100.00% 861.977 1 -total\",\" 67.56% 582.383 284 Template:Country_alias\",\" 52.65% 453.827 55 Template:FlagIOCteam\",\" 21.87% 188.490 32 Template:FlagIOCathlete\",\" 17.98% 155.002 1 Template:Olympic_medalists_in_the_women's_marathon\",\" 7.09% 61.146 1 Template:Reflist\",\" 5.89% 50.739 1 Template:Infobox_Olympic_athletics_event\",\" 5.64% 48.629 1 Template:Infobox\",\" 5.00% 43.110 1 Template:Fact\",\" 4.82% 41.560 3 Template:Cite_web\"]},\"scribunto\":{\"limitreport-timeusage\":{\"value\":\"0.521\",\"limit\":\"10.000\"},\"limitreport-memusage\":{\"value\":6072062,\"limit\":52428800}},\"cachereport\":{\"origin\":\"mw1194\",\"timestamp\":\"20170201185253\",\"ttl\":2592000,\"transientcontent\":false}}});});</script><script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({\"wgBackendResponseTime\":76,\"wgHostname\":\"mw1240\"});});</script>"
## [5] "\t</body>"
## [6] "</html>"
This is not exactly what we expected. What we wanted to see are headers, paragraphs and tables at the very least. So what happened, did R corrupt our data?
To answer this question, lets take a good look at our wiki pages and determine what content we want. From our 800m webpage, we are interested in the two tables (men and women) following the header “Olympic medalist” and from our marathon page we what the men and women tables under the title “Medal Summary”.
Now, right click on either of the pages and scroll to the bottom of the pop-up window, click on view page source (or ctrl+u
from Windows). This will open another window with URL starting with “view-source:”, this window can be refereed to as a source page
. Content of this page looks a lot like R’s scraped data, which means R did not corrupted our data, it only shipped in data in a format that we did not expect.
With that, our next logical question is; “Given this format, how can we locate and extract our data (tables)?”. To answer this question we need to know what type of data R has scraped. Let’s go back to “view-source:” web page and look at line 1. We can see document type is declared as “html” which stands for “hypertext markup language”. Markup is information and/or instruction added to web content mostly for structure and display purposes. Basically this is what will distinguish a header and a paragraph, usual content and content meant to be part of a table, as well as layout of different content on the webpage. This markup comes in the form of tags; the angle brackets and text (
So somewhere withing all we see on the source page is our data and we must now figure out how to extract it in a form ready for analysis or reporting. In order for us to do this we need to know a little bit of HTML (just enough to traverse and extract what we need).
Based on this, in our next section we get to understand HTML from a data extraction point of view or the basic concepts (as opposed to a web developers point of view which requires an in-depth understanding).
10.1.4.1 Understanding HTML
There are three widely used web development languages, these are XML
(Extensible Markup Language), HTML
(Hypertext Markup Language) and XHTM
(Extensible Hypertext Markup Language). Out of these three, HTML is the most widely used web language.
As you can tell from the names of all of these three languages, they are all markup languages, which means they use things called tags to give information about a web content (think of markup as a language to a web browser, that is, markup tells a web browser the type of content you have and how to handle it). The core distinguishing feature of these three markup languages are the names given to their tags; for HTML, they are clearly defined, but for XML, they are not defined. XHTML has both defined a undefined tags names, it is more like a cross breed of XML and HTML (All this will become clear in our next section).
Since we are most likely to come across HTML than the other two languages and since our “Wikipedia webpage” is written in HTML, we shall discuss and explore core concepts of HTML but with an understanding that these concepts can easily be extended to XML and XHTML.
10.1.4.1.1 Basic HTML
There are two things to grasp as far as HTML for data extraction/scraping is concerned. These are elements and their relationships. If you understand these two concepts, then you can easily traverse through imported data.
Let’s take look at each of these concepts in turn, but take note, HTML code can be what I call general HTML
(omissions allowed by HTML specifications) or strict HTML (parsed HTML). Where necessary, we shall highlight difference between the two.
10.1.4.1.1.1 HTML Elements
HTML elements are single components that build a web page. These single components (elements) are built with tags
, attributes
and text
.
10.1.4.1.1.3 Attributes
Attributes are additional information added to opening tags. This additional information could be an identifier (mostly used for styling or display purposes), or description of content type (for example, lang
which specifies language used).
Attributes usually come in name=value
pair, for example lang="en">
; here “lang” is an attribute name (short for “language”) “en” is its value meaning English.
Attributes added to opening tags
================================
<h1 class="center"> /*Attribute "class" is used by multiple html elements to provide uniform styling or identification, in this case all elements with class center can be center aligned*/
<table id="firstTable"> /*Attribute "id" is a unique html object identifier, here it maps location of first table*/
<a href="https://myWebPage.org"> /*Attribute "href" creates a hypelink to an external url*/
Both tags and attributes are the markup component of an element, they are not visible on a web page although they are used to change it’s appearance or make them dynamic/interactive.
10.1.4.1.1.4 Text
Text is the visible part of a web page; this what you see on web page.
Before we look at examples of HTML elements, let’s discuss types of elements as each type will have different markup and text.
10.1.4.1.1.5 Types of HTML element
There are five types of elements (https://www.w3.org/TR/html5/syntax.html#elements-0), two of which are relevant for a web scraper, these are normal elements and void elements.
Normal elements are composed of markup and text.Markup is not invisible on a web page, it is composed of tags and attributes which provide details of an element.
Normal Elements
===============
<h1>Level 1 header</h1>
<p>A paragraph</p>
<div>A division/section</div>
Examples of normal elements from 800m wikipedia web page
wiki800mDOM[c(5, 58, 84)]
## [1] "<title>800 metres - Wikipedia</title>"
## [2] "<th scope=\"row\">World</th>"
## [3] "<h2>Contents</h2>"
Void elements (also called self-closing elements) are elements which have no text or closing tag, they are basically opening tags. To distinguish these tags from normal
elements, strict HTML
includes a forward slash right before closing angle bracket.
Void/Self-Closing
================
<br/> /*Used to add line breaks*/
<meta/> /*Used to add meta data*/
<link/> /*Used to add links to url's*/
<image/> /*Used to add images*/
Examples of void/self-closing elements from 800m wikipedia web page
wiki800mDOM[c(4, 27)]
## [1] "<meta charset=\"UTF-8\"/>"
## [2] "<link rel=\"dns-prefetch\" href=\"//meta.wikimedia.org\" />"
There are normal elements called Empty elements
which are important to know. This is because empty elements contain no text.
wiki800mDOM[81]
## [1] "<p></p>"
Ideally, all elements that are not defined by W3C Recommendation as void/self-closing elements should have a closing tag (even empty elements). However, the same recommendations allow some elements to make some omission one being closing tags. From a data extraction point of view, omitting closing tags would cause a problem as it would be difficult to map-out beginning and ending of an element.
For void/self-closing elements, the recommendations allows omission of forward slash /
right before end angle bracket (>
). These omissions are not good when extracting data as the forward slash inform us we are dealing with void/self-closing elements.
10.1.4.1.1.6 Nesting Tags
Tags can embed other tags as it’s content, this is refereed to as nesting.
Nesting Elements
================
<p><dfn id ="myWebPage>My web page is a blog on matters R</dfn></p>
<table>
<thead><th>Column Head 1</th><th>Column Head 2</th></thead>
<tbody><td>Cell Input 1</td><td>Cell Input 2</td></tbody>
</table>
wiki800mDOM[194]
## [1] "<td><b>1:56.58</b></td>"
10.1.4.1.1.7 Extracted HTML Document
Given the allowance made by HTML syntax specifications, it is quite possible for a HTML script to be parse and loaded on a web page while it contains numerous omissions. It is also possible to parse and load error prone HTML even though specification clearly prohibit certain code. For example, one can easily write text without any markup and it will still be loaded on a web page, for example a script with only “Hello welcome to my webpage”.
Basically, HTML is a very relax and user-friendly language. There is nothing like an error in HTML, this is because W3C HTML specification provides corrections for all errors. These corrections are implemented by HTML parsers
as they are configured to construct syntactically correct HTML code before parsing. All browsers have a HTML parser and follow these specifications when parsing HTML scripts, that is why web pages opened in one browser looks the same in all browser. For example, our wiki pages should look the same on chome
, mozilla
, operamini
or any other browser. However, there a few subtle difference which might not be of concern to us as data extractors.
When a HTML script is parsed, it becomes a HTML document
. HTML documents are well formed as they use the strict HTML syntax which requires all elements that are not void elements to be closed and void elements to include a forward slash before it’s ending angle bracket. From these HTML documents, browsers use Application Programming Interfaces (API)
called Document Object Model (DOM)
to create document objects
which are basically used to structure HTML elements. HTML document objects show how each element is related to another element. It is from these relationship that we will be able to map-out exact location of elements of interest.
Unfortunately, when we read in web data using “readLine()”, what we receive is the HTML script used to load the page rather than the parsed HTML document. Hence, it is quite possible to extract data that does not follow strict HTML syntax. Creating our our own HTML parser might be out of scope of this section, there are also contributed packages which have HTML parser. Fortunate for us, Wikipedia pages are well-formed
as far as closing and self-closing tags are concerned so we can safely proceed.
Having discussed HTML elements, we can now discuss how elements relate to one another.
10.1.4.1.1.8 Relationships among Elements
From our preceding discourse, we know HTML documents
are parsed HTML scripts
which are used to generate a HTML document object
. We also mentioned that HTML document objects show relationship between elements, the question now is: how can we describe these relationships?
Relationships among elements is described by how elements are nested. Beginning from the root
element which is <html>
, all other elements can be described as either child
, sibling
, descendant
or ancenstors
(html
being the parent
). This description can also be used to describe any other element which becomes a parent
element. This type of relationship is often refereed to as familial relationship
.
For example, take the following document object;
<html>
<head>
<title>Sample web page</title>
</head>
<body>
<h2>Welcome to my webpage</h2>
</body>
</html>
Here we have our root element which is html
, this element nests four other elements, that is, head
, title
, body
, and h2
. In terms of familial relationship, we call html
the parent element while head
and body
are it’s children, title
and h2
are html's
descendants (but children of head and body respectively).
These relationships can be displayed in a tree-like structure generated by a Document Object Model (DOM). As web scrapers, our interest with DOM are the relationships, we can leave all the other technical bits on DOM implementation.
Now that we know what HTML elements are and how we can refer to them using their relationships, let’s look at a basic HTML document just to get a hang of how they look like.
10.1.5 HTML Document’s
A basic HTML document is composed of declaration
, elements
and comments
. Declaration is the first line of code in a HTML document, it is used to indicate the html version used (there are currently six versions with HTML5 being the latest). This declaration is referred to as “Document Type Declaration” (DTD). Declarations are encased in <! >
for example <!DOCTYPE html>
for HTML5. HTML comments like R comments provide guiding information, in HTML, they begin with <!--
and end in -->
for instance “”.
Like all other programming scripts, indentations are used to show structure, in HTML indentation is used to show nested elements. For example, in the following HTML snippet, indentation is used to show li
elements are nested elements of ol
.
<ol>
<li>Item one</li>
<li>Item two</li>
<li>Item three</li>
</ol>
Using familial relationships
, elements on the same line are siblings and the immediate opening tag is their parent.
Strict HTML required all HTML documents to have a <head>
and <body>
tag as children of root element html
. head
element is meant to give information about the web document, this element along side it’s nested tags are not visible on a web page, visible section of a web page begins from body
element.
References
Online web sites
- Introduction to HTML: http://www.w3schools.com/html/html_intro.asp
- HTML specifications: https://www.w3.org/TR/html5/
- Introduction to CSS: http://www.w3schools.com/css/css_intro.asp
- Rvest, a web scraping package: https://cran.r-project.org/web/packages/rvest/index.html
More on Nth Pseudo-class expressions{#moreOnNth}
- How nth-child works by CSS tricks: https://css-tricks.com/how-nth-child-works/
- Understanding nth-child pseudo-class expressions: https://www.sitepoint.com/web-foundations/understanding-nth-child-pseudo-class-expressions/
Gauge Yourself
- Practice scraping other web pages
- Think of other case studies we can use to practice lessons learnt in this book