Introduction to Data Analysis and Graphics in R

Hellen Gakuruh
2017-04-03

Slide 5: Graphics in R

Outline

What we will cover:

  • Introduction
  • High level plotting functions
  • Low level plotting functions
  • Interacting with graphics
  • Modifying a graph

n

  • Plotting dichotomous and categorical variables
  • Plotting ordinal variables
  • Plotting continuous variables

Introduction

  • R is renown for it's plotting facilities; not only does it have all the well known graphs, it also offers an opportunity to build an entirely new type of graph
  • There three well known graphics in R; “base graphics”, “grid graphics (often implemented with package Lattice)” and “ggplot2”
  • On start-up, R initiates a graphical device; calls X11() IN UNIX, windows() in Windows and quartz() in mac
  • Plotting functions fall under three types of commands; High-level, Low-level, and Interactive
  • Plots can be customized with “graphical parameters”

High level plotting functions

  • They are designed to generate a complete plot with axes, labels and titles unless they are suppressed (with graphical parameters)
  • They start a new plot
  • Core R's plotting function is plot()
  • plot() can produce a variety of different plots depending on type/class of first argument (hence, plot() is completely reliant on class(object))

Expected output of "plot()"

  • If only “x” is given only;
    • if it is a time series object (class = ts), a line plot is produced; other wise if it's numeric a scatter plot of it's index against it (x) is generated
    • if class(x) = "factor", a bar plot is produced
    • it's an error when class(x) == "character" as plot needs a finite object to set a plotting window
  • If two variables are given and they are both numeric, output is a scatter plot

Expected output of "plot()"

  • If a factor and a numeric vector are given, box plots are produced
  • If both vectors are factors, stacked bar plot is produced
  • If objected parsed is not a vector but a matrix, data frame or list, plot() will make plots per elements type
  • We produce a few of these as example using plain plot(obj) (without changing/giving other arguments)

Time series object

n

ts <- ts(rnorm(12, 50), start = 1, end = 12, frequency = 1)
class(ts)
[1] "ts"

n

plot(ts)

plot of chunk timeseries2-05

Numeric vector

n

num <- rnorm(12, 50)
class(num)
[1] "numeric"

n

plot(num)

plot of chunk numeric2-05

Factor vector

n

fac <- factor(sample(c("Y", "N"), 100, T, c(0.7, 0.3)))
class(fac)
[1] "factor"

n

plot(fac)

plot of chunk factor2-05

Two numeric vectors

n

num2 <- rnorm(12, 88)
class(num2)
[1] "numeric"

n

plot(num, num2)

plot of chunk twonum2-05

Factor and numeric vector

n

set.seed(5)
num3 <- rnorm(100, 88)
class(num3)
[1] "numeric"

n

plot(fac, num3)

plot of chunk facnum2-05

Two factor vectors

n

fac2 <- factor(sample(c("F", "M"), 100, T, c(0.8, 0.2)))
class(fac2)
[1] "factor"

n

plot(fac, fac2)

plot of chunk twofac2-05

Summary

  • In all these plots, axis, labels (except title) and in some, color is give, this makes them communicative
  • However, they might not be aesthetically up to requirements, this can be changed by passing other arguments including suppression of axis

Other arguments to "plot"

  • Type of plot produced by plot() depends on first (and “y”) argument, but how it is generated depends on values parsed to other argument
  • Plot type can also be changed with argument “type”, though do this when sure it makes sense
  • “xlim” and “ylim” define x and y limits (min and max axis values), this can be changed especially if need a bit more padding

Other argument to "plot" function cont.

  • For customized axis like logs, argument “axes” can be suppressed
  • To annotate plot with additional graphical parameters, add them as argument to high and low level plots or make a call to par()… more on this later (read ?par)

Other High-level plots

  • hist() for histograms (univariate continuous distributions)
  • boxplot() for box-and-whiskers plot (for univariate numerical variables alone or categorised by a categorical variable)
  • barplot() for bar plots (for categorical distribution)
  • pie() for pie chart (for categorical distribution)

Low level plotting functions

  • These functions add more information to an existing plot
  • Used to customize plots
  • Some of the most frequently used functions are; point(), lines(), text(), title(), abline(), polygon(), legend(), and axis()
  • We use some of these when plotting some of the example distributions

Interacting with graphics

  • Interaction means extracting or adding information to a plot using a mouse (rather than inputting data to plot)
  • Two function for interaction in R are locator() and identify()
  • locator(n, type): one can select “n” number of points using left mouse button and if type is not specified, a list with two components x and y is outputted otherwise plotting over selected points given “type” is done
  • locator() is particularly handy in locating position for legends, and labels e.g. text(locator(1), "Outlier", adj=0)

Interacting with graphics cont.

  • identify(x, y, labels) is used to highlight any of the points defined by x and y (using left mouse button)
  • These can be used to identify certain points and possibly label

Demonstration on interacting with graphics

Graphical paramenters "par()"

  • Almost every aspect of a plot can be customized by graphical parameters
  • Graphical parameters come in “name=value” pair with all having a default value
  • Accessing current default parameters call par() for complete list
  • For a specific list call par detailing parameter of interest par("parameter") e.g. par("mfrow")
  • Changing any parameters can be done globally (not recommended) or individually

Plotting dichotomous and categorical variables

  • Plotting of any distribution depends on whether it's univariate (one variable), bi-variate (two variables) or multi-variate
  • Plots for univariate categorical variables (dichotomous included) are:
    • Pie charts (for few values e.g. 2)
    • Bar plots, and
    • Cleveland's dot plots

Plotting dichotomous and categorical variables conti.

  • Bi-variate plots
    • Stacked/besides bar plots
    • Four-fold display
  • Multi-variate plots
    • Mosaic
    • Four-fold plots

Pie chart

  • Suitable when their few categories
  • Useful for showing “%'s”
  • Highly discouraged due to angular perception, in addition it uses a lot of ink

plot of chunk pie1-05

Pie chart example

set.seed(5)
response <- sample(c("Yes", "No"), 300, T, c(0.68, 0.32))
tab_response <-  table(response)
pie(tab_response, col = c("#99CCFF", "#6699CC"))
labs <- paste0("(", round(as.vector(prop.table(tab_response)*100)), "%)")
text(x = c(0.78, -0.50), y = c(0.80, -1), labels = c(labs[1], labs[2]))

Bar plot

  • Consist of a sequence of rectangular bars with heights given by values given
  • Ideally, bars should be ordered by frequency rather than bar-label
  • Not recommended due to high-ink-ration (an alternative is Cleveland's dot plot)

plot of chunk barplot1-05

Bar plot cont.

barplot(sort(tab_response, decreasing = TRUE), las = 1, col = c("#6699CC", "#99CCFF"))
title("Bar chart", xlab = "Response", ylab = "Frequency")

Cleveland's dot plot

  • An alternative to bar chart (uses less data:ink ratio)
  • As an example, generate a “Cleveland's dot plot” of the following data set and it should be:
    • titled “Total student's trained by quarters (2016)”
    • have an x axis titled “Total student's trained”
    • a sub-title “Data Mania Inc” (grey in color and slant), and
    • Y axis titled “Quarters”, balled according to (ordered) months given (March, Jun, Sep and Dec)
    • have blue colored points

Cleveland's dot plot

  • Example data: Hypothetical random number of students trained by quarter totals for year 2016
set.seed(5)
months <- sample(month.abb[c(3, 6, 9, 12)], size = 300, replace = TRUE)
tab_months <- table(months)[c("Mar", "Jun", "Sep", "Dec")] 
tab_months
months
Mar Jun Sep Dec 
 81  78  60  81 

Cleveland's dot plot

plot of chunk cleveland1-05

n

dotchart(as.numeric(tab_months), xlab = "Total student's Trained", ylab = "Quarters", bg = 4)
title("Total students trained by quarters (2016)", sub = "Data Mania Inc.,", font.sub = 3, col.sub = "#6699CC", cex.sub = 0.9)
axis(2, at = 1:4, labels = names(tab_months), las = 2)

Bi-variate Stacked/Besides bar plots and Dot plot

  • Following earlier example, generate stacked/besides bar plot and bi-variate Cleveland's dot plot
  • Adding second variable; Gender composition of students trained

Bivariate stacked/besides bar plots and dot plot cont.

set.seed(5)
gender <- sample(c("Female", "Male"), 300, TRUE, c(0.7, 0.3))
monthgen_tab <- table(gender, months)[, c("Dec", "Sep", "Jun", "Mar")]
monthgen_tab
        months
gender   Dec Sep Jun Mar
  Female   0  49  78  81
  Male    81  11   0   0

Bivariate stacked/besides bar plots and dot plot cont.

plot of chunk bivabarplot1-05

barplot(monthgen_tab, col = c("#6699CC", "#99CCFF"), beside = TRUE)
legend("topright", legend = c("Female", "Male"), pch = 22 , pt.bg = c("#6699CC", "#99CCFF"), xpd = TRUE, cex = 0.75)
title("Student's trained by gender and month (2016)", xlab = "Month", ylab = "Number trained", sub = "Data Mania Inc.", cex.sub = 0.9, col.sub = "#6699CC", font.sub = 3)

Bivariate Cleveland's dot plot

plot of chunk bivardotplot1-05

dotchart(as.matrix(monthgen_tab)[, c("Mar", "Jun", "Sep", "Dec")], bg = 4, xlab = "Total number of student's trained")
title("Total student's trained by gender and month", sub = "Data Mania Inc.", font.sub = 3, cex.sub = 0.9, col.sub = "#6699CC")
title(ylab = "Gender and month", line = 2.5)

Four-fold plots

  • Used to display association (or lack of)
  • Designed for two binary variables (2 x 2 tables), this can be categorized by a third categorical variable with K levels (2 x 2 x k tables)
  • Association established if diagonal opposite cells in one direction tend to differ in size from those in the other direction
  • Color used to show this direction

Four-fold plots cont.

  • Rings around circle are confidence rings and if adjacent quadrants rings overlap then it corresponds to \( H_0: \) No association
  • Example data: R's “Titanic” data (but only for passengers)
# Convert Titanic data
titanic_passengers <- colSums(Titanic[-4,,,])
titanic_passengers
, , Survived = No

        Age
Sex      Child Adult
  Male      35   659
  Female    17   106

, , Survived = Yes

        Age
Sex      Child Adult
  Male      29   146
  Female    28   296

Four-fold for Titanic Passengers

plot of chunk fourfold1-05

n

# Plotting four fold plot
fourfoldplot(titanic_passengers, std = "margins")
  • Plot shows association (rings do not overlap and diagonal opposite cells differ in size) between Titanic's passenger's age (child/adult) and gender (Male/Female) stratified by survival status (No/Yes)
  • Four-fold differ from pie chart as it varies radius while holding angle constant while pie varies angle while holding radius constant

Mosaic plots

  • Originally proposed by Hartigan and Kleiner (1981, 1984)
  • Similar to a divided bar plot where it displays counts of a contingency table directly by tiles whose area is proportional to the observed cell frequency
  • Later extended by Friendly (1992, 1994b)
  • Extended version generates greater visual impact by using color and shading to reflect size of residuals from independence (no association)
  • Used for exploratory data analysis (establish associations) and model building (display residuals of log-linear model)

plot of chunk mosaic1-05

mosaicplot(titanic_passengers, color = TRUE)
  • Width of each column of tile in above figure is proportional to observed frequency of each cell and height of each tile is determined by conditional probabilities of row (age) in each column (sex).
# Height of tiles
prop.table(apply(titanic_passengers, 1:2, sum), 1)
        Age
Sex           Child     Adult
  Male   0.07364787 0.9263521
  Female 0.10067114 0.8993289

Plotting continuous variables

  • Display will depend on whether it univariate, bi-variate or multivariate
  • Some often used displays for univariate:
    • Histograms
    • Density plots
    • Box-and-whisker plots
    • Dot plot
    • Stem-and-leave plot

Plotting continuous variables

  • Some bi-variate displays
    • Scatter plot (both variables are continuous)
    • Box-and-whisker plot (one variable is continuous and the other categorical)

Histogram

  • Display distribution of observation in intervals called “bins”
  • Each bin is represented by a rectangle whose width is the intervals
  • Intervals can be equal through out (equidistant, R's default) or not
  • Heights of each rectangle corresponds to number of observations falling within an interval (bin)
  • Generated with function “hist” or plot(x, type = “h”)
  • Hist constructs bins from argument “breaks”

Histogram cont.

  • Breaks are breaking points for each interval or bin
  • Giving a vector without this argument is okay (R will compute them), but it's usually good to change them to show best picture of distribution
  • Argument “nclass” (compatible with S) can also be used to get number of breaks needed
  • Histograms are excellent for data with numerous observations

Histogram cont.

# Example data: Edgar Anderson's Iris Data
sepal <- iris$Sepal.Length
sepal
  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
 [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
 [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
 [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
 [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
 [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
[103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
[120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
[137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

plot of chunk hist1-05

Code used to plot

op <- par("mfrow")
par(mfrow = c(1, 2))

hist(sepal, col = "#99CCFF", ann = FALSE)
title("Breaks = 10", xlab = "Sepal Length", ylab = "Frequency")
hist(sepal, nclass = 15, col = "#6699cc", ann = FALSE)
title("Breaks = 15", xlab = "Sepal Length", ylab = "Frequency")

par(mfrow = op)

Density Plots

  • Fit “smooth” curve by computing kernel density estimates
  • Based on probability theory

plot of chunk density1-05

dens_sepal <- density(sepal)
plot(dens_sepal, type = "n")
polygon(dens_sepal, col = "#99CCFF")

Box-and-whisker plot (univariate)

  • Used to visualize data distribution in terms of quarters
  • Shows outliers
  • Good comparison displays as multiple variables or groups can be plotted side-by-side
states <- as.data.frame(state.x77[, c("Illiteracy", "Life Exp", "Murder", "HS Grad")])

plot of chunk univa-boxplots1-05

# Layout (1 row by 2 columns)
op <- par("mfrow")
par(mfrow = c(1, 2))

# Visualise distributions
boxplot(states$Illiteracy, col = "#99CCFF")
boxplot(states$'Life Exp', col = "#6699CC")

# Reset original layout
par(mfrow = op)
  • Both distributions have no outliers (points beyond whiskers)
  • First distribution has most of it's values at the lower side suggesting a positive skewness (right tail)
  • Second distribution look almost symmetrical as lower and upper quarters look the same though it's middle value is more on the lower side

Dot plots (Uni-variate)

  • An alternative to box plot when n (sample size) is small
  • They are one dimensional scatter plots
  • Called stripchart in R
  • Example data: 49.3, 48.1, 51.4, 48.1, 49, 49.3, 49.5, 49.8, 49.9, 50.4, 50.1 and 50.3
stripchart(round(num, 1), pch = 22, bg = col[1])
title("Dot plot for small sample size", xlab = "Observations")

plot of chunk univar-stripchart-05

Stem-and-leave plot

  • Used to show distribution of observation
  • Use actual values rather than points
  • Stem is the whole number and is plotted on the left side while on the right side (separated by a vertical bar) are the fractions
# Example data (sorted)
sort(round(num, 1))
 [1] 48.1 48.1 49.0 49.3 49.3 49.5 49.8 49.9 50.1 50.3 50.4 51.4
# # Stem-and-leave plot
stem(round(num, 1))

  The decimal point is at the |

  48 | 11
  49 | 033589
  50 | 134
  51 | 4

Scatter plot

  • Used to show relationship between two continuous variables
  • Relationship is said to exist if points have a visible pattern (positive or negative)
  • No relationship exists if not pattern is visible; points are scattered
plot(states[, 1:2], pch = 21, bg = col[1])
title("Association between Illiteracy and Life Expectancy")

plot of chunk scatterplot1-05

n

  • Scatter plot shows some negative pattern suggesting an association between “Life Expectancy” and “Illiteracy” (cor = -0.5884779)

Box-and-whisker plot (bi-variate)

  • Useful to display numerical variable by strata's or groups of another categorical variable
  • Can also be used to compare two numerical distributions

plot of chunk bi-multi-boxplot1-05

# Box plot with slant axis
op <- par("mar")
par(mar = c(7, 4, 4, 2) + 0.1)

# Plot without axis
boxplot(states$`Life Exp`~state.division, col = col[1], xaxt = "n", xlab = "")

# Add axis without labels
axis(1, labels = FALSE)

# Labels as levels of categorical variable
labs <- levels(state.division)

# Add labels 
text(1:length(labs), par("usr")[3] - 0.25, srt = 45, adj = 1, labels = labs, xpd = TRUE)

# Add xlab
mtext("Divisions", side = 1, line = 6, font = 2)

# Annotate plot
title("Life expectancy for each US division", ylab = "Life expectancy")

# Reset parameter
par(mar = op)
  • Using box plot to make comparison of similar distribution
  • Example data: Elgar Anderson's Iris Data

plot of chunk bi-multi-boxplot3-05

plot of chunk bi-multi-boxplot4-05

# Comparing lengths (Sepal and Petal)
boxplot(iris[, c("Sepal.Length", "Petal.Length")], col = col)
title("Comparing length of Irises of Gaspe Peninsula")
# Comparing width (Sepal and Petal)
boxplot(iris[, c("Sepal.Width", "Petal.Width")], col = col)
title("Comparing width of Irises of Gaspe Peninsula")
  • Sepal seems to be higher in terms of length and width than petal
  • Will this pattern hold under different species?

plot of chunk bi-multi-boxplot7-05

  • Pattern still holds, Sepal length is higher than Petal length across all species

plot of chunk bi-multi-boxplot8-05

  • Pattern still holds as Sepal width is higher than Petal width across all species however, it's interesting to see “setosa” is higher than the others.
# High level functions
boxplot(iris$Sepal.Length~iris$Species, col = col[1], ylim = c(min(iris$Petal.Length) - 0.1, max(iris$Sepal.Length) + 0.1))
boxplot(iris$Petal.Length~iris$Species, col = 4, add = TRUE)

# Low level functions
legend("bottomright", c("Sepal", "Petal"), pch = 22, pt.bg = c(col[1], 4), title = "Iris Type", cex = 0.75)
title("Comparison of Iris Length by species", xlab = "Species", ylab = "Length")
# High level functions
boxplot(iris$Sepal.Width~iris$Species, col = col[1], ylim = c(min(iris$Petal.Width) - 0.1, max(iris$Sepal.Width) + 0.1))
boxplot(iris$Petal.Width~iris$Species, col = 4, add = TRUE)

# Low level functions
legend("bottomright", c("Sepal", "Petal"), pch = 22, pt.bg = c(col[1], 4), title = "Iris Type", cex = 0.75)
title("Comparison of Iris Width by species", xlab = "Species", ylab = "Width")