Introduction to Data Analysis and Graphics in R

Hellen Gakuruh
2017-03-31

Session three

Data Entry, Management and Manipulation in R

Outline

n

  • Creating a dataset
  • Understanding datasets
  • Data input
  • Useful functions for working with datasets
  • Creating new variables
  • Recording and renaming variables

n

  • Missing and date values
  • Type conversions
  • Sorting data
  • Merging datasets
  • Subsetting datasets
  • Using SQL statements to subset dataframes

Creating a dataset

  • Data sets can be created for any of R's data structure i.e. dimensionless vector, 1 dim vector, matrix, array, data frame or list
  • There are two way to create a data set:
    • Using spreadsheet like data editor
    • By coding then in

Invoking spreadsheet-like data editor in R

  • R has four functions to invoke a spreadsheet-like data editor, these are:
    • edit()
    • fix()
    • data.entry(), and
    • dataentry()

Note on using spreadsheet-like data editors in R

  • Using these function's goes against R's core functionality; programming/coding
  • Not a recommended way as it looses on documentation/reproducibility

Coding in data

  • To code in data, function scan() can be quite handy in addition to calling functions for any of the data structures; c() for vector, matrix(), array(), data.frame(), and list()
  • scan() is also not a good data entry process as it looses on reproducibility as data is entered interactively (console)

Understanding datasets

  • It can be a single variable or multiple variables
  • In R, a single variable can be a dimensional vector (created with “c()”) or a 1 dim array
  • For multiple variables, if they are all of the same type (especially if numeric), then matrix is a better data structure other wise for multiple types with same length data frame is ideal

Understanding datasets

  • If data is of different length and type, generic lists are appropriate.
  • Lists can also be used to store different data sets for a particular project as well as accompanying source code/function

Data input

We will look at:

  • Spreadsheet data entry using "data.entry()"
  • Using "scan()"
  • Coding in data using data structure functions c(), matrix(), array(), data.frame(), and list()

Spreadsheet data entry

  • First, need to have variables or list of variables for data entry
  • Then Call data entry
  • From pop-up data editor, click on individual cell and enter data
  • Variable names can be changed from data entry

Spreadsheet data entry demonstration

Data entry using "scan" function

  • Can be used to input 1 dim atomic vectors
  • Values entered interactively (on console) if file is not give
  • For each entry, type value and click enter, after last value click enter and entry mode will be exited
  • Important to assign to variable name and specify type if it not “double”; dataset2 <- scan(what = "character")

Demonstration on data input using function “scan”

Data entry using data structure functions

  • Recommended way to generate data in R (ideally small data)
  • Data structure function include:
    • c() for atomic vectors
    • matrix() for matrices
    • array() for 1 or more dimension arrays
    • data.frame for data frames
    • list() for lists

Data entry using c()

  • Used to create individual variables of any type as long as all elements are of the same type e.g all logical or all character
# An integer vector
num <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # same as 1:10
# A logical vector
logi <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)

Data entry using function c()

# A character vector
R_authours <- c("Douglas Bates", "John Chambers", "Peter Dalgaard", "Seth Falcon", "Robert Gentleman", "Kurt Hornik", "Ross Ihaka", "Michael Lawrence", "Friedrich Leisch", "Uwe Ligges", "Thomas Lumley", "Martin Morgan", "Duncan Murdoch", "Paul Murrell", "Martyn Plummer", "Brian Ripley", "Deepayan Sarkar", "Duncan Temple Lang", "Luke Tierney", "Simon Urbanek")

Data entry using "matrix()"

  • 2 dimensional vectors (store data as rows and columns)
  • Primarily created with function matrix() but rbind(), cbind() and as.matrix() can be used to convert other vectors to a matrix
  • Function matrix() can be called without any input thus creating an empty matrix
  • Argument “dimnames” can only be NULL (nothing) or a list

Data entry using "matrix()"

mat1 <- matrix(data = 1:9, nrow = 3, dimnames = list(NULL, c("a", "b", "c")))
mat1
     a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

Data entry using "array()"

  • Multi-dimensional structures (1 > dims), but often used for 3 dim structures
  • Can only be used with one data type
  • Matrices are special form of these data structures (have 2 dims)
  • Primarily created with function array()

"array()" (cont)

n

dims <- list(1:3, c("a", "b", "c"), c("Yes", "No")) 
arry <- array(data = seq(1, 9*2), dim = c(3, 3, 2), dimnames = dims)

arry
, , Yes

  a b c
1 1 4 7
2 2 5 8
3 3 6 9

, , No

   a  b  c
1 10 13 16
2 11 14 17
3 12 15 18

Data entry using `data.frame()`

  • Similar to matrices except they can contain different types of data as long as they have the same length (number of elements)
  • Though resemble matrices, they are actually list of vectors
  • Columns contain measurements on one variable and rows contain cases
  • Primarily created by data.frame()

data.frame()

# Example of weight loss data set
dataset3 <- data.frame(ID = 1:5, Exercise = c(TRUE, TRUE, FALSE, TRUE, FALSE), Height = c(5.2, 4.9, 5.1, 5.2, 5.4), Weight = c(69, 72, 75, 67, 77))
dataset3
  ID Exercise Height Weight
1  1     TRUE    5.2     69
2  2     TRUE    4.9     72
3  3    FALSE    5.1     75
4  4     TRUE    5.2     67
5  5    FALSE    5.4     77

Data entry using "list()"

  • A bit unique as not many statistical programs have similar data structure
  • A sort of “carry-all” data structure
  • Can also contain sub-list thus referred to as recursive
  • Primarily created by list()

"lists()"

lst <- list(vect = 5:9, Matrix = mat1, Array = arry, Dataframe = dataset3, List = list("a", 2:3))
str(lst)
List of 5
 $ vect     : int [1:5] 5 6 7 8 9
 $ Matrix   : int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:3] "a" "b" "c"
 $ Array    : int [1:3, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : chr [1:3] "1" "2" "3"
  .. ..$ : chr [1:3] "a" "b" "c"
  .. ..$ : chr [1:2] "Yes" "No"
 $ Dataframe:'data.frame':  5 obs. of  4 variables:
  ..$ ID      : int [1:5] 1 2 3 4 5
  ..$ Exercise: logi [1:5] TRUE TRUE FALSE TRUE FALSE
  ..$ Height  : num [1:5] 5.2 4.9 5.1 5.2 5.4
  ..$ Weight  : num [1:5] 69 72 75 67 77
 $ List     :List of 2
  ..$ : chr "a"
  ..$ : int [1:2] 2 3

R's objects and properties

  • Everything in R is referred to as an object from data structures to functions and all objects have two types of attributes:
    • Mode and
    • Length
  • Mode is the basic type of an object's core constituent
  • Length is extent or number of elements in an object
  • Function mode() and length() are used to establish mode and length of an object

Establishing basic composition of objects

n

mode(num)
[1] "numeric"
mode(mat1)
[1] "numeric"
mode(arry)
[1] "numeric"

n

mode(dataset3)
[1] "list"
mode(lst)
[1] "list"

Establishing length of an object

# Atomic vector
length(num)
[1] 10
# Matrix
length(mat1) 
[1] 9

Establishing length of an object

  • Length is not the best attribute for assessing a matrix or an array, “dim” is more appropriate
mat1; dim(mat1)
     a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
[1] 3 3

Establishing length of R objects (cont)

# Data frames
length(dataset3) # This shows number of variables not cases/rows
[1] 4
# Lists
length(lst)
[1] 5

Difference between typeof(), mode() and storage.mode()

  • There are 3 functions for checking basic constituents of an object, these are:
    • mode() which is an S compatible function for checking type
    • storage.mode() which is used for compatability when calling functions written in other languages (ensures data is of expected type)
    • typeof() which is basically an R's implementation of S's mode()

Selecting between typeof(), mode(), and storage.mode()

  • Which function should be used? Depends on why,
    • If it's just a general query, then typeof() is adequate
    • If working with other S objects, use mode()
    • If calling functions written in other languages, use storage.mode()

Other Attributes

  • Attributes are basically meta data about an object in R
  • All objects (except NULL) can have at least two or more attributes
  • Attributes are stored as a pairlist i.e. name=value
  • List of all attributes for an object are given by attributes()
  • Individual attributes are given by attr()

Other attributes (cont)

  • Other than mode and length other often used attributes are:
    • Names
    • Dimensions (dim)
    • Dimnames
    • Classes, and
    • Time series

Names Attribute

  • Used to name individual elements of a data object
  • They are not mandatory, but quite handy when indexing element
  • Accessed with name() and set with name(object) <-
  • colnames() is used for matrix-like objects

Querying and setting element's names

# Creating an unnamed vector
vect1 <- c(12, 54, 98)
names(vect1)
NULL
# Naming vector elements
names(vect1) <- c("a", "b", "c")
names(vect1)
[1] "a" "b" "c"

Naming elements (cont)

n

# An unnamed matrix
mat3 <- matrix(1:9, 3)
mat3
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

n

# Naming a matrix
colnames(mat3) <- c("a", "b", "c")
mat3
     a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

Dimensions Attribute

  • There at least two data structures with dimension attribute, these are arrays (including matrices) and data frames
  • Function dim() is used to query an objects dimension and dim <- used to set dimension to an object
  • There is a difference between an atomic vector and a 1 dim array; latter has a dim attribute while former does not
  • Giving a vector dimensions changes it's data structure from a vector to an array

Dim attribute (cont)

n

# An atomic vector (dimensionless)
vect2 <- 1
vect2
[1] 1
dim(vect2)
NULL

n

# Converting to 1 dimension array
dim(vect2) <- 1
vect2
[1] 1
dim(vect2)
[1] 1

Dimnames Attribute

  • Gives names to dimensions
  • Like “dim” attribute, “dimnames” attributes are given to vectors with dim attribute like matrices, array and data frames
  • Dimnames are given as a list of names (same lenth as “dim(x)”)

Quering and setting dimnames

n

# Matrix with no dimnames
vect3 <- matrix(1:9, 3)
vect3
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

n

# Adding dimnames 
dimnames(vect3) <- list(1:3, c("a", "b", "c"))
vect3
  a b c
1 1 4 7
2 2 5 8
3 3 6 9

Classes Attribute

  • Class attribute is a special type of information used functions called “methods”
  • Used to determine how an object should be handled/acted upon
  • All objects have an intrinsic class attribute which is basically it's data type, but other classes can be added to an object

Class attribute (cont)

  • Classes are character vectors accessed and added with function class() and class <- respectively or attr(obj, class)
  • When a class is added to an object, that object is called an s3 object. This makes it part of R's Object Oriented Programming (OOP)

Quering and adding class attribute

n

# Intrinsic class attribute
vect <- 1:5; class(vect) 
[1] "integer"
# (Assigned) Class attribute
attr(vect, "class")
NULL

n

# Add class with either
attr(vect, which = "class") <- "myclass"
# OR class(vect) <- "myclass"

# Query class attribute
attr(vect, "class")
[1] "myclass"

Useful note on adding class attribute

  • Adding classes has it's implications as far as “method dispatch” (selection of suitable function) is concerned
  • For example, changing from intrinsic class “numeric"to "myclass” means function/methods for “myclass” if found, will be applied first
  • Basically when a generic function such as “plot()” or “mean()” are called, they will look for functions suitable for first listed class in a class vector, it is not until all classes are are checked that a method for it's intrisic class is dispatched

Time series Attribute

  • Used for data with time dimensionality like timely, daily, weekly, monthly, quarterly or annual data
  • Created by adding a “tsp” attribute
  • It ensures time series parameters such as “start”, “end”, and “frequency” are kept and
  • For compartability with S version 2

Tsp Attribute

n

# Random annual data 
set.seed(28)
tms <- round(rnorm(12, 56))
tms
 [1] 54 56 55 54 56 57 56 56 56 58 55 58
attributes(tms)
NULL

n

# Adding attribute `tsp`
tsp(tms) <- c(start = 1, end = 12, frequency = 1)
attributes(tms)
$tsp
[1]  1 12  1

R's data sets

n

  • R has a number of data set
  • Full documentation help(package = datasets)
  • Currently there are 104 data sets
  • Our of these:
Data Structure Number
Array 1
Character (1 dim vector) 2
Data frame 46
Dist (Distance Matrix) 2
Factor (1 dim integer vector) 2
List 4
Matrix 8
Numeric (1 dim vector) 6
Table (Atomic vectors) 51
ts 28

Conditional Statements

n

  • Used to certain conditions are met by some data like observation above a certain value
  • They are also called control structures. Include:
    • if-else
    • ifelse
    • for

Conditional Statements (cont)

  • Others are
    • while
    • repeat
    • break
    • next, and
    • switch
  • We discuss frequently used control structure, that is if-else, ifelse and for

if-else()

  • Used to check if a condition evaluated to true, and if so an action is performed
  • It can be extended alternative action(s) with “else if” or “else”
  • When “else statement” is given, it must be on same line as end of if statement
  • Example, check a vector has intrisic type “character”, if it does, we convert it to a factor else leave it as it is
  • Note: if-else can only be performed if condition evaluated to one logical value either TRUE or FALSE

if-else() example

x <- c("a", "b", "c")
class(x)
[1] "character"
if (class(x) == "character") {
   x <- as.factor(x)
} else {
   x
}
class(x)
[1] "factor"

ifelse() control structure

  • Used when condition evaluates to a logical vector length > 1
  • Excellent for recoding variables, hence an example is done under “recoding variables”

for() control structure

  • for() is a looping structure used to perform repetitive tasks
  • Though in most programming programs, this is a frequently used construct, in R, there more efficient functions like apply group of functions
  • for iterates from a certain value through a sequence performing an action defined it's body (body of any function including for loop is what is in between {})
  • As a simple example, let's say Hello five time

for() loop example

for(i in 1:5) {     # variable "i" is a counter (conting from 1 to 5)
    cat("Hello \n") # function "cat" is used to print to console 
}
Hello 
Hello 
Hello 
Hello 
Hello 

Recoding variables

  • Recoding a variable means changing it's values
  • It is often recommended to create a new variable instead of overwriting original variable
  • Example:
    • Create a dichotomous recoded variable of “feed” variable from “chickwts” data set
    • This variable will have values “casein” and “others” (this is something often done during analysis)

Recoding variables (cont)

n

'data.frame':   71 obs. of  2 variables:
 $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
 $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...

n

# Current categories of variable of interest (feed)
levels(chickwts$feed)
[1] "casein"    "horsebean" "linseed"   "meatmeal"  "soybean"   "sunflower"

Recoding variables (cont)

# Recording with function "ifelse"
chickwts$feed2 <- ifelse(chickwts$feed == "casein", yes = "casein", no = "other")

# Conveting to a factor vector
chickwts$feed2 <- factor(chickwts$feed2)

# New levels 
levels(chickwts$feed2)
[1] "casein" "other" 

Renaming variables

  • using base R, renaming any variables in a data frame requires all variable names to issued to names() <- function
  • For example, to rename “feed2” from previous slide:
# Current name
names(chickwts)
[1] "weight" "feed"   "feed2" 

Renaming variables (cont)

# Renaming variables (all must be proived)
names(chickwts) <- c("weight", "feed", "casein")
names(chickwts)
[1] "weight" "feed"   "casein"

Missing values

  • In R, denoted with Logical value “NA”
  • Many operation can not be performed when there are missing value
  • is.na() used to check for missing value
  • If negated with “!” infront, it output current (non-missing) values
  • For matrices and data frames, complete.cases() might be more appropriate

Missing values (cont)

n

# Vector with a missing value
vect1 <- c(letters[1:5], NA); vect1
[1] "a" "b" "c" "d" "e" NA 
# A logical vector checking for missing values 
is.na(vect1)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE

Missing values: complete case for matrices

n

vect2 <- letters[1:6]
mat3 <- rbind(vect1, vect2)
mat3
      [,1] [,2] [,3] [,4] [,5] [,6]
vect1 "a"  "b"  "c"  "d"  "e"  NA  
vect2 "a"  "b"  "c"  "d"  "e"  "f" 

n

complete.cases(mat3)
[1] FALSE  TRUE
  • Output indicates the first row/case is not complete but the second is complete

Date values

  • Initially imported or created as numeric or character vectors
  • Conversion (to class for data/time object: POSIXlt/POSIXct) depends on whether they are character or numeric
  • One way to convert character vector to date/time object is by using function as.Date() specifying argument format as detailed by ?strftime
  • as.Date() can also be used to convert a numeric vector to a date object, by specifying argument origin; origin in R is “1970-01-01”

Date values (cont)

# Converting a character vector
date1char <- c("3/6/2017", "3/7/2017", "4/7/2017")
class(date1char)
[1] "character"
date1 <- as.Date(date1char, format = "%m/%e/%Y")

Date Values (cont)

date1
[1] "2017-03-06" "2017-03-07" "2017-04-07"
class(date1)
[1] "Date"

Date values (cont)

# Converting a numeric vector
date1num <- c(17231, 17232, 17263)
class(date1num)
[1] "numeric"
date2 <- as.Date(date1num, origin = "1970-01-01")

Date values (cont)

date2
[1] "2017-03-06" "2017-03-07" "2017-04-07"
class(date2)
[1] "Date"

Conversion between data types

  • To convert from one data type to another, use as.data_type like as.logical(), as.integer(), as.double(), as.character(), as.raw(), and as.complex()
  • But it must be convertible e.g.
    • Can convert from logical to character but if character is not “TRUE/FALSE” or “true/false” it will result in NA
    • Cannot convert character to integer or double

Sorting data

  • Sorting an atomic vector is done with sort()
  • Sorting a data frame is done with order()
  • Matrices are actually atomic vectors with dimensions, hence sorted with looping function apply
  • By default sort is done in an increasing manner, be nullified by setting argument “decreasing” to TRUE
  • Logical values ordered according to their integer form, i.e. TRUE = 1, FALSE = 0 (TRUE > FALSE)

Sorting vectors

n

# An unsorted random numbers
set.seed(58)
tosort <- round(rnorm(10, 87, 10))
tosort
 [1]  83  91  97  80  81  68  84  92 106  96
# Sorted vector (increasing) 
sort(tosort)        
 [1]  68  80  81  83  84  91  92  96  97 106
# Sorted vector (decreasing)
sort(tosort, TRUE)  
 [1] 106  97  96  92  91  84  83  81  80  68

Sorting Matrices

n

mat2sort <- matrix(tosort[-1], 3, dimnames = list(1:3, c("a", "b", "c")))
mat2sort
   a  b   c
1 91 81  92
2 97 68 106
3 80 84  96

n

# Sort by columns of a matrix
apply(mat2sort, 2, sort)
      a  b   c
[1,] 80 68  92
[2,] 91 81  96
[3,] 97 84 106

Sorting Data frames

set.seed(3)
v1 <- round(rnorm(9, 50, 10))
set.seed(3)
v2 <- round(rnorm(9, 90))
set.seed(3)
logi <- sample(c(TRUE, FALSE), 9, TRUE, c(0.7, 0.3))
df1 <- data.frame(Logi = logi, V1 = v1, V2 = v2)

Sorting Data frames (cont)

# Sorted by first variable "logi"
df1[order(df1$Logi, decreasing = TRUE),]
   Logi V1 V2
1  TRUE 40 89
3  TRUE 53 90
4  TRUE 38 89
5  TRUE 52 90
6  TRUE 50 90
7  TRUE 51 90
8  TRUE 61 91
9  TRUE 38 89
2 FALSE 47 90

Sorting data frames by more than one variable

  • Sorting by more than one variable is first done on first listed variable then the second and so on.
  • Example:
    • Sort variable Logi in a decreasing manner (TRUE first)
    • Then sort variable “V1” in a decreasing manner

Sorting data frames example

df1[order(df1$Logi, df1$V1, decreasing = TRUE),]
   Logi V1 V2
8  TRUE 61 91
3  TRUE 53 90
5  TRUE 52 90
7  TRUE 51 90
6  TRUE 50 90
1  TRUE 40 89
4  TRUE 38 89
9  TRUE 38 89
2 FALSE 47 90

Sorting by both decreasing and ascending order

# Negative sign used to indicate decreasing
df1[order(-df1$Logi, df1$V1), ]
   Logi V1 V2
4  TRUE 38 89
9  TRUE 38 89
1  TRUE 40 89
6  TRUE 50 90
7  TRUE 51 90
5  TRUE 52 90
3  TRUE 53 90
8  TRUE 61 91
2 FALSE 47 90

Merging data sets

  • Done by similar (intersecting) columns
  • Can use database semantics
  • Core considerations for merging
    • Default merging done by intersect(names(x), names(y))
    • Otherwise specific columns in each can be given especially if they do not have same name or capitalization

Merging data frames

# Additional data set
dataset4 <- data.frame(ID = 6:10, Exercise = c(TRUE, FALSE, TRUE, TRUE, FALSE), Height = c(5.4, 5.4, 5.2, 5.6, 5.4), Weight = c(77, 74, 75, 79, 82))

# Similar columns to be used for merging
intersect(names(dataset3), names(dataset4))
[1] "ID"       "Exercise" "Height"   "Weight"  

Merging data frames

# Merging (adding cases)
merge(dataset3, dataset4, all = TRUE)
   ID Exercise Height Weight
1   1     TRUE    5.2     69
2   2     TRUE    4.9     72
3   3    FALSE    5.1     75
4   4     TRUE    5.2     67
5   5    FALSE    5.4     77
6   6     TRUE    5.4     77
7   7    FALSE    5.4     74
8   8     TRUE    5.2     75
9   9     TRUE    5.6     79
10 10    FALSE    5.4     82

Subsetting data sets

Look at:

  • Indexing
  • Subsetting/extracting operators
  • Subsetting different data objects

Indexing

  • Indexing vectors are used to access elements from different data objects, they include:
    • Logical vector
    • Positive integers
    • Negative integers and
    • Character vectors
  • Note: It's possible to have 0 index (empty indexing)

Indexing (cont)

  • Logical vectors select elements which evaluate to TRUE
  • Positive integers select elements at given positions
  • Negative integers exclude values at given integers
  • Character indices are only appropriate for named elements
  • An empty index selects all values, used to replace all entries but at the same time keeping it's attributes

Subsetting/Extracting operators

  • There three extracting operators and one extracting function
    • [
    • [[
    • $, and
    • getElement()

Subsetting/Extracting operators

  • "[" can select more than one element and keeps their names if present while "[[" and "$" can only select one element without their names
  • "$" is only applicable for recursive objects (generic/list data structures), basically data frames and lists
  • "getElement()" function is similar to extracting with "[["

Subsetting Atomic Vectors

n

  • Subsetting operator is [, although [[ can also be used to select a single element without it's names attribute
  • Index vector put between subsetting operators.

n

vect1
[1] "a" "b" "c" "d" "e" NA 
# Index vector: Elements that are not NA
!is.na(vect1)
[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

Subsetting vectors (cont)

# Subset non-na values
vect1[!is.na(vect1)]
[1] "a" "b" "c" "d" "e"
# Subsetting with an empty index
tms[]
 [1] 54 56 55 54 56 57 56 56 56 58 55 58

# Empty index useful for replacement while keeping attributes
set.seed(3)
tms[] <- sample(1:100, 12)
tms
 [1] 17 80 38 32 58 96 12 28 54 95 47 45
attr(,"tsp")
[1]  1 12  1

Subsetting atomic elements (cont)

n

  • Subsetting with “[[” returns without a names attribute
# Some of my favourite fruits
fruits <- c(Mangoes = 50, Apples = 35, Pineapples = 20)

n

fruits["Mangoes"]
Mangoes 
     50 
fruits[["Mangoes"]]
[1] 50

Subsetting Matrices and Arrays

  • Essentially atomic vectors with dimensions hence can be subset with [ and [[
  • Output is value occurring at given indices when all values are concatenated
  • However, the best way to index these structures is by their dimension e.g. [r, c] for 2 dim matrices and [r, c, l] meaning row, column, and layer for 3 dim arrays
  • Exampl data set: R's USPersonalExpenditure

Example data set

# One of R's data set
USPersonalExpenditure
                      1940   1945  1950 1955  1960
Food and Tobacco    22.200 44.500 59.60 73.2 86.80
Household Operation 10.500 15.500 29.00 36.5 46.20
Medical and Health   3.530  5.760  9.71 14.0 21.10
Personal Care        1.040  1.980  2.45  3.4  5.40
Private Education    0.341  0.974  1.80  2.6  3.64
# Subsetting with an empty index
USPersonalExpenditure[]
                      1940   1945  1950 1955  1960
Food and Tobacco    22.200 44.500 59.60 73.2 86.80
Household Operation 10.500 15.500 29.00 36.5 46.20
Medical and Health   3.530  5.760  9.71 14.0 21.10
Personal Care        1.040  1.980  2.45  3.4  5.40
Private Education    0.341  0.974  1.80  2.6  3.64
# Subseting with one index
USPersonalExpenditure[5]
[1] 0.341
# Subsetting with dimensions
USPersonalExpenditure[1, ]       # Subset 1st row, all columns
1940 1945 1950 1955 1960 
22.2 44.5 59.6 73.2 86.8 
USPersonalExpenditure[1, 1]      # Subset 1st row, first column
[1] 22.2
USPersonalExpenditure[3, "1950"] # Subset 3rd row, column 3 "1950"
[1] 9.71
USPersonalExpenditure[, "1960"]  # Subset an entire row, drops dimension
   Food and Tobacco Household Operation  Medical and Health 
              86.80               46.20               21.10 
      Personal Care   Private Education 
               5.40                3.64 
# Maintaining dimension
USPersonalExpenditure[, "1960", drop = FALSE]
                     1960
Food and Tobacco    86.80
Household Operation 46.20
Medical and Health  21.10
Personal Care        5.40
Private Education    3.64
dim(USPersonalExpenditure[, "1960", drop = FALSE])
[1] 5 1

Subsetting Data frames

  • All subsetting operators ([, [[ and $) can be used
  • As before [ can selects more than one element
  • Both [[ and $ can select one item, difference is that $ can not be used with computed values like “i + 1” (index + 1)
  • x$name is equivalent to x[[“name”, exact = FALSE]]
  • Other than these operators, a much more efficient way to subset data frames is with function subset()

Example data set: USArrests

# Vewing first 6 rows
head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7
# Computing average of assault, murder and rape using "$"
avg_murder  <- median(USArrests$Murder)
avg_assault <- median(USArrests$Assault)
avg_rape    <- median(USArrests$Rape)

# Using "[" subset states with above average assault, murder and rape
high_crime <- USArrests[USArrests$Murder > avg_murder & USArrests$Assault > avg_assault & USArrests$Rape > avg_rape, ]
# Sort (by decreasing order for Murder) and output names of states
high_crime <- high_crime[order(high_crime$Murder, decreasing = TRUE),]
row.names(high_crime)
 [1] "Georgia"        "Florida"        "Louisiana"      "South Carolina"
 [5] "Alabama"        "Tennessee"      "Texas"          "Nevada"        
 [9] "Michigan"       "New Mexico"     "Maryland"       "New York"      
[13] "Illinois"       "Alaska"         "California"     "Missouri"      
[17] "Arizona"        "Colorado"      
# Subset a column without name attribute
high_crime[[1]]
 [1] 17.4 15.4 15.4 14.4 13.2 13.2 12.7 12.2 12.1 11.4 11.3 11.1 10.4 10.0
[15]  9.0  9.0  8.1  7.9
# Or
USArrests[["Assault"]]
 [1] 236 263 294 190 276 204 110 238 335 211  46 120 249 113  56 115 109
[18] 249  83 300 149 255  72 259 178 109 102 252  57 159 285 254 337  45
[35] 120 151 159 106 174 279  86 188 201 120  48 156 145  81  53 161

Subsetting with function "subset()"

  • Function subset can be used to subset any vector, but most suitable for data frames
  • Here we will use it to subset high_crime states as we did before
  • We use function with() to access variables without making reference to data frame name
high_crime2 <- with(USArrests, subset(USArrests, Murder > avg_murder & Assault > avg_assault & Rape > avg_rape, Murder:Rape))
high_crime2 <- high_crime2[order(high_crime2$Murder, decreasing = TRUE), ]

# Check both data sets are identical
identical(high_crime, high_crime2)
[1] TRUE

Subsetting Lists

  • List can be subset with all three subsetting operators
  • Rule of the thumb is, subsetting with [ returns a list, subsetting with [[ or $ outputs the same type as element being subset i.e. if list has data frame, subsetting with [[ or $ will output a data frame
  • Example data set: R's first 10 values of “state.center” data set

Subsetting lists (cont)

# Example data
state.center; class(state.center)
$x
 [1]  -86.7509 -127.2500 -111.6250  -92.2992 -119.7730 -105.5130  -72.3573
 [8]  -74.9841  -81.6850  -83.3736

$y
 [1] 32.5901 49.2500 34.2192 34.7336 36.5341 38.6777 41.5928 38.6777
 [9] 27.8744 32.3329
[1] "list"

Subsetting lists (cont)

# Using `[` outputs a list
state.center[1]
$x
 [1]  -86.7509 -127.2500 -111.6250  -92.2992 -119.7730 -105.5130  -72.3573
 [8]  -74.9841  -81.6850  -83.3736
class(state.center[1])
[1] "list"

Subsetting lists (cont)

# Using `[[` outputs elements type
state.center[[1]]
 [1]  -86.7509 -127.2500 -111.6250  -92.2992 -119.7730 -105.5130  -72.3573
 [8]  -74.9841  -81.6850  -83.3736
class(state.center[[1]])
[1] "numeric"

Subsetting lists

# Using "$" outputs elements type
state.center$x
 [1]  -86.7509 -127.2500 -111.6250  -92.2992 -119.7730 -105.5130  -72.3573
 [8]  -74.9841  -81.6850  -83.3736
class(state.center$x)
[1] "numeric"

Using SQL statements to subset data frames

  • Database semantics can sometimes be quite handy in subsetting e.g. subset has to meet certain condition
  • Core data base statements are :
    • SELECT
    • FROM
    • WHERE
    • ORDER BY

Using SQL statements to subset data frames

  • If interested, read a small introduction to SQL statement from R's Data Import/Export manual (4.2) or go online and learn from “www.sqlcourse.com
  • Discussing this here might take us out scope, but it's good to know it's possible in R using contributed packages like “sqldf” and “dplyr”.

Other functions useful for data sets

Function Description
str A compact display internals of a data frame
head Prints first part, default is first 6 rows
tail Prints last part, default is last 6 row
attach Put data frame on R's search path hence variables are accessible without reference to data frame name
dettach Remove data frame from R's search path. Recommended after completion of task

Other useful functions (cont)

Function Description
with Recommended alternative to attach, makes it possible to run expressions/function on a data frame's element
which Locates indices of logical value TRUE. Used for indexing data frame elements