Introduction to Data Analysis and Graphics in R

Hellen Gakuruh
2017-03-31

Session three

Data Entry, Management and Manipulation in R

Outline

Creating a dataset
Understanding datasets
Data input
Useful functions for working with datasets
Creating new variables
Recording and renaming variables

Missing and date values
Type conversions
Sorting data
Merging datasets
Subsetting datasets
Using SQL statements to subset dataframes

Creating a dataset

Data sets can be created for any of R's data structure i.e. dimensionless vector, 1 dim vector, matrix, array, data frame or list
There are two way to create a data set:
- Using spreadsheet like data editor
- By coding then in

Invoking spreadsheet-like data editor in R

R has four functions to invoke a spreadsheet-like data editor, these are:
- edit()
- fix()
- data.entry(), and
- dataentry()

Note on using spreadsheet-like data editors in R

Using these function's goes against R's core functionality; programming/coding
Not a recommended way as it looses on documentation/reproducibility

Coding in data

To code in data, function scan() can be quite handy in addition to calling functions for any of the data structures; c() for vector, matrix(), array(), data.frame(), and list()
scan() is also not a good data entry process as it looses on reproducibility as data is entered interactively (console)

Understanding datasets

It can be a single variable or multiple variables
In R, a single variable can be a dimensional vector (created with “c()”) or a 1 dim array
For multiple variables, if they are all of the same type (especially if numeric), then matrix is a better data structure other wise for multiple types with same length data frame is ideal

Understanding datasets

If data is of different length and type, generic lists are appropriate.
Lists can also be used to store different data sets for a particular project as well as accompanying source code/function

Data input

We will look at:

Spreadsheet data entry using "data.entry()"
Using "scan()"
Coding in data using data structure functions c(), matrix(), array(), data.frame(), and list()

Spreadsheet data entry

First, need to have variables or list of variables for data entry
Then Call data entry
From pop-up data editor, click on individual cell and enter data
Variable names can be changed from data entry

Spreadsheet data entry demonstration

Data entry using "scan" function

Can be used to input 1 dim atomic vectors
Values entered interactively (on console) if file is not give
For each entry, type value and click enter, after last value click enter and entry mode will be exited
Important to assign to variable name and specify type if it not “double”; dataset2 <- scan(what = "character")

Demonstration on data input using function “scan”

Data entry using data structure functions

Recommended way to generate data in R (ideally small data)
Data structure function include:
- c() for atomic vectors
- matrix() for matrices
- array() for 1 or more dimension arrays
- data.frame for data frames
- list() for lists

Data entry using c()

Used to create individual variables of any type as long as all elements are of the same type e.g all logical or all character

# An integer vector
num <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) # same as 1:10
# A logical vector
logi <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)

Data entry using function c()

# A character vector
R_authours <- c("Douglas Bates", "John Chambers", "Peter Dalgaard", "Seth Falcon", "Robert Gentleman", "Kurt Hornik", "Ross Ihaka", "Michael Lawrence", "Friedrich Leisch", "Uwe Ligges", "Thomas Lumley", "Martin Morgan", "Duncan Murdoch", "Paul Murrell", "Martyn Plummer", "Brian Ripley", "Deepayan Sarkar", "Duncan Temple Lang", "Luke Tierney", "Simon Urbanek")

Data entry using "matrix()"

2 dimensional vectors (store data as rows and columns)
Primarily created with function matrix() but rbind(), cbind() and as.matrix() can be used to convert other vectors to a matrix
Function matrix() can be called without any input thus creating an empty matrix
Argument “dimnames” can only be NULL (nothing) or a list

Data entry using "matrix()"

mat1 <- matrix(data = 1:9, nrow = 3, dimnames = list(NULL, c("a", "b", "c")))
mat1

     a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

Data entry using "array()"

Multi-dimensional structures (1 > dims), but often used for 3 dim structures
Can only be used with one data type
Matrices are special form of these data structures (have 2 dims)
Primarily created with function array()

"array()" (cont)

dims <- list(1:3, c("a", "b", "c"), c("Yes", "No")) 
arry <- array(data = seq(1, 9*2), dim = c(3, 3, 2), dimnames = dims)

arry

Data entry using `data.frame()`

Similar to matrices except they can contain different types of data as long as they have the same length (number of elements)
Though resemble matrices, they are actually list of vectors
Columns contain measurements on one variable and rows contain cases
Primarily created by data.frame()

data.frame()

# Example of weight loss data set
dataset3 <- data.frame(ID = 1:5, Exercise = c(TRUE, TRUE, FALSE, TRUE, FALSE), Height = c(5.2, 4.9, 5.1, 5.2, 5.4), Weight = c(69, 72, 75, 67, 77))
dataset3

  ID Exercise Height Weight
1  1     TRUE    5.2     69
2  2     TRUE    4.9     72
3  3    FALSE    5.1     75
4  4     TRUE    5.2     67
5  5    FALSE    5.4     77

Data entry using "list()"

A bit unique as not many statistical programs have similar data structure
A sort of “carry-all” data structure
Can also contain sub-list thus referred to as recursive
Primarily created by list()

"lists()"

lst <- list(vect = 5:9, Matrix = mat1, Array = arry, Dataframe = dataset3, List = list("a", 2:3))

str(lst)

List of 5
 $ vect     : int [1:5] 5 6 7 8 9
 $ Matrix   : int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:3] "a" "b" "c"
 $ Array    : int [1:3, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : chr [1:3] "1" "2" "3"
  .. ..$ : chr [1:3] "a" "b" "c"
  .. ..$ : chr [1:2] "Yes" "No"
 $ Dataframe:'data.frame':  5 obs. of  4 variables:
  ..$ ID      : int [1:5] 1 2 3 4 5
  ..$ Exercise: logi [1:5] TRUE TRUE FALSE TRUE FALSE
  ..$ Height  : num [1:5] 5.2 4.9 5.1 5.2 5.4
  ..$ Weight  : num [1:5] 69 72 75 67 77
 $ List     :List of 2
  ..$ : chr "a"
  ..$ : int [1:2] 2 3

R's objects and properties

Everything in R is referred to as an object from data structures to functions and all objects have two types of attributes:
- Mode and
- Length
Mode is the basic type of an object's core constituent
Length is extent or number of elements in an object
Function mode() and length() are used to establish mode and length of an object

Establishing basic composition of objects

mode(num)

[1] "numeric"

mode(mat1)

[1] "numeric"

mode(arry)

[1] "numeric"

mode(dataset3)

[1] "list"

mode(lst)

[1] "list"

Establishing length of an object

# Atomic vector
length(num)

[1] 10

# Matrix
length(mat1)

[1] 9

Establishing length of an object

Length is not the best attribute for assessing a matrix or an array, “dim” is more appropriate

mat1; dim(mat1)

     a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

[1] 3 3

Establishing length of R objects (cont)

# Data frames
length(dataset3) # This shows number of variables not cases/rows

[1] 4

# Lists
length(lst)

[1] 5

Difference between typeof(), mode() and storage.mode()

There are 3 functions for checking basic constituents of an object, these are:
- mode() which is an S compatible function for checking type
- storage.mode() which is used for compatability when calling functions written in other languages (ensures data is of expected type)
- typeof() which is basically an R's implementation of S's mode()

Selecting between typeof(), mode(), and storage.mode()

Which function should be used? Depends on why,
- If it's just a general query, then typeof() is adequate
- If working with other S objects, use mode()
- If calling functions written in other languages, use storage.mode()

Other Attributes

Attributes are basically meta data about an object in R
All objects (except NULL) can have at least two or more attributes
Attributes are stored as a pairlist i.e. name=value
List of all attributes for an object are given by attributes()
Individual attributes are given by attr()

Other attributes (cont)

Other than mode and length other often used attributes are:
- Names
- Dimensions (dim)
- Dimnames
- Classes, and
- Time series

Names Attribute

Used to name individual elements of a data object
They are not mandatory, but quite handy when indexing element
Accessed with name() and set with name(object) <-
colnames() is used for matrix-like objects

Querying and setting element's names

# Creating an unnamed vector
vect1 <- c(12, 54, 98)
names(vect1)

NULL

# Naming vector elements
names(vect1) <- c("a", "b", "c")
names(vect1)

[1] "a" "b" "c"

Naming elements (cont)

# An unnamed matrix
mat3 <- matrix(1:9, 3)
mat3

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

# Naming a matrix
colnames(mat3) <- c("a", "b", "c")
mat3

     a b c
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

Dimensions Attribute

There at least two data structures with dimension attribute, these are arrays (including matrices) and data frames
Function dim() is used to query an objects dimension and dim <- used to set dimension to an object
There is a difference between an atomic vector and a 1 dim array; latter has a dim attribute while former does not
Giving a vector dimensions changes it's data structure from a vector to an array

Dim attribute (cont)

# An atomic vector (dimensionless)
vect2 <- 1
vect2

[1] 1

dim(vect2)

NULL

# Converting to 1 dimension array
dim(vect2) <- 1
vect2

[1] 1

dim(vect2)

[1] 1

Dimnames Attribute

Gives names to dimensions
Like “dim” attribute, “dimnames” attributes are given to vectors with dim attribute like matrices, array and data frames
Dimnames are given as a list of names (same lenth as “dim(x)”)

Quering and setting dimnames

# Matrix with no dimnames
vect3 <- matrix(1:9, 3)
vect3

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

# Adding dimnames 
dimnames(vect3) <- list(1:3, c("a", "b", "c"))
vect3

Classes Attribute

Class attribute is a special type of information used functions called “methods”
Used to determine how an object should be handled/acted upon
All objects have an intrinsic class attribute which is basically it's data type, but other classes can be added to an object

Class attribute (cont)

Classes are character vectors accessed and added with function class() and class <- respectively or attr(obj, class)
When a class is added to an object, that object is called an s3 object. This makes it part of R's Object Oriented Programming (OOP)

Quering and adding class attribute

# Intrinsic class attribute
vect <- 1:5; class(vect)

[1] "integer"

# (Assigned) Class attribute
attr(vect, "class")

NULL

# Add class with either
attr(vect, which = "class") <- "myclass"
# OR class(vect) <- "myclass"

# Query class attribute
attr(vect, "class")

[1] "myclass"

Useful note on adding class attribute

Adding classes has it's implications as far as “method dispatch” (selection of suitable function) is concerned
For example, changing from intrinsic class “numeric"to "myclass” means function/methods for “myclass” if found, will be applied first
Basically when a generic function such as “plot()” or “mean()” are called, they will look for functions suitable for first listed class in a class vector, it is not until all classes are are checked that a method for it's intrisic class is dispatched

Time series Attribute

Used for data with time dimensionality like timely, daily, weekly, monthly, quarterly or annual data
Created by adding a “tsp” attribute
It ensures time series parameters such as “start”, “end”, and “frequency” are kept and
For compartability with S version 2

Tsp Attribute

# Random annual data 
set.seed(28)
tms <- round(rnorm(12, 56))
tms

 [1] 54 56 55 54 56 57 56 56 56 58 55 58

attributes(tms)

NULL

# Adding attribute `tsp`
tsp(tms) <- c(start = 1, end = 12, frequency = 1)
attributes(tms)

$tsp
[1]  1 12  1

R's data sets

R has a number of data set
Full documentation help(package = datasets)
Currently there are 104 data sets
Our of these:

Data Structure	Number
Array	1
Character (1 dim vector)	2
Data frame	46
Dist (Distance Matrix)	2
Factor (1 dim integer vector)	2
List	4
Matrix	8
Numeric (1 dim vector)	6
Table (Atomic vectors)	51
ts	28

Conditional Statements

Used to certain conditions are met by some data like observation above a certain value
They are also called control structures. Include:
- if-else
- ifelse
- for

Conditional Statements (cont)

Others are
- while
- repeat
- break
- next, and
- switch
We discuss frequently used control structure, that is if-else, ifelse and for

if-else()

Used to check if a condition evaluated to true, and if so an action is performed
It can be extended alternative action(s) with “else if” or “else”
When “else statement” is given, it must be on same line as end of if statement
Example, check a vector has intrisic type “character”, if it does, we convert it to a factor else leave it as it is
Note: if-else can only be performed if condition evaluated to one logical value either TRUE or FALSE

if-else() example

x <- c("a", "b", "c")
class(x)

[1] "character"

if (class(x) == "character") {
   x <- as.factor(x)
} else {
   x
}
class(x)

[1] "factor"

ifelse() control structure

Used when condition evaluates to a logical vector length > 1
Excellent for recoding variables, hence an example is done under “recoding variables”

for() control structure

for() is a looping structure used to perform repetitive tasks
Though in most programming programs, this is a frequently used construct, in R, there more efficient functions like apply group of functions
for iterates from a certain value through a sequence performing an action defined it's body (body of any function including for loop is what is in between {})
As a simple example, let's say Hello five time

for() loop example

for(i in 1:5) {     # variable "i" is a counter (conting from 1 to 5)
    cat("Hello \n") # function "cat" is used to print to console 
}

Hello 
Hello 
Hello 
Hello 
Hello

Recoding variables

Recoding a variable means changing it's values
It is often recommended to create a new variable instead of overwriting original variable
Example:
- Create a dichotomous recoded variable of “feed” variable from “chickwts” data set
- This variable will have values “casein” and “others” (this is something often done during analysis)

Recoding variables (cont)

'data.frame':   71 obs. of  2 variables:
 $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
 $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...

# Current categories of variable of interest (feed)
levels(chickwts$feed)

[1] "casein"    "horsebean" "linseed"   "meatmeal"  "soybean"   "sunflower"

Recoding variables (cont)

# Recording with function "ifelse"
chickwts$feed2 <- ifelse(chickwts$feed == "casein", yes = "casein", no = "other")

# Conveting to a factor vector
chickwts$feed2 <- factor(chickwts$feed2)

# New levels 
levels(chickwts$feed2)

[1] "casein" "other"

Renaming variables

using base R, renaming any variables in a data frame requires all variable names to issued to names() <- function
For example, to rename “feed2” from previous slide:

# Current name
names(chickwts)

[1] "weight" "feed"   "feed2"

Renaming variables (cont)

# Renaming variables (all must be proived)
names(chickwts) <- c("weight", "feed", "casein")
names(chickwts)

[1] "weight" "feed"   "casein"

Missing values

In R, denoted with Logical value “NA”
Many operation can not be performed when there are missing value
is.na() used to check for missing value
If negated with “!” infront, it output current (non-missing) values
For matrices and data frames, complete.cases() might be more appropriate

Missing values (cont)

# Vector with a missing value
vect1 <- c(letters[1:5], NA); vect1

[1] "a" "b" "c" "d" "e" NA

# A logical vector checking for missing values 
is.na(vect1)

[1] FALSE FALSE FALSE FALSE FALSE  TRUE

Missing values: complete case for matrices

vect2 <- letters[1:6]
mat3 <- rbind(vect1, vect2)
mat3

      [,1] [,2] [,3] [,4] [,5] [,6]
vect1 "a"  "b"  "c"  "d"  "e"  NA  
vect2 "a"  "b"  "c"  "d"  "e"  "f"

complete.cases(mat3)

[1] FALSE  TRUE

Output indicates the first row/case is not complete but the second is complete

Date values

Initially imported or created as numeric or character vectors
Conversion (to class for data/time object: POSIXlt/POSIXct) depends on whether they are character or numeric
One way to convert character vector to date/time object is by using function as.Date() specifying argument format as detailed by ?strftime
as.Date() can also be used to convert a numeric vector to a date object, by specifying argument origin; origin in R is “1970-01-01”

Date values (cont)

# Converting a character vector
date1char <- c("3/6/2017", "3/7/2017", "4/7/2017")
class(date1char)

[1] "character"

date1 <- as.Date(date1char, format = "%m/%e/%Y")

Date Values (cont)

date1

[1] "2017-03-06" "2017-03-07" "2017-04-07"

class(date1)

[1] "Date"

Date values (cont)

# Converting a numeric vector
date1num <- c(17231, 17232, 17263)
class(date1num)

[1] "numeric"

date2 <- as.Date(date1num, origin = "1970-01-01")

Date values (cont)

date2

[1] "2017-03-06" "2017-03-07" "2017-04-07"

class(date2)

[1] "Date"

Conversion between data types

To convert from one data type to another, use as.data_type like as.logical(), as.integer(), as.double(), as.character(), as.raw(), and as.complex()
But it must be convertible e.g.
- Can convert from logical to character but if character is not “TRUE/FALSE” or “true/false” it will result in NA
- Cannot convert character to integer or double

Sorting data

Sorting an atomic vector is done with sort()
Sorting a data frame is done with order()
Matrices are actually atomic vectors with dimensions, hence sorted with looping function apply
By default sort is done in an increasing manner, be nullified by setting argument “decreasing” to TRUE
Logical values ordered according to their integer form, i.e. TRUE = 1, FALSE = 0 (TRUE > FALSE)

Sorting vectors

# An unsorted random numbers
set.seed(58)
tosort <- round(rnorm(10, 87, 10))
tosort

 [1]  83  91  97  80  81  68  84  92 106  96

# Sorted vector (increasing) 
sort(tosort)

 [1]  68  80  81  83  84  91  92  96  97 106

# Sorted vector (decreasing)
sort(tosort, TRUE)

 [1] 106  97  96  92  91  84  83  81  80  68

Sorting Matrices

mat2sort <- matrix(tosort[-1], 3, dimnames = list(1:3, c("a", "b", "c")))
mat2sort

   a  b   c
1 91 81  92
2 97 68 106
3 80 84  96

# Sort by columns of a matrix
apply(mat2sort, 2, sort)

      a  b   c
[1,] 80 68  92
[2,] 91 81  96
[3,] 97 84 106

Sorting Data frames

set.seed(3)
v1 <- round(rnorm(9, 50, 10))
set.seed(3)
v2 <- round(rnorm(9, 90))
set.seed(3)
logi <- sample(c(TRUE, FALSE), 9, TRUE, c(0.7, 0.3))
df1 <- data.frame(Logi = logi, V1 = v1, V2 = v2)

Sorting Data frames (cont)

# Sorted by first variable "logi"
df1[order(df1$Logi, decreasing = TRUE),]

   Logi V1 V2
1  TRUE 40 89
3  TRUE 53 90
4  TRUE 38 89
5  TRUE 52 90
6  TRUE 50 90
7  TRUE 51 90
8  TRUE 61 91
9  TRUE 38 89
2 FALSE 47 90

Sorting data frames by more than one variable

Sorting by more than one variable is first done on first listed variable then the second and so on.
Example:
- Sort variable Logi in a decreasing manner (TRUE first)
- Then sort variable “V1” in a decreasing manner

Sorting data frames example

df1[order(df1$Logi, df1$V1, decreasing = TRUE),]

   Logi V1 V2
8  TRUE 61 91
3  TRUE 53 90
5  TRUE 52 90
7  TRUE 51 90
6  TRUE 50 90
1  TRUE 40 89
4  TRUE 38 89
9  TRUE 38 89
2 FALSE 47 90

Sorting by both decreasing and ascending order

# Negative sign used to indicate decreasing
df1[order(-df1$Logi, df1$V1), ]

   Logi V1 V2
4  TRUE 38 89
9  TRUE 38 89
1  TRUE 40 89
6  TRUE 50 90
7  TRUE 51 90
5  TRUE 52 90
3  TRUE 53 90
8  TRUE 61 91
2 FALSE 47 90

Merging data sets

Done by similar (intersecting) columns
Can use database semantics
Core considerations for merging
- Default merging done by intersect(names(x), names(y))
- Otherwise specific columns in each can be given especially if they do not have same name or capitalization

Merging data frames

# Additional data set
dataset4 <- data.frame(ID = 6:10, Exercise = c(TRUE, FALSE, TRUE, TRUE, FALSE), Height = c(5.4, 5.4, 5.2, 5.6, 5.4), Weight = c(77, 74, 75, 79, 82))

# Similar columns to be used for merging
intersect(names(dataset3), names(dataset4))

[1] "ID"       "Exercise" "Height"   "Weight"

Merging data frames

# Merging (adding cases)
merge(dataset3, dataset4, all = TRUE)

   ID Exercise Height Weight
1   1     TRUE    5.2     69
2   2     TRUE    4.9     72
3   3    FALSE    5.1     75
4   4     TRUE    5.2     67
5   5    FALSE    5.4     77
6   6     TRUE    5.4     77
7   7    FALSE    5.4     74
8   8     TRUE    5.2     75
9   9     TRUE    5.6     79
10 10    FALSE    5.4     82

Subsetting data sets

Look at:

Indexing
Subsetting/extracting operators
Subsetting different data objects

Indexing

Indexing vectors are used to access elements from different data objects, they include:
- Logical vector
- Positive integers
- Negative integers and
- Character vectors
Note: It's possible to have 0 index (empty indexing)

Indexing (cont)

Logical vectors select elements which evaluate to TRUE
Positive integers select elements at given positions
Negative integers exclude values at given integers
Character indices are only appropriate for named elements
An empty index selects all values, used to replace all entries but at the same time keeping it's attributes

Subsetting/Extracting operators

There three extracting operators and one extracting function
- [
- [[
- $, and
- getElement()

Subsetting/Extracting operators

"[" can select more than one element and keeps their names if present while "[[" and "$" can only select one element without their names
"$" is only applicable for recursive objects (generic/list data structures), basically data frames and lists
"getElement()" function is similar to extracting with "[["

Subsetting Atomic Vectors

Subsetting operator is [, although [[ can also be used to select a single element without it's names attribute
Index vector put between subsetting operators.

vect1

[1] "a" "b" "c" "d" "e" NA

# Index vector: Elements that are not NA
!is.na(vect1)

[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

Subsetting vectors (cont)

# Subset non-na values
vect1[!is.na(vect1)]

[1] "a" "b" "c" "d" "e"

# Subsetting with an empty index
tms[]

 [1] 54 56 55 54 56 57 56 56 56 58 55 58

# Empty index useful for replacement while keeping attributes
set.seed(3)
tms[] <- sample(1:100, 12)
tms

 [1] 17 80 38 32 58 96 12 28 54 95 47 45
attr(,"tsp")
[1]  1 12  1

Subsetting atomic elements (cont)

Subsetting with “[[” returns without a names attribute

# Some of my favourite fruits
fruits <- c(Mangoes = 50, Apples = 35, Pineapples = 20)

fruits["Mangoes"]

Mangoes 
     50

fruits[["Mangoes"]]

[1] 50

Subsetting Matrices and Arrays

Essentially atomic vectors with dimensions hence can be subset with [ and [[
Output is value occurring at given indices when all values are concatenated
However, the best way to index these structures is by their dimension e.g. [r, c] for 2 dim matrices and [r, c, l] meaning row, column, and layer for 3 dim arrays
Exampl data set: R's USPersonalExpenditure

Example data set

# One of R's data set
USPersonalExpenditure

                      1940   1945  1950 1955  1960
Food and Tobacco    22.200 44.500 59.60 73.2 86.80
Household Operation 10.500 15.500 29.00 36.5 46.20
Medical and Health   3.530  5.760  9.71 14.0 21.10
Personal Care        1.040  1.980  2.45  3.4  5.40
Private Education    0.341  0.974  1.80  2.6  3.64

# Subsetting with an empty index
USPersonalExpenditure[]

                      1940   1945  1950 1955  1960
Food and Tobacco    22.200 44.500 59.60 73.2 86.80
Household Operation 10.500 15.500 29.00 36.5 46.20
Medical and Health   3.530  5.760  9.71 14.0 21.10
Personal Care        1.040  1.980  2.45  3.4  5.40
Private Education    0.341  0.974  1.80  2.6  3.64

# Subseting with one index
USPersonalExpenditure[5]

[1] 0.341

# Subsetting with dimensions
USPersonalExpenditure[1, ]       # Subset 1st row, all columns

1940 1945 1950 1955 1960 
22.2 44.5 59.6 73.2 86.8

USPersonalExpenditure[1, 1]      # Subset 1st row, first column

[1] 22.2

USPersonalExpenditure[3, "1950"] # Subset 3rd row, column 3 "1950"

[1] 9.71

USPersonalExpenditure[, "1960"]  # Subset an entire row, drops dimension

   Food and Tobacco Household Operation  Medical and Health 
              86.80               46.20               21.10 
      Personal Care   Private Education 
               5.40                3.64

# Maintaining dimension
USPersonalExpenditure[, "1960", drop = FALSE]

                     1960
Food and Tobacco    86.80
Household Operation 46.20
Medical and Health  21.10
Personal Care        5.40
Private Education    3.64

dim(USPersonalExpenditure[, "1960", drop = FALSE])

[1] 5 1

Subsetting Data frames

All subsetting operators ([, [[ and $) can be used
As before [ can selects more than one element
Both [[ and $ can select one item, difference is that $ can not be used with computed values like “i + 1” (index + 1)
x$name is equivalent to x[[“name”, exact = FALSE]]
Other than these operators, a much more efficient way to subset data frames is with function subset()

Example data set: USArrests

# Vewing first 6 rows
head(USArrests)

           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

# Computing average of assault, murder and rape using "$"
avg_murder  <- median(USArrests$Murder)
avg_assault <- median(USArrests$Assault)
avg_rape    <- median(USArrests$Rape)

# Using "[" subset states with above average assault, murder and rape
high_crime <- USArrests[USArrests$Murder > avg_murder & USArrests$Assault > avg_assault & USArrests$Rape > avg_rape, ]

# Sort (by decreasing order for Murder) and output names of states
high_crime <- high_crime[order(high_crime$Murder, decreasing = TRUE),]
row.names(high_crime)

 [1] "Georgia"        "Florida"        "Louisiana"      "South Carolina"
 [5] "Alabama"        "Tennessee"      "Texas"          "Nevada"        
 [9] "Michigan"       "New Mexico"     "Maryland"       "New York"      
[13] "Illinois"       "Alaska"         "California"     "Missouri"      
[17] "Arizona"        "Colorado"

# Subset a column without name attribute
high_crime[[1]]

 [1] 17.4 15.4 15.4 14.4 13.2 13.2 12.7 12.2 12.1 11.4 11.3 11.1 10.4 10.0
[15]  9.0  9.0  8.1  7.9

# Or
USArrests[["Assault"]]

 [1] 236 263 294 190 276 204 110 238 335 211  46 120 249 113  56 115 109
[18] 249  83 300 149 255  72 259 178 109 102 252  57 159 285 254 337  45
[35] 120 151 159 106 174 279  86 188 201 120  48 156 145  81  53 161

Subsetting with function "subset()"

Function subset can be used to subset any vector, but most suitable for data frames
Here we will use it to subset high_crime states as we did before
We use function with() to access variables without making reference to data frame name

high_crime2 <- with(USArrests, subset(USArrests, Murder > avg_murder & Assault > avg_assault & Rape > avg_rape, Murder:Rape))
high_crime2 <- high_crime2[order(high_crime2$Murder, decreasing = TRUE), ]

# Check both data sets are identical
identical(high_crime, high_crime2)

[1] TRUE

Subsetting Lists

List can be subset with all three subsetting operators
Rule of the thumb is, subsetting with [ returns a list, subsetting with [[ or $ outputs the same type as element being subset i.e. if list has data frame, subsetting with [[ or $ will output a data frame
Example data set: R's first 10 values of “state.center” data set

Subsetting lists (cont)

# Example data
state.center; class(state.center)

$x
 [1]  -86.7509 -127.2500 -111.6250  -92.2992 -119.7730 -105.5130  -72.3573
 [8]  -74.9841  -81.6850  -83.3736

$y
 [1] 32.5901 49.2500 34.2192 34.7336 36.5341 38.6777 41.5928 38.6777
 [9] 27.8744 32.3329

[1] "list"

Subsetting lists (cont)

# Using `[` outputs a list
state.center[1]

$x
 [1]  -86.7509 -127.2500 -111.6250  -92.2992 -119.7730 -105.5130  -72.3573
 [8]  -74.9841  -81.6850  -83.3736

class(state.center[1])

[1] "list"

Subsetting lists (cont)

# Using `[[` outputs elements type
state.center[[1]]

 [1]  -86.7509 -127.2500 -111.6250  -92.2992 -119.7730 -105.5130  -72.3573
 [8]  -74.9841  -81.6850  -83.3736

class(state.center[[1]])

[1] "numeric"

Subsetting lists

# Using "$" outputs elements type
state.center$x

 [1]  -86.7509 -127.2500 -111.6250  -92.2992 -119.7730 -105.5130  -72.3573
 [8]  -74.9841  -81.6850  -83.3736

class(state.center$x)

[1] "numeric"

Using SQL statements to subset data frames

Database semantics can sometimes be quite handy in subsetting e.g. subset has to meet certain condition
Core data base statements are :
- SELECT
- FROM
- WHERE
- ORDER BY

Using SQL statements to subset data frames

If interested, read a small introduction to SQL statement from R's Data Import/Export manual (4.2) or go online and learn from “www.sqlcourse.com”
Discussing this here might take us out scope, but it's good to know it's possible in R using contributed packages like “sqldf” and “dplyr”.

Other functions useful for data sets

Function	Description
str	A compact display internals of a data frame
head	Prints first part, default is first 6 rows
tail	Prints last part, default is last 6 row
attach	Put data frame on R's search path hence variables are accessible without reference to data frame name
dettach	Remove data frame from R's search path. Recommended after completion of task

Other useful functions (cont)

Function	Description
with	Recommended alternative to `attach`, makes it possible to run expressions/function on a data frame's element
which	Locates indices of logical value TRUE. Used for indexing data frame elements