Introduction to Data Analysis and Graphics2

Hellen Gakuruh
2017-03-07

Session Two

Vector and Assignment, Data Objects and Data Importation

Outline

By the end of this session we will have knowledge on:

  • Vectors and Assignment
  • Data types
  • Data structure and
  • Importing data into R

Vector and Assignment

  • Simplest data structure in R is a vector. From a data point of view, a vector is collection of elements. These elements can be numeric values, alphabetical characters, logical, dates and time values.
  • Vectors are created with function “c” which means “concatenate”. e.g. a numerical vector c(1, 5, 6, 8)
  • Thee vectors can be named by using an assignment operator “<-” or function “assign()”. e.g. to assign vector c(1, 5, 6, 8) to name “num”; num <- c(1, 5, 6, 8) or assign(“num”, c(1, 5, 6, 8)). We often use “<-” for assignment, “assign” function is mostly used in developing functions
  • A vector can be of any length begining from 1 to about 2.1474836 × 109

Data types

R recognises seven data types, these are:

  • Logical
  • Integer
  • Real/Double
  • String/Character
  • Factor

cont…

  • Complex
  • Raw
  • R manuals specifys six types; logical, integer, double, character, complex and raw. However, factor is a data type that does not fall into either of the six listed data types.
  • In this sub-section we introduce these data types

Data types: Logical

  • These are vectors with only TRUE and FALSE values like c(TRUE, TRUE, FALSE, TRUE, FALSE)
  • Can be considered as binary vectors in analysis
  • Other than categorical variables with these values, these vectors are often created by binary operators like “<”, “>”, “<=”, >=, ==, =!, “|”, “||”, “&”, and “&&”
  • During analysis, these vectors can be coerced to numeric values in which case TRUE becomes 1 and FALSE becomes 0
  • These vectors include value “NA” which in R means “Not Available”, a placeholder for missing values.
  • Any operation done with a vector containing NA is bound to result to NA since NA is unknown

Data types: Integer

  • These are basically positive and negative numbers without fractions {…, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, …}
  • In R, integers are denoted with letter L e.g. c(-3L, 0L, 2L, 5L, 6L). Can confirm it's an integer vector with function is.integer(c(-3L, 0L, 2L, 5L, 5L))
  • Example of a variable which can be considered to naturally have integers is “number of people” (you can't have a fraction of a person)
  • Mathematically denoted by \( \mathbb{Z} \)

Real/Double

  • A real number is any number along an infinitely number line
  • They include fractions
  • Denoted mathematically with \( \mathbb{R} \)
  • Any numeric vector that does not have values followed by letter “L” are considered as double e.g. c(-3, 0, 2, 5, 6). Can confirm a vector is a real or double vector with funtion “is.double” e.g is.double(c(-3, 0, 2, 5, 6))

String/Character

  • Composed of alphabetical letters and word/text
  • Denoted by single or double quotation marks
  • R has a special vector with alphabetical letter; this is letters
  • Example c("a", "b", "c"), letters, c('cats', 'and' , 'dogs')
  • Can check whether a vector is a character vector with function is.character e.g. is.character(letters)

Data type: Factors

n

  • In R a factor vector is a categorical variable with discrete classification (grouping)
  • Example
cat <- factor(c(rep("Y", 28), rep("N", 10)))
is.factor(cat)
[1] TRUE
levels(cat)
[1] "N" "Y"

Data type: Complex

n

  • These are vectors with real and imaginary values. Imaginary numbers are denoted by letter “i”
  • Mathematically used to make it possible to take square-root of negative values
# Example, complex vector
3+2i
[1] 3+2i
# Confirm it's complex
is.complex(3+2i)
[1] TRUE

Data type: Raw

  • These are vectors containing computer bytes or information on data storage units
  • More of computer language (0's and 1's) than human readable language
  • Integers and doubles are jointly refered to as numeric
  • The most commonly used data types are logical, numeric and characters. Complex and raw data types are rarely used
int <- c(-3L, -2L, -1L, 0L, 1L, 2L, 3L)
is.integer(int)
[1] TRUE
is.numeric(int)
[1] TRUE
doub <- c(-3, -2, -1, 0, 1, 2, 3)
is.double(doub)
[1] TRUE
is.numeric(doub)
[1] TRUE

Data structures

  • There two broad types of data structures in R
    • Atomic vectors
    • Generic (list) vectors
  • These structures have three properties
    • Type
    • Length and
    • Attributes

  • Function "type" is used to establish a vector's type, function "length" is used to determine length and function "attributes" is used to get additional information about a vector
  • Atomic vectors and lists differ in their type as atomic vectors can only contain one data type while lists can contain any number of data types.

Atomic Vectors

  • Contains only one data type, they include 1 dimensional atomic vectors, 2 dimensional atomic vectors called “matrices” and multi-dimensional atomic vectors called “arrays”.
  • Dimensionality can be considered as number of indices required to address any element in a vector e.g. vector “cat” requires one index to address any value, for example index “4” means fourth value which is Y
  • Single variables are all atomic vectors of one dimension
  • To check if a vector is either atomic or list, use is.atomic() or is.list(). Note there is a is.vector() but this checks if vector is named

Atomic vectors: Matrices

  • Two dimensional atomic vectors, they contain data of the same type
  • Any atomic vector can be converted to a matrix by adding a dim attribute
cat <- c(rep("Y", 28), rep("N", 10))
typeof(cat)
[1] "character"
dim(cat)
NULL
is.matrix(cat)
[1] FALSE
dim(cat) <- c(19, 2)
typeof(cat)
[1] "character"
dim(cat)
[1] 19  2
is.matrix(cat)
[1] TRUE
  • Other than using "dim()" to convert a one dim to a multi-dimension atomic vector, matrices can be created with "matrix()", or by coercing another data object with "as.matrix()"
typeof(airmiles)
[1] "double"
airmiles2 <- matrix(airmiles, nrow = 8, ncol = 3)
is.matrix(airmiles2)
[1] TRUE
airmiles3 <- as.matrix(airmiles, nrow = 8, ncol = 3)
is.matrix(airmiles3)
[1] TRUE
rm(airmiles2, airmiles3)

Special 1 & 2 dimension atomic vectors

Time series objects

  • These are vectors used to store observations collected at given time points (interval) over a period time, e.g. observations collected every three three months for five year.
  • Distiguishing feature in this data is time, interval is usually constant like three months (regular), but in other cases it might not be so (irregular)
  • In R, time series data are numeric vectors with attribute class equal “ts” meaning time series
  • Time series vectors can either be 1 dim atomic vector like “AirPassengers” data set in R or a 2d matrix like "EuStockMarkets"
typeof(AirPassengers)
[1] "double"
attr(AirPassengers, "class")
[1] "ts"
typeof(EuStockMarkets)
[1] "double"
attr(EuStockMarkets, "class")
[1] "mts"    "ts"     "matrix"

Atomic vectors: Arrays

  • Arrays are multi-dimensional atomic vectors.
  • Matrices are two dimensional array.
  • They are rarely used, but it's good to know they exist
  • Created like matrices; "dim()" e.g. dim(a) <- c(6, 2, 2), or array() or as.array()

Data structures: Generic vectors

  • Lists are data structure which can contain more than on type of data type.
  • There are two types of lists; two dimensional lists called "data frames" and "lists"

Data frames

n

  • Most recognizable data structure
  • A core data strucure in R
  • Present data in row and columns like matrices, but in this case columns can have different data types
# Example
head(faithful)
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

Generic vectors: Lists

  • These are unique data structure
  • Can contain any number and type of object, not just data. Can contain sub-lists hence also called recursive
  • Created with function “list()”. Can also coerce other structures to a list with function “as.list()”
  • We will create this structure in our next session

Importing and Exporting Data in R

  • Data importation also referred to as “reading in” data
  • Reading data depends on type and location of file
  • Sub-session interest, reading in local R, text, excel, database and other statistical program files
  • Also discuss web scrapping

Reading in .RData

  • Data created in R can be store in RData file
  • This could be any data structure or a collection of data saved from an active working directory (workspace)
  • Function “save.image()” used to store workspace, function “load” is used to read in any “.RData” (or even .Rhistory)
# See current objects
ls()
[1] "cat"  "doub" "int" 
# Store in an external .RData file
save.image()
# Remove all object from workspace/global environment
rm(list = ls())
ls()
character(0)
# Read in .RData
load(".RData")
# Check we have them back
ls()
[1] "cat"  "doub" "int" 

R's core importing function "read.table()"

  • read.table is R's core importing function
  • Almost all other functions including contributed packages depend on this function
  • Reads a file and creates a data frame from it
  • It has a number of wrapper functions (functions which provide a convinience interface to another function like give pre-defined/default values, this make function calls more efficient)
  • Wrapper functions include read.csv(), read.csv2(), read.delim, read.delim2
  • CSV are comma separated files
  • Delim are text files, word delim means delimited which implys how data are separate like with tabs
  • Both csv and delim are relatively easy to read into R as long as separator/delimitors are known
  • In case separator or delimitor is not known and file cannot be opened, then best to read in a few lines with read.lines function Live demo (reading in CSV file)

Reading in Excel files

  • Base R does not have a function to read in Excel based files
  • But many contributed packages have functions to read them in
  • Core reference in importing this type of files is one of R-projects manuals R Data Import/Export specifically chapter 9.
  • Recommendation made is to try and convert Excel file in to “.csv” (comma-separated) or “delim” (tab-separated) file. Live demo (reading excel file)

Reading in Databases data

  • A bit of caution, database data tend to be large, R is not to good when it comes to large data, hence read in part of data or look for ways to increase memory allocated to R processes like using cloud.
  • Most Relational Database Management Systems (RDMS) have data similar to R's dataframe where columns are called “fields” and rows are called “records”.
  • Extracting part of relational database requires use of database quering sematics core of which is a SELECT statement.
  • In general, SELECT query uses:
    • FROM to select the table
    • WHERE to specify a condition for inclusion and
    • ORDER BY to sort results (this is important as RDMS do not order it's rows like R's dataframes)
  • There are a number of contributed packaged on CRAN for reading RDMS data, these include RMySQL, DBI, ROracle, RPostgreSQL and RSQLite.

Live demo (reading in RDMS and web data)

From other statistical softwares

  • Other statistical softwares often used to read in data are SPSS, SAS, Stata and EpiInfo
  • Like excel and database data, to read in these files a package must be used
  • Recommended package is package "foreign" other packages include, "readstata3" and haven.

Live demo (reading SPSS and Stata data files)