Introduction to Data Analysis and Graphics in R

author: Hellen Gakuruh date: 2017-03-10 autosize: true

Slide 4: Summarizing Data

Outline

What we shall cover

Numerical summaries for discrete variables
Numerical summaries for continuous variables
Tables for dichotomous variables
Tables for categorical variables
Tables for ordinal variables

Introduction

type: section

A variable is a quantity whose values are not constant (change)
Discrete variables have finite values (obtained by counting)
Continuous variables can take any value within a range (obtained by measuring)
Dichotomous variable has two values like “Yes” and “NO” or TRUE and FALSE
Categorical variables are qualitative variables whose values are non-numerical (text) with no ordering like gender “Female” and “Male”

Introduction cont.

type: section

Ordinal variables are qualitative variables whose values are textual (non-numeric) with natural ordering like likert scales or level of education
There are two way to describe a variable; numerically and graphically
Numerical summaries comprise measures of central tendency, measures of spread/variability and shape of distribution (latter often not reported, used to guide additional analysis)
All these variables (discrete, continuous, dichotomous, categorical, and ordinal) can be described by these measures, but each has it’s own computation and presentation

Measures of central Tendency

type: section

There three most often used/reported measures of central tendency
Mean (arithmetic)
Median
Mode
Mean is average of all values, i.e. sum of observations divided by number of observations

=================================================================== type: sub-section

Median is central value when ordered
Mode is most frequently occurring value
There at least three measures of dispersion
- Range
- Inter-quantile range (IQR)
- Variance and Standard deviation

================================================================== type: sub-section

Range is minimum and maximum value
IQR is a range of where 50% of values lie (ordered statistic)
Standard deviation is average distance of values from mean. It is computed from variance which is squared distance from mean.
Distinction is made between sample and population
Measure for population are called population parameters and they are often unknown
Measures for a sample are called sample statistics
Population mean is denoted as \(\mu\) (pronounced ad “mu”)
Sample mean is denoted as \(\bar{x}\) (pronounced as “x bar”) Computing mean ================================================================== type: sub-section
Since mean is sum of all values divided by number of values, then population and sample mean can be expressed as: \[\mu = \frac{\sum{X}}{N}, where X are value and N is number of values\] \[\bar{x} = \frac{\sum{x}}{n}, where x are values and n is number of values\] respectively

Locating median

type: sub-section

Median depends on whether number of observations are odd or even
For odd number of values, median is the middle value like 3 in data set {1,2,3,4,5}
For odd number of values, median is average of the two middle values like average of 3 and 4 for data set {1,2,3,4,5,6} which is 3.5.

Determining mode

type: sub-section

Mode is most frequently occurring value (observation)
To get mode, count number of occurrence of each unique value (observation) and select the one with most number of occurrences
Number of occurrences is called frequency
Mode for data set {1, 2, 1, 1, 3, 3} is 3
Mode is the only measure of central tendency which can has 0, 1, 2, > 2 modes (no mode, uni-modal, bi-modal, or multi-modal)

Standard deviation (SD)

type: sub-section

Used to determine how spread out values are from it’s average (mean)
A small SD means values are clustered around it’s mean and a big SD means values are spread out
Computed by first subtracting each value from mean. Then summing the deviation. But before summing, they are squared as summation would result to 0. Finally they are divided by number of values. But since it’s a squared deviation, a square root is taken.
For samples from unknown population parameters, dividing with number of observation has been proved to underestimate variance, hence divided by “n-1” i.e.. \(s = \sqrt{\sum(x-\bar{x})^2/(n-1)}\)

Skewness

type: sub-section

Skewness measures symmetry of values around it’s mean
If values are symmetrical, left and right side of it’s average is a mirror image, then it’s said to have “no skweness”
If bulk of values is to the left and has a right trail of values, then it’s positively skewed
If bulk of values is to the right and has a trail of values to the left, then it’s negatively skewed
Measurement involves balancing values on both sides of the mean, if difference is zero, they it’s symmetrical, else +ve or -ve

Kurtosis

type: sub-section

A measure of tailness; fat/thin or long/short
Not a measure of “peakness” as often discussed in older text
Reason: measure gives more weight to values far away from average, thus outputting how far and by how much it is from average
Kurtosis is noted as being “Mesokurtic”, “Leptokurtic” or “platykurtic”.

================================================================ type: sub-section

Mesokurtic means it’s symmetrical (tails are the same), “leptokurtic” means it is “slender” and has fatter tails, it also has a greater kurtosis than “mesokurtic” or a symmetrical distribution
Platykurtic means it has a lesser kurtosis than symmetric distribution and it’s broad with thinner tails
Symmetry is considered ideal hence kurtosis measured in reference to symmetry which as kurtosis of 3
Kurtosis measured in reference to symmetry f 3 are referred to as Excess Kurtosis

Numerical summaries for discrete variables

type: section

Can be described by mean or median as its average
If data is skewed, median is appropriate, otherwise compute mean
If average is mean, then dispersion is reported as standard deviation. If average is median, then dispersion should be IQR
Shape of distribution as measured by skewness and kurtosis can inform on which average (mean or median) to use. It also guides inferential statistics
Example: Hypothetical random numbers of students scores

============================================================== type: sub-section

# Data
set.seed(4)
scores <- as.integer(round(rnorm(50, 78, 1)))

# Source own function for printing frequency tables 
source("~/R/Scripts/desc-statistics.R")

# Frequency table
freq(scores)

  Values Freq Perc
1     76    2    4
2     77    8   16
3     78   19   38
4     79   17   34
5     80    4    8

=================================================================== type: sub-section

# Mean
mean(scores)

[1] 78.26

# Median
median(scores)

[1] 78

# Range
cat("Range for this distribution is", diff(range(scores)), paste0("(", paste(range(scores), collapse = ", "), ")"))

Range for this distribution is 4 (76, 80)

==================================================================== type: sub-section

# Where 50% of values lie
cat("50% of values lie between score of about", round(quantile(scores, 0.25)), "and", paste0(round(quantile(scores, 0.75)), ":"), "an IQR of about", round(IQR(scores)))

50% of values lie between score of about 78 and 79: an IQR of about 1

# Standard deviation (spread of values around mean)
sd(scores)

[1] 0.964894

====================================================================== type: sub-section

# Functions developed to measure and interpret skewness and kurtosis
source("~/R/Scripts/skewness-kurtosis-fun.R")

# Skewness
m3_std(scores)

[1] -0.2551918

skewness_interpreter(m3_std(scores))

[1] "approximately symmetric"

# Kurtosis
excess_kurt(scores)

[1] -0.365273

excess_interpreter(excess_kurt(scores))

[1] "approximately mesokurtic"

Conclusion (discrete numerical measures)

type: sub-section

From skewness and kurtosis we can tell this data set is almost centered around it’s mean, hence mean is an appropriate representative value (a value to describe data)
Since mean is our representative value, then standard deviation is the appropriate measure for dispersion
SD of 0.964894 indicates values are not dispersed
Display-wise, we expect to see an almost symmetric distribution

Numerical summaries for continuous variables

type: section

Continuous variables have the same numerical summaries as discrete variable
Exception is how to locate it’s mode, since values can take on an infinite number of values within a range
Mode then involves grouping values into useful intervals sometimes called breaks. This is a process called “discretization”
Breaks can range between 2 to 10 but most often interval of five (data determines)

============================================================ type: sub-section

Example data: Random hypothetical sample of human height in inches

# Example data
set.seed(4)
height <- round(rnorm(50, 5.4), 2)
sort(height)

 [1] 3.60 3.71 3.92 4.12 4.47 4.54 4.58 4.65 4.76 4.86 4.93 5.00 5.02 5.12
[15] 5.12 5.17 5.19 5.30 5.35 5.36 5.42 5.43 5.50 5.55 5.57 5.57 5.58 5.62
[29] 5.78 5.97 5.99 6.00 6.09 6.12 6.26 6.29 6.31 6.33 6.45 6.57 6.64 6.66
[43] 6.69 6.69 6.71 6.74 6.94 7.04 7.18 7.30

=============================================================== type: sub-section

# Average
mean(height)

[1] 5.6352

median(height)

[1] 5.57

# Dispersion
sd(height)

[1] 0.9184931

diff(range(height)); range(height)

[1] 3.7

[1] 3.6 7.3

================================================================ type: sub-section

IQR(height)

[1] 1.28

# Modal Class (interval)
tab <- freq_continuous(height)
as.vector(tab[which.max(tab$Perc), 1])

[1] "(5,5.5]"

# Functions for generating frequency tables
freq_continuous(height)

   Values Freq Perc
1 (3.5,4]    3    6
2 (4,4.5]    2    4
3 (4.5,5]    7   14
4 (5,5.5]   11   22
5 (5.5,6]    9   18
6 (6,6.5]    7   14
7 (6.5,7]    8   16
8 (7,7.5]    3    6

============================================================ type: sub-section

# Skewness
m3_std(height)

[1] -0.2186212

skewness_interpreter(m3_std(height))

[1] "approximately symmetric"

# Kurtosis
excess_kurt(height)

[1] -0.7024805

excess_interpreter(excess_kurt(height))

[1] "moderately platykurtic"

Tables for dichotomous variables

type: section

Have two values e.g. “Yes” & “No”
Best presented in frequency tables

set.seed(4)
dichot <- sample(c("Yes", "No"), 100, replace = TRUE)
freq(dichot)

  Values Freq Perc
1     No   57   57
2    Yes   43   43

Tables for categorical variables

type: section

Just like dichotomous variables (which are categorical), these can be displayed in a frequency table if univariate and contingency tables for bi-variate relationships

========================================================== type: sub-section

groups <- rep(c("a", "b", "c"), 200)
set.seed(4)
outcome <- sample(c("improved", "same", "decreased"), length(groups), replace = TRUE, prob = c(0.7, 0.2, 0.1))
freq(groups)

  Values Freq Perc
1      a  200   33
2      b  200   33
3      c  200   33

freq(outcome)

     Values Freq Perc
1 decreased   66   11
2  improved  418   70
3      same  116   19

Contingency table

type: sub-section

source("~/R/Scripts/desc-statistics.R")
contigency_tab(groups, outcome)

      outcome
groups decreased perc improved perc same perc
     a        22   33      136   33   42   36
     b        23   35      140   33   37   32
     c        21   32      142   34   37   32