Chapter 1 Introduction
The world today presents us with vast amount of data, most of it generated from everyday activity and others emanating from surveys. However, most of this data is underutilized and we often tend to organize surveys before turning available data into useful information. One of the major reasons as to why most data is underutilized is lack of data analysis skills. Having data analysis skills not only helps to identify sources of data but also convert readily available data into useful information. It’s upon this bases that this book is written; this book is geared towards imparting data analysis concepts building them from a foundation level.
Data analysis is often interchanged with statistics, however, a clear distinction exists between the two terms. Data analysis is a broad area which includes statistics; statistics on the other hand is derived from mathematics where it looks at a sample to draw conclusions about a population. In a sense, data analysis can be public health analysis, population analysis, business analysis, social media analysis and so on, in all these types of data analysis, statistics is core, but it is not the only aspect.
In this book we take this broad approach by looking at data analysis from different fields or specialization. Our core focus will be statistics but specifically applied statistics (using statistical concepts to understand readily available data).
Below we discuss a few statistical concepts we will be using during this book and later give a snapshot of what to expect in each of the chapters.
1.1 Some statistical terms/concepts
In this section we go over some of the terms and concepts we will be using.
1.1.1 Variables
A variable is an element or a factor that changes (varies) depending on some condition. Variables can be independent or dependent, discrete or continuous, quantitative or qualitative.
When conducting a study, a dependent variable is that variable being studied while independent variables are all variables being manipulated by study implementers (for experimental studies) or out of control by study implementers but affect study/dependent variable. For example, in an experiment to measure efficacy of a new drug, independent variables could be other drugs addressing the same condition as the new drug while dependent variable would be efficacy level. It is quite common to see independent variables referred to as predictors, or explanatory variables. Dependent variables are also referred to as response or outcome variables.
Variables are classified as discrete if they are numerical variables (based on numbers as opposed to qualitative) and they are count data or their numbers can take on certain values. Example of a discrete variable is number of R packages on CRAN, there is no possibility of having 10.5 packages, it can only be 10 or 11 packages. Continuous variables on the other hand are numerical variables but their values can take on any value withing a range for example shoe size, which can be 6, 6.5, 7 and so on, however, they can not be less than 0 and more than largest recorded foot size (16 or there about).
Classification of variables as either quantitative or qualitative has a great influence on type of data analysis methods used. A quantitative variable is one whose values are numerical, for example anthropometric measures like weight and height, number of something like students in a class and so on. Both discrete and continuous variables are quantitative variables. Qualitative variables are variables whose values are not numerical but text-based, these variables measure some “type” or “level” or “quantity”. For example gender type (female or male), area code (as many as area of concern), educational level (high, medium, low), and residency (urban or rural). Qualitative variables are also known as categorical variables.
1.1.2 Measurement Levels
A vital concept in data analysis is measurement levels of a dependent variable. Measurement levels are classifications used to indicated type of values a variable can have and therefore inform on type of measurement to be used. There are four measurement levels, these are “nominal”, “ordinal”, “interval” and “ratio”.
Nominal level is a qualitative type and variables with this measurement level have categories with no defined/natural order. A good example is gender, in analysis being male or female has no difference and therefore treated same, however, educational level categorized as high medium and low has a natural order and therefore cannot be referred to as nominal.
Ordinal level are categorical variables with natural order and each level is treated differently during analysis. Our earlier example on education level is a good example of an ordinal variable as it has ordered levels. Levels have an increasing or decreasing aspect to it and therefore analytical methods chosen have to appreciate this ordering. Note, in these levels, it is easy to say there is a difference between each level, but degree of this difference is not quantifiable, for example, we know medium education level is higher than “low” education level, but we do not know by how much. if we added some numerical value to these levels like 1 for low education, 2 for medium and 3 for high education level, we still cannot tell if the difference between 1 and 2 is the same as difference between 2 and 3.
Interval and ratio measurement levels are similar in that they are both numerical scales where each level has the same measured distance from the other. The best way to understand these two measurement levels is to picture them on a number line with each level representing a position on the continuous number line. By this latter fact, we can say both interval and ratio variables are continuous variables. The difference between these two scales is value of zero; for interval scales, zero is arbitrary meaning it does not indicate “lack off/absence” of the variable whereas in ratio scale zero means lack off or absence of that variable. A classic example given for interval scales are temperature measured by degrees Fahrenheit/Celsius, a zero degree does not mean temperature does not exist but if this temperature is measured by Kelvins (unit of measure for temperature based on absolute scale) 1, then zero would mean no temperature and therefore become a ratio scale.
1.1.3 Distributions
A very common term in statistics and by extent data analysis is “distribution”. Distribution is a description of relative values of observations an event/trial/experiment. For example, if we were to count the number of laptop types in a classroom, we would end up with relative values such as 10 HP’s, 9 Macs, 5 Dell, 2 Toshiba. Note, these values do not have any relationship with number of these machines in general population, they just represent “laptop distribution in a classroom”. We can also say “distribution” is a listing (or in some case a function as we shall later see), of all possible values and how often they occur.
Distributions are often shown using a table or a graph. When we want to table distribution of one variable like number of laptops in a class by make, we use a “frequency table”. Frequency means number of occurrence of a certain observation like number of HP’s.
Make | Total |
---|---|
HP | 10 |
Macs | 9 |
Dell | 5 |
Toshiba | 2 |
When describing distribution of more than one variable, we use contingency tables (also known as cross tabulation or cross tabs). For example, if we were to table number of laptops by make and by gender (how many Female/Males had a certain type of laptop), then we would have something like this:
## gender
## laptops Female Male
## Dell 3 2
## HP 8 2
## Mac 5 4
## Toshiba 0 2
Some of the graphs used to display distributions include dot plot, bar charts, histograms, box plots, tally charts, pie charts and the like. Let’s discuss this graphs as we discuss graphical representation of descriptive summaries.
1.2 Chapter organization
In this section we go through what to expect in each of the chapters in this book. Our aim here is to interlink the concepts and see how some concepts rely on knowledge of another.
Following this introduction, we shall start off by discussing ways to describe univariate data. Here we shall distinguish between numerical and categorical data. We will conclude this chapter by demonstrating how to display descriptive summaries using R.
A chapter on exploratory data analysis (EDA) follows descriptive statistics. This chapter introduces EDA as a technique to quickly look at data. It should be noted, in actual practice, EDA is usually performed as an initial activity and thus this chapter should have come before descriptive statistics. However, since this book also targets complete beginners in data analysis, then discussing EDA (which involves a lot of descriptive statistics) before descriptive statistics would not have been good for novice learners.
Before discussing core of statistics (making inference), we will discuss probability as a prerequisite. This chapter will (re)introduce probability concepts like events, sample space, permutations and combinations (counting rules). We shall use classical examples like dice throws, playing cards and marbles an urn to just ensure we have the concepts right.
With probability covered, we shall begin our discussion on inferential statistics. In this chapter, we shall discuss a critical concept in inferential statistics which is modeling assumptions (fully, non and semi parametric). We shall also discuss other concepts like central limit theorem and statistical propositions like point and interval estimates (CI). In our discussion we shall distinguish statistical propositions for means and proportions as well as difference between sample statistic and parameters. In addition we shall discuss hypothesis testing highlighting type one and type two errors as well as sensitivity analysis. We shall conclude with techniques for studying relationships between two or more variables(regression analysis).
As we wind up this book, we shall have chapters discussing time and survival analysis. These chapters are included as data on time and survival issues are quite common especially in development and humanitarian sector (one of the target learners of this book).
1.3 Basic Preliquiste
This book is intended as an introduction to data analysis and graphics, hence discussion is kept simple and concepts introduced gradually.
It is however important to have basic knowledge of R (or another statistical program) for you to participate in examples and exercises.
Mathematics is an inevitable prerequisite for statistics, and since a large part of data analysis relies on statistics, it is therefore important to have some basic mathematical concepts. It is advisable to go through appendix A on refresher mathematics and see what you need to refresh on. As with the rest of the book, mathematical concepts are re-introduced in a very basic and easy way. There is a list of handy and certainly (beginner friendly) resources added at the end of this chapter in case you need just a little bit more mathematics.