Short Introduction to R
Outline of the workshop:
- Gain familiarity with the R-Studio interface
- Learn where to find example datasets
- Learn where to find help
- Learn how to get data into R (and get data out of R)
- Understand the basics of data structures and objects
- Explore data
- Learn how to make some basic graphs
- Basic data management
- Learn how to do some basic statistical analyses
- If we have time: more about tidyverse
The R script for this workshop can be downloaded here.
R is a programming environment, and it can be used with or without another program to interface with it. As you probably already know, R is free. R is available for Windows, Mac and Linux. You can download R from https://cran.r-project.org/ . You can also find lots of information about R from the links on the left side of the page.
R itself is made up of Base R and user-written packages. There are thousands of user-written packages that you can download and use. You can have many packages loaded at once, and you can have multiple datasets open at once as well. R is case sensitive, so if a command or option has a capital letter in it, you must also capitalize that letter. Comments start with #; these are lines in the R script that are not executed.
Before going any further, a very brief history of R is presented because it is useful when reading some of the help files. S was created by John Chambers in 1976 when he was working at Bell Labs. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. R is currently developed by the R Development Core Team, and John Chambers is part of that team. R started in 1992, the initial release was in 1995, and a stable beta version was available in 2000. For perspective, BMDP was developed in 1965 here at UCLA. Jim Goodnight and others started writing SAS in 1966, and SAS Institute was incorporated in to 1976. SPSS was first released in 1968, and SPSS, Inc. was started in 1975. The first version of Stata was released in 1985.
For this workshop, we will use R-Studio to interface with R because it is user-friendly and makes tasks easier. There are several programs that you could use as an external interface with R, and you are welcome to try them to see if you like them.
When you open R-Studio, you will see four windows. In the upper left, you will see the editor; this is where you type your R commands. Below the editor is the Console window, and this is where you will see your output, warning messages or error messages. In the upper right corner, you will see the workspace. You can click on the tabs and icons at the top. Below that is the working directory window. Again, there are tabs that you can click on to change the content of this window. When you open R-Studio, you will some notes in the console window. You should read them the first time you use R-Studio, as they contain useful information.
Notice that you have at least three things that you need to keep up-to-date: R itself, the packages in R, and R-Studio.
It is common to download and install all of the packages that will be needed at the beginning of the R script. Below are the packages that will be used in this presentation.
install.packages("installr") library(installr) install.packages("tidyverse", dependencies=TRUE) library(tidyverse) install.packages("haven") require(haven) install.packages("readr") require(readr) install.packages("yaml") library(yaml) install.packages("ggplot2") library(ggplot2) install.packages("dplyr") library(dplyr)
There are three commands that are useful for updating packages:
# download and install newest version of the package update.packages # shows which packages have updates old.packages # looks for new packages that are not already installed new.packages
Alternatively, you can click on the Packages tab of the lower right-hand window.
To update R-Studio, you can click on Help and then Check for Updates.
Getting example datasets
Having example datasets is very useful, especially when learning a new statistical software package. Example datasets are always “clean”, so you can quickly start to practice the commands that you want without needing to do the data cleaning and managing that often needs to be done with real data. Some example datasets come with R, and you use
data() data(women) women
to see them.
Another thing that you need to know when learning a new statistical software package is how find help. If you need help with R-Studio, you can click on the Help button on the top right of the window. There is also a searchable help in the middle of right-hand window. To find help on general topics in R, type
In the example above, we asked for help for the help command, and we see this in the lower right-hand window.
To get help with a specific command, such as regress, type
The command above requested the help files for all of the packages that have regress in them. They are shown in the lower right-hand window. Let’s scroll down to stats::lm. Here we can see how to fit a linear model (AKA OLS regression).
Oftentimes, you will find both help pages and documentation. The documentation may have user-guides and vignettes and is often a good place to start if you are just learning the package.
help(vignette) vignette() vignette("dplyr")
Notice that help(vignette) showed the help file in the lower right-hand window, while (vignette) opened a new tab in the upper left-hand window.
Here is the results of a search for tidyverse: https://cran.r-project.org/web/packages/tidyverse/index.html . Notice that there are vignettes and other useful information here.
Of course, there are many other sources of help for both R and R-Studio. Many questions can be answered with a Google search. While there are many published books on a variety of topics using R, it may be best to stick to online books. They are free, and they are often more up-to-date than printed books. There are also some message forums to which you can post questions.
Getting data into R
Perhaps the easiest way to get data into R is to click on the Import Datasets button in the upper right-hand window. Of course, R code can also be written to do this, and there are functions in both Base R and packages that will do this. Please note that the package haven is part of a much larger package known as tidyverse. In the examples below, we read data over the internet, but you can type a file path where the URL is to read data saved on your computer.
require(haven) hsbdemo_stata <- read_stata("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta") hsbdemo_csv <- read_csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv") require(readr) hsbdemo_delim <- read_delim("https://stats.idre.ucla.edu/stat/data/hsbsemi.txt", delim = ";")
Let’s look at the hsbdemo dataset from Stata. If we click on the blue circle with the white arrow in the upper right-hand window, we can see various attributes about the variables in the dataset. If we click on the name of the dataset itself, we can see the data in the editor window. This is useful for ensuring that the data were read in properly.
You can use the command
to find out where the current working directory is set. You can change the working directory with
Exporting data from R
Below are a few examples of code that can be used to write R datasets to different formats.
write_csv(filename, file = "d:/myfolder/data_file_name.csv") write_dta(filename, file = "d:/myfolder/data_file_name.dta")
The basics of data structures and objects
Unlike most other statistical software packages, R has several different data structures. You need to understand a little bit about them because affect how you work with the variable. One type of structure is called a vector. Vectors have only one dimension (such as a single column) and contain only one type of data, either numeric or string. Vectors can contain one of three types of numbers: logical (meaning 0 or 1), integers (a number followed by L, e.g., the 5 in 5L is the integer) and double (AKA real numbers or numeric).
Lists are also one-dimensional, but they can contain a mixture of data types. A list can contain a vector, other lists, matrices and data frames.
Matrices have two dimensions and contain only one type of data. For example,
1 2 3 4
5 6 7 8
is a 2 by 4 matrix (2 rows and 4 columns).
Data frames are the structure in which most data to be used for analysis are stored. Data frames have some of the features of matrices (variables as columns and observations as rows) and lists (the columns can be of different types, e.g., some columns contain numeric data and other contain character data). The columns should be of equal length, meaning that all observations contribute values to all variables. Of course, missing data are allowed; the topic of missing data will be discussed briefly a little later on. A data frame in R is approximately the same as a dataset in SAS, Stata or SPSS.
A tibble is a data frame with a few enhancements and is part of tidyverse.
R is an object-oriented software language, as opposed SAS, Stata and SPSS, which have procedural languages. This means that R stores information, such as output from a procedure, in an object, and then you use that object in a function. Every object belongs to one or more classes. Many functions only accept objects of a certain class, so it is important to know what type of class or classes an object belongs to.
Generic functions match object classes to the appropriate function, so that the you don’t have to remember which class or classes a function supports. To find out if a function is a generic function, check the help file. Generic functions accept objects from multiple classes and then pass the object to a specific function (called methods) designed for the object’s class. For example, the function summary() is a generic function. When a data frame is passed to summary(), the data frame is then passed to a specific function (AKA method) called summary.data.frame(), which provides a numeric summary of all variables in the data frame. You could also pass a regression model object (of class lm) to summary(), which would call summary.lm() and the result would be a regression table.
Methods are class-specific functions. The methods() function lists what methods exist in the current R session.
Let’s read in the hsbdemo dataset from Stata again, this time with a shorter name.
hsb <- read_stata("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
Once you have your data in R (or any other statistical software package), it is a good idea to explore the data before beginning the statistical analyses. While data can be either string or numeric, there are very few statistical analyses that can be done with string variables, so we will focus on numeric variables. Numeric variables can be either continuous or categorical. We will start with continuous variables.
mean(hsb$read) var(hsb$read) median(hsb$read)
Numeric summaries of variables are nice and necessary, but sometimes, a graph is what you want. For the graphs in this presentation, we will use the package ggplot2, which contains the function ggplot. Base R does have graphing capabilities, but for aesthetic reasons, we will use ggplot2. Let’s start with a boxplot and a histogram.
# graphing continuous variables # making a boxplot ggplot(hsb, aes(x = 1, y = math)) + geom_boxplot() # making a histogram ggplot(hsb, aes(x = write)) + geom_histogram(bins = 10) ggplot(hsb, aes(x = read)) + geom_histogram(bins = 10)
Two continuous variables can be graphed together to make a scatterplot.
# graphing with two continuous variables ggplot(data = hsb, aes(x = write, y = read)) + geom_point()
Variables that are treated as categorical can be either ordinal (meaning that the values have a natural ordering, such as small, medium and large), or they can be nominal (such as chocolate, vanilla and strawberry). Binary variables (which have values of 0, 1 or missing) can be treated as either continuous or categorical. Frequency tables are often used to describe categorical variables.
The relationship between two categorical variables can be described with a crosstabulation.
A bar chart is one visual representation of a categorical variable.
ggplot(hsb, aes(x = ses, fill = female)) + geom_bar(position = "dodge")
While on the topic of categorical variables, we need to mention factors. Factors are numeric variables that have value labels associated with them. String variables can be converted to numeric variables, and the string value is retained as the value label. Let’s read in a dataset that has some string variables and show how to make one a factor variable.
hsbraw <- read.csv("http://stats.idre.ucla.edu/stat/data/hsbraw.csv") str(hsbraw$ses) hsbraw$ses <- factor(hsbraw$ses, levels = c("low", "middle", "high")) str(hsbraw$ses) levels(hsbraw$ses)
If we wanted descriptive statistics for only those cases that met a specific criteria, we could use the filter function.
summary(filter(hsb, read >= 60))
We can get descriptive statistics broken out by levels of a categorical variable.
tapply(hsb$socst, hsb$prog, summary)
Now let’s use both continuous and categorical variables in a graph. We will start with a grouped bar chart, and then create some box plots and end with some density plots.
# graphing continuous by categorical # grouped bar chart ggplot(hsb, aes(x = ses, fill = prog)) + geom_bar(position = "dodge") + facet_wrap(~female) # box plots ggplot(hsb, aes(x = female, y = math)) + geom_boxplot() # density plots ggplot(hsb, aes(x = math, color = prog)) + geom_density()
Basic data managment
Once you have explored your data, you may realize that you need to create a new variable or recode an existing variable. We will start with something simple, such as creating a new variable that is a constant.
hsb$newvar = 1 names(hsb) table(hsb$newvar)
Creating a constant may be the first step in the process, but it is rarely the last. Let’s use some if-then logic to modify our new variable.
# if-then coding (but no "then") hsb$newvar <- ifelse((hsb$math) > 65, 0, 1) table(hsb$newvar)
There are many useful functions in R. In this example, we take the log of the variable read.
hsb$logread <- log(hsb$read) head(hsb$logread, n = 5)
We can transform more than one variable at a time.
hsb <- mutate(hsb, logmath = log(math), rankmath = min_rank(math), math_grade = cut(math, breaks = c(0, 35, 45, 55, 65, 80), labels = c("F", "D", "C", "B", "A")), zmath = scale(math)) # look at dataset View(hsb)
Data can be sorted with arrange. In the example below, we will sort the dataset based on the values of the variable science. Notice that we changed datasets.
# sorting # the arrange function is part of dplyr temp1 <- arrange(hsbraw, science) head(temp1, n = 10)
You can also sort on more than one variable. In the example below, we sort on both ses and science.
temp2 <- arrange(hsbraw, ses, science) head(temp2, n = 20) View(temp2)
If you look at the values of science, you will see some values that are -99. These are missing values. However, R sees them as real values, so we need to convert them to a missing value that R understands as missing. That is NA. For more information on missing values in R, please see How Does R Handle Missing Values? .
# missing values hsbraw$science[hsbraw$science == -99] <- NA head(hsbraw$science, 10)
Another common task is to remove either observations or variables from a dataset. This is called subsetting. For example, if we wanted to create a new dataset with only the observations for females, we could type
# subsetting cases hsbf <- hsb[hsb$female == 1, ] hsbf
If we wanted to keep only the variables id, female, ses, schtyp and prog, we could type
# subsetting variables hsbd <- hsb[, c("id", "female", "ses", "schtyp", "prog")] hsbd
You can also combine these to do both at once. For example, you could select the observations for females and keep only the variables id, female, ses, schtyp and prog.
# subsetting both cases and variables hsbb <- hsb[hsb$female == 1, c("id", "female", "ses", "schtyp", "prog")] hsbb
A few basic statistical analyses
First of all, there are lots of packages that do an incredible variety of statistical tests. Let’s start by reloading the dataset, just for a fresh start.
hsb <- read.csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")
Many statistical functions use “formula notation”. For example,
y ~ a
where y represents the outcome (or dependent variable), and a represents a predictor (or independent variable).
If you want to include an interaction term, you have two choices. You can use : between two variables to include their interaction in the model, or you can use * between two variables to include the interaction and the lower order terms.
y ~ a + b a:b
y ~ a*b
Independent samples t-test
t.test(write ~ female, data = hsb)
t.test(hsb$read, hsb$write, paired = TRUE)
cor(hsb$read, hsb$write, method=("pearson"))
m1 <- lm(write ~ read + female, data = hsb)
Objects that are created by running a model are often complicated lists that contain a variety of information related to the model that was fitted. These model objects usually belong to special classes that have associated methods. The lm() function returns objects of class lm, which has related methods summary.lm(), plot.lm, etc.
class(m1) methods(class = "lm") summary(m1)
Before we can run the logistic regression, we need to convert the variable honors to a binary variable which we will call hon.
hsb$hon = (hsb$honors == "enrolled") m2 <- glm(hon ~ read + female, data = hsb, family = binomial)
To get odds ratios, you can exponentiate the coefficients and their confidence intervals.
More about tidyverse
You can learn more about tidyverse, which was written by Hadley Wickham and is very popular now, at its website here. To quote from this page: “The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” More specifically, tidyverse is a collection of packages designed to assist with common data management tasks such as importing data, cleaning and managing data. Among the more popular package are:
- dplyr is used for subsetting, sorting, transforming and grouping variables;
- tidyr is used for reshaping data (i.e., restructuring data from long to wide format and vice versa);
- magrittr is used to pipe a chain of commands;
- lubridate is used with string variables;
- stringr is also used for manipulating string variables;
- purrr enhances R’s function programming toolkit for working with functions and vectors;
- forcats provides tools for handling common problems with factor variables; and
- ggplot2 is used to make descriptive graphs.
The grammar used in ggplot2 is described in the book The Grammar of Graphics by Leland Wilkinson. (The grammar for the new graph procedures in SPSS is also from this book; Wilkinson was working for SPSS when he wrote the book.) Tidyverse also saves data as a tibble, which is a type of a data frame. You can learn more about the differences between a data frame and tibble here. The tidyverse website has links to online books that you can read, among other resources. Some of these resources are free, including some online courses with videos and exercises. Hadley Wickham works for RStudio and is an adjunct professor at the University of Auckland (he is from New Zealand), Standford Univerity and Rice University, and he has written several books. We have a workshop on Data Management in R that uses tidyverse extensively.