Purpose of this seminar
This seminar introduces how to use the R ggplot2 package, emphasizing data analysis accompanying a regression model.

First the underlying grammar (system) of graphics is introduced with demonstrations.

Then, usage of ggplot2 for exploratory graphs, model diagnostics, and presentation of model results is illustrated through 3 examples.
Background
The ggplot2
package

produces layered statistical graphics.

uses an underlying “grammar” to build graphs componentbycomponent rather than providing premade graphs.

is easy enough to use without any exposure to the underlying grammar.

allows the user to build a graph from concepts rather than recall of commands and options.
We can conceive of new kinds of graphs to display more data variation more clearly.
Installing ggplot2
and other packages
Install ggplot2
with install.packages
.
We install and load the following additional packages:
Hmisc
: for documentation for statistical summary functions used withggplot2
dplyr
: data managementreshape
: restructuring datalme4
: mixed effects modelingnlme
: needed for dataset (installed with lme4)
#run these if you do not have the packages installed
install.packages("ggplot2")
install.packages("Hmisc")
install.packages("dplyr")
install.packages("reshape")
install.packages("lme4")
#load libraries into R session
library(ggplot2)
library(Hmisc)
library(dplyr)
library(reshape)
library(lme4)
library(nlme)
theme_set(theme_grey(base_size = 12))
ggplot2 documentation
http://docs.ggplot2.org/current/
Help file pages usually contain numerous code and graphics examples.
The ggplot2 book
Read ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham, creator of the ggplot2 package:
 http://ggplot2.org/book
 concise at 200 pages, but rich with information
 filled with pictures
This section of the seminar describing the grammar summarizes bits and pieces of chapter 3.
The grammar of graphics
Outline of grammar
Grammar of a language defines the rules of structuring words and phrases into meaningful expressions.
Grammar of graphics defines the rules of structuring mathematic and aesthetic elements into a meaningful graph.
Leland Wilkinson (2005) designed the grammar upon which ggplot2
is based.
Elements of grammar of graphics
 Data: variables mapped to aesthetic features of the graph.
 Geoms: objects/shapes on the graph.
 Stats: stastical transformations that summarize data,(e.g mean, confidence intervals)
 Scales: define which aesthetic values are mapped to data values. Legends and axes display these mappings.
 Coordiante systems: define the plane on which data are mapped on the graphic.
 Faceting: splits the data into subsets to create multiple variations of the same graph (paneling).
Data and the ggplot
function
Data for plotting must be stored in a data.frame
.
Multiple data frames may be used in one plot.
The default dataset is specified in the ggplot
function (not “ggplot2”).
Datasets and the Milk
dataset
All datasets used in this seminar are loaded with the R packages used for this seminar.
To discuss the grammar of graphics, we use the Milk
dataset, loaded with the nlme
package.
The Milk
dataset contains 1337 rows of the following 4 columns:
 protein: numeric, protein content of milk
 Time: numeric, time since calving
 Cow: ordered factor, cow id
 Diet: factor, diet of cow, 3 levels = barley, barley+lupins, lupins
A quick look at the Milk
dataset:
#first few rows of Milk
head(Milk)
## Grouped Data: protein ~ Time  Cow
## protein Time Cow Diet
## 1 3.63 1 B01 barley
## 2 3.57 2 B01 barley
## 3 3.47 3 B01 barley
## 4 3.65 4 B01 barley
## 5 3.89 5 B01 barley
## 6 3.73 6 B01 barley
The ggplot
function
In the ggplot
function we specify the default dataset and map variables to aesthetics (aspects) of the graph.
We then add layers (of geoms, stats, etc.), producing the actual graphical elements of the plot.
First ggplot
specifiction
In the ggplot
below:
 data set is
Milk
 Time mapped to xaxis
 protein mapped to the yaxis
#declare data and x and y aesthetics,
#but no shapes yet
ggplot(data = Milk, aes(x=Time, y=protein))
Aesthetic mappings and aes
Aesthetics are the visually perceivable components of the graph.
Map variables to aesthetics using the aes
function, such as:
 which variables appear on the xaxis and yaxis.
 a classification variable to colors
 a numeric variable to the size of graphical objects
Use aes
in ggplot
to specify default aesthetics
Embed the aes
function in other functions for convenience.
Mappings specified in aes
in the ggplot
function serve as defaults and are inherited by subsequent layers and do not need to respecified.
Additional aes
specifications in further layers will override the default aesthetics for that layer.
Default aesthetics are inherited
#geom_point inherits x=Time, y=protein
ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point()
We can respecify aes
for specific layers
#color=Diet only applies to geom_boxplot
ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point() +
geom_boxplot(aes(color=Diet))
Example aesthetics
Which aesthetics are required and which are allowed depend on the geom.
Some example aesthetics:
x
: positioning along xaxisy
: positioning along yaxiscolor
: color of objects; for 2d objects, the color of the object’s outline (compare to fill below)fill
: fill color of objectsalpha
: transparency of objects (value between 0, transparent, and 1, opaque – inverse of how many stacked objects it will take to be opaque)linetype
: how lines should be drawn (solid, dashed, dotted, etc.)shape
: shape of markers in scatter plotssize
: how large objects appear
Mapping vs setting
Map aesthetics to variables inside the aes
function. The aesthetic will vary as the variable varies.
#color varies as Diet varies
ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point(aes(color=Diet))
Set aesthetics to a constant outside the aes
function.
#map color to constant "green" outside of aes
ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point(color="green")
Setting an aesthetic to a constant within aes can lead to unexpected results, as the aesthetic is set to a default value, rather than the specified value.
#Color set to default redorange color rather than green
ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point(aes(color="green"))
Layers
Add more layers with the character +
.
Layers may include any of the elements of the grammar, such as geoms, stats, faceting, and scales.
Remember that layers inherit the aesthetics from the ggplot
function, but they can be respecified.
Store plot specifications in objects
We can store ggplot specifications in R objects and then reuse the object to keep the code compact. Issuing the object name produces the graph
#store specification in object p and then produce graph
p < ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point()
p
Adding layers to a plot object renders a graphic
When we add layers to a plot stored in an object, the graphic is rendered, with the layers inheriting aesthetics from the original ggplot
.
#add a loess plot layer
p + geom_smooth()
#or panel by Diet
p + facet_wrap(~Diet)
Geoms
Geom functions differ in the geometric shapes produced for the plot:
geom_bar
: bars with bases on the xaxisgeom_boxplot
: boxesandwhiskersgeom_errorbar
: Tshaped error barsgeom_histogram
: histogramgeom_line
: linesgeom_point
: points (scatterplot)geom_ribbon
: bands spanning yvalues across a range of xvaluesgeom_smooth
: smoothed conditional means (e.g. loess smooth)
Geoms and aesthetics
Each geom is defined by aesthetics required for it to be rendered. For example, geom_point
requires both x
and y
, the minimal specification for a scatterplot.
Geoms differ in which aesthetics they understand as arguments. For example, geom_point
understands the shape
, which defines the shapes of points on the graph. In contrast, geom_bar
does not understand shape
.
Check the geom function help files for required and understood aesthetics.
geom functions require only the x
aesthetic.
These geoms are often used to plot the distribution of a single, continuous variable.
#define data and x for univariate plots
pro < ggplot(Milk, aes(x=protein))
#histogram
pro + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#frequency polygon
pro + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#and density plot
pro + geom_density()
geom_bar
requires only x
and summarizes a discrete variable
geom_bar
by default produces a bar plot with counts of each xvalue.
#bar plot of counts of Diet
ggplot(Milk, aes(x=Diet)) +
geom_bar()
Geoms that require more than x
Plots of more than one variable are typically more interesting. Below are geoms that map variables to both the x and y aesthetics.
#define data, x and y
p2 < ggplot(Milk, aes(x=Time, y=protein))
#scatter plot
p2 + geom_point()
#loess plot
p2 + geom_smooth()
## `geom_smooth()` using method = 'gam'
Adding a grouping variable
For the next two plots below, we additionally specify the group aesthetic to get lines (of protein across Time) by Cow and boxplots by Time point.
#lines of protein over Time by group=Cow
p2 + geom_line(aes(group=Cow))
#boxplots of protein by Time point; stored in p3
p3 < p2 + geom_boxplot(aes(group=Time))
p3
Show more data variation by mapping more variables to aesthetics
Map more variables to aesthetics to increase data variation in the graph. Let’s see how our proteinbyTime boxplots vary by Diet.
#add color to show how groups vary
p3 + facet_wrap(~Diet)
Stats
The stat functions statistically transform data, usually as some form of summary. For example:
 frequencies of values of a variable (histogram, bar graphs)
 a mean
 a confidence limit
Each stat function is associated with a default geom, so no geom is required for shapes to be rendered. The geom can often be respecfied in the stat for different shapes.
stat_bin
produces a histogram
stat_bin
transforms a continuous variable mapped to x
into bins (intervals) of counts. Its default geom is geom_bar
, producing a histogram.
#stat_bin produces a bar graph by default
ggplot(Milk, aes(x=protein)) +
stat_bin()
Respecify the geom to change how the transformation is depicted.
#points instead of bars
ggplot(Milk, aes(x=protein)) +
stat_bin(geom="point")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
stat_summary
summarizes y
at each x
stat_summary
applies a summary function (e.g. mean
) to y
at each value (or interval) of x
.
The default summary function is mean_se
, which returns the mean and mean ± standard error. Thus, stat_summary
plots the mean and standard errors for y
at each x
value.
The default geom is “pointrange”, which places a dot at the mean y
and extends lines to mean y
± s.e.
Means and standard errors of protein, y
, at each Time, x
, plotted as “pointrange”.
#plot mean and se of protein at each Time
ggplot(Milk, aes(x=Time, y=protein)) +
stat_summary()
Using other summary functions and geoms with stat_summary
Generally, functions that accept continuous numeric variables (e.g. mean
, var
, userwritten) can be specified in stat_summary
, either with argument fun.data
or fun.y
(see below).
The ggplot2
package conveniently has additional summary functions adapted from the Hmisc
package, for use with stat_summary
, including:
mean_cl_boot
: mean and bootstapped confidence interval (default 95%)mean_cl_normal
: mean and Guassian (tdistribution based) confidence interval (default 95%)mean_dsl
: mean plus or minus standard deviation times some constant (default constant=2)median_hilow
: median and outer quantiles (default outer quantiles = 0.025 and 0.975)
It is not necessary to load the Hmisc
package into the R session to use these functions, but it is necessary to view their help files.
If the new summary function returns 1 value (e.g. mean
, var
), use fun.y
and a geom that requires only a single y
input (e.g. geom_point
, geom_line
).
#median returns 1 value, geom_pointrange expects 2 (high, low)
#change geom to point
ggplot(Milk, aes(x=Time, y=protein)) +
stat_summary(fun.y="median", geom="point")
If the function resturns 3 values, such as the mean and 2 limits (e.g. mean_se
m mean_cl_normal
) use fun.data
#mean and boostrapped confidence limits
ggplot(Milk, aes(x=Time, y=protein)) +
stat_summary(fun.data="mean_cl_boot")
We can even specify a function we ourselves have defined.
#we define our own function mean of log
#and change geom to line
meanlog < function(y) {mean(log(y))}
ggplot(Milk, aes(x=Time, y=protein)) +
stat_summary(fun.y="meanlog", geom="line")
Scales
Scales define the actual mapping of an aesthetic’s values to data values. For example, a color scale maps each of the 3 values of Diet to one of 3 colors that vary in hue. A shape scale could instead map Diet to scatter plot shapes.
The ggplot2
package usually provides multiple types of scales for each aesthetic, each customizable by the user.
To demonstrate, we use scales for fill colors to adjust which 3 colors are mapped to the values of Diet in the Milk:
scale_fill_hue
: evenly spaced hues, default used for factor variablesscale_fill_brewer
: sequential, diverging or qualitative colour schemes, originally intended to display factor levels on a map.
The plots are densities of protein for each Diet group.
Here are 4 fill color scales using the above 2 functions.
#default is scale_fill_hue, would be the same if omitted
dDiet < ggplot(Milk, aes(x=protein, fill=Diet)) +
geom_density(alpha=1/3)
dDiet + scale_fill_hue()
#a different starting hue [0,360] changes all 3 colors
#because of the even huespacing
dDiet + scale_fill_hue(h.start=150)
#a qualitative scale for Diet in scale_fill_brewer
dDiet + scale_fill_brewer(type="qual")
#a divergent scale in scale_fill_brewer
dDiet + scale_fill_brewer(type="div")
Setting axis limits and labeling scales
Axes visualize the scales for the aesthetics x
and y
. We commonly need to adjust them so ggplot2
provides several convenient functions to label axes and other aesthetics:
lims
,xlim
,ylim
: set axis limitsexpand_limits
: extend limits of scales for various aetheticsxlab
,ylab
,ggtitle
,labs
: give labels (titles) to xaxis, yaxis, or graph;labs
can set labels for all aesthetics and title
Guides (legends) visualize scales
Guides specify the axes or legends that match the aesthetic values of the scale to values of the variable mapped to the aesthetic.
The guides
function sets and removes guides for each scale.
We remove the fillcolorscale guide from the previous plot with guides
.
dDiet + scale_fill_brewer(type="div") +
guides(fill="none")
Coordinate systems
Coordinate systems define the planes on which objects are positioned in space on the plot. Most plots use Cartesian coordinate systems, as do all the plots in the seminar. Nevertheless, ggplot2
provides multiple coordinate systems, including polar, flipped Carteisan and map projections.
Faceting (paneling)
Split plots into small multiples (panels) with the faceting functions, facet_wrap
and facet_grid
. The resulting graph shows how each plot varies along the faceting variable(s).
facet_wrap
wraps a ribbon of plots into a multirow panel of plots. The number of rows and columns can be specified. Multiple faceting variables can be separated by +
.
#densities of protein, colored by Diet, faceted by Time
ggplot(Milk, aes(x=protein, color=Diet)) +
geom_density() +
facet_wrap(~Time)
facet_grid
allows direct specification of which variables are used to split the data/plots along the rows and columns. Put the rowsplitting variable before ~
, and the columnsplitting variable after. A .
specifies no faceting along that dimension.
#split rows by Diet, no splitting along columns
ggplot(Milk, aes(x=Time, y=protein)) +
geom_point() +
facet_grid(Diet~.)
Themes
Themes control elements of the graph not related to the data. For example:
 background color
 size of fonts
 gridlines
 color of labels
These dataindependent graph elements are known as theme elements.
A list of theme elements can be found at http://docs.ggplot2.org/current/theme.html
Each theme element can be conceived of as either a line (e.g. xaxis), a rectangle (e.g. graph background), or text (e.g. axis title). We specify arguments pf the basic elements functions to adjust aspects of these theme elements:
element_line
– can specifycolor
,size
,linetype
, etc.element_rect
– can specifyfill
,color
,size
, etc.element_text
– can specifyfamily
,size
,color
, etc.element_blank
– removes theme elements from graph
Adjust theme elements with theme
and element functions
Within the theme
function, adjust the properties of theme elements by specifying the arguments of the element functions.
For example, the theme element corresponding to the background of the graph is panel.background
, and its properties are set in element_rect
. We can adjust the background’s fill
color, border color
, and size
in element_rect
. All of this will be specified within theme
.
Below we change the fill color of the backround to “lightblue”.
#light blue background
pt < ggplot(Milk, aes(x=Time, y=protein)) +
geom_point()
pt + theme(panel.background=element_rect(fill="lightblue"))
Now we specify the xaxis title, a text theme element, to be size=20pt and red.
#make xaxis title big and red
pt + theme(axis.title.x=element_text(size=20, color="red"))
Theme elements can be removed by setting them to element_blank. Let’s remove the background and xaxis title.
#remove plot background and xaxis title
pt + theme(panel.background=element_blank(),
axis.title.x=element_blank())
Saving plots to files
ggsave
makes saving plots easy. The last plot displayed is saved by default, but we can also save a plot stored to an R object.
ggsave
attempts to guess the device to use to save the image from the file extension, so use a meaningful extension. Available devices include eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf. The size of the image can be specified as well.
#save last displayed plot as pdf
ggsave("plot.pdf")
#save plot stored in "p" as png and resize
ggsave("myplot.png", plot=p, width=7, height=5. units="in")
Upcoming examples
The next 3 sections of the seminar consist of examples designed to show create simple ggplot2
graphs at each stage of data analysis. Each section begins with exploratory graphs, then chooses a regression model informed by the exploratory graphs, checks the model assumptions with diagnostic graphs, and ends with the modifying a graph for presentation of model results.
Example 1: Linear regression
Description of ToothGrowth
dataset
The ToothGrowth
dataset loads with the datasets package
.
The data are from an experiment examining how vitamin C dosage delivered in 2 different methods predicts tooth growth in guinea pigs. The data consist of 60 observations, representing 60 guinea pigs, and 3 variables:
 len: numeric, tooth (odontoblast, actually) length
 supp: factor, supplement type, 2 levels, “VC” is ascorbic acid, and “OJ” is orange juice
 dose: numeric, dose (mg/day)
A quick look at ToothGrowth
reveals that many guinea pigs were given the same dose of vitamin C. Three doses, 0.5, 1.0, and 2.0, were used.
#look at ToothGrowth data
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
unique(ToothGrowth$dose)
## [1] 0.5 1.0 2.0
Example 1: Exploratory analysis
Graphical exploration of the dataset provides the researcher with descriptive depictions of summaries of variables or relationship among variables.
We being our exploration of Toothgrowth
by examining the distributions of supp
and dose
. Since both variables are discretely distributed, we want the frequencies (counts) of each value.
Frequencies are often plotted as bars, so we select geom_bar
. We specify dose mapped to x
and supp mapped to the fill
color of the bars, which will give us counts of each dose divided by supp.
Exploring distributions
#bar plot of counts of dose by supp
#data are balanced, so not so interesting
ggplot(ToothGrowth, aes(x=dose, fill=supp)) +
geom_bar()
Statistical models often make assumptions about the distribution of the outcome (or its residuals), so an inspection might be wise. First let’s check the overall distribution of “len” with a density plot:
#density of len
ggplot(ToothGrowth, aes(x=len)) +
geom_density()
We can get densities of len by supp by mapping supp to color
:
#density of len by supp
ggplot(ToothGrowth, aes(x=len, color=supp)) +
geom_density()
The outcome distributions appear a bit skewed, but the samples are small.
Quick scatterplot of outcome and predictor
We plot the dosetooth length (len) relationship.
#not the best scatterplot
tp < ggplot(ToothGrowth, aes(x=dose, y=len))
tp + geom_point()
Summarizing the outcome across a predictor
Because dose takes on only 3 values, many point are crowded in 3 columns, obscuring the shape of relationship between dose and len. We replace the column of points at each dose value with its mean and confidence limit using stat_summary
instead of geom_point
.
#mean and cl of len at each dose
tp.1 < tp + stat_summary(fun.data="mean_cl_normal")
tp.1
An additional call to stat_summary
with fun.y=mean
(fun.y
because mean
returns one value) and changing the geom
to “line” adds a line between means.
#add a line plot of means to see doselen relationship
#maybe not linear
tp.2 < tp.1 + stat_summary(fun.y="mean", geom="line")
tp.2
Does a third variable moderate the xy relationship?
Does the doselen relationship depend on supp? We can specify new global aesthetics in aes
.
#all plots in tp.2 will now be colored by supp
tp.2 + aes(color=supp)
Interpreting the previous graph
This graph suggests:
 The slope of the doseresponse curve decreases as dose increases for both supp types, suggesting a quadratic function.
 The slope of the OJ curve flattens more dramatically, perhaps suggesting the quadratic term is different between supplement groups
 The initial slopes look rather similar, perhaps suggesting that the linear term may not be different between groups
 The 2 supp group means differ at the two lower doses, but not at the highest dose
ggplot2
makes graphs summarizing the outcome easy
We just plotted means and confidence limits of len, with lines connecting the means, separated by supp, all without any manipulation to the original data!
The stat_summary
function facilitates looking at patterns of means, just as regression models do.
Next we fit our linear regression model and check model assumptions with diagnostic graphs.
Model preliminaries
We want to model how tooth length (len) is predicted by dose, allowing for moderation of this relationship through an interaction with supp.
We assume that dose and tooth length have a smooth, continuous relationship in the range of doses tested, so we will treat dose as a continuous (numeric) predictor. We also create a dosesquared variable, for use in the quadratic model and prediction later.
#create dosesquared variable
ToothGrowth$dosesq < ToothGrowth$dose^2
Fitting the model
We noticed in the previous graph that the doselen relationship appears quadratic, that the quadratic effect may differ between supp groups, but that the initial linear slope may not be so different. We fit a model to reflect our expectations:
lm2 < lm(len ~ dose + dosesq*supp, data=ToothGrowth)
summary(lm2)$coef
## Estimate Std. Error t value Pr(>t)
## (Intercept) 0.7491667 2.7983895 0.2677135 7.899213e01
## dose 30.1550000 5.2474684 5.7465806 4.114588e07
## dosesq 8.7238095 2.0402571 4.2758383 7.640686e05
## suppVC 6.4783333 1.3762287 4.7073088 1.739152e05
## dosesq:suppVC 1.5876190 0.5770719 2.7511635 8.024694e03
The model appears to conform to our expectations.
Example 1: Model diagnostics
Statistical inference depends on the assumptions of the regression model, which we check with diagnostic graphs. Common diagnostics for linear regression
 inspect the normality of the residuals
 verify that the residuals show no trends (assumption of linearity) and are homoscedastic,
 check for overly influential outliers.
fortify
makes linear regression diagnostics easy
Conveniently, the fortify
function takes a lm model object (one among several classes) and returns a dataset with several estimated diagnostic variables including:
.hat
: leverages(influence).sigma
: residual standard deviation when observation dropped from model.cooksd
: Cook’s distance.fitted
: fitted (predicted) values.resid
: residuals.stdresid
: standardized residuals
We fortify
our lm2 object with these diagnostic variables and take a quick look a the new variables.
#create dataset with original data and diagnostic variables
lm2diag < fortify(lm2)
#quick look
head(lm2diag)
## len dose dosesq supp .hat .sigma .cooksd .fitted
## 1 4.2 0.5 0.25 VC 0.08095238 3.623134 0.0165458459 7.564286
## 2 11.5 0.5 0.25 VC 0.08095238 3.611516 0.0226438547 7.564286
## 3 7.3 0.5 0.25 VC 0.08095238 3.654279 0.0001021058 7.564286
## 4 5.8 0.5 0.25 VC 0.08095238 3.645881 0.0045503109 7.564286
## 5 6.4 0.5 0.25 VC 0.08095238 3.650733 0.0019816291 7.564286
## 6 10.0 0.5 0.25 VC 0.08095238 3.638080 0.0086727319 7.564286
## .resid .stdresid
## 1 3.3642857 0.96913367
## 2 3.9357143 1.13374237
## 3 0.2642857 0.07613152
## 4 1.7642857 0.50822934
## 5 1.1642857 0.33539021
## 6 2.4357143 0.70164455
Normality of residuals: qq plot and stat_qq
A qq plot can assess the assumption that the residuals are normally distributed by plotting the standardized residuals (observed zscore) against theoretical quantiles of the normal distribution (expected zscore if normally distributed).
stat_qq
creates a qqplot. The only required aesthetic is sample
, which we map to the standardized residual variable created by fortify
, .stdresid
.
A diagonal reference line (intercept=0, slope=1) is added to the plot with geom_abline
, representing perfect fit to a normal distribution. The normal distribution will appear to be a reasonable fit below.
#qq plot of residuals and diagonal reference line
#geom_abline default aesthetics are yintercept=0 and slope=1
ggplot(lm2diag, aes(sample=.stdresid)) +
stat_qq() +
geom_abline()
Linearity and Homoscedasticity: residuals vs fitted
We next assess the assumptions of homoscedescasticity and linear relationships between the outcome and predictors. A residuals vs fitted (predicted value) plot assesses both of these assmuptions.
An even spread of residuals around 0 suggests homoscedasticity, and a zero, flat slope for residuals against fitted suggests linearity of predictor effects.
We build our residuals vs fitted plot like so:
 Start with a scatter plot of
x=.fitted
andy=.stdresid
.  Add a plot the means and standard errors of the residuals across fitted values using
stat_summary
. The standard error bars somewhat address homoskedasticity.  Plot a line representing the mean trend of the residuals with another call to
stat_summary
(changing function tomean
and geom toline
). Normally, we would usegeom_smooth
to plot a loess curve to visualize the trend among residuals, but loess smooths do not work well when the variable mapped tox
is discrete.
The error bars do not appear to vary too much and the line seems sufficiently flat to feel that neither assumption has been violated.
#residual vs fitted, means and s.e.
ggplot(lm2diag, aes(x=.fitted, y=.stdresid)) +
geom_point() +
stat_summary() +
stat_summary(fun.y="mean", geom="line")
## No summary function supplied, defaulting to `mean_se()
Identifying influential observations
Strongly influential observations can distort regression coefficients. The most influential datapoints will typically have more extreme predictor values (leverage), measured by .hat
, and large residuals. The overall influence of an observation on the model is measured by Cook’s D, variable .cooksd
.
The Toothgrowth
dataset is fairly balanced across doses and supps (20 at each of 3 doses, 30 at each of 2 supp). Thus, no values are “extreme”, so influential observations for this model will be those with large residuals.
In the following plot:
 map
.hat
tox
,.stdresid
toy
 map
.cooksd
tosize
, making more influential points larger.  label the points in
geom_text
with their row numbers for identification
The dependence of Cook’s D on both leverage and residual is apparent in the plot, with Cook’s D rising as we move away from center (larger residual) and to the right (larger leverage). No point looks too influential for concern.
# in geom_text we SET size to 4 so that size of text is constant,
# and not tied to .cooksd. We also nudge the text .001 (xaxis units)
# to the left, and separate overlapping labels
ggplot(lm2diag, aes(x=.hat, y=.stdresid, size=.cooksd)) +
geom_point() +
geom_text(aes(label=row.names(lm2diag)),
size=4, nudge_x=0.003, check_overlap=T)
Example 1: Customizing a plot for presentation
We have decided to present our model doseresponse curves with confidence intervals for an audience.
Our model (lm2) implies smooth, quadratic dose functions so we predict fitted values at more finely spaced doses than in the original data (0.5,1,2). First we create the new dataset of doses between 0.5 and 2, separated by 0.01 and a corresponding dosesquared variable. Then, we use predict
to add fitted values and confidence bands estimates to the new dataset.
#create new data set of values from 0.5 to 2 for each supp
newdata < data.frame(dose=rep(seq(0.5, 2, .01),each=2),
supp=factor(c("OJ", "VC")))
#dose squared variable
newdata$dosesq < newdata$dose^2
#add predicted values and confidence limits
newdata < data.frame(newdata, predict(lm2, newdata,
interval="confidence"))
Let’s peak at the new variables.
#check the new data for plotting
head(newdata)
## dose supp dosesq fit lwr upr
## 1 0.50 OJ 0.2500 13.645714 11.580988 15.710440
## 2 0.50 VC 0.2500 7.564286 5.499560 9.629012
## 3 0.51 OJ 0.2601 13.859154 11.829995 15.888313
## 4 0.51 VC 0.2601 7.793760 5.764601 9.822919
## 5 0.52 OJ 0.2704 14.070849 12.075506 16.066191
## 6 0.52 VC 0.2704 8.021807 6.026465 10.017150
Initial plot: fitted vs dose by supp
First we map dose to x
, fit is mapped to y
and supp is mapped to color
. We begin with a line plot of the fitted curves.
#fit curves by supp
p1.0 < ggplot(newdata, aes(x=dose, y=fit, color=supp)) +
geom_line()
p1.0
Add confidence bands with geom_ribbon
geom_ribbon
requires the aesthetics ymax
and ymin
, which we map to upr and lwr, respectively. We specify that fill
colors of the ribbons are mapped to supp as well (otherwise they will be filled black).
#add confidence bands
p1.1 < p1.0 + geom_ribbon(aes(ymax=upr, ymin=lwr, fill=supp))
p1.1
Oops! THe fill color has obscured everything. We set (outside of aes
) the alpha level of geom_ribbon to a constant of 1/5.
#make confidence bands transparent
p1.2 < p1.0 + geom_ribbon(aes(ymax=upr, ymin=lwr, fill=supp),
alpha=1/5)
p1.2
Adding original data (in a different dataset) to the graph
As a final data addition to the graph, we decide to overlay a scatter plot of the original data. In geom_point
we respecify the data to be ToothGrowth
(instead of lm2), and respecify the y
aesthetic to len, the observed outcome.
#overlay observed data
p1.3 < p1.2 + geom_point(data=ToothGrowth, aes(y=len))
p1.3
Finishing the graph: adjusting theme and theme elements
Finally, we make a few nondata related changes to the graph.
First we change the overall look of a graph with a builtin theme, theme_classic
, which replaces the gray background and white gridlines with a white background, no grid lines, no border on the top or right. This theme mimics the look of base R
graphics.
The ggplot2
package provides a few builtin themes (see http://docs.ggplot2.org/current/ggtheme.html), which make several changes to the overall background look of the graphic.
theme_classic
mimics base R
graphics
#overlay observed data
p1.4 < p1.3 + theme_classic()
p1.4
Next we change the color of the “OJ” group to a more obvious orange.
scale_fill_manual
and scale_color_manual
allow color specification through hexadecimal notation (often called color hex codes, # follwed by 6 alphanumeric characters).
On the Color Brewer website, are many readytouse color scales with corresponding color hex codes. We find a qualitative scale we like(“Set2”) with an orange, hex code “#fd8d62”, and a complementary green, hex code “#66c2a5”.
We apply these orangegreen colors scales to both the color
aesthetic (color of points, fit line and the borders of ribbon), and to the fill
aesthetic (fill color of the confidence bands)
#specify hex codes for orange, #fc8d62, and green #66c2a5
p1.5 < p1.4 +
scale_color_manual(values=c("#fc8d62", "#66c2a5")) +
scale_fill_manual(values=c("#fc8d62", "#66c2a5"))
p1.5
Finally, we use labs
to make the axes and guide titles more descriptive. The “\n” is a newline. We also move the guide (legend) to the bottom with theme.
#note the default text justifications for the xaxis vs guide
p1.6 < p1.5 + labs(x="dosage vitamin C\n(mg/day)",
y="tooth length", fill="supplement\ntype",
color ="supplement\ntype") +
theme(legend.position="bottom")
p1.6
#FULL CODE FOR THE ABOVE GRAPH
ggplot(newdata, aes(x=dose, y=fit, color=supp)) +
geom_line() +
geom_ribbon(aes(ymax=upr, ymin=lwr, fill=supp), alpha=1/5) +
geom_point(data=ToothGrowth, aes(x=dose, y=len)) +
theme_minimal() +
scale_color_manual(values=c("#fc8d62", "#66c2a5")) +
scale_fill_manual(values=c("#fc8d62", "#66c2a5")) +
labs(x="dosage vitamin C\n(mg/day)", y="tooth length",
fill="supplement\ntype",
color ="supplement\ntype") +
theme(legend.position="bottom")
Recap of Example 1
 Use exploratory graphs to characterize the sample and to formulate the functional relationships between outcomes and predictors.
 The stat_summary function greatly simplifies plotting means and other summaries.
 For linear regression models, fortify datasets with diagnostic variables.
 Assess model assumptions with diagnostic plots.
 You can plot from multiple datasets on the same graph!
Example 2: Logistic regression
The esoph
dataset
The esoph
dataset:
 loaded with the
datasets
package  comes from a casecontrol study of the risk of alcohol and tobacco use in predicting esophageal cancer.
 consists of 88 “grouped” observations, which record the number of cases (with cancer) and controls ( without cancer) who fall in the same groupings of age, alcohol use, and tobacco use.
esoph
variables
Variables in the dataset:
 agegp: ordered factor, age group, 6 levels (2534, 3544, etc.)
 alcgp: ordered factor, alcohol consumption gm/day, 4 levels (039, 4079, etc.)
 tobgp: ordered factor, tobacco consumption gm/day, 4 levels (09, 1019, etc.)
 ncases: numeric, number of cases
 ncontrols: numeric, number of controls
#structure of esoph
str(esoph)
## 'data.frame': 88 obs. of 5 variables:
## $ agegp : Ord.factor w/ 6 levels "2534"<"3544"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alcgp : Ord.factor w/ 4 levels "039g/day"<"4079"<..: 1 1 1 1 2 2 2 2 3 3 ...
## $ tobgp : Ord.factor w/ 4 levels "09g/day"<"1019"<..: 1 2 3 4 1 2 3 4 1 2 ...
## $ ncases : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ncontrols: num 40 10 6 5 27 7 4 7 2 1 ...
#head of esoph
head(esoph)
## agegp alcgp tobgp ncases ncontrols
## 1 2534 039g/day 09g/day 0 40
## 2 2534 039g/day 1019 0 10
## 3 2534 039g/day 2029 0 6
## 4 2534 039g/day 30+ 0 5
## 5 2534 4079 09g/day 0 27
## 6 2534 4079 1019 0 7
In the first row we see that there were 0 cases and 40 controls in the grouping age 2534, alcohol consumption 039g/day, and tobacco consumption 09g/day.
The grouped data structure here is common for casecontrol designs, uncommon for other designs.
Example 2: Exploratory analysis
Knowing that the outcome (case vs control) is binary, we expect to conduct a logistic regression, which models how the probability of the outcome (logittransformed) is predicted by the various risk factors.
We would thus like to plot the frequencies of both cases and controls for each grouping to visualize these probabilities. Unfortunately, those frequencies are in different variables in the esoph dataset, ncases and ncontrols, and we can map only one variable to y
.
#we want to plot both number of cases and controls for each group
#but they are in different variables "ncases" and "ncontrols"
head(esoph)
## agegp alcgp tobgp ncases ncontrols
## 1 2534 039g/day 09g/day 0 40
## 2 2534 039g/day 1019 0 10
## 3 2534 039g/day 2029 0 6
## 4 2534 039g/day 30+ 0 5
## 5 2534 4079 09g/day 0 27
## 6 2534 4079 1019 0 7
The graph below will plot the total ncases for each alcohol group.
In geom_bar
, the default is stat=bin
, which counts the number of observations that fall into each value of the variable mapped to x
(alcgp).
By respecifying stat="identity"
, geom_bar
will plot the values stored in the variable mapped to y
(ncases). geom_bar
is smart enough to sum ncases across groups with the same value of alcgp.
In other words, the data are already binned.
How do we add ncontrols to the bar plot of ncases below?
#"stat=identity" plots the value of the variable mapped to "y" (ncases) as is
ggplot(esoph, aes(x=alcgp, y=ncases)) +
geom_bar(stat="identity")
Melting ncases and ncontrols into one column
We want ncases and ncontrols in a single column while knowing whether the value came from ncases or ncontrols.
The melt
function from the reshape
package takes a data.frame
and stacks all nonfactor variable columns into one variable (called “value”, by default), while creating another variable that indexes the originating column (called “variable”, by default). The factor variables define the groups and are unaltered.
In esoph
we have 3 factor variables, agegp, alcgp and tobgp. After melt
, the remaining variables ncases and ncontrols will be stacked into one column, creating 2 rows of observations for each crossing of agegp, alcgp and tobgp.
Below we create a melted version of the esoph
dataset for exploratory graphing. In the column “value” we see values that came from the columns in esoph
, ncases and ncolumns. The originating column is indexed by the column “variable”.
#melt esoph
esoph_melt < melt(esoph)
## Using agegp, alcgp, tobgp as id variables
#sort data before viewing (arrange from package dplyr)
esoph_melt < arrange(esoph_melt, agegp, alcgp, tobgp)
#we can see the stacking of ncases and ncontrols
head(esoph_melt)
## agegp alcgp tobgp variable value
## 1 2534 039g/day 09g/day ncases 0
## 2 2534 039g/day 09g/day ncontrols 40
## 3 2534 039g/day 1019 ncases 0
## 4 2534 039g/day 1019 ncontrols 10
## 5 2534 039g/day 2029 ncases 0
## 6 2534 039g/day 2029 ncontrols 6
Let’s rename “variable” to “status”, and rename “value” to “n”, since that is what the new columns represent.
#rename variables for more meaning.
colnames(esoph_melt)[4:5]< c("status", "n")
head(esoph_melt)
## agegp alcgp tobgp status n
## 1 2534 039g/day 09g/day ncases 0
## 2 2534 039g/day 09g/day ncontrols 40
## 3 2534 039g/day 1019 ncases 0
## 4 2534 039g/day 1019 ncontrols 10
## 5 2534 039g/day 2029 ncases 0
## 6 2534 039g/day 2029 ncontrols 6
Now we have one column, n, which we can map to y
, and we can use status as a fill
color for the plot.
Putting ncases and ncontrols on the same graph
We now plot counts of cases and controls across alcoholg groups on the same graph.
We again use geom_bar
and stat=identity
to plot the values stored in y=n
, now with status mapped to fill
color of the bars.
Notice that the resulting bars are stacked.
#put n for cases and controls on the same bar graph
ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity")
We have more light drinkers than heavy drinkers in the dataset (overall height of bars). Probability of being a case seems to increase with alcohol consumption (red takes up proportionally more as we move right).
A brief aside about position adjustments
When 2 graphical objects are to be placed in the same or overlapping positions on the graph, we have several options, called “position adjustments”, for placement of the objects:
dodge
: move elements sidebyside,fill
: stack elements vertically, standardize heights to 1identity
: no adjustment, objects will overlapjitter
: add a little random noise to positionstack
: stack elements vertically
These options are specified in a “position” argument (in a geom or stat). For geom_bar
, by default “position=stack” for stacked bars. For geom_point
, by default “position=identity” for overlapping points.
The next few graphs demonstrate the utility of position adjustments for bar graphs.
position=stack
Stacking let’s us visualize both proportions and counts.
#by default, geom_bar uses position="stack"
#looks same as above
ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity", position="stack")
position=dodge
Dodging allows us to compare counts better.
#position = dodge move bars to the side
ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity", position="dodge")
position=fill
Filling allows us to compare proportions (probabilities) better.
#position = fill stacks and standardizes height to 1
ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity", position="fill")
position=identity
Using identity will obscure one of the groups.
#position = identity overlays, obscruing "ncases"
ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity", position="identity")
position=jitter
for scatter plots
Jittering is typically used to separate overlaid continuous values by adding a tiny bit of noise to each value. We revisit the ToothGrowth
example from Example 1 to show jittering.
#jittering moves points a little away from x=0.5, x=1 and x=2
ggplot(ToothGrowth, aes(x=dose, y=len, fill=supp)) +
geom_point(position="jitter")
End of position adjustment aside
Graphing observed probabilities
Logistic regression models probabilities, so we will graph proportions of cases vs controls for each grouping of age, alcohol and tobacco.
Using position="fill"
in geom_bar stacks overlapping groups and standardizes the height of all bars to 1, creating a visual representation of a proportion or probability.
Let’s look at bar charts of each our grouping variables alone against “filled” bars of numbers of cases and controls.
#alcohol group
an < ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity", position="fill")
an
#age group instead of alcohol consumption
an + aes(x=agegp)
#tobacco consumption instead of agegp
an + aes(x=tobgp)
The proportion of ncases seems to increase with all 3 predictors, most clearly with alcgp. There agegp relationship appears possibly nonlinear.
Visualizing moderation by adding more grouping variables to the graph
Is the effect of one of these risk factors moderated by another risk factor (interaction)? Separating or grouping the graph by the moderator can help us visualize possible interaction.
We add another grouping variable to the graph through faceting – we will make multiple version of these bar graphs split by a faceting variable.
We will split the bar graph of aclgp along the rows by tobgp using facet_grid
.
#tobgp to split rows, no splitting along columns
an + facet_grid(tobgp~.)
The panels look rather similar – no clear indication of moderation of the alcohol effect by tobacco.
Faceting by 2 variables
Faceting on two variables can visualize a threeway interaction.
We split the previous graph along the columns by agegp by adding agegp after the ~
. The code labeller="label_both"
adds both the factor name and the value for each facet label.
#3way interaction visualization
an + facet_grid(tobgp~agegp, labeller="label_both")
More alcohol, old age, and more tobacco usage all seems to predict being a case – we see more red on the right bars in each plot, as we move down rows and we move right across columns. No clear indication of interactions (but see next slide).
In the highest smoking groups, 30+/day (bottom row), the 2 youngest age groups, 2534 and 3544 (2 left columns) had no cases at all, regardless of drinking. Does young age offer some protection against (moderate) the effects of alcohol and tobacco?
Before we break out the cigarettes and whiskey, let’s check our sample sizes in each of these groups.
We change to position=stack
to get stacked counts.
#stacked counts
ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity", position="stack") +
facet_grid(tobgp~agegp, labeller="label_both")
Very few heavy smokers or heavy drinkers were studied in the youngest 2 age groups.
Example 2: Logistic regression models
Our graphs seem to point to mostly linear effects of age, alcohol and tobacco on probability of esophogeal cancer (case status).
By default, an ordered factored will be entered as polynomial contrasts (linear, quadratic, cubic, etc.) in a regression model. We thus expect mostly linear effects.
All of the linear (.L) effects are significant. Age may have a quadratic effect, but the sample sizes in the highest age group is small, so we are doubtful of this effect.
#ordered factors with k levels have k1 polynomial coefficients
logit1 < glm(cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp,
data=esoph, family=binomial)
summary(logit1)$coef
## Estimate Std. Error z value Pr(>z)
## (Intercept) 1.77996650 0.1979579 8.9916409 2.435665e19
## agegp.L 3.00533623 0.6521503 4.6083490 4.058789e06
## agegp.Q 1.33787489 0.5911085 2.2633323 2.361521e02
## agegp.C 0.15306634 0.4485356 0.3412580 7.329094e01
## agegp^4 0.06410003 0.3088056 0.2075741 8.355616e01
## agegp^5 0.19363171 0.1953703 0.9911011 3.216362e01
## tobgp.L 0.59448332 0.1942245 3.0608053 2.207426e03
## tobgp.Q 0.06536502 0.1881135 0.3474764 7.282334e01
## tobgp.C 0.15679378 0.1865788 0.8403623 4.007053e01
## alcgp.L 1.49185165 0.1993506 7.4835584 7.233674e14
## alcgp.Q 0.22663253 0.1795222 1.2624206 2.067975e01
## alcgp.C 0.25463311 0.1590636 1.6008253 1.094156e01
Our planned model only included linear effects, so we use as.numeric
to treat the ordered factors as continuous, which will model just linear effects.
#change to numeric to model just linear effects of each
logit2 < glm(cbind(ncases, ncontrols) ~ as.numeric(agegp) +
as.numeric(alcgp) + as.numeric(tobgp),
data=esoph, family=binomial)
summary(logit2)$coef
## Estimate Std. Error z value Pr(>z)
## (Intercept) 5.5959444 0.41539708 13.471314 2.307236e41
## as.numeric(agegp) 0.5286674 0.07187623 7.355246 1.905751e13
## as.numeric(alcgp) 0.6938248 0.08341803 8.317444 8.988022e17
## as.numeric(tobgp) 0.2744565 0.08073921 3.399296 6.755944e04
We are happy with this model and proceed to check its assumptions.
Example 2: Diagnostic plots
The assumptions of logistic regression are not the same as those in linear regression (no assumptions of normality of residuals or homoscedasticity).
Logistic regression is a special case of the generalized linear model. The generalized linear model assumes linear relationships between the independent variables and the linear predictor (log odds of the outcome for logistic regression).
The plot analagous to the fitted vs residual plot for the generalized linear model is the linear predictor (fitted) vs deviance residual plot. We can get the estimate of the linear predictor (log odds of cancer) from predict
. The deviance residuals come from residuals
with type="deviance"
.
#predict for glm returns linear predictor by default
esoph$.xb < predict(logit2, esoph)
#devaince residuals
esoph$.devr < residuals(logit2, type="deviance")
head(esoph)
## agegp alcgp tobgp ncases ncontrols .xb .devr
## 1 2534 039g/day 09g/day 0 40 4.098996 1.1472830
## 2 2534 039g/day 1019 0 10 3.824539 0.6571706
## 3 2534 039g/day 2029 0 6 3.550083 0.5829324
## 4 2534 039g/day 30+ 0 5 3.275626 0.6090692
## 5 2534 4079 09g/day 0 27 3.405171 1.3280595
## 6 2534 4079 1019 0 7 3.130714 0.7737113
linear predictor vs deviance residual plot
With these variables we create a scatter plot and loess smooth. We would want the smooth to be flatter – its shape suggests we missed a quadratic effect.
#linear predictor vs deviance residual
ggplot(esoph, aes(x=.xb, y=.devr)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
Our first logistic regression model suggested a quadratic effect of agegp. Coloring the previous graph by agegp (and sizing by alcgp) shows a quadratic relationship between the residual and age (notice the progression of colors along the invertedU)
#sized by alcohol and colored by age
ggplot(esoph, aes(x=.xb, y=.devr,
size=alcgp, color=agegp)) +
geom_point()
Plots to assess influence
Cook’s D vs deviance residual to assess influence. Maybe check that one outlier.
#cook's d vs deviance residual
esoph$.cooksd < cooks.distance(logit2)
ggplot(esoph, aes(x=.devr, y=.cooksd, color=agegp, size=alcgp)) +
geom_point() +
geom_text(aes(label=row.names(esoph)),
size=4, nudge_x=0.15, color="black", check_overlap=T)
No one alcohol or age group seems to have systematically large influence on the model.
Example 2: Customizing a plot for presentation
Our planned graph for presentation will show how the model predicted probabilities of esophogeal cancer are with all 3 risk factors, age, alcohol and tobacco. Imagine we are constrained to colorless images, so we cannot use color
or fill
.
Predicted probabilities will be mapped to y
, and we map tobgp to x
. Faceting by the other 2 risk factors would produce too many plots to fit in a journal article figure, so we will facet only by one variable, agegp. We map alcgp to alpha
, to create a grayscale in place of a color scale.
First we need predicted probabilities, which we obtain with predict
and type="response"
(remember default is linear predictor, log odds).
esoph$.pred < predict(logit2, esoph, type="response")
Starting graph
#starting graph
p2.0 < ggplot(esoph, aes(x=tobgp, y=.pred,
alpha=alcgp, group=alcgp)) +
geom_point(size=3) +
facet_grid(~agegp)
p2.0
Removing background color
Let’s remove the background gray (panel.background
) in the plotting area to make the lighter points easier to see.
However, the predicted probabilities need some guide, so we add back some panel grid lines on the yaxis (panel.grid.major.y
).
We also remove the gray in the facet headings (strip.background
) to clean up the graph.
All of these changes are made within theme
. Remember to use element_blank
to remove a theme_element.
#in theme
# set panel.background to blank to remove it completely
p2.1 < p2.0 +
theme(panel.background=element_blank(),
strip.background=element_blank(),
panel.grid.major.y=element_line(color="gray90"))
p2.1
Labeling guides and changing fonts
Next we fix the guide (legend) titles with labs
. Notice how we give a title to the guide by specifying the aesthetic (x
, alpha
, etc.).
We also standardize all the font sizes and families in theme
by setting various theme elements to element_text
. The xaxis tick labels (axis.text.x
) are rotated for clarity.
#Set titles with labs
#set fonts and sizes with theme
p2.2 < p2.1 +
labs(title="age group",
x="tobacco use(g/day)",
y = "predicted probability of esophogeal cancer",
alpha="alcohol\nuse(g/day)") +
theme(plot.title=element_text(family="serif", size=14),
axis.title=element_text(family="serif", size=14),
legend.title=element_text(family="serif", size=14),
strip.text=element_text(family="serif", size=10),
legend.text=element_text(family="serif", size=10),
axis.text.x=element_text(family="serif", size=8, angle=90))
p2.2
Finally, we remove the unnecessary tick marks, ,fix the tick label “09g/day” to “09” and guide label for “039g/day” to “039”, and move the guide to the bottom. Notice that changing the labels for guides is done in a scale function, and that we chose “discrete” versions of the functions.
#Set titles with labs
#set fonts and sizes with theme
p2.3 < p2.2 +
theme(legend.position="bottom",
axis.ticks=element_blank()) +
scale_x_discrete(labels=c("09", "1019", "2029", "30+")) +
scale_alpha_discrete(labels=c("039", "4079", "80119", "120+"))
p2.3
#FULL CODE FOR ABOVE PLOT
ggplot(esoph, aes(x=tobgp, y=.pred, alpha=alcgp, group=alcgp)) +
geom_point(size=3) +
facet_grid(~agegp) +
labs(title="age group",
x="tobacco use(g/day)",
y = "predicted probability of esophogeal cancer",
alpha="alcohol\nuse(g/day)") +
theme(panel.background = element_blank(),
strip.background=element_blank(),
panel.grid.major.y = element_line(color="gray90"),
plot.title=element_text(family="serif", size=14),
axis.title=element_text(family="serif", size=14),
legend.title=element_text(family="serif", size=14),
strip.text=element_text(family="serif", size=10),
legend.text=element_text(family="serif", size=10),
axis.text.x=element_text(family="serif", size=8, angle=90),
legend.position="bottom",
axis.ticks=element_blank()) +
scale_x_discrete(labels=c("09", "1019", "2029", "30+")) +
scale_alpha_discrete(labels=c("039", "4079", "80119", "120+"))
Example 3
The sleepstudy dataset
The sleepstudy dataset:
 loaded with the
lme4
package  comes from a study looking at the effects of successive nights of sleep deprivation on reaction times
 constists of 180 observations of 18 Subjects
Three variables in sleepstudy
:
Reaction
: numeric, reaction time (ms)Days
: numeric, number of days of sleep deprivation, 0 to 9Subject
: factor, Subject id
#quick look at data structure
str(sleepstudy)
## 'data.frame': 180 obs. of 3 variables:
## $ Reaction: num 250 259 251 321 357 ...
## $ Days : num 0 1 2 3 4 5 6 7 8 9 ...
## $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...
Example 3: Analysis assuming independence of observations (simple linear regression)
We first graph the data without any regard to repeated measurements of subjects and analyze the data in a simple linear regression model. This analysis is contrasted later with graphs by subject and mixed effects (multilevel) models.
We hope this will highlight the dangers of ignoring clustering in the data as well as outline the usefulness of graphs by cluster and mixed models.
First, we plot the expected mean functional relationship between Days and Reaction with a scatter plot and a loess smooth. We map Days to x
and Reaction to y
and choose geom_point
and geom_smooth
, which by default produces a loess smooth.
The relationship appears linear.
#mean relationship
ggplot(sleepstudy, aes(x=Days, y=Reaction)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
We next regress Reaction on Days and fortify
dataset of diagnostic variables.
#linear model
lm1 < lm(Reaction ~ Days, sleepstudy)
summary(lm1)$coef
## Estimate Std. Error t value Pr(>t)
## (Intercept) 251.40510 6.610154 38.033169 2.156888e87
## Days 10.46729 1.238195 8.453663 9.894096e15
#fortify dataset with diagnostics
lm1_diag < fortify(lm1)
head(lm1_diag)
## Reaction Days .hat .sigma .cooksd .fitted .resid
## 1 249.5600 0 0.019191919 47.84911 1.491618e05 251.4051 1.845105
## 2 258.7047 1 0.013804714 47.84872 3.127897e05 261.8724 3.167691
## 3 250.8006 2 0.009764310 47.82165 1.014574e03 272.3397 21.539077
## 4 321.4398 3 0.007070707 47.76050 2.350740e03 282.8070 38.632837
## 5 356.8519 4 0.005723906 47.60871 5.139877e03 293.2742 63.577651
## 6 414.6901 5 0.005723906 47.11275 1.565262e02 303.7415 110.948565
## .stdresid
## 1 0.03904601
## 2 0.06685116
## 3 0.45363381
## 4 0.81254059
## 5 1.33628349
## 6 2.33193164
A qqplot to assess normality of residuals. Remember to use sample
instead of x
. Maybe some issues in the tails
#qqplot of residuals vs normal
ggplot(lm1_diag, aes(sample=.stdresid)) +
stat_qq()
Next we plot fitted values vs standardized residuals to assess homodescasticity and linearity, and a loess curve to check for trend. No obvious trend appears.
#fitted vs standardized residuals with loess
ggplot(lm1_diag, aes(x=.fitted, y=.stdresid)) +
geom_point() +
geom_smooth()
Independence of observations – residuals by subject
Linear regression models assume independence of the observations.
Repeated measurements by subject almost certainly violates this assumption.
A plot of residuals by Subject, with means and standard errors, will assess this assumption. In the plot below we see evidence of clustering by Subject.
Subjects appear to have different mean levels of Reaction (i.e. different intercepts).
#SET color and size to make dots stand out more
lm1_diag$Subject < sleepstudy$Subject
ggplot(lm1_diag, aes(x=Subject, y=.stdresid)) +
geom_point() +
stat_summary(color="red", size=1)
Slopes by subject
Faced with evidence that Subjects have different baseline levels of Reaction, we further suspect that not all subjects react to sleep deprivation in the same way. In other words, we believe there is heterogeneity in the Days effect on Reaction among subjects (i.e. we want random slopes for Days).
If the mean slope described all Subjects well, then we would see no systematic trend in the residuals across Subjects. A plot of loess smooths of fitted value vs residuals by Subject confirms our suspicion.
The plot below shows considerable trend variation between Subjects.
#fitresidual slopes by Subject  not all flat!
ggplot(lm1_diag, aes(x=.fitted, y=.stdresid,
color=Subject)) +
geom_point() +
geom_smooth(span=1.5, se=F)
Knowing we have considerable heterogeneity in effects by Subject, we create graphs by Subject and apply mixed effects regression models to our data in the next section.
Example 3: Exploratory graphs by subject
We repeat the initial scatter plot of Days vs Reaction with loess smooth, but now facet by Subject. We can assess the linearity of the relationship across Subjects.
#DaysReaction faceted by Subject
ggplot(sleepstudy, aes(x=Days, y=Reaction)) +
geom_point() +
geom_smooth() +
facet_wrap(~Subject)
Curves are too jagged so we increase span
to 1 (default .75) in geom_smooth
.
#DaysReaction faceted by Subject
ggplot(sleepstudy, aes(x=Days, y=Reaction)) +
geom_point() +
geom_smooth(span=1.5) +
facet_wrap(~Subject)
Some smooths appear mostly linear, some more curved. A couple are flat, though the majority appear to show increasing Reaction with increasing Days of sleep deprivation.
Placing all the smooths colored by Subject on one graph allows us to see the variation in the intercepts better. We can also assess correlation between the intercept and the slope, with a widening fan shape lefttoright indicating a positive correlation, and a narrowing wedge shape indicating negative correlation.
#all smooths one graph, se=F suppresses confidence bands
#no obvious correlation between intercept and slope
ggplot(sleepstudy, aes(x=Days, y=Reaction,
color=Subject)) +
geom_point() +
geom_smooth(span=1.5, se=F)
Example 3: Mixed effects models
Our graphs suggest a linear effect of Days, with Subjectlevel heterogeneity of intercepts and effect of Days.
We estimate a model with a fixed effect of Days, and random intercepts and coefficients for Days by Subject. Our graph suggested that the random effects were uncorrelated so we use the “” notation to specify this.
#random intercepts and coefficients for Days, uncorrelated
lmer2 < lmer(Reaction ~ Days + (1+DaysSubject),
data=sleepstudy)
summary(lmer2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + ((1  Subject) + (0 + Days  Subject))
## Data: sleepstudy
##
## REML criterion at convergence: 1743.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## 3.9626 0.4625 0.0204 0.4653 5.1860
##
## Random effects:
## Groups Name Variance Std.Dev.
## Subject (Intercept) 627.57 25.051
## Subject.1 Days 35.86 5.988
## Residual 653.58 25.565
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.405 6.885 36.51
## Days 10.467 1.560 6.71
##
## Correlation of Fixed Effects:
## (Intr)
## Days 0.184
Example 3: Model diagnostics
Mixed models assume (at least):
 normality of residuals and random effects
 linearity of predictor effects
 homoscedasticity
 independence of observations across the clustering variable (Subject)
 independence of the random effects and fixed effects
 correct specification of the model.
We should also check for overly influential observations.
Conveniently, the ggplot2 package inlcudes fortify
functions for lmer
model (.merMod
) objects. Returned in fortify.merMod
datasets are:
 .fitted: fitted values
 .resid: raw residuals
 .scresid: standardized (scaled Pearson) residuals.
Many of the plots below can be made (more easily) with calls to plot
and qqmath
on the lmer2 .merMod
object (e.g plot(lmer2)
).
First we fortify the dasaset with diagnostic variables.
#fortify dataset, works with lmer model objects
lmer2_diag < fortify(lmer2)
head(lmer2_diag)
## Reaction Days Subject .fitted .resid .scresid
## 1 249.5600 0 308 252.9178 3.357801 0.1313422
## 2 258.7047 1 308 272.7086 14.003876 0.5477692
## 3 250.8006 2 308 292.4994 41.698751 1.6310693
## 4 321.4398 3 308 312.2901 9.149673 0.3578944
## 5 356.8519 4 308 332.0809 24.770998 0.9689310
## 6 414.6901 5 308 351.8717 62.818423 2.4571767
Normality of residuals
We first inspect the normality of the standardized residuals from the model. The residuals look better than in the simple linear regression model.
ggplot(lmer2_diag, aes(sample=.scresid)) +
stat_qq()
Normality of random effects
The ranef
function extracts the random effects for each Subject from the model. Within the list output of ranef
is a data.frame
, named Subject
, containing the random effects. We rename the column Intercept for easier use.
#extract random effects
re < ranef(lmer2)
#output is a list object, with data.frame of random effects
# called Subject
str(re)
## List of 1
## $ Subject:'data.frame': 18 obs. of 2 variables:
## ..$ (Intercept): num [1:18] 1.51 40.37 39.18 24.52 22.91 ...
## ..$ Days : num [1:18] 9.32 8.6 5.39 4.97 3.19 ...
##  attr(*, "class")= chr "ranef.mer"
#rename column 1 of random effects data.frame
names(re$Subject)[1] < "Intercept"
#check normality of random effects
#same as lme4 method  qqmath(ranef(lmer2))
Distribution of random intercepts
Not terrible looking for sample size 18
ggplot(re$Subject, aes(sample=Intercept)) +
stat_qq()
Distribution of random slopes
Not bad looking
ggplot(re$Subject, aes(sample=Days)) +
stat_qq()
Fitted vs scaled residuals
We can also plot the fitted vs scaled residuals to assess homoscedasticity and linearity. Neither assumption appears grossly vioalted.
#fitted vs residuals; same as plot(lmer2)
ggplot(lmer2_diag, aes(x=.fitted, y=.scresid)) +
geom_point() +
geom_smooth()
Residuals by subject again
We revisit this diagnostic plot to assess if we controlled for variation in the intercept by Subject. Subject mean residuals look very similar.
#std residuals by Subject  most means near 0
ggplot(lmer2_diag, aes(x=Subject, y=.scresid)) +
geom_point() +
stat_summary(color="red")
Let’s compare the residuals of the simple linear regression model to those from the mixed model. We plot raw residual means and standard errors rather than standardized residuals to emphasize differences in the variances of the residuals.
#fitted vs residuals
#means and standard errors of residuals by Subject from
# simple linear (blue) and mixed models (red)
# alpha set to 1/2 and size to 2 for better comparison
ggplot(lmer2_diag, aes(x=Subject, y=.resid)) +
stat_summary(color="red",
alpha=1/2, size=2) +
stat_summary(data=lm1_diag, color="blue",
alpha=1/2, size=2)
Here we see that the mixed model (red) has addressed Subject heterogeneity much better than the simple linear regression model.
## No summary function supplied, defaulting to `mean_se()
## No summary function supplied, defaulting to `mean_se()
Influence
Finally, we replot fit vs residual, with points replaced by geom_text
, label
mapped to Subject and color
mapped to Days. Subject 332 has 2 large residuals – perhaps we should inspect her data.
#inspect points with large residuals by Subject and Days
ggplot(lmer2_diag, aes(x=.fitted, y=.scresid, color=Days)) +
geom_text(aes(label=Subject))
Example 3: Customizing a plot for presentation
For our central figure we would like to display the following:
 mean effect of Days on Reaction
 variability in the effect of Days on Reaction across Subjects
 fit of the model to observed data for each Subject
We have 18 subjects, a small enough number that we can fit all of our data in a graph faceted by Subject. We will facet into 3 rows of graphs.
We will use our “lmer2_diag” diagnostics dataset because we need the fitted values.
First, a scatterplot of observed x=Days
vs y=Reaction
, faceted by Subject, we use geom_point
. For the fitted line for each subject as well, we use geom_line
, but remap y
to .fitted.
#observed and fitted DaysReaction by Subject
#increase size of line for visibility
p3.0 < ggplot(lmer2_diag, aes(x=Days, y=Reaction)) +
geom_point() +
geom_line(aes(x=Days, y=.fitted), size=.75) +
facet_wrap(~Subject, nrow=3)
p3.0
We now want to add the mean intercept and mean DaysReaction slope for reference to each facet. Those are represented by the fixed intercept and Day coefficient, respectively. We extract the coefficients from the model using fixef.
fixef(lmer2)
## (Intercept) Days
## 251.40510 10.46729
An easy solution to add the reference mean slope is to use geom_abline
, which understands the aesthetics intercept
and slope
. We want the reference slope underneath the points and fit line, so we add it first to the plot.
#give them easy names
int < fixef(lmer2)[1]
slope < fixef(lmer2)[2]
#reference line goes underneath, so specify geom_abline first
p3.1 < ggplot(lmer2_diag, aes(x=Days,
y=Reaction)) +
geom_abline(intercept=int, slope=slope,
color="red", size=.75) +
geom_point() +
geom_line(aes(x=Days, y=.fitted), size=.75) +
facet_wrap(~Subject, nrow=3)
p3.1
While not a bad solution, the line extends beyond the limits of Days in the study, 0 and 9, which we don’t like. We cannot specify limits in geom_abline
, but we can in geom_segment
, which understands the aesthetics x
, xend
, y
, and yend
.
To extend the line from Days 0 to 9, set (outside aes) x
to 0, and xend
to 9. The first yvalue at Days=0 is the intercept, so we set y
to “int”, and the last yvalue we use the formula for a line (y = intercept + slope*x), and set yend
to (int + slope*9).
#use geom_segment to specify limits of line
#x goes from 0 to 9
#y goes from intercept to (intercept + slope*9)
p3.2 < ggplot(lmer2_diag, aes(x=Days,
y=Reaction)) +
geom_segment(x=0, xend=9,
y=int, yend=(int+slope*9),
color="red", size=.75) +
geom_point() +
geom_line(aes(x=Days, y=.fitted), size=.75) +
facet_wrap(~Subject, nrow=3)
p3.2
That looks better.
Sorting the facets by the random slopes
At the top of each facet, we see that the facets have been ordered by Subject id number. That numbering has no intrinsic meaning. Sorting the facets instead by the magnitude of the estimated slope for each Subject makes grasping the variation among slopes easier.
We first estimate each Subject’s slope, which is the sum of the fixed effect coefficient for Days and the estimate of the random effect of Days (BLUP) for that subject. We already have the fixed effect of Days, so we use ranef
to obtain a data.frame
of the random effects by Subject.
We extract random effects as “re”, rename the slope column to “slope”, create a Subject column, and then merge it with our diagnostic dataset.
#look at random effects
head(ranef(lmer2))
## $Subject
## (Intercept) Days
## 308 1.5126960 9.3234894
## 309 40.3738983 8.5991692
## 310 39.1810427 5.3877905
## 330 24.5189056 4.9686458
## 331 22.9144343 3.1939349
## 332 9.2219741 0.3084936
## 333 17.1561219 0.2872074
## 334 7.4517338 1.1159901
## 335 0.5787256 10.9059664
## 337 34.7679291 8.6276161
## 349 25.7543247 1.2806878
## 350 13.8650359 6.7564006
## 351 4.9159804 3.0751330
## 352 20.9290432 3.5122096
## 369 3.2586475 0.8730507
## 370 26.4758277 4.9837865
## 371 0.9056476 1.0052929
## 372 12.4217579 1.2584028
–
#save data.frame of random effects
re < ranef(lmer2)$Subject
#rename column for use with ggplot
colnames(re)[2] < "rand_slope"
#create Subject column out of rownames, make it a factor
re$Subject < factor(rownames(re))
#merge our random effects into our diagnostic dataset by Subject
lmer2_diag < merge(lmer2_diag, re, by="Subject")
#check that merge worked
str(lmer2_diag)
## 'data.frame': 180 obs. of 8 variables:
## $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Reaction : num 250 259 251 321 357 ...
## $ Days : num 0 1 2 3 4 5 6 7 8 9 ...
## $ .fitted : num 253 273 292 312 332 ...
## $ .resid : num 3.36 14 41.7 9.15 24.77 ...
## $ .scresid : num 0.131 0.548 1.631 0.358 0.969 ...
## $ (Intercept): num 1.51 1.51 1.51 1.51 1.51 ...
## $ rand_slope : num 9.32 9.32 9.32 9.32 9.32 ...
We now add the fixed effect of Days (stored in “slope”) to each “rand_slope” to get the total slope for each subject. Then we sort the dataset by slope, and then by Day (using arrange
) to ensure everything is ordered properly.
#total slope is rand_slope + slope, the fixed effect of Days from above
slope < fixef(lmer2)[2]
lmer2_diag$rand_slope < lmer2_diag$rand_slope + slope
#sort by slope and then Days
lmer2_diag < arrange(lmer2_diag, rand_slope, Days)
unique(lmer2_diag$rand_slope)
## [1] 0.4386804 1.8681168 5.0794955 5.4986401 7.2733511 7.3921530
## [7] 9.4619930 10.1587924 10.1800786 11.3403366 11.5832760 11.7256888
## [13] 11.7479738 13.9794956 15.4510725 17.2236866 19.0949021 19.7907753
In place of subject number we will place the random slope estimate in the facet label. However, in the unique
output above, we notice quite a few decimal places, too many for facet titles. The format
function trims the number of digits in rand_slope.
#format rand_slope to have 2 digits after decimal
lmer2_diag$rand_slope < format(lmer2_diag$rand_slope, digits=2)
unique(lmer2_diag$rand_slope)
## [1] "0.44" " 1.87" " 5.08" " 5.50" " 7.27" " 7.39" " 9.46" "10.16"
## [9] "10.18" "11.34" "11.58" "11.73" "11.75" "13.98" "15.45" "17.22"
## [17] "19.09" "19.79"
We can now facet by rand_slope.
#facet by rand_slope
p3.3 < ggplot(lmer2_diag, aes(x=Days, y=Reaction)) +
geom_segment(x=0, xend=9, y=int, yend=(int+slope*9),
color="red", size=.75) +
geom_point() +
geom_line(aes(x=Days, y=.fitted), size=.75) +
facet_wrap(~rand_slope, nrow=3)
p3.3
Adjusting theme elements
The variation in the slopes is more apparent with the sorting. We are happy with the datarelated aspects of the plot, so will now be adding to “p3.3”.
We address the nondata aspects of the graph now.
Relabeling ticks
Let’s relabel the ticks on the xaxis to (0,3,6,9), as data were collected on whole days. Use scale_x_continuous
since days numeric (continuous). The breaks
argument controls ticks.
#reset xaxis ticks with scale_x_continuous
p3.4 < p3.3 +
scale_x_continuous(breaks=c(0,3,6,9))
p3.4
Facet titles and fonts
Next, we use labs
to add a title at the top and axis titles. We use theme
to set the font family and font size of all text on the graph, including all titles, and tick labels.
#labs creates the title
#theme sets size and font family of title, axis title,
#facet titles, and tick labels
p3.5 < p3.4 +
labs(title="increase in reaction time (ms)
per day of deprivation",
x="days of deprivation", y="reaction time (ms)") +
theme(plot.title=element_text(family="serif", size=10),
axis.title=element_text(family="serif", size=10),
strip.text.x=element_text(family="serif", size=8),
axis.text=element_text(family="serif", size=8))
p3.5
Overall look: Look 1
Finally, we present two changes to the overall look of the graph. These will not build upon “p3.5” because we must respecify aesthetics within some of the geoms.
For the first look, we use a builtin theme theme_dark
to give the graph a nighttime feel (it’s a sleep study!). We change the colors of the reference line to black, and the observed datapoints and estimated slopes to white. NOTE: In the R code, we use base_family
to change all fonts used in theme_dark
to “serif”.
#use theme_dark for a nighttime theme
#place theme font specifications after theme_dark to ensure
# that they are obeyed
ggplot(lmer2_diag, aes(x=Days, y=Reaction)) +
geom_segment(x=0, xend=9,
y=int, yend=(int+slope*9),
color="black", size=.75) +
geom_point(color="white") +
geom_line(aes(x=Days, y=.fitted),
size=.75, color="white") +
facet_wrap(~rand_slope, nrow=3) +
scale_x_continuous(breaks=c(0,3,6,9)) +
labs(title="increase in reaction time (ms) per day of deprivation",
x="days of deprivation", y="reaction time (ms)") +
theme_dark(base_family="serif") +
theme(plot.title=element_text(size=14),
axis.title=element_text(size=14),
strip.text.x=element_text(size=10),
axis.text=element_text(size=10))
Overall look: Look 2
We also present a graph using ink/color more minimally.
Below, we use theme
to change the backgrounds color of both the graphing area (panel.background
) and the strip heading (strip.background
).
The value NA sets a property to blank.
We also specify the color of the grid lines (panel.grid.major
and panel.grid.minor
), which are normally set to “white”, to make the axes appear notched. The various gray colors used become lighter as the number (from 0 to 100) following increases.
#specify background and grid elements using theme
#grays become lighter as number increases
ggplot(lmer2_diag, aes(x=Days, y=Reaction)) +
geom_segment(x=0, xend=9, y=int, yend=(int+slope*9),
linetype=2, size=.75, color="gray60") +
geom_point() +
geom_line(aes(x=Days, y=.fitted), size=.75) +
facet_wrap(~rand_slope, nrow=3) +
scale_x_continuous(breaks=c(0,3,6,9)) +
labs(title="increase in reaction time (ms) per day of deprivation",
x="days of deprivation", y="reaction time (ms)") +
theme(plot.title=element_text(family="serif", size=14),
axis.title=element_text(family="serif", size=14),
strip.text.x=element_text(family="serif", size=10),
axis.text=element_text(family="serif", size=10),
panel.background=element_rect(fill=NA, color="gray60"),
strip.background=element_rect(fill="gray90", color="gray60"),
panel.grid.major=element_line(color=NA),
panel.grid.minor=element_line(color=NA))
Conclusion
 We hope this seminar has shown you how to use ggplot2 at each stage of data analysis.
 We provided code for both simple and more complex graphs to demonstrate that
ggplot2
is appropriate for use by both users new to R and statistical graphing and by experienced users wishing to make beautiful, illustrative graphs for presentation.  Most importantly, we hope we have conveyed the underlying grammar clearly enough to inspire you to create new graphs to display data variation in novel ways.
References
 Books
 Wickham, H. (2009). ggplot2 Elegant Graphics for Data Analysis. Springer: New York.
 Wilkinson, L. (2005). The Grammar of Graphics. Springer: New York.
 Tufte, E. R. (2001). The Visual Display of Quantitative Information. Graphics Press: Cheshire. (For guidance on making clear and informative statistical graphs)
 Websites
 docs.ggplot2.org/current/ : all help files for ggplot2 functions