### Purpose of this seminar

This seminar introduces how to use the R ggplot2 package, emphasizing data analysis accompanying a regression model.

- First the underlying grammar (system) of graphics is introduced with demonstrations.
- Then, usage of ggplot2 for exploratory graphs, model diagnostics, and presentation of model results is illustrated through 3 examples.

## Background

### The `ggplot2`

package

- produces layered statistical graphics.
- uses an underlying “grammar” to build graphs component-by-component rather than providing premade graphs.
- is easy enough to use without any exposure to the underlying grammar.
- allows the user to build a graph from concepts rather than recall of commands and options.

We can conceive of new kinds of graphs to display more data variation more clearly.

### Installing `ggplot2`

and other packages

Install `ggplot2`

with `install.packages`

.

We install and load the following additional packages:

`Hmisc`

: for documentation for statistical summary functions used with`ggplot2`

`dplyr`

: data management`reshape`

: restructuring data`lme4`

: mixed effects modeling`nlme`

: needed for dataset (installed with lme4)

```
#run these if you do not have the packages installed
install.packages("ggplot2")
install.packages("Hmisc")
install.packages("dplyr")
install.packages("reshape")
install.packages("lme4")
```

```
#load libraries into R session
library(ggplot2)
library(Hmisc)
library(dplyr)
library(reshape)
library(lme4)
library(nlme)
```

`theme_set(theme_grey(base_size = 12))`

### ggplot2 documentation

http://docs.ggplot2.org/current/

Help file pages usually contain numerous code and graphics examples.

### The ggplot2 book

Read *ggplot2: Elegant Graphics for Data Analysis* by Hadley Wickham, creator of the ggplot2 package:

- http://ggplot2.org/book
- concise at 200 pages, but rich with information
- filled with pictures

This section of the seminar describing the grammar summarizes bits and pieces of chapter 3.

## The grammar of graphics

### Outline of grammar

Grammar of a language defines the rules of structuring words and phrases into meaningful expressions.

Grammar of graphics defines the rules of structuring mathematic and aesthetic elements into a meaningful graph.

Leland Wilkinson (2005) designed the grammar upon which `ggplot2`

is based.

### Elements of grammar of graphics

**Data:**variables**mapped**to aesthetic features of the graph.**Geoms:**objects/shapes on the graph.**Stats:**stastical transformations that summarize data,(e.g mean, confidence intervals)**Scales:**define which aesthetic values are mapped to data values. Legends and axes display these mappings.**Coordiante systems:**define the plane on which data are mapped on the graphic.**Faceting:**splits the data into subsets to create multiple variations of the same graph (paneling).

### Data and the `ggplot`

function

Data for plotting must be stored in a `data.frame`

.

Multiple data frames may be used in one plot.

The default dataset is specified in the `ggplot`

function (not “ggplot2”).

### Datasets and the `Milk`

dataset

All datasets used in this seminar are loaded with the R packages used for this seminar.

To discuss the grammar of graphics, we use the `Milk`

dataset, loaded with the `nlme`

package.

The `Milk`

dataset contains 1337 rows of the following 4 columns:

**protein:**numeric, protein content of milk**Time:**numeric, time since calving**Cow:**ordered factor, cow id**Diet:**factor, diet of cow, 3 levels = barley, barley+lupins, lupins

### A quick look at the `Milk`

dataset:

```
#first few rows of Milk
head(Milk)
## Grouped Data: protein ~ Time | Cow
## protein Time Cow Diet
## 1 3.63 1 B01 barley
## 2 3.57 2 B01 barley
## 3 3.47 3 B01 barley
## 4 3.65 4 B01 barley
## 5 3.89 5 B01 barley
## 6 3.73 6 B01 barley
```

### The `ggplot`

function

In the `ggplot`

function we specify the default dataset and map variables to *aesthetics* (aspects) of the graph.

We then add layers (of geoms, stats, etc.), producing the actual graphical elements of the plot.

### First `ggplot`

specifiction

In the `ggplot`

below:

- data set is
`Milk`

- Time mapped to x-axis
- protein mapped to the y-axis

```
#declare data and x and y aesthetics,
#but no shapes yet
ggplot(data = Milk, aes(x=Time, y=protein))
```

### Aesthetic mappings and `aes`

Aesthetics are the visually perceivable components of the graph.

Map variables to aesthetics using the `aes`

function, such as:

- which variables appear on the x-axis and y-axis.
- a classification variable to colors
- a numeric variable to the size of graphical objects

### Use `aes`

in `ggplot`

to specify default aesthetics

Embed the `aes`

function in other functions for convenience.

Mappings specified in `aes`

in the `ggplot`

function serve as defaults and are *inherited* by subsequent layers and do not need to respecified.

Additional `aes`

specifications in further layers will override the default aesthetics *for that layer*.

### Default aesthetics are inherited

```
#geom_point inherits x=Time, y=protein
ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point()
```

### We can respecify `aes`

for specific layers

```
#color=Diet only applies to geom_boxplot
ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point() +
geom_boxplot(aes(color=Diet))
```

### Example aesthetics

Which aesthetics are required and which are allowed depend on the geom.

Some example aesthetics:

`x`

: positioning along x-axis`y`

: positioning along y-axis`color`

: color of objects; for 2-d objects, the color of the object’s outline (compare to fill below)`fill`

: fill color of objects`alpha`

: transparency of objects (value between 0, transparent, and 1, opaque – inverse of how many stacked objects it will take to be opaque)`linetype`

: how lines should be drawn (solid, dashed, dotted, etc.)`shape`

: shape of markers in scatter plots`size`

: how large objects appear

### Mapping vs setting

**Map** aesthetics to variables *inside* the `aes`

function. The aesthetic will vary as the variable varies.

```
#color varies as Diet varies
ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point(aes(color=Diet))
```

**Set** aesthetics to a constant *outside* the `aes`

function.

```
#map color to constant "green" outside of aes
ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point(color="green")
```

Setting an aesthetic to a constant within aes can lead to unexpected results, as the aesthetic is set to a default value, rather than the specified value.

```
#Color set to default red-orange color rather than green
ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point(aes(color="green"))
```

### Layers

Add more layers with the character `+`

.

Layers may include any of the elements of the grammar, such as geoms, stats, faceting, and scales.

Remember that layers inherit the aesthetics from the `ggplot`

function, but they can be respecified.

### Store plot specifications in objects

We can store ggplot specifications in R objects and then reuse the object to keep the code compact. Issuing the object name produces the graph

```
#store specification in object p and then produce graph
p <- ggplot(data = Milk, aes(x=Time, y=protein)) +
geom_point()
p
```

### Adding layers to a plot object renders a graphic

When we add layers to a plot stored in an object, the graphic is rendered, with the layers inheriting aesthetics from the original `ggplot`

.

```
#add a loess plot layer
p + geom_smooth()
```

```
#or panel by Diet
p + facet_wrap(~Diet)
```

### Geoms

Geom functions differ in the geometric shapes produced for the plot:

`geom_bar`

: bars with bases on the x-axis`geom_boxplot`

: boxes-and-whiskers`geom_errorbar`

: T-shaped error bars`geom_histogram`

: histogram`geom_line`

: lines`geom_point`

: points (scatterplot)`geom_ribbon`

: bands spanning y-values across a range of x-values`geom_smooth`

: smoothed conditional means (e.g. loess smooth)

### Geoms and aesthetics

Each geom is defined by aesthetics required for it to be rendered. For example, `geom_point`

requires both `x`

and `y`

, the minimal specification for a scatterplot.

Geoms differ in which aesthetics they understand as arguments. For example, `geom_point`

understands the `shape`

, which defines the shapes of points on the graph. In contrast, `geom_bar`

does not understand `shape`

.

Check the geom function help files for required and understood aesthetics.

### geom functions require only the `x`

aesthetic.

These geoms are often used to plot the distribution of a single, continuous variable.

```
#define data and x for univariate plots
pro <- ggplot(Milk, aes(x=protein))
```

```
#histogram
pro + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

```
#frequency polygon
pro + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

```
#and density plot
pro + geom_density()
```

`geom_bar`

requires only `x`

and summarizes a discrete variable

`geom_bar`

by default produces a bar plot with counts of each x-value.

```
#bar plot of counts of Diet
ggplot(Milk, aes(x=Diet)) +
geom_bar()
```

### Geoms that require more than `x`

Plots of more than one variable are typically more interesting. Below are geoms that map variables to both the x and y aesthetics.

```
#define data, x and y
p2 <- ggplot(Milk, aes(x=Time, y=protein))
```

```
#scatter plot
p2 + geom_point()
```

```
#loess plot
p2 + geom_smooth()
## `geom_smooth()` using method = 'gam'
```

### Adding a grouping variable

For the next two plots below, we additionally specify the group aesthetic to get lines (of protein across Time) by Cow and boxplots by Time point.

```
#lines of protein over Time by group=Cow
p2 + geom_line(aes(group=Cow))
```

```
#boxplots of protein by Time point; stored in p3
p3 <- p2 + geom_boxplot(aes(group=Time))
p3
```

### Show more data variation by mapping more variables to aesthetics

Map more variables to aesthetics to increase data variation in the graph. Let’s see how our protein-by-Time boxplots vary by Diet.

```
#add color to show how groups vary
p3 + facet_wrap(~Diet)
```

### Stats

The stat functions statistically transform data, usually as some form of summary. For example:

- frequencies of values of a variable (histogram, bar graphs)
- a mean
- a confidence limit

Each stat function is associated with a default geom, so no geom is required for shapes to be rendered. The geom can often be respecfied in the stat for different shapes.

`stat_bin`

produces a histogram

`stat_bin`

transforms a continuous variable mapped to `x`

into bins (intervals) of counts. Its default geom is `geom_bar`

, producing a histogram.

```
#stat_bin produces a bar graph by default
ggplot(Milk, aes(x=protein)) +
stat_bin()
```

Respecify the geom to change how the transformation is depicted.

```
#points instead of bars
ggplot(Milk, aes(x=protein)) +
stat_bin(geom="point")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

`stat_summary`

summarizes `y`

at each `x`

`stat_summary`

applies a summary function (e.g. `mean`

) to `y`

at each value (or interval) of `x`

.

The default summary function is `mean_se`

, which returns the mean and mean ± standard error. Thus, `stat_summary`

plots the mean and standard errors for `y`

at each `x`

value.

The default geom is “pointrange”, which places a dot at the mean `y`

and extends lines to mean `y`

± s.e.

Means and standard errors of protein, `y`

, at each Time, `x`

, plotted as “pointrange”.

```
#plot mean and se of protein at each Time
ggplot(Milk, aes(x=Time, y=protein)) +
stat_summary()
```

### Using other summary functions and geoms with `stat_summary`

Generally, functions that accept continuous numeric variables (e.g. `mean`

, `var`

, user-written) can be specified in `stat_summary`

, either with argument `fun.data`

or `fun.y`

(see below).

The `ggplot2`

package conveniently has additional summary functions adapted from the `Hmisc`

package, for use with `stat_summary`

, including:

`mean_cl_boot`

: mean and bootstapped confidence interval (default 95%)`mean_cl_normal`

: mean and Guassian (t-distribution based) confidence interval (default 95%)`mean_sdl`

: mean plus or minus standard deviation times some constant (default constant=2)`median_hilow`

: median and outer quantiles (default outer quantiles = 0.025 and 0.975)

It is not necessary to load the `Hmisc`

package into the R session to use these functions, but it is necessary to view their help files.

If the new summary function returns 1 value (e.g. `mean`

, `var`

), use `fun.y`

and a geom that requires only a single `y`

input (e.g. `geom_point`

, `geom_line`

).

```
#median returns 1 value, geom_pointrange expects 2 (high, low)
#change geom to point
ggplot(Milk, aes(x=Time, y=protein)) +
stat_summary(fun.y="median", geom="point")
```

If the function returns 3 values, such as the mean and 2 limits (e.g. `mean_se`

`mean_cl_normal`

) use `fun.data`

```
#mean and boostrapped confidence limits
ggplot(Milk, aes(x=Time, y=protein)) +
stat_summary(fun.data="mean_cl_boot")
```

We can even specify a function we ourselves have defined.

```
#we define our own function mean of log
#and change geom to line
meanlog <- function(y) {mean(log(y))}
ggplot(Milk, aes(x=Time, y=protein)) +
stat_summary(fun.y="meanlog", geom="line")
```

### Scales

Scales define the actual mapping of an aesthetic’s values to data values. For example, a color scale maps each of the 3 values of Diet to one of 3 colors that vary in hue. A shape scale could instead map Diet to scatter plot shapes.

The `ggplot2`

package usually provides multiple types of scales for each aesthetic, each customizable by the user.

To demonstrate, we use scales for fill colors to adjust which 3 colors are mapped to the values of Diet in the Milk:

`scale_fill_hue`

: evenly spaced hues, default used for factor variables`scale_fill_brewer`

: sequential, diverging or qualitative colour schemes, originally intended to display factor levels on a map.

The plots are densities of protein for each Diet group.

Here are 4 fill color scales using the above 2 functions.

```
#default is scale_fill_hue, would be the same if omitted
dDiet <- ggplot(Milk, aes(x=protein, fill=Diet)) +
geom_density(alpha=1/3)
dDiet + scale_fill_hue()
```

```
#a different starting hue [0,360] changes all 3 colors
#because of the even hue-spacing
dDiet + scale_fill_hue(h.start=150)
```

```
#a qualitative scale for Diet in scale_fill_brewer
dDiet + scale_fill_brewer(type="qual")
```

```
#a divergent scale in scale_fill_brewer
dDiet + scale_fill_brewer(type="div")
```

### Setting axis limits and labeling scales

Axes visualize the scales for the aesthetics `x`

and `y`

. We commonly need to adjust them so `ggplot2`

provides several convenient functions to label axes and other aesthetics:

`lims`

,`xlim`

,`ylim`

: set axis limits`expand_limits`

: extend limits of scales for various aethetics`xlab`

,`ylab`

,`ggtitle`

,`labs`

: give labels (titles) to x-axis, y-axis, or graph;`labs`

can set labels for all aesthetics and title

### Guides (legends) visualize scales

Guides specify the axes or legends that match the aesthetic values of the scale to values of the variable mapped to the aesthetic.

The `guides`

function sets and removes guides for each scale.

We remove the fill-color-scale guide from the previous plot with `guides`

.

```
dDiet + scale_fill_brewer(type="div") +
guides(fill="none")
```

### Coordinate systems

Coordinate systems define the planes on which objects are positioned in space on the plot. Most plots use Cartesian coordinate systems, as do all the plots in the seminar. Nevertheless, `ggplot2`

provides multiple coordinate systems, including polar, flipped Carteisan and map projections.

### Faceting (paneling)

Split plots into small multiples (panels) with the faceting functions, `facet_wrap`

and `facet_grid`

. The resulting graph shows how each plot varies along the faceting variable(s).

`facet_wrap`

wraps a ribbon of plots into a multirow panel of plots. The number of rows and columns can be specified. Multiple faceting variables can be separated by `+`

.

```
#densities of protein, colored by Diet, faceted by Time
ggplot(Milk, aes(x=protein, color=Diet)) +
geom_density() +
facet_wrap(~Time)
```

`facet_grid`

allows direct specification of which variables are used to split the data/plots along the rows and columns. Put the row-splitting variable before `~`

, and the column-splitting variable after. A `.`

specifies no faceting along that dimension.

```
#split rows by Diet, no splitting along columns
ggplot(Milk, aes(x=Time, y=protein)) +
geom_point() +
facet_grid(Diet~.)
```

### Themes

**Themes** control elements of the graph not related to the data. For example:

- background color
- size of fonts
- gridlines
- color of labels

These data-independent graph elements are known as **theme elements**.

A list of theme elements can be found at http://docs.ggplot2.org/current/theme.html

Each theme element can be conceived of as either a line (e.g. x-axis), a rectangle (e.g. graph background), or text (e.g. axis title). We specify arguments of the basic elements functions to adjust aspects of these theme elements:

`element_line`

– can specify`color`

,`size`

,`linetype`

, etc.`element_rect`

– can specify`fill`

,`color`

,`size`

, etc.`element_text`

– can specify`family`

,`size`

,`color`

, etc.`element_blank`

– removes theme elements from graph

### Adjust theme elements with `theme`

and element functions

Within the `theme`

function, adjust the properties of theme elements by specifying the arguments of the element functions.

For example, the theme element corresponding to the background of the graph is `panel.background`

, and its properties are set in `element_rect`

. We can adjust the background’s `fill`

color, border `color`

, and `size`

in `element_rect`

. All of this will be specified within `theme`

.

Below we change the fill color of the backround to “lightblue”.

```
#light blue background
pt <- ggplot(Milk, aes(x=Time, y=protein)) +
geom_point()
pt + theme(panel.background=element_rect(fill="lightblue"))
```

Now we specify the x-axis title, a text theme element, to be size=20pt and red.

```
#make x-axis title big and red
pt + theme(axis.title.x=element_text(size=20, color="red"))
```

Theme elements can be removed by setting them to element_blank. Let’s remove the background and x-axis title.

```
#remove plot background and x-axis title
pt + theme(panel.background=element_blank(),
axis.title.x=element_blank())
```

### Saving plots to files

`ggsave`

makes saving plots easy. The last plot displayed is saved by default, but we can also save a plot stored to an R object.

`ggsave`

attempts to guess the device to use to save the image from the file extension, so use a meaningful extension. Available devices include eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf. The size of the image can be specified as well.

```
#save last displayed plot as pdf
ggsave("plot.pdf")
#save plot stored in "p" as png and resize
ggsave("myplot.png", plot=p, width=7, height=5. units="in")
```

### Upcoming examples

The next 3 sections of the seminar consist of examples designed to show create simple `ggplot2`

graphs at each stage of data analysis. Each section begins with exploratory graphs, then chooses a regression model informed by the exploratory graphs, checks the model assumptions with diagnostic graphs, and ends with the modifying a graph for presentation of model results.

## Example 1: Linear regression

### Description of `ToothGrowth`

dataset

The `ToothGrowth`

dataset loads with the `datasets package`

.

The data are from an experiment examining how vitamin C dosage delivered in 2 different methods predicts tooth growth in guinea pigs. The data consist of 60 observations, representing 60 guinea pigs, and 3 variables:

- len: numeric, tooth (odontoblast, actually) length
- supp: factor, supplement type, 2 levels, “VC” is ascorbic acid, and “OJ” is orange juice
- dose: numeric, dose (mg/day)

A quick look at `ToothGrowth`

reveals that many guinea pigs were given the same dose of vitamin C. Three doses, 0.5, 1.0, and 2.0, were used.

```
#look at ToothGrowth data
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
unique(ToothGrowth$dose)
## [1] 0.5 1.0 2.0
```

### Example 1: Exploratory analysis

Graphical exploration of the dataset provides the researcher with descriptive depictions of summaries of variables or relationship among variables.

We being our exploration of `Toothgrowth`

by examining the distributions of `supp`

and `dose`

. Since both variables are discretely distributed, we want the frequencies (counts) of each value.

Frequencies are often plotted as bars, so we select `geom_bar`

. We specify dose mapped to `x`

and supp mapped to the `fill`

color of the bars, which will give us counts of each dose divided by supp.

### Exploring distributions

```
#bar plot of counts of dose by supp
#data are balanced, so not so interesting
ggplot(ToothGrowth, aes(x=dose, fill=supp)) +
geom_bar()
```

Statistical models often make assumptions about the distribution of the outcome (or its residuals), so an inspection might be wise. First let’s check the overall distribution of “len” with a density plot:

```
#density of len
ggplot(ToothGrowth, aes(x=len)) +
geom_density()
```

We can get densities of len by supp by mapping supp to `color`

:

```
#density of len by supp
ggplot(ToothGrowth, aes(x=len, color=supp)) +
geom_density()
```

The outcome distributions appear a bit skewed, but the samples are small.

### Quick scatterplot of outcome and predictor

We plot the dose-tooth length (len) relationship.

```
#not the best scatterplot
tp <- ggplot(ToothGrowth, aes(x=dose, y=len))
tp + geom_point()
```

### Summarizing the outcome across a predictor

Because dose takes on only 3 values, many point are crowded in 3 columns, obscuring the shape of relationship between dose and len. We replace the column of points at each dose value with its mean and confidence limit using `stat_summary`

instead of `geom_point`

.

```
#mean and cl of len at each dose
tp.1 <- tp + stat_summary(fun.data="mean_cl_normal")
tp.1
```

An additional call to `stat_summary`

with `fun.y=mean`

(`fun.y`

because `mean`

returns one value) and changing the `geom`

to “line” adds a line between means.

```
#add a line plot of means to see dose-len relationship
#maybe not linear
tp.2 <- tp.1 + stat_summary(fun.y="mean", geom="line")
tp.2
```

### Does a third variable moderate the x-y relationship?

Does the dose-len relationship depend on supp? We can specify new global aesthetics in `aes`

.

```
#all plots in tp.2 will now be colored by supp
tp.2 + aes(color=supp)
```

### Interpreting the previous graph

This graph suggests:

- The slope of the dose-response curve decreases as dose increases for both supp types, suggesting a quadratic function.
- The slope of the OJ curve flattens more dramatically, perhaps suggesting the quadratic term is different between supplement groups
- The initial slopes look rather similar, perhaps suggesting that the linear term may not be different between groups
- The 2 supp group means differ at the two lower doses, but not at the highest dose

`ggplot2`

makes graphs summarizing the outcome easy

We just plotted means and confidence limits of len, with lines connecting the means, separated by supp, all without any manipulation to the original data!

The `stat_summary`

function facilitates looking at patterns of means, just as regression models do.

Next we fit our linear regression model and check model assumptions with diagnostic graphs.

### Model preliminaries

We want to model how tooth length (len) is predicted by dose, allowing for moderation of this relationship through an interaction with supp.

We assume that dose and tooth length have a smooth, continuous relationship in the range of doses tested, so we will treat dose as a continuous (numeric) predictor. We also create a dose-squared variable, for use in the quadratic model and prediction later.

```
#create dose-squared variable
ToothGrowth$dosesq <- ToothGrowth$dose^2
```

### Fitting the model

We noticed in the previous graph that the dose-len relationship appears quadratic, that the quadratic effect may differ between supp groups, but that the initial linear slope may not be so different. We fit a model to reflect our expectations:

```
lm2 <- lm(len ~ dose + dosesq*supp, data=ToothGrowth)
summary(lm2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7491667 2.7983895 0.2677135 7.899213e-01
## dose 30.1550000 5.2474684 5.7465806 4.114588e-07
## dosesq -8.7238095 2.0402571 -4.2758383 7.640686e-05
## suppVC -6.4783333 1.3762287 -4.7073088 1.739152e-05
## dosesq:suppVC 1.5876190 0.5770719 2.7511635 8.024694e-03
```

The model appears to conform to our expectations.

### Example 1: Model diagnostics

Statistical inference depends on the assumptions of the regression model, which we check with diagnostic graphs. Common diagnostics for linear regression

- inspect the normality of the residuals
- verify that the residuals show no trends (assumption of linearity) and are homoscedastic,
- check for overly influential outliers.

`fortify`

makes linear regression diagnostics easy

Conveniently, the `fortify`

function takes a lm model object (one among several classes) and returns a dataset with several estimated diagnostic variables including:

`.hat`

: leverages(influence)`.sigma`

: residual standard deviation when observation dropped from model`.cooksd`

: Cook’s distance`.fitted`

: fitted (predicted) values`.resid`

: residuals`.stdresid`

: standardized residuals

We `fortify`

our lm2 object with these diagnostic variables and take a quick look a the new variables.

```
#create dataset with original data and diagnostic variables
lm2diag <- fortify(lm2)
#quick look
head(lm2diag)
## len dose dosesq supp .hat .sigma .cooksd .fitted
## 1 4.2 0.5 0.25 VC 0.08095238 3.623134 0.0165458459 7.564286
## 2 11.5 0.5 0.25 VC 0.08095238 3.611516 0.0226438547 7.564286
## 3 7.3 0.5 0.25 VC 0.08095238 3.654279 0.0001021058 7.564286
## 4 5.8 0.5 0.25 VC 0.08095238 3.645881 0.0045503109 7.564286
## 5 6.4 0.5 0.25 VC 0.08095238 3.650733 0.0019816291 7.564286
## 6 10.0 0.5 0.25 VC 0.08095238 3.638080 0.0086727319 7.564286
## .resid .stdresid
## 1 -3.3642857 -0.96913367
## 2 3.9357143 1.13374237
## 3 -0.2642857 -0.07613152
## 4 -1.7642857 -0.50822934
## 5 -1.1642857 -0.33539021
## 6 2.4357143 0.70164455
```

### Normality of residuals: q-q plot and `stat_qq`

A q-q plot can assess the assumption that the residuals are normally distributed by plotting the standardized residuals (observed z-score) against theoretical quantiles of the normal distribution (expected z-score if normally distributed).

`stat_qq`

creates a qq-plot. The only required aesthetic is `sample`

, which we map to the standardized residual variable created by `fortify`

, `.stdresid`

.

A diagonal reference line (intercept=0, slope=1) is added to the plot with `geom_abline`

, representing perfect fit to a normal distribution. The normal distribution will appear to be a reasonable fit below.

```
#q-q plot of residuals and diagonal reference line
#geom_abline default aesthetics are yintercept=0 and slope=1
ggplot(lm2diag, aes(sample=.stdresid)) +
stat_qq() +
geom_abline()
```

### Linearity and Homoscedasticity: residuals vs fitted

We next assess the assumptions of homoscedescasticity and linear relationships between the outcome and predictors. A residuals vs fitted (predicted value) plot assesses both of these assmuptions.

An even spread of residuals around 0 suggests homoscedasticity, and a zero, flat slope for residuals against fitted suggests linearity of predictor effects.

We build our residuals vs fitted plot like so:

- Start with a scatter plot of
`x=.fitted`

and`y=.stdresid`

. - Add a plot the means and standard errors of the residuals across fitted values using
`stat_summary`

. The standard error bars somewhat address homoskedasticity. - Plot a line representing the mean trend of the residuals with another call to
`stat_summary`

(changing function to`mean`

and geom to`line`

). Normally, we would use`geom_smooth`

to plot a loess curve to visualize the trend among residuals, but loess smooths do not work well when the variable mapped to`x`

is discrete.

The error bars do not appear to vary too much and the line seems sufficiently flat to feel that neither assumption has been violated.

```
#residual vs fitted, means and s.e.
ggplot(lm2diag, aes(x=.fitted, y=.stdresid)) +
geom_point() +
stat_summary() +
stat_summary(fun.y="mean", geom="line")
## No summary function supplied, defaulting to `mean_se()
```

### Identifying influential observations

Strongly influential observations can distort regression coefficients. The most influential datapoints will typically have more extreme predictor values (leverage), measured by `.hat`

, and large residuals. The overall influence of an observation on the model is measured by Cook’s D, variable `.cooksd`

.

The `Toothgrowth`

dataset is fairly balanced across doses and supps (20 at each of 3 doses, 30 at each of 2 supp). Thus, no values are “extreme”, so influential observations for this model will be those with large residuals.

In the following plot:

- map
`.hat`

to`x`

,`.stdresid`

to`y`

- map
`.cooksd`

to`size`

, making more influential points larger. - label the points in
`geom_text`

with their row numbers for identification

The dependence of Cook’s D on both leverage and residual is apparent in the plot, with Cook’s D rising as we move away from center (larger residual) and to the right (larger leverage). No point looks too influential for concern.

```
# in geom_text we SET size to 4 so that size of text is constant,
# and not tied to .cooksd. We also nudge the text .001 (x-axis units)
# to the left, and separate overlapping labels
ggplot(lm2diag, aes(x=.hat, y=.stdresid, size=.cooksd)) +
geom_point() +
geom_text(aes(label=row.names(lm2diag)),
size=4, nudge_x=-0.003, check_overlap=T)
```

### Example 1: Customizing a plot for presentation

We have decided to present our model dose-response curves with confidence intervals for an audience.

Our model (lm2) implies smooth, quadratic dose functions so we predict fitted values at more finely spaced doses than in the original data (0.5,1,2). First we create the new dataset of doses between 0.5 and 2, separated by 0.01 and a corresponding dose-squared variable. Then, we use `predict`

to add fitted values and confidence bands estimates to the new dataset.

```
#create new data set of values from 0.5 to 2 for each supp
newdata <- data.frame(dose=rep(seq(0.5, 2, .01),each=2),
supp=factor(c("OJ", "VC")))
#dose squared variable
newdata$dosesq <- newdata$dose^2
#add predicted values and confidence limits
newdata <- data.frame(newdata, predict(lm2, newdata,
interval="confidence"))
```

Let’s peak at the new variables.

```
#check the new data for plotting
head(newdata)
## dose supp dosesq fit lwr upr
## 1 0.50 OJ 0.2500 13.645714 11.580988 15.710440
## 2 0.50 VC 0.2500 7.564286 5.499560 9.629012
## 3 0.51 OJ 0.2601 13.859154 11.829995 15.888313
## 4 0.51 VC 0.2601 7.793760 5.764601 9.822919
## 5 0.52 OJ 0.2704 14.070849 12.075506 16.066191
## 6 0.52 VC 0.2704 8.021807 6.026465 10.017150
```

### Initial plot: fitted vs dose by supp

First we map dose to `x`

, fit is mapped to `y`

and supp is mapped to `color`

. We begin with a line plot of the fitted curves.

```
#fit curves by supp
p1.0 <- ggplot(newdata, aes(x=dose, y=fit, color=supp)) +
geom_line()
p1.0
```

### Add confidence bands with `geom_ribbon`

`geom_ribbon`

requires the aesthetics `ymax`

and `ymin`

, which we map to upr and lwr, respectively. We specify that `fill`

colors of the ribbons are mapped to supp as well (otherwise they will be filled black).

```
#add confidence bands
p1.1 <- p1.0 + geom_ribbon(aes(ymax=upr, ymin=lwr, fill=supp))
p1.1
```

Oops! THe fill color has obscured everything. We **set** (outside of `aes`

) the alpha level of geom_ribbon to a constant of 1/5.

```
#make confidence bands transparent
p1.2 <- p1.0 + geom_ribbon(aes(ymax=upr, ymin=lwr, fill=supp),
alpha=1/5)
p1.2
```

### Adding original data (in a different dataset) to the graph

As a final data addition to the graph, we decide to overlay a scatter plot of the original data. In `geom_point`

we respecify the data to be `ToothGrowth`

(instead of lm2), and respecify the `y`

aesthetic to len, the observed outcome.

```
#overlay observed data
p1.3 <- p1.2 + geom_point(data=ToothGrowth, aes(y=len))
p1.3
```

### Finishing the graph: adjusting theme and theme elements

Finally, we make a few non-data related changes to the graph.

First we change the overall look of a graph with a built-in theme, `theme_classic`

, which replaces the gray background and white gridlines with a white background, no grid lines, no border on the top or right. This theme mimics the look of base `R`

graphics.

The `ggplot2`

package provides a few built-in themes (see http://docs.ggplot2.org/current/ggtheme.html), which make several changes to the overall background look of the graphic.

`theme_classic`

mimics base `R`

graphics

```
#overlay observed data
p1.4 <- p1.3 + theme_classic()
p1.4
```

Next we change the color of the “OJ” group to a more obvious orange.

`scale_fill_manual`

and `scale_color_manual`

allow color specification through hexadecimal notation (often called color hex codes, # follwed by 6 alphanumeric characters).

On the Color Brewer website, are many ready-to-use color scales with corresponding color hex codes. We find a qualitative scale we like(“Set2”) with an orange, hex code “#fd8d62”, and a complementary green, hex code “#66c2a5”.

We apply these orange-green colors scales to both the `color`

aesthetic (color of points, fit line and the borders of ribbon), and to the `fill`

aesthetic (fill color of the confidence bands)

```
#specify hex codes for orange, #fc8d62, and green #66c2a5
p1.5 <- p1.4 +
scale_color_manual(values=c("#fc8d62", "#66c2a5")) +
scale_fill_manual(values=c("#fc8d62", "#66c2a5"))
p1.5
```

Finally, we use `labs`

to make the axes and guide titles more descriptive. The “\n” is a newline. We also move the guide (legend) to the bottom with theme.

```
#note the default text justifications for the x-axis vs guide
p1.6 <- p1.5 + labs(x="dosage vitamin C\n(mg/day)",
y="tooth length", fill="supplement\ntype",
color ="supplement\ntype") +
theme(legend.position="bottom")
p1.6
```

```
#FULL CODE FOR THE ABOVE GRAPH
ggplot(newdata, aes(x=dose, y=fit, color=supp)) +
geom_line() +
geom_ribbon(aes(ymax=upr, ymin=lwr, fill=supp), alpha=1/5) +
geom_point(data=ToothGrowth, aes(x=dose, y=len)) +
theme_minimal() +
scale_color_manual(values=c("#fc8d62", "#66c2a5")) +
scale_fill_manual(values=c("#fc8d62", "#66c2a5")) +
labs(x="dosage vitamin C\n(mg/day)", y="tooth length",
fill="supplement\ntype",
color ="supplement\ntype") +
theme(legend.position="bottom")
```

### Recap of Example 1

- Use exploratory graphs to characterize the sample and to formulate the functional relationships between outcomes and predictors.
- The stat_summary function greatly simplifies plotting means and other summaries.
- For linear regression models, fortify datasets with diagnostic variables.
- Assess model assumptions with diagnostic plots.
- You can plot from multiple datasets on the same graph!

## Example 2: Logistic regression

### The `esoph`

dataset

The `esoph`

dataset:

- loaded with the
`datasets`

package - comes from a case-control study of the risk of alcohol and tobacco use in predicting esophageal cancer.
- consists of 88 “grouped” observations, which record the number of cases (with cancer) and controls ( without cancer) who fall in the same groupings of age, alcohol use, and tobacco use.

`esoph`

variables

Variables in the dataset:

- agegp: ordered factor, age group, 6 levels (25-34, 35-44, etc.)
- alcgp: ordered factor, alcohol consumption gm/day, 4 levels (0-39, 40-79, etc.)
- tobgp: ordered factor, tobacco consumption gm/day, 4 levels (0-9, 10-19, etc.)
- ncases: numeric, number of cases
- ncontrols: numeric, number of controls

```
#structure of esoph
str(esoph)
## 'data.frame': 88 obs. of 5 variables:
## $ agegp : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ alcgp : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
## $ tobgp : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
## $ ncases : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ncontrols: num 40 10 6 5 27 7 4 7 2 1 ...
```

```
#head of esoph
head(esoph)
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
```

In the first row we see that there were 0 cases and 40 controls in the grouping age 25-34, alcohol consumption 0-39g/day, and tobacco consumption 0-9g/day.

The grouped data structure here is common for case-control designs, uncommon for other designs.

### Example 2: Exploratory analysis

Knowing that the outcome (case vs control) is binary, we expect to conduct a logistic regression, which models how the probability of the outcome (logit-transformed) is predicted by the various risk factors.

We would thus like to plot the frequencies of both cases and controls for each grouping to visualize these probabilities. Unfortunately, those frequencies are in different variables in the esoph dataset, ncases and ncontrols, and we can map only one variable to `y`

.

```
#we want to plot both number of cases and controls for each group
#but they are in different variables "ncases" and "ncontrols"
head(esoph)
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
```

The graph below will plot the total ncases for each alcohol group.

In `geom_bar`

, the default is `stat=bin`

, which counts the number of observations that fall into each value of the variable mapped to `x`

(alcgp).

By respecifying `stat="identity"`

, `geom_bar`

will plot the values stored in the variable mapped to `y`

(ncases). `geom_bar`

is smart enough to sum ncases across groups with the same value of alcgp.

In other words, the data are already binned.

How do we add ncontrols to the bar plot of ncases below?

```
#"stat=identity" plots the value of the variable mapped to "y" (ncases) as is
ggplot(esoph, aes(x=alcgp, y=ncases)) +
geom_bar(stat="identity")
```

### Melting ncases and ncontrols into one column

We want ncases and ncontrols in a single column while knowing whether the value came from ncases or ncontrols.

The `melt`

function from the `reshape`

package takes a `data.frame`

and stacks all non-factor variable columns into one variable (called “value”, by default), while creating another variable that indexes the originating column (called “variable”, by default). The factor variables define the groups and are unaltered.

In `esoph`

we have 3 factor variables, agegp, alcgp and tobgp. After `melt`

, the remaining variables ncases and ncontrols will be stacked into one column, creating 2 rows of observations for each crossing of agegp, alcgp and tobgp.

Below we create a melted version of the `esoph`

dataset for exploratory graphing. In the column “value” we see values that came from the columns in `esoph`

, ncases and ncolumns. The originating column is indexed by the column “variable”.

```
#melt esoph
esoph_melt <- melt(esoph)
## Using agegp, alcgp, tobgp as id variables
#sort data before viewing (arrange from package dplyr)
esoph_melt <- arrange(esoph_melt, agegp, alcgp, tobgp)
#we can see the stacking of ncases and ncontrols
head(esoph_melt)
## agegp alcgp tobgp variable value
## 1 25-34 0-39g/day 0-9g/day ncases 0
## 2 25-34 0-39g/day 0-9g/day ncontrols 40
## 3 25-34 0-39g/day 10-19 ncases 0
## 4 25-34 0-39g/day 10-19 ncontrols 10
## 5 25-34 0-39g/day 20-29 ncases 0
## 6 25-34 0-39g/day 20-29 ncontrols 6
```

Let’s rename “variable” to “status”, and rename “value” to “n”, since that is what the new columns represent.

```
#rename variables for more meaning.
colnames(esoph_melt)[4:5]<- c("status", "n")
head(esoph_melt)
## agegp alcgp tobgp status n
## 1 25-34 0-39g/day 0-9g/day ncases 0
## 2 25-34 0-39g/day 0-9g/day ncontrols 40
## 3 25-34 0-39g/day 10-19 ncases 0
## 4 25-34 0-39g/day 10-19 ncontrols 10
## 5 25-34 0-39g/day 20-29 ncases 0
## 6 25-34 0-39g/day 20-29 ncontrols 6
```

Now we have one column, n, which we can map to `y`

, and we can use status as a `fill`

color for the plot.

### Putting ncases and ncontrols on the same graph

We now plot counts of cases and controls across alcoholg groups on the same graph.

We again use `geom_bar`

and `stat=identity`

to plot the values stored in `y=n`

, now with status mapped to `fill`

color of the bars.

Notice that the resulting bars are stacked.

```
#put n for cases and controls on the same bar graph
ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity")
```

We have more light drinkers than heavy drinkers in the dataset (overall height of bars). Probability of being a case seems to increase with alcohol consumption (red takes up proportionally more as we move right).

### A brief aside about position adjustments

When 2 graphical objects are to be placed in the same or overlapping positions on the graph, we have several options, called “position adjustments”, for placement of the objects:

`dodge`

: move elements side-by-side,`fill`

: stack elements vertically, standardize heights to 1`identity`

: no adjustment, objects will overlap`jitter`

: add a little random noise to position`stack`

: stack elements vertically

These options are specified in a “position” argument (in a geom or stat). For `geom_bar`

, by default “position=stack” for stacked bars. For `geom_point`

, by default “position=identity” for overlapping points.

The next few graphs demonstrate the utility of position adjustments for bar graphs.

`position=stack`

Stacking let’s us visualize both proportions and counts.

```
#by default, geom_bar uses position="stack"
#looks same as above
ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity", position="stack")
```

`position=dodge`

Dodging allows us to compare counts better.

```
#position = dodge move bars to the side
ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity", position="dodge")
```

`position=fill`

Filling allows us to compare proportions (probabilities) better.

```
#position = fill stacks and standardizes height to 1
ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity", position="fill")
```

`position=identity`

Using identity will obscure one of the groups.

```
#position = identity overlays, obscruing "ncases"
ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity", position="identity")
```

`position=jitter`

for scatter plots

Jittering is typically used to separate overlaid continuous values by adding a tiny bit of noise to each value. We revisit the `ToothGrowth`

example from Example 1 to show jittering.

```
#jittering moves points a little away from x=0.5, x=1 and x=2
ggplot(ToothGrowth, aes(x=dose, y=len, fill=supp)) +
geom_point(position="jitter")
```

### End of position adjustment aside

### Graphing observed probabilities

Logistic regression models probabilities, so we will graph proportions of cases vs controls for each grouping of age, alcohol and tobacco.

Using `position="fill"`

in geom_bar stacks overlapping groups and standardizes the height of all bars to 1, creating a visual representation of a proportion or probability.

Let’s look at bar charts of each our grouping variables alone against “filled” bars of numbers of cases and controls.

```
#alcohol group
an <- ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity", position="fill")
an
```

```
#age group instead of alcohol consumption
an + aes(x=agegp)
```

```
#tobacco consumption instead of agegp
an + aes(x=tobgp)
```

The proportion of ncases seems to increase with all 3 predictors, most clearly with alcgp. There agegp relationship appears possibly non-linear.

### Visualizing moderation by adding more grouping variables to the graph

Is the effect of one of these risk factors moderated by another risk factor (interaction)? Separating or grouping the graph by the moderator can help us visualize possible interaction.

We add another grouping variable to the graph through faceting – we will make multiple version of these bar graphs split by a faceting variable.

We will split the bar graph of aclgp along the rows by tobgp using `facet_grid`

.

```
#tobgp to split rows, no splitting along columns
an + facet_grid(tobgp~.)
```

The panels look rather similar – no clear indication of moderation of the alcohol effect by tobacco.

### Faceting by 2 variables

Faceting on two variables can visualize a three-way interaction.

We split the previous graph along the columns by agegp by adding agegp after the `~`

. The code `labeller="label_both"`

adds both the factor name and the value for each facet label.

```
#3-way interaction visualization
an + facet_grid(tobgp~agegp, labeller="label_both")
```

More alcohol, old age, and more tobacco usage all seems to predict being a case – we see more red on the right bars in each plot, as we move down rows and we move right across columns. No clear indication of interactions (but see next slide).

In the highest smoking groups, 30+/day (bottom row), the 2 youngest age groups, 25-34 and 35-44 (2 left columns) had no cases at all, regardless of drinking. Does young age offer some protection against (moderate) the effects of alcohol and tobacco?

Before we break out the cigarettes and whiskey, let’s check our sample sizes in each of these groups.

We change to `position=stack`

to get stacked counts.

```
#stacked counts
ggplot(esoph_melt, aes(x=alcgp, y=n, fill=status)) +
geom_bar(stat="identity", position="stack") +
facet_grid(tobgp~agegp, labeller="label_both")
```

Very few heavy smokers or heavy drinkers were studied in the youngest 2 age groups.

### Example 2: Logistic regression models

Our graphs seem to point to mostly linear effects of age, alcohol and tobacco on probability of esophogeal cancer (case status).

By default, an ordered factored will be entered as polynomial contrasts (linear, quadratic, cubic, etc.) in a regression model. We thus expect mostly linear effects.

All of the linear (.L) effects are significant. Age may have a quadratic effect, but the sample sizes in the highest age group is small, so we are doubtful of this effect.

```
#ordered factors with k levels have k-1 polynomial coefficients
logit1 <- glm(cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp,
data=esoph, family=binomial)
summary(logit1)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.77996650 0.1979579 -8.9916409 2.435665e-19
## agegp.L 3.00533623 0.6521503 4.6083490 4.058789e-06
## agegp.Q -1.33787489 0.5911085 -2.2633323 2.361521e-02
## agegp.C 0.15306634 0.4485356 0.3412580 7.329094e-01
## agegp^4 0.06410003 0.3088056 0.2075741 8.355616e-01
## agegp^5 -0.19363171 0.1953703 -0.9911011 3.216362e-01
## tobgp.L 0.59448332 0.1942245 3.0608053 2.207426e-03
## tobgp.Q 0.06536502 0.1881135 0.3474764 7.282334e-01
## tobgp.C 0.15679378 0.1865788 0.8403623 4.007053e-01
## alcgp.L 1.49185165 0.1993506 7.4835584 7.233674e-14
## alcgp.Q -0.22663253 0.1795222 -1.2624206 2.067975e-01
## alcgp.C 0.25463311 0.1590636 1.6008253 1.094156e-01
```

Our planned model only included linear effects, so we use `as.numeric`

to treat the ordered factors as continuous, which will model just linear effects.

```
#change to numeric to model just linear effects of each
logit2 <- glm(cbind(ncases, ncontrols) ~ as.numeric(agegp) +
as.numeric(alcgp) + as.numeric(tobgp),
data=esoph, family=binomial)
summary(logit2)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.5959444 0.41539708 -13.471314 2.307236e-41
## as.numeric(agegp) 0.5286674 0.07187623 7.355246 1.905751e-13
## as.numeric(alcgp) 0.6938248 0.08341803 8.317444 8.988022e-17
## as.numeric(tobgp) 0.2744565 0.08073921 3.399296 6.755944e-04
```

We are happy with this model and proceed to check its assumptions.

### Example 2: Diagnostic plots

The assumptions of logistic regression are not the same as those in linear regression (no assumptions of normality of residuals or homoscedasticity).

Logistic regression is a special case of the generalized linear model. The generalized linear model assumes linear relationships between the independent variables and the linear predictor (log odds of the outcome for logistic regression).

The plot analagous to the fitted vs residual plot for the generalized linear model is the linear predictor (fitted) vs deviance residual plot. We can get the estimate of the linear predictor (log odds of cancer) from `predict`

. The deviance residuals come from `residuals`

with `type="deviance"`

.

```
#predict for glm returns linear predictor by default
esoph$.xb <- predict(logit2, esoph)
#devaince residuals
esoph$.devr <- residuals(logit2, type="deviance")
head(esoph)
## agegp alcgp tobgp ncases ncontrols .xb .devr
## 1 25-34 0-39g/day 0-9g/day 0 40 -4.098996 -1.1472830
## 2 25-34 0-39g/day 10-19 0 10 -3.824539 -0.6571706
## 3 25-34 0-39g/day 20-29 0 6 -3.550083 -0.5829324
## 4 25-34 0-39g/day 30+ 0 5 -3.275626 -0.6090692
## 5 25-34 40-79 0-9g/day 0 27 -3.405171 -1.3280595
## 6 25-34 40-79 10-19 0 7 -3.130714 -0.7737113
```

### linear predictor vs deviance residual plot

With these variables we create a scatter plot and loess smooth. We would want the smooth to be flatter – its shape suggests we missed a quadratic effect.

```
#linear predictor vs deviance residual
ggplot(esoph, aes(x=.xb, y=.devr)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
```

Our first logistic regression model suggested a quadratic effect of agegp. Coloring the previous graph by agegp (and sizing by alcgp) shows a quadratic relationship between the residual and age (notice the progression of colors along the inverted-U)

```
#sized by alcohol and colored by age
ggplot(esoph, aes(x=.xb, y=.devr,
size=alcgp, color=agegp)) +
geom_point()
```

### Plots to assess influence

Cook’s D vs deviance residual to assess influence. Maybe check that one outlier.

```
#cook's d vs deviance residual
esoph$.cooksd <- cooks.distance(logit2)
ggplot(esoph, aes(x=.devr, y=.cooksd, color=agegp, size=alcgp)) +
geom_point() +
geom_text(aes(label=row.names(esoph)),
size=4, nudge_x=-0.15, color="black", check_overlap=T)
```

No one alcohol or age group seems to have systematically large influence on the model.

### Example 2: Customizing a plot for presentation

Our planned graph for presentation will show how the model predicted probabilities of esophogeal cancer are with all 3 risk factors, age, alcohol and tobacco. Imagine we are constrained to colorless images, so we cannot use `color`

or `fill`

.

Predicted probabilities will be mapped to `y`

, and we map tobgp to `x`

. Faceting by the other 2 risk factors would produce too many plots to fit in a journal article figure, so we will facet only by one variable, agegp. We map alcgp to `alpha`

, to create a grayscale in place of a color scale.

First we need predicted probabilities, which we obtain with `predict`

and `type="response"`

(remember default is linear predictor, log odds).

`esoph$.pred <- predict(logit2, esoph, type="response")`

### Starting graph

```
#starting graph
p2.0 <- ggplot(esoph, aes(x=tobgp, y=.pred,
alpha=alcgp, group=alcgp)) +
geom_point(size=3) +
facet_grid(~agegp)
p2.0
```

### Removing background color

Let’s remove the background gray (`panel.background`

) in the plotting area to make the lighter points easier to see.

However, the predicted probabilities need some guide, so we add back some panel grid lines on the y-axis (`panel.grid.major.y`

).

We also remove the gray in the facet headings (`strip.background`

) to clean up the graph.

All of these changes are made within `theme`

. Remember to use `element_blank`

to remove a theme_element.

```
#in theme
# set panel.background to blank to remove it completely
p2.1 <- p2.0 +
theme(panel.background=element_blank(),
strip.background=element_blank(),
panel.grid.major.y=element_line(color="gray90"))
p2.1
```

### Labeling guides and changing fonts

Next we fix the guide (legend) titles with `labs`

. Notice how we give a title to the guide by specifying the aesthetic (`x`

, `alpha`

, etc.).

We also standardize all the font sizes and families in `theme`

by setting various theme elements to `element_text`

. The x-axis tick labels (`axis.text.x`

) are rotated for clarity.

```
#Set titles with labs
#set fonts and sizes with theme
p2.2 <- p2.1 +
labs(title="age group",
x="tobacco use(g/day)",
y = "predicted probability of esophogeal cancer",
alpha="alcohol\nuse(g/day)") +
theme(plot.title=element_text(family="serif", size=14),
axis.title=element_text(family="serif", size=14),
legend.title=element_text(family="serif", size=14),
strip.text=element_text(family="serif", size=10),
legend.text=element_text(family="serif", size=10),
axis.text.x=element_text(family="serif", size=8, angle=90))
p2.2
```

Finally, we remove the unnecessary tick marks, ,fix the tick label “0-9g/day” to “0-9” and guide label for “0-39g/day” to “0-39”, and move the guide to the bottom. Notice that changing the labels for guides is done in a scale function, and that we chose “discrete” versions of the functions.

```
#Set titles with labs
#set fonts and sizes with theme
p2.3 <- p2.2 +
theme(legend.position="bottom",
axis.ticks=element_blank()) +
scale_x_discrete(labels=c("0-9", "10-19", "20-29", "30+")) +
scale_alpha_discrete(labels=c("0-39", "40-79", "80-119", "120+"))
p2.3
```

```
#FULL CODE FOR ABOVE PLOT
ggplot(esoph, aes(x=tobgp, y=.pred, alpha=alcgp, group=alcgp)) +
geom_point(size=3) +
facet_grid(~agegp) +
labs(title="age group",
x="tobacco use(g/day)",
y = "predicted probability of esophogeal cancer",
alpha="alcohol\nuse(g/day)") +
theme(panel.background = element_blank(),
strip.background=element_blank(),
panel.grid.major.y = element_line(color="gray90"),
plot.title=element_text(family="serif", size=14),
axis.title=element_text(family="serif", size=14),
legend.title=element_text(family="serif", size=14),
strip.text=element_text(family="serif", size=10),
legend.text=element_text(family="serif", size=10),
axis.text.x=element_text(family="serif", size=8, angle=90),
legend.position="bottom",
axis.ticks=element_blank()) +
scale_x_discrete(labels=c("0-9", "10-19", "20-29", "30+")) +
scale_alpha_discrete(labels=c("0-39", "40-79", "80-119", "120+"))
```

## Example 3

### The sleepstudy dataset

The sleepstudy dataset:

- loaded with the
`lme4`

package - comes from a study looking at the effects of successive nights of sleep deprivation on reaction times
- constists of 180 observations of 18 Subjects

Three variables in `sleepstudy`

:

`Reaction`

: numeric, reaction time (ms)`Days`

: numeric, number of days of sleep deprivation, 0 to 9`Subject`

: factor, Subject id

```
#quick look at data structure
str(sleepstudy)
## 'data.frame': 180 obs. of 3 variables:
## $ Reaction: num 250 259 251 321 357 ...
## $ Days : num 0 1 2 3 4 5 6 7 8 9 ...
## $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...
```

### Example 3: Analysis assuming independence of observations (simple linear regression)

We first graph the data without any regard to repeated measurements of subjects and analyze the data in a simple linear regression model. This analysis is contrasted later with graphs by subject and mixed effects (multilevel) models.

We hope this will highlight the dangers of ignoring clustering in the data as well as outline the usefulness of graphs by cluster and mixed models.

First, we plot the expected mean functional relationship between Days and Reaction with a scatter plot and a loess smooth. We map Days to `x`

and Reaction to `y`

and choose `geom_point`

and `geom_smooth`

, which by default produces a loess smooth.

The relationship appears linear.

```
#mean relationship
ggplot(sleepstudy, aes(x=Days, y=Reaction)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
```

We next regress Reaction on Days and `fortify`

dataset of diagnostic variables.

```
#linear model
lm1 <- lm(Reaction ~ Days, sleepstudy)
summary(lm1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 251.40510 6.610154 38.033169 2.156888e-87
## Days 10.46729 1.238195 8.453663 9.894096e-15
#fortify dataset with diagnostics
lm1_diag <- fortify(lm1)
head(lm1_diag)
## Reaction Days .hat .sigma .cooksd .fitted .resid
## 1 249.5600 0 0.019191919 47.84911 1.491618e-05 251.4051 -1.845105
## 2 258.7047 1 0.013804714 47.84872 3.127897e-05 261.8724 -3.167691
## 3 250.8006 2 0.009764310 47.82165 1.014574e-03 272.3397 -21.539077
## 4 321.4398 3 0.007070707 47.76050 2.350740e-03 282.8070 38.632837
## 5 356.8519 4 0.005723906 47.60871 5.139877e-03 293.2742 63.577651
## 6 414.6901 5 0.005723906 47.11275 1.565262e-02 303.7415 110.948565
## .stdresid
## 1 -0.03904601
## 2 -0.06685116
## 3 -0.45363381
## 4 0.81254059
## 5 1.33628349
## 6 2.33193164
```

A qq-plot to assess normality of residuals. Remember to use `sample`

instead of `x`

. Maybe some issues in the tails

```
#qqplot of residuals vs normal
ggplot(lm1_diag, aes(sample=.stdresid)) +
stat_qq()
```

Next we plot fitted values vs standardized residuals to assess homodescasticity and linearity, and a loess curve to check for trend. No obvious trend appears.

```
#fitted vs standardized residuals with loess
ggplot(lm1_diag, aes(x=.fitted, y=.stdresid)) +
geom_point() +
geom_smooth()
```

### Independence of observations – residuals by subject

Linear regression models assume independence of the observations.

Repeated measurements by subject almost certainly violates this assumption.

A plot of residuals by Subject, with means and standard errors, will assess this assumption. In the plot below we see evidence of clustering by Subject.

Subjects appear to have different mean levels of Reaction (i.e. different intercepts).

```
#SET color and size to make dots stand out more
lm1_diag$Subject <- sleepstudy$Subject
ggplot(lm1_diag, aes(x=Subject, y=.stdresid)) +
geom_point() +
stat_summary(color="red", size=1)
```

### Slopes by subject

Faced with evidence that Subjects have different baseline levels of Reaction, we further suspect that not all subjects react to sleep deprivation in the same way. In other words, we believe there is heterogeneity in the Days effect on Reaction among subjects (i.e. we want random slopes for Days).

If the mean slope described all Subjects well, then we would see no systematic trend in the residuals across Subjects. A plot of loess smooths of fitted value vs residuals by Subject confirms our suspicion.

The plot below shows considerable trend variation between Subjects.

```
#fit-residual slopes by Subject -- not all flat!
ggplot(lm1_diag, aes(x=.fitted, y=.stdresid,
color=Subject)) +
geom_point() +
geom_smooth(span=1.5, se=F)
```

Knowing we have considerable heterogeneity in effects by Subject, we create graphs by Subject and apply mixed effects regression models to our data in the next section.

### Example 3: Exploratory graphs by subject

We repeat the initial scatter plot of Days vs Reaction with loess smooth, but now facet by Subject. We can assess the linearity of the relationship across Subjects.

```
#Days-Reaction faceted by Subject
ggplot(sleepstudy, aes(x=Days, y=Reaction)) +
geom_point() +
geom_smooth() +
facet_wrap(~Subject)
```

Curves are too jagged so we increase `span`

to 1 (default .75) in `geom_smooth`

.

```
#Days-Reaction faceted by Subject
ggplot(sleepstudy, aes(x=Days, y=Reaction)) +
geom_point() +
geom_smooth(span=1.5) +
facet_wrap(~Subject)
```

Some smooths appear mostly linear, some more curved. A couple are flat, though the majority appear to show increasing Reaction with increasing Days of sleep deprivation.

Placing all the smooths colored by Subject on one graph allows us to see the variation in the intercepts better. We can also assess correlation between the intercept and the slope, with a widening fan shape left-to-right indicating a positive correlation, and a narrowing wedge shape indicating negative correlation.

```
#all smooths one graph, se=F suppresses confidence bands
#no obvious correlation between intercept and slope
ggplot(sleepstudy, aes(x=Days, y=Reaction,
color=Subject)) +
geom_point() +
geom_smooth(span=1.5, se=F)
```

### Example 3: Mixed effects models

Our graphs suggest a linear effect of Days, with Subject-level heterogeneity of intercepts and effect of Days.

We estimate a model with a fixed effect of Days, and random intercepts and coefficients for Days by Subject. Our graph suggested that the random effects were uncorrelated so we use the “||” notation to specify this.

```
#random intercepts and coefficients for Days, uncorrelated
lmer2 <- lmer(Reaction ~ Days + (1+Days||Subject),
data=sleepstudy)
summary(lmer2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Reaction ~ Days + ((1 | Subject) + (0 + Days | Subject))
## Data: sleepstudy
##
## REML criterion at convergence: 1743.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9626 -0.4625 0.0204 0.4653 5.1860
##
## Random effects:
## Groups Name Variance Std.Dev.
## Subject (Intercept) 627.57 25.051
## Subject.1 Days 35.86 5.988
## Residual 653.58 25.565
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.405 6.885 36.51
## Days 10.467 1.560 6.71
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.184
```

### Example 3: Model diagnostics

Mixed models assume (at least):

- normality of residuals and random effects
- linearity of predictor effects
- homoscedasticity
- independence of observations across the clustering variable (Subject)
- independence of the random effects and fixed effects
- correct specification of the model.

We should also check for overly influential observations.

Conveniently, the ggplot2 package inlcudes `fortify`

functions for `lmer`

model (`.merMod`

) objects. Returned in `fortify.merMod`

datasets are:

- .fitted: fitted values
- .resid: raw residuals
- .scresid: standardized (scaled Pearson) residuals.

Many of the plots below can be made (more easily) with calls to `plot`

and `qqmath`

on the lmer2 `.merMod`

object (e.g `plot(lmer2)`

).

First we fortify the dasaset with diagnostic variables.

```
#fortify dataset, works with lmer model objects
lmer2_diag <- fortify(lmer2)
head(lmer2_diag)
## Reaction Days Subject .fitted .resid .scresid
## 1 249.5600 0 308 252.9178 -3.357801 -0.1313422
## 2 258.7047 1 308 272.7086 -14.003876 -0.5477692
## 3 250.8006 2 308 292.4994 -41.698751 -1.6310693
## 4 321.4398 3 308 312.2901 9.149673 0.3578944
## 5 356.8519 4 308 332.0809 24.770998 0.9689310
## 6 414.6901 5 308 351.8717 62.818423 2.4571767
```

### Normality of residuals

We first inspect the normality of the standardized residuals from the model. The residuals look better than in the simple linear regression model.

```
ggplot(lmer2_diag, aes(sample=.scresid)) +
stat_qq()
```

### Normality of random effects

The `ranef`

function extracts the random effects for each Subject from the model. Within the list output of `ranef`

is a `data.frame`

, named `Subject`

, containing the random effects. We rename the column Intercept for easier use.

```
#extract random effects
re <- ranef(lmer2)
#output is a list object, with data.frame of random effects
# called Subject
str(re)
## List of 1
## $ Subject:'data.frame': 18 obs. of 2 variables:
## ..$ (Intercept): num [1:18] 1.51 -40.37 -39.18 24.52 22.91 ...
## ..$ Days : num [1:18] 9.32 -8.6 -5.39 -4.97 -3.19 ...
## - attr(*, "class")= chr "ranef.mer"
#rename column 1 of random effects data.frame
names(re$Subject)[1] <- "Intercept"
#check normality of random effects
#same as lme4 method -- qqmath(ranef(lmer2))
```

### Distribution of random intercepts

Not terrible looking for sample size 18

```
ggplot(re$Subject, aes(sample=Intercept)) +
stat_qq()
```

### Distribution of random slopes

Not bad looking

```
ggplot(re$Subject, aes(sample=Days)) +
stat_qq()
```

### Fitted vs scaled residuals

We can also plot the fitted vs scaled residuals to assess homoscedasticity and linearity. Neither assumption appears grossly vioalted.

```
#fitted vs residuals; same as plot(lmer2)
ggplot(lmer2_diag, aes(x=.fitted, y=.scresid)) +
geom_point() +
geom_smooth()
```

### Residuals by subject again

We revisit this diagnostic plot to assess if we controlled for variation in the intercept by Subject. Subject mean residuals look very similar.

```
#std residuals by Subject - most means near 0
ggplot(lmer2_diag, aes(x=Subject, y=.scresid)) +
geom_point() +
stat_summary(color="red")
```

Let’s compare the residuals of the simple linear regression model to those from the mixed model. We plot raw residual means and standard errors rather than standardized residuals to emphasize differences in the variances of the residuals.

```
#fitted vs residuals
#means and standard errors of residuals by Subject from
# simple linear (blue) and mixed models (red)
# alpha set to 1/2 and size to 2 for better comparison
ggplot(lmer2_diag, aes(x=Subject, y=.resid)) +
stat_summary(color="red",
alpha=1/2, size=2) +
stat_summary(data=lm1_diag, color="blue",
alpha=1/2, size=2)
```

Here we see that the mixed model (red) has addressed Subject heterogeneity much better than the simple linear regression model.

```
## No summary function supplied, defaulting to `mean_se()
## No summary function supplied, defaulting to `mean_se()
```

### Influence

Finally, we replot fit vs residual, with points replaced by `geom_text`

, `label`

mapped to Subject and `color`

mapped to Days. Subject 332 has 2 large residuals – perhaps we should inspect her data.

```
#inspect points with large residuals by Subject and Days
ggplot(lmer2_diag, aes(x=.fitted, y=.scresid, color=Days)) +
geom_text(aes(label=Subject))
```

### Example 3: Customizing a plot for presentation

For our central figure we would like to display the following:

- mean effect of Days on Reaction
- variability in the effect of Days on Reaction across Subjects
- fit of the model to observed data for each Subject

We have 18 subjects, a small enough number that we can fit all of our data in a graph faceted by Subject. We will facet into 3 rows of graphs.

We will use our “lmer2_diag” diagnostics dataset because we need the fitted values.

First, a scatterplot of observed `x=Days`

vs `y=Reaction`

, faceted by Subject, we use `geom_point`

. For the fitted line for each subject as well, we use `geom_line`

, but remap `y`

to `.fitted.`

```
#observed and fitted Days-Reaction by Subject
#increase size of line for visibility
p3.0 <- ggplot(lmer2_diag, aes(x=Days, y=Reaction)) +
geom_point() +
geom_line(aes(x=Days, y=.fitted), size=.75) +
facet_wrap(~Subject, nrow=3)
p3.0
```

We now want to add the mean intercept and mean Days-Reaction slope for reference to each facet. Those are represented by the fixed intercept and Day coefficient, respectively. We extract the coefficients from the model using fixef.

```
fixef(lmer2)
## (Intercept) Days
## 251.40510 10.46729
```

An easy solution to add the reference mean slope is to use `geom_abline`

, which understands the aesthetics `intercept`

and `slope`

. We want the reference slope underneath the points and fit line, so we add it first to the plot.

```
#give them easy names
int <- fixef(lmer2)[1]
slope <- fixef(lmer2)[2]
#reference line goes underneath, so specify geom_abline first
p3.1 <- ggplot(lmer2_diag, aes(x=Days,
y=Reaction)) +
geom_abline(intercept=int, slope=slope,
color="red", size=.75) +
geom_point() +
geom_line(aes(x=Days, y=.fitted), size=.75) +
facet_wrap(~Subject, nrow=3)
p3.1
```

While not a bad solution, the line extends beyond the limits of Days in the study, 0 and 9, which we don’t like. We cannot specify limits in `geom_abline`

, but we can in `geom_segment`

, which understands the aesthetics `x`

, `xend`

, `y`

, and `yend`

.

To extend the line from Days 0 to 9, **set** (outside aes) `x`

to 0, and `xend`

to 9. The first y-value at Days=0 is the intercept, so we set `y`

to “int”, and the last y-value we use the formula for a line (y = intercept + slope*x), and set `yend`

to (int + slope*9).

```
#use geom_segment to specify limits of line
#x goes from 0 to 9
#y goes from intercept to (intercept + slope*9)
p3.2 <- ggplot(lmer2_diag, aes(x=Days,
y=Reaction)) +
geom_segment(x=0, xend=9,
y=int, yend=(int+slope*9),
color="red", size=.75) +
geom_point() +
geom_line(aes(x=Days, y=.fitted), size=.75) +
facet_wrap(~Subject, nrow=3)
p3.2
```

That looks better.

### Sorting the facets by the random slopes

At the top of each facet, we see that the facets have been ordered by Subject id number. That numbering has no intrinsic meaning. Sorting the facets instead by the magnitude of the estimated slope for each Subject makes grasping the variation among slopes easier.

We first estimate each Subject’s slope, which is the sum of the fixed effect coefficient for Days and the estimate of the random effect of Days (BLUP) for that subject. We already have the fixed effect of Days, so we use `ranef`

to obtain a `data.frame`

of the random effects by Subject.

We extract random effects as “re”, rename the slope column to “slope”, create a Subject column, and then merge it with our diagnostic dataset.

```
#look at random effects
head(ranef(lmer2))
## $Subject
## (Intercept) Days
## 308 1.5126960 9.3234894
## 309 -40.3738983 -8.5991692
## 310 -39.1810427 -5.3877905
## 330 24.5189056 -4.9686458
## 331 22.9144343 -3.1939349
## 332 9.2219741 -0.3084936
## 333 17.1561219 -0.2872074
## 334 -7.4517338 1.1159901
## 335 0.5787256 -10.9059664
## 337 34.7679291 8.6276161
## 349 -25.7543247 1.2806878
## 350 -13.8650359 6.7564006
## 351 4.9159804 -3.0751330
## 352 20.9290432 3.5122096
## 369 3.2586475 0.8730507
## 370 -26.4758277 4.9837865
## 371 0.9056476 -1.0052929
## 372 12.4217579 1.2584028
```

–

```
#save data.frame of random effects
re <- ranef(lmer2)$Subject
#rename column for use with ggplot
colnames(re)[2] <- "rand_slope"
#create Subject column out of rownames, make it a factor
re$Subject <- factor(rownames(re))
#merge our random effects into our diagnostic dataset by Subject
lmer2_diag <- merge(lmer2_diag, re, by="Subject")
#check that merge worked
str(lmer2_diag)
## 'data.frame': 180 obs. of 8 variables:
## $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Reaction : num 250 259 251 321 357 ...
## $ Days : num 0 1 2 3 4 5 6 7 8 9 ...
## $ .fitted : num 253 273 292 312 332 ...
## $ .resid : num -3.36 -14 -41.7 9.15 24.77 ...
## $ .scresid : num -0.131 -0.548 -1.631 0.358 0.969 ...
## $ (Intercept): num 1.51 1.51 1.51 1.51 1.51 ...
## $ rand_slope : num 9.32 9.32 9.32 9.32 9.32 ...
```

We now add the fixed effect of Days (stored in “slope”) to each “rand_slope” to get the total slope for each subject. Then we sort the dataset by slope, and then by Day (using `arrange`

) to ensure everything is ordered properly.

```
#total slope is rand_slope + slope, the fixed effect of Days from above
slope <- fixef(lmer2)[2]
lmer2_diag$rand_slope <- lmer2_diag$rand_slope + slope
#sort by slope and then Days
lmer2_diag <- arrange(lmer2_diag, rand_slope, Days)
unique(lmer2_diag$rand_slope)
## [1] -0.4386804 1.8681168 5.0794955 5.4986401 7.2733511 7.3921530
## [7] 9.4619930 10.1587924 10.1800786 11.3403366 11.5832760 11.7256888
## [13] 11.7479738 13.9794956 15.4510725 17.2236866 19.0949021 19.7907753
```

In place of subject number we will place the random slope estimate in the facet label. However, in the `unique`

output above, we notice quite a few decimal places, too many for facet titles. The `format`

function trims the number of digits in rand_slope.

```
#format rand_slope to have 2 digits after decimal
lmer2_diag$rand_slope <- format(lmer2_diag$rand_slope, digits=2)
unique(lmer2_diag$rand_slope)
## [1] "-0.44" " 1.87" " 5.08" " 5.50" " 7.27" " 7.39" " 9.46" "10.16"
## [9] "10.18" "11.34" "11.58" "11.73" "11.75" "13.98" "15.45" "17.22"
## [17] "19.09" "19.79"
```

We can now facet by rand_slope.

```
#facet by rand_slope
p3.3 <- ggplot(lmer2_diag, aes(x=Days, y=Reaction)) +
geom_segment(x=0, xend=9, y=int, yend=(int+slope*9),
color="red", size=.75) +
geom_point() +
geom_line(aes(x=Days, y=.fitted), size=.75) +
facet_wrap(~rand_slope, nrow=3)
p3.3
```

### Adjusting theme elements

The variation in the slopes is more apparent with the sorting. We are happy with the data-related aspects of the plot, so will now be adding to “p3.3”.

We address the non-data aspects of the graph now.

### Relabeling ticks

Let’s relabel the ticks on the x-axis to (0,3,6,9), as data were collected on whole days. Use `scale_x_continuous`

since days numeric (continuous). The `breaks`

argument controls ticks.

```
#reset x-axis ticks with scale_x_continuous
p3.4 <- p3.3 +
scale_x_continuous(breaks=c(0,3,6,9))
p3.4
```

### Facet titles and fonts

Next, we use `labs`

to add a title at the top and axis titles. We use `theme`

to set the font family and font size of all text on the graph, including all titles, and tick labels.

```
#labs creates the title
#theme sets size and font family of title, axis title,
#facet titles, and tick labels
p3.5 <- p3.4 +
labs(title="increase in reaction time (ms)
per day of deprivation",
x="days of deprivation", y="reaction time (ms)") +
theme(plot.title=element_text(family="serif", size=10),
axis.title=element_text(family="serif", size=10),
strip.text.x=element_text(family="serif", size=8),
axis.text=element_text(family="serif", size=8))
p3.5
```

### Overall look: Look 1

Finally, we present two changes to the overall look of the graph. These will not build upon “p3.5” because we must respecify aesthetics within some of the geoms.

For the first look, we use a built-in theme `theme_dark`

to give the graph a nighttime feel (it’s a sleep study!). We change the colors of the reference line to black, and the observed datapoints and estimated slopes to white. NOTE: In the R code, we use `base_family`

to change all fonts used in `theme_dark`

to “serif”.

```
#use theme_dark for a nighttime theme
#place theme font specifications after theme_dark to ensure
# that they are obeyed
ggplot(lmer2_diag, aes(x=Days, y=Reaction)) +
geom_segment(x=0, xend=9,
y=int, yend=(int+slope*9),
color="black", size=.75) +
geom_point(color="white") +
geom_line(aes(x=Days, y=.fitted),
size=.75, color="white") +
facet_wrap(~rand_slope, nrow=3) +
scale_x_continuous(breaks=c(0,3,6,9)) +
labs(title="increase in reaction time (ms) per day of deprivation",
x="days of deprivation", y="reaction time (ms)") +
theme_dark(base_family="serif") +
theme(plot.title=element_text(size=14),
axis.title=element_text(size=14),
strip.text.x=element_text(size=10),
axis.text=element_text(size=10))
```

### Overall look: Look 2

We also present a graph using ink/color more minimally.

Below, we use `theme`

to change the backgrounds color of both the graphing area (`panel.background`

) and the strip heading (`strip.background`

).

The value NA sets a property to blank.

We also specify the color of the grid lines (`panel.grid.major`

and `panel.grid.minor`

), which are normally set to “white”, to make the axes appear notched. The various gray colors used become lighter as the number (from 0 to 100) following increases.

```
#specify background and grid elements using theme
#grays become lighter as number increases
ggplot(lmer2_diag, aes(x=Days, y=Reaction)) +
geom_segment(x=0, xend=9, y=int, yend=(int+slope*9),
linetype=2, size=.75, color="gray60") +
geom_point() +
geom_line(aes(x=Days, y=.fitted), size=.75) +
facet_wrap(~rand_slope, nrow=3) +
scale_x_continuous(breaks=c(0,3,6,9)) +
labs(title="increase in reaction time (ms) per day of deprivation",
x="days of deprivation", y="reaction time (ms)") +
theme(plot.title=element_text(family="serif", size=14),
axis.title=element_text(family="serif", size=14),
strip.text.x=element_text(family="serif", size=10),
axis.text=element_text(family="serif", size=10),
panel.background=element_rect(fill=NA, color="gray60"),
strip.background=element_rect(fill="gray90", color="gray60"),
panel.grid.major=element_line(color=NA),
panel.grid.minor=element_line(color=NA))
```

### Conclusion

- We hope this seminar has shown you how to use ggplot2 at each stage of data analysis.
- We provided code for both simple and more complex graphs to demonstrate that
`ggplot2`

is appropriate for use by both users new to R and statistical graphing and by experienced users wishing to make beautiful, illustrative graphs for presentation. - Most importantly, we hope we have conveyed the underlying grammar clearly enough to inspire you to create new graphs to display data variation in novel ways.

### References

- Books
- Wickham, H. (2009). ggplot2 Elegant Graphics for Data Analysis. Springer: New York.
- Wilkinson, L. (2005). The Grammar of Graphics. Springer: New York.
- Tufte, E. R. (2001). The Visual Display of Quantitative Information. Graphics Press: Cheshire. (For guidance on making clear and informative statistical graphs)

- Websites
- docs.ggplot2.org/current/ : all help files for ggplot2 functions

// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });