R Graphics: Introduction to ggplot2

Background

Purpose of this seminar

This seminar introduces how to use the R ggplot2 package, particularly for producing statistical graphics for data analysis.

Text in this font signifies R code or variables in a data set

Text that appears like this represents an instruction to practice ggplot2 coding

Seminar packages

Next we load the packages into the current R session with library(). In addition to ggplot2, we load package MASS (installed with R) for data sets.

#load libraries into R session
library(ggplot2)
library(MASS)

Please use library() to load packages ggplot2 and MASS.

The ggplot2 package

ggplot2 documentation

https://ggplot2.tidyverse.org/reference/

The official reference webpage for ggplot2 has help files for its many functions an operators. Many examples are provided in each help file.

The grammar of graphics

What is a grammar of graphics?

A grammar of a language defines the rules of structuring words and phrases into meaningful expressions.

A grammar of graphics defines the rules of structuring mathematic and aesthetic elements into a meaningful graph.

Leland Wilkinson (2005) designed the grammar upon which ggplot2 is based.

Elements of grammar of graphics

  1. Data: variables mapped to aesthetic features of the graph.
  2. Geoms: objects/shapes on the graph.
  3. Stats: statistical transformations that summarize data,(e.g mean, confidence intervals).
  4. Scales: mappings of aesthetic values to data values. Legends and axes visualize scales.
  5. Coordinate systems: the plane on which data are mapped on the graphic.
  6. Faceting: splitting the data into subsets to create multiple variations of the same graph (paneling).

The Sitka dataset

To practice using the grammar of graphics, we will use the Sitka dataset (from the MASS package).

Note: Data sets that are loaded into R with a package are immediately available for use. To see the object appear in RStudio’s Environment pane (so you can click to view it), run data() on the data set, and then another function like str() on the data set.

Use data() and then str() on Sitka to make it appear in the Environment pane.

The Sitka dataset describes the growth of trees over time, some of which were grown in ozone-enriched chambers. The data frame contains 395 rows of the following 4 columns:

Here are the first few rows of Sitka:

size Time tree treat
4.51 152 1 ozone
4.98 174 1 ozone
5.41 201 1 ozone
5.90 227 1 ozone
6.15 258 1 ozone
4.24 152 2 ozone

The ggplot() function and aesthetics

All graphics begin with specifying the ggplot() function (Note: not ggplot2, the name of the package)

In the ggplot() function we specify the data set that holds the variables we will be mapping to aesthetics, the visual properties of the graph. The data set must be a data.frame object.

Example syntax for ggplot() specification (italicized words are to be filled in by you):


ggplot(data, aes(x=xvar, y=yvar))


Notice that the aesthetics are specified inside aes(), which is itself nested inside of ggplot().

The aesthetics specified inside of ggplot() are inherited by subsequent layers:

# scatter plot of volume vs sales
ggplot(txhousing, aes(x=volume, y=sales)) +
  geom_point() 
geom_point() inherits x and y aesthetics

geom_point() inherits x and y aesthetics

Initiate a graph of Time vs size by mapping Time to x and size to y from the data set Sitka.

Without any additional layers, no data will be plotted.

Layers and overriding aesthetics

Specifying just x and y aesethetics alone will produce a plot with just the 2 axes.

ggplot(data = txhousing, aes(x=volume, y=sales))
without a geom or stat, just axes

without a geom or stat, just axes

We add layers with the character + to the graph to add graphical components.

Layers consist of geoms, stats, scales, and themes, which we will discuss in detail.

Remember that each subsequent layer inherits its aesthetics from ggplot(). However, specifying new aesthetics in a layer will override the aesthetics speficied in ggplot().

# scatter plot of volume vs sales
#  with rug plot colored by median sale price
ggplot(txhousing, aes(x=volume, y=sales)) +     # x=volume and y=sales inherited by all layers  
  geom_point() +
  geom_rug(aes(color=median))   # color will only apply to the rug plot because not specified in ggplot()
both geoms inherit aesthetics from gglot, but geom_rug() also adds color aesthetic

both geoms inherit aesthetics from gglot, but geom_rug() also adds color aesthetic

Add a geom_point() layer to the Sitka graph we just initiated.

Add an additional geom_smooth() layer to the graph.

Both geom layers inherit x and y aesthetics from ggplot().

Specify aes(color=treat) inside of geom_point().

Notice that the coloring only applies to geom_point().

Aesthetics

Aesthetics are the visual properties of objects on the graph.

Which aesthetics are required and which are allowed vary by geom.

Commonly used aesthetics:

Change the aesthetic color mapped to treat in our previous graph to shape.

Mapping vs setting

Map aesthetics to variables inside the aes() function. By mapping, we mean the aesthetic will vary as the variable varies. For example, mapping x=time causes the position of the plotted data to vary with values of variable “time”. Similary, mapping color=group causes the color of objects to vary with values of variable “group”.

# mapping color to median inside of aes()
ggplot(txhousing, aes(x=volume, y=sales)) +
  geom_point(aes(color=median))
color of points varies with median price

color of points varies with median price

Set aesthetics to a constant outside the aes() function.

Compare the following graphs:

# setting color to green outside of aes()
ggplot(txhousing, aes(x=volume, y=sales)) +
  geom_point(color="green")
color of points set to constant green

color of points set to constant green

Create a new graph for data set Sitka, a scatter plot of Time (x-axis) vs size (y-axis), where all the points are colored “green”.

Setting an aesthetic to a constant within aes() can lead to unexpected results, as the aesthetic is then set to a default value rather than the specified value.

# color="green" inside of aes()
# geom_point() cannot find a variable called "green" and
#   uses a default color instead
ggplot(txhousing, aes(x=volume, y=sales)) +
  geom_point(aes(color="green"))
aesthetic set to constant within aes() leads to unexpected results

aesthetic set to constant within aes() leads to unexpected results

Geoms

geoms: bar, boxplot, density, histogram, line, point

geoms: bar, boxplot, density, histogram, line, point

Geom functions differ in the geometric shapes produced for the plot.

Some example geoms:

Geoms and aesthetics

Each geom is defined by aesthetics required for it to be rendered. For example, geom_point() requires both x and y, the minimal specification for a scatterplot.

Geoms differ in which aesthetics they accept as arguments. For example, geom_point() accepts the aesthetic shape, which defines the shapes of points on the graph, while geom_bar() does not accept shape.

Check the geom function help files for required and understood aesthetics. In the Aesthetics section of the geom’s help file, required aesthetics are bolded.

We will tour some commonly used geoms.

Histograms

ggplot(txhousing, aes(x=median)) + 
  geom_histogram() 
histograms visualize distribution of variable mapped to x

histograms visualize distribution of variable mapped to x

Histograms are popular choices to depict the distribution of a continuous variable.

geom_histogram() cuts the continuous variable mapped to x into bins, and count the number of values within each bin.

Create a histogram of size from data set Sitka.

ggplot2 issues a message urging you to pick a number of bins for the histogram (it defaults to 30), using the bins argument.

Specify bins=20 inside of geom_histogram(). Note: bins is not an aesthetic, so should not be specified within aes().

Density plots

ggplot(txhousing, aes(x=median)) + 
  geom_density() 
density plots visualize smoothed distribution of variable mapped to x

density plots visualize smoothed distribution of variable mapped to x

Denisty plots are basically smoothed histograms.

Density plots, unlike histograms, can be plotted separately by group by mapping a grouping variable to color.

ggplot(txhousing, aes(x=median, color=factor(month))) + 
  geom_density() 
densities of median price by month

densities of median price by month

Boxplots

ggplot(txhousing, aes(x=factor(year), y=median)) + 
  geom_boxplot() 
boxplots are useful to compare distribution of y variable across levels of x variable

boxplots are useful to compare distribution of y variable across levels of x variable

Boxplots compactly visualize particular statistics of a distributions:

Boxplots are perhaps are particularly useful for comparing whole distributions of a continuous variable between groups.

geom_boxplot() will create boxplots of the variable mapped to y for each group defined by the values of the x variable.

Create a new graph where we compare distributions of size across levels of treat from dataset Sitka.

Bar plots

ggplot(diamonds, aes(x=cut)) + 
  geom_bar() 
geom_bar displays frequencies of levels of <code>x</code> variable

geom_bar displays frequencies of levels of x variable

Bar plots are often used to display frequencies of factor (categorical) variables.

geom_bar() by default produces a bar plot where the height of the bar represents counts of each x-value.

Start a new graph where the frequencies of treat from data set Sitka are displayed as a bar graph. Remember to map x to treat.

The color that fills the bars is not controlled by aesthetic color, but instead by fill, which can only be mapped to a factor (categorical) variable. We can visualize a crosstabulation of variables by mapping one of them to fill in geom_bar():

ggplot(diamonds, aes(x=cut, fill=clarity)) + 
  geom_bar() 
frequencies of cut by clarity

frequencies of cut by clarity

Add the aesthetic mapping fill=factor(Time) to aes() inside of ggplot() of the previous graph.

Scatter plots

# scatter of volume vs sales
ggplot(txhousing, aes(x=volume, y=sales)) + 
  geom_point() 
scatter plot of volume vs sales

scatter plot of volume vs sales

Scatter plots depict the covariation between pairs of variables (typically both continuous).

geom_point() depicts covariation between variables mapped to x and y.

Scatter plots are among the most flexible graphs, as variables can be mapped to many aesthetics such as color, shape, size, and alpha.

ggplot(txhousing, aes(x=volume, y=sales, 
                      color=median, alpha=listings, size=inventory)) + 
  geom_point() 
scatter plot of volume vs sales, colored by median price, transparent by number of listings, and sized by inventory

scatter plot of volume vs sales, colored by median price, transparent by number of listings, and sized by inventory

Line graphs

ggplot(txhousing, aes(x=date, y=sales, group=city)) + 
  geom_line() 
line graph of sales over time, separate lines by city

line graph of sales over time, separate lines by city

Line graphs depict covariation between variables mapped to x and y with lines instead of points.

geom_line() will treat all data as belonging to one line unless a variable is mapped to one of the following aesthetics to group the data into separate lines:

Let’s first examine a line graph with no grouping:

ggplot(txhousing, aes(x=date, y=sales)) + 
  geom_line() 
line graph of sales over time, no grouping results in garbled graph

line graph of sales over time, no grouping results in garbled graph

As you can see, unless the data represent a single series, line graphs usually call for some grouping.

Using color or linetype in geom_line() will implicitly group the lines.

ggplot(txhousing, aes(x=date, y=sales, color=city)) + 
  geom_line() 
line graph of sales over time, colored and grouped by city

line graph of sales over time, colored and grouped by city

Let’s try graphing separate lines (growth curves) for each tree.

Create a new line graph for data set Sitka with Time on the x-axis and size on the y axis, but also map group to tree.

We can specify color and linetype in addition to group. The lines will still be separately drawn by group, but can be colored or patterned by additional variables.

In our data, We might want to compare trajectories of growth between treatments.

Now add a specification mapping color to treat (in addition to group=tree).

Finally, map treat to linetype instead of color.

*Stats*

The stat functions statistically transform data, usually as some form of summary, such as the mean, or standard devation, or a confidence interval.

Each stat function is associated with a default geom, so no geom is required for shapes to be rendered.

stat_summary(), perhaps the most useful of all stat functions, applies a summary function to the variable mapped to y for each value of the x variable. The default summary function is mean_se(), with associated geom geom_pointrange(), which will produce a plot of the mean (dot) and standard error (lines) of the variable mapped to y for each value of the x variable.

# summarize sales (y) for each year (x)
ggplot(txhousing, aes(x=year, y=sales)) + 
  stat_summary() 
mean and standard errors of sales by year

mean and standard errors of sales by year

Create a new plot where x is mapped to Time and y is mapped to size. Then, add a stat_summary() layer.

What makes stat_summary() so powerful is that you can use any function that accepts a vector as the summary function (e.g. mean(), var(), max(), etc.) and the geom can also be changed to adjust the shapes plotted.

Scales

Scales define which aesthetic values are mapped to the data values.

Here is an example of a color scale that defines which colors are mapped to values of treat:

color treat
red ozone
blue control


Imagine that we might want to change the colors to “green” and “orange”.

The scale_ functions allow the user to control the scales for each aesthetic. 0 These scale functions have names with structure scale_aesthetic_suffix, where aesthetic is the name of an aesthetic like color or shape or x, and suffix is some descriptive word that defines the functionality of the scale.

Then, to specify the aesthetic values to be used by the scale, supply a vector of values to the values argument (usually) of the scale function.

Some example scales functions:

See the ggplot2 documentation page section on scales to see a full list of scale functions.

Here is a color scale that ggplot2 chooses for us:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point()
default color scale

default color scale

We can use scale_colour_manual() to specify which colors we want to use:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) 
using scale_color_manual to respecify colors

using scale_color_manual to respecify colors

Create a new graph for data set Sitka, a scatter plot of Time on x and size on y, with the color of the points mapped to treat.
Now use scale_color_manual(), and inside specify the values argument to be “orange” and “purple”.

Scale functions for the axes

Remember that x and y are aesthetics, and the two axes visualize the scale for these aesthetics.

Thus, we use scale functions to control to the scaling of these axes.

When y is mapped to a continuous variable, we will typically use scale_y_continuous() to control its scaling (use scale_y_discrete() if y is mapped to factor). Similar functions exist for the x aesthetic.

A description of some of the important arguments to scale_y_continuous():

Our current graph of volume vs sales has y-axis tick marks at 0, 5000, 10000, and 15000

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) 
default x-tick marks

default x-tick marks

Let’s put tick marks at all grid lines along the y-axis using the breaks argument of scale_y_continuous:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) + 
  scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500))
changing y-axis tick marks

changing y-axis tick marks

Now let’s relabel the tick marks to reflect units of thousands (of dollars) using labels:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) + 
  scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500),
                     labels=c(0,2.5,5,7.5,10,12.5,15,17.5))
relabeling y-axis tick marks

relabeling y-axis tick marks

And finally, we’ll retitle the y-axis using the name argument to reflect the units:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) + 
  scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500),
                     labels=c(0,2.5,5,7.5,10,12.5,15,17.5),
                     name="price(thousands of dollars)")
new y-axis title

new y-axis title

Use scale_x_continuous() to convert the x-axis of the previous graph from days to months. First, relabel the tick marks from (150,180,210,240) to (5,6,7,8). Then retitle the x-axis “time(months)”.

Modifying axis limits and titles

Although we can use scale functions like scale_x_continuous() to control the limits and titles of the x-axis, we can also use the following shortcut functions:

To set axis limits, supply a vector of 2 numbers (inside c(), for example) to one of the limits functions:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  xlim(c(1,3)) # cut ranges from 0 to 5 in the data
restricting axis limits will zoom in

restricting axis limits will zoom in

We can use labs() to specify an overall titles for the overall graph, the axes, and legends (guides).

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  labs(x="CARAT", y="PRICE", color="CUT", title="CARAT vs PRICE by CUT")
respecifying all titles with labs

respecifying all titles with labs

Guides visualize scales

Guides (axes and legends) visualize a scale, displaying data values and their matching aesthetic values. The x-axis, a guide, visualizes the mapping of data values to position along the x-axis. A color scale guide (legend) displays which colors map to which data values.

Most guides are displayed by default. The guides() function sets and removes guides for each scale.

Here we use guides() to remove the color scale legend:

# notice no legend on the right anymore
ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  guides(color="none")
color legend removed

color legend removed

Coordinate systems

Coordinate systems define the planes on which objects are positioned in space on the plot. Most plots use Cartesian coordinate systems, as do all the plots in the seminar. Nevertheless, ggplot2 provides multiple coordinate systems, including polar, flipped Carteisan and map projections.

Faceting (paneling)

Split plots into small multiples (panels) with the faceting functions, facet_wrap() and facet_grid(). The resulting graph shows how each plot varies along the faceting variable(s).

facet_wrap() wraps a ribbon of plots into a multirow panel of plots. Inside facet_wrap(), specify ~, then a list of splitting variables, separated by +. The number of rows and columns can be specified with arguments nrow and ncol.

ggplot(diamonds, aes(x=carat, y=price)) + 
  geom_point() + 
  facet_wrap(~cut) # create a ribbon of plots using cut
carat vs price, paneled by cut with facet_wrap()

carat vs price, paneled by cut with facet_wrap()

facet_grid() allows direct specification of which variables are used to split the data/plots along the rows and columns. Put the row-splitting variable before ~, and the column-splitting variable after. The character . specifies no faceting along that dimension.

ggplot(diamonds, aes(x=carat, y=price)) + 
  geom_point() + 
  facet_grid(clarity~cut) # split using clarity along rows along columns using cut 
carat vs price, paneled by clarity and cut with facet_grid()

carat vs price, paneled by clarity and cut with facet_grid()

Create a panel of scatter plots of Time vs size (same as above), split by treat along the rows using facet_grid().

Themes

Themes control elements of the graph not related to the data. For example:

To modify these, we use the theme() function, which has a large number of arguments called theme elements, which control various non-data elements of the graph.

Some example theme() arguments and what aspect of the graph they control:

A full description of theme elements can be found on the ggplot2 documentation page.

Specifying theme() arguments

Most non-data element of the graph can be categorized as either a line (e.g. axes, tick marks), a rectangle (e.g. the background), or text (e.g. axes titles, tick labels). Each of these categories has an associated element_ function to specify the parameters controlling its apperance:


Inside theme() we control the properties of a theme element using the proper element_ function.

For example, the x- and y-axes are lines and are both controlled by theme() argument axis.line, so their visual properties, such as color and size (thickness), are specified as arguments to element_line():

ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", size=2)) # size in mm
using theme argument axis.line to modify x-axis and y-axis lines

using theme argument axis.line to modify x-axis and y-axis lines

On the other hand, the background of the graph, controlled by theme() argument panel.background is a rectangle, so parameters like fill color and border color can be specified element_rect().

ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", size=2),
        panel.background=element_rect(fill="white", color="gray")) # color is the border color
using theme element axis.line.x to modify x-axis line

using theme element axis.line.x to modify x-axis line

With element_text() we can control properties such as the font family or face ("bold", "italic", "bold.italic") of text elements like title, which controls the titles of both axes.

ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", size=2),
        panel.background=element_rect(fill="white", color="gray"),
        title=element_text(family="serif", face="bold")) 
using theme argument title to adjust fonts of all titles

using theme argument title to adjust fonts of all titles

Note: "sans", "serif", and "mono" are the only font family choices available for ggplot2 without downloading additional R packages. See this RPubs webpage for more information.

Finally, some theme() arguments do not use element_ functions to control their properties, like legend.position, which simply accepts values "none", "left", "right", "bottom", and "top".

ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", size=2),
        panel.background=element_rect(fill="white", color="gray"),
        title=element_text(family="serif", face="bold"),
        legend.position="bottom") 
using theme argument legend.position to position legend

using theme argument legend.position to position legend

We could then use legend.text=element.text() in theme() to rotate the legend labels (not shown).

Remember to use the ggplot2 theme documentation page when using theme().

Create a scatter plot of Time vs size from the Sitka data set. Then use theme() argument axis.ticks to “erase” the tick marks by coloring them white with element_line().

Changing the overall look with complete themes

The ggplot2 package provides a few complete themes which make several changes to the overall background look of the graphic (see here for a full description).

Some examples:

The themes usually adjust the color of the background and most of the lines that make up the non-data portion of the graph.

theme_classic() mimics the look of base R graphics:

ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  theme_classic()
theme_classic()

theme_classic()

theme_dark() makes a dramatic change to the look:

ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  theme_dark()
theme_dark()

theme_dark()

Saving plots to files

ggsave() makes saving plots easy. The last plot displayed is saved by default, but we can also save a plot stored to an R object.

ggsave attempts to guess the device to use to save the image from the file extension, so use a meaningful extension. Available devices include eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf.

Other important arguments to ggsave():

#save last displayed plot as pdf
ggsave("plot.pdf")

#if you're working with lots of graphs, you can store them in R objects
p <- ggplot(Sitka, aes(x=Time, y=size)) + 
  geom_point()
#You can then use the plot argument of ggsave() to specify which plot to save instead of the last
ggsave("myplot.png", plot=p)

Save your last plot as mygraph.png. View the file on your computer.

Practice using the grammar of graphics

From idea to final graphic: graphing the Rabbit data set {MASS}

To practice using the elements of the grammar of graphics, we will begin with the idea of the what we want to display, and step-by-step, we will add to and adjust the graphic until we feel it is ready to share with an audience.

For our next graph, we will be visualizing the data in the Rabbit data set, also loaded with the MASS package.

Run data(Rabbit) and then str(Rabbit) to look at the structure of Rabbit and to bring it into the RStudio Environment.

The Rabbit data set describes an experiment where:

The purpose was to test whether blood pressure changes were dependent on activation of serotonin receptors.


The data set contains 60 rows (5 rabbits measured 12 times) of the following 5 variables:

The idea of the graph

As is typical in drug studies, we want to create a dose-response curve. For this study, we want to see how blood pressure changes are related to the dose of phenylbiguanide, the blood-pressure raising drug, in the presence of both saline (control) and the serotonin antagonist MDL 7222 (treatment).

Issues to think about:

So, we want a graph that represents individual dose-response curves for each rabbit, and ideally separate curves for treatment and control conditions per rabbit.

Let’s build this graph step-by-step.

Rabbit graph 1

What geom will be required to create the dose-response curve?

What are its required aesthetic?

Initiate a new graph plotting the relationship between Dose and BPchange using geom_line() for data set Rabbit. For now, we will not group the data for instructive purposes.

Rabbit graph 2

ggplot(Rabbit, aes(x=Dose, y=BPchange)) +
  geom_line()

That obviously isn’t what we want – this graph treats the data as if it all belongs on one line. We want separate lines by rabbit (variable Animal). How can we specify that?

Let’s imagine that we are constrained to produce a colorless figure, so that we cannot use color. We will instead use linetype to separate by Animal.

Modify the previous plot by mapping linetype to Animal.

Rabbit graph 3

ggplot(Rabbit, aes(x=Dose, y=BPchange, linetype=Animal)) +
  geom_line()

That still doesn’t look right? Why?

We thus need to separate the treatment curve from the control curve for each animal. How can we accomplish that?

ggplot(Rabbit, aes(x=Dose, y=BPchange, color=Treatment, linetype=Animal)) +
  geom_line() 

However, we are constraining ourselves to colorless graphs. Can you think of another way to separate the curves?

Use facet_wrap() to split the previous graph by Treatment. Remember the ~.

Rabbit graph 4

ggplot(Rabbit, aes(x=Dose, y=BPchange, linetype=Animal)) +
  geom_line() +
  facet_wrap(~Treatment)

Now we’re finally starting to see the graph that we want!

It can be a bit difficult to distinguish between the lines just based on linetype. Let’s add some points to the graph, but vary the shapes of the points by Treatment.

How do we add points to the graph that vary by shape?

Add a scatter plot with the shape of points varying with Animal to the previous graph.

Rabbit graph 5

ggplot(Rabbit, aes(x=Dose, y=BPchange, linetype=Animal, shape=Animal)) +
  geom_line() +
  facet_wrap(~Treatment) +
  geom_point()

Making good progress!

Now, imagine we want to select the shapes that are plotted rather than use the defaults. What function (or what kind of function) will we need to change the shapes?

Use scale_shape_manual() to change the shapes to those corresponding to the codes (0, 3, 8, 16, 17) in the previous graph.

Rabbit graph 6

ggplot(Rabbit, aes(x=Dose, y=BPchange, shape=Animal, linetype=Animal)) +
  geom_line() +
  facet_wrap(~Treatment) + 
  geom_point() + 
  scale_shape_manual(values=c(0, 3, 8, 16, 17)) 

Ok! Now we are done adding data to the graph. Now we move on to some fine tuning.

First, let’s change the titles of the x-axis and the y-axis. What function can change both of these?

For the previous graph, change the title of the x-axis to “Dose(mcg)” and the title of the y-axis to “Change in blood pressure”.

Rabbit graph 7

ggplot(Rabbit, aes(x=Dose, y=BPchange, shape=Animal, linetype=Animal)) +
  geom_point() + 
  geom_line() +
  facet_wrap(~Treatment) + 
  scale_shape_manual(values=c(0, 3, 8, 16, 17)) +
  labs(x="Dose(mcg)", y="Change in blood pressure")

Almost there!

Imagine we don’t like the gray background with white gridlines, and instead want to use a white background with gray gridlines.

What function will we use to adjust each of these elements?

The theme() argument that controls the graph background is panel.background. Which element_ function should we use to specify the parameters for panel.background?

theme() argument panel.grid controls the grid lines. Which element_ function should we use to specify parameters for gridlines?

element_line()

Use theme() with arguments panel.background and panel.grid to change the background color to "white" and the gridline color to "gray90".

Rabbit graph 8

ggplot(Rabbit, aes(x=Dose, y=BPchange, shape=Animal, linetype=Animal)) +
  geom_point() +
  geom_line() +
  facet_wrap(~Treatment) +
  scale_shape_manual(values=c(0, 3, 8, 16, 17)) +
  labs(x="Dose(mcg)", y="Change in blood pressure") +
  theme(panel.background = element_rect(fill="white"),
        panel.grid=element_line(color="gray90"))

One final step!

Now we want to use theme() to adjust the axis, legend, and panel titles (“Control” and “MDL”) to use bold fonts.

The theme arguments we will use are title and strip.text. Which element_ function should we use to specify parameters for these?

We will use the face argument in element_text to set the titles to “bold”.

Use theme() to bold the axes titles, the legend title, and the panel titles in the previous graph.

Rabbit graph finished

ggplot(Rabbit, aes(x=Dose, y=BPchange, shape=Animal, linetype=Animal)) +
  geom_point() +
  geom_line() +
  facet_wrap(~Treatment) +
  scale_shape_manual(values=c(0, 3, 8, 16, 17)) +
  labs(x="Dose(mcg)", y="Change in blood pressure") +
  theme(panel.background = element_rect(fill="white"),
        panel.grid=element_line(color="gray90"),
        title=element_text(face="bold"),
        strip.text=element_text(face="bold"))

Advice for working with ggplot2

New dataset birthwt {MASS}

The birthwt data set contains data regarding risk factors associated with low infant birth weight.

The data consist of 189 observations of 10 variables, all numeric:

Let’s take a look at the structure of the birthwt data set first, to get an idea of how the variables are measured.

Run data(birthwt). Then use str() to examine the structure of the birthwt dataset.

Aesthetics, numeric and factor (categorical) variables

Variables in a dataset can generally be divided into numeric variables, where the number value is a meaningful representation of a quantity, and factor (categorical) variables, where number values are usually codes representing membership to a category rather than quantities.

In R, we can encode variables as “factors” with the factor() function.

Some aesthetics can be mapped to either numeric or categorical variables, and will be scaled differently.

These include:

Other aesthetics can only be mapped to categorical variables:

And finally some aethetics should only be mapped to numeric variables (a warning is issued if mapped to a categorical variable):

Aesthetic scales are formed differently for numeric and factor variables

Let’s examine how aesthetics behave when mapped to different type of variables using the birthwt dataset, in which all of the variables are numeric initially.

When color is mapped to a numeric variable, a color gradient scale is used:

ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
  geom_point()
color gradient is appropriate for a continuous variable

color gradient is appropriate for a continuous variable

Note: even though we just used race as a numeric variable to demonstrate how ggplot handles it, we do not recommmend treating categorical variables as numeric.

When color is instead mapped to a factor variable, a color scale of evenly spaced hues is used. We can convert a numeric variable to a factor inside of aes() with factor():

ggplot(birthwt, aes(x=age, y=bwt, color=factor(race))) +
  geom_point()
evenly spaced hues emphasize contrasts between groups of a factor

evenly spaced hues emphasize contrasts between groups of a factor

An error results if we try to map shape to a numeric version of race, because shape only accepts factor variables.

Shape accepts the factor representation of race:

ggplot(birthwt, aes(x=age, y=bwt, shape=factor(race))) +
  geom_point()
evenly spaced hues emphasize contrasts between groups of a factor

evenly spaced hues emphasize contrasts between groups of a factor

Finally, some aesthetics like alpha and size should really only be used with truly numeric variables, and a warning will be issued if the variable is a factor:

ggplot(birthwt, aes(x=age, y=bwt, size=factor(race))) +
  geom_point()
## Warning: Using size for a discrete variable is not advised.
size and alpha should only be mapped to numeric variables

size and alpha should only be mapped to numeric variables

Convert categorical variables to factors before graphing

We recommend converting all categorical variable to factors prior to graphing for several reasons:

Use the following code to convert 3 of the variables in birthwt to factors:

birthwt$low <- factor(birthwt$low, levels=0:1, 
                      labels=c("now low", "low"))
birthwt$race <- factor(birthwt$race, levels=1:3, 
                       labels=c("white", "black", "other"))
birthwt$ht <- factor(birthwt$ht, levels=0:1, 
                     labels=c("non-hypert", "hypert"))
birthwt$smoke <- factor(birthwt$smoke, levels=0:1, 
                      labels=c("did not smoke", "smoked"))

Issue str(birthwt) again to examine the class of each variable.

Then create a scatter plot of x=age and y=bwt for data set birthwt. Try mapping an appropriate variable (besides race) to shape and another to alpha.

Overlapping data points in scatter plots

When 2 data points have the same values plotted on the graphs, they will generally occupy the same position, causing one to obscure the other.

Here is an example where we map race, a factor variable, to x and map age (in years) to y in a scatter plot:

ggplot(birthwt, aes(x=race, y=age)) +
  geom_point()
too many discrete values leads to overlapping points

too many discrete values leads to overlapping points

There are 189 data points in the data set, but far fewer than 189 points visible in the graph, because many are completely overlapping.

To address this problem, we have a choice of “position adjustments” which can be specified to the position argument in a geom function.

For geom_point(), we usually use either:

By adding position="jitter" to the previous scatter plot, we can better see how many points there are at each age:

ggplot(birthwt, aes(x=race, y=age)) +
  geom_point(position="jitter")
jittering adds random variation to the position of the points

jittering adds random variation to the position of the points

Overlapping bars in bar graphs

Remember that geom_bar() will plot the frequencies of the variable mapped to x as bars. If we map a second variable to fill, the bars will be colored by the second variable.

We can use the position argument in geom_bar() to control the placement of the bars with the same x value.

The following adjustments are generally used for geom_bar():

Each position adjustment emphasizes different quantities.

By default, geom_bar uses position="stack", a compromise where we can see both the counts and proportions well:

ggplot(birthwt, aes(x=low, fill=race)) +
  geom_bar()
geom_bar() will stack bars with the same x-position

geom_bar() will stack bars with the same x-position

If we instead want to emphasize counts, we use position="dodge", which places the bars side-by-side:

ggplot(birthwt, aes(x=low, fill=race)) +
  geom_bar(position="dodge")
dodging emphasizes counts

dodging emphasizes counts

Proportions are emphasized with position="fill", where the bars are stacked and their heights are standardized:

ggplot(birthwt, aes(x=low, fill=race)) +
  geom_bar(position="fill")
filling emphasizes proportions

filling emphasizes proportions

Use geom_bar() with data set birthwt variables low and smoke together with position adjustments to answer two questions:
A. Are there more low birth weight or non-low birth weight babies with mothers who smoked in the dataset?
B. Are babies from mother who smoked proportionally more likely to be low birth weight or non-low birth weight?

*Error bars and confidence bands*

Error bars and confidence bands are both used to express ranges of statistics. To draw these, we’ll use geom_errorbar() and geom_ribbon(), repsectively.

To use both geoms, the following aesthetics are required:

For example, the following code estimates the mean birth weight and 95% confidence interval for the mean for the three races in data set birthwt. The means and confidence limits are stored in a new data.frame called bwt_bt_race.

bwt_by_race <- do.call(rbind, 
                         tapply(birthwt$bwt, birthwt$race, mean_cl_normal))
bwt_by_race$race <- row.names(bwt_by_race)
names(bwt_by_race) <- c("mean", "lower", "upper", "race")
bwt_by_race
##           mean    lower    upper  race
## white 3102.719 2955.235 3250.202 white
## black 2719.692 2461.722 2977.662 black
## other 2805.284 2629.127 2981.441 other

Now we can plot the means by race with geom_point() and the confidence limits with geom_errorbar():

ggplot(bwt_by_race, aes(x=race, y=mean)) +
  geom_point() +
  geom_errorbar(aes(ymin=lower, ymax=upper))
mean birthweight by race

mean birthweight by race

Use width= to adjust the width of the error bars:

ggplot(bwt_by_race, aes(x=race, y=mean)) +
  geom_point() +
  geom_errorbar(aes(ymin=lower, ymax=upper), width=.1)
mean birthweight by race, narrower error bars

mean birthweight by race, narrower error bars

Confidence bands work similarly. We’ll need values for the maximum and minium again for geom_ribbon().

This time, we’ll create a plot of predicted values with confidence bands from a regression of birthweight on age. First, we’ll run the model and add the predicted values and the confidence limits to the original data set for plotting:

# linear regression of birth weight on age
m <- lm(bwt ~ age, data=birthwt)
# get predicted values (fit) and confidence limits (lwr and upr)
preddata <- predict(m, interval="confidence")
# add predicted values to original data
birthwt <- cbind(birthwt, preddata)
head(birthwt)
##        low age lwt  race         smoke ptl         ht ui ftv  bwt      fit
## 85 now low  19 182 black did not smoke   0 non-hypert  1   0 2523 2891.909
## 86 now low  33 155 other did not smoke   0 non-hypert  0   3 2551 3065.925
## 87 now low  20 105 white        smoked   0 non-hypert  0   1 2557 2904.339
## 88 now low  21 108 white        smoked   0 non-hypert  1   2 2594 2916.768
## 89 now low  18 107 white        smoked   0 non-hypert  1   0 2600 2879.479
## 91 now low  21 124 other did not smoke   0 non-hypert  0   0 2622 2916.768
##         lwr      upr
## 85 2757.969 3025.849
## 86 2846.442 3285.408
## 87 2781.794 3026.883
## 88 2803.295 3030.242
## 89 2732.358 3026.600
## 91 2803.295 3030.242

Now we’ll use geom_line() to show the best fit line, and geom_ribbon() to show the confidence bands:

ggplot(birthwt, aes(x=age, y=fit)) + 
  geom_line() +
  geom_ribbon(aes(ymin=lwr, ymax=upr))
best fit line with confidence bands

best fit line with confidence bands

Yikes! That confidence band is too dark. Use alpha to lighten the bands by making them more transparent. Remember, because we are setting the entire band to be a constant transparency, we will specify alpha outside of aes().

ggplot(birthwt, aes(x=age, y=fit)) + 
  geom_line() +
  geom_ribbon(aes(ymin=lwr, ymax=upr), alpha=.5)
best fit line with confidence bands

best fit line with confidence bands

Use the following code to run a logistic regression of low birth weight (low) on various terms in the birthwt data set and then convert the coefficients and their confidence intervals to odds ratios.

m2 <- glm(low ~ smoke + age + ptl + ui, family="binomial", data=birthwt)
ci <- confint(m2)
odds_ratios <- data.frame(coef = exp(coef(m2)[-1]),
                          lower = exp(ci[-1,1]),
                          upper= exp(ci[-1,2]))
odds_ratios$term <- row.names(odds_ratios)

Graph the odds ratios and their confidence intervals using geom_point() and geom_errorbar().

Annotating a graph

At times we need to add notes or annotations directly to the graph that are not represented by any variables in the graph data set. For example, we may want to add a text label to a single point on a scatter plot, or perhaps highlight a portion of the graph with a colored box.

For this, we can use the annotate() function. To use annotate(), the first argument is the name of a geom (for example "text" or "rect"). Subsequent arguments are positioning aesthetics such as x= and y= and any additional aesthetics needed for that particular geom.

Let’s imagine that we want to annotate the data point in the far upper right corner of this graph we have seen before:

ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
  geom_point()
we want to annotate point in upper right

we want to annotate point in upper right

Suppose we want to label the outlier as a possible data error. To add annotation text, we will use geom_text() in annotate(). We will need to specify x= and y= positions for the text, and the contents of the text in label=.

We see that the outlier lies at x=45, y=5000. To place the text a little to the left of the point, we will use x=42 and y=5000. Proper positioning will take some experimentation. We specify the text to be displayed with label=Data error?.

ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
  geom_point() + 
  annotate("text", x=42, y=5000, label="Data error?")  # notice first argument is "text", not geom_text
Annotating outlier as possible data entry error

Annotating outlier as possible data entry error

As another example, let’s highlight a portion of the graph that features birthweights within 1 standard deviation of the mean weight. We will create a rectangle using geom_rect() that spans the x-axis for its full width from xmin=13 and xmax=46, and the y-axis from ymin=2215 to ymax=3673 (mean-sd, mean+sd). We will set alpha=.2 to make the box transparent.

ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
  geom_point() + 
  annotate("rect", xmin=13, xmax=46, ymin=2215, ymax=3673, alpha=.2)  # notice first argument is "text", not geom_text
Birthweights within one standard deviation of mean

Birthweights within one standard deviation of mean

*Working with colors*

Specifying colors in R

We can specify a specific color in R in several ways:

We have already used string names like “white” and “green” to specify colors.

You can issue colors() in R to see a full list of available color names. See here for a chart of these colors. Here we show the first 30 names out of 657:

head(colors(), n=30)
##  [1] "white"          "aliceblue"      "antiquewhite"   "antiquewhite1" 
##  [5] "antiquewhite2"  "antiquewhite3"  "antiquewhite4"  "aquamarine"    
##  [9] "aquamarine1"    "aquamarine2"    "aquamarine3"    "aquamarine4"   
## [13] "azure"          "azure1"         "azure2"         "azure3"        
## [17] "azure4"         "beige"          "bisque"         "bisque1"       
## [21] "bisque2"        "bisque3"        "bisque4"        "black"         
## [25] "blanchedalmond" "blue"           "blue1"          "blue2"         
## [29] "blue3"          "blue4"

We can also use hex color codes. These hex codes usually consist of # followed by 6 numbers/letters (each a hexadecimal digit ranging from 0 to F), where the first two digits represent redness, the second two greenness, and the last two blueness.

For example, the hex code #009900 would represent a green shade, while hex code #FF00EE would represent a purple shade. Tools like this can help you identify the hex code for a particular color.

ggplot(birthwt, aes(x=age, y=bwt)) +
  geom_point(color="#E36D11")
Using a hex code to specify a shade of orange

Using a hex code to specify a shade of orange

Finally, we can use RGB (red, green, blue) values to specify a color. Specify three numbers between 0 and 1 to rgb() function, and it will return the hex code for that color. Let’s try a purple:

# rgb() returns a hex code
rgb(.75, 0, 1)
## [1] "#BF00FF"

ggplot(birthwt, aes(x=age, y=bwt)) +
  geom_point(color=rgb(.75, 0, 1))
Using rgb() to specify a shade of purple

Using rgb() to specify a shade of purple

Color scales by variable type

Part of the challenge of making effective and attractive color graphs is choosing a color palette that serves both purposes of representing variation and catching the eye.

When you map a variable to color or fill, the ggplot2 package will use the variable’s type (i.e. numeric, factor, ordinal) to choose a color scale.

If you use map a numeric variable to color you will usually get a color gradient based on a single hue:

ggplot(birthwt, aes(x=lwt, y=bwt, color=as.numeric(race))) + 
  geom_point() 
color gradient scale for numeric variables

color gradient scale for numeric variables

A color gradient is a natural analog to a numeric variable. In the above graph, as the color becomes “bluer”, the race value becomes higher (assuming the value has numeric meaning).

On the other hand, if we map a factor variable to color, we get a set of distinct hues evenly spaced around the color wheel:

ggplot(birthwt, aes(x=lwt, y=bwt, color=factor(race))) + 
  geom_point() 
evenly spaced distinct hues for factor variables

evenly spaced distinct hues for factor variables

The categories of a factor variable are considered unordered, so using completely different hues to represent them makes sense.

Those were just the default color scales that ggplot2 chooses for you, by guessing the appropriate scale from the variable’s type. There are many ways to form color scales using ggplot2 so you have lots of options when choosing a palette.

Color scale functions

Here are some color scale functions used to form color scales (there is an analogous scale function for the fill aesthetic for each of the below):

With scale_color_gradient, we can define the colors that define the ends of the gradient with arguments low and high. The default gradient runs from a blueish-black at the low end to a light-blue at the high end. We can redefine the scale to go from a very light green (honeydew) to a dark green:

ggplot(birthwt, aes(x=age, y=bwt, color=lwt)) +
  geom_point() +
  scale_color_gradient(low="honeydew", high="darkgreen")
Defining our own color gradient

Defining our own color gradient

Because we are only changing a single hue with color gradients, perhaps it is easier to use rgb(), where we can specify the intensity of each hue:

ggplot(birthwt, aes(x=age, y=bwt, color=lwt)) +
  geom_point() +
  scale_color_gradient(low=rgb(.1, .2, .1), high=rgb(.1, 1, .1))
Defining our own color gradient with rgb()

Defining our own color gradient with rgb()

With scale_color_hue(), we define a color scale by specifying a range of colors to use, and then evenly spaced hues will be selected from this range. Here are the relevant arguments to scale_color_hue():

Varying any of the 3 above will alter the color palette.

First, we change the range of colors with h to be much smaller:

ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
  geom_point() +
  scale_color_hue(h=c(0,90))
restricting range of colors with scale_color_hue()

restricting range of colors with scale_color_hue()

We can also use the original range, but change the starting hue with h.start to get a completely different set:

ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
  geom_point() +
  scale_color_hue(h.start=20)
changing starting hue with scale_color_hue()

changing starting hue with scale_color_hue()

We have already used scale_color_manual() to alter color scales, but note that you can use hex codes and rgb() to specify the colors:

ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
  geom_point() +
  scale_color_manual(values=c(rgb(.5,.5,.2), "#FF1035", "blue"))
manually changing colors with hex codes and rgb()

manually changing colors with hex codes and rgb()

ColorBrewer

ColorBrewer is a webpage resource designed by Cynthia Brewer that lists many color schemes designed for different purposes:

The ColorBrewer palettes are not only designed to be highly functional, they are also very attractive, with colors that complement each other well.

The ColorBrewer palettes have been integrated into R, and are available in ggplot through scale_color_brewer() and scale_fill_brewer().

Arguments to scale_color_brewer() and scale_fill_brewer():

We’ll use a sequential palette first, although it should not be used with race since race does not progress from low to high values:

ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
  geom_point() +
  scale_color_brewer(type="seq", palette="RdPu")
Sequential palette not a great choice for race

Sequential palette not a great choice for race

Instead, we should use a qualitative palette with race:

ggplot(birthwt, aes(x=age, y=bwt, color=race)) +
  geom_point() +
  scale_color_brewer(type="qual", palette=8) # requests the 8th qualitative palette
Qualitative palette better for race

Qualitative palette better for race

Create a new bar graph of data set birthwt, where we have counts of x=low colored by fill=race. Use scale_fill_hue() and scale_fill_brewer() to adjust the color scales to your liking.

The ggplot2 book

For more in-depth information, read ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham, creator of the ggplot2 package:

This section of the seminar describing the grammar summarizes bits and pieces of chapter 3.

Additional exercises

New data set hsb

For the final set of exercises, we will be using a data set stored on the UCLA IDRE website, which we load with the following code:

hsb <- read.csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")

This data set contains demographic and academic data for 200 high school students. We will be using the following variables:

Use the following code to load the hsb data set:
hsb <- read.csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")

Exercise 1

Create a graph of boxplots of the variable math across levels of the variable honors. Color the inside of the boxes by female. Change the inside colors to “blue” and “gold”.

ggplot(hsb, aes(x=honors, y=math, fill=female)) +
  geom_boxplot() +
  scale_fill_manual(values=c("blue", "gold"))

Exercise 2

Create bar graph that displays the counts the number of students that fall into groups made up of the following 4 variables: female, prog, schtyp, ses. For example, from such a graph we can know how many female students in the academic program who go to public school who are of high socioeconomic status are in the data set.

Hint?

# don't forget to use position="dodge" for counts
ggplot(hsb, aes(x=female, fill=prog)) +
  geom_bar(position="dodge", width=.5) +
  facet_grid(schtyp ~ ses)

Exercise 3

Try to recreate this graph:

Note that the background has been entirely removed and that the axis and legend titles are red and in “mono” font.

This is just one solution:

ggplot(hsb, aes(x=read, y=write, color=math)) +
  geom_point() +
  geom_smooth(color="red") +
  labs(x="Reading Score", y="Writing Score", color="Math Score") +
  theme(title=element_text(family="mono", color="red"),
        panel.background=element_blank())

END THANK YOU!