Previous chapter
ggplot2ggplot2
Next chapter

Welcome

Data visualization is one of the most important tools for data science. It is also a great way to start learning R; when you visualize data, you get an immediate payoff that will keep you motivated as you push through the initial frustrations of learning a language.

This tutorial will teach you how to begin visualizing data with ggplot2, which is one of the most popular packages in R for visualizing data.

ggplot2 package

ggplot2{target=“_blank“} is an R package by Hadley Wickham and implements the Grammar of Graphics{target=”_blank“} inspired by Wilkinson. Instead of procedural base-R plot() we can specify a graph declaratively. This approach is especially useful if we want to produce good looking plots quickly (instead of controlling each detail). ggplot2 can also be beneficial if many plots sharing the same coordinate system need to be drawn on one device (see also the lattice{target=”_blank“} package).

ggplot2 Terminology

The tutorial focuses on three essential skills:

  1. How to create graphs with a reusable ggplot2 template
  2. How to add variables to a graph with aesthetics
  3. How to select the “type” of your graph with geoms

A graph can be built from a given data set by specifying aesthetics and geoms. Attributes in the data set are mapped to visual properties (aesthetics) of geoms. aesthetics can be properties like size, color, x and y. geoms specify shape of data in plots. Optional elements include facets, statistics, coordinates and themes.

Acknowledgments

These examples are excerpted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. This tutorial uses the core tidyverse packages (http://tidyverse.org/), including ggplot2, which have been pre-loaded for your convenience.

The Grammar of Graphics

The best way to understand the Grammar of Graphics is to see it explained in action:

A code template

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

Query

Let’s begin the tutorial with a question to explore.

mpg

You can test your answer with two things: the mpg dataset that comes in ggplot2 and a plot. mpg contains observations collected on 38 models of cars by the US Environmental Protection Agency. Among the variables in mpg are:

  1. manufacturer, manufacturer name
  2. model, model name
  3. displ, a car’s engine size, in litres
  4. year, year of manufacture
  5. cyl, number of cylinders
  6. trans, type of transmission
  7. drv, f = front-wheel drive, r = rear wheel drive, 4 = 4wd
  8. cty, city miles per gallon
  9. hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg).
  10. fl, fuel type
  11. class, “type” of car

A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.

To see the mpg data frame, type mpg in the code block below and click “Submit Answer”.

A plot

The code below uses functions from the ggplot2 package to plot the relationship between displ and hwy.

Let’s look at the plot and then talk about the code itself. To see the plot, click “Submit Answer” to run the code below.

And the answer is…

The plot shows a negative relationship between engine size (displ) and fuel efficiency (hwy). Points that have a large value of displ have a small value of hwy and vice versa.

In other words, cars with big engines use more fuel. If that was your hypothesis, you were right!

Now let’s look at how we made the plot.

ggplot()

Here’s the code that we used to make the plot. Notice that it contains three functions: ggplot(), geom_point(), and aes().

The first function, ggplot(), creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph.

By itself, ggplot(data = mpg) creates an empty graph, but it is not very interesting so I’m not going to show it here.

geom_point()

geom_point() adds a layer of points to the empty plot created by ggplot(). As a result, we get a scatterplot.

mapping = aes()

geom_point() takes a mapping argument, which defines which variables in your dataset are mapped to which axes in your graph. The mapping argument is always paired with the function aes(), which you use to gather together all of the mappings that you want to create.

Here, we want to map the displ variable to the x axis and the hwy variable to the y axis, so we add x = displ and y = hwy inside of aes() (and we separate them with a comma). Where will ggplot2 look for these mapped variables? In the data frame that we passed to the data argument, in this case, mpg.

A graphing workflow

Our code follows the common workflow for making graphs with ggplot2. To make a graph, you:

  1. Start the graph with ggplot()
  2. Add elements to the graph with a geom_ function
  3. Select variables with the mapping = aes() argument

A graphing template

In fact, you can turn our code into a reusable template for making graphs. To make a graph, replace the bracketed sections in the code below with a data set, a geom_ function, or a collection of mappings.

Give it a try! Replace the bracketed sections <DATA>, <GEOM_FUNCTION> and <MAPPINGS> with mpg, geom_boxplot, and x = class, y = hwy to make a slightly different graph.

Common problems

As you start to run R code, you’re likely to run into problems. Don’t worry — it happens to everyone. We have been writing R code for years, and every day we still write code that doesn’t work!

Start by carefully comparing the code that you’re running to the code in the examples. R is extremely picky, and a misplaced character can make all the difference. Make sure that every ( is matched with a ) and every " is paired with another ". Also pay attention to capitalization; R is case sensitive.

+ location

One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of a line, not the start. In other words, make sure you haven’t accidentally written code like this:

ggplot(data = mpg) 
+ geom_point(mapping = aes(x = displ, y = hwy))

help

If you’re still stuck, try the help. You can get help about any R function by running ?function_name in a code chunk, e.g. ?geom_point. Don’t worry if the help doesn’t seem that helpful — instead skip down to the bottom of the help page and look for a code example that matches what you’re trying to do.

If that doesn’t help, carefully read the error message that appears when you run your (non-working) code. Sometimes the answer will be buried there! But when you’re new to R, you might not yet know how to understand the error message. Another great tool is Google: try googling the error message, as it’s likely someone else has had the same problem, and has gotten help online.

Exercise 1

Run ggplot(data = mpg) what do you see?

Exercise 2

Make a scatterplot of cty vs hwy.

Exercise 3

What happens if you make a scatterplot of class vs drv. Try it. Why is the plot not useful?

Aesthetic mappings

“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

A closer look

In the plot below, one group of points (highlighted in red) seems to fall outside of the linear trend between engine size and gas mileage. These cars have a higher mileage than you might expect. How can you explain these cars?

A hypothesis

Let’s hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the class value for each car. The class variable of the mpg dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular). To check this, we need to add the class variable to the plot.

Aesthetics

You can add a third variable, like class, to a two dimensional scatterplot by mapping it to a new aesthetic. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points.

You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word “value” to describe data, let’s use the word “level” to describe aesthetic properties. Here we change the levels of a point’s size, shape, and color to make the point small, triangular, or blue:

A strategy

We can add the class variable to the plot by mapping the levels of an aesthetic (like color) to the values of class. For example, we can color a point green if it belongs to the compact class, blue if it belongs to the midsize class, and so on.

Let’s give this a try. Fill in the blank piece of code below with color = class. What happens?

Hint: Be sure to remove all of the underlines and hashtags from the code.

And the answer is…

The colors reveal that many of the unusual points in mpg are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.

This isn’t the only insight we’ve gleaned; you’ve also learned how to add new aesthetics to your graph. Let’s review the process.

Aesthetic mappings

To map an aesthetic to a variable, set the name of the aesthetic equal to the name of the variable, and do this inside mapping = aes(). ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable. ggplot2 will also add a legend that explains which levels correspond to which values.

This insight gives us a new way to think about the mapping argument. Mappings tell ggplot2 more than which variables to put on which axes, they tell ggplot2 which variables to map to which visual properties. The x and y locations of each point are just two of the many visual properties displayed by a point.

Other aesthetics

In the above example, we mapped color to class, but we could have mapped size to class in the same way.

Change the code below to map size to class. What happens?

Hint: If color controls the aesthetic, what word do you suppose controls the size aesthetic?

alpha

You can also map class to the alpha aesthetic, which controls the transparency of the points. Try it below.

Hint: If color controls the aesthetic, what word do you suppose controls the alpha aesthetic?

Shape

Let’s try one more aesthetic. This time map the class of the points to shape, then look for the SUVs. What happened?

Hint: If color controls the aesthetic, what word do you suppose controls the shape aesthetic?

Exercise 1

In the code below, map cty, which is a continuous variable, to color, size, and shape. How do these aesthetics behave differently for continuous variables, like cty, vs. categorical variables, like class?

Exercise 2

Map class to color, size, and shape all in the same plot. Does it work?

Hint: Be sure to set each mapping separately, e.g. color = class, size = class, etc.

Exercise 3

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Try it.

Setting aesthetics

What if you just want to make all of the points in your plot blue, like in the plot below?

You can do this by setting the color aesthetic outside of the aes() function, like this

Setting vs. Mapping

Setting works for every aesthetic in ggplot2. If you want to manually set the aesthetic to a value in the visual space, set the aesthetic outside of aes().

If you want to map the aesthetic to a variable in the data space, map the aesthetic inside aes().

Exercise 4

What do you think went wrong in the code below? Fix the code so it does something sensible.

Recap

For each aesthetic, you associate the name of the aesthetic with a variable to display, and you do this within aes().

Once you map a variable to an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.

You’ve experimented with the most common aesthetics for points: x, y, color, size, alpha and shape. Each geom uses its own set of aesthetics (you wouldn’t expect a line to have a shape, for example). To find out which aesthetics a geom uses, open its help page, e.g. ?geom_line.

This raises a new question that we’ve only brushed over: what is a geom?

Geometric objects

Geoms

How are these two plots similar?

Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different geoms.

A geom is the geometrical object that a plot uses to represent observations. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.

As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.

Geom functions

To change the geom in your plot, change the geom function that you add to ggplot(). For example, take this code which makes the plot on the left (above), and change geom_point() to geom_smooth(). What do you get?

More about geoms

ggplot2 provides over 30 geom functions that you can use to make plots, and extension packages provide even more (see https://www.ggplot2-exts.org for a sampling).

Until then, the best way to get a comprehensive overview of the available geoms is with the ggplot2 cheatsheet. To learn more about any single geom, look at its help page, e.g. ?geom_smooth.

Available ggplot2 geoms

geomDescription
geom_point()Create Scatterplot
geom_line()Create Lines
geom_bar()Bar Chart
geom_histogram()Histogram
geom_boxplot()Box Whisker Plot

geom_point()

geom_line()

geom_bar()

geom_histogram()

geom_boxplot()

Exercise 1

What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

Exercise 2

Facets

Split plot into facets, each displaying one data subset

  • facet_wrap(~<var>) for single variable subsets
  • facet_grid(<var1>~<var2>) for bivariate subsets
  • Subset plots use one common scale

Facet Wrap Example

Facet Grid Example

Putting it all together

The ideas that you’ve learned here: geoms, aesthetics, and the implied existence of a data space and a visual space combine to form a system known as the Grammar of Graphics.

The Grammar of Graphics provides a systematic way to build any graph, and it underlies the ggplot2 package. In fact, the first two letters of ggplot2 stand for “Grammar of Graphics”.

The ggplot2 package

Throughout this tutorial, we’ve referred to ggplot2 as a package. What does that mean?

The R language is subdivided into packages, small collections of data sets and functions that all focus on a single task. The functions that we used in this tutorial come from one of those packages, the ggplot2 package, which focuses on visualizing data. See the packages section in the next Programming Basics section for a more detailed description of packages.