Previous chapter
Data Wrangling with dplyrDeriving Information with dplyr
Next chapter

Welcome

In this case study, you will identify the most popular American names from 1880 to 2015. While doing this, you will master three more dplyr functions:

  • mutate(), group_by(), and summarize(), which help you use your data to compute new variables and summary statistics

These are some of the most useful R functions for data science, and this tutorial provides everything you need to learn them.

This tutorial uses the core tidyverse packages, including ggplot2, tibble, and dplyr, as well as the babynames package. All of these packages have been pre-installed and pre-loaded for your convenience.

Click the Next Topic button to begin.

summarise()

summarise() takes a data frame and uses it to calculate a new data frame of summary statistics.

Syntax

To use summarise(), pass it a data frame and then one or more named arguments. Each named argument should be set to an R expression that generates a single value. Summarise will turn each named argument into a column in the new data frame. The name of each argument will become the column name, and the value returned by the argument will become the column contents.

Example

I used summarise() above to calculate the total number of boys named “Mario”, but let’s expand that code to also calculate

  • max - the maximum number of boys named “Mario” in a single year
  • mean - the mean number of boys named “Mario” per year
babynames %>% 
  filter(name == "Mario", sex == "M") %>% 
  summarise(total = sum(n), max = max(n), mean = mean(n))
## # A tibble: 1 x 3
##    total   max  mean
##    <int> <int> <dbl>
## 1 144018  2935 1210.

Don’t let the code above fool you. The first argument of summarise() is always a data frame, but when you use summarise() in a pipe, the first argument is provided by the pipe operator, %>%. Here the first argument will be the data frame that is returned by babynames %>% filter(name == "Mario", sex == "M").

Exercise - summarise()

Use the code chunk below to compute three statistics:

  1. the total number of children who ever had your name
  2. the maximum number of children given your name in a single year
  3. the mean number of children given your name per year

If you cannot think of an R function that would compute each statistic, click the Hint/Solution button.

Summary functions

So far our summarise() examples have relied on sum(), max(), and mean(). But you can use any function in summarise() so long as it meets one criteria: the function must take a vector of values as input and return a single value as output. Functions that do this are known as summary functions and they are common in the field of descriptive statistics. Some of the most useful summary functions include:

  1. Measures of location - mean(x), median(x), quantile(x, 0.25), min(x), and max(x)
  2. Measures of spread - sd(x), var(x), IQR(x), and mad(x)
  3. Measures of position - first(x), nth(x, 2), and last(x)
  4. Counts - n_distinct(x) and n(), which takes no arguments, and returns the size of the current group or data frame.
  5. Counts and proportions of logical values - sum(!is.na(x)), which counts the number of TRUEs returned by a logical test; mean(y == 0), which returns the proportion of TRUEs returned by a logical test.

Let’s apply some of these summary functions. Click Continue to test your understanding.

Khaleesi challenge

“Khaleesi” is a very modern name that appears to be based on the Game of Thrones TV series, which premiered on April 17, 2011. In the chunk below, filter babynames to just the rows where name == “Khaleesi”. Then use summarise() and a summary function to return the first value of year in the data set.

Distinct name challenge

In the chunk below, use summarise() and a summary function to return a data frame with two columns:

  • A column named n that displays the total number of rows in babynames
  • A column named distinct that displays the number of distinct names in babynames

Will these numbers be different? Why or why not?

summarise by groups?

How can we apply summarise() to find the most popular names in babynames? You’ve seen how to calculate the total number of children that have your name, which provides one of our measures of popularity, i.e. the total number of children that have a name:

babynames %>% 
  filter(name == "Mario", sex == "M") %>% 
  summarise(total = sum(n))

However, we had to isolate your name from the rest of your data to calculate this number. You could imagine writing a program that goes through each name one at a time and:

  1. filters out the rows with just that name
  2. applies summarise to the rows

Eventually, the program could combine all of the results back into a single data set. However, you don’t need to write such a program; this is the job of dplyr’s group_by() function.

group_by()

group_by() takes a data frame and then the names of one or more columns in the data frame. It returns a copy of the data frame that has been “grouped” into sets of rows that share identical combinations of values in the specified columns.

group_by() in action

For example, the result below is grouped into rows that have the same combination of year and sex values: boys in 1880 are treated as one group, girls in 1880 as another group and so on.

babynames %>%
  group_by(year, sex)
## # A tibble: 1,924,665 x 5
## # Groups:   year, sex [276]
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,924,655 more rows

Using group_by()

By itself, group_by() doesn’t do much. It assigns grouping criteria that is stored as metadata alongside the original data set. If your dataset is a tibble, as above, R will tell you that the data is grouped at the top of the tibble display. In all other aspects, the data looks the same.

However, when you apply a dplyr function like summarise() to grouped data, dplyr will execute the function in a groupwise manner. Instead of computing a single summary for the entire data set, dplyr will compute individual summaries for each group and return them as a single data frame. The data frame will contain the summary columns as well as the columns in the grouping criteria, which makes the result decipherable:

To understand exactly what group_by() is doing, remove the line group_by(year, sex) %>% from the code above and rerun it. How do the results change?

Ungrouping 1

If you apply summarise() to grouped data, summarise() will return data that is grouped in a similar, but not identical fashion. summarise() will remove the last variable in the grouping criteria, which creates a data frame that is grouped at a higher level. For example, this summarise() statement receives a data frame that is grouped by year and sex, but it returns a data frame that is grouped only by year.

babynames %>%
  group_by(year, sex) %>% 
  summarise(total = sum(n))
## # A tibble: 276 x 3
## # Groups:   year [138]
##     year sex    total
##    <dbl> <chr>  <int>
##  1  1880 F      90993
##  2  1880 M     110491
##  3  1881 F      91953
##  4  1881 M     100743
##  5  1882 F     107847
##  6  1882 M     113686
##  7  1883 F     112319
##  8  1883 M     104627
##  9  1884 F     129020
## 10  1884 M     114442
## # … with 266 more rows

Ungrouping 2

If only one grouping variable is left in the grouping criteria, summarise() will return an ungrouped data set. This feature let’s you progressively “unwrap” a grouped data set:

If we add another summarise() to our pipe,

  1. our data set will first be grouped by year and sex.
  2. Then it will be summarised into a data set grouped by year (i.e. the result above)
  3. Then be summarised into a final data set that is not grouped.
babynames %>%
  group_by(year, sex) %>% 
  summarise(total = sum(n)) %>% 
  summarise(total = sum(total))
## # A tibble: 138 x 2
##     year  total
##    <dbl>  <int>
##  1  1880 201484
##  2  1881 192696
##  3  1882 221533
##  4  1883 216946
##  5  1884 243462
##  6  1885 240854
##  7  1886 255317
##  8  1887 247394
##  9  1888 299473
## 10  1889 288946
## # … with 128 more rows

Ungrouping 3

If you wish to manually remove the grouping criteria from a data set, you can do so with ungroup().

babynames %>%
  group_by(year, sex) %>% 
  ungroup()
## # A tibble: 1,924,665 x 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,924,655 more rows

Ungrouping 3

And, you can override the current grouping information with a new call to group_by().

babynames %>%
  group_by(year, sex) %>% 
  group_by(name)
## # A tibble: 1,924,665 x 5
## # Groups:   name [97,310]
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,924,655 more rows

That’s it. Between group_by(), summarise(), and ungroup(), you have a toolkit for taking groupwise summaries of your data at various levels of grouping.

mutate()

The total number of children by year

Why might there be a difference between the proportion of children who receive a name over time, and the number of children who receive the name?

An obvious culprit could be the total number of children born per year. If more children are born each year, the number of children who receive a name could grow even if the proportion of children given that name declines.

Test this theory in the chunk below. Use babynames and groupwise summaries to compute the total number of children born each year and then to plot that number vs. year in a line graph.

Popularity based on rank

The graph above suggests that our first definition of popularity is confounded with population growth: the most popular names in 2015 likely represent far more children than the most popular names in 1880. The total number of children given a name may still be the best definition of popularity to use, but it will overweight names that have been popular in recent years.

There is also evidence that our definition is confounded with a gender effect: only one of the top ten names was a girl’s name.

If you are concerned about these things, you might prefer to use our second definition of popularity, which would give equal representation to each year and gender:

  1. Ranks - A name is popular if it consistently ranks among the top names from year to year.

To use this definition, we could:

  1. Compute the rank of each name within each year and gender. The most popular name would receive the rank 1 and so on.
  2. Find the median rank for each name, accounting for gender. The names with the lowest median would be the names that “consistently rank among the top names from year to year.”

To do this, we will need to learn one last dplyr function.

mutate()

mutate() uses a data frame to compute new variables. It then returns a copy of the data frame that includes the new variables. For example, we can use mutate() to compute a percent variable for babynames. Here percent is just the prop multiplied by 100 and rounded to two decimal places.

babynames %>%
  mutate(percent = round(prop * 100, 2))
## # A tibble: 1,924,665 x 6
##     year sex   name          n   prop percent
##    <dbl> <chr> <chr>     <int>  <dbl>   <dbl>
##  1  1880 F     Mary       7065 0.0724    7.24
##  2  1880 F     Anna       2604 0.0267    2.67
##  3  1880 F     Emma       2003 0.0205    2.05
##  4  1880 F     Elizabeth  1939 0.0199    1.99
##  5  1880 F     Minnie     1746 0.0179    1.79
##  6  1880 F     Margaret   1578 0.0162    1.62
##  7  1880 F     Ida        1472 0.0151    1.51
##  8  1880 F     Alice      1414 0.0145    1.45
##  9  1880 F     Bertha     1320 0.0135    1.35
## 10  1880 F     Sarah      1288 0.0132    1.32
## # … with 1,924,655 more rows

Exercise - mutate()

The syntax of mutate is similar to summarise(). mutate() takes first a data frame, and then one or more named arguments that are set equal to R expressions. mutate() turns each named argument into a column. The name of the argument becomes the column name and the result of the R expression becomes the column contents.

Use mutate() in the chunk below to create a births column, the result of dividing n by prop. You can think of births as a sanity check; it uses each row to double check the number of boys or girls that were born each year. If all is well, the numbers will agree across rows (allowing for rounding errors).

Vectorized functions

Like summarise(), mutate() works in combination with a specific type of function. summarise() expects summary functions, which take a vector of input and return a single value. mutate() expects vectorized functions, which take a vector of input and return a vector of values.

In other words, summary functions like min() and max() won’t work well with mutate(). You can see why if you take a moment to think about what mutate() does: mutate() adds a new column to the original data set. In R, every column in a dataset must be the same length, so mutate() must supply as many values for the new column as there are in the existing columns.

If you give mutate() an expression that returns a single value, it will follow R’s recycling rules and repeat that value as many times as needed to fill the column. This can make sense in some cases, but the reverse is never true: you cannot give summarise() a vectorized function; summarise() needs its input to return a single value.

What are some of R’s vectorized functions? Click Continue to find out.

The most useful vectorized functions

Some of the most useful vectorised functions in R to use with mutate() include:

  1. Arithmetic operators - +, -, *, /, ^. These are all vectorised, using R’s so called “recycling rules”. If one vector of input is shorter than the other, it will automatically be repeated multiple times to create a vector of the same length.
  2. Modular arithmetic: %/% (integer division) and %% (remainder)
  3. Logical comparisons, <, <=, >, >=, !=
  4. Logs - log(x), log2(x), log10(x)
  5. Offsets - lead(x), lag(x)
  6. Cumulative aggregates - cumsum(x), cumprod(x), cummin(x), cummax(x), cummean(x)
  7. Ranking - min_rank(x), row_number(x), dense_rank(x), percent_rank(x), cume_dist(x), ntile(x)

For ranking, I recommend that you use min_rank(), which gives the smallest values the top ranks. To rank in descending order, use the familiar desc() function, e.g.

min_rank(c(50, 100, 1000))
## [1] 1 2 3
min_rank(desc(c(50, 100, 1000)))
## [1] 3 2 1

Exercise - Ranks

Let’s practice by ranking the entire dataset based on prop. In the chunk below, use mutate() and min_rank() to rank each row based on its prop value, with the highest values receiving the top ranks.

Rankings by group

In the previous exercise, we assigned rankings across the entire data set. For example, with the exception of ties, there was only one 1 in the entire data set, only one 2, and so on. To calculate a popularity score across years, you will need to do something different: you will need to assign rankings within groups of year and sex.

To rank within groups, combine mutate() with group_by(). Like dplyr’s other functions, mutate() will treat grouped data in a group-wise fashion.

Add group_by() to our code from above, to calculate ranking within year and sex combinations. Do you notice the numbers change?

Recap

In this primer, you learned three functions for isolating data within a table:

  • select()
  • filter()
  • arrange()

You also learned three functions for deriving new data from a table:

  • summarise()
  • group_by()
  • mutate()

Together these six functions create a grammar of data manipulation, a system of verbs that you can use to manipulate data in a sophisticated, step-by-step way. These verbs target the everyday tasks of data analysis. No matter which types of data you work with, you will discover that:

  1. Data sets often contain more information than you need
  2. Data sets imply more information than they display

The six dplyr functions help you work with these realities by isolating and revealing the information contained in your data. In fact, dplyr provides more than six functions for this grammar: dplyr comes with several functions that are variations on the themes of select(), filter(), summarise(), and mutate(). Each follows the same pipeable syntax that is used throughout dplyr. If you are interested, you can learn more about these peripheral functions in the dplyr cheatsheet.

Challenges

Apply your knowledge of dplyr to do the following two challenges.

Number Ones Challenge - boys

How many distinct boys names acheived a rank of Number 1 in any year?

Number Ones Challenge - girls

How many distinct girls names acheived a rank of Number 1 in any year?

Number Ones Challenge - Plot

number_ones is a vector of every boys name to acheive a rank of one.

Use number_ones with babynames to recreate the plot below, which shows the popularity over time for every name in number_ones.

Name Diversity Challenge - number of unique names

Which gender uses more names?

In the chunk below, calculate and then plot the number of distinct names used each year for boys and girls. Place year on the x axis, the number of distinct names on they y axis and color the lines by sex.

Name Diversity Challenge - number of boys and girls

Let’s make sure that we’re not confounding our search with the total number of boys and girls born each year. With the chunk below, calculate and then plot over time the total number of boys and girls by year. Is the relative number of boys and girls constant?

Name Diversity Challenge - children per name

Hmm. Sometimes there are more girls and sometimes more boys. In addition, the entire population has been grown over time. Let’s account for this weith a new metric: the average number of children per name.

If girls have a smaller number of children per name, that would imply that they use more names overall (and vice versa).

In the chunk below, calculate and plot the average number of children per name by year and sex over time. How do you interpret the results?

Where to from here

Congratulations! You can use dplyr’s grammar of data manipulation to access any data associated with a table—even if that data is not currently displayed by the table.

In other words, you now know how to look at data in R, as well as how to access specific values, calculate summary statistics, and compute new variables. When you combine this with the visualization skills you have everything that you need to begin exploring data in R.

The next tutorial will teach you the last of three basic skills for working with R:

  1. How to visualize data
  2. How to work with data
  3. How to program with R code