Using the %>% operator

The pipe operator %>% is a special part of the tidyverse universe. It is used to combine multiple functions and run them one after the other. In this setting the input of each function is the output of the previous function. Imagine we have the pres_results data frame and want to create a smaller, more transparent data frame for answering the question: In which states was the democratic party the most popular choice in the 2016 US presidential election? To accomplish this task we would need to take the following steps:

  1. filter() the data frame for the rows, where the year variable equals 2016
  2. select() the two variables state and dem, since we are not interested in the rest of the columns.
  3. arrange() the filtered and selected data frame based on the dem column in a descending way.

The steps and functions described above should be run one after the other, where the input of each function is the output of the previous step. Applying the things you learned so far, you could accomplish this task by taking the following steps:

Input
result <- filter(pres_results, year==2016)
result <- select(result, state, dem)
result <- arrange(result, desc(dem))
result
Output
# A tibble: 51 x 2
  state   dem
  <chr> <dbl>
1 DC    0.905
2 CA    0.617
3 HI    0.610
# … with 48 more rows

The first function takes the pres_results data frame, filters it according to the task description and assigns it to the variable result. Then, each subsequent function takes the result variable as input and overwrites it with its own output.

The %>% operator provides a practical way for combining the steps above into seemingly one step. It takes a data frame as the initial input. Then, it applies a list of functions, and passes on the output of each function for the input for the next function. The same task as above can be accomplished using the pipe operator %>% like this:

Input
pres_results %>%
  filter(year==2016) %>%
  select(state, dem, rep) %>%
  arrange(desc(dem))
Output
# A tibble: 51 x 3
  state   dem    rep
  <chr> <dbl>  <dbl>
1 DC    0.905 0.0407
2 CA    0.617 0.316 
3 HI    0.610 0.294 
# … with 48 more rows

We can interpret the code in the following way:

  1. We define the original data set as a starting point.
  2. Using the %>% operator right after the data frame tells dplyr, that a function is coming, which takes the previously defined data frame as input.
  3. We use each function as usual, but skip the first parameter. The data frame input is automatically provided by the output of the previous step.
  4. As long as we add the %>% operator after a step, dplyr will expect an additional step.
  5. In our example the pipeline closes with a arrange() function. It gets the filtered and selected version of the pres_results data frame as input and sorts it based on the dem column in a descending way. Finally, it gives back the output.

One difference between the two approaches is, that the %>% operator does not save permanently the intermediate or the final results. To save the resulting data frame we need to assign the output to a variable:

Input
result <- pres_results %<>%
  filter(year==2016) %>%
  select(state, dem) %>%
  arrange(desc(dem))

result
Output
# A tibble: 51 x 2
  state   dem
  <chr> <dbl>
1 DC    0.905
2 CA    0.617
3 HI    0.610
# … with 48 more rows
Create a data transformation pipeline