Selecting by name

select(my_data_frame, column_one, column_two, ...)
select(my_data_frame, new_column_name = current_column, ...)
select(my_data_frame, column_start:column_end)
select(my_data_frame, index_one, index_two, ...)
select(my_data_frame, index_start:index_end)
View Interactive Version

In this chapter we will have a look at the pres_results dataset from the politicaldata package. It contains data about US presidential elections since 1976, converted to a Tibble for nicer printing.

Output
# A tibble: 561 x 6
   year state total_votes   dem   rep   other
  <dbl> <chr>       <dbl> <dbl> <dbl>   <dbl>
1  1976 AK         123574 0.357 0.579 0.0549 
2  1976 AL        1182850 0.557 0.426 0.0163 
3  1976 AR         767535 0.650 0.349 0.00134
# … with 558 more rows

For this example, we will have a look at the number of total votes in different states at different elections. Since we are only interested in the number of people who voted we would like to create a custom version of the pres_results data frame that only contains the columns year, state and total_votes. For such filtering, we can use the select() fiction from the dplyr package.

The select() function takes a data frame as an input parameter and lets us decide which of the columns we want to keep from it. The output of the function is a data frame with all rows, but containing only the columns we explicitly select.

We can reduce our dataset to only year, state and total_votes in the following way:

Input
select(pres_results, year, state, total_votes)
Output
# A tibble: 561 x 3
   year state total_votes
  <dbl> <chr>       <dbl>
1  1976 AK         123574
2  1976 AL        1182850
3  1976 AR         767535
# … with 558 more rows

As the first parameter we passed the pres_results data frame, as the remaining parameters we passed the columns we want to keep to select().

Apart from keeping the columns we want, the select() function also keeps them in the same order as we specified in the function parameters.

If we change the order of the parameters when we call the function, the columns of the output change accordingly:

Input
select(pres_results, total_votes, year, state)
Output
# A tibble: 561 x 3
  total_votes  year state
        <dbl> <dbl> <chr>
1      123574  1976 AK   
2     1182850  1976 AL   
3      767535  1976 AR   
# … with 558 more rows
Select columns from a data frame