Previous chapter
Data Wrangling with dplyrSeparate Columns
Next chapter

Welcome

Data is easiest to analyze in R when it is stored in a tidy format. In the last tutorial, you learned how to tidy data that has an untidy layout, but there is another way that data sets can be untidy: a data set can combine multiple values in a single cell or spread a single value across multiple cells. This makes it difficult to extract and use values in your analysis.

This tutorial will teach you two tools that you can use to tidy this type of data:

  • separate() - which separates a column of cells into multiple columns
  • unite() - which combines multiple columns of cells into a single column

It ends with a case study that requires you to use all of the tidy tools to wrangle a messy real world data set.

This tutorial uses the core tidyverse packages, including tidyr. All of these packages have been pre-installed and pre-loaded for your convenience.

Click the Next Topic button to begin.

separate()

hurricanes

The hurricanes data set contains historical information about five hurricanes. At first glance it appears to contain four variables: name, wind_speed, pressure, and date. However, there are three more variables hidden in plain sight. Can you spot them?

## # A tibble: 6 x 4
##   name    wind_speed pressure date      
##   <chr>        <dbl>    <dbl> <chr>     
## 1 Alberto        110     1007 2000-08-03
## 2 Alex            45     1009 1998-07-27
## 3 Allison         65     1005 1995-06-03
## 4 Ana             40     1013 1997-06-30
## 5 Arlene          50     1010 1999-06-11
## 6 Arthur          45     1010 1996-06-17

Dates

Did you realize that dates are a combination of multiple variables? They are.

You’ll almost always display these variables together to make a date, because a date is itself a variable—one that conveys more than the sum of its parts.

However, there are times where it is convenient to treat each element of a date separately. For example, what if you wanted to filter hurricanes to just the storms that occurred in June (i.e. month == 6)? Then it would be convenient to reorganize the data to look like this.

## # A tibble: 6 x 6
##   name    wind_speed pressure  year month   day
##   <chr>        <dbl>    <dbl> <int> <int> <int>
## 1 Alberto        110     1007  2000     8     3
## 2 Alex            45     1009  1998     7    27
## 3 Allison         65     1005  1995     6     3
## 4 Ana             40     1013  1997     6    30
## 5 Arlene          50     1010  1999     6    11
## 6 Arthur          45     1010  1996     6    17

But how could you do it?

separate()

You can separate the elements of date with the separate() function. separate() divides a column of values into multiple columns that each contain a portion of the original values.

Run the code below to see separate() in action. Then click continue to learn about the syntax.

Syntax

Let’s rewrite our above command without the pipe, to make the syntax of separate easier to see.

separate(hurricanes, col = date, into = c("year","month","day"), sep = "-")

separate() takes a data frame and then the name of a column in the data frame to separate. Here our code will separate the date column of the hurricane data set.

The sep = "-" argument tells separate() to split each value in date wherever a - appears. You can choose to split on any character or character string.

Separating on - will split each date into three dates: a year, month, and day. As a result, separate() will need to add three new columns to the result. The into argument gives separate() a character vector of names to use for the new columns. Since the result will have three new columns, this vector will need to have three new names. separate() will provide an error message if it ends up creating fewer or more columns than column names.

Defaults

By default separate() will separate values at the location of any non-alphanumeric character, like -, ,, /, etc. So for example, we could run our code without the sep = "-" argument and—in this case—get the same result.

Or will we? Do a quick mental check and then run the code to see if you are right.

Separating by position

If you set sep equal to an integer, separate() will split the values at the location indicated by the integers. For example,

  • sep = 1 will split the values after the first character
  • sep = -2 will split the values after the second to last character, no matter how many characters appear in the value. In other words, it will split off the last character of each value.
  • sep = c(2, 4, 6) will split the values after the second, fourth, and sixth characters, creating four sub-values

Think you have it? Create this version of hurricanes by adding a second call to separate() that uses an integer separator to the code below:

## # A tibble: 6 x 7
##   name    wind_speed pressure century year  month day  
##   <chr>        <dbl>    <dbl> <chr>   <chr> <chr> <chr>
## 1 Alberto        110     1007 20      00    08    03   
## 2 Alex            45     1009 19      98    07    27   
## 3 Allison         65     1005 19      95    06    03   
## 4 Ana             40     1013 19      97    06    30   
## 5 Arlene          50     1010 19      99    06    11   
## 6 Arthur          45     1010 19      96    06    17
## [1] "Good job! Notice that when you separate on a character, separate() removes that character from the results. When you separate on a location, separate() retains all of the characters in the result."

Quiz - What if

Would these two commands return the same result? Why or why not? Once you have an answer, run the code below to see if you were right.

Convert

You may have noticed that separate() returns its results as columns of character strings. However, in some cases, like ours, the columns will contain integers, doubles, or other types of non-character data.

You can ask separate() to convert the new columns to an appropriate data type by adding convert = TRUE to your separate() call. This is identical to the convert = TRUE argument of gather().

Identify the data types of year, month, and day (they appear under the column names) in the output below. Then add convert = TRUE and re-run the code. What changes?

Remove

Let’s take a look at one last argument for separate(). If you add remove = FALSE to your separate() call, R will retain the original column in the results.

hurricanes %>% 
  separate(col = date, into = c("year","month","day"), convert = TRUE, remove = FALSE)
## # A tibble: 6 x 7
##   name    wind_speed pressure date        year month   day
##   <chr>        <dbl>    <dbl> <chr>      <int> <int> <int>
## 1 Alberto        110     1007 2000-08-03  2000     8     3
## 2 Alex            45     1009 1998-07-27  1998     7    27
## 3 Allison         65     1005 1995-06-03  1995     6     3
## 4 Ana             40     1013 1997-06-30  1997     6    30
## 5 Arlene          50     1010 1999-06-11  1999     6    11
## 6 Arthur          45     1010 1996-06-17  1996     6    17

unite()

unite()

You can do the inverse of separate() with unite(). unite() uses multiple input columns to create a single output column. It builds this column by pasting together the cells of the input column with a separator.

Syntax

hurricanes %>%
  separate(date, c("year", "month", "day"), sep = "-") %>%
  unite(col = "date", month, day, year, sep = ":")

Notice that the syntax of unite() is the inverse of separate():

  • The first argument is a character string: the name of the new column that unite() will make
  • The arguments that follow are the columns to be combine into the new column. You can list as many columns as you like, their names do not need to be in quotes, and each name is listed as its own argument.

Exercise - Separate and Unite

Use separate() and unite() to rewrite the dates in hurricanes in the format below:

Exercise - Separate and Unite 2

Use the chunk below to:

  1. Use separate to isolate the first two digits of each date as “century”
  2. Filter the data to just rows where century == 19. These will be storms that occurred in the 1900’s.
  3. Use unite() to return the results to the original date format. Hint: you can set sep = "" to avoid including a separator character when uniting.

Tidy data

So far we’ve separated and united date, a variable that contains legitimate sub-variables. This is because it makes little sense to combine unrelated values within the same cells. However, many data sets follow this senseless practice. If you inherit one, you can use separate() and unite() to reorganize the values in a tidy fashion.

In the case study that follows, you will do just that. You will also practice using all of the tidyr functions as you do.

Case study

who

The who data set contains a subset of data from the World Health Organization Global Tuberculosis Report, available here.

In its original format, the data is very untidy

who
## # A tibble: 7,240 x 60
##    country iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534
##    <chr>   <chr> <chr> <int>       <int>        <int>        <int>
##  1 Afghan… AF    AFG    1980          NA           NA           NA
##  2 Afghan… AF    AFG    1981          NA           NA           NA
##  3 Afghan… AF    AFG    1982          NA           NA           NA
##  4 Afghan… AF    AFG    1983          NA           NA           NA
##  5 Afghan… AF    AFG    1984          NA           NA           NA
##  6 Afghan… AF    AFG    1985          NA           NA           NA
##  7 Afghan… AF    AFG    1986          NA           NA           NA
##  8 Afghan… AF    AFG    1987          NA           NA           NA
##  9 Afghan… AF    AFG    1988          NA           NA           NA
## 10 Afghan… AF    AFG    1989          NA           NA           NA
## # … with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
## #   new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## #   new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## #   new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## #   new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## #   new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## #   new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## #   new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## #   new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## #   new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
## #   new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## #   new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## #   new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
## #   new_ep_f5564 <int>, new_ep_f65 <int>, new_rel_m014 <int>,
## #   new_rel_m1524 <int>, new_rel_m2534 <int>, new_rel_m3544 <int>,
## #   new_rel_m4554 <int>, new_rel_m5564 <int>, new_rel_m65 <int>,
## #   new_rel_f014 <int>, new_rel_f1524 <int>, new_rel_f2534 <int>,
## #   new_rel_f3544 <int>, new_rel_f4554 <int>, new_rel_f5564 <int>,
## #   new_rel_f65 <int>

who variables

The first four columns of who each contain a single variable:

  • country - the name of a country
  • iso2 - a two letter country code
  • iso3 - a three letter country code
  • year - year

The remaining columns are named after codes that contain multiple variables.

who codes

Each column name after the fourth, contains a code comprised of three values from three variables: type of TB, gender, and age.

A goal

To make who easier to use in R, we should tidy it into the format below. This data set contains six non-redundant variables: country, year, type, sex, age (group), and n (the number of cases of TB reported for each group).

## # A tibble: 76,046 x 6
##    country      year type  sex   age       n
##    <chr>       <int> <chr> <chr> <chr> <int>
##  1 Afghanistan  1997 sp    m     014       0
##  2 Afghanistan  1998 sp    m     014      30
##  3 Afghanistan  1999 sp    m     014       8
##  4 Afghanistan  2000 sp    m     014      52
##  5 Afghanistan  2001 sp    m     014     129
##  6 Afghanistan  2002 sp    m     014      90
##  7 Afghanistan  2003 sp    m     014     127
##  8 Afghanistan  2004 sp    m     014     139
##  9 Afghanistan  2005 sp    m     014     151
## 10 Afghanistan  2006 sp    m     014     193
## # … with 76,036 more rows

Warm up

It will take a number of operations to tidy the who data set. When you look at the end result, the task may seem overwhelming, but each individual operation will be fairly simple (and familiar). We will use the pipe operator, %>%, to string these simple operations together.

Let’s begin the pipe by removing the redundant variables iso2 and iso3 from who. In otherwords, let’s use a dplyr function to select every column except iso2 and iso3. Recall that there is a way to do this without a large amount of typing.

A strategy

Next, we need to move the type, sex, and age variables out of the column names and into a column of their own. It is true that we want to separate these values into their own cells, but that will be easier to do once they are in their own column.

In short, we want to do something like this:

Exercise - Reshape

Add to the pipe below. Use a tidyr reshaping function to gather the column names into their own column, named “codes”. Place the column cells into a column named “n”. Hint: it may be helpful to know that there are now 58 columns in the data set.

You can think of each column name as a key that combines the values of several variables. We want to move those keys into their own key column.

Exercise - Separate

Add to the pipe below. Separate the codes in codes into three columns named “new”, “type”, and “sexage”. What type of separator should you separate on?

Exercise - Separate again

Our last separate, isolated two components of the who codes: new and type. However, it did not separate the sex and age variables.

If you look closely at the structure of the sexage column, you will see that each cell begins with a single letter that represents a gender, m or f, and is then followed by three or more numbers, which represent an age group. Use this insight to perform a second separate that isolates the “sex” and “age” variables:

Exercise - Select

Add to the pipe to remove the new variable, which doesn’t provide any useful information. (Every row in the data set shows new cases of TB and has the same value of new).

n

Notice that the n column of who contains the most insightful information. You do not need to take any measurments to list out the country, year, type, sex, and age combinations in the data set. In a sense, you know these combinations in advance. However, n shows how many cases of TB were reported for each combination. You do not know this information in advance, and you can only acquire it through field work—yours or someone else’s. As a result, it is concerning that our data contains so many NAs for n.

NA

NA is R’s symbol for missing information, and it is common to have multiple NAs when you reshape your data from a wide format to a long format. The two by two table structure imposed by wide data requires a place holder for every combination of variable values—even if no data was collected for that combination.

In contrast, the long data format does not require a place holder for each combination of variable values. Since each combination is saved as its own row, you can simply not include rows that contain an NA.

drop_na()

The tidyr package provides a convenient function for dropping rows that contain an NA in a specific column. The function is drop_na(). To use it, give drop_na() a data set (perhaps via a pipe), then list one or more columns in that data set, e.g.

data %>% drop_na(column1, column2)

drop_na() will drop every row that contains an NA in one or more of the listed columns.

Add drop_na() to the pipe below to drop every row that has an NA in the n column.

Recap

Good job! You’ve wrangled who into a tidy, polished data set that is ready to be explored, modelled, and analyzed.

The difference between the initial and final versions of who is drastic, but each step in our pipe imposed a small, logical change. This is by design.

The tidyverse contains a vocabulary of functions that each do one simple thing, but can be combined to do more sophisticated tasks. In this way, the tidyverse is like a written language, it is made up of words (functions) that can be combined into sentences that have a sophisticated meaning (pipes).

This approach also makes it easier to solve problems with code. You can approach any problem by decomposing it into a series of small, simple steps.