Data is easiest to analyze in R when it is stored in a tidy format. In the last tutorial, you learned how to tidy data that has an untidy layout, but there is another way that data sets can be untidy: a data set can combine multiple values in a single cell or spread a single value across multiple cells. This makes it difficult to extract and use values in your analysis.
This tutorial will teach you two tools that you can use to tidy this type of data:
separate() - which separates a column of cells into multiple columns
unite() - which combines multiple columns of cells into a single column
It ends with a case study that requires you to use all of the tidy tools to wrangle a messy real world data set.
This tutorial uses the core tidyverse packages, including tidyr. All of these packages have been pre-installed and pre-loaded for your convenience.
Click the Next Topic button to begin.
hurricanes data set contains historical information about five hurricanes. At first glance it appears to contain four variables: name, wind_speed, pressure, and date. However, there are three more variables hidden in plain sight. Can you spot them?
## # A tibble: 6 x 4
## name wind_speed pressure date
## <chr> <dbl> <dbl> <chr>
## 1 Alberto 110 1007 2000-08-03
## 2 Alex 45 1009 1998-07-27
## 3 Allison 65 1005 1995-06-03
## 4 Ana 40 1013 1997-06-30
## 5 Arlene 50 1010 1999-06-11
## 6 Arthur 45 1010 1996-06-17
Did you realize that dates are a combination of multiple variables? They are.
You’ll almost always display these variables together to make a date, because a date is itself a variable—one that conveys more than the sum of its parts.
However, there are times where it is convenient to treat each element of a date separately. For example, what if you wanted to filter
hurricanes to just the storms that occurred in June (i.e.
month == 6)? Then it would be convenient to reorganize the data to look like this.
## # A tibble: 6 x 6
## name wind_speed pressure year month day
## <chr> <dbl> <dbl> <int> <int> <int>
## 1 Alberto 110 1007 2000 8 3
## 2 Alex 45 1009 1998 7 27
## 3 Allison 65 1005 1995 6 3
## 4 Ana 40 1013 1997 6 30
## 5 Arlene 50 1010 1999 6 11
## 6 Arthur 45 1010 1996 6 17
But how could you do it?
You can separate the elements of
date with the
separate() divides a column of values into multiple columns that each contain a portion of the original values.
Run the code below to see
separate() in action. Then click continue to learn about the syntax.
Let’s rewrite our above command without the pipe, to make the syntax of separate easier to see.
separate(hurricanes, col = date, into = c("year","month","day"), sep = "-")
separate() takes a data frame and then the name of a column in the data frame to separate. Here our code will separate the
date column of the
hurricane data set.
sep = "-" argument tells
separate() to split each value in
date wherever a
- appears. You can choose to split on any character or character string.
- will split each date into three dates: a year, month, and day. As a result,
separate() will need to add three new columns to the result. The
into argument gives
separate() a character vector of names to use for the new columns. Since the result will have three new columns, this vector will need to have three new names.
separate() will provide an error message if it ends up creating fewer or more columns than column names.
separate() will separate values at the location of any non-alphanumeric character, like
/, etc. So for example, we could run our code without the
sep = "-" argument and—in this case—get the same result.
Or will we? Do a quick mental check and then run the code to see if you are right.
Separating by position
If you set
sep equal to an integer,
separate() will split the values at the location indicated by the integers. For example,
sep = 1 will split the values after the first character
sep = -2 will split the values after the second to last character, no matter how many characters appear in the value. In other words, it will split off the last character of each value.
sep = c(2, 4, 6) will split the values after the second, fourth, and sixth characters, creating four sub-values
Think you have it? Create this version of
hurricanes by adding a second call to
separate() that uses an integer separator to the code below:
## # A tibble: 6 x 7
## name wind_speed pressure century year month day
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 Alberto 110 1007 20 00 08 03
## 2 Alex 45 1009 19 98 07 27
## 3 Allison 65 1005 19 95 06 03
## 4 Ana 40 1013 19 97 06 30
## 5 Arlene 50 1010 19 99 06 11
## 6 Arthur 45 1010 19 96 06 17
##  "Good job! Notice that when you separate on a character, separate() removes that character from the results. When you separate on a location, separate() retains all of the characters in the result."
Quiz - What if
Would these two commands return the same result? Why or why not? Once you have an answer, run the code below to see if you were right.
You may have noticed that
separate() returns its results as columns of character strings. However, in some cases, like ours, the columns will contain integers, doubles, or other types of non-character data.
You can ask
separate() to convert the new columns to an appropriate data type by adding
convert = TRUE to your
separate() call. This is identical to the
convert = TRUE argument of
Identify the data types of
day (they appear under the column names) in the output below. Then add
convert = TRUE and re-run the code. What changes?
Let’s take a look at one last argument for
separate(). If you add
remove = FALSE to your
separate() call, R will retain the original column in the results.
separate(col = date, into = c("year","month","day"), convert = TRUE, remove = FALSE)
## # A tibble: 6 x 7
## name wind_speed pressure date year month day
## <chr> <dbl> <dbl> <chr> <int> <int> <int>
## 1 Alberto 110 1007 2000-08-03 2000 8 3
## 2 Alex 45 1009 1998-07-27 1998 7 27
## 3 Allison 65 1005 1995-06-03 1995 6 3
## 4 Ana 40 1013 1997-06-30 1997 6 30
## 5 Arlene 50 1010 1999-06-11 1999 6 11
## 6 Arthur 45 1010 1996-06-17 1996 6 17
who data set contains a subset of data from the World Health Organization Global Tuberculosis Report, available here.
In its original format, the data is very untidy
## # A tibble: 7,240 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Afghan… AF AFG 1980 NA NA NA
## 2 Afghan… AF AFG 1981 NA NA NA
## 3 Afghan… AF AFG 1982 NA NA NA
## 4 Afghan… AF AFG 1983 NA NA NA
## 5 Afghan… AF AFG 1984 NA NA NA
## 6 Afghan… AF AFG 1985 NA NA NA
## 7 Afghan… AF AFG 1986 NA NA NA
## 8 Afghan… AF AFG 1987 NA NA NA
## 9 Afghan… AF AFG 1988 NA NA NA
## 10 Afghan… AF AFG 1989 NA NA NA
## # … with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
## # new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## # new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## # new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## # new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## # new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## # new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## # new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## # new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
## # new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## # new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## # new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
## # new_ep_f5564 <int>, new_ep_f65 <int>, new_rel_m014 <int>,
## # new_rel_m1524 <int>, new_rel_m2534 <int>, new_rel_m3544 <int>,
## # new_rel_m4554 <int>, new_rel_m5564 <int>, new_rel_m65 <int>,
## # new_rel_f014 <int>, new_rel_f1524 <int>, new_rel_f2534 <int>,
## # new_rel_f3544 <int>, new_rel_f4554 <int>, new_rel_f5564 <int>,
## # new_rel_f65 <int>
The first four columns of
who each contain a single variable:
- country - the name of a country
- iso2 - a two letter country code
- iso3 - a three letter country code
- year - year
The remaining columns are named after codes that contain multiple variables.
Each column name after the fourth, contains a code comprised of three values from three variables: type of TB, gender, and age.
who easier to use in R, we should tidy it into the format below. This data set contains six non-redundant variables: country, year, type, sex, age (group), and n (the number of cases of TB reported for each group).
## # A tibble: 76,046 x 6
## country year type sex age n
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1998 sp m 014 30
## 3 Afghanistan 1999 sp m 014 8
## 4 Afghanistan 2000 sp m 014 52
## 5 Afghanistan 2001 sp m 014 129
## 6 Afghanistan 2002 sp m 014 90
## 7 Afghanistan 2003 sp m 014 127
## 8 Afghanistan 2004 sp m 014 139
## 9 Afghanistan 2005 sp m 014 151
## 10 Afghanistan 2006 sp m 014 193
## # … with 76,036 more rows
It will take a number of operations to tidy the
who data set. When you look at the end result, the task may seem overwhelming, but each individual operation will be fairly simple (and familiar). We will use the pipe operator,
%>%, to string these simple operations together.
Let’s begin the pipe by removing the redundant variables
who. In otherwords, let’s use a dplyr function to select every column except
iso3. Recall that there is a way to do this without a large amount of typing.
Next, we need to move the type, sex, and age variables out of the column names and into a column of their own. It is true that we want to separate these values into their own cells, but that will be easier to do once they are in their own column.
In short, we want to do something like this:
Exercise - Reshape
Add to the pipe below. Use a tidyr reshaping function to gather the column names into their own column, named “codes”. Place the column cells into a column named “n”. Hint: it may be helpful to know that there are now 58 columns in the data set.
You can think of each column name as a key that combines the values of several variables. We want to move those keys into their own key column.
Exercise - Separate
Add to the pipe below. Separate the codes in
codes into three columns named “new”, “type”, and “sexage”. What type of separator should you separate on?
Exercise - Separate again
Our last separate, isolated two components of the who codes: new and type. However, it did not separate the sex and age variables.
If you look closely at the structure of the
sexage column, you will see that each cell begins with a single letter that represents a gender,
f, and is then followed by three or more numbers, which represent an age group. Use this insight to perform a second separate that isolates the “sex” and “age” variables:
Exercise - Select
Add to the pipe to remove the
new variable, which doesn’t provide any useful information. (Every row in the data set shows new cases of TB and has the same value of
Notice that the
n column of
who contains the most insightful information. You do not need to take any measurments to list out the country, year, type, sex, and age combinations in the data set. In a sense, you know these combinations in advance. However,
n shows how many cases of TB were reported for each combination. You do not know this information in advance, and you can only acquire it through field work—yours or someone else’s. As a result, it is concerning that our data contains so many
NA is R’s symbol for missing information, and it is common to have multiple
NAs when you reshape your data from a wide format to a long format. The two by two table structure imposed by wide data requires a place holder for every combination of variable values—even if no data was collected for that combination.
In contrast, the long data format does not require a place holder for each combination of variable values. Since each combination is saved as its own row, you can simply not include rows that contain an NA.
The tidyr package provides a convenient function for dropping rows that contain an NA in a specific column. The function is
drop_na(). To use it, give
drop_na() a data set (perhaps via a pipe), then list one or more columns in that data set, e.g.
data %>% drop_na(column1, column2)
drop_na() will drop every row that contains an NA in one or more of the listed columns.
drop_na() to the pipe below to drop every row that has an NA in the
Good job! You’ve wrangled
who into a tidy, polished data set that is ready to be explored, modelled, and analyzed.
The difference between the initial and final versions of
who is drastic, but each step in our pipe imposed a small, logical change. This is by design.
The tidyverse contains a vocabulary of functions that each do one simple thing, but can be combined to do more sophisticated tasks. In this way, the tidyverse is like a written language, it is made up of words (functions) that can be combined into sentences that have a sophisticated meaning (pipes).
This approach also makes it easier to solve problems with code. You can approach any problem by decomposing it into a series of small, simple steps.