Previous chapter
Data Wrangling with data.tableData Wrangling with data.table
Next chapter

Introduction

The data.table package enhances R with fast data wrangling for small and large data sets. Since data.table inherits from data.frame it can be used as a drop-in replacement for data.frame. The design of data.table is influenced by the desire to have a efficient tool to do SQL like operations from within R. Here, efficient not only refers to speed but also to reduce the code to be written by the user.

The main motivation of using data.table is typically its speed. It is so fast that some of the authors of dplyr created the dtplyr created a data.table back-end for dplyr to be able to harness the amazing speed data.table.

Benchmarks

The page https://h2oai.github.io/db-benchmark/ gives several benchmarks which compare the performance of data.table to similar software.

Matt Dowle’s view on data.table

Matt Dowle is the creator and still active maintainer of data.table. In the video below (recorded on the UseR conference 2014) he shares his motivation for creating data.table.

Syntax

The basic data.table syntax can be summarized by
DT[ where , select , by ]

or in the more general form

DT[ where | order , select | update , by ]
library(data.table)
cdt <- as.data.table(mtcars)

where / order

The where argument/dimension can be used to select rows

cdt[mpg <= 11,]
##     mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1: 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
## 2: 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4

and order_by by selected columns

head(cdt[order(mpg, disp),], 2)
##     mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1: 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
## 2: 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4

by default the data is ordered ascending, to order descending add a - in front of the column name.

head(cdt[order(-mpg, disp),], 2)
##     mpg cyl disp hp drat    wt  qsec vs am gear carb
## 1: 33.9   4 71.1 65 4.22 1.835 19.90  1  1    4    1
## 2: 32.4   4 78.7 66 4.08 2.200 19.47  1  1    4    1

select

The select argument/dimension can be used to get a single column, with dropping dimension

cdt[1:2, mpg]
## [1] 21 21

without dropping dimension (Note, inside data.table .( is an alias to list.)

cdt[1:2, .(mpg)]
##    mpg
## 1:  21
## 2:  21

or to select multiple columns (note to use a variable to select columns the prefix .. has to be used)

cdt[1, .(mpg, cyl, gear)]
##    mpg cyl gear
## 1:  21   6    4
keys <- c("mpg", "cyl", "gear")
cdt[3, ..keys]
##     mpg cyl gear
## 1: 22.8   4    4
j <- 1:3
cdt[3, ..j]
##     mpg cyl disp
## 1: 22.8   4  108

or a range of columns.

cdt[1, mpg:drat]
##    mpg cyl disp  hp drat
## 1:  21   6  160 110  3.9

To deselect columns - or ! can be used.

colnames(cdt[, -(mpg:gear)])
## [1] "carb"
colnames(cdt[, !(cyl:gear)])
## [1] "mpg"  "carb"
cdt[, mean(cyl)]
## [1] 6.1875

update

The select argument/dimension can also be used to add / alter a variable,

# 1 miles per gallon = 235.215 litre per 100 km
cdt[, lpk := 235.215 / mpg]
head(cdt, 2)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb      lpk
## 1:  21   6  160 110  3.9 2.620 16.46  0  1    4    4 11.20071
## 2:  21   6  160 110  3.9 2.875 17.02  0  1    4    4 11.20071

add / alter multiple variables,

# alter to litre per 1 km
cdt[, `:=` (lp100k = 235.215 / mpg, lpk = 2.35215 / mpg)]
head(cdt, 2)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb       lpk   lp100k
## 1:  21   6  160 110  3.9 2.620 16.46  0  1    4    4 0.1120071 11.20071
## 2:  21   6  160 110  3.9 2.875 17.02  0  1    4    4 0.1120071 11.20071

delete a variable

cdt[,carb := NULL]
setdiff(colnames(mtcars), colnames(cdt))
## [1] "carb"

or delete multiple variables

colnames(cdt)
##  [1] "mpg"    "cyl"    "disp"   "hp"     "drat"   "wt"     "qsec"  
##  [8] "vs"     "am"     "gear"   "lpk"    "lp100k"
cdt[, `:=` (lpk = NULL, lp100k = NULL, vs = NULL, gear = NULL)]
setdiff(colnames(mtcars), colnames(cdt))
## [1] "vs"   "gear" "carb"

by

The by statement has to be used together with some selection or aggregation, to simplify the syntax data.table specifies the special symbols .SD, .BY, .N, .I and .GRP.

As a first simple example we calculate the mean fuel consumption for each group of cylinders.

cdt[, mean_mpg := mean(mpg), by = cyl]
cdt[,]
##      mpg cyl  disp  hp drat    wt  qsec am mean_mpg
##  1: 21.0   6 160.0 110 3.90 2.620 16.46  1 19.74286
##  2: 21.0   6 160.0 110 3.90 2.875 17.02  1 19.74286
##  3: 22.8   4 108.0  93 3.85 2.320 18.61  1 26.66364
##  4: 21.4   6 258.0 110 3.08 3.215 19.44  0 19.74286
##  5: 18.7   8 360.0 175 3.15 3.440 17.02  0 15.10000
##  6: 18.1   6 225.0 105 2.76 3.460 20.22  0 19.74286
##  7: 14.3   8 360.0 245 3.21 3.570 15.84  0 15.10000
##  8: 24.4   4 146.7  62 3.69 3.190 20.00  0 26.66364
##  9: 22.8   4 140.8  95 3.92 3.150 22.90  0 26.66364
## 10: 19.2   6 167.6 123 3.92 3.440 18.30  0 19.74286
## 11: 17.8   6 167.6 123 3.92 3.440 18.90  0 19.74286
## 12: 16.4   8 275.8 180 3.07 4.070 17.40  0 15.10000
## 13: 17.3   8 275.8 180 3.07 3.730 17.60  0 15.10000
## 14: 15.2   8 275.8 180 3.07 3.780 18.00  0 15.10000
## 15: 10.4   8 472.0 205 2.93 5.250 17.98  0 15.10000
## 16: 10.4   8 460.0 215 3.00 5.424 17.82  0 15.10000
## 17: 14.7   8 440.0 230 3.23 5.345 17.42  0 15.10000
## 18: 32.4   4  78.7  66 4.08 2.200 19.47  1 26.66364
## 19: 30.4   4  75.7  52 4.93 1.615 18.52  1 26.66364
## 20: 33.9   4  71.1  65 4.22 1.835 19.90  1 26.66364
## 21: 21.5   4 120.1  97 3.70 2.465 20.01  0 26.66364
## 22: 15.5   8 318.0 150 2.76 3.520 16.87  0 15.10000
## 23: 15.2   8 304.0 150 3.15 3.435 17.30  0 15.10000
## 24: 13.3   8 350.0 245 3.73 3.840 15.41  0 15.10000
## 25: 19.2   8 400.0 175 3.08 3.845 17.05  0 15.10000
## 26: 27.3   4  79.0  66 4.08 1.935 18.90  1 26.66364
## 27: 26.0   4 120.3  91 4.43 2.140 16.70  1 26.66364
## 28: 30.4   4  95.1 113 3.77 1.513 16.90  1 26.66364
## 29: 15.8   8 351.0 264 4.22 3.170 14.50  1 15.10000
## 30: 19.7   6 145.0 175 3.62 2.770 15.50  1 19.74286
## 31: 15.0   8 301.0 335 3.54 3.570 14.60  1 15.10000
## 32: 21.4   4 121.0 109 4.11 2.780 18.60  1 26.66364
##      mpg cyl  disp  hp drat    wt  qsec am mean_mpg

Assignment / Copy

In some situations it is desirable to explicit copy your data.table to avoid surprises. Since data.table tries to avoid full object copies. The following examples will highlight in which situations data.table copies and in which situations just passes a references.

A <- data.table(k = c("a", "b", "a"), v= 1:3)
B <- copy(A)
C <- B

Since B is copied from A we can do manipulations on the columns of B without changing A however C will be also changed.

B[, v := 2 * v]
cbind(A = A, B = B, C = C)
##    A.k A.v B.k B.v C.k C.v
## 1:   a   1   a   2   a   2
## 2:   b   2   b   4   b   4
## 3:   a   3   a   6   a   6

Also if C is changes B will be changed.

C[, v := 2 * v]
cbind(A = A, B = B, C = C)
##    A.k A.v B.k B.v C.k C.v
## 1:   a   1   a   4   a   4
## 2:   b   2   b   8   b   8
## 3:   a   3   a  12   a  12

Note due to the usage of references the typical behavior of R objects within functions is also altered. This can be very nice since we can avoid unnecessary copying but should be used carefully.

fun <- function(x) {
  x[, v := 2*v]
  3 
}
fun(B)
## [1] 3
cbind(A = A, B = B, C = C)
##    A.k A.v B.k B.v C.k C.v
## 1:   a   1   a   8   a   8
## 2:   b   2   b  16   b  16
## 3:   a   3   a  24   a  24

As a general rule of thumb,

  • row sub-setting will create a new data.table,
  • altering adding columns will not create a new data.table.

More Sub-setting

Here we look at more examples of sub-setting.

library(babynames)
bnam <- as.data.table(babynames)

Helper-Functions

between

The between function can be used to select values between a lower and upper bound. So if we want to know how many different baby names were used in the years 2014, 2015 and 2016 we can use.

bnam[year %between% c(2014, 2016), .N, by = year]
##    year     N
## 1: 2014 33228
## 2: 2015 33098
## 3: 2016 32979

chmatch, %chin%

The functions chmatch and %chin% are faster versions of match and %in% specialized on character vectors.

(nam <- sample(bnam[["name"]], 3))
## [1] "Batya"   "Allexia" "Tempest"
bnam[name %chin% nam, .(count = sum(n)), by = name]
##       name count
## 1: Tempest  1185
## 2:   Batya   549
## 3: Allexia    40

%like%, %ilike%, %flike%

*like can be seen as a wrapper around grep, the basic syntax is col_name %like% pattern.

bnam[name %like% "^Uniq", sum(n), by = name]
##        name   V1
## 1:   Unique 7508
## 2:   Uniqua  488
## 3:  Uniquea   19
## 4:  Uniquia   25
## 5: Uniqueka   17
## 6: Uniquewa    5

The function %ilike% is a specialized version of like where ignore.case is set to TRUE.

bnam[name %ilike% "Uniq", sum(n), by = name]
##          name   V1
##  1:    Unique 7508
##  2:   Eunique  600
##  3:    Uniqua  488
##  4:  Imunique  268
##  5:   Uniquea   19
##  6:  Amunique  114
##  7: Shauniqua   19
##  8:   Uniquia   25
##  9: Shaunique   10
## 10:  Uniqueka   17
## 11:  Omunique  145
## 12:  Uniquewa    5
## 13:   Euniqua   22
## 14: Domunique    5
## 15:   Yunique  102
## 16:  Younique   81
## 17:  Myunique    5

The function %flike% is a specialized version of like where fixed is set to TRUE.

Join

library(babynames)
set.seed(7302)
A <- data.table(
  cid = c(1, 2, 3, 4), ## customer id
  name = head(babynames$name, 4),
  city = sample(c("Vienna", "Linz"), 4L, TRUE), key = "cid" )
B <- data.table(
  cid = c(3, 5, 1, 3, 2),
  oid = 1:5, ## order id
  value = round(runif(5, 10, 1000), 2), key = c("cid", "oid"))

left (outer) join

Figure source / license: [Wikipedia](https://en.wikipedia.org/wiki/File:SQL_Join_-_01_A_Left_Join_B.svg) / [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en)

Figure source / license: Wikipedia / CC BY-SA 4.0

Example - left join

B[A, on="cid"]
##    cid oid  value      name   city
## 1:   1   3 886.06      Mary   Linz
## 2:   2   5 387.54      Anna   Linz
## 3:   3   1 879.38      Emma Vienna
## 4:   3   4 602.21      Emma Vienna
## 5:   4  NA     NA Elizabeth Vienna

right (outer) join

Figure source / license: [Wikipedia]() / [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en)

Figure source / license: Wikipedia / CC BY-SA 4.0

Example - right join

A[B, on="cid"]
##    cid name   city oid  value
## 1:   1 Mary   Linz   3 886.06
## 2:   2 Anna   Linz   5 387.54
## 3:   3 Emma Vienna   1 879.38
## 4:   3 Emma Vienna   4 602.21
## 5:   5 <NA>   <NA>   2 509.29

natural (inner) join

Figure source / license: [Wikipedia](https://en.wikipedia.org/wiki/File:SQL_Join_-_07_A_Inner_Join_B.svg) / [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en)

Figure source / license: Wikipedia / CC BY-SA 4.0

Example - natural (inner) join

A[B, on = "cid", nomatch = 0] ## or nomatch = NULL
##    cid name   city oid  value
## 1:   1 Mary   Linz   3 886.06
## 2:   2 Anna   Linz   5 387.54
## 3:   3 Emma Vienna   1 879.38
## 4:   3 Emma Vienna   4 602.21

full (outer) join

Figure source / license: [Wikipedia]() / [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/deed.en)

Figure source / license: Wikipedia / CC BY-SA 4.0

The data.table package contains a S3 method of merge so full outer join can be done with merge.

Example - full (outer) join

merge(A, B, by = "cid", all = TRUE)
##    cid      name   city oid  value
## 1:   1      Mary   Linz   3 886.06
## 2:   2      Anna   Linz   5 387.54
## 3:   3      Emma Vienna   1 879.38
## 4:   3      Emma Vienna   4 602.21
## 5:   4 Elizabeth Vienna  NA     NA
## 6:   5      <NA>   <NA>   2 509.29

Group / Summarize

To show the basic usage we define the simple data.table

sdt <- data.table(grp = c("a", "c", "a", "b", "c"), 
                  v1 = round(runif(5, 0, 100)),
                  v2 = round(runif(5, 0, 100)),
                  v3 = round(runif(5, 0, 100)),
                  id = sample(1:5))

Count .(.N)

.N gives an integer with the number of rows in each group. Therefore it can be used to obtain the number of rows

nrow(sdt) == sdt[,.N]
## [1] TRUE

the count in each group

sdt[,.N, by = grp]
##    grp N
## 1:   a 2
## 2:   c 2
## 3:   b 1

if the result should be sorted, used keyby instead of by (keyby is also faster)

sdt[,.N, keyby = grp]
##    grp N
## 1:   a 2
## 2:   b 1
## 3:   c 2

and more.

Exercise

Use .N to obtain the relative size of each group.

#hint> sdt[, ?, by = ?]
sdt[, .(rfeq = .N / nrow(sdt)), by = grp]

Group

The .SD variable contains all the columns except those used for grouping.

The following code prints the table grouped by grp.

sdt[, {print(.BY); print(.SD)}, by = grp]
## $grp
## [1] "a"
## 
##    v1 v2 v3 id
## 1: 13 92 37  1
## 2: 93  7 48  3
## $grp
## [1] "c"
## 
##    v1 v2 v3 id
## 1: 45 75 22  2
## 2: 48 68 34  5
## $grp
## [1] "b"
## 
##    v1 v2 v3 id
## 1: 89  5 80  4
## Empty data.table (0 rows and 1 cols): grp
sdt[, {print(.BY); print(.SD[["id"]])}, by = grp]
## $grp
## [1] "a"
## 
## [1] 1 3
## $grp
## [1] "c"
## 
## [1] 2 5
## $grp
## [1] "b"
## 
## [1] 4
## Empty data.table (0 rows and 1 cols): grp

It is also possible to combine lapply with .SD to calculate statistics on all the values v*.

sdt[, lapply(.SD[,-"id"], mean), by = grp]
##    grp   v1   v2   v3
## 1:   a 53.0 49.5 42.5
## 2:   c 46.5 71.5 28.0
## 3:   b 89.0  5.0 80.0

Exercise

library(insuranceData)
data(AutoCollision)
ac <- as.data.table(AutoCollision)

Use .SD in combination with which.max to find Vehicle_Use with the highest Severity for each Age group.

#hint> ac[, .SD[which.max(?)], by = ?]
ac[, .SD[which.max(Severity)], by = Age]

Use .SD in combination with which.max and which.min to find the Age group with the highest and lowest Claim_Count for each Vehicle_Use.

# see hint above
ac[, .SD[c(which.min(Claim_Count), which.max(Claim_Count))], 
   by = Vehicle_Use, .SDcols = c('Age', 'Claim_Count')]

Use := and lapply to convert all factor columns to character.

sapply(ac, class)
#hint> ac[]
factor_cols <- names(sapply(ac, is.factor))
ac[, (factor_cols) := lapply(.SD, as.character), .SDcols = factor_cols]

.I

The variable .I gives the index of the variables within each group. In it’s basic form .I gives the simplified version of

aggregate(seq_len(nrow(sdt)), by = list(sdt$grp), FUN = c)
sdt[, .I, by = grp]

.GRP

The variable .GRP gives the number of the group.

sdt[, .(.BY, .GRP), by = grp]
##    grp BY GRP
## 1:   a  a   1
## 2:   c  c   2
## 3:   b  b   3

Speed up

  • Avoid copying (if possible)
  • Set keys if you group or join on the same variable more than once
  • Use keyby which faster than by (or the same if the by variable is the key).
  • Use DT[["key"]] which is faster than DT[, key].