Fast (Grouped, Weighted) Sum for Matrix-Like Objects
fsum
is a generic function that computes the (column-wise) sum of all values in x
, (optionally) grouped by g
and/or weighted by w
(e.g. to calculate survey totals). The TRA
argument can further be used to transform x
using its (grouped, weighted) sum.
fsum(x, ...) ## Default S3 method: fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE, use.g.names = TRUE, ...) ## S3 method for class 'matrix' fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE, use.g.names = TRUE, drop = TRUE, ...) ## S3 method for class 'data.frame' fsum(x, g = NULL, w = NULL, TRA = NULL, na.rm = TRUE, use.g.names = TRUE, drop = TRUE, ...) ## S3 method for class 'grouped_df' fsum(x, w = NULL, TRA = NULL, na.rm = TRUE, use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE, ...)
x |
a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df'). |
g |
a factor, |
w |
a numeric vector of (non-negative) weights, may contain missing values. |
TRA |
an integer or quoted operator indicating the transformation to perform:
1 - "replace_fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See |
na.rm |
logical. Skip missing values in |
use.g.names |
logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. |
drop |
matrix and data.frame method: Logical. |
keep.group_vars |
grouped_df method: Logical. |
keep.w |
grouped_df method: Logical. Retain summed weighting variable after computation (if contained in |
... |
arguments to be passed to or from other methods. |
Missing-value removal as controlled by the na.rm
argument is done very efficiently by simply skipping them in the computation (thus setting na.rm = FALSE
on data with no missing values doesn't give extra speed). Large performance gains can nevertheless be achieved in the presence of missing values if na.rm = FALSE
, since then the corresponding computation is terminated once a NA
is encountered and NA
is returned (unlike sum
which just runs through without any checks).
The weighted sum (i.e. survey total) is computed as sum(x * w)
. If na.rm = TRUE
, missing values will be removed from both x
and w
i.e. utilizing only x[complete.cases(x,w)]
and w[complete.cases(x,w)]
.
This all seamlessly generalizes to grouped computations, which are performed in a single pass (without splitting the data) and therefore extremely fast. See Benchmark and Examples below.
When applied to data frames with groups or drop = FALSE
, fsum
preserves all column attributes (such as variable labels) but does not distinguish between classed and unclassed objects. The attributes of the data frame itself are also preserved.
The (w
weighted) sum of x
, grouped by g
, or (if TRA
is used) x
transformed by its sum, grouped by g
.
## default vector method mpg <- mtcars$mpg fsum(mpg) # Simple sum fsum(mpg, w = mtcars$hp) # Weighted sum (total): Weighted by hp fsum(mpg, TRA = "%") # Simple transformation: obtain percentages of mpg fsum(mpg, mtcars$cyl) # Grouped sum fsum(mpg, mtcars$cyl, mtcars$hp) # Weighted grouped sum (total) fsum(mpg, mtcars[c(2,8:9)]) # More groups.. g <- GRP(mtcars, ~ cyl + vs + am) # Precomputing groups gives more speed ! fsum(mpg, g) fmean(mpg, g) == fsum(mpg, g) / fNobs(mpg, g) fsum(mpg, g, TRA = "%") # Percentages by group ## data.frame method fsum(mtcars) fsum(mtcars, TRA = "%") fsum(mtcars, g) fsum(mtcars, g, TRA = "%") ## matrix method m <- qM(mtcars) fsum(m) fsum(m, TRA = "%") fsum(m, g) fsum(m, g, TRA = "%") \donttest{ ## method for grouped data frames - created with dplyr::group_by or fgroup_by library(dplyr) mtcars %>% group_by(cyl,vs,am) %>% fsum(hp) # Weighted grouped sum (total) mtcars %>% fgroup_by(cyl,vs,am) %>% fsum(hp) # Equivalent and faster !! mtcars %>% fgroup_by(cyl,vs,am) %>% fsum(TRA = "%") mtcars %>% fgroup_by(cyl,vs,am) %>% fselect(mpg) %>% fsum }
## This compares fsum with data.table (2 threads) and base::rowsum # Starting with small data mtcDT <- qDT(mtcars) f <- qF(mtcars$cyl) library(microbenchmark) microbenchmark(mtcDT[, lapply(.SD, sum), by = f], rowsum(mtcDT, f, reorder = FALSE), fsum(mtcDT, f, na.rm = FALSE), unit = "relative") expr min lq mean median uq max neval cld mtcDT[, lapply(.SD, sum), by = f] 145.436928 123.542134 88.681111 98.336378 71.880479 85.217726 100 c rowsum(mtcDT, f, reorder = FALSE) 2.833333 2.798203 2.489064 2.937889 2.425724 2.181173 100 b fsum(mtcDT, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a # Now larger data tdata <- qDT(replicate(100, rnorm(1e5), simplify = FALSE)) # 100 columns with 100.000 obs f <- qF(sample.int(1e4, 1e5, TRUE)) # A factor with 10.000 groups microbenchmark(tdata[, lapply(.SD, sum), by = f], rowsum(tdata, f, reorder = FALSE), fsum(tdata, f, na.rm = FALSE), unit = "relative") expr min lq mean median uq max neval cld tdata[, lapply(.SD, sum), by = f] 2.646992 2.975489 2.834771 3.081313 3.120070 1.2766475 100 c rowsum(tdata, f, reorder = FALSE) 1.747567 1.753313 1.629036 1.758043 1.839348 0.2720937 100 b fsum(tdata, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.