Advanced Data Aggregation
collap
is a fast and easy to use multi-purpose data aggregation command.
It performs simple aggregations, multi-type data aggregations applying different functions to numeric and categorical data, weighted aggregations (including weighted multi-type aggregations), multi-function aggregations applying multiple functions to each column, and fully customized aggregations where the user passes a list mapping functions to columns.
collap
works with collapse's Fast Statistical Functions, providing extremely fast conventional and weighted aggregation. It also works with other functions but this does not deliver high speeds on large data and does not support weighted aggregations.
# Main function: allows formula and data input to `by` and `w` arguments collap(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum, custom = NULL, keep.by = TRUE, keep.w = TRUE, keep.col.order = TRUE, sort = TRUE, decreasing = FALSE, na.last = TRUE, parallel = FALSE, mc.cores = 2L, return = c("wide","list","long","long_dupl"), give.names = "auto", sort.row, ...) # Programmer function: allows column names and indices input to `by` and `w` arguments collapv(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum, custom = NULL, keep.by = TRUE, keep.w = TRUE, keep.col.order = TRUE, sort = TRUE, decreasing = FALSE, na.last = TRUE, parallel = FALSE, mc.cores = 2L, return = c("wide","list","long","long_dupl"), give.names = "auto", sort.row, ...) # Auxiliary function: for grouped data ('grouped_df') input + non-standard evaluation collapg(X, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum, custom = NULL, keep.group_vars = TRUE, keep.w = TRUE, keep.col.order = TRUE, parallel = FALSE, mc.cores = 2L, return = c("wide","list","long","long_dupl"), give.names = "auto", sort.row, ...)
X |
a data frame, or an object coercible to data frame using |
by |
for |
FUN |
a function, list of functions (i.e. |
catFUN |
same as |
cols |
select columns to aggregate using a function, column names, indices or logical vector. Note: |
w |
weights. Can be passed as numeric vector or alternatively as formula i.e. |
wFUN |
same as |
custom |
a named list specifying a fully customized aggregation task. The names of the list are function names and the content columns to aggregate using this function (same input as |
keep.by, keep.group_vars |
logical. |
keep.w |
logical. |
keep.col.order |
logical. Retain original column order post-aggregation. |
sort, decreasing, na.last |
logical. Arguments passed to |
parallel |
logical. Use |
mc.cores |
integer. Argument to |
return |
character. Control the output format when aggregating with multiple functions or performing custom aggregation. "wide" (default) returns a wider data frame with added columns for each additional function. "list" returns a list of data frames - one for each function. "long" adds a column "Function" and row-binds the results from different functions using |
give.names |
logical. Create unique names of aggregated columns by adding a prefix 'FUN.var'. |
sort.row |
depreciated, renamed to |
... |
additional arguments passed to all functions supplied to |
collap
automatically checks each function passed to it whether it is a Fast Statistical Function (i.e. whether the function name is contained in .FAST_STAT_FUN
). If the function is a fast statistical function, collap
only does the grouping and then calls the function to carry out the grouped computations. If the function is not one of .FAST_STAT_FUN
, BY
is called internally to perform the computation. The resulting computations from each function are put into a list and recombined to produce the desired output format as controlled by the return
argument.
When setting parallel = TRUE
on a non-windows computer, aggregations will efficiently be parallelized at the column level using mclapply
utilizing mc.cores
cores.
X
aggregated. If X
is not a data frame it is coerced to one using qDF
and then aggregated.
(1) Since BY
does not check and split additional arguments passed to it, it is presently not possible to create a weighted function in R and apply it to data by groups with collap
. Weighted aggregations only work with Fast Statistical Functions supporting weights. User written weighted functions can be applied using the data.table package.
(2) When the w
argument is used, the weights are passed to all Fast Statistical Functions. This may be undesirable in settings like collapse::collap(data, ~ id, custom = list(fsum = ..., fmean = ...), w = ~ weights)
where some columns are to be aggregated using the weighted mean, and others using a simple sum or another unweighted statistic. Since many Fast Statistical Functions including fsum
support weights, the above computes a weighted mean and a weighted sum. A couple of workarounds were outlined here, but collapse 1.5.0 incorporates an easy solution into collap
: It is now possible to simply append Fast Statistical Functions by _uw
to yield an unweighted computation. So for the above example we can write: collapse::collap(data, ~ id, custom = list(fsum_uw = ..., fmean = ...), w = ~ weights)
to get the weighted mean and the simple sum. Note that the _uw
functions are not available for use outside collap. Thus one also needs to quote them when passed to the FUN
or catFUN
arguments, e.g. use collap(data, ~ id, fmean, "fmode_uw", w = ~ weighs)
, since collap(data, ~ id, fmean, fmode_uw, w = ~ weighs)
gives an error stating that fmode_uw
was not found. Note also that it is never necessary for functions passed to wFUN
to be appended like this, as the weights are never used to aggregate themselves.
(3) The dispatch between using optimized Fast Statistical Functions performing grouped computations internally or calling BY
to perform split-apply-combine computing is done by matching the function name against .FAST_STAT_FUN
. Thus code like collapse::collap(data, ~ id, collapse::fmedian)
does not yield an optimized computation, as "collapse::fmedian" %!in% .FAST_STAT_FUN
. It is sufficient to write collapse::collap(data, ~ id, "fmedian")
to get the desired result when the collapse namespace is not attached.
## A Simple Introduction -------------------------------------- head(iris) collap(iris, ~ Species) # Default: FUN = fmean for numeric collapv(iris, 5) # Same using collapv collap(iris, ~ Species, fmedian) # Using the median collap(iris, ~ Species, fmedian, keep.col.order = FALSE) # Groups in-front collap(iris, Sepal.Width + Petal.Width ~ Species, fmedian) # Only '.Width' columns collapv(iris, 5, cols = c(2, 4)) # Same using collapv collap(iris, ~ Species, list(fmean, fmedian)) # Two functions collap(iris, ~ Species, list(fmean, fmedian), return = "long") # Long format collapv(iris, 5, custom = list(fmean = 1:2, fmedian = 3:4)) # Custom aggregation collapv(iris, 5, custom = list(fmean = 1:2, fmedian = 3:4), # Raw output, no column reordering return = "list") collapv(iris, 5, custom = list(fmean = 1:2, fmedian = 3:4), # A strange choice.. return = "long") collap(iris, ~ Species, w = ~ Sepal.Length) # Using Sepal.Length as weights, .. weights <- abs(rnorm(fnrow(iris))) collap(iris, ~ Species, w = weights) # Some random weights.. collap(iris, iris$Species, w = weights) # Note this behavior.. collap(iris, iris$Species, w = weights, keep.by = FALSE, keep.w = FALSE) library(dplyr) # Needed for "%>%" iris %>% fgroup_by(Species) %>% collapg # dplyr style, but faster ## Multi-Type Aggregation -------------------------------------- head(wlddev) # World Development Panel Data head(collap(wlddev, ~ country + decade)) # Aggregate by country and decade head(collap(wlddev, ~ country + decade, fmedian, ffirst)) # Different functions head(collap(wlddev, ~ country + decade, cols = is.numeric)) # Aggregate only numeric columns head(collap(wlddev, ~ country + decade, cols = 9:12)) # Only the 4 series head(collap(wlddev, PCGDP + LIFEEX ~ country + decade)) # Only GDP and life-expactancy head(collap(wlddev, PCGDP + LIFEEX ~ country + decade, fsum)) # Using the sum instead head(collap(wlddev, PCGDP + LIFEEX ~ country + decade, sum, # Same using base::sum -> slower! na.rm = TRUE)) head(collap(wlddev, wlddev[c("country","decade")], fsum, # Same, exploring different inputs cols = 9:10)) head(collap(wlddev[9:10], wlddev[c("country","decade")], fsum)) head(collapv(wlddev, c("country","decade"), fsum)) # ..names/indices with collapv head(collapv(wlddev, c(1,5), fsum)) g <- GRP(wlddev, ~ country + decade) # Precomputing the grouping head(collap(wlddev, g, keep.by = FALSE)) # This is slightly faster now # Aggregate categorical data using not the mode but the last element head(collap(wlddev, ~ country + decade, fmean, flast)) head(collap(wlddev, ~ country + decade, catFUN = flast, # Aggregate only categorical data cols = is.categorical)) ## Weighted Aggregation ---------------------------------------- weights <- abs(rnorm(fnrow(wlddev))) # Random weight vector head(collap(wlddev, ~ country + decade, w = weights)) # Takes weighted mean for numeric.. # ..and weighted mode for categorical data. The weight vector is aggregated using fsum wlddev$weights <- weights # Adding to data head(collap(wlddev, ~ country + decade, w = ~ weights)) # Keeps column order head(collap(wlddev, ~ country + decade, w = ~ weights, # Aggregating weights using sum wFUN = list(fsum, fmax))) # and max (corresponding to mode) wlddev$weights <- NULL ## Multi-Function Aggregation ---------------------------------- head(collap(wlddev, ~ country + decade, list(fmean, fNobs), # Saving mean and Nobs cols = 9:12)) head(collap(wlddev, ~ country + decade, # Same using base R -> slower list(mean = mean, Nobs = function(x, ...) sum(!is.na(x))), cols = 9:12, na.rm = TRUE)) lapply(collap(wlddev, ~ country + decade, # List output format list(fmean, fNobs), cols = 9:12, return = "list"), head) head(collap(wlddev, ~ country + decade, # Long output format list(fmean, fNobs), cols = 9:12, return = "long")) head(collap(wlddev, ~ country + decade, # Also aggregating categorical data, list(fmean, fNobs), return = "long_dupl")) # and duplicating it 2 times head(collap(wlddev, ~ country + decade, # Now also using 2 functions on list(fmean, fNobs), list(fmode, flast), # categorical data keep.col.order = FALSE)) head(collap(wlddev, ~ country + decade, # More functions, string input, c("fmean","fsum","fNobs","fsd","fvar"), # parallelized execution c("fmode","ffirst","flast","fNdistinct"), # (choose more than 1 cores, parallel = TRUE, mc.cores = 1L, # depending on your machine) keep.col.order = FALSE)) ## Custom Aggregation ------------------------------------------ head(collap(wlddev, ~ country + decade, # Custom aggregation custom = list(fmean = 9:12, fsd = 9:10, fmode = 7:8))) head(collap(wlddev, ~ country + decade, # Using column names custom = list(fmean = "PCGDP", fsd = c("LIFEEX","GINI"), flast = "date"))) head(collap(wlddev, ~ country + decade, # Weighted parallelized custom custom = list(fmean = 9:12, fsd = 9:10, # aggregation fmode = 7:8), w = weights, wFUN = list(fsum, fmax), parallel = TRUE, mc.cores = 1L)) head(collap(wlddev, ~ country + decade, # No column reordering custom = list(fmean = 9:12, fsd = 9:10, fmode = 7:8), w = weights, wFUN = list(fsum, fmax), parallel = TRUE, mc.cores = 1L, keep.col.order = FALSE)) ## Piped Use -------------------------------------------------- wlddev %>% fgroup_by(country, decade) %>% collapg %>% head wlddev %>% fgroup_by(country, decade) %>% collapg(w = ODA) %>% head wlddev %>% fgroup_by(country, decade) %>% collapg(fmedian, flast) %>% head wlddev %>% fgroup_by(country, decade) %>% collapg(custom = list(fmean = 9:12, fmode = 5:7, flast = 3)) %>% head
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.