Fast (Grouped) Distinct Value Count for Matrix-Like Objects
fNdistinct is a generic function that (column-wise) computes the number of distinct values in x, (optionally) grouped by g. It is significantly faster than length(unique(x)). The TRA argument can further be used to transform x using its (grouped) distinct value count.
fNdistinct(x, ...)
## Default S3 method:
fNdistinct(x, g = NULL, TRA = NULL, na.rm = TRUE,
           use.g.names = TRUE, ...)
## S3 method for class 'matrix'
fNdistinct(x, g = NULL, TRA = NULL, na.rm = TRUE,
           use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'data.frame'
fNdistinct(x, g = NULL, TRA = NULL, na.rm = TRUE,
           use.g.names = TRUE, drop = TRUE, ...)
## S3 method for class 'grouped_df'
fNdistinct(x, TRA = NULL, na.rm = TRUE,
           use.g.names = FALSE, keep.group_vars = TRUE, ...)| x | a vector, matrix, data frame or grouped data frame (class 'grouped_df'). | 
| g | a factor,  | 
| TRA | an integer or quoted operator indicating the transformation to perform:
1 - "replace_fill"     |     2 - "replace"     |     3 - "-"     |     4 - "-+"     |     5 - "/"     |     6 - "%"     |     7 - "+"     |     8 - "*"     |     9 - "%%"     |     10 - "-%%". See  | 
| na.rm | logical.  | 
| use.g.names | logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. | 
| drop | matrix and data.frame method: Logical.  | 
| keep.group_vars | grouped_df method: Logical.  | 
| ... | arguments to be passed to or from other methods. | 
fNdistinct implements a fast algorithm to find the number of distinct values utilizing index- hashing implemented in the Rcpp::sugar::IndexHash class.
If na.rm = TRUE (the default), missing values will be skipped yielding substantial performance gains in data with many missing values. If na.rm = TRUE, missing values will simply be treated as any other value and read into the hash-map. Thus with the former, a numeric vector c(1.25,NaN,3.56,NA) will have a distinct value count of 2, whereas the latter will return a distinct value count of 4.
Grouped computations are performed by mapping the data to a sparse-array and then hash-mapping each group. This is often not much slower than using a larger hash-map for the entire data when g = NULL.
fNdistinct preserves all attributes of non-classed vectors / columns, and only the 'label' attribute (if available) of classed vectors / columns (i.e. dates or factors). When applied to data frames and matrices, the row-names are adjusted as necessary.
Integer. The number of distinct values in x, grouped by g, or (if TRA is used) x transformed by its distinct value count, grouped by g.
## default vector method
fNdistinct(airquality$Solar.R)                   # Simple distinct value count
fNdistinct(airquality$Solar.R, airquality$Month) # Grouped distinct value count
## data.frame method
fNdistinct(airquality)
fNdistinct(airquality, airquality$Month)
fNdistinct(wlddev)                               # Works with data of all types!
head(fNdistinct(wlddev, wlddev$iso3c))
## matrix method
aqm <- qM(airquality)
fNdistinct(aqm)                                  # Also works for character or logical matrices
fNdistinct(aqm, airquality$Month)
 
## method for grouped data frames - created with dplyr::group_by or fgroup_by
library(dplyr)
airquality %>% group_by(Month) %>% fNdistinct
wlddev %>% group_by(country) %>%
             select(PCGDP,LIFEEX,GINI,ODA) %>% fNdistinctPlease choose more modern alternatives, such as Google Chrome or Mozilla Firefox.