Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

discretization

Discretization of a Possibly Continuous Data Frame of Random Variables based on their distribution


Description

This function discretizes a data frame of possibly continuous random variables through rules for discretization. The discretization algorithms are unsupervised and univariate. See details for the complete list (the number of state of each random variable could also be provided).

Usage

discretization(data.df = NULL,
               data.dists = NULL, 
               discretization.method = "sturges", 
               nb.states = FALSE)

Arguments

data.df

a data frame containing the data to discretize, binary variables must be declared as factors, other as a numeric vector. The data frame must be named.

data.dists

a named list giving the distribution for each node in the network.

discretization.method

a character vector giving the discretization method to use; see details. If a number is provided, the variable will be discretized by equal binning.

nb.states

logical variable to select the output. If set to TRUE a list with the discretized data frame and the number of state of each variable is returned. If set to FALSE only the discretized data frame is returned.

Details

discretization() supports multiple rules for discretization. Below is the list of supported rules. IQR() stands for interquartile range.

fd stands for the Freedman Diaconis rule. The number of bins is given by

range(x) * n^{1/3} / 2 * IQR(x)

The Freedman Diaconis rule is known to be less sensitive than the Scott's rule to outlier.

doane stands for doane's rule. The number of bins is given by

1 + \log_{2}{n} + \log_{2}{1+\frac{|g|}{σ_{g}}}

This is a modification of Sturges' formula, which attempts to improve its performance with non-normal data.

sqrt The number of bins is given by:

√(n)

cencov stands for Cencov's rule. The number of bins is given by:

n^{1/3}

rice stands for Rice' rule. The number of bins is given by:

2 n^{1/3}

terrell-scott stands for Terrell-Scott's rule. The number of bins is given by:

(2 n)^{1/3}

This is known that Cencov, Rice, and Terrell-Scott rules over-estimates k, compared to other rules due to his simplicity.

sturges stands for Sturges's rule. The number of bins is given by:

1 + \log_2(n)

scott stands for Scott's rule. The number of bins is given by:

range(x) / σ(x) n^{-1/3}

Value

The discretized data frame or a list containing the table of counts for each bin the discretized data frame.

Author(s)

Gilles Kratzer

References

Garcia, S., et al. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25.4, 734-750.

Cebeci, Z. and Yildiz, F. (2017). Unsupervised Discretization of Continuous Variables in a Chicken Egg Quality Traits Dataset. Turkish Journal of Agriculture-Food Science and Technology, 5.4, 315-320.

Examples

## Generate random variable
rv <- rnorm(n = 100, mean = 5, sd = 2)
dist <- list("gaussian")
names(dist) <- c("rv")

## Compute the entropy through discretization
entropyData(freqs.table = discretization(data.df = rv, data.dists = dist,
            discretization.method = "sturges", nb.states = FALSE))

abn

Modelling Multivariate Data with Additive Bayesian Networks

v2.5-0
GPL (>= 2)
Authors
Gilles Kratzer [aut, cre] (<https://orcid.org/0000-0002-5929-8935>), Fraser Iain Lewis [aut] (<https://orcid.org/0000-0003-4580-2712>), Reinhard Furrer [ctb] (<https://orcid.org/0000-0002-6319-2332>), Marta Pittavino [ctb] (<https://orcid.org/0000-0002-1232-1034>)
Initial release
2021-04-21

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.