mixOmics: tune.spca – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

tune.spca

Tune number of selected variables for spca

Description

This function performs sparse pca and optimises the number of variables to keep on each component using repeated cross-validation.

Usage

tune.spca(
  X,
  ncomp = 2,
  nrepeat = 1,
  folds,
  test.keepX,
  center = TRUE,
  scale = TRUE,
  BPPARAM = SerialParam()
)

Arguments

`X`	a numeric matrix (or data frame) which provides the data for the sparse principal components analysis. It should not contain missing values.
`ncomp`	Integer, if data is complete `ncomp` decides the number of components and associated eigenvalues to display from the `pcasvd` algorithm and if the data has missing values, `ncomp` gives the number of components to keep to perform the reconstitution of the data using the NIPALS algorithm. If `NULL`, function sets `ncomp = min(nrow(X), ncol(X))`
`nrepeat`	Number of times the Cross-Validation process is repeated.
`folds`	Number of folds in 'Mfold' cross-validation. See details.
`test.keepX`	numeric vector for the different number of variables to test from the X data set
`center`	(Default=TRUE) Logical, whether the variables should be shifted to be zero centered. Only set to FALSE if data have already been centered. Alternatively, a vector of length equal the number of columns of `X` can be supplied. The value is passed to `scale`. If the data contain missing values, columns should be centered for reliable results.
`scale`	(Default=TRUE) Logical indicating whether the variables should be scaled to have unit variance before the analysis takes place.
`BPPARAM`	A BiocParallelParam object indicating the type of parallelisation. See examples.

Details

Essentially, for the first component, and for a grid of the number of variables to select (keepX), a number of repeats and folds, data are split to train and test and the extracted components are compared against those from a spca model with all the data to ascertain the optimal keepX. In order to keep at least 3 samples in each test set for reliable scaling of the test data for comparison, folds must be <= floor(nrow(X)/3)

The number of selected variables for the following components will then be sequentially optimised. If the number of observations are small (e.g. < 30), it is recommended to use Leave-One-Out Cross-Validation which can be achieved by setting folds = nrow(X).

Value

A tune.spca object containing:

call: The function call
choice.keepX: The selected number of components on each component
cor.comp: The correlations between the components from the cross-validated studies and those from the study which used all of the data in training.

Examples

data("nutrimouse")
set.seed(42)
nrepeat <- 5
tune.spca.res <- tune.spca(
    X = nutrimouse$lipid,
    ncomp = 2,
    nrepeat = nrepeat,
    folds = 3,
    test.keepX = seq(5, 15, 5)
)
tune.spca.res
plot(tune.spca.res)
## Not run: 
## parallel processing using BiocParallel on repeats with more workers (cpus)
## You can use BiocParallel::MulticoreParam() on non_Windows machines 
## for faster computation
BPPARAM <- BiocParallel::SnowParam(workers = max(parallel::detectCores()-1, 2))
tune.spca.res <- tune.spca(
    X = nutrimouse$lipid,
    ncomp = 2,
    nrepeat = nrepeat,
    folds = 3,
    test.keepX = seq(5, 15, 5),
    BPPARAM = BPPARAM
)
plot(tune.spca.res)

## End(Not run)

mixOmics

Omics Data Integration Project

v6.14.1

GPL (>= 2)

Authors

Kim-Anh Le Cao [aut, cre], Florian Rohart [aut], Ignacio Gonzalez [aut], Sebastien Dejean [aut], Al Abadi [ctb], Benoit Gautier [ctb], Francois Bartolo [ctb], Pierre Monget [ctb], Jeff Coquery [ctb], FangZou Yao [ctb], Benoit Liquet [ctb]

Initial release

2021-04-11