Tune number of selected variables for spca
This function performs sparse pca and optimises the number of variables to keep on each component using repeated cross-validation.
tune.spca( X, ncomp = 2, nrepeat = 1, folds, test.keepX, center = TRUE, scale = TRUE, BPPARAM = SerialParam() )
X |
a numeric matrix (or data frame) which provides the data for the sparse principal components analysis. It should not contain missing values. |
ncomp |
Integer, if data is complete |
nrepeat |
Number of times the Cross-Validation process is repeated. |
folds |
Number of folds in 'Mfold' cross-validation. See details. |
test.keepX |
numeric vector for the different number of variables to test from the X data set |
center |
(Default=TRUE) Logical, whether the variables should be shifted
to be zero centered. Only set to FALSE if data have already been centered.
Alternatively, a vector of length equal the number of columns of |
scale |
(Default=TRUE) Logical indicating whether the variables should be scaled to have unit variance before the analysis takes place. |
BPPARAM |
A BiocParallelParam object indicating the type of parallelisation. See examples. |
Essentially, for the first component, and for a grid of the number of
variables to select (keepX
), a number of repeats and folds, data are
split to train and test and the extracted components are compared against
those from a spca model with all the data to ascertain the optimal
keepX
. In order to keep at least 3 samples in each test set for
reliable scaling of the test data for comparison, folds
must be <=
floor(nrow(X)/3)
The number of selected variables for the following components will then be
sequentially optimised. If the number of observations are small (e.g. < 30),
it is recommended to use Leave-One-Out Cross-Validation which can be
achieved by setting folds = nrow(X)
.
A tune.spca
object containing:
The function call
The selected number of components on each component
The correlations between the components from the cross-validated studies and those from the study which used all of the data in training.
data("nutrimouse") set.seed(42) nrepeat <- 5 tune.spca.res <- tune.spca( X = nutrimouse$lipid, ncomp = 2, nrepeat = nrepeat, folds = 3, test.keepX = seq(5, 15, 5) ) tune.spca.res plot(tune.spca.res) ## Not run: ## parallel processing using BiocParallel on repeats with more workers (cpus) ## You can use BiocParallel::MulticoreParam() on non_Windows machines ## for faster computation BPPARAM <- BiocParallel::SnowParam(workers = max(parallel::detectCores()-1, 2)) tune.spca.res <- tune.spca( X = nutrimouse$lipid, ncomp = 2, nrepeat = nrepeat, folds = 3, test.keepX = seq(5, 15, 5), BPPARAM = BPPARAM ) plot(tune.spca.res) ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.