pcalg: pcSelect – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

pcSelect

PC-Select: Estimate subgraph around a response variable

Description

The goal is feature selection: If you have a response variable y and a data matrix dm, we want to know which variables are “strongly influential” on y. The type of influence is the same as in the PC-Algorithm, i.e., y and x (a column of dm) are associated if they are correlated even when conditioning on any subset of the remaining columns in dm. Therefore, only very strong relations will be found and the result is typically a subset of other feature selection techniques. Note that there are also robust correlation methods available which render this method robust.

Usage

pcSelect(y, dm, alpha, corMethod = "standard",
         verbose = FALSE, directed = FALSE)

Arguments

`y`	response vector.
`dm`	data matrix (rows: samples/observations, columns: variables); `nrow(dm) == length(y)`.
`alpha`	significance level of individual partial correlation tests.
`corMethod`	a string determining the method for correlation estimation via `mcor()`; specifically any of the `mcor(*, method = "..")` can be used, e.g., `"Qn"` for one kind of robust correlation estimate.
`verbose`	`logical` or in \{0,1,2\}; FALSE, 0: No output, TRUE, 1: Little output, 2: Detailed output. Note that such diagnostic output may make the function considerably slower.
`directed`	logical; should the output graph be directed?

Details

This function basically applies pc on the data matrix obtained by joining y and dm. Since the output is not concerned with the edges found within the columns of dm, the algorithm is adapted accordingly. Therefore, the runtime and the ability to deal with large datasets is typically increased substantially.

Value

`G`	A `logical` vector indicating which column of `dm` is associated with `y`.
`zMin`	The minimal z-values when testing partial correlations between `y` and each column of `dm`. The larger the number, the more consistent is the edge with the data.

Author(s)

Markus Kalisch (kalisch@stat.math.ethz.ch) and Martin Maechler.

References

Buehlmann, P., Kalisch, M. and Maathuis, M.H. (2010). Variable selection for high-dimensional linear models: partially faithful distributions and the PC-simple algorithm. Biometrika 97, 261–278.

Examples

p <- 10
## generate and draw random DAG :
suppressWarnings(RNGversion("3.5.0"))
set.seed(101)
myDAG <- randomDAG(p, prob = 0.2)
if (require(Rgraphviz)) {
  plot(myDAG, main = "randomDAG(10, prob = 0.2)")
}
## generate 1000 samples of DAG using standard normal error distribution
n <- 1000
d.mat <- rmvDAG(n, myDAG, errDist = "normal")

## let's pretend that the 10th column is the response and the first 9
## columns are explanatory variable. Which of the first 9 variables
## "cause" the tenth variable?
y <- d.mat[,10]
dm <- d.mat[,-10]
(pcS <- pcSelect(d.mat[,10], d.mat[,-10], alpha=0.05))
## You see, that variable 4,5,6 are considered as important
## By inspecting zMin,
with(pcS, zMin[G])
## you can also see that the influence of variable 6
## is most evident from the data (its zMin is 18.64, so quite large - as
## a rule of thumb for judging what is large, you could use quantiles
## of the Standard Normal Distribution)

pcalg

Methods for Graphical Models and Causal Inference

v2.7-2

GPL (>= 2)

Authors

Markus Kalisch [aut, cre], Alain Hauser [aut], Martin Maechler [aut], Diego Colombo [ctb], Doris Entner [ctb], Patrik Hoyer [ctb], Antti Hyttinen [ctb], Jonas Peters [ctb], Nicoletta Andri [ctb], Emilija Perkovic [ctb], Preetam Nandy [ctb], Philipp Ruetimann [ctb], Daniel Stekhoven [ctb], Manuel Schuerch [ctb], Marco Eigenmann [ctb], Leonard Henckel [ctb], Joris Mooij [ctb]

Initial release

2021-4-20