PC-Select: Estimate subgraph around a response variable
The goal is feature selection: If you have a response variable y and a data matrix dm, we want to know which variables are “strongly influential” on y. The type of influence is the same as in the PC-Algorithm, i.e., y and x (a column of dm) are associated if they are correlated even when conditioning on any subset of the remaining columns in dm. Therefore, only very strong relations will be found and the result is typically a subset of other feature selection techniques. Note that there are also robust correlation methods available which render this method robust.
pcSelect(y, dm, alpha, corMethod = "standard", verbose = FALSE, directed = FALSE)
y |
response vector. |
dm |
data matrix (rows: samples/observations, columns: variables);
|
alpha |
significance level of individual partial correlation tests. |
corMethod |
a string determining the method for correlation
estimation via |
verbose |
Note that such diagnostic output may make the function considerably slower. |
directed |
logical; should the output graph be directed? |
This function basically applies pc
on the data
matrix obtained by joining y
and dm
. Since the output is
not concerned with the edges found within the columns of dm
,
the algorithm is adapted accordingly. Therefore, the runtime and the
ability to deal with large datasets is typically increased
substantially.
G |
A |
zMin |
The minimal z-values when testing partial correlations
between |
Markus Kalisch (kalisch@stat.math.ethz.ch) and Martin Maechler.
Buehlmann, P., Kalisch, M. and Maathuis, M.H. (2010). Variable selection for high-dimensional linear models: partially faithful distributions and the PC-simple algorithm. Biometrika 97, 261–278.
pc
which is the more general version of this function;
pcSelect.presel
which applies pcSelect()
twice.
p <- 10 ## generate and draw random DAG : suppressWarnings(RNGversion("3.5.0")) set.seed(101) myDAG <- randomDAG(p, prob = 0.2) if (require(Rgraphviz)) { plot(myDAG, main = "randomDAG(10, prob = 0.2)") } ## generate 1000 samples of DAG using standard normal error distribution n <- 1000 d.mat <- rmvDAG(n, myDAG, errDist = "normal") ## let's pretend that the 10th column is the response and the first 9 ## columns are explanatory variable. Which of the first 9 variables ## "cause" the tenth variable? y <- d.mat[,10] dm <- d.mat[,-10] (pcS <- pcSelect(d.mat[,10], d.mat[,-10], alpha=0.05)) ## You see, that variable 4,5,6 are considered as important ## By inspecting zMin, with(pcS, zMin[G]) ## you can also see that the influence of variable 6 ## is most evident from the data (its zMin is 18.64, so quite large - as ## a rule of thumb for judging what is large, you could use quantiles ## of the Standard Normal Distribution)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.