Heller-Heller-Gorfine Tests of Independence and Equality of Distributions
These functions perform Heller-Heller-Gorfine (HHG) tests. Implemented are tests of independence between two random vectors (x and y) and tests of equality of 2 or more multivariate distributions.
hhg.test(Dx, Dy, ties = T, w.sum = 0, w.max = 2, nr.perm = 10000, is.sequential = F, seq.total.nr.tests = 1, seq.alpha.hyp = NULL, seq.alpha0 = NULL, seq.beta0 = NULL, seq.eps = NULL, nr.threads = 0, tables.wanted = F, perm.stats.wanted = F) hhg.test.k.sample(Dx, y, w.sum = 0, w.max = 2, nr.perm = 10000, is.sequential = F, seq.total.nr.tests = 1, seq.alpha.hyp = NULL, seq.alpha0 = NULL, seq.beta0 = NULL, seq.eps = NULL, nr.threads = 0, tables.wanted = F, perm.stats.wanted = F) hhg.test.2.sample(Dx, y, w.sum = 0, w.max = 2, nr.perm = 10000, is.sequential = F, seq.total.nr.tests = 1, seq.alpha.hyp = NULL, seq.alpha0 = NULL, seq.beta0 = NULL, seq.eps = NULL, nr.threads = 0, tables.wanted = F, perm.stats.wanted = F)
Dx |
a symmetric matrix of doubles, where element [i, j] is a norm-based distance between the |
Dy |
same as |
y |
a numeric or factor vector, whose values or levels are in (0, 1, ..., |
ties |
a boolean specifying whether ties in |
w.sum |
minimum expected frequency taken into account when computing the |
w.max |
minimum expected frequency taken into account when computing the |
nr.perm |
number of permutations from which a p-value is to be estimated (must be non-negative). Can be specified as zero if only the observed statistics are wanted, without p-values. The actual number of permutations used may be slightly larger when using multiple processing cores. A Wald sequential probability ratio test is optionally implemented, which may push the p-value to 1 and stop permuting if it becomes clear that it is going to be high. See Details below. |
is.sequential |
boolean flag specifying whether Wald's sequential test is desired (see Details), otherwise a simple Monte-Carlo computation of |
seq.total.nr.tests |
the total number of hypotheses in the family of hypotheses simultaneously tested. When this optional argument is supplied, it is used to derive default values for the parameters of the Wald sequential test. The default derivation is done assuming a nominal
|
seq.alpha.hyp |
the nominal test size for this single test within the multiple testing procedure. |
seq.alpha0 |
the nominal test size for testing the side null hypothesis of p-value > |
seq.beta0 |
one minus the power for testing the side null hypothesis of p-value > |
seq.eps |
approximation margin around |
nr.threads |
number of processing cores to use for p-value permutation. If left as zero, will try to use all available cores. |
tables.wanted |
boolean flag determining whether to output detailed local 2x2 contingency tables. |
perm.stats.wanted |
boolean flag determining whether to output statistics values computed for all permutations (representing null distributions). |
The HHG test (Heller et al., 2013) is a powerful nonparametric test for association (or, alternatively, independence) between two random vectors (say, x and y) of arbitrary dimensions. It is consistent against virtually any form of dependence, and has been shown to offer significantly more power than alternative approaches in the face of simple, and, more importantly, complex high-dimensional dependence. The test relies on norm-based distance metrics in x and (separately) in y. The choice of metric for either variable is up to the user (e.g. Euclidean, Mahalanobis, Manhattan, or whatever is appropriate for the data). The general implementation in hhg.test takes the distance matrices computed on an observed sample, Dx and Dy, and starts form there.
hhg.test.k.sample and hhg.test.2.sample are optimized special cases of the general test, where y
is a partition of samples in x
to K
or 2 groups, respectively.
When enabled by is.sequential
, Wald's sequential test is implemented as suggested by Fay et al. (2007) in order to reduce the O(nr.perm * n^2 * log(n))
computational compelxity of the permutation test to a more managable size. Especially when faced with high multiplicity, say M
simultaneous tests, the necessary number of iterations may be quite large. For example, if it is desired to control the family-wise error rate (FWER) at level alpha
using the Bonferroni correction, one needs a p-value of alpha / M
to establish significance. This seems to suggest that the minimum number of permutations required is nr.perm = M / alpha
. However, if it becomes clear after a smaller number of permutations that the null cannot be rejected, no additional permutations are needed, and the p-value can be conservatively estimated as 1. Often, only a handful of hypotheses in a family are expected to be non-null. In this case the number of permutations for testing all hypotheses using Wald's procedure is expected to be much lower than the full M^2 / alpha
.
The target significance level of the sequential test is specified in the argument seq.alpha.hyp
. It depends on the number of hypotheses M
, and the type of multiplicity correction wanted. For the Bonferroni correction, the threshold is alpha / M
. For the less conservative procedure of Benjamini & Hochberg (1995), it is M1 * q / M
, where q
is the desired false discovery rate (FDR), and M1
is the (unknwon) number of true non-null hypotheses. Although M1
is unknown, the investigator can sometimes estimate it conservatively (e.g., if at most 0.02 of the hypotheses are expected to be non-null, set M1 = 0.02 * M
).
Four statistics described in the original HHG paper are returned:
sum.chisq
- sum of Pearson chi-squared statistics from the 2x2 contingency tables considered.
sum.lr
- sum of liklihood ratio ("G statistic") values from the 2x2 tables.
max.chisq
- maximum Pearson chi-squared statistic from any of the 2x2 tables.
max.lr
- maximum G statistic from any of the 2x2 tables.
If nr.perm > 0
, then estimated permutation p-values are returned as:
perm.pval.hhg.sc
, perm.pval.hhg.sl
, perm.pval.hhg.mc
, perm.pval.hhg.ml
In order to give information that may help localize where in the support of the distributions of x
and y
there is departure from independence, if tables.wanted
is true, the 2x2 tables themselves are provided in:
extras.hhg.tbls
This is a n^2
by 4 matrix, whose columns are A11
, A12
, A21
, A22
as denoted in the original HHG paper. Row r of the matrix corresponds to S_ij
in the same paper, where i = 1 + floor((r - 1) / n)
, and j = 1 + ((r - 1) %% n)
. Since S_ij
is never computed for i == j
, rows (0:(n - 1)) * n + (1:n)
contain NA
s on purpose. The only other case where NAs will occur are for the 2 and K-sample tests, where only one table is given for any x-tied samples (the other tables at indices with the same x value are redundant).
Finally, as a means of estimating the null distributions of computed statistics, if perm.stats.wanted
is true, the statistics computed for every permutation of the data performed during testing is outputted as:
extras.perm.stats
A data.frame with one variable per statistic and one sample per permutation.
n
- The sample size
y
- Group sizes for hhg.test.2.sample
and hhg.test.k.sample
tests.
stat.type
- String holding the type of test used: 'hhg.test'
, 'hhg.test.2.sample'
or 'hhg.test.k.sample'
The computational complexity of the test is n^2*log(n)
, where n
is the number of samples. Thus, when the sample size is large, computing the test for many permutations may take a long time.
P-values are reproducible when setting set.seed(seed_value) before peforming the permutation test (also when computation is done in multithread). This feature is currently implemented only in hhg.test
and not in hhg.test.k.sample
and hhg.test.2.sample
.
Shachar Kaufman, based in part on an earlier version by Ruth Heller and Yair Heller.
Heller, R., Heller, Y., & Gorfine, M. (2013). A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2), 503-510.
Fay, M., Kim., H., & Hachey, M. (2007). On using truncated sequential probability ratio test boundaries for Monte Carlo implementation of hypothesis tests. Journal of Computational and Graphical Statistics, 16(4), 946-967.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289-300.
## Not run: ## 1. The test of independence ## 1.1. A non-null univariate example ## Generate some data from the Circle example n = 50 X = hhg.example.datagen(n, 'Circle') plot(X[1,], X[2,]) ## Compute distance matrices, on which the HHG test will be based Dx = as.matrix(dist((X[1,]), diag = TRUE, upper = TRUE)) Dy = as.matrix(dist((X[2,]), diag = TRUE, upper = TRUE)) ## Compute HHG statistics, and p-values using 1000 random permutations set.seed(1) #set the seed for the random permutations hhg = hhg.test(Dx, Dy, nr.perm = 1000) ## Print the statistics and their permutation p-value hhg ## 1.2. A null univariate example n = 50 X = hhg.example.datagen(n, '4indclouds') Dx = as.matrix(dist((X[1,]), diag = TRUE, upper = TRUE)) Dy = as.matrix(dist((X[2,]), diag = TRUE, upper = TRUE)) set.seed(1) #set the seed for the random permutations hhg = hhg.test(Dx, Dy, nr.perm = 1000) hhg ## 1.3. A multivariate example library(MASS) n = 50 p = 5 x = t(mvrnorm(n, rep(0, p), diag(1, p))) y = log(x ^ 2) Dx = as.matrix(dist((t(x)), diag = TRUE, upper = TRUE)) Dy = as.matrix(dist((t(y)), diag = TRUE, upper = TRUE)) set.seed(1) #set the seed for the random permutations hhg = hhg.test(Dx, Dy, nr.perm = 1000) hhg ## 2. The k-sample test n = 50 D = hhg.example.datagen(n, 'FourClassUniv') Dx = as.matrix(dist(D$x, diag = TRUE, upper = TRUE)) hhg = hhg.test.k.sample(Dx, D$y, nr.perm = 1000) hhg ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.