HHG: hhg.univariate.nulltable.from.mstats – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

hhg.univariate.nulltable.from.mstats

Constructor of Distribution Free Null Table Using Existing Statistics

Description

This function converts null test statistics for different partition sizes into the null table object necessary for the computation of p-values efficiently.

Usage

hhg.univariate.nulltable.from.mstats(m.stats,minm,maxm,type,variant,
size,score.type,aggregation.type, w.sum = 0, w.max = 2,
keep.simulation.data=F,nr.atoms = nr_bins_equipartition(sum(size)),
compress=F,compress.p0=0.001,compress.p=0.99,compress.p1=0.000001)

Arguments

`m.stats`	A matrix with B rows and `maxm` - `minm`+1 columns, where each row contains the test statistics for partition sizes m from `minm` to `maxm` for the sample permutation of the input sample.
`minm`	The minimum partition size of the ranked observations, default value is 2.
`maxm`	The maximum partition size of the ranked observations.
`type`	A character string specifying the test type, must be one of `"KSample"`, `"Independence"`
`variant`	A character string specifying the partition type for the test of independence, must be one of `"ADP"`, `"DDP"`, `"ADP-ML"`, `"ADP-EQP"`,`"ADP-EQP-ML"` if `type="Independence"`. If `type="KSample"`, must be `"KSample-Variant"` or `"KSample-Equipartition"`.
`size`	The sample size if `type="Independence"`, and a vector of group sizes if `type="KSample"`.
`score.type`	a character string specifying the score type, must be one of `"LikelihoodRatio"`, or `"Pearson"`.
`aggregation.type`	a character string specifying the aggregation type, must be one of `"sum"`, or `"max"`.
`w.sum`	The minimum number of observations in a partition, only relevant for `type="Independence"`, `aggregation.type="Sum"` and `score.type="Pearson"`, default value 0.
`w.max`	The minimum number of observations in a partition, only relevant for `type="Independence"`, `aggregation.type="Max"` and `score.type="Pearson"`, default value 2.
`keep.simulation.data`	TRUE/FALSE.
`nr.atoms`	For `"ADP-EQP"`, `"ADP-EQP-ML"` and `"KSample-Equipartition"` type tests, sets the number of possible split points in the data
`compress`	TRUE or FALSE. If enabled, null tables are compressed: The lower `compress.p` part of the null statistics is kept at a `compress.p0` resolution, while the upper part is kept at a `compress.p1` resolution (which is finer).

`compress.p0`	Parameter for compression. This is the resolution for the lower `compress.p` part of the null distribution.
`compress.p`	Parameter for compression. Part of the null distribution to compress.
`compress.p1`	Parameter for compression. This is the resolution for the upper value of the null distribution.

Details

For finding multiple quantiles, the null table object is more efficient than a matrix of a matrix with B rows and maxm - minm+1 columns, where each row contains the test statistics for partition sizes m from minm to maxm for the sample permutation of the input sample.

Null tables may be compressed, using the compress argument. For each of the partition sizes (i.e. m or mXm), the null distribution is held at a compress.p0 resolution up to the compress.p quantile. Beyond that value, the distribution is held at a finer resolution defined by compress.p1 (since higher values are attained when a relation exists in the data, this is required for computing the p-value accurately.)

See vignette('HHG') for a section on how to use this function, for computing a null tables using multiple cores.

Value

m.statsThe input m.stats if keep.simulation.data=TRUE

univariate.objectA useful format of the null tables for computing p-values efficiently..

Author(s)

Barak Brill and Shachar Kaufman.

References

Heller, R., Heller, Y., Kaufman S., Brill B, & Gorfine, M. (2016). Consistent Distribution-Free K-Sample and Independence Tests for Univariate Random Variables, JMLR 17(29):1-54

Brill B. (2016) Scalable Non-Parametric Tests of Independence (master's thesis)

http://primage.tau.ac.il/libraries/theses/exeng/free/2899741.pdf

Examples

## Not run: 

# 1. Downloading a lookup table from site
# download from site http://www.math.tau.ac.il/~ruheller/Software.html
####################################################################
#using an already ready null table as object (for use in test functions)
#for example, ADP likelihood ratio statistics, for the independence problem,
#for sample size n=300
load('Object-ADP-n_300.Rdata') #=>null.table

#or using a matrix of statistics generated for the null distribution,
#to create your own table.
load('ADP-nullsim-n_300.Rdata') #=>mat
null.table = hhg.univariate.nulltable.from.mstats(m.stats = mat,minm = 2,
             maxm = 5,type = 'Independence', variant = 'ADP',size = 300,
             score.type = 'LikelihoodRatio',aggregation.type = 'sum')
             
# 2. generating an independence null table using multiple cores,
#and then compiling to object.
####################################################################
library(parallel)
library(doParallel)
library(foreach)
library(doRNG)

#generate an independence null table
nr.cores = 4 #this is computer dependent
n = 30 #size of independence problem
nr.reps.per.core = 25
mmax =5
score.type = 'LikelihoodRatio'
aggregation.type = 'sum'
variant = 'ADP'

#generating null table of size 4*25

#single core worker function
generate.null.distribution.statistic =function(){
  library(HHG)
  null.table = matrix(NA,nrow=nr.reps.per.core,ncol = mmax-1)
  for(i in 1:nr.reps.per.core){
    #note that the statistic is distribution free (based on ranks),
    #so creating a null table (for the null distribution)
    #is essentially permuting over the ranks
    statistic = hhg.univariate.ind.stat(1:n,sample(1:n),
                                        variant = variant,
                                        aggregation.type = aggregation.type,
                                        score.type = score.type,
                                        mmax = mmax)$statistic
    null.table[i,]=statistic
  }
  rownames(null.table)=NULL
  return(null.table)
}

#parallelize over cores
cl = makeCluster(nr.cores)
registerDoParallel(cl)
res = foreach(core = 1:nr.cores, .combine = rbind, .packages = 'HHG',
              .export=c('variant','aggregation.type','score.type',
              'mmax','nr.reps.per.core','n'), .options.RNG=1234) %dorng% 
              { generate.null.distribution.statistic() }
stopCluster(cl)

#the null table:
head(res)

#as object to be used:
null.table = hhg.univariate.nulltable.from.mstats(res,minm=2,
  maxm = mmax,type = 'Independence',
  variant = variant,size = n,score.type = score.type,
  aggregation.type = aggregation.type)

#using the null table, checking for dependence in a linear relation
x=rnorm(n)
y=x+rnorm(n)
ADP.test = hhg.univariate.ind.combined.test(x,y,null.table)
ADP.test$MinP.pvalue #pvalue


# 3. generating a k-sample null table using multiple cores
# and then compiling to object.
####################################################################

library(parallel)
library(doParallel)
library(foreach)
library(doRNG)

#generate a k sample null table
nr.cores = 4 #this is computer dependent
n1 = 25 #size of first group
n2 = 25 #size of first group
nr.reps.per.core = 25
mmax =5
score.type = 'LikelihoodRatio'
aggregation.type = 'sum'

#generating null table of size 4*25

#single core worker function
generate.null.distribution.statistic =function(){
  library(HHG)
  null.table = matrix(NA,nrow=nr.reps.per.core,ncol = mmax-1)
  for(i in 1:nr.reps.per.core){
    #note that the statistic is distribution free (based on ranks),
    #so creating a null table (for the null distribution)
    #is essentially permuting over the ranks
    statistic = hhg.univariate.ks.stat(1:(n1+n2),sample(c(rep(0,n1),rep(1,n2))),
                                        aggregation.type = aggregation.type,
                                        score.type = score.type,
                                        mmax = mmax)$statistic
    null.table[i,]=statistic
  }
  rownames(null.table)=NULL
  return(null.table)
}

#parallelize over cores
cl = makeCluster(nr.cores)
registerDoParallel(cl)
res = foreach(core = 1:nr.cores, .combine = rbind, .packages = 'HHG',
              .export=c('n1','n2','aggregation.type','score.type','mmax',
              'nr.reps.per.core'), .options.RNG=1234) %dorng% 
              {generate.null.distribution.statistic()}
stopCluster(cl)

#the null table:
head(res)

#as object to be used:
null.table = hhg.univariate.nulltable.from.mstats(res,minm=2,
  maxm = mmax,type = 'KSample',
  variant = 'KSample-Variant',size = c(n1,n2),score.type = score.type,
  aggregation.type = aggregation.type)

#using the null table, checking for dependence in a case of two distinct samples
x=1:(n1+n2)
y=c(rep(0,n1),rep(1,n2))
Sm.test = hhg.univariate.ks.combined.test(x,y,null.table)
Sm.test$MinP.pvalue #pvalue

## End(Not run)

HHG

Heller-Heller-Gorfine Tests of Independence and Equality of Distributions

v2.3.2

GPL-3

Authors

Barak Brill & Shachar Kaufman, based in part on an earlier implementation by Ruth Heller and Yair Heller.

Initial release

2019-03-11