Missing Value Imputation with kNN for High-Dimensional Data
Imputes missing values in a high-dimensional matrix composed of categorical variables using k Nearest Neighbors.
knncatimputeLarge(data, mat.na = NULL, fac = NULL, fac.na = NULL, nn = 3, distance = c("smc", "cohen", "snp1norm", "pcc"), n.num = 100, use.weights = TRUE, verbose = FALSE)
data |
a numeric matrix consisting of integers between 1 and n.cat,
where n.cat is maximum number of levels the categorical variables
can take. If Each row of |
mat.na |
a numeric matrix containing missing values. Must have the same number of
columns as |
fac |
a numeric or character vector of length |
fac.na |
a numeric or character vector of length |
nn |
an integer specifying k, i.e.\ the number of nearest neighbors, used to impute the missing values. |
distance |
character string naming the distance measure used in k Nearest Neighbors.
Must be either |
n.num |
an integer giving the number of rows of |
use.weights |
should weighted k nearest neighbors be used to impute the missing values?
If |
verbose |
should more information about the progress of the imputation be printed? |
If mat.na = NULL
, then a matrix of the same size as data
in which the missing
values have been replaced. If mat.na
has been specified, then a matrix of the same size as
mat.na
in which the missing values have been replaced.
While in knncatimpute
all variable/rows are considered when replacing
missing values, knncatimputeLarge
only considers the rows with no missing values
when searching for the k nearest neighbors.
Holger Schwender, holger.schwender@udo.edu
Schwender, H. and Ickstadt, K.\ (2008). Imputing Missing Genotypes with k Nearest Neighbors. Technical Report, SFB 475, Department of Statistics, University of Dortmund. Appears soon.
knncatimpute
, gknn
, smc
, pcc
## Not run: # Generate a data set consisting of 100 columns and 2000 rows (actually, # knncatimputeLarge is made for much larger data sets), where the values # are randomly drawn from the integers 1, 2, and 3. # Afterwards, remove 200 of the observations randomly. mat <- matrix(sample(3, 200000, TRUE), 2000) mat[sample(200000, 20)] <- NA # Apply knncatimputeLarge to mat to remove the missing values. mat2 <- knncatimputeLarge(mat) sum(is.na(mat)) sum(is.na(mat2)) # Now assume that the first 100 rows belong to SNPs from chromosome 1, # the second 100 rows to SNPs from chromosome 2, and so on. chromosome <- rep(1:20, e = 100) # Apply knncatimputeLarge to mat chromosomewise, i.e. only consider # the SNPs that belong to the same chromosome when replacing missing # genotypes. mat4 <- knncatimputeLarge(mat, fac = chromosome) ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.