bigsnpr: snp_fastImpute – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

bigsnpr

snp_fastImpute

Fast imputation

Description

Fast imputation algorithm based on local XGBoost models.

Usage

snp_fastImpute(
  Gna,
  infos.chr,
  alpha = 1e-04,
  size = 200,
  p.train = 0.8,
  n.cor = nrow(Gna),
  seed = NA,
  ncores = 1
)

Arguments

`Gna`	A FBM.code256 (typically `<bigSNP>$genotypes`). You can have missing values in these data.
`infos.chr`	Vector of integers specifying each SNP's chromosome. Typically `<bigSNP>$map$chromosome`.
`alpha`	Type-I error for testing correlations. Default is `1e-4`.
`size`	Number of neighbor SNPs to be possibly included in the model imputing this particular SNP. Default is `200`.
`p.train`	Proportion of non missing genotypes that are used for training the imputation model while the rest is used to assess the accuracy of this imputation model. Default is `0.8`.
`n.cor`	Number of rows that are used to estimate correlations. Default uses them all.
`seed`	An integer, for reproducibility. Default doesn't use seeds.
`ncores`	Number of cores used. Default doesn't use parallelism. You may use nb_cores.

Value

An FBM with

the proportion of missing values by SNP (first row),
the estimated proportion of imputation errors by SNP (second row).

Examples

## Not run: 

fake <- snp_attachExtdata("example-missing.bed")
G <- fake$genotypes
CHR <- fake$map$chromosome
infos <- snp_fastImpute(G, CHR)
infos[, 1:5]

# Still missing values
big_counts(G, ind.col = 1:10)
# You need to change the code of G
# To make this permanent, you need to save (modify) the file on disk
fake$genotypes$code256 <- CODE_IMPUTE_PRED
fake <- snp_save(fake)
big_counts(fake$genotypes, ind.col = 1:10)

# Plot for post-checking
## Here there is no SNP with more than 1% error (estimated)
pvals <- c(0.01, 0.005, 0.002, 0.001); colvals <- 2:5
df <- data.frame(pNA = infos[1, ], pError = infos[2, ])

# base R
plot(subset(df, pNA > 0.001), pch = 20)
idc <- lapply(seq_along(pvals), function(i) {
  curve(pvals[i] / x, from = 0, lwd = 2,
        col = colvals[i], add = TRUE)
})
legend("topright", legend = pvals, title = "p(NA & Error)",
       col = colvals, lty = 1, lwd = 2)

# ggplot2
library(ggplot2)
Reduce(function(p, i) {
  p + stat_function(fun = function(x) pvals[i] / x, color = colvals[i])
}, x = seq_along(pvals), init = ggplot(df, aes(pNA, pError))) +
  geom_point() +
  coord_cartesian(ylim = range(df$pError, na.rm = TRUE)) +
  theme_bigstatsr()

## End(Not run)

bigsnpr

Analysis of Massive SNP Arrays

v1.10.8

GPL-3

Authors

Florian Privé [aut, cre], Michael Blum [ths], Hugues Aschard [ths], Bjarni Jóhann Vilhjálmsson [ths]

Initial release

2022-07-05

snp_fastImpute

Description

Usage

Arguments

Value

See Also

Examples

bigsnpr

We don't support your browser anymore