visdat: vis_miss – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

vis_miss

Visualise a data.frame to display missingness.

Description

vis_miss provides an at-a-glance ggplot of the missingness inside a dataframe, colouring cells according to missingness, where black indicates a missing cell and grey indicates a present cell. As it returns a ggplot object, it is very easy to customize and change labels.

Usage

vis_miss(x, cluster = FALSE, sort_miss = FALSE, show_perc = TRUE,
  show_perc_col = TRUE, large_data_size = 9e+05,
  warn_large_data = TRUE)

Arguments

`x`	a data.frame
`cluster`	logical. TRUE specifies that you want to use hierarchical clustering (mcquitty method) to arrange rows according to missingness. FALSE specifies that you want to leave it as is. Default value is FALSE.
`sort_miss`	logical. TRUE arranges the columns in order of missingness. Default value is FALSE.
`show_perc`	logical. TRUE now adds in the % of missing/complete data in the whole dataset into the legend. Default value is TRUE.
`show_perc_col`	logical. TRUE adds in the % missing data in a given column into the x axis. Can be disabled with FALSE. Default value is TRUE.
`large_data_size`	integer default is 900000, this can be changed. See note for more details
`warn_large_data`	logical - warn if there is large data? Default is TRUE see note for more details

Value

ggplot2 object displaying the position of missing values in the dataframe, and the percentage of values missing and present.

Note

Some datasets might be too large to plot, sometimes creating a blank plot - if this happens, I would recommend downsampling the data, either looking at the first 1,000 rows or by taking a random sample. This means that you won't get the same "look" at the data, but it is better than a blank plot! See example code for suggestions on doing this.

Examples

vis_miss(airquality)

## Not run: 
vis_miss(airquality, cluster = TRUE)

vis_miss(airquality, sort_miss = TRUE)

# if you have a large dataset, you might want to try downsampling:
library(nycflight13)
library(dplyr)
flights %>%
  sample_n(1000) %>%
  vis_miss()

flights %>%
  slice(1:1000) %>%
  vis_miss()


## End(Not run)

visdat

Preliminary Visualisation of Data

v0.5.3

MIT + file LICENSE

Authors

Nicholas Tierney [aut, cre] (<https://orcid.org/0000-0003-1460-8722>), Sean Hughes [rev] (<https://orcid.org/0000-0002-9409-9405>, Sean Hughes reviewed the package for rOpenSci, see https://github.com/ropensci/onboarding/issues/87), Mara Averick [rev] (Mara Averick reviewed the package for rOpenSci, see https://github.com/ropensci/onboarding/issues/87), Stuart Lee [ctb], Earo Wang [ctb], Nic Crane [ctb]

Initial release