First draft of function to diagnose problems in merges and key variables
This is a first effort. It works with 2 data frames and 1 key variable in each. It does not work if the by parameter includes more than one column name (but may work in future). The return is a list which includes full copies of the rows from the data frames in which trouble is observed.
mergeCheck(x, y, by, by.x = by, by.y = by, incomparables = c(NULL, NA, NaN, Inf, "\\s+", ""))
x |
data frame |
y |
data frame |
by |
Commonly called the "key" variable. A column name to be
used for merging (common to both |
by.x |
Column name in |
by.y |
Column name in |
incomparables |
values in the key (by) variable that are ignored for matching. We default to include these values as incomparables: c(NULL, NA, NaN, Inf, "\s+", ""). Note this is a larger list of incomparables than assumed by R merge (which assumes only NULL). |
A list of data structures that are displayed for keys and
data sets. The return is list(keysBad, keysDuped,
unmatched)
. unmatched
is a list with 2 elements, the
unmatched cases from x
and y
.
Paul Johnson
df1 <- data.frame(id = 1:7, x = rnorm(7)) df2 <- data.frame(id = c(2:6, 9:10), x = rnorm(7)) mc1 <- mergeCheck(df1, df2, by = "id") ## Use mc1 objects mc1$keysBad, mc1$keysDuped, mc1$unmatched df1 <- data.frame(id = c(1:3, NA, NaN, "", " "), x = rnorm(7)) df2 <- data.frame(id = c(2:6, 5:6), x = rnorm(7)) mergeCheck(df1, df2, by = "id") df1 <- data.frame(idx = c(1:5, NA, NaN), x = rnorm(7)) df2 <- data.frame(idy = c(2:6, 9:10), x = rnorm(7)) mergeCheck(df1, df2, by.x = "idx", by.y = "idy")
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.