Propensity scores and other distance measures
Several matching methods require or can involve the distance between treated and control units. Options include the Mahalanobis distance, propensity score distance, or distance between user-supplied values. Propensity scores are also used for common support via the discard
options and for defined calipers. This page documents the options that can be supplied to the distance
argument to matchit
.
There are two ways to specify the distance
argument: 1) as the string "mahalanobis"
, 2) as a string containing the name of a method for estimating propensity scores, or 3) as a vector of values whose pairwise differences define the distance between units.
When distance
is specified as one of the allowed strings (described below) other than "mahalanobis"
, a propensity score is estimated using the variables in formula
and the method corresponding to the given argument. This propensity score can be used to compute the distance between units as the absolute difference between the propensity scores of pairs of units. In this respect, the propensity score is more like a "position" measure than a distance measure, since it is the pairwise difference that form the distance rather than the propensity scores themselves. Still, this naming convention is used to reflect their primary purpose without committing to the status of the estimated values as propensity scores, since transformations of the scores are allowed and user-supplied values that are not propensity scores can also be supplied (detailed below). Propensity scores can also be used to create calipers and common support restrictions, whether or not they are used in the actual distance measure used in the matching, if any.
In addition to the distance
argument, two other arguments can be specified that relate to the estimation and manipulation of the propensity scores. The link
argument allows for different links to be used in models that require them such as generalized linear models, for which the logit and probit links are allowed, among others. In addition to specifying the link, the link
argument can be used to specify whether the propensity score or the linearized version of the propensity score should be used; by specifying link = "linear.{link}"
, the linearized version will be used.
The distance.options
argument can also be specified, which should be a list of values passed to the propensity score-estimating function, for example, to choose specific options or tuning parameters for the estimation method. If formula
, data
, or verbose
are not supplied to distance.options
, the corresponding arguments from matchit
will be automatically supplied. See the Examples for demonstrations of the uses of link
and distance.options
. When s.weights
is supplied in the call to matchit
, it will automatically be passed to the propensity score-estimating function as the weights
argument unless otherwise described below.
Below are the allowed options for distance
:
"glm"
The propensity scores are estimated using a generalized linear model (e.g., logistic regression). The formula
supplied to matchit
is passed directly to glm
, and predict.glm
is used to compute the propensity scores. The link
argument can be specified as a link function supplied to binomial
, e.g., "logit"
, which is the default. When link
is prepended by "linear."
, the linear predictor is used instead of the predicted probabilities. distance = "glm"
with link = "logit"
(logistic regression) is the default in matchit
.
"gam"
The propensity scores are estimated using a generalized additive model. The formula
supplied to matchit
is passed directly to mgcv::gam
, and mgcv::predict.gam
is used to compute the propensity scores. The link
argument can be specified as a link function supplied to binomial
, e.g., "logit"
, which is the default. When link
is prepended by "linear."
, the linear predictor is used instead of the predicted probabilities. Note that unless the smoothing functions s
, te
, ti
, or t2
are used in formula
, a generalized additive model is identical to a generalized linear model and will estimate the same propensity scores as glm
. See the documentation for mgcv::gam
, mgcv::formula.gam
, and mgcv::gam.models
for more information on how to specify these models. Also note that the formula returned in the matchit
output object will be a simplified version of the supplied formula with smoothing terms removed (but all named variables present).
"rpart"
The propensity scores are estimated using a classification tree. The formula
supplied to matchit
is passed directly to rpart::rpart
, and rpart::predict.rpart
is used to compute the propensity scores. The link
argument is ignored, and predicted probabilities are always returned as the distance measure.
"randomforest"
The propensity scores are estimated using a random forest. The formula
supplied to matchit
is passed directly to randomForest::randomForest
, and randomForest::predict.randomForest
is used to compute the propensity scores. The link
argument is ignored, and predicted probabilities are always returned as the distance measure. When s.weights
is supplied to matchit
, it will not be passed to randomForest
because randomForest
does not accept weights.
"nnet"
The propensity scores are estimated using a single-hidden-layer neural network. The formula
supplied to matchit
is passed directly to nnet::nnet
, and fitted
is used to compute the propensity scores. The link
argument is ignored, and predicted probabilities are always returned as the distance measure. An argument to size
must be supplied to distance.options
when using method = "nnet"
.
"cbps"
The propensity scores are estimated using the covariate balancing propensity score (CBPS) algorithm, which is a form of logistic regression where balance constraints are incorporated to a generalized method of moments estimation of of the model coefficients. The formula
supplied to matchit
is passed directly to CBPS::CBPS
, and fitted
is used to compute the propensity scores. The link
argument can be specified as "linear"
to use the linear predictor instead of the predicted probabilities. No other links are allowed. The estimand
argument supplied to matchit
will be used to select the appropriate estimand for use in defining the balance constraints, so no argument needs to be supplied to ATT
in CBPS
.
"bart"
The propensity scores are estimated using Bayesian additive regression trees (BART). The formula
supplied to matchit
is passed directly to dbarts::bart2
, and dbarts::fitted
is used to compute the propensity scores. The link
argument can be specified as "linear"
to use the linear predictor instead of the predicted probabilities. When s.weights
is supplied to matchit
, it will not be passed to bart2
because the weights
argument in bart2
does not correspond to sampling weights.
"mahalanobis"
No propensity scores are estimated. Rather than using the propensity score difference as the distance between units, the Mahalanobis distance is used instead. See mahalanobis
for details on how it is computed. The Mahalanobis distance is always computed using all the variables in formula
. With this specification, calipers and common support restrictions cannot be used and the distance
component of the output object will be empty because no propensity scores are estimated. The link
and distance.options
arguments are ignored. See individual methods pages for whether the Mahalanobis distance is allowed and how it is used. Sometimes this setting is just a placeholder to indicate that no propensity score is to be estimated (e.g., with method = "genetic"
). To perform Mahalanobis distance matching and estimate propensity scores to be used for a purpose other than matching, the mahvars
argument should be used along with a different specification to distance
. See the individual matching method pages for details on how to use mahvars
.
distance
can also be supplied as a numeric vector whose values will be taken to function like propensity scores; their pairwise difference will define the distance between units. This might be beneficial to supply propensity scores computed outside matchit
or to resupply matchit
with propensity scores estimated before without having to recompute them. When distance
is a supplied as a numeric vector, link
and distance.options
are ignored.
When specifying an argument to distance
that estimates a propensity score, the output of the function called to estimate the propensity score (e.g., the glm
object when distance = "glm"
) will be included in the matchit
output object in the model
component. When distance
is anything other than "mahalanobis"
, the estimated or supplied distance measures will be included in the matchit
output object in the distance
component.
In versions of MatchIt prior to 4.0.0, distance
was specified in a slightly different way. When specifying arguments using the old syntax, they will automatically be converted to the corresponding method in the new syntax but a warning will be thrown. distance = "logit"
, the old default, will still work in the new syntax, though distance = "glm", link = "logit"
is preferred (note that these are the default settings and don't need to be made explicit).
data("lalonde") # Linearized probit regression PS: m.out1 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, distance = "glm", link = "linear.probit") # GAM logistic PS with smoothing splines (s()): m.out2 <- matchit(treat ~ s(age) + s(educ) + race + married + nodegree + re74 + re75, data = lalonde, distance = "gam") summary(m.out2$model) # CBPS for ATC matching w/replacement, using the just- # identified version of CBPS (setting method = "exact"): m.out3 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, distance = "cbps", estimand = "ATC", distance.options = list(method = "exact"), replace = TRUE) # Mahalanobis distance matching - no PS estimated m.out4 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, distance = "mahalanobis") m.out4$distance #NULL # Mahalanobis distance matching with PS estimated # for use in a caliper; matching done on mahvars m.out5 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, distance = "glm", caliper = .1, mahvars = ~ age + educ + race + married + nodegree + re74 + re75) summary(m.out5) # User-supplied propensity scores p.score <- fitted(glm(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, family = binomial)) m.out6 <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75, data = lalonde, distance = p.score)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.