R squared
Calculate the coefficient of determination using correlation. For the
traditional measure of R squared, see rsq_trad()
.
rsq(data, ...) ## S3 method for class 'data.frame' rsq(data, truth, estimate, na_rm = TRUE, ...) rsq_vec(truth, estimate, na_rm = TRUE, ...)
data |
A |
... |
Not currently used. |
truth |
The column identifier for the true results
(that is |
estimate |
The column identifier for the predicted
results (that is also |
na_rm |
A |
The two estimates for the
coefficient of determination, rsq()
and rsq_trad()
, differ by
their formula. The former guarantees a value on (0, 1) while the
latter can generate inaccurate values when the model is
non-informative (see the examples). Both are measures of
consistency/correlation and not of accuracy.
rsq()
is simply the squared correlation between truth
and estimate
.
Because rsq()
internally computes a correlation, if either truth
or
estimate
are constant it can result in a divide by zero error. In these
cases, a warning is thrown and NA
is returned. This can occur when a model
predicts a single value for all samples. For example, a regularized model
that eliminates all predictors except for the intercept would do this.
Another example would be a CART model that contains no splits.
A tibble
with columns .metric
, .estimator
,
and .estimate
and 1 row of values.
For grouped data frames, the number of rows returned will be the same as the number of groups.
For rsq_vec()
, a single numeric
value (or NA
).
Max Kuhn
Kvalseth. Cautionary note about R^2. American Statistician (1985) vol. 39 (4) pp. 279-285.
# Supply truth and predictions as bare column names rsq(solubility_test, solubility, prediction) library(dplyr) set.seed(1234) size <- 100 times <- 10 # create 10 resamples solubility_resampled <- bind_rows( replicate( n = times, expr = sample_n(solubility_test, size, replace = TRUE), simplify = FALSE ), .id = "resample" ) # Compute the metric by group metric_results <- solubility_resampled %>% group_by(resample) %>% rsq(solubility, prediction) metric_results # Resampled mean estimate metric_results %>% summarise(avg_estimate = mean(.estimate)) # With uninformitive data, the traditional version of R^2 can return # negative values. set.seed(2291) solubility_test$randomized <- sample(solubility_test$prediction) rsq(solubility_test, solubility, randomized) rsq_trad(solubility_test, solubility, randomized) # A constant `truth` or `estimate` vector results in a warning from # a divide by zero error in the correlation calculation. # `NA` will be returned in these cases. truth <- c(1, 2) estimate <- c(1, 1) rsq_vec(truth, estimate)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.