Previous chapter
Resampling and PreprocessingResampling Methods
Next chapter

Overfitting

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

John von Neumann

From the previous chapters we have seen that all supervised learning models need to have outputs for training available and typically minimize some form of error measure (or likelihood function within the training set, with linear regression being a special case). The reason why we fit our models is not to perfectly explain our training data but to predict unseen data. By fitting arbitrarily complex models with many parameters to our training data we risk that the fitted model is too complex and has actually memorized each training data point. This phenomenon leads to bad predictions in out-of-sample datasets and is called overfitting.

Example: Regression Spline

The example below shows a regression line fitted using smooth.spline(). By modifying the spar (Number of bins) parameter which controls for complexity we see that the line either fits straight through the points (spar=0) or gets very wriggly and goes through each point (spar=0). We can easily see that, by setting spar=1, the regression line does not fit the actual relationship (or concept) of a car’s weight wt and how many miles-per-gallon mpg it goes but memorized each data point. This model will clearly not generalize well on unseen data.

Classification with KNN

The example below shows a k-nearest neighbor algorithm fitted to the iris dataset. Although this model is considered to be parameter-less we can still tune at least one parameter - namely that number of neighbors used to determine the class of each of the data points. We can see that by reducing the number of neighbors more and more the decision boundary gets very complex (wriggly) and will probably not generalize well on unseen data.

Resampling techniques

{width=“30%”}

To overcome the problem of overfitting we need to compute the model error on an unseen holdout data set to give us the best (unbiased) estimate of the error rate on new data points. Additionally, the resulting error rate helps us to decide which models to use and select the parameters for the models accordingly.

The rsample package provides a number of tools to resample the dataset, most importantly:

  • initial_split(): Create a binary split into training and test dataset.
  • vfold_cv(): V-Fold Cross-Validation
  • bootstraps():
  • Combinations of techniques, e.g. initial_split() and vfold_cv()

Initial Split

The initial split function (also known as validation set approach) splits your dataset into a training and a test set. This is the most basic way to reserve a holdout set and can be done using

library(rsample)
isplit <- initial_split(mtcars)
isplit

which by default performs a split in the proportion 3/4th for the training- 1/4th test set.

The training- and test set can be extracted using training() and testing() from the resulting split object:

train_set <- training(isplit)
test_set <- testing(isplit)

dim(train_set)
dim(test_set)

We can now evaluate the performance of our previous linear model based on the mtcars dataset using

linear_reg() %>%
  set_engine("lm") %>%
  fit(mpg ~ ., data = train_set) %>% 
  predict(new_data=test_set) %>% 
  bind_cols(test_set) %>% 
  yardstick::rmse(mpg, .pred)

This gives us the Root-Mean-Squared-Error (RMSE) based on the test set.

However, we can easily see that this result can be highly dependent on the selected split and the resulting training and test set. This is especially true for small sample sizes and complex models with high variance:

isplit <- initial_split(mtcars)
train_set <- training(isplit)
test_set <- testing(isplit)
linear_reg() %>%
  set_engine("lm") %>%
  fit(mpg ~ ., data = train_set) %>% 
  predict(new_data=test_set) %>% 
  bind_cols(test_set) %>% 
  yardstick::rmse(mpg, .pred)

V-Fold Cross-Validation

To overcome the problem of validation set dependence we can also vary the validation set \(v\)-times into non-overlapping parts and train the model on the residual data points. The procedure works as follows:

  1. Create \(v\) non-overlapping folds.
  2. For each fold, train model on \(v-1\) folds calculate error rate on residual one.
  3. Average errors for each fold to obtain the cross-validation error rate (CV error).

In practical examples, we typically choose ten cross validation folds. The more data or less variance in the dataset, the less folds we need.

To perform a 10-fold Cross-Validation on the mtcars dataset we use vfold_cv():

vsplit <- vfold_cv(mtcars)
vsplit

Next, we compute the RSME on each fold:

linear_mod <- linear_reg() %>%
  set_engine("lm")

rmse_cv <- vsplit %>% 
  mutate(rmse = map(splits, function(x) {
    linear_mod %>% 
       fit(mpg ~ ., data = training(x)) %>% 
       predict(new_data = testing(x)) %>% 
       bind_cols(testing(x)) %>% 
       rmse(mpg, .pred) 
  })) %>% 
  unnest(rmse)
rmse_cv

We see quite some deviation in the results of the CV errors for different folds. Thanks to averaging the CV error is much stabler and given as

rmse_cv %>% 
  summarise(CV_error = mean(.estimate), 
            CV_sd = sd(.estimate),
            CV_min = min(.estimate),
            CV_max = max(.estimate))

Bootstrap

Using bootstrap we sample from the dataset with replacement to obtain the original dataset size.

bsplit <- bootstraps(mtcars, times = 10)
bsplit

The resulting bootstrap error for each resampling can then be calculated analogous to the cross-validation errors in the previous section:

linear_mod <- linear_reg() %>%
  set_engine("lm")

rmse_bt <- bsplit %>% 
  mutate(rmse = map(splits, function(x) {
    linear_mod %>% 
       fit(mpg ~ ., data = training(x)) %>% 
       predict(new_data = testing(x)) %>% 
       bind_cols(testing(x)) %>% 
       rmse(mpg, .pred) 
  })) %>% 
  unnest(rmse)
rmse_bt

And averaged as

rmse_bt %>% 
  summarise(CV_error = mean(.estimate), 
            CV_sd = sd(.estimate),
            CV_min = min(.estimate),
            CV_max = max(.estimate))

Final Notes

It is generally a good idea to split your initial dataset into a training and a test at the very beginning in addition to some resampling technique used on the resulting training set. This gives you an unseen test set which you can use to validate or select your final model. You only have to make sure that you do this final validation/selection only once. Otherwise, the information from your previously unseen test dataset will compromise all further analysis and may lead to overfitting.

Exercises

Below you can find the exercises on the resampling chapter: