## Overfitting

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

John von Neumann

From the previous chapters we have seen that all supervised learning models need to have outputs for training available and typically minimize some form of error measure (or likelihood function within the training set, with linear regression being a special case). The reason why we fit our models is **not** to perfectly explain our training data but to predict **unseen** data. By fitting arbitrarily complex models with many parameters to our training data we risk that the fitted model is too complex and has actually memorized each training data point. This phenomenon leads to bad predictions in out-of-sample datasets and is called *overfitting*.

**Example: Regression Spline**

The example below shows a regression line fitted using `smooth.spline()`

. By modifying the `spar`

(*Number of bins*) parameter which controls for complexity we see that the line either fits straight through the points (`spar=0`

) or gets very wriggly and goes through each point (`spar=0`

). We can easily see that, by setting `spar=1`

, the regression line does not fit the actual relationship (or concept) of a car’s weight `wt`

and how many miles-per-gallon `mpg`

it goes but memorized each data point. This model will clearly not generalize well on unseen data.

**Classification with KNN**

The example below shows a k-nearest neighbor algorithm fitted to the `iris`

dataset. Although this model is considered to be *parameter-less* we can still tune at least one parameter - namely that number of neighbors used to determine the class of each of the data points. We can see that by reducing the number of neighbors more and more the decision boundary gets very complex (wriggly) and will probably not generalize well on unseen data.

## Resampling techniques

{width=“30%”}

To overcome the problem of *overfitting* we need to compute the model error on an unseen holdout data set to give us the best (unbiased) estimate of the error rate on new data points. Additionally, the resulting error rate helps us to decide which models to use and select the parameters for the models accordingly.

The **rsample** package provides a number of tools to resample the dataset, most importantly:

`initial_split()`

: Create a binary split into training and test dataset.`vfold_cv()`

: V-Fold Cross-Validation`bootstraps()`

:- Combinations of techniques, e.g.
`initial_split()`

and `vfold_cv()`

## Initial Split

The initial split function (also known as validation set approach) splits your dataset into a training and a test set. This is the most basic way to reserve a holdout set and can be done using

```
library(rsample)
isplit <- initial_split(mtcars)
isplit
```

which by default performs a split in the proportion 3/4th for the training- 1/4th test set.

The training- and test set can be extracted using `training()`

and `testing()`

from the resulting split object:

```
train_set <- training(isplit)
test_set <- testing(isplit)
dim(train_set)
dim(test_set)
```

We can now evaluate the performance of our previous linear model based on the `mtcars`

dataset using

```
linear_reg() %>%
set_engine("lm") %>%
fit(mpg ~ ., data = train_set) %>%
predict(new_data=test_set) %>%
bind_cols(test_set) %>%
yardstick::rmse(mpg, .pred)
```

This gives us the Root-Mean-Squared-Error (RMSE) based on the test set.

However, we can easily see that this result can be highly dependent on the selected split and the resulting training and test set. This is especially true for small sample sizes and complex models with high variance:

```
isplit <- initial_split(mtcars)
train_set <- training(isplit)
test_set <- testing(isplit)
linear_reg() %>%
set_engine("lm") %>%
fit(mpg ~ ., data = train_set) %>%
predict(new_data=test_set) %>%
bind_cols(test_set) %>%
yardstick::rmse(mpg, .pred)
```

## V-Fold Cross-Validation

To overcome the problem of validation set dependence we can also vary the validation set \(v\)-times into non-overlapping parts and train the model on the residual data points. The procedure works as follows:

- Create \(v\) non-overlapping folds.
- For each fold, train model on \(v-1\) folds calculate error rate on residual one.
- Average errors for each fold to obtain the cross-validation error rate (CV error).

In practical examples, we typically choose ten cross validation folds. The more data or less variance in the dataset, the less folds we need.

To perform a 10-fold Cross-Validation on the `mtcars`

dataset we use `vfold_cv()`

:

```
vsplit <- vfold_cv(mtcars)
vsplit
```

Next, we compute the RSME on each fold:

```
linear_mod <- linear_reg() %>%
set_engine("lm")
rmse_cv <- vsplit %>%
mutate(rmse = map(splits, function(x) {
linear_mod %>%
fit(mpg ~ ., data = training(x)) %>%
predict(new_data = testing(x)) %>%
bind_cols(testing(x)) %>%
rmse(mpg, .pred)
})) %>%
unnest(rmse)
rmse_cv
```

We see quite some deviation in the results of the CV errors for different folds. Thanks to averaging the CV error is much stabler and given as

```
rmse_cv %>%
summarise(CV_error = mean(.estimate),
CV_sd = sd(.estimate),
CV_min = min(.estimate),
CV_max = max(.estimate))
```

## Bootstrap

Using bootstrap we sample from the dataset *with replacement* to obtain the original dataset size.

```
bsplit <- bootstraps(mtcars, times = 10)
bsplit
```

The resulting bootstrap error for each resampling can then be calculated analogous to the cross-validation errors in the previous section:

```
linear_mod <- linear_reg() %>%
set_engine("lm")
rmse_bt <- bsplit %>%
mutate(rmse = map(splits, function(x) {
linear_mod %>%
fit(mpg ~ ., data = training(x)) %>%
predict(new_data = testing(x)) %>%
bind_cols(testing(x)) %>%
rmse(mpg, .pred)
})) %>%
unnest(rmse)
rmse_bt
```

And averaged as

```
rmse_bt %>%
summarise(CV_error = mean(.estimate),
CV_sd = sd(.estimate),
CV_min = min(.estimate),
CV_max = max(.estimate))
```

### Final Notes

It is generally a good idea to split your initial dataset into a training and a test **at the very beginning** in addition to some resampling technique used on the resulting training set. This gives you an **unseen** test set which you can use to validate or select your **final** model. You only have to make sure that you do this **final validation/selection only once**. Otherwise, the information from your previously unseen test dataset will compromise all further analysis and may lead to overfitting.

## Exercises

Below you can find the exercises on the resampling chapter: