Linear Model Selection and Regularization
Although linear regression models have no specific tuning parameter per se we can still try to vary the predictors in the model to see if the expected out-of-sample error goes down. We can perform a step-wise regression which performs a greedy search for all possible predictor combinations of the model and returns the one with the lowest AIC. The Akaike Information Criterion is a measure for the relative Goodness-of-Fit for the model and corrected with the number of parameters. Given that models perform similarly but have a different number of predictors it prefers the one with less predictors.
The subset selection either goes
forward: starting with an empty model and adding predictors or
backward: starting with the full model and removing predictors.
exhaustive: trying all possible subset combinations to find the optimal one.
The subset selection estimates all possible models by either adding (removing) parameters to (from) the model and chooses the one which increases model quality the most. If model quality cannot be increased the algorithm stops. Once the subset is found a model is then fit using least squares on the reduced set of variables. The subset found can also be used with different (non-linear) model families.
We can compare the resulting model using the vanilla R way
mod_full <- lm(mpg ~ ., data = mtcars)
mod_step <- stats::step(mod_full)
Ridge- and Lasso Regression
Shrinkage models take a different approach and regularize model coefficients through a shrinkage parameter \(λ\). Both, Ridge- and Lasso-regression take a similar approach with the only difference that Ridge minimizes the \(L_2\) Norm (squared error) whereas Lasso the \(L_1\) Norm (absolute error). This also results in different parameter estimates: Ridge parameters are reduced smoothly whereas Lasso parameters tend to be reduced to exactly zero. This makes Lasso regression also a viable choice for feature selection.
We first fit a Ridge regression model using
parsnip. Note, that a
glmnet model is a mixture between a Ridge- and a Lasso regression. The corresponding model can be set through the
mixture parameter which specifies the proportion of the Lasso model in the final mixture. We first fit a pure Ridge model using
mod_ridge <- linear_reg(mixture = 0) %>%
fit(mpg ~ ., data = mtcars)
The resulting ridge model also returns a set of results for each corresponding \(λ\). We can therefore use the
multi_predict function which returns the predictions for each \(λ\):
multi_predict(glmn_fit, new_data = mtcars) %>%
rmse(mpg, .pred) %>%
geom_line(aes(penalty, .estimate)) +
It is easy to see that the error is lowest if the penalty equals to zero - this corresponds to a vanilla linear regression model.
So far we have only optimized one model without any resampling and thus got a penalty=0 or a linear regression model as the best fitting model. What is still missing is a structured approach to specify the tuning grid and parameter search itself. The package tune is still under development and can be installed through
mtcars has very few data points omit the initial split and directly perform a cross-validation on the data set.
First, we specify the linear regression model and instead of specifying the parameters
mixture directly we use
tune() as a placeholder:
linear_reg(penalty = tune(), mixture = tune()) %>%
Next, we specify a
workflow() to be used by tune and implemented in the workflows package:
add_formula(mpg ~ .) %>%
## ══ Workflow ═══════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: linear_reg()
## ── Preprocessor ───────────────────────────────────────────────────────────
## mpg ~ .
## ── Model ──────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## Main Arguments:
## penalty = tune()
## mixture = tune()
## Computational engine: glmnet
Before we can get started we specify a tuning grid as
grid_df <- grid_regular(mtcars_wflow, levels = c(10, 3))
## # A tibble: 30 x 2
## penalty mixture
## <dbl> <dbl>
## 1 0.0000000001 0.05
## 2 0.00000000129 0.05
## 3 0.0000000167 0.05
## 4 0.000000215 0.05
## 5 0.00000278 0.05
## 6 0.0000359 0.05
## 7 0.000464 0.05
## 8 0.00599 0.05
## 9 0.0774 0.05
## 10 1 0.05
## # … with 20 more rows
and a cross-validation resampling:
cv_splits <- vfold_cv(mtcars, v = 3)
Finally, we can optimize our
glmnet model and get a nice output through
mtcars_glmnet <- tune_grid(mtcars_wflow,
resamples = cv_splits,
grid = grid_df,
control = control_grid(verbose = TRUE))
mtcars_glmnet %>% autoplot()
group_by(penalty, mixture) %>%
summarize(est = mean(.estimate))
?autoplot.tune_results for more info.