Previous chapter
ModelLinear Regression
Next chapter

## Regression: Residuals

We start fitting a regression model to predict the range of a car (in miles per gallon, mpg) by its weight wt. The scatterplot including regression line and residuals is shown below.

fit <- lm(mpg ~ wt, data = mtcars)

## Regression: Residual Sum of Squares

The residual sum of squares (RSS) is the sum-of-squares difference between estimated $$\hat{y_i}$$ and actual $$y_i$$.

$RSS = \sum_{i=1}^n(y_i-\hat{y_i})^2$ To make $$RSS$$ independent of the number of observations we can simply divide by $$n$$ to obtain the Mean Squared Error (MSE).

$MSE = \frac{1}{n}\ \sum_{i=1}^n(y_i-\hat{y_i})^2$ The Root-Mean-Squared-Error (RMSE) measures the distance between our model prediction and the actual value and has also the same units as the original variable:

$RMSE = \sqrt{\frac{1}{n}\ \sum_{i=1}^n(y_i-\hat{y_i})^2}$

## Regression: Fitting Wage

• Fit a linear model on the diamonds dataset predicting price using the variables year, age, maritl, race and education as predictors.
• Save the result to model.
• Make predictions using model on the full original dataset and save the result to p.
• Compute errors using the formula error=predicted−actual. Save the result to error.
• Compute RMSE using the formula sqrt(mean(error^2)) and print it to the console.

## Linear Model Selection and Regularization

OLS Regression

Let’s fit and examine a linear regression model on the Hitters dataset:

In case of highly correlated predictors we could also consider:

• Principal component Regression (PCR) through e.g. plsr() function in the pls package.
• Partial Least Squares (PLR) through e.g. pls() in the pls package.

## Stepwise Regression

Search for best variable combination in model space using the leaps package.

See what variables are selected for methods:

• "forward"
• "backward"
• "exhaustive"

## See how BIC and AIC compare with number of variables:

• What is the best subset/number of variables to choose from?

## Fit best model

Based on the results from Exhaustive selection, fit a linear regression model: