Previous chapter
ModelLinear Regression
Next chapter

Regression: Residuals

We start fitting a regression model to predict the range of a car (in miles per gallon, mpg) by its weight wt. The scatterplot including regression line and residuals is shown below.

fit <- lm(mpg ~ wt, data = mtcars)

Regression: Residual Sum of Squares

The residual sum of squares (RSS) is the sum-of-squares difference between estimated \(\hat{y_i}\) and actual \(y_i\).

\[RSS = \sum_{i=1}^n(y_i-\hat{y_i})^2\] To make \(RSS\) independent of the number of observations we can simply divide by \(n\) to obtain the Mean Squared Error (MSE).

\[MSE = \frac{1}{n}\ \sum_{i=1}^n(y_i-\hat{y_i})^2\] The Root-Mean-Squared-Error (RMSE) measures the distance between our model prediction and the actual value and has also the same units as the original variable:

\[RMSE = \sqrt{\frac{1}{n}\ \sum_{i=1}^n(y_i-\hat{y_i})^2}\]

Regression: Fitting Wage

  • Fit a linear model on the diamonds dataset predicting price using the variables year, age, maritl, race and education as predictors.
  • Save the result to model.
  • Make predictions using model on the full original dataset and save the result to p.
  • Compute errors using the formula error=predicted−actual. Save the result to error.
  • Compute RMSE using the formula sqrt(mean(error^2)) and print it to the console.

Linear Model Selection and Regularization

OLS Regression

Let’s fit and examine a linear regression model on the Hitters dataset:

In case of highly correlated predictors we could also consider:

  • Principal component Regression (PCR) through e.g. plsr() function in the pls package.
  • Partial Least Squares (PLR) through e.g. pls() in the pls package.

Stepwise Regression

Search for best variable combination in model space using the leaps package.

See what variables are selected for methods:

  • "forward"
  • "backward"
  • "exhaustive"

See how BIC and AIC compare with number of variables:

  • What is the best subset/number of variables to choose from?

Fit best model

Based on the results from Exhaustive selection, fit a linear regression model: