## Linear Regression Models

Linear Regression is probably the most popular model taught in statistics at schools and universities. It has been extensively researched, is conceptually easy to understand and has been used in academia and industry for decades. We will use these models for now to review the generic fitting and validation procedures in R. These procedures are later extended for more complex machine learning models.

Linear regression models assume a linear relationship between the predictors \(X_1, X_2,...,X_p\) and a `numeric`

target variable \(Y\). The prediction errors \(\epsilon\) (or residuals) are assumed to be independent normally distributed with mean zero and standard deviation one. The model tries to find linear coefficients \(\beta\) (and the intercept \(\alpha\)) to reduce the residual sum-of-squares RSS.

We can thus write the regression equation as

\[\alpha + X_1 \beta_1 + X_2 \beta_2 + ... + X_p \beta_p + \epsilon= Y\]

## First example

Let’s start fitting a regression model to predict the range of a car (in miles per gallon, `mpg`

) by its weight `wt`

. Since heavier cars typically consume more fuel we expect this line to have a negative slope, thus a negative correlation between miles-per-gallon `mpg`

and weight `wt`

.

The code below shows a typical R example to fit a function. The fitting function `lm`

(standing for linear model) takes as parameters a *model formula*, containing the output `mpg`

and input variable(s) `wt`

separated by a tilde, and the corresponding data set `mtcars`

.

`fit <- lm(mpg ~ wt, data = mtcars)`

`fit <- lm(mpg ~ wt, data = mtcars)`

The resulting scatter plot including the regression line and residuals (red) is shown below. The regression model obviously achieved to fit a linear regression line such that the residuals could be minimized - in particular the residual sum of squares RSS.

The regression line is specified by the fitted *coefficients* for each input variable which are defined as

`coef(fit)`

```
## (Intercept) wt
## 37.285126 -5.344472
```

## Predicting new values

The fitted model parameters \(\beta\) (and \(\alpha\)) can be used to predict unseen output variables by simply multiplying with the observed inputs as

\[\alpha +X_1 \beta_1 + X_2 \beta_2 + ... + X_p \beta_p = \hat{Y}\] since the error term averages to zero. The obtained value \(\hat{Y}\) is called the *prediction* which can generally be obtained for any R model through `predict()`

:

`predict(fit)`

Note, that without specifying the parameter `newdata`

, `predict()`

simply does an in-sample prediction - thus multiplying the coefficients with data points which already have been used for model fitting/training:

```
out <- coef(fit)["(Intercept)"] + coef(fit)["wt"] * mtcars$wt
names(out) <- rownames(mtcars)
out
```

```
## Mazda RX4 Mazda RX4 Wag Datsun 710
## 23.282611 21.919770 24.885952
## Hornet 4 Drive Hornet Sportabout Valiant
## 20.102650 18.900144 18.793255
## Duster 360 Merc 240D Merc 230
## 18.205363 20.236262 20.450041
## Merc 280 Merc 280C Merc 450SE
## 18.900144 18.900144 15.533127
## Merc 450SL Merc 450SLC Cadillac Fleetwood
## 17.350247 17.083024 9.226650
## Lincoln Continental Chrysler Imperial Fiat 128
## 8.296712 8.718926 25.527289
## Honda Civic Toyota Corolla Toyota Corona
## 28.653805 27.478021 24.111004
## Dodge Challenger AMC Javelin Camaro Z28
## 18.472586 18.926866 16.762355
## Pontiac Firebird Fiat X1-9 Porsche 914-2
## 16.735633 26.943574 25.847957
## Lotus Europa Ford Pantera L Ferrari Dino
## 29.198941 20.343151 22.480940
## Maserati Bora Volvo 142E
## 18.205363 22.427495
```

To get a true out-of-sample prediction we need to specify a `data.frame`

/`list`

for the `newdata`

parameter. To find out how many miles/gallon a car can go with a weight of one to six tonnes we could specify `newdata`

as

`predict(fit, newdata = data.frame(wt = 1:6))`

```
## 1 2 3 4 5 6
## 31.940655 26.596183 21.251711 15.907240 10.562768 5.218297
```

Try to modify `wt`

yourself and see how the predicted output changes. At which weight reaches the predicted miles-per-gallon zero?

## Summary

Models in R typically have a `summary()`

method implemented which outputs information of the fitted model such as standard errors, coefficients, etc. In the case of the linear model we also get the statistics of how significant each coefficient is as indicated by dots and star: more stars indicate a higher significance (less likelihood of coefficient being zero):

```
fit3 <- lm(mpg ~ ., data = mtcars)
summary(fit3)
```

*Exercise*: Through the `summary()`

output try to determine

- Which variables seem to be important based on the t-statistic $Pr(>|t|) $?
- What is the Residual standard error (RSE) and Adjusted R-squared? Try to calculate the numbers on your own and compare.
- What happens to the RSE and adjusted R-squared if you remove important/unimportant predictors? Explain.
- Would you prefer a model having a higher adjusted R-squared or a higher RSE? Explain.