Previous chapter
Model Fitting BasicsBasic Model with R
Next chapter

## Linear Regression Models

Linear Regression is probably the most popular model taught in statistics at schools and universities. It has been extensively researched, is conceptually easy to understand and has been used in academia and industry for decades. We will use these models for now to review the generic fitting and validation procedures in R. These procedures are later extended for more complex machine learning models.

Linear regression models assume a linear relationship between the predictors $$X_1, X_2,...,X_p$$ and a numeric target variable $$Y$$. The prediction errors $$\epsilon$$ (or residuals) are assumed to be independent normally distributed with mean zero and standard deviation one. The model tries to find linear coefficients $$\beta$$ (and the intercept $$\alpha$$) to reduce the residual sum-of-squares RSS.

We can thus write the regression equation as

$\alpha + X_1 \beta_1 + X_2 \beta_2 + ... + X_p \beta_p + \epsilon= Y$

## First example

Let’s start fitting a regression model to predict the range of a car (in miles per gallon, mpg) by its weight wt. Since heavier cars typically consume more fuel we expect this line to have a negative slope, thus a negative correlation between miles-per-gallon mpg and weight wt.

The code below shows a typical R example to fit a function. The fitting function lm (standing for linear model) takes as parameters a model formula, containing the output mpg and input variable(s) wt separated by a tilde, and the corresponding data set mtcars.

fit <- lm(mpg ~ wt, data = mtcars)
fit <- lm(mpg ~ wt, data = mtcars)

The resulting scatter plot including the regression line and residuals (red) is shown below. The regression model obviously achieved to fit a linear regression line such that the residuals could be minimized - in particular the residual sum of squares RSS.

The regression line is specified by the fitted coefficients for each input variable which are defined as

coef(fit)
## (Intercept)          wt
##   37.285126   -5.344472

## The model formula

We can use the following model formula to predict mpg using wt for a univariate regression model:

fit <- lm(mpg ~ wt, data = mtcars)

The model formula could also be written to contain multiple predictors separated by + as

fit2 <- lm(mpg ~ wt + hp + cyl, data = mtcars)

fit2 now predicts mpg using the three predictors weight wt, horse power hp and number of cylinders cyl.

If all available predictors shall be included into the model you can also use the shortcut . by writing

fit3 <- lm(mpg ~ ., data = mtcars)

fit3 predicts mpg using all variables in mtcars except for the output mpg. To include all available variables of mtcars but without the variable gear we can also use this shortcut using a - sign:

fit4 <- lm(mpg ~ . -gear, data = mtcars)

Your turn Play around with the model formula to reproduce the examples above:

lm(mpg ~ wt, data = mtcars)

## Predicting new values

The fitted model parameters $$\beta$$ (and $$\alpha$$) can be used to predict unseen output variables by simply multiplying with the observed inputs as

$\alpha +X_1 \beta_1 + X_2 \beta_2 + ... + X_p \beta_p = \hat{Y}$ since the error term averages to zero. The obtained value $$\hat{Y}$$ is called the prediction which can generally be obtained for any R model through predict():

predict(fit)

Note, that without specifying the parameter newdata, predict() simply does an in-sample prediction - thus multiplying the coefficients with data points which already have been used for model fitting/training:

out <- coef(fit)["(Intercept)"] + coef(fit)["wt"] * mtcars$wt names(out) <- rownames(mtcars) out ## Mazda RX4 Mazda RX4 Wag Datsun 710 ## 23.282611 21.919770 24.885952 ## Hornet 4 Drive Hornet Sportabout Valiant ## 20.102650 18.900144 18.793255 ## Duster 360 Merc 240D Merc 230 ## 18.205363 20.236262 20.450041 ## Merc 280 Merc 280C Merc 450SE ## 18.900144 18.900144 15.533127 ## Merc 450SL Merc 450SLC Cadillac Fleetwood ## 17.350247 17.083024 9.226650 ## Lincoln Continental Chrysler Imperial Fiat 128 ## 8.296712 8.718926 25.527289 ## Honda Civic Toyota Corolla Toyota Corona ## 28.653805 27.478021 24.111004 ## Dodge Challenger AMC Javelin Camaro Z28 ## 18.472586 18.926866 16.762355 ## Pontiac Firebird Fiat X1-9 Porsche 914-2 ## 16.735633 26.943574 25.847957 ## Lotus Europa Ford Pantera L Ferrari Dino ## 29.198941 20.343151 22.480940 ## Maserati Bora Volvo 142E ## 18.205363 22.427495 To get a true out-of-sample prediction we need to specify a data.frame/list for the newdata parameter. To find out how many miles/gallon a car can go with a weight of one to six tonnes we could specify newdata as predict(fit, newdata = data.frame(wt = 1:6)) ## 1 2 3 4 5 6 ## 31.940655 26.596183 21.251711 15.907240 10.562768 5.218297 Try to modify wt yourself and see how the predicted output changes. At which weight reaches the predicted miles-per-gallon zero? ## Summary Models in R typically have a summary() method implemented which outputs information of the fitted model such as standard errors, coefficients, etc. In the case of the linear model we also get the statistics of how significant each coefficient is as indicated by dots and star: more stars indicate a higher significance (less likelihood of coefficient being zero): fit3 <- lm(mpg ~ ., data = mtcars) summary(fit3) Exercise: Through the summary() output try to determine • Which variables seem to be important based on the t-statistic$Pr(>|t|) \$?
• What is the Residual standard error (RSE) and Adjusted R-squared? Try to calculate the numbers on your own and compare.
• What happens to the RSE and adjusted R-squared if you remove important/unimportant predictors? Explain.
• Would you prefer a model having a higher adjusted R-squared or a higher RSE? Explain.