Previous chapter
Model Fitting BasicsBasic Model with R
Next chapter

Linear Regression Models

Linear Regression is probably the most popular model taught in statistics at schools and universities. It has been extensively researched, is conceptually easy to understand and has been used in academia and industry for decades. We will use these models for now to review the generic fitting and validation procedures in R. These procedures are later extended for more complex machine learning models.

Linear regression models assume a linear relationship between the predictors \(X_1, X_2,...,X_p\) and a numeric target variable \(Y\). The prediction errors \(\epsilon\) (or residuals) are assumed to be independent normally distributed with mean zero and standard deviation one. The model tries to find linear coefficients \(\beta\) (and the intercept \(\alpha\)) to reduce the residual sum-of-squares RSS.

We can thus write the regression equation as

\[\alpha + X_1 \beta_1 + X_2 \beta_2 + ... + X_p \beta_p + \epsilon= Y\]

First example

Let’s start fitting a regression model to predict the range of a car (in miles per gallon, mpg) by its weight wt. Since heavier cars typically consume more fuel we expect this line to have a negative slope, thus a negative correlation between miles-per-gallon mpg and weight wt.

The code below shows a typical R example to fit a function. The fitting function lm (standing for linear model) takes as parameters a model formula, containing the output mpg and input variable(s) wt separated by a tilde, and the corresponding data set mtcars.

fit <- lm(mpg ~ wt, data = mtcars)
fit <- lm(mpg ~ wt, data = mtcars)

The resulting scatter plot including the regression line and residuals (red) is shown below. The regression model obviously achieved to fit a linear regression line such that the residuals could be minimized - in particular the residual sum of squares RSS.

The regression line is specified by the fitted coefficients for each input variable which are defined as

coef(fit)
## (Intercept)          wt 
##   37.285126   -5.344472

The model formula

We can use the following model formula to predict mpg using wt for a univariate regression model:

fit <- lm(mpg ~ wt, data = mtcars)

The model formula could also be written to contain multiple predictors separated by + as

fit2 <- lm(mpg ~ wt + hp + cyl, data = mtcars)

fit2 now predicts mpg using the three predictors weight wt, horse power hp and number of cylinders cyl.

If all available predictors shall be included into the model you can also use the shortcut . by writing

fit3 <- lm(mpg ~ ., data = mtcars)

fit3 predicts mpg using all variables in mtcars except for the output mpg. To include all available variables of mtcars but without the variable gear we can also use this shortcut using a - sign:

fit4 <- lm(mpg ~ . -gear, data = mtcars)

Your turn Play around with the model formula to reproduce the examples above:

lm(mpg ~ wt, data = mtcars)

Predicting new values

The fitted model parameters \(\beta\) (and \(\alpha\)) can be used to predict unseen output variables by simply multiplying with the observed inputs as

\[\alpha +X_1 \beta_1 + X_2 \beta_2 + ... + X_p \beta_p = \hat{Y}\] since the error term averages to zero. The obtained value \(\hat{Y}\) is called the prediction which can generally be obtained for any R model through predict():

predict(fit)

Note, that without specifying the parameter newdata, predict() simply does an in-sample prediction - thus multiplying the coefficients with data points which already have been used for model fitting/training:

out <- coef(fit)["(Intercept)"] + coef(fit)["wt"] * mtcars$wt
names(out) <- rownames(mtcars)
out
##           Mazda RX4       Mazda RX4 Wag          Datsun 710 
##           23.282611           21.919770           24.885952 
##      Hornet 4 Drive   Hornet Sportabout             Valiant 
##           20.102650           18.900144           18.793255 
##          Duster 360           Merc 240D            Merc 230 
##           18.205363           20.236262           20.450041 
##            Merc 280           Merc 280C          Merc 450SE 
##           18.900144           18.900144           15.533127 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
##           17.350247           17.083024            9.226650 
## Lincoln Continental   Chrysler Imperial            Fiat 128 
##            8.296712            8.718926           25.527289 
##         Honda Civic      Toyota Corolla       Toyota Corona 
##           28.653805           27.478021           24.111004 
##    Dodge Challenger         AMC Javelin          Camaro Z28 
##           18.472586           18.926866           16.762355 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2 
##           16.735633           26.943574           25.847957 
##        Lotus Europa      Ford Pantera L        Ferrari Dino 
##           29.198941           20.343151           22.480940 
##       Maserati Bora          Volvo 142E 
##           18.205363           22.427495

To get a true out-of-sample prediction we need to specify a data.frame/list for the newdata parameter. To find out how many miles/gallon a car can go with a weight of one to six tonnes we could specify newdata as

predict(fit, newdata = data.frame(wt = 1:6))
##         1         2         3         4         5         6 
## 31.940655 26.596183 21.251711 15.907240 10.562768  5.218297

Try to modify wt yourself and see how the predicted output changes. At which weight reaches the predicted miles-per-gallon zero?

Summary

Models in R typically have a summary() method implemented which outputs information of the fitted model such as standard errors, coefficients, etc. In the case of the linear model we also get the statistics of how significant each coefficient is as indicated by dots and star: more stars indicate a higher significance (less likelihood of coefficient being zero):

fit3 <- lm(mpg ~ ., data = mtcars)
summary(fit3)

Exercise: Through the summary() output try to determine

  • Which variables seem to be important based on the t-statistic $Pr(>|t|) $?
  • What is the Residual standard error (RSE) and Adjusted R-squared? Try to calculate the numbers on your own and compare.
  • What happens to the RSE and adjusted R-squared if you remove important/unimportant predictors? Explain.
  • Would you prefer a model having a higher adjusted R-squared or a higher RSE? Explain.