Used Cars Model Selection with Tidymodels

QBit to analyze used cars data.

title: "Used Cars Model Selection with Tidymodels" author: "Quantargo" date: "`r format(Sys.time(), '%d %B, %Y')`" output: html_document

knitr::opts_chunk$set(echo = TRUE)
library(rsample)
library(recipes)
library(parsnip)
library(tune)
library(yardstick)
library(dplyr)
library(rpart.plot)
library(ggplot2)
library(readr)
library(tibble)
library(tidyr)
library(ranger)
library(vip)
library(forcats)

cars <- read_csv("audi.csv") %>%
  mutate(model = as.factor(model), 
         transmission = as.factor(transmission),
         fuelType = as.factor(fuelType))

Introduction

In this project we will work with a data set containing information about used cars listed on an online selling platform. The data set was automatically extracted using webscraping techniques and is openly available on the data science platform Kaggle.

We will restrict our analysis to the Audi car brand and train a model, that predicts the price of used cars. Such models could be used to estimate the value of used cars and thus find good deals with undervalued cars. Similar models could be also applied as part of filtering systems that search for suspicious listings, which often significantly diverge from the expected price.

To get an overview about the data set, we take a look at the first few observations and interpret the variables:

head(cars)

model: A factor describing the model type of the car.
year: An integer variable describing the year of the production.
transmission: A factor describing the transmission type of the car.
mileage: An integer value with the number on the odometer.
fuelType: A factor with the fuel type of the car.
tax: A flat rate tax depending on things like the vehicle's CO2 emissions.
mpg: The miles per gallon efficiency of the vehicle.
engineSize: The engine size in liters.
price: The listed price on the used car selling platform.

Create Data Set Splits

Before splitting the data set into separate training and testing sets, there is one issue which we must handle first. Out of the 10.668 cars in the data set, there are 28 hybrid vehicles. However, these type of cars have in many ways different properties compared to conventional ones. Most notably their mpg efficiency is measured based on different principles, resulting in significantly higher values. For the sake of simplicity, we will exclude hybrid cars from our analysis:

cars <- filter(cars, fuelType != "Hybrid")

Next we split the data set into separate training, validation and testing sets. This way, we reserve a part of the data set exclusively for the evaluation of the final model. This is crucial in order to get a reliable estimate of the model performance:

set.seed(42)
cars_split <- initial_split(cars)
cars_train <- training(cars_split)
cars_test <- testing(cars_split)
cars_cv <- vfold_cv(cars_train)

Linear Regression - Preprocessing

We start off by training a linear regression model. As a first step, we take a look at the level frequencies of the model factor variable:

ggplot(cars_train) + 
  geom_bar(aes(model)) +
  coord_flip()

Some model types definitely dominate others in the data set. This might decrease the prediction performance of the model on unseen data, as we won't have enough samples to generalize well about specific car models.

More importantly, some car models might become unique in the evaluation folds during a cross-validation, which would cause errors. As a solution we collapse factor levels, such that no car model has a frequency of 1. The only A2 model in the data set will be handled as an A1, while the only RS7 model will be combined with the RS6 models:

rec <- recipe(price ~ ., data = cars_train) %>%
  step_mutate(model = fct_collapse(model,
                                   A1_2 = c("A1", "A2"),
                                   RS6_7 = c("RS6", "RS7")))

Next, we take a look at the distribution of the target variable price:

ggplot(cars_train) + 
  geom_histogram(aes(price))

As the distribution seems to be quite skewed, we apply a log transformation of the target variable, which creates a more favorable distribution for the regression model. For this step we use the step_log() preprocessing function:

rec <- recipe(price ~ ., data = cars_train) %>%
  step_mutate(model = fct_collapse(model,
                                   A1_2 = c("A1", "A2"),
                                   RS6_7 = c("RS6", "RS7")))  %>%
  step_log(price)

Finally, there are two outlier issues which we must handle before training a model. There are 45 observations with an engineSize of 0. These all describe cars with an internal combustion engine, which makes a value of 0 impossible:

filter(cars_train, engineSize == 0)

In addition, there are some observations with an mpg efficiency above 100. These are suspicious listings, that are not possible in reality:

filter(cars_train, mpg > 100)

In order eliminate the influence of these false/missing values on the model, we exclude these observations using the step_filter() function:

rec <- recipe(price ~ ., data = cars_train) %>%
  step_mutate(model = fct_collapse(model,
                                   A1_2 = c("A1", "A2"),
                                   RS6_7 = c("RS6", "RS7")))  %>%
  step_log(price) %>%
  step_filter(engineSize > 0, mpg < 100)

Note, that the filtering step will only be applied during the training of the model. The testing data remains the same, which ensures that we get a reliable estimate of the model performance.

Linear Regression - Modeling

Now that we have determined the required preprocessing steps, next we train and evaluate the model in a cross-validation setting:

model <- linear_reg() %>%
  set_engine("lm")

res <- fit_resamples(model,
                     preprocessor = rec,
                     resamples = cars_cv,
                     control = control_resamples(save_pred = TRUE))

collect_metrics(res)

The R-Squared value provides us a promising estimate of the prediction performance. However, the predictions in this case represent the log of the actual price. Thus, to get the actual currency value, we need to calculate the exponential value or the predictions:

predictions <- collect_predictions(res) %>%
  mutate(.pred = exp(.pred)) %>%
  arrange(.row)

predictions
```

As the predictions are now arranged by the `.row` variable, they perfectly align the observations in the `cars_train` data set. Thus we can simply bind them together and calculate the performance metrics once again. This time we use the exponential value of the predictions and the original, non-transformed true prices contained in `cars_train`:

```{r}
comparison <- bind_cols(select(predictions, .pred), 
                        select(cars_train, price))

metrics(comparison, price, .pred)
```

This performance measure is based on the actual car prices, which makes it directly comparable with any other model performance in the upcoming sections.

## Decision Tree

As an alternative to linear regression, in this section we train a decision tree. This algorithm is not bounded by the assumption of a linear relationship between the predictors and the target variable.

First we define the model and the tuning hyperparameters:

```{r}
model_dt <- decision_tree(mode = "regression",
                          tree_depth = tune(),
                          min_n = tune(),
                          cost_complexity = tune()) %>%
  set_engine("rpart")
```

Next, we create a recipe object for the model:

- We will not log-transform the target variable for the decision tree.
- New, unseen car `model` types might still cause issues during testing. Therefore, we keep the factor collapsing step.
- The (false) 0 values in `engineSize` might mislead the model when searching for an optimal splitting point. The same is true for the outliers in the `mpg` column. Therefore, these observations will be excluded during the training:

```{r}
rec <- recipe(price ~ ., data = cars_train) %>% 
  step_mutate(model = fct_collapse(model,
                                   A1_2 = c("A1", "A2"),
                                   RS6_7 = c("RS6", "RS7")))  %>%
  step_filter(engineSize > 0, mpg < 100)
```

Finally, we tune the model and extract the best performing model settings:

```{r eval=FALSE}
tune_res_dt <- tune_grid(object = model_dt,
                         preprocessor = rec,
                         resamples = cars_cv)

show_best(tune_res_dt, metric = "rsq", n = 3)
```

```{r echo=FALSE}
tune_res_dt <- readRDS("dt_tuned.rds")
show_best(tune_res_dt, metric = "rsq", n = 3)
```

The results improved somewhat compared to the linear regression.

## Random Forest

We have already seen, that the decision tree provides promising results. In a next step, we implement a random forest model, which often provides an improved performance compared to a single decision tree.

We perform a hyperparameter tuning on the following random forest model and extract the top model settings:

```{r eval=FALSE}
model_rf <- rand_forest(mode = "regression",
                        mtry = tune(),
                        trees = tune(),
                        min_n = tune())%>%
  set_engine(engine = "ranger", 
             importance = "impurity")

tune_res_rf <- tune_grid(object = model_rf, 
                         preprocessor = rec,
                         resamples = cars_cv,
                         control = control_resamples(verbose = T))

show_best(tune_res_rf, metric = "rsq", n = 3)
```

```{r echo=FALSE}
model_rf <- rand_forest(mode = "regression",
                        mtry = tune(),
                        trees = tune(),
                        min_n = tune())%>%
  set_engine(engine = "ranger", 
             importance = "impurity")

tune_res_rf <- readRDS("rf_tuned.rds")
show_best(tune_res_rf, metric = "rsq", n = 3)
```

## Model selection

Based on the cross-validation results from the three models, we can conclude, that the random forest model performed the best. Thus, we select this model for the final evaluation on the test set, which should provide us a reliable estimate about the performance of the model in a real world setting. To do so, we take the following three steps:

1. Extract the best random forest hyperparameters.
1. Finalize the model object with these settings.
1. Train a model on the entire training set and evaluate it on the test set using the `last_fit()` function.

```{r}
best_parameters <- select_best(tune_res_rf, metric = "rsq")
final_model <- finalize_model(model_rf, best_parameters)
test_performance = last_fit(final_model, preprocessor = rec, split = cars_split)
collect_metrics(test_performance)
```

The results are similar to the ones we have seen during the model development. 

## Model Application

To illustrate an actual use case of the model, we collect the predictions and extract the cars which seem to be much cheaper than predicted:

```{r}
collect_predictions(test_performance) %>%
  mutate(diff = price - .pred) %>%
  arrange(diff) %>%
  slice_head(n = 10)
```
The first few suggestions with the largest difference are somewhat flawed predictions. If we take a closer look at the data, the largest difference is estimated for the following car:

```{r}
cars %>% 
  slice(9893)
```

However there was only a single observation in the training set that was comparable to this car. This single observation was most likely not enough to extract useful patterns and estimate accurate prices:

```{r}
filter(cars_train, model == "S4")
```

Nonetheless, we can easily find good deals among the top suggestions. One example could be the following car:

```{r}
cars %>% 
  slice(1726)
```

In this case we have 55 similar models in the training set. If we take a look at the `price` distribution of these 55 cars, we can easily see that the listing for $27991 is in deed a favorable deal:

```{r}
filter(cars_train, model == "Q5", year == 2018, mileage < 30000) %>% 
  summarise(min(price), mean(price), median(price), max(price))
```

## Variable Importance

Finally, to get a better understanding of the model, we take a look the variable importance, as estimated by the `vip()` function:

```{r}
model_inspect <- final_model %>% 
  fit(price ~ ., data = cars_train)

vip(model_inspect)
```

Based on this analysis we can conclude, that the miles per gallon `mpg` value has the largest influence on the price of the used cars. The exact relationship becomes visible, when we plot the `price` against the `mpg` values. Note that cars with a low `mpg` value tend to be sports cars, which are usually more expensive:

```{r}
filter(cars_train, mpg < 100) %>%
  ggplot() + 
  geom_point(aes(price, mpg))
```

Another decisive variable is the production year of the cars. We can see the importance of this variable by making a summary of the average used car prices for each production year separately:

```{r message=FALSE}
cars_train %>% 
  group_by(year) %>% 
  summarise(NumberOfCars = n(),
            MeanPrice = mean(price)) %>% 
  arrange(desc(year))
```