Use call center data from a bank marketing use case to predict if the client will subscribe a term deposit.
r format(Sys.time(), '%d %B, %Y')
"
output: html_documentknitr::opts_chunk$set(echo = TRUE)
library(rsample)
library(recipes)
library(parsnip)
library(tune)
library(dplyr)
library(yardstick)
library(readr)
library(ranger)
library(themis)
library(vip)
library(ggplot2)
bank <- read.csv("bank-full.csv", sep = ";", stringsAsFactors = TRUE)
In this project we will work with a telemarketing related data set, available on the UCI Machine Learning Repository. The data set describes the success of marketing campaigns conducted by a bank. The majority of the variables describe the clients, who were contacted over the phone. Additional variables describe different aspects of the marketing campaign and the economic situation at the time of the call. The target variable y
describes the outcome of the call, i.e. whether the client opened a term deposit or not:
head(bank)
Before splitting the data set into separate training, validation and testing sets, we remove the duration
variable. This variable is tightly connected to the outcome itself - e.g a marketing call that lasts only a few seconds is generally not successful. However the call duration cannot be known prior to contacting the clients, which makes it a flawed predictor in real world settings.
bank$duration <- NULL
Next, we create data set splits for training and testing. Additionally, we create a cross-validation split from the training set for later model tuning:
set.seed(42)
bank_split <- initial_split(bank)
bank_train <- training(bank_split)
bank_test <- testing(bank_split)
bank_cv <- vfold_cv(bank_train)
As a first step, we take a look at the class frequencies of the target variable:
ggplot(bank_train) +
geom_bar(aes(y))
Since, it is very unbalanced we extend the recipe with a downsampling step:
rec <- recipe(y ~ ., data = bank_train) %>%
step_downsample(y)
The downsampling step filters the majority class, such that the factor classes yes
and no
become equally frequent:
rec %>%
prep() %>%
bake(new_data = NULL) %>%
ggplot() +
geom_bar(aes(y))
For this classification task, we will use a random forest model:
model_rf <- rand_forest(mode = "classification",
trees = tune(),
min_n = tune(),
mtry = tune()) %>%
set_engine(engine = "ranger",
importance = "impurity")
To find an optimal combination of hyperparameters, we conduct a tuning procedure:
tune_res <- tune_grid(object = model_rf,
preprocessor = rec,
resamples = bank_cv,
control = control_grid(save_pred = TRUE, event_level = "second"))
best_models <- show_best(tune_res, metric = "roc_auc", n = 3)
best_models
tune_res <- readRDS("tune_res_rf.RDS")
best_models <- show_best(tune_res, metric = "roc_auc", n = 3)
best_models
To get a detailed overview about the difference between the top three models, we plot the respective ROC curve of the models. First, we extract the predictions belonging to the top 3 hyperparameter settings:
best_model_predictions <- collect_predictions(tune_res, parameters = best_models)
best_model_predictions
Afterwards we calculate and plot the ROC curve for each separate model (which is indicated by the .config
variable):
best_model_predictions %>%
group_by(.config) %>%
roc_curve(truth = y, estimate = .pred_yes, event_level = "second") %>%
autoplot()
Based on this comparison, we cannot identify any significant difference between the performance of the three model settings. Therefore, we simply select the one with the highest AUC value and extract its hyperparameter settings for the final model:
best_hyperparams <- select_best(tune_res, metric = "roc_auc")
Once we have selected the optimal hyperparameter settings, we apply the last_fit()
function to get an estimate about the expected real-world performance of the model. First, we finalize the model with the best hyperparameters, then train a model on the entire training data set and evaluate on the test set:
model_rf <- finalize_model(model_rf, parameters = best_hyperparams)
lf <- last_fit(model_rf, rec, bank_split)
collect_metrics(lf)
Based on the performance metrics, we can conclude that the model performed similarly to the results we have seen during the cross validation.
To illustrate how the prediction model could improve future marketing campaigns, we take a look at a gain chart. This type of plot compares the percentage of tested observations (i.e. calls made) and percentage of positive cases found (i.e. new term deposits acquired):
lf_preds <- collect_predictions(lf)
gc <- gain_curve(data = lf_preds, truth = y, estimate = .pred_yes, event_level = "second")
autoplot(gc)
Focusing on the gray area we can see two extreme cases. According to the data set, around 12% of the marketing calls end up successfully. Thus, a perfect model would find 100% of the positive cases, once we tested the first 12% of the predictions with the highest predicted probability. On the other hand, in case of a random model, the positive cases would simply proportionally increase with the tested cases (i.e. if we test half of the predictions we find half of the positive cases).
Our general goal is to decrease the number of calls, such that we still find the majority of clients who will open term deposit. The gain curve indicates, that almost 50% of the relevant clients could be already reached out to by making ~13% of the calls suggested by the model. By making 50% of the calls, we could already find 80% of the positive cases. The increase after this point is rather marginal. One could conduct a detailed analysis to find out, up to which point the additional cost of the calls is worth it for the acquired deposits.
Apart from relying solely on the probability predictions, the model might also provide useful insights in terms of the extracted patterns. Therefore, in a final step, we take a look at the most important variables in the prediction model. First we fit a model on the entire training set:
model_inspect <- model_rf %>%
fit(y ~ ., data = bank_train)
Next, we inspect the variable importance estimated by the vip()
function:
vip(model_inspect)
Apparently, the month of the year has an interesting relationship with the outcome of the marketing calls. In deed, if we take a look at the ratio of successful marketing calls for each month separately, we get significantly different results:
bank_train %>%
group_by(month) %>%
summarise(ratio = sum(y == "yes") / n()) %>%
arrange(desc(ratio))
Another clear pattern is, that clients who were already successfully convinced by previous campaigns (i.e. they invested or bought the product), tend to be better targets for new marketing campaigns as well:
bank_train %>%
group_by(poutcome) %>%
summarise(ratio = sum(y == "yes") / n()) %>%
arrange(desc(ratio))
Furthermore, the balance
of the clients is also an important predictor, as clients with more money tend to open new term deposits more frequently:
bank_train %>%
group_by(CampaignOutcome = y) %>%
summarise(MeanBalance = mean(balance),
MedianBalance = median(balance))
All these insights could be used for the improvement of future marketing campaigns. By filtering out relevant clients, marketing campaigns could be made more efficient and less expensive.
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.