Previous chapter
Case StudiesBank Marketing

Problem Description

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Questions

  1. Read the csv file stored in data/bank.csv. This time, use read.csv2, so that characters are automatically converted to factors. This is required for further modelling purposes.
  2. Exploratory Data Analysis: Create histograms of age and account balance using geom_histogram. What do you see? If necessary, transform variables accordingly. Use e.g. log() and/or scale() to change shape of the distribution. How could balance be transformed? Check the target variable y: How is it distributed? Hint: use summary() to get a simple count of the classes.
  3. Before we start modelling, we split our data into a training- and a testset (70/30 split). Introduce a new variable id which is simply the current row number. Use sample_frac() to create the training set. Use anti_join() to anit-join the training set with the original dataset on the variable id.
  4. Let’s train a decision tree using rpart on our training datase. What do you see? Hint: You can use rpart.plot() from the package rpart.plot to create a nicer looking plot.
  5. Let’s create a confusion matrix for our model and determine its accuracy. First, use predict(newdata = test) on our resulting tree and use type = "class" values for our class. Next, use the function table() to cross-tabulate our predictions on the actual classes. The accuracy is simply the result of sum(diag(confmat)) / sum(confmat)
  6. Create a logistic model using glm(..., family = "binomial"). Convert the output probabilities from predict back to a factor and compare with the actual classes. According to model accuracy of the tree- and the logistic regression model, which one is better?

Bonus: Use tidy() from the broom package to extract the estimates of the model. You can now plot the p-values using geom_bar(). Convert term to a factor through mutate() and use .$term as levels resulting after the ordering by arrange(desc(p.value)).

Data Description

Attribute Information:

Bank client data

  1. age (numeric)
  2. job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
  3. marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
  4. education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
  5. default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
  6. housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
  7. loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)

Other attributes:

  1. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  2. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  3. previous: number of contacts performed before this campaign and for this client (numeric)
  4. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

Social and economic context attributes

  1. emp.var.rate: employment variation rate - quarterly indicator (numeric)
  2. cons.price.idx: consumer price index - monthly indicator (numeric)
  3. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
  4. euribor3m: euribor 3 month rate - daily indicator (numeric)
  5. nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

  1. y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Source

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

1. Reading Data

Let’s read our bank dataset:

bankdata <- read.csv2("data/bank.csv")

It has 4521 number of observations. The number of attributes to consider is 16.

2. Exploratory Data Analysis

Attribute: Age

The age of our customers is distributed as follows:

Attribute: Account Balance

Loan term deposits (y)

summary(bankdata$y)
##   no  yes 
## 4000  521

3. Split Train/Testset

To reduce the bias in our model create a training data set with an equal class distribution. You can achieve this using sample_frac() followed by an anti_join() using a common id variable. Change the code and group_by() the target variable y. What is now the difference in the resulting distribution of the target variable?

bankdataId <- bankdata %>% mutate(id = row_number()) 
train <- bankdataId %>% sample_frac(0.7)
test <- anti_join(bankdataId, train, by = "id")

4. Model Selection

Let’s train a decision tree using rpart on our training datase. What do you see? Hint: You can use rpart.plot() from the package rpart.plot to create a nicer looking plot.

5. Confusion Matrix

Let’s create a confusion matrix for our model and determine its accuracy. First, use predict(newdata = test) on our resulting tree and use type = "class" values for our class. Next, use the function table() to cross-tabulate our predictions on the actual classes. The accuracy is simply the result of sum(diag(confmat)) / sum(confmat)

6. Create Logistic Regression Model and Compare

Create a logistic model using glm(..., family = "binomial"). Convert the output probabilities from predict back to a factor and compare with the actual classes. According to model accuracy of the tree- and the logistic regression model, which one is better?

Bonus: Use tidy() from the broom package to extract the estimates of the model. You can now plot the p-values using geom_bar(). Convert term to a factor through mutate() and use .$term as levels resulting after the ordering by arrange(desc(p.value)).

myglm <- glm(y ~ ., data = train, family = "binomial")
mypred <- predict(myglm, newdata = test, type = "response")
myclass = rep("no", dim(test)[1])
myclass[mypred > .5] = "yes"
myclass <- factor(myclass)
accuracy <- mean(myclass == test$y)