Previous chapter
Case StudiesBank Marketing

## Problem Description

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

### Questions

1. Read the csv file stored in `data/bank.csv`. This time, use `read.csv2`, so that characters are automatically converted to factors. This is required for further modelling purposes.
2. Exploratory Data Analysis: Create histograms of age and account balance using `geom_histogram`. What do you see? If necessary, transform variables accordingly. Use e.g. `log()` and/or `scale()` to change shape of the distribution. How could balance be transformed? Check the target variable `y`: How is it distributed? Hint: use `summary()` to get a simple count of the classes.
3. Before we start modelling, we split our data into a training- and a testset (70/30 split). Introduce a new variable id which is simply the current row number. Use `sample_frac()` to create the training set. Use `anti_join()` to anit-join the training set with the original dataset on the variable `id`.
4. Let’s train a decision tree using `rpart` on our training datase. What do you see? Hint: You can use `rpart.plot()` from the package rpart.plot to create a nicer looking plot.
5. Let’s create a confusion matrix for our model and determine its accuracy. First, use `predict(newdata = test)` on our resulting tree and use `type = "class"` values for our class. Next, use the function `table()` to cross-tabulate our predictions on the actual classes. The accuracy is simply the result of `sum(diag(confmat)) / sum(confmat)`
6. Create a logistic model using `glm(..., family = "binomial")`. Convert the output probabilities from `predict` back to a factor and compare with the actual classes. According to model accuracy of the tree- and the logistic regression model, which one is better?

Bonus: Use `tidy()` from the broom package to extract the estimates of the model. You can now plot the p-values using `geom_bar()`. Convert `term` to a factor through `mutate()` and use `.\$term` as levels resulting after the ordering by `arrange(desc(p.value))`.

## Data Description

### Attribute Information:

#### Bank client data

1. age (numeric)
2. job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
3. marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
4. education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
5. default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
6. housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
7. loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)

#### Other attributes:

1. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
2. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
3. previous: number of contacts performed before this campaign and for this client (numeric)
4. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

#### Social and economic context attributes

1. emp.var.rate: employment variation rate - quarterly indicator (numeric)
2. cons.price.idx: consumer price index - monthly indicator (numeric)
3. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
4. euribor3m: euribor 3 month rate - daily indicator (numeric)
5. nr.employed: number of employees - quarterly indicator (numeric)

#### Output variable (desired target):

1. y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

### Source

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

``bankdata <- read.csv2("data/bank.csv")``

It has 4521 number of observations. The number of attributes to consider is 16.

## 2. Exploratory Data Analysis

### Attribute: Age

The age of our customers is distributed as follows:

### Loan term deposits (y)

``summary(bankdata\$y)``
``````##   no  yes
## 4000  521``````

## 3. Split Train/Testset

To reduce the bias in our model create a training data set with an equal class distribution. You can achieve this using `sample_frac()` followed by an `anti_join()` using a common `id` variable. Change the code and `group_by()` the target variable `y`. What is now the difference in the resulting distribution of the target variable?

``````bankdataId <- bankdata %>% mutate(id = row_number())
train <- bankdataId %>% sample_frac(0.7)
test <- anti_join(bankdataId, train, by = "id")``````

## 4. Model Selection

Let’s train a decision tree using `rpart` on our training datase. What do you see? Hint: You can use `rpart.plot()` from the package rpart.plot to create a nicer looking plot.

### 5. Confusion Matrix

Let’s create a confusion matrix for our model and determine its accuracy. First, use `predict(newdata = test)` on our resulting tree and use `type = "class"` values for our class. Next, use the function `table()` to cross-tabulate our predictions on the actual classes. The accuracy is simply the result of `sum(diag(confmat)) / sum(confmat)`

### 6. Create Logistic Regression Model and Compare

Create a logistic model using `glm(..., family = "binomial")`. Convert the output probabilities from `predict` back to a factor and compare with the actual classes. According to model accuracy of the tree- and the logistic regression model, which one is better?

Bonus: Use `tidy()` from the broom package to extract the estimates of the model. You can now plot the p-values using `geom_bar()`. Convert `term` to a factor through `mutate()` and use `.\$term` as levels resulting after the ordering by `arrange(desc(p.value))`.

``````myglm <- glm(y ~ ., data = train, family = "binomial")
mypred <- predict(myglm, newdata = test, type = "response")
myclass = rep("no", dim(test)[1])
myclass[mypred > .5] = "yes"
myclass <- factor(myclass)
accuracy <- mean(myclass == test\$y)``````