Previous chapter
WorkshopDay 1
Next chapter

General Introduction

In this general introduction we re-iterate on the program and reasoning behind the chosen modules. Additionally, we ask learnings on the curent state of their learning journey.

This section should also answer the following questions:

  • What is Data Science?
  • Why Data Science?
  • Is Coding Required for Data Science?
  • R vs Python?
  • Why Use Version Control, Git?
  • Why Create Packages?

This section will be presented with slides and is mainly based on interactive classroom feedback.

Exercise Introduction

Turn a script into functions (and later a package)

A collegue of yours gives you an R-script file which analyzes a dataset of banking customers. The results have been presented to the board and should now be implemented into production. To ensure proper functionality, checking, documentation and dependency management you choose to refactor the code and put it into a new package. Since your collegue is not highly proficient in R you have also been asked to check for proper functionality.

Requirements: This exercise requires a Github account. Please create one if not already done as described in the Git chapter.

Script

Below you can find the script in question. You can checkout the script and dataset from the following Git repository: https://github.com/quantargo/bmarketing You can also find the script below and download the dataset in question here.

library(tidyverse)

#################Loading data into the environment#################
bmarketing <- read.csv2("data/bmarketing.csv")

#Lets look at dataset and generate initial understanding about the column types
str(bmarketing)
summary(bmarketing)

# A quick check:
# If newdata has same number of observation that implies no NA value present
# is.na(bmarketing)
newdata <- na.omit(bmarketing)
nrow(newdata)==nrow(bmarketing)

#A deep check for a particular column let say age
if(length(which(is.na(bmarketing$y)==TRUE)>0)){
  print("Missing Value found in the specified column")
} else{
  print("All okay: No Missing Value found in the specified column")
}

# Let's find the range of individual variables
summary(bmarketing)

## ------------------------------------------------------------------------
bmarketing %>% 
  ggplot() + geom_histogram(aes(age), bins = 30) + 
  geom_vline(aes(xintercept= median(age)), color = "red")

# TODO: do boxplots for each data
# boxplot(duration~y,data=bmarketing_sub,col="red")

#################Decision Tree#################
library(rpart)
library(rpart.plot)

dt_model<- rpart(y ~ ., data = bmarketing)
rpart.plot(dt_model)
summary(dt_model)

#################Testing Decision Tree #################
predictions <- predict(dt_model, bmarketing, type = "class")

## Compute the accuracy
mean(bmarketing$y == predictions)

# Lets look at the confusion matrix
table(predictions, bmarketing$y)

Data Description

Attribute Information:

Bank client data

  1. age (numeric)
  2. job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
  3. marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
  4. education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
  5. default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
  6. housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
  7. loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)

Other attributes:

  1. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  2. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  3. previous: number of contacts performed before this campaign and for this client (numeric)
  4. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

Social and economic context attributes

  1. emp.var.rate: employment variation rate - quarterly indicator (numeric)
  2. cons.price.idx: consumer price index - monthly indicator (numeric)
  3. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
  4. euribor3m: euribor 3 month rate - daily indicator (numeric)
  5. nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

  1. y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Source

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Forming Teams and First Steps

To start off, let’s do the following steps:

  1. Formation of Team, Determine Team Head (5 min)
  2. Team head forks repository from Github at https://github.com/Quantargo/bmarketing (10 min)
  3. Team works soley on forked repository from team head.
  4. Team checks contents from forked repo to see if the script works as expected. Fix any bugs as needed. (45 min)

Package Implementation

The script can now be refactored into the following parts/functions:

  1. Clean: Data checking and cleaning
  2. Transform: Data transformation
  3. Model: Create a new model based on data.
  4. Model-Plot: Plot model, e.g. decision tree.
  5. Model-Predict: Create model prediction
  6. Model-Performance: Calculate the performance of the model.

Determine which functions need to be written and distribute work among team members. Try to define functions so that a consistent workflow is ensured for package users.

Clean

As required by our data quality team we need to make sure that we

  • return an error if the target variable contains any missing values (NA’s).
  • Give clear warnings for all other variables which contain NA’s.
  • Remove any columns (and report as warning) which contain more than 50% NA’s.

Transform

During data transformation step we might need to

  • transform numeric variables using the log as required.
  • transform factors into numeric variables (and vice versa) as necessary.

Model

Model: Create a decision tree model to predict if customer signing a term deposit.

Model-Plot: We shall implement a function to present a nice representation of the model, e.g. for a decision tree we should plot the tree and respective nodes.

Model-Predict: The model prediction function shall return the actual predicted classes from the model.

Model-Performance: The model performance function should calculate the accuracy (or precision) of the model.

Next Steps

  1. Create package project in forked repo. Hint: It is a good idea to also integrate the dataset into the package for easier testing. You can put the dataset in the data/ folder of the package and reference to it using system.file().
  2. Distribute work among team members and work on package collaboratively. 1. Try to set 2 milestones for your implementation (e.g. until coffee break, end-of-day)
  3. Put package together (merge, rebase) by the end of the day.

Expected Results

Make sure, the final package:

  1. Works as expected.
  2. Has a consistent style and workflow.
  3. Is properly documented so that an outsider can use it. This also means that a general package documentation (see http://r-pkgs.had.co.nz/man.html#man-packages) is available AND you have a README.md describing how the package can be used.
  4. Runs through all checks.
  5. Is general enough so it could also be used for different datasets.

Bonus: Implement some test cases to make sure that all of your functions are working as expected.