Previous chapter
Resampling and PreprocessingPreprocessing with Recipes
Next chapter

Introduction

{width=“30%”}

Most of the time real-world data sets require us to pre-process predictors to get most performance out of our models. The term pre-processing can range from more technical data pre-processing like missing value imputation, outlier detection and scaling to more advanced feature engineering techniques. The tidymodels framework offers the package recipes to perform exactly these tasks. Generally, recipes creates a model-matrix similar to R’s model.matrix() which is directly consumed as an input by the model.

A recipe can be created using the formula interface analogous to the fitting formula as

rec <- recipe(mpg ~ ., data = mtcars)

Data Pre-Processing

Based on the recipe additional step functions can be added to specify the pre-processing steps.

The data pre-processing step typically involves

  • Missing value handling: Impute or remove missing values present in the data set.
  • Scaling: Center and scale numerical attributes to have a standard normal distribution.
  • Transformations: (Power-)Transformation like Box-Cox to obtain a normal distribution.
  • Feature filtering: Remove predictors with little information, near-zero variance.
  • Datatype Conversions: Convert predictors to proper data type (e.g character to factors)
  • Merge and Split: Merge multiple predictors to single, split single predictors to multiple.
  • One-Hot-Encoding: Encode nominal variables (factor) as bits indicating each factor level: Not required in R!

Missing value handling

R handles missing data out-of-the-box and specifies NA values for each data type. Some models can handle missing data directly such as decision trees and some cannot (e.g. linear regression models). It is important to understand why a specific predictor contains missing values and how to deal with them.

There exist various methods for missing value imputation:

  • Mean/Median: Replace missing numerical values by the sample mean/mean.
  • Majority Class: Replace missing categorical values by the majority class.
  • K-nearest Neighbor: Replace missing numerical and categorical values by close observations taking all attributes into account.

Before using any of these techniques the relationship of predictors should be checked if a specific correlation exists which could be used to create a focused imputation. There is no general rule how to deal with missing values since these decisions depend on

  • Domain: Specific meaning of imputed values.
  • Data set size: The more data the less impact it has to drop observations.
  • Variable importance: Models are sensitive to changes on important variables.
  • Data type: Some techniques can only be applied to numeric attributes (e.g. median imputation) and some only to categorical (majority class).

ATTENTION: It is crucial that the imputation is performed within the resampling process (e.g. cross validation). Otherwise we use information from the test data set which could lead to an over-estimation of our model performance.

As a first step, the missing data can be visualized - e.g. using the vis_miss() function from the visdat package. For this purpose we slightly scramble our mtcars dataset and add some missing values randomly:

set.seed(42)
idx <- matrix(sample(c(TRUE, FALSE, FALSE, FALSE), 
              nrow(mtcars) * ncol(mtcars), replace = TRUE), 
       nrow = nrow(mtcars))
mtcars_na <- mtcars
mtcars_na[idx] <- NA
summary(mtcars_na)
##       mpg             cyl             disp             hp       
##  Min.   :14.30   Min.   :4.000   Min.   : 75.7   Min.   : 52.0  
##  1st Qu.:16.18   1st Qu.:4.000   1st Qu.:141.8   1st Qu.: 92.0  
##  Median :18.95   Median :6.000   Median :167.6   Median :110.0  
##  Mean   :21.16   Mean   :6.296   Mean   :229.2   Mean   :132.1  
##  3rd Qu.:25.12   3rd Qu.:8.000   3rd Qu.:314.5   3rd Qu.:177.5  
##  Max.   :33.90   Max.   :8.000   Max.   :460.0   Max.   :264.0  
##  NA's   :12      NA's   :5       NA's   :10      NA's   :9      
##       drat             wt             qsec             vs     
##  Min.   :2.760   Min.   :1.513   Min.   :15.84   Min.   :0.0  
##  1st Qu.:3.150   1st Qu.:2.429   1st Qu.:17.18   1st Qu.:0.0  
##  Median :3.690   Median :3.203   Median :17.98   Median :0.5  
##  Mean   :3.626   Mean   :3.128   Mean   :18.27   Mean   :0.5  
##  3rd Qu.:4.000   3rd Qu.:3.570   3rd Qu.:19.17   3rd Qu.:1.0  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0  
##  NA's   :13      NA's   :4       NA's   :9       NA's   :8    
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :3.000   Median :2.500  
##  Mean   :0.4091   Mean   :3.667   Mean   :2.792  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :6.000  
##  NA's   :10       NA's   :11      NA's   :8

We can easily see from the summary output that each variable contains NAs. Additionally, we can create a plot to inspect the structure of data missingness:

library(visdat)
vis_miss(mtcars_na)

See also https://bradleyboehmke.github.io/HOML/engineering.html for further information.

Mean/Median Imputation

The simplest method to impute missing numerical attributes is to replace them with the sample mean or median. Each preprocessing step in recipes requires a step function - in this case it is step_medianimpute(). Considering the modified mtcars_na dataset the recipe would look like:

rec <- recipe(mpg ~ ., data = mtcars_na) %>% 
  step_medianimpute(all_predictors())

rec %>% 
  prep() %>% 
  bake(new_data = mtcars_na)

Note that the function prep() is required to estimate the required parameters which is the median in the case above. The function bake() then uses the estimated parameters and applies the defined preprocessing steps on a new data as specified by new_data.

Majority Class Imputation

The next imputation strategy for nominal predictors is to simply replace NA values with the majority class. For this purpose we use credit scoring data which is defined as

library(modeldata)
data(credit_data)

credit_data %>% 
  select(Home, Marital, Job) %>% 
  summary()
##       Home           Marital            Job      
##  ignore :  20   divorced :  38   fixed    :2805  
##  other  : 319   married  :3241   freelance:1024  
##  owner  :2107   separated: 130   others   : 171  
##  parents: 783   single   : 977   partime  : 452  
##  priv   : 246   widow    :  67   NA's     :   2  
##  rent   : 973   NA's     :   1                   
##  NA's   :   6

We see that the nominal variables Home, Marital and Job contain NA values which we would like to replace with the most common one.

rec <- recipe(Status ~ ., data = credit_data) %>% 
  step_modeimpute(Home, Marital, Job)

rec %>% 
  prep() %>% 
  bake(new_data = credit_data) %>% 
  select(Home, Marital, Job) %>% 
  summary()

KNN Imputation

A Imputation strategy which takes all attributes of the dataset into account is the imputation via K-Nearest Neighbors (KNN). The KNN algorithms measures the distance of each observation in in the dataset against each other and uses the nearest neighors to estimate the missing values. It works for numeric as well as for nominal data types and can therefore be used as a good baseline imputation algorithm. Based on the existing credit_data dataset

credit_data %>% 
  summary()
##   Status       Seniority           Home           Time      
##  bad :1254   Min.   : 0.000   ignore :  20   Min.   : 6.00  
##  good:3200   1st Qu.: 2.000   other  : 319   1st Qu.:36.00  
##              Median : 5.000   owner  :2107   Median :48.00  
##              Mean   : 7.987   parents: 783   Mean   :46.44  
##              3rd Qu.:12.000   priv   : 246   3rd Qu.:60.00  
##              Max.   :48.000   rent   : 973   Max.   :72.00  
##                               NA's   :   6                  
##       Age             Marital     Records           Job      
##  Min.   :18.00   divorced :  38   no :3681   fixed    :2805  
##  1st Qu.:28.00   married  :3241   yes: 773   freelance:1024  
##  Median :36.00   separated: 130              others   : 171  
##  Mean   :37.08   single   : 977              partime  : 452  
##  3rd Qu.:45.00   widow    :  67              NA's     :   2  
##  Max.   :68.00   NA's     :   1                              
##                                                              
##     Expenses          Income          Assets            Debt      
##  Min.   : 35.00   Min.   :  6.0   Min.   :     0   Min.   :    0  
##  1st Qu.: 35.00   1st Qu.: 90.0   1st Qu.:     0   1st Qu.:    0  
##  Median : 51.00   Median :125.0   Median :  3000   Median :    0  
##  Mean   : 55.57   Mean   :141.7   Mean   :  5404   Mean   :  343  
##  3rd Qu.: 72.00   3rd Qu.:170.0   3rd Qu.:  6000   3rd Qu.:    0  
##  Max.   :180.00   Max.   :959.0   Max.   :300000   Max.   :30000  
##                   NA's   :381     NA's   :47       NA's   :18     
##      Amount         Price      
##  Min.   : 100   Min.   :  105  
##  1st Qu.: 700   1st Qu.: 1117  
##  Median :1000   Median : 1400  
##  Mean   :1039   Mean   : 1463  
##  3rd Qu.:1300   3rd Qu.: 1692  
##  Max.   :5000   Max.   :11140  
## 

we use the function step_knnimpute() to perform the KNN imputation:

rec <- recipe(Status ~ ., data = credit_data) %>% 
  step_knnimpute(all_predictors(), neighbors = 3)

rec %>% 
  prep() %>% 
  bake(new_data = credit_data) %>% 
  summary()

Normalization

Some models require predictors to be normalized having zero mean and a standard deviation of one. These models include linear/logistic regression, neural networks and support vector machines. To center and scale the predictor \(p\), we need to subtract the mean (center) and divide by its standard deviation (scale):

\[\frac{p - mean(p)}{sd(p)}\] Let’s look at the Age variable in the credit_data dataset:

credit_data %>% 
  ggplot() + 
  geom_histogram(aes(Age))

We now transform our Age variable in the credit_data dataset to have zero mean and unit standard deviation using the following recipe:

rec <- recipe(Status ~ ., data = credit_data) %>% 
  step_center(Age) %>% 
  step_scale(Age)

rec %>% 
  prep() %>% 
  bake(new_data = credit_data) %>% 
  ggplot() + 
  geom_histogram(aes(Age))

Instead of step_center() and step_scale() you can also use step_normalize():

rec <- recipe(Status ~ ., data = credit_data) %>% 
  step_normalize(Age)

Log

Many models run into numerical issues if predictors are highly skewed. As an example, let’s look at the Assets distribution from the credit_data dataset:

credit_data %>% 
  ggplot() + 
  geom_histogram(aes(Assets))

To remove the fat-tails in the distributions on the right and since we have no negative values we can simply take the log of the distribution using step_log:

rec <- recipe(Status ~ ., data = credit_data) %>% 
  step_log(Assets)

rec %>% 
  prep() %>% 
  bake(new_data = credit_data) %>% 
  ggplot() + 
  geom_histogram(aes(Assets))

However, we can see that the Assets predictor contains lots of zeros which leads to NA values since \(log(0)=NA\). An alternative would be to set the offset \(\epsilon\). We use here \(\epsilon=1\) since \(log(1) = 0\):

rec <- recipe(Status ~ ., data = credit_data) %>% 
  step_log(Assets, offset = 1)

rec %>% 
  prep() %>% 
  bake(new_data = credit_data) %>% 
  ggplot() + 
  geom_histogram(aes(Assets))

Due to the large number of zeros the distribution still has no nice bell-shape. We could either think of encoding the variable differently or using some kind of mixture model. Alternatively, we could use a hyperbolic transform.

See also https://robjhyndman.com/hyndsight/transformations

Square Root

We could also use the square root which is another special Box-Cox transform:

rec <- recipe(Status ~ ., data = credit_data) %>% 
  step_sqrt(Assets)

rec %>% 
  prep() %>% 
  bake(new_data = credit_data) %>% 
  ggplot() + 
  geom_histogram(aes(Assets))

Hyperbolic Transform

To solve the problem of a skewed distribution after using log or sqrt, we can also use an inverse hyperbolic sine transform as

rec <- recipe(Status ~ ., data = credit_data) %>% 
  step_hyperbolic(Assets, func = "sin", inverse = FALSE)

rec_trans <- rec %>% 
  prep() %>% 
  bake(new_data = credit_data)
  
ggplot(rec_trans) + 
  geom_histogram(aes(Assets))

See also https://robjhyndman.com/hyndsight/transformations

Chaining Transforms Together

Finally, all shown step transformation can be chained together and are executed in the specified order. For the credit_data dataset this looks as follows:

dat_prec <- recipe(Status ~ ., data = credit_data) %>% 
  step_knnimpute(all_predictors(), neighbors = 3) %>% 
  step_hyperbolic(Assets, func = "sin", inverse = FALSE) %>% 
  step_normalize(all_numeric()) %>% 
  prep() %>% 
  bake(new_data = credit_data)
  
ggplot(dat_prec) + 
  geom_histogram(aes(Assets))

dat_prec %>% summary()