## Introduction

{width=“30%”}

Most of the time real-world data sets require us to pre-process predictors to get most performance out of our models. The term pre-processing can range from more technical *data pre-processing* like missing value imputation, outlier detection and scaling to more advanced *feature engineering* techniques. The **tidymodels** framework offers the package **recipes** to perform exactly these tasks. Generally, **recipes** creates a model-matrix similar to R’s `model.matrix()`

which is directly consumed as an input by the model.

A `recipe`

can be created using the formula interface analogous to the fitting formula as

`rec <- recipe(mpg ~ ., data = mtcars)`

### Data Pre-Processing

Based on the recipe additional `step`

functions can be added to specify the pre-processing steps.

The data pre-processing step typically involves

**Missing value handling**: Impute or remove missing values present in the data set.**Scaling**: Center and scale numerical attributes to have a standard normal distribution.**Transformations**: (Power-)Transformation like Box-Cox to obtain a normal distribution.**Feature filtering**: Remove predictors with little information, near-zero variance.**Datatype Conversions**: Convert predictors to proper data type (e.g character to factors)- Merge and Split: Merge multiple predictors to single, split single predictors to multiple.
**One-Hot-Encoding**: Encode nominal variables (`factor`

) as bits indicating each factor level: **Not required in R!**

## Missing value handling

R handles missing data out-of-the-box and specifies `NA`

values for each data type. Some models can handle missing data directly such as decision trees and some cannot (e.g. linear regression models). It is important to understand why a specific predictor contains missing values and how to deal with them.

There exist various **methods** for missing value imputation:

**Mean/Median**: Replace missing numerical values by the sample mean/mean.**Majority Class**: Replace missing categorical values by the majority class.**K-nearest Neighbor**: Replace missing numerical and categorical values by close observations taking all attributes into account.

Before using any of these techniques the relationship of predictors should be checked if a specific correlation exists which could be used to create a *focused* imputation. There is no general rule how to deal with missing values since these decisions depend on

**Domain**: Specific meaning of imputed values.**Data set size**: The more data the less impact it has to drop observations.**Variable importance**: Models are sensitive to changes on important variables.**Data type**: Some techniques can only be applied to numeric attributes (e.g. median imputation) and some only to categorical (majority class).

**ATTENTION**: It is crucial that the imputation is performed **within** the resampling process (e.g. cross validation). Otherwise we use information from the *test* data set which could lead to an over-estimation of our model performance.

As a first step, the missing data can be visualized - e.g. using the `vis_miss()`

function from the **visdat** package. For this purpose we slightly scramble our `mtcars`

dataset and add some missing values randomly:

```
set.seed(42)
idx <- matrix(sample(c(TRUE, FALSE, FALSE, FALSE),
nrow(mtcars) * ncol(mtcars), replace = TRUE),
nrow = nrow(mtcars))
mtcars_na <- mtcars
mtcars_na[idx] <- NA
summary(mtcars_na)
```

```
## mpg cyl disp hp
## Min. :14.30 Min. :4.000 Min. : 75.7 Min. : 52.0
## 1st Qu.:16.18 1st Qu.:4.000 1st Qu.:141.8 1st Qu.: 92.0
## Median :18.95 Median :6.000 Median :167.6 Median :110.0
## Mean :21.16 Mean :6.296 Mean :229.2 Mean :132.1
## 3rd Qu.:25.12 3rd Qu.:8.000 3rd Qu.:314.5 3rd Qu.:177.5
## Max. :33.90 Max. :8.000 Max. :460.0 Max. :264.0
## NA's :12 NA's :5 NA's :10 NA's :9
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :15.84 Min. :0.0
## 1st Qu.:3.150 1st Qu.:2.429 1st Qu.:17.18 1st Qu.:0.0
## Median :3.690 Median :3.203 Median :17.98 Median :0.5
## Mean :3.626 Mean :3.128 Mean :18.27 Mean :0.5
## 3rd Qu.:4.000 3rd Qu.:3.570 3rd Qu.:19.17 3rd Qu.:1.0
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0
## NA's :13 NA's :4 NA's :9 NA's :8
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :3.000 Median :2.500
## Mean :0.4091 Mean :3.667 Mean :2.792
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :6.000
## NA's :10 NA's :11 NA's :8
```

We can easily see from the summary output that each variable contains `NA`

s. Additionally, we can create a plot to inspect the structure of data missingness:

```
library(visdat)
vis_miss(mtcars_na)
```

See also https://bradleyboehmke.github.io/HOML/engineering.html for further information.

## Majority Class Imputation

The next imputation strategy for nominal predictors is to simply replace `NA`

values with the majority class. For this purpose we use credit scoring data which is defined as

```
library(modeldata)
data(credit_data)
credit_data %>%
select(Home, Marital, Job) %>%
summary()
```

```
## Home Marital Job
## ignore : 20 divorced : 38 fixed :2805
## other : 319 married :3241 freelance:1024
## owner :2107 separated: 130 others : 171
## parents: 783 single : 977 partime : 452
## priv : 246 widow : 67 NA's : 2
## rent : 973 NA's : 1
## NA's : 6
```

We see that the nominal variables `Home`

, `Marital`

and `Job`

contain `NA`

values which we would like to replace with the most common one.

```
rec <- recipe(Status ~ ., data = credit_data) %>%
step_modeimpute(Home, Marital, Job)
rec %>%
prep() %>%
bake(new_data = credit_data) %>%
select(Home, Marital, Job) %>%
summary()
```

## KNN Imputation

A Imputation strategy which takes all attributes of the dataset into account is the imputation via K-Nearest Neighbors (KNN). The KNN algorithms measures the distance of each observation in in the dataset against each other and uses the nearest `neighors`

to estimate the missing values. It works for numeric as well as for nominal data types and can therefore be used as a good baseline imputation algorithm. Based on the existing `credit_data`

dataset

```
credit_data %>%
summary()
```

```
## Status Seniority Home Time
## bad :1254 Min. : 0.000 ignore : 20 Min. : 6.00
## good:3200 1st Qu.: 2.000 other : 319 1st Qu.:36.00
## Median : 5.000 owner :2107 Median :48.00
## Mean : 7.987 parents: 783 Mean :46.44
## 3rd Qu.:12.000 priv : 246 3rd Qu.:60.00
## Max. :48.000 rent : 973 Max. :72.00
## NA's : 6
## Age Marital Records Job
## Min. :18.00 divorced : 38 no :3681 fixed :2805
## 1st Qu.:28.00 married :3241 yes: 773 freelance:1024
## Median :36.00 separated: 130 others : 171
## Mean :37.08 single : 977 partime : 452
## 3rd Qu.:45.00 widow : 67 NA's : 2
## Max. :68.00 NA's : 1
##
## Expenses Income Assets Debt
## Min. : 35.00 Min. : 6.0 Min. : 0 Min. : 0
## 1st Qu.: 35.00 1st Qu.: 90.0 1st Qu.: 0 1st Qu.: 0
## Median : 51.00 Median :125.0 Median : 3000 Median : 0
## Mean : 55.57 Mean :141.7 Mean : 5404 Mean : 343
## 3rd Qu.: 72.00 3rd Qu.:170.0 3rd Qu.: 6000 3rd Qu.: 0
## Max. :180.00 Max. :959.0 Max. :300000 Max. :30000
## NA's :381 NA's :47 NA's :18
## Amount Price
## Min. : 100 Min. : 105
## 1st Qu.: 700 1st Qu.: 1117
## Median :1000 Median : 1400
## Mean :1039 Mean : 1463
## 3rd Qu.:1300 3rd Qu.: 1692
## Max. :5000 Max. :11140
##
```

we use the function `step_knnimpute()`

to perform the KNN imputation:

```
rec <- recipe(Status ~ ., data = credit_data) %>%
step_knnimpute(all_predictors(), neighbors = 3)
rec %>%
prep() %>%
bake(new_data = credit_data) %>%
summary()
```

## Normalization

Some models require predictors to be normalized having zero mean and a standard deviation of one. These models include linear/logistic regression, neural networks and support vector machines. To center and scale the predictor \(p\), we need to subtract the mean (center) and divide by its standard deviation (scale):

\[\frac{p - mean(p)}{sd(p)}\] Let’s look at the `Age`

variable in the `credit_data`

dataset:

```
credit_data %>%
ggplot() +
geom_histogram(aes(Age))
```

We now transform our `Age`

variable in the `credit_data`

dataset to have zero mean and unit standard deviation using the following recipe:

```
rec <- recipe(Status ~ ., data = credit_data) %>%
step_center(Age) %>%
step_scale(Age)
rec %>%
prep() %>%
bake(new_data = credit_data) %>%
ggplot() +
geom_histogram(aes(Age))
```

Instead of `step_center()`

and `step_scale()`

you can also use `step_normalize()`

:

```
rec <- recipe(Status ~ ., data = credit_data) %>%
step_normalize(Age)
```

## Log

Many models run into numerical issues if predictors are highly skewed. As an example, let’s look at the `Assets`

distribution from the `credit_data`

dataset:

```
credit_data %>%
ggplot() +
geom_histogram(aes(Assets))
```

To remove the fat-tails in the distributions on the right and since we have no negative values we can simply take the `log`

of the distribution using `step_log`

:

```
rec <- recipe(Status ~ ., data = credit_data) %>%
step_log(Assets)
rec %>%
prep() %>%
bake(new_data = credit_data) %>%
ggplot() +
geom_histogram(aes(Assets))
```

However, we can see that the `Assets`

predictor contains lots of zeros which leads to `NA`

values since \(log(0)=NA\). An alternative would be to set the offset \(\epsilon\). We use here \(\epsilon=1\) since \(log(1) = 0\):

```
rec <- recipe(Status ~ ., data = credit_data) %>%
step_log(Assets, offset = 1)
rec %>%
prep() %>%
bake(new_data = credit_data) %>%
ggplot() +
geom_histogram(aes(Assets))
```

Due to the large number of zeros the distribution still has no nice bell-shape. We could either think of encoding the variable differently or using some kind of mixture model. Alternatively, we could use a hyperbolic transform.

See also https://robjhyndman.com/hyndsight/transformations

## Square Root

We could also use the square root which is another special Box-Cox transform:

```
rec <- recipe(Status ~ ., data = credit_data) %>%
step_sqrt(Assets)
rec %>%
prep() %>%
bake(new_data = credit_data) %>%
ggplot() +
geom_histogram(aes(Assets))
```