## Quick Install

The keras R package is a wrapper for the Python keras library. You therefore need either admin rights on your machine or a running Python installation including the `tensorflow`

and `keras`

libraries.

For Windows the easiest way is to use the Anaconda distribution and Python 3.x which is available at https://www.anaconda.com/download/#windows.

If all prerequisites are there you can install keras on your machine is to run the following commands in the R console:

```
install.packages("keras")
library(keras)
install_keras()
```

If you want to take advantage of NVIDIA GPUs, see `?install_keras()`

.

See also https://keras.rstudio.com/articles/getting_started.html

## Getting Started: MNIST

Let’s get started with our first Deep-Learning model using Keras and the R programming language. In this section we will be using the MNIST dataset to classify digits in handwritten grey-scale images (28x28 pixels) into 10 categories (0 through 9). The dataset consists of 60,000 training images plus 10,000 test images by the National Institute of Standards and Technology in the 1980s.

## Pre-processing

Below we load the MNIST dataset from the `keras`

package using `dataset_mnist()`

:

```
library(keras)
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y
str(x_train)
```

`## int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...`

`range(x_train)`

`## [1] 0 255`

The x data is a 3 dimensional array (images,width,height) of gray-scale images holding values from 0 to 255. For training we reduce the rank of the array (tensor) by converting the image matrix (28x28) to a single vector of length 784. For easier subsequent computations we also convert the array to floating point from integer and re-scale the values between 0 and 1 by dividing by the maximum value 255:

```
# reshape
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
# rescale between 0 and 1
x_train <- x_train / 255
x_test <- x_test / 255
str(x_train)
```

`## num [1:60000, 1:784] 0 0 0 0 0 0 0 0 0 0 ...`

`range(x_train)`

`## [1] 0 1`

The y data is holding the handwritten digit information 0 to 9. Since keras is a Python library we explicitly need to one-hot encode the vectors into binary class matrices using `to_categorical()`

:

```
y_train <- to_categorical(y_train, 10)
y_test <- to_categorical(y_test, 10)
```

## Model Specification

Finding the right network structure is the most crucial part in deep learning. Since theoretical proofs are not available the right number of hidden units and layers needs to be determined through trial and error.

In the `keras`

package we need to define a model to specify layers and units accordingly. The sequential model is the simplest model type and connects the input- with the output layer through a linear set of layers, see also the figure and code below. Note, that we can use the `%>%`

operator to build the network structure.

```
model <- keras_model_sequential()
model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dropout(rate = 0.3) %>%
layer_dense(units = 10, activation = 'softmax')
```

The `input_shape`

argument to the first layer specifies the shape of the input data (a length 784 numeric vector representing a gray-scale image). The final layer outputs a length 10 numeric vector (probabilities for each digit) using a softmax activation function.

Use the `summary()`

function to print the details of the model:

`summary(model)`

```
## Model: "sequential"
## ________________________________________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================================================
## dense (Dense) (None, 256) 200960
## ________________________________________________________________________________________________________________
## dropout (Dropout) (None, 256) 0
## ________________________________________________________________________________________________________________
## dense_1 (Dense) (None, 128) 32896
## ________________________________________________________________________________________________________________
## dropout_1 (Dropout) (None, 128) 0
## ________________________________________________________________________________________________________________
## dense_2 (Dense) (None, 10) 1290
## ================================================================================================================
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
## ________________________________________________________________________________________________________________
```

Note, that the number of parameters in dense layers is simply the number of input units multiplied by the number of output units plus the bias units (one for each output). For example, the first layer has \(784 * 256 + 256\) parameters or weights.

After model definition we need to compile it with an appropriate loss function (`categorical_crossentropy`

), optimizer (`optimizer_rmsprop()`

), and metrics (`accuracy`

):

```
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_rmsprop(),
metrics = c('accuracy')
)
```

## Model Fitting

The `fit()`

function is used for training and learn the free parameters so that the model does not *overfit* and predicts new handwritten digits fairly well. We set the epoch to 30 which means that each training observation is used 30 times in total for model fitting. The batch size of 128 means that after errors have been calculated for 128 samples they are back-propagated and parameters are updated. Low batch sizes mean a high number of iterations and higher computational costs whereas high batch increase the memory footprint. A batch size of 128 and 60,000 training examples means that 469 iterations are needed for one epoch. With 30 epochs in total this results in 14063 iterations.

```
history <- model %>% fit(
x_train, y_train,
epochs = 30, batch_size = 128,
validation_split = 0.2
)
```

The `history`

object returned includes all required metrics.

Evaluation of the model on unseen data can be done using `evaluate()`

:

`model %>% evaluate(x_test, y_test)`

```
## $loss
## [1] 0.1111055
##
## $acc
## [1] 0.9812
```

Prediction can be done using the pipe operator and `predict_classes()`

for direct class output instead of probabilities:

`model %>% predict_classes(x_test) %>% head(100)`

```
## [1] 7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4 9 6 6 5 4 0 7 4 0 1 3 1 3 4 7 2 7 1 2 1 1 7 4 2 3 5 1 2 4 4 6 3 5
## [54] 5 6 0 4 1 9 5 7 8 9 3 7 4 6 4 3 0 7 0 2 9 1 7 3 2 9 7 7 6 2 7 8 4 7 3 6 1 3 6 9 3 1 4 1 7 6 9
```

In this example we already see the power of deep neural networks in action. After reshaping of input data and specifying the network structure we can, without much preprocessing, feed the input data directly into the network without the need for time consuming feature engineering. This direct learning from input data is also one of the main reasons why DNN have become so popular and even more powerful than more traditional techniques.