Previous chapter
IntroductionGetting Started
Next chapter

Quick Install

The keras R package is a wrapper for the Python keras library. You therefore need either admin rights on your machine or a running Python installation including the tensorflow and keras libraries.

For Windows the easiest way is to use the Anaconda distribution and Python 3.x which is available at https://www.anaconda.com/download/#windows.

If all prerequisites are there you can install keras on your machine is to run the following commands in the R console:

install.packages("keras")
library(keras)
install_keras()

If you want to take advantage of NVIDIA GPUs, see ?install_keras().

See also https://keras.rstudio.com/articles/getting_started.html

Getting Started: MNIST

Let’s get started with our first Deep-Learning model using Keras and the R programming language. In this section we will be using the MNIST dataset to classify digits in handwritten grey-scale images (28x28 pixels) into 10 categories (0 through 9). The dataset consists of 60,000 training images plus 10,000 test images by the National Institute of Standards and Technology in the 1980s.

3BLUE1BROWN SERIES, S3 - E1

3BLUE1BROWN SERIES, S3 - E1

Pre-processing

Below we load the MNIST dataset from the keras package using dataset_mnist():

library(keras)
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y

str(x_train)
##  int [1:60000, 1:28, 1:28] 0 0 0 0 0 0 0 0 0 0 ...
range(x_train)
## [1]   0 255

The x data is a 3 dimensional array (images,width,height) of gray-scale images holding values from 0 to 255. For training we reduce the rank of the array (tensor) by converting the image matrix (28x28) to a single vector of length 784. For easier subsequent computations we also convert the array to floating point from integer and re-scale the values between 0 and 1 by dividing by the maximum value 255:

# reshape
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
# rescale between 0 and 1
x_train <- x_train / 255
x_test <- x_test / 255

str(x_train)
##  num [1:60000, 1:784] 0 0 0 0 0 0 0 0 0 0 ...
range(x_train)
## [1] 0 1

The y data is holding the handwritten digit information 0 to 9. Since keras is a Python library we explicitly need to one-hot encode the vectors into binary class matrices using to_categorical():

y_train <- to_categorical(y_train, 10)
y_test <- to_categorical(y_test, 10)

Model Specification

Finding the right network structure is the most crucial part in deep learning. Since theoretical proofs are not available the right number of hidden units and layers needs to be determined through trial and error.

In the keras package we need to define a model to specify layers and units accordingly. The sequential model is the simplest model type and connects the input- with the output layer through a linear set of layers, see also the figure and code below. Note, that we can use the %>% operator to build the network structure.

Deep Neural Network (DNN)

Deep Neural Network (DNN)

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 10, activation = 'softmax')

The input_shape argument to the first layer specifies the shape of the input data (a length 784 numeric vector representing a gray-scale image). The final layer outputs a length 10 numeric vector (probabilities for each digit) using a softmax activation function.

Use the summary() function to print the details of the model:

summary(model)
## Model: "sequential"
## ________________________________________________________________________________________________________________
## Layer (type)                                      Output Shape                                 Param #          
## ================================================================================================================
## dense (Dense)                                     (None, 256)                                  200960           
## ________________________________________________________________________________________________________________
## dropout (Dropout)                                 (None, 256)                                  0                
## ________________________________________________________________________________________________________________
## dense_1 (Dense)                                   (None, 128)                                  32896            
## ________________________________________________________________________________________________________________
## dropout_1 (Dropout)                               (None, 128)                                  0                
## ________________________________________________________________________________________________________________
## dense_2 (Dense)                                   (None, 10)                                   1290             
## ================================================================================================================
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
## ________________________________________________________________________________________________________________

Note, that the number of parameters in dense layers is simply the number of input units multiplied by the number of output units plus the bias units (one for each output). For example, the first layer has \(784 * 256 + 256\) parameters or weights.

After model definition we need to compile it with an appropriate loss function (categorical_crossentropy), optimizer (optimizer_rmsprop()), and metrics (accuracy):

model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_rmsprop(),
  metrics = c('accuracy')
)

Model Fitting

The fit() function is used for training and learn the free parameters so that the model does not overfit and predicts new handwritten digits fairly well. We set the epoch to 30 which means that each training observation is used 30 times in total for model fitting. The batch size of 128 means that after errors have been calculated for 128 samples they are back-propagated and parameters are updated. Low batch sizes mean a high number of iterations and higher computational costs whereas high batch increase the memory footprint. A batch size of 128 and 60,000 training examples means that 469 iterations are needed for one epoch. With 30 epochs in total this results in 14063 iterations.

history <- model %>% fit(
  x_train, y_train, 
  epochs = 30, batch_size = 128, 
  validation_split = 0.2
)

The history object returned includes all required metrics.

Evaluation of the model on unseen data can be done using evaluate():

model %>% evaluate(x_test, y_test)
## $loss
## [1] 0.1111055
## 
## $acc
## [1] 0.9812

Prediction can be done using the pipe operator and predict_classes() for direct class output instead of probabilities:

model %>% predict_classes(x_test) %>% head(100)
##   [1] 7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4 9 6 6 5 4 0 7 4 0 1 3 1 3 4 7 2 7 1 2 1 1 7 4 2 3 5 1 2 4 4 6 3 5
##  [54] 5 6 0 4 1 9 5 7 8 9 3 7 4 6 4 3 0 7 0 2 9 1 7 3 2 9 7 7 6 2 7 8 4 7 3 6 1 3 6 9 3 1 4 1 7 6 9

In this example we already see the power of deep neural networks in action. After reshaping of input data and specifying the network structure we can, without much preprocessing, feed the input data directly into the network without the need for time consuming feature engineering. This direct learning from input data is also one of the main reasons why DNN have become so popular and even more powerful than more traditional techniques.