Previous chapter
Case StudiesIMDB

IMDB

The IMDB datasets consists of 50,000 polarized reviews from the Internet Movie Database. The dataset is split into 25,000 training and 25,000 test samples. Each of theses sets are split again into 50% positive and 50% negative reviews.

The IMDB dataset comes packaged with Keras and has already been preprocessed: The reviews (sequences of words/characters) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.

See also Tutorial: Text Classification.

Example also available in Deep Learning with R, page 59.

Loading the Dataset

We load the IMDB dataset from Keras and keep the 10,000 most frequent words occuring in the training data. Next we use the multi-assignment %<-% operator to unpack the dataset into training- and test data/labels as follows:

library(keras)
imdb <- dataset_imdb(num_words = 10000)
c(c(train_data, train_labels), c(test_data, test_labels)) %<-% imdb

The predictors train_data and test_data are lists of reviews; each review encodes a list of words as indices. The target variables train_labels and test_labels are lists of 0s and 1s, where 0 stands for negative and 1 stands for positive.

Below we see the first 20 words for the first review in the training data train_data:

head(train_data[[1]], 20)
##  [1]    1   14   22   16   43  530  973 1622 1385   65  458 4468   66 3941
## [15]    4  173   36  256    5   25

To decode the indices to actual words we can use dataset_imdb_word_index() and map the indices accordingly. Note, that the indices are offset by 3 because 0, 1 and 2 are reserved indices for “padding”, “start of sequence” and “unknown”.

word_index <- dataset_imdb_word_index()
reverse_word_index <- names(word_index)
names(reverse_word_index) <- word_index
decoded_review <- sapply(train_data[[1]], function(index) {
  word <- if (index >= 3) reverse_word_index[[as.character(index-3)]]
  if (!is.null(word)) word else "?"
})
head(decoded_review, 20)
##  [1] "?"          "this"       "film"       "was"        "just"      
##  [6] "brilliant"  "casting"    "location"   "scenery"    "story"     
## [11] "direction"  "everyone's" "really"     "suited"     "the"       
## [16] "part"       "they"       "played"     "and"        "you"

Data preparation

The current data structure of the IMDB dataset consists of integer list objects which cannot be fed directly into the network. There are 2 possible ways to convert the input data accordingly:

  1. Convert lists to have the same length and put them into an integer tensor of shape (samples, word_indices) and use a first layer one that is capable of handling these integer tensors (also known as the embedding layer).
  2. One-hot encode your lists and turn them into vectors of 0s and 1s. A sequence of [2, 3, 7] would turned into a 10,000 dimensional vector containing only zeros except for the indices 2, 3 and 7.

We’ll go with approach 2. and vectorize the data:

vectorize_sequences <- function(sequences, dimension = 10000) {
  results <- matrix(0, nrow = length(sequences), ncol = dimension)
  for (i in 1:length(sequences)) {
    results[i, sequences[[i]]] <- 1
  }
  results
}

x_train <- vectorize_sequences(train_data)
x_test <- vectorize_sequences(test_data)

We finally convert the labels to numeric:

y_train <- as.numeric(train_labels)
y_test <- as.numeric(test_labels)

Building the Network

The input data is (10,000-dimensional) vectors and the labels are scalars standing for negative (0s) or positive (1s) reviews. A network that typically performs well on this kind of steup is a stack of fully connected (dense) layers with relu activations:

layer_dense(units=16, activation="relu")

The parameter units specifies the number of hidden units in the layer which performs the operation

output = relu(dot(W, input) + b)

The key paramters we need to decide now for the network are

  • The number of hidden layers to use
  • How many hidden units to choose for each layer

Later we will introduce some more concepts how to make this decision more educated. For the time being, we’ll choose 2 hidden layers with 16 units each. You can see sketch of the resulting architecture below:

Let’s check the question on the network above:

Model Definition

Next, we setup our model of choice (from input to output, setting the units and input shape):

Note, that without choosing (non-linear) activation functions we would end up with a linear classifier. relu is the most popular one but we could also choose prelu, elu, etc. For the output layer and (being a classification problem) we choose the sigmoid function.

For this classification problem we choose a loss function as binary_crossentropy which is the only viable one and outputs the probability (between zero and one) that an review is either positive or negative. As an optimizer we use rmsprop. Let’s compile our model:

Training

If we have enough observations available it is best practice to leave some data aside for validation. We create our validation set x_val and our new (partial) training set partial_x_train using the first 10,000 observations as follows:

val_indices <- 1:10000
x_val <- x_train[val_indices,,drop=FALSE]
partial_x_train <- x_train[-val_indices,,drop=FALSE]
y_val <- y_train[val_indices]
partial_y_train <- y_train[-val_indices]

Next, we compile and fit our model:

set.seed(42)
model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = 10000) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

history <- model %>% fit(
  partial_x_train,
  partial_y_train,
  epochs = 20,
  batch_size=512,
  validation_data = list(x_val, y_val)
)

During model fitting progress is shown in the Viewer pane as

The fitting history can finally be plotted as

plot(history)

Exercise: Choose Model

Choose the optimal number of epochs from the previous section and retrain the model from scratch:

model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = 10000) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = "rmsprop",
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

history <- model %>% fit(
  partial_x_train,
  partial_y_train,
  epochs = ___,
  batch_size=512,
  validation_data = list(x_val, y_val)
)

Evaluate the model using

results <- model %>% evaluate(x_test, y_test)
results

What is the accuracy of the resulting model?

Predict

You can use the resulting model for prediction using the predict function as follows:

model %>% predict(x_test[1:20, ]) %>% round(digits=2)
##       [,1]
##  [1,] 0.27
##  [2,] 1.00
##  [3,] 0.97
##  [4,] 0.68
##  [5,] 0.93
##  [6,] 0.84
##  [7,] 0.99
##  [8,] 0.03
##  [9,] 0.95
## [10,] 0.97
## [11,] 0.90
## [12,] 0.02
## [13,] 0.00
## [14,] 0.02
## [15,] 0.99
## [16,] 0.00
## [17,] 0.88
## [18,] 0.65
## [19,] 0.02
## [20,] 0.09

The output is the probabilites that a review is either positive or negative. Depending on the decision boundary (e.g. 0.5) all predictions below 0.5 are classified as negative and all predictions above 0.5 are positive.