## IMDB

The IMDB datasets consists of 50,000 polarized reviews from the Internet Movie Database. The dataset is split into 25,000 training and 25,000 test samples. Each of theses sets are split again into 50% positive and 50% negative reviews.

The IMDB dataset comes packaged with Keras and has already been preprocessed: The reviews (sequences of words/characters) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.

See also Tutorial: Text Classification.

Example also available in Deep Learning with R, page 59.

## Loading the Dataset

We load the IMDB dataset from Keras and keep the 10,000 most frequent words occuring in the training data. Next we use the multi-assignment `%<-%`

operator to unpack the dataset into training- and test data/labels as follows:

```
library(keras)
imdb <- dataset_imdb(num_words = 10000)
c(c(train_data, train_labels), c(test_data, test_labels)) %<-% imdb
```

The predictors `train_data`

and `test_data`

are lists of reviews; each review encodes a list of words as indices. The target variables `train_labels`

and `test_labels`

are lists of 0s and 1s, where 0 stands for *negative* and 1 stands for *positive*.

Below we see the first 20 words for the first review in the training data `train_data`

:

`head(train_data[[1]], 20)`

```
## [1] 1 14 22 16 43 530 973 1622 1385 65 458 4468 66 3941
## [15] 4 173 36 256 5 25
```

To decode the indices to actual words we can use `dataset_imdb_word_index()`

and map the indices accordingly. Note, that the indices are offset by 3 because 0, 1 and 2 are reserved indices for “padding”, “start of sequence” and “unknown”.

```
word_index <- dataset_imdb_word_index()
reverse_word_index <- names(word_index)
names(reverse_word_index) <- word_index
decoded_review <- sapply(train_data[[1]], function(index) {
word <- if (index >= 3) reverse_word_index[[as.character(index-3)]]
if (!is.null(word)) word else "?"
})
head(decoded_review, 20)
```

```
## [1] "?" "this" "film" "was" "just"
## [6] "brilliant" "casting" "location" "scenery" "story"
## [11] "direction" "everyone's" "really" "suited" "the"
## [16] "part" "they" "played" "and" "you"
```

## Data preparation

The current data structure of the IMDB dataset consists of integer `list`

objects which cannot be fed directly into the network. There are 2 possible ways to convert the input data accordingly:

- Convert lists to have the same length and put them into an integer tensor of shape
`(samples, word_indices)`

and use a first layer one that is capable of handling these integer tensors (also known as the *embedding layer*). - One-hot encode your lists and turn them into vectors of 0s and 1s. A sequence of
`[2, 3, 7]`

would turned into a 10,000 dimensional vector containing only zeros except for the indices 2, 3 and 7.

We’ll go with approach 2. and vectorize the data:

```
vectorize_sequences <- function(sequences, dimension = 10000) {
results <- matrix(0, nrow = length(sequences), ncol = dimension)
for (i in 1:length(sequences)) {
results[i, sequences[[i]]] <- 1
}
results
}
x_train <- vectorize_sequences(train_data)
x_test <- vectorize_sequences(test_data)
```

We finally convert the labels to numeric:

```
y_train <- as.numeric(train_labels)
y_test <- as.numeric(test_labels)
```

## Building the Network

The input data is (10,000-dimensional) vectors and the labels are scalars standing for *negative* (0s) or positive (1s) reviews. A network that typically performs well on this kind of steup is a stack of fully connected (dense) layers with `relu`

activations:

`layer_dense(units=16, activation="relu")`

The parameter `units`

specifies the number of hidden units in the layer which performs the operation

`output = relu(dot(W, input) + b)`

The key paramters we need to decide now for the network are

- The number of hidden layers to use
- How many hidden units to choose for each layer

Later we will introduce some more concepts how to make this decision more educated. For the time being, we’ll choose 2 hidden layers with 16 units each. You can see sketch of the resulting architecture below:

Let’s check the question on the network above:

## Model Definition

Next, we setup our model of choice (from input to output, setting the `unit`

s and input shape):

Note, that without choosing (non-linear) activation functions we would end up with a linear classifier. `relu`

is the most popular one but we could also choose `prelu`

, `elu`

, etc. For the output layer and (being a classification problem) we choose the sigmoid function.

For this classification problem we choose a loss function as `binary_crossentropy`

which is the only viable one and outputs the *probability* (between zero and one) that an review is either *positive* or *negative*. As an optimizer we use `rmsprop`

. Let’s compile our model:

## Training

If we have enough observations available it is best practice to leave some data aside for validation. We create our validation set `x_val`

and our new (partial) training set `partial_x_train`

using the first 10,000 observations as follows:

```
val_indices <- 1:10000
x_val <- x_train[val_indices,,drop=FALSE]
partial_x_train <- x_train[-val_indices,,drop=FALSE]
y_val <- y_train[val_indices]
partial_y_train <- y_train[-val_indices]
```

Next, we compile and fit our model:

```
set.seed(42)
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = 10000) %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
model %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
history <- model %>% fit(
partial_x_train,
partial_y_train,
epochs = 20,
batch_size=512,
validation_data = list(x_val, y_val)
)
```

During model fitting progress is shown in the **Viewer** pane as

The fitting history can finally be plotted as

`plot(history)`

## Exercise: Choose Model

Choose the optimal number of epochs from the previous section and retrain the model from scratch:

```
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = 10000) %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
model %>% compile(
optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
history <- model %>% fit(
partial_x_train,
partial_y_train,
epochs = ___,
batch_size=512,
validation_data = list(x_val, y_val)
)
```

Evaluate the model using

```
results <- model %>% evaluate(x_test, y_test)
results
```

What is the accuracy of the resulting model?

## Predict

You can use the resulting model for prediction using the `predict`

function as follows:

`model %>% predict(x_test[1:20, ]) %>% round(digits=2)`

```
## [,1]
## [1,] 0.27
## [2,] 1.00
## [3,] 0.97
## [4,] 0.68
## [5,] 0.93
## [6,] 0.84
## [7,] 0.99
## [8,] 0.03
## [9,] 0.95
## [10,] 0.97
## [11,] 0.90
## [12,] 0.02
## [13,] 0.00
## [14,] 0.02
## [15,] 0.99
## [16,] 0.00
## [17,] 0.88
## [18,] 0.65
## [19,] 0.02
## [20,] 0.09
```

The output is the probabilites that a review is either positive or negative. Depending on the decision boundary (e.g. 0.5) all predictions below 0.5 are classified as *negative* and all predictions above 0.5 are *positive*.