Previous chapter
Mathematical FoundationsGradient Descent
Next chapter

How do DNN Learn?



Gradient Descent

Deep Neural Network (DNN)

Deep Neural Network (DNN)

Each layer in a deep neural network (DNN) transforms input data by combining weights with input data through an activation function as:

\[f(x) = relu((W \cdot x) + b)\] or as code

output = relu(dot(W, x) + b)

where \(W\) and \(b\) are the weight- and bias tensors of the layer which need to be trained through learning. The weight tensors initialized with random values. From this starting point the weights are gradually adjusted through a feedback signal which is the actual training. The adjustment happens in a training loop:

  1. Draw a batch of training samples x and targets y.
  2. Run the network on x to obtain predictions y_pred in the output layer, also called forward pass.
  3. Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y.
  4. Update all weights of the network so that the loss is slightly reduced on this batch.

Step 4 is computed using gradient-descent.

See also Deep Learning with R, page 41,42.

Stochastic Gradient Descent

Stochastic Gradient Descent on a 1D loss curve and one learnable paramter.

Stochastic Gradient Descent on a 1D loss curve and one learnable paramter.

In addition to the loss computation in Step 3 from the previous section we can also take advantage of the fact that all functions used in the Network are differentiable. It is therefore possible to compute the gradient or the direction of change for the respective weights.

If the current value of W is W0 the tensor gradient(f)(W0) is the gradient of the function f(W) = loss_value in W0. The weight parameters W0 can thus be moved a little in the opposite direction from the gradient as

W1 = W0 - (step * gradient)

and reducing the loss on the batch a bit.

See also Deep Learning with R, page 44.


In practice (deep) neural networks consist of many tensor functions chained together each having a known derivative. For example, the network f composed of three tensor operations a, b and c with weight matrices W1, W2, W3:

f(W1, W2, W3) = a(W1, b (W2, c(W3)))

Such a chain of functions can be differentiated using the chain rule:

f(g(x)) = f'(g(x)) * g'(x)

Applying the chain rule to the gradient computation of the network results in the Back-propagation algorithm.

Modern frameworks such as Tensorflow support symbolic differentiation allowing them to calculate a gradient function for the chain that maps network weights to gradient values. The backward pass is thus reduced to a call to the gradient function.



See also below for an in-depth view of backpropagation:



See also Deep Learning with R, page 46.

Example: Revisited

Let’s revisit our first example to put all of the new concepts together:

Input Data

mnist <- dataset_mnist()

train_images <- mnist$train$x
train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images / 255

Network Definition

network <- keras_model_sequential() %>%
  layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
  layer_dense(units = 10, activation = "softmax")

Network Compilation

network %>% compile(
  optimizer = "rmsprop", 
  loss = "categorical_crossentropy",
  metrics = c("accuracy")

Training Loop

network %>% fit(train_images, train_labels, epochs = 5, batch_size = 128)

See also Deep Learning with R, page 47.