How do DNN Learn?
3BLUE1BROWN SERIES, S3 - E2
Deep Neural Network (DNN)
Each layer in a deep neural network (DNN) transforms input data by combining weights with input data through an activation function as:
\[f(x) = relu((W \cdot x) + b)\] or as code
output = relu(dot(W, x) + b)
where \(W\) and \(b\) are the weight- and bias tensors of the layer which need to be trained through learning. The weight tensors initialized with random values. From this starting point the weights are gradually adjusted through a feedback signal which is the actual training. The adjustment happens in a training loop:
- Draw a batch of training samples
x and targets
- Run the network on
x to obtain predictions
y_pred in the output layer, also called forward pass.
- Compute the loss of the network on the batch, a measure of the mismatch between
- Update all weights of the network so that the loss is slightly reduced on this batch.
Step 4 is computed using gradient-descent.
See also Deep Learning with R, page 41,42.
Stochastic Gradient Descent
Stochastic Gradient Descent on a 1D loss curve and one learnable paramter.
In addition to the loss computation in Step 3 from the previous section we can also take advantage of the fact that all functions used in the Network are differentiable. It is therefore possible to compute the gradient or the direction of change for the respective weights.
If the current value of
W0 the tensor
gradient(f)(W0) is the gradient of the function
f(W) = loss_value in
W0. The weight parameters
W0 can thus be moved a little in the opposite direction from the gradient as
W1 = W0 - (step * gradient)
and reducing the loss on the batch a bit.
See also Deep Learning with R, page 44.
In practice (deep) neural networks consist of many tensor functions chained together each having a known derivative. For example, the network
f composed of three tensor operations
c with weight matrices
f(W1, W2, W3) = a(W1, b (W2, c(W3)))
Such a chain of functions can be differentiated using the chain rule:
f(g(x)) = f'(g(x)) * g'(x)
Applying the chain rule to the gradient computation of the network results in the Back-propagation algorithm.
Modern frameworks such as Tensorflow support symbolic differentiation allowing them to calculate a gradient function for the chain that maps network weights to gradient values. The backward pass is thus reduced to a call to the gradient function.
3BLUE1BROWN SERIES, S3 - E3
See also below for an in-depth view of backpropagation:
3BLUE1BROWN SERIES, S3 - E4
See also Deep Learning with R, page 46.
Let’s revisit our first example to put all of the new concepts together:
mnist <- dataset_mnist()
train_images <- mnist$train$x
train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images / 255
network <- keras_model_sequential() %>%
layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
layer_dense(units = 10, activation = "softmax")
network %>% compile(
optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics = c("accuracy")
network %>% fit(train_images, train_labels, epochs = 5, batch_size = 128)
See also Deep Learning with R, page 47.