## Gradient Descent

Each layer in a deep neural network (DNN) transforms input data by combining weights with input data through an activation function as:

\[f(x) = relu((W \cdot x) + b)\] or as code

`output = relu(dot(W, x) + b)`

where \(W\) and \(b\) are the weight- and bias tensors of the layer which need to be *trained* through *learning*. The weight tensors initialized with random values. From this starting point the weights are gradually adjusted through a feedback signal which is the actual *training*. The adjustment happens in a training loop:

- Draw a batch of training samples
`x`

and targets `y`

. - Run the network on
`x`

to obtain predictions `y_pred`

in the output layer, also called *forward pass*. - Compute the
*loss* of the network on the batch, a measure of the mismatch between `y_pred`

and `y`

. - Update all weights of the network so that the loss is slightly reduced on this batch.

Step 4 is computed using **gradient-descent**.

See also Deep Learning with R, page 41,42.

## Stochastic Gradient Descent

In addition to the loss computation in Step 3 from the previous section we can also take advantage of the fact that all functions used in the Network are *differentiable*. It is therefore possible to compute the *gradient* or the direction of change for the respective weights.

If the current value of `W`

is `W0`

the tensor `gradient(f)(W0)`

is the gradient of the function `f(W) = loss_value`

in `W0`

. The weight parameters `W0`

can thus be moved a little in the opposite direction from the gradient as

`W1 = W0 - (step * gradient)`

and reducing the loss on the batch a bit.

See also Deep Learning with R, page 44.

## Back-propagation

In practice (deep) neural networks consist of many tensor functions chained together each having a known derivative. For example, the network `f`

composed of three tensor operations `a`

, `b`

and `c`

with weight matrices `W1`

, `W2`

, `W3`

:

`f(W1, W2, W3) = a(W1, b (W2, c(W3)))`

Such a chain of functions can be differentiated using the *chain rule*:

`f(g(x)) = f'(g(x)) * g'(x)`

Applying the chain rule to the gradient computation of the network results in the *Back-propagation* algorithm.

Modern frameworks such as Tensorflow support *symbolic differentiation* allowing them to calculate a *gradient function* for the chain that maps network weights to gradient values. The backward pass is thus reduced to a call to the gradient function.

See also below for an in-depth view of backpropagation:

See also Deep Learning with R, page 46.

## Example: Revisited

Let’s revisit our first example to put all of the new concepts together:

**Input Data**

```
library(keras)
mnist <- dataset_mnist()
train_images <- mnist$train$x
train_images <- array_reshape(train_images, c(60000, 28 * 28))
train_images <- train_images / 255
```

**Network Definition**

```
network <- keras_model_sequential() %>%
layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
layer_dense(units = 10, activation = "softmax")
```

**Network Compilation**

```
network %>% compile(
optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics = c("accuracy")
)
```

**Training Loop**

`network %>% fit(train_images, train_labels, epochs = 5, batch_size = 128)`

See also Deep Learning with R, page 47.