SGD optimizer
Implements stochastic gradient descent (optionally with momentum). Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.
optim_sgd( params, lr = optim_required(), momentum = 0, dampening = 0, weight_decay = 0, nesterov = FALSE )
params |
(iterable): iterable of parameters to optimize or dicts defining parameter groups |
lr |
(float): learning rate |
momentum |
(float, optional): momentum factor (default: 0) |
dampening |
(float, optional): dampening for momentum (default: 0) |
weight_decay |
(float, optional): weight decay (L2 penalty) (default: 0) |
nesterov |
(bool, optional): enables Nesterov momentum (default: FALSE) |
The implementation of SGD with Momentum-Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.
Considering the specific case of Momentum, the update can be written as
\begin{array}{ll} v_{t+1} & = μ * v_{t} + g_{t+1}, \\ p_{t+1} & = p_{t} - \mbox{lr} * v_{t+1}, \end{array}
where p, g, v and μ denote the parameters, gradient, velocity, and momentum respectively.
This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form
\begin{array}{ll} v_{t+1} & = μ * v_{t} + \mbox{lr} * g_{t+1}, \\ p_{t+1} & = p_{t} - v_{t+1}. \end{array}
The Nesterov version is analogously modified.
if (torch_is_installed()) { ## Not run: optimizer <- optim_sgd(model$parameters(), lr=0.1, momentum=0.9) optimizer$zero_grad() loss_fn(model(input), target)$backward() optimizer$step() ## End(Not run) }
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.