深度神经网络这类复杂模型，学习具有以下特点：

- 目标函数非凸
- 参数空间维度高
- 数据集样本量大

主流优化技术，有诸多变体和改进，但都围绕下面的参数更新公式： $$ \theta = \theta - \eta \cdot \nabla_\theta J( \theta) $$

其中 $\theta$ 是模型参数；$\eta$ 是迭代步长，即学习率；

其中 $\nabla_\theta J( \theta)$ 是梯度向量，记录一阶偏导 $\nabla_\theta J( \theta)_i=\frac {\partial J(\theta)} {\partial \theta_i}$；

优：

- 多框架开箱即用
- 热门研究领域

. . .

劣：

- 超参黑盒
- 场景黑盒

$f\left( x \right) = 0$ 求根问题，将 $f(x)$ 泰勒展开： $$ y = f\left( {{x_0}} \right) + f'\left( {{x_0}} \right)\left( {x - {x_0}} \right) $$ 令 $f\left( {{x_0}} \right) + f'\left( {{x_0}} \right)\left( {x - {x_0}} \right) = 0$ 进而得迭代公式： $$ {x_{n + 1}} = {x_n} - \frac{{f\left( {{x_n}} \right)}}{{f'\left( {{x_n}} \right)}} $$

迭代法一步步逼近待求解的根：

参考：http://tutorial.math.lamar.edu/Classes/CalcI/NewtonsMethod.aspx

扩展到多元函数 $f(\theta)$，目标变为求极值（等价于求 $\nabla f(\theta) = 0$ 的根），得到迭代公式： $$ \theta = \theta - H^{-1} \nabla f(\theta) $$ 其中 $\nabla f(\theta)$ 表示梯度向量；$H$ 表示 Hessian 矩阵，记录了二阶偏导，即 $H_{i,j}=\frac {\partial^2 f(\theta)} {\partial \theta_i \partial \theta_j}$；

优：梯度下降方法的迭代步长 $\eta$ 被 $H^{-1}$ 取代，更高迭代效率且无须超参

. . .

劣：对于高维参数空间，计算 Hessian 矩阵的时间和空间复杂度太高

a large variety of

quasi-Newtonmethods have been developed that seek to approximate the inverse Hessian. Among these, the most popular is L-BFGS, which uses the information in the gradients over time to form the approximation implicitly (i.e. the full matrix is never computed).

参考：https://cs231n.github.io/neural-networks-3/#second

优化策略目的是找到 $\theta$ 使得目标函数 $J(\theta)$ 最小化： $$ \arg \min_{\theta} J(\theta; D) $$

其中 $\theta$ 是模型参数，$D$ 是数据集。

参考：CS231n: Convolutional Neural Networks for Visual Recognition - Optimization

Strategy #1：随机搜索

```
bestloss = float("inf") # Python assigns the highest possible float value
for num in xrange(1000):
W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
loss = L(X_train, Y_train, W) # get the loss over the entire training set
if loss < bestloss: # keep track of the best solution
bestloss = loss
bestW = W
print 'in attempt %d the loss was %f, best %f' % (num, loss, bestloss)
```

. . .

特点：整个参数空间；每步重新搜索；

启发：在高维空间中随机搜索到最佳参数值十分困难，但改进一个已有参数值就是问题变得简单多了。如果能够从一个初始值开始不断迭代改进当前状况（*iterative refinement*），也不失为一种有效的方法。

Strategy #2：局部随机搜索

```
W = np.random.randn(10, 3073) * 0.001 # generate random starting W
bestloss = float("inf")
for i in xrange(1000):
step_size = 0.0001
Wtry = W + np.random.randn(10, 3073) * step_size
loss = L(Xtr_cols, Ytr, Wtry)
if loss < bestloss:
W = Wtry
bestloss = loss
print 'iter %d loss is %f' % (i, bestloss)
```

. . .

特点：局部搜索；当有所改进时走下一步；试探性的盲目搜索；

Strategy #3：梯度下降

改进当前状况，其实不需要局部随机搜索，只要沿着梯度反方向，目标函数在当前位置必然下降的最快。

. . .

特点：梯度以及梯度求解；步长大小（learning rate）沿梯度下降的前提条件

参考：https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

数学形式：

$$ \theta = \theta - \eta \cdot \nabla_\theta J( \theta) $$

寻找模型参数 $\theta \in \mathbb{R}^d$ 来使得目标函数 $J(\theta)$ 最小化，参数每步迭代更新由梯度 $\nabla_\theta J(\theta)$ 和学习率 $\eta$ 共同控制。

在更新效率和更新正确性之间权衡，根据每次参数更新所用数据量大小可以分成三类变体。

Method #1：批量梯度下降

批量梯度下降，使用整个数据集 $D$ 来计算梯度更新参数： $$ \theta = \theta - \eta \cdot \nabla_\theta J( \theta;D) $$ 优：跟真实标目函数一致；收敛性保证；

劣：每次更新计算代价大；无法用于大数据集和流式数据；

Method #2：随机梯度下降（SGD）

随机梯度下降，仅使用一个训练样本 $(x^{(i)}, y^{(i)})$ 来计算梯度更新参数： $$ \theta = \theta - \eta \cdot \nabla_\theta J( \theta; x^{(i)};y^{(i)}) $$ 优：每次更新计算速度快；克服数据集冗余；

劣：偏离真实目标函数；更新波动大；可能收敛到更好或更坏的位置；

SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily (From Wikipedia)

Method #3：Mini-batch 梯度下降

这是上述两方法的折中方案，也是实际应用中常用方案，一般我们说的 SGD 其实指的是这个，而不是上面的方法2。该方法使用 n 个训练样本来计算梯度更新参数： $$ \theta = \theta - \eta \cdot \nabla_\theta J( \theta; x^{(i:i+n)}; y^{(i:i+n)}) $$ 既降低了更新方差，确保了稳定收敛，同时又加快了迭代速度。此外还可以利用机器学习框架的矩阵计算加速。

SGD 已经足够好了吗？

. . .

- 如何选择学习率？学习率太小会有什么问题？学习率太大又会有什么问题？
- 学习率在训练过程应该如何调整？预先定义一个调度器？学习无进展时衰减？
- 所有参数的学习率都一样会有问题吗？或者说不同参数是不是应该按照不同尺度去调整？
- 当迭代进行到极小值点或鞍点时，梯度为零，应该如何逃离这些次优位置？

痛点就是优化的动机，而这里介绍的优化都是围绕 SGD 参数更新公式，一步步向前不断改进。 $$ \theta = \theta - \eta \cdot \nabla_\theta J( \theta) $$

山谷两侧来回震荡缓慢下降：

参考：https://ruder.io/optimizing-gradient-descent/

Momentum 引入惯性加速 SGD 下降、减少震荡： $$ \begin{split} v_t &= \gamma v_{t-1} + \eta \nabla_\theta J( \theta) \ \theta &= \theta - v_t \end{split} $$ 其中 $v_t$ 累计历史动量；$\gamma$ 动量衰减因子，一般设 $0.9$。

在 SGD 公式中引入历史累计梯度值： $$ \theta = \theta - (\eta \nabla_\theta J( \theta) + \nabla_\theta J( \theta)_\text{acc}) $$ 想象从山坡在滚落的一个小球，SGD 和 Momentum 的差异。

. . .

$\nabla_\theta J( \theta)$ 梯度 like **加速度**；Momentum 引入 $v_t$ 作为**速度**；$\theta$ 参数值 like **位置**；

小球保持惯性从山坡滚落，在力的作用下一直加速，越滚越快，甚至到达谷底还继续向对面山坡冲上去。

NAG (Nesterov accelerated gradient) 如何解决这个问题？

. . .

通过预测下一步位置的梯度来决定如何改变当前动量：

- 下一步位置的梯度跟当前动量反向，减小当前动量，比如小球要再次冲到对面上坡
- 下一步位置的梯度跟当前动量同向，增大当前动量，比如小球正在冲下山坡

用累计梯度 $\gamma v_{t-1}$ 预测下一步位置 $\theta_\text{next}$，并计算此位置梯度：
$$
\begin{split}
\theta_\text{next} = \theta - \gamma v_{t-1} \

\eta \nabla_\theta J( \theta_\text{next})
\end{split}
$$
. . .

参数更新公式：
$$
\begin{split}
v_t &= \gamma v_{t-1} + \eta \nabla_\theta J( \theta_\text{next}) \

\theta &= \theta - v_t
\end {split}
$$

NAG 跟 Momentun 相比，从参数更新公式看，区别仅在于当前累计到历史动量的梯度值是 $\nabla_\theta J( \theta)$ 还是 $\nabla_\theta J( \theta_\text{next})$，也即计算梯度的位置不同。

. . .

参考：https://cs231n.github.io/neural-networks-3/

Now that we are able to adapt our updates to the slope of our error function and speed up SGD in turn, we would also like to adapt our updates to each individual parameter to perform larger or smaller updates depending on their importance (we might want to perform a larger update for parameters related to the rarely occurring features).

Adagrad is an algorithm for gradient-based optimization that does just this: It adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, *it is well-suited for dealing with sparse data*.

Dean et al. have found that Adagrad greatly improved the robustness of SGD. Moreover, Pennington et al. used Adagrad to train GloVe word embeddings, as infrequent words require much larger updates than frequent ones.

Previously, we performed an update for all parameters $\theta$ at once as every parameter $\theta_i$ used the same learning rate $\eta$. As Adagrad uses a different learning rate for every parameter $\theta_i$ at every time step $t$, we first show Adagrad’s per-parameter update. For brevity, we use $g^{(t)}$ to denote the gradient at time step $t$. $g^{(t)}*i = \nabla*\theta J( \theta^{(t)}*i )$ is then the partial derivative of the objective function w.r.t. to the parameter $\theta_i$ at time step $t$. Adagrad modifies the general learning rate $\eta$ of SGD at each time step $t$ for every parameter $\theta_i$ based on the past gradients that have been computed for $\theta_i$:
$$
\eta_i = \frac{\eta}{\sqrt{G^{(t)}*{i,i} + \epsilon}}
$$
where $G^{(t)} \in \mathbb{R}^{d \times d}$ is a diagonal matrix where each diagonal element $G_{i,i}=\sum_{t'}^{t} (g^{t'}_i)^2$ is the sum of the squares of the gradients w.r.t. $\theta_i$ up to time step $t$. It’s obviously that the larger $G_{i,i}$ is, and the slower $\theta_i$ changed; and $\epsilon$ is a smoothing term that avoids division by zero (usually on the order of $1e-8$); Interestingly, without the square root operation, the algorithm performs much worse. One of Adagrad’s main benefits is that it eliminates the need to manually tune the learning rate $\eta$. Most implementations use a default value of $\eta = 0.01$ and leave it at that.

Now the parameters updating rule as follows: $$ \theta^{(t+1)}_i = \theta^{(t)}_i - \eta_i \cdot g^{(t)}_i $$ As $G^{(t)}$ contains the sum of the squares of the past gradients w.r.t. to all parameters $\theta$ along its diagonal, we can now vectorize our implementation: $$ \theta^{(t+1)} = \theta^{(t)} - \dfrac{\eta}{\sqrt{G^{(t)} + \epsilon}} g^{(t)} $$ Adagrad’s main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. The following algorithms aim to resolve this flaw.

Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size $w$.

How to record the $w$ past gradients? Instead of inefficiently storing $w$ previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average $E[g^2]^{(t+1)}$ at time step $t+1$ then depends only on the previous average and the current squared gradient $(g^2)^{(t+1)}$: $$ E[g^2]^{(t+1)} = \gamma E[g^2]^{(t)} + (1 - \gamma)(g^{(t+1)})^2 $$ where $1 - \gamma$ controls how much of current gradient will be accumulated into the running average; we set $\gamma$ around 0.9. $E[g^2]^{(t)}$ is a vector with the same dimensions. The term $E[g^2]^{(t)}$ is the exponentially decaying average of squared gradients: in stark contrast to AdaGrad, we keep adding new information from squared gradients $(g^2)^{(t)}$, but we also make sure to decrease the effect of old gradients.

We can now simply replace the diagonal matrix $G^{(t)}$ in the Adagrad with the decaying average over past squared gradients $E[g^2]^{(t)}$. Then the parameters updating rule is:
$$
\begin{align}
\begin{split}
\theta^{(t+1)} &= \theta^{(t)} + \Delta \theta^{(t)} \

\Delta \theta_t &= - \dfrac{\eta}{\sqrt{E[g^2]^{(t)} + \epsilon}} \odot g^{(t)} = - \dfrac{\eta}{\text{RMS}[g]^{(t)}} \odot g^{(t)}
\end{split}
\end{align}
$$
where $\odot$ is a element-wise vector multiplication; $\text{RMS}[\cdot]$ denotes the root mean square metric.

The authors note that the units in this update (as well as in SGD, Momentum, or Adagrad) do not match, i.e. the update should have the same hypothetical units as the parameter. To realize this, they first define another exponentially decaying average $E[\Delta \theta^2]^{(t)}$’s RMS $\text{RMS}[\Delta \theta]^{(t)}$ to substitute the above learning rate $\eta$, this time not of squared gradients but of squared parameter updates:
$$
E[\Delta \theta^2]^{(t)} = \rho E[\Delta \theta^2]^{(t-1)} + (1 - \rho) (\Delta \theta^{(t)})^2
$$
Since $\text{RMS}[\Delta \theta]^{(t)}=\sqrt{E[\Delta \theta^2]^{(t)}}$ is unknown in time step $t$, we approximate it with the RMS of parameter updates until the previous time step $\text{RMS}[\Delta \theta]^{(t-1)}$. Then, AdaDelta becomes:
$$
\begin{align}
\begin{split}
\theta^{(t+1)} &= \theta^{(t)} + \Delta \theta^{(t)} \

\Delta \theta^{(t)} &= - \dfrac{\text{RMS}[\Delta \theta]^{(t-1)}}{\text{RMS}[g]^{(t)}} \odot g^{(t)}
\end{split}
\end{align}
$$
With Adadelta, we do not even need to set a default learning rate $\eta$, as it has been eliminated from the update rule.

RMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera Class.

RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad’s radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta that we derived above:
$$
\begin{align}
\begin{split}
\theta^{(t+1)} &= \theta^{(t)} + \Delta \theta^{(t)} \

\Delta \theta^{(t)} &= - \dfrac{\eta}{\sqrt{E[g^2]^{(t)} + \epsilon}} \odot g^{(t)} \

E[g^2]^{(t)} &= \gamma E[g^2]^{(t-1)} + (1-\gamma) (g^{(t)})^2
\end{split}
\end{align}
$$
RMSprop as well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests $\gamma$ to be set to 0.9, while a good default value for the learning rate $\eta$ is 0.001.

Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients ($v^{(t)}$ in the following formulation) like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients ($m^{(t)}$ in the following), similar to momentum. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.

We compute the decaying averages of past and past squared gradients $m^{(t)}$ and $v^{(t)}$ respectively as follows:
$$
m^{(t)} = \beta_1 m^{(t-1)} + (1 - \beta_1)g^{(t)} \

v^{(t)} = \beta_2 v^{(t-1)} + (1 - \beta_2)(g^{(t)})^2
$$
$m^{(t)}$ and $v^{(t)}$ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method.

Then the parameters updating rule as follows: $$ \theta^{(t+1)} = \theta^{(t)} - \frac {\eta} {\sqrt{{v}^{(t)}} + \epsilon} \odot {m}^{(t)} $$ But as $m^{(t)}$ and $v^{(t)}$ are initialized as vectors of 0’s, the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. $\beta_1$ and $\beta_2$ are close to 1). They counteract these biases by computing bias-corrected first and second moment estimates: $$ \begin{align} \begin{split} \hat{m}^{(t)} &= \dfrac{{m}^{(t)}}{1 - \beta^t_1} \ \hat{v}^{(t)} &= \dfrac{{v}^{(t)}}{1 - \beta^t_2} \end{split} \end{align} $$ They then use these to update the parameters just as we have seen in Adadelta and RMSprop, which yields the Adam update rule: $$ \theta^{(t+1)} = \theta^{(t)} - \frac {\eta} {\sqrt{\hat{v}^{(t)}} + \epsilon} \odot \hat{m}^{(t)} $$ The authors propose default values of 0.9 for $\beta_1$, 0.999 for $\beta_2$, and $10^{-8}$ for $\epsilon$. They show empirically that Adam works well in practice and compares favorably to other adaptive learning-method algorithms.

The second moment $v^{(t)}$ in the Adam update rule actually use the $\ell_2$ norm of the gradient: $$ v^{(t)} = \beta_2 v^{(t-1)} + (1 - \beta_2)(g^{(t)})^2 $$ We can generalize this update to the $\ell_p$ norm and also scale $\beta_2$: $$ v^{(t)} = \beta_2^p v^{(t-1)} + (1 - \beta_2^p)(g^{(t)})^p $$ Norms for large $p$ values generally become numerically unstable, which is why $\ell_1$ and $\ell_2$ norms are most common in practice. However, $\ell_\infty$ also generally exhibits stable behavior. For this reason, the authors propose AdaMax (Kingma and Ba, 2015) and show that $v^{(t)}$ with $\ell_\infty$ converges to the following more stable value: $$ v^{(t)} = \beta_2^\infty v^{(t-1)} + (1 - \beta_2^\infty)(g^{(t)})^\infty = \max(\beta_2 v^{(t-1)}, g^{(t)}) $$ where the $\max$ is a element-wise operator.

We can now plug this into the Adam update equation by replacing its denominator with the above $v^{(t)}$ to obtain the AdaMax update rule: $$ \theta^{(t+1)} = \theta^{(t)} - \frac {\eta} {v^{(t)}} \odot \hat{m}^{(t)} $$ Note that $v^{(t)}$ relies on the maxmax operation, it is not as suggestible to bias towards zero as $v^{(t)}$ and $m^{(t)}$ in Adam, which is why we do not need to compute a bias correction for it.

Good default values are 0.9 for $\beta_1$, 0.999 for $\beta_2$, and $\eta = 0.002$.

Adam can be viewed as a combination of RMSprop and Momentum: RMSprop contributes the exponentially decaying average of past squared gradients $v^{(t)}$, while momentum accounts for the exponentially decaying average of past gradients $m^{(t)}$. And we have also seen that Nesterov accelerated gradient (NAG) is superior to vanilla Momentum. Nadam (Nesterov-accelerated Adaptive Moment Estimation) thus combines Adam and NAG. In order to incorporate NAG into Adam, we need to modify its momentum $m^{(t)}$.

First, let us recall the Momentum update rule:
$$
\begin{split}
g^{(t)} & = \nabla_\theta J( \theta) \

m^{(t)} &= \gamma m^{(t-1)} + \eta g^{(t)} \
\theta^{(t+1)} &= \theta^{(t)} - m^{(t)}
\end{split}
$$
where $J$ is our objective function, $\gamma$ is the momentum decay term, and $\eta$ is our step size. Note that momentum involves taking a step in the direction of the previous momentum vector and a step in the direction of the current gradient.

NAG then allows us to perform a more accurate step in the gradient direction by updating the parameters with the momentum step *before* computing the gradient. We thus only need to modify the gradient $g^{(t)}$ to arrive at NAG:
$$
\begin{split}
\theta_\text{next} &= \theta^{(t)} - \gamma m^{(t-1)} \

g^{(t)} & = \nabla_\theta J(\theta_\text{next}) \

m^{(t)} &= \gamma m^{(t-1)} + \eta g^{(t)} \
\theta^{(t+1)} &= \theta^{(t)} - m^{(t)}
\end{split}
$$
where $\theta_\text{next}$ is an approximation of the next position of the parameters. We can arrange those formulations as following:
$$
\begin{split}

g^{(t)} & = \nabla_\theta J(\theta^{(t)} - \gamma m^{(t-1)}) \

\theta^{(t+1)} &= \theta^{(t)} - (\gamma m^{(t-1)} + \eta g^{(t)})
\end{split}
$$
We can see that the momentum step $m^{(t-1)}$ is applied twice – one time for updating the gradient $g^{(t)}$ and a second time for updating the parameters $\theta^{(t+1)}$. Many proposes to modify NAG by the following way: to apply the look-ahead momentum vector $m^{(t)}$ instead of the previous momentum vector $m^{(t-1)}$ in the above NAG directly to update the current parameters.
$$
\begin{split}
g^{(t)} & = \nabla_\theta J(\theta) \

m^{(t)} &= \gamma m^{(t-1)} + \eta g^{(t)} \
\theta^{(t+1)} &= \theta^{(t)} - (\gamma m^{(t)} + \eta g^{(t)})
\end{split}
$$

Similarly, we can add Nesterov momentum into Adam by replacing the previous momentum vector with the current momentum vector. First, the Adam update rule is the following:
$$
\begin {split}
g^{(t)} & = \nabla_\theta J(\theta) \

m^{(t)} &= \beta_1 m^{(t-1)} + (1 - \beta_1) g^{(t)} \

v^{(t)} &= \beta_2 v^{(t-1)} + (1 - \beta_2) g^{(t)} \

\hat{m}^{(t)} &= \frac {m^{(t)}} {1 - \beta^t_1} \

\hat{v}^{(t)} &= \frac {v^{(t)}} {1 - \beta^t_2} \

\theta^{(t+1)} &= \theta^{(t)} - \frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon} \odot \hat{m}^{(t)} \

&= \theta^{(t)} - {\frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon}} \odot {\frac {m^{(t)}} {1 - \beta^t_1}} \

&= \theta^{(t)} - {\frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon}} \odot {\frac {\beta_1 m^{(t-1)} + (1 - \beta_1) g^{(t)}} {1 - \beta^t_1}} \

&= \theta^{(t)} - {\frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon}} \odot \left( {\frac {\beta_1 m^{(t-1)}} {1 - \beta^t_1}} + {\frac {(1 - \beta_1) g^{(t)}} {1 - \beta^t_1}} \right) \

&= \theta^{(t)} - {\frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon}} \odot \left( {\frac { m^{(t-1)}} {1 - \beta^{t-1}_1}} \cdot {\frac {\beta_1 ({1 - \beta^{t-1}_1})} {{1 - \beta^t_1}}} + {\frac {(1 - \beta_1) g^{(t)}} {1 - \beta^t_1}} \right) \

&= \theta^{(t)} - {\frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon}} \odot \left( { \hat{m}^{(t-1)}} \cdot {\frac {\beta_1 ({1 - \beta^{t-1}_1})} {{1 - \beta^t_1}}} + {\frac {(1 - \beta_1) g^{(t)}} {1 - \beta^t_1}} \right) \

\end {split}
$$
We can now add Nesterov momentum just as we did previously by simply replacing this bias-corrected estimate of the momentum vector of the previous time step $\hat{m}^{(t-1)}$ with the bias-corrected estimate of the current momentum vector $\hat{m}^{(t)}$, which gives us the Nadam update rule:
$$
\theta{(t+1)} = \theta^{(t)} - \frac {\eta} {\sqrt {\hat{v}^{(t)}} + \epsilon} \odot \left( \beta_1 \hat{m}^{(t)} + {\frac {(1 - \beta_1) g^{(t)}} {1 - \beta^t_1}} \right)
$$

Reddi et al. (2018) shows that the exponential moving average of past squared gradients as a reason for the poor generalization behaviour of adaptive learning rate methods, e.g. Adadelta, RMSprop, Adam and so on. Recall that the introduction of the exponential average was well-motivated: It should prevent the learning rates to become infinitesimally small as training progresses, the key flaw of the Adagrad algorithm. However, this short-term memory of the gradients becomes an obstacle in other scenarios.

In settings where Adam converges to a suboptimal solution, it has been observed that some minibatches provide large and informative gradients, but as these minibatches only occur rarely, exponential averaging diminishes their influence, which leads to poor convergence. To fix this behaviour, the authors propose a new algorithm, AMSGrad that uses the maximum of past squared gradients $v^{(t)}$ rather than the exponential average to update the parameters.

The following two animations (Image credit: Alec Radford) provide some intuitions towards the optimization behaviour of most of the presented optimization methods. Also have a look here for a description of the same images by Karpathy and another concise overview of the algorithms discussed.

In the above image, we see their behaviour on the contours of a loss surface (the Beale function) over time. Note that Adagrad, Adadelta, and RMSprop almost immediately head off in the right direction and converge similarly fast, while Momentum and NAG are led off-track, evoking the image of a ball rolling down the hill. NAG, however, is quickly able to correct its course due to its increased responsiveness by looking ahead and heads to the minimum.

The above image shows the behaviour of the algorithms at a saddle point, i.e. a point where one dimension has a positive slope, while the other dimension has a negative slope, which pose a difficulty for SGD as we mentioned before. Notice here that SGD, Momentum, and NAG find it difficulty to break symmetry, although the two latter eventually manage to escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope.

As we can see, the adaptive learning-rate methods, i.e. Adagrad, Adadelta, RMSprop, and Adam are most suitable and provide the best convergence for these scenarios.

Note: If you are interested in visualizing these or other optimization algorithms, refer to this useful tutorial.

If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. An additional benefit is that you won’t need to tune the learning rate but likely achieve the best results with the default value.

In summary, RMSprop is an extension of Adagrad that deals with its radically diminishing learning rates. It is identical to Adadelta, except that Adadelta uses the RMS of parameter updates in the numinator update rule. Adam, finally, adds bias-correction and momentum to RMSprop. RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. Adam with bias-correction slightly outperform RMSprop towards the end of optimization as gradients become sparser.

Interestingly, many recent papers use vanilla SGD without momentum and a simple learning rate annealing schedule. As has been shown, SGD usually achieves to find a minimum, but it might take significantly longer than with some of the optimizers, is much more reliant on a robust initialization and annealing schedule, and may get stuck in saddle points rather than local minima. Consequently, if you care about fast convergence and train a deep or complex neural network, you should choose one of the adaptive learning rate methods.