## Momentum

Momentum technique is an approach which provides an update rule that is motivated from the physical perspective of optimization. Imagine a ball in a hilly terrain is trying to reach the deepest valley. When the slope of the hill is very high, the ball gains a lot of momentum and is able to pass through slight hills in its way. As the slope decreases the momentum and speed of the ball decreases, eventually coming to rest in the deepest position of valley.

This technique modifies the standard SGD by introducing velocity $v$ , which is the parameter we are trying to optimize, and friction $\mu$, which tries to control the velocity and prevents overshooting the valley while allowing faster descent. The gradient only has direct influence on the velocity, which in turn has an effect on the position. Mathematically,

\begin{align} v &= \mu v - \alpha \Delta L \\ w &= w + v \end{align}

Which translates to code as

def momentum_update(velocity,params,grads,learning_rate=0.01,mu=0.9):
v[i] = mu*v[i] + learning_rate * grad[i]
param[i] -= v[i]


The advantage of momentum is that it makes very small change to SGD but provides a big boost to speed of learning. We need to store the velocity for all the parameters, and use this velocity for making the updates. Here is the modified function for SGD which uses the above momentum update rule.

def sgd_momentum(nnet,X_train,y_train,minibatch_size,epoch,learning_rate,mu = 0.9,\
verbose=True,X_test=None,y_test=None):
minibatches = get_minibatches(X_train,y_train,minibatch_size)
for i in range(epoch):
loss = 0
velocity = []
for param_layer in nnet.params:
p = [np.zeros_like(param) for param in list(param_layer)]
velocity.append(p)

if verbose:
print("Epoch {0}".format(i+1))
for X_mini, y_mini in minibatches:
if verbose:
train_acc = accuracy(y_train,nnet.predict(X_train))
test_acc = accuracy(y_test,nnet.predict(X_test))
print("Loss = {0} | Training Accuracy = {1} | Test Accuracy = {2}".\
format(loss,train_acc,test_acc))
return nnet


Nesterov's Accelerated Gradient is a clever variation of momentum that works slightly better than standard momentum. The idea behind Nesterov's momentum is that instead of calculating the gradient at the current position, we calculate the gradient at a position that we know our momentum is about to take us, called as "look ahead" position. From physical perspective, it makes sense to make judgements about our final position based on the position that we know we are going to be in a short while.

The implementation makes a slight modification to standard SGD Momentum by nudging our parameters slightly in the direction of the velocity and calculating the gradients there. Here is the code:

def sgd_momentum(nnet,X_train,y_train,minibatch_size,epoch,learning_rate,mu = 0.9,\
verbose=True,X_test=None,y_test=None,nesterov = False):

minibatches = get_minibatches(X_train,y_train,minibatch_size)

for i in range(epoch):
loss = 0
velocity = []
for param_layer in nnet.params:
p = [np.zeros_like(param) for param in list(param_layer)]
velocity.append(p)

if verbose:
print("Epoch {0}".format(i+1))

for X_mini, y_mini in minibatches:
# if nesterov is enabled, nudge the params forward by momentum
if nesterov:
for param,ve in zip(nnet.params,velocity):
for i in range(len(param)):
param[i] += mu*ve[i]

if verbose:
train_acc = accuracy(y_train,nnet.predict(X_train))
test_acc = accuracy(y_test,nnet.predict(X_test))
print("Loss = {0} | Training Accuracy = {1} | Test Accuracy = {2}".format(loss,train_acc,test_acc))
return nnet


Until now we have used a global and equal learning rate for all our parameters. So all of our parameters are being updated with constant factor. But what if we could speed up or slow down this factor, even for each parameter, as the training progresses? We could adaptively tune the learning throughout the training phases and know which direction to accelerate and which to decelerate. Several methods that use such adaptive learning rates have been proposed, most notably AdaGrad, RMSprop and Adam.

AdaGrad (original paper) keeps track of per parameter sum of squared gradient and normalizes parameter update step. The idea is that parameters which receive big updates will have their effective learning rate reduced, while parameters which receive small updates will have their effective learning rate increased. This way we can accelerate the convergence by accelerating per parameter learning.

def adagrad_update(cache,params,grads,learning_rate=0.01):
param[i] += - learning_rate * grad[i] / (np.sqrt(cache[i])+1e-8) # for preventing divide by 0



RMSprop

A disadvantage of AdaGrad is that cache[i] += grad[i]**2 part of the update is monotonically increasing. This can pose problems because the learning rate can steadily decrease to the point where it stops the learning altogether. RMSprop (unpublished, citation here) combats this problem by decaying the past squared gradient by a factor decay_rate to control the aggressive learning rates. Here decay_rate is a hyperparameter with typical values like 0.9,0.99 and so on.

def rmsprop_update(cache,params,grads,learning_rate=0.01,decay_rate=0.9):
cache[i] = decay_rate * cache[i] + (1-decay_rate) * grad[i]**2
param[i] += - learning_rate * grad[i] / (np.sqrt(cache[i])+1e-4)


Adam (original paper) is the most widely used first order optimization algorithm. It is an improvement upon RMSprop by adding momentum to the update rule, combining best of the both momentum and adaptive learning worlds. We introduce two more parameters beta1 and beta2 with recommended values 0.9 and 0.999 respectively.

Another thing to note is that Adam includes bias correction mechanism, which compensates for first few iterations when both cache and velocity are biased at zero as they are initialized to zero.

Here is the full implementation of Adam:


X_test=None,y_test=None,beta1=0.9,beta2=0.999):

minibatches = get_minibatches(X_train,y_train,minibatch_size)
for i in range(epoch):
loss = 0
velocity,cache = [],[]
for param_layer in nnet.params:
p = [np.zeros_like(param) for param in list(param_layer)]
velocity.append(p)
cache.append(p)
if verbose:
print("Epoch {0}".format(i+1))
t = 1
for X_mini, y_mini in minibatches:
c[i] = beta1 * c[i] + (1-beta1) * grad[i]
mt = c[i] / (1 - beta1**t)
v[i] = beta2 * v[i] + (1-beta2) * (grad[i]**2)
vt = v[i] / (1 - beta2**t)
print(vt)
param[i] += - learning_rate * mt / (np.sqrt(vt) + 1e-4)
t+=1

if verbose:
train_acc = accuracy(y_train,nnet.predict(X_train))
test_acc = accuracy(y_test,nnet.predict(X_test))
print("Loss = {0} | Training Accuracy = {1} | Test Accuracy = {2}".\
format(loss,train_acc,test_acc))
return nnet