We can set the learning rate adaptively using adagrad. Using adagrad method, we assign a high learning rate when the previous gradient value is low and we assign a low learning rate when the previous gradient value is high. This makes the learning rate to change adaptively based on the past gradient updates.
In Adam, we compute the running average of squared gradients as we compute in RMSProp. But instead of computing only the running average of squared gradients, we also compute the running average of gradients, That is, Adam uses both first and second-order moments of the gradients.
Due to the exponentially moving the average of gradients, Adam fails to reach convergence and may reach the sub-optimal solution instead of the globally optimal solution. This happens because when we use the exponentially moving average of gradients, we miss the information about the gradients that occur less frequently. So, to combat this issue, we use AMSGrad.