6.a) Explain AdaGrad and write an algorithm for AdaGrad.
Answer:
AdaGrad
- The AdaGrad algorithm, individually adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all of their historical squared values.
- Theparameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their learning rate, while parameters with small partial derivatives have a relatively small decrease in their learning rate.
- Theneteffect is greater progress in the more gently sloped directions of parameter space.
- In the context of convex optimization, the AdaGrad algorithm enjoys some desirable theoretical properties.
- However, empirically it has been found that—for training deep neural network models— the accumulation of squared gradients from the beginning of training can result in a premature and excessive decrease in the effective learning rate.
- AdaGradperforms well for some but not all deep learning models.