Disentangling Adaptive Gradient Methods from Learning Rates

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

We investigate several confounding factors in the evaluation of optimization algorithms for deep learning. Primarily, we take a deeper look at how adaptive gradient methods interact with the learning rate schedule, a notoriously difficult-to-tune hyperparameter which has dramatic effects on the convergence and generalization of neural network training. We introduce a "grafting" experiment which decouples an update's magnitude from its direction, finding that many existing beliefs in the literature may have arisen from insufficient isolation of the implicit schedule of step sizes. Alongside this contribution, we present some empirical and theoretical retrospectives on the generalization of adaptive gradient methods, aimed at bringing more clarity to this space.

Related collections

Author and article information

Journal

Publication date Created: 26 February 2020

Article

ArXiV ID: 2002.11803

SO-VID: c6594cab-aa8e-4cc6-ac12-ded929b16580

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Categories cs.LG stat.ML

ScienceOpen disciplines: Machine learning,Artificial intelligence

Data availability:

ScienceOpen disciplines: Machine learning, Artificial intelligence

Disentangling Adaptive Gradient Methods from Learning Rates

Read this article at

Abstract

Related collections

Annual Reviews AI, Machine Learning, and Society

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 168