Note: This post answers the question I asked on Cross-Validated: What are the benefits of layer-specific learning rates?
The layer-specific learning rates mean using different learning rates for different layers of neural networks instead of using the same global learning rate for each layer.
The layer-specific learning rates help in overcoming the slow learning (thus slow training) problem in deep neural networks. As stated in the paper titled Layer-Specific Adaptive Learning Rates for Deep Networks1:
When the gradient descent methods are used to train deep networks, additional problems are introduced. As the number of layers in a network increases, the gradients that are propagated back to the initial layers get very small (vanishing gradient problem). This dramatically slows down the rate of learning in the initial layers and slows down the convergence of the whole network.
The learning rates specific to each layer in the network allows larger learning rates to compensate for the small size of gradients in shallow layers (layers near the input layer).
The layer-specific learning rates also help in transfer learning – check differential learning rates2, discriminative fine-tuning by Jeremy Howard3 and post on CaffeNet4.
The intuition is that in the layers closer to the input layer are more likely to have learned more general features – such as lines and edges, which we won’t want to change much. Thus, we set their learning rate low. On the other hand, in case of later layers of the model – which learn the detailed features, we increase the learning rate – to let the new layers learn fast.
1. Layer-Specific Adaptive Learning Rates for Deep Networks ↩
2. Differential Learning Rates ↩
3. Discriminative fine-tuning by Jeremy Howard ↩
4. Post on CaffeNet ↩