Setting the learning rate of your neural network.
In previous posts, I've discussed how we can train neural networks using backpropagation with gradient descent. One of the key hyperparameters to set in order to train a neural network is the learning rate for gradient descent.
In previous posts, I've discussed how we can train neural networks using backpropagation with gradient descent. One of the key hyperparameters to set in order to train a neural network is the learning rate for gradient descent. As a reminder, this parameter scales the magnitude of our weight updates in order to minimize the network's loss function.
If your learning rate is set too low, training will progress very slowly as you are making very tiny updates to the weights in your network. However, if your learning rate is set too high, it can cause undesirable divergent behavior in your loss function. I'll visualize these cases below - if you find these visuals hard to interpret, I'd recommend reading (at least) the first section in my post on gradient descent.
So how do we find the optimal learning rate?
3e-4 is the best learning rate for Adam, hands down.
— Andrej Karpathy (@karpathy) November 24, 2016
Perfect! I guess my job here is done.
Well... not quite.
(i just wanted to make sure that people understand that this is a joke...)
— Andrej Karpathy (@karpathy) November 24, 2016
(Humor yourself by reading through that thread after finishing this post.)
The loss landscape of a neural network (visualized below) is a function of the network's parameter values quantifying the "error" associated with using a specific configuration of parameter values when performing inference (prediction) on a given dataset. This loss landscape can look quite different, even for very similar network architectures. The images below are from a paper, Visualizing the Loss Landscape of Neural Nets, which shows how residual connections in a network can yield an smoother loss topology.
The optimal learning rate will be dependent on the topology of your loss landscape, which is in turn dependent on both your model architecture and your dataset. While using a default learning rate (ie. the defaults set by your deep learning library) may provide decent results, you can often improve the performance or speed up training by searching for an optimal learning rate. I hope you'll see in the next section that this is quite an easy task.
A systematic approach towards finding the optimal learning rate
Ultimately, we'd like a learning rate which results is a steep decrease in the network's loss. We can observe this by performing a simple experiment where we gradually increase the learning rate after each mini batch, recording the loss at each increment. This gradual increase can be on either a linear or exponential scale.
For learning rates which are too low, the loss may decrease, but at a very shallow rate. When entering the optimal learning rate zone, you'll observe a quick drop in the loss function. Increasing the learning rate further will cause an increase in the loss as the parameter updates cause the loss to "bounce around" and even diverge from the minima. Remember, the best learning rate is associated with the steepest drop in loss, so we're mainly interested in analyzing the slope of the plot.
You should set the range of your learning rate bounds for this experiment such that you observe all three phases, making the optimal range trivial to identify.
This technique was proposed by Leslie Smith in Cyclical Learning Rates for Training Neural Networks and evangelized by Jeremy Howard in fast.ai's course.
Setting a schedule to adjust your learning rate during training
Another commonly employed technique, known as learning rate annealing, recommends starting with a relatively high learning rate and then gradually lowering the learning rate during training. The intuition behind this approach is that we'd like to traverse quickly from the initial parameters to a range of "good" parameter values but then we'd like a learning rate small enough that we can explore the "deeper, but narrower parts of the loss function" (from Karparthy's CS231n notes). If you're having a hard time picturing what I just mentioned, recall that too high of a learning rate can cause the parameter update to "jump over" the ideal minima and subsequent updates will either result in a continued noisy convergence in the general region of the minima, or in more extreme cases may result in divergence from the minima.
The most popular form of learning rate annealing is a step decay where the learning rate is reduced by some percentage after a set number of training epochs.
More generally, we can establish that it is useful to define a learning rate schedule in which the learning rate is updating during training according to some specified rule.
Cyclical learning rates
In the previously mentioned paper, Cyclical Learning Rates for Training Neural Networks, Leslie Smith proposes a cyclical learning rate schedule which varies between two bound values. The main learning rate schedule (visualized below) is a triangular update rule, but he also mentions the use of a triangular update in conjunction with a fixed cyclic decay or an exponential cyclic decay.
Note: At the end of this post, I'll provide the code to implement this learning rate schedule. Thus, if you don't care to understand the mathematical formulation you can skim past this section.
We can write the general schedule as
where
and
where
In order to grok how this equation works, let's progressively build it with visualizations. For the visuals below, the triangular update for 3 full cycles are shown with a step size of 100 iterations. Remember, one iteration corresponds with one mini-batch of training.
Foremost, we can establish our "progress" during training in terms of half-cycles that we've completed. We measure our progress in terms of half-cycles and not full cycles so that we can acheive symmetry within a cycle (this should become more clear as you continue reading).
Next, we compare our half-cycle progress to the number of half-cycles which will be completed at the end of the current cycle. At the beginning of a cycle, we have two half-cycles yet to be completed. At the end of a cycle, this value reaches zero.
Next, we'll add 1 to this value in order to shift the function to be centered on the y-axis. Now we're showing our progress within a cycle with reference to the half-cycle point.
At this point, we can take the absolute value to acheive a triangular shape within each cycle. This is the value that we assign to
However, we'd like our learning rate schedule to start at the minumum value, increasing to the maximum value at the middle of a cycle, and then decrease back to the minimum value. We can accomplish this by simply calculating
We now have a value which we can use to modulate the learning rate by adding some fraction of the learning rate range to the minimum learning rate (also referred to as the base learning rate).
Smith writes, the main assumption behind the rationale for a cyclical learning rate (as opposed to one which only decreases) is "that increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect." Indeed, his paper includes several examples of a loss function evolution which temporarily deviates to higher losses while ultimately converging to a lower loss when compared with a benchmark fixed learning rate.
To gain intuition on why this short-term effect would yield a long-term positive effect, it's important to understand the desirable characteristics of our converged minimum. Ultimately, we'd like our network to learn from the data in a manner which generalizes to unseen data. Further, a network with good generalization properties should be robust in the sense that small changes to the network's parameters don't cause drastic changes to performance. With this in mind, it makes sense that sharp minima lead to poor generalization as small changes to the parameter values may result in a drastically higher loss. By allowing for our learning rate to increase at times, we can "jump out" of sharp minima which would temporarily increase our loss but may ultimately lead to convergence on a more desirable minima.
Note: Although the "flat minima for good generalization" is widely accepted, you can read a good counter-argument here.
Additionally, increasing the learning rate can also allow for "more rapid traversal of saddle point plateaus." As you can see in the image below, the gradients can be very small at a saddle point. Because the parameter updates are a function of the gradient, this results in our optimization taking very small steps; it can be useful to increase the learning rate here to avoid getting stuck for too long at a saddle point.
Image credit (with modification)
Note: A saddle point, by definition, is a critical point in which some dimensions observe a local minimum while other dimensions observe a local maximum. Because neural networks can have thousands or even millions of parameters, it's unlikely that we'll observe a true local minimum across all of these dimensions; saddle points are much more likely to occur. When I referred to "sharp minima", realistically we should picture a saddle point where the minimum dimensions are very steep while the maximum dimensions are very wide (as shown below).
Stochastic Gradient Descent with Warm Restarts (SGDR)
A similar approach cyclic approach is known as stochastic gradient descent with warm restarts where an aggressive annealing schedule is combined with periodic "restarts" to the original starting learning rate.
We can write this schedule as
where
This annealing schedule relies on the cosine function, which varies between -1 and 1.
The authors note that this learning rate schedule can further be adapted to:
- Lengthen the cycle as training progresses.
- Decay
and after each cycle.
By drastically increasing the learning rate at each restart, we can essentially exit a local minima and continue exploring the loss landscape.
Neat idea: By snapshotting the weights at the end of each cycle, researchers were able to build an ensemble of models at the cost of training a single model. This is because the network "settles" on various local optima from cycle to cycle, as shown in the figure below.
Implementation
Both finding the optimal range of learning rates and assigning a learning rate schedule can be implemented quite trivially using Keras Callbacks.
Finding the optimal learning rate range
We can write a Keras Callback which tracks the loss associated with a learning rate varied linearly over a defined range.
import matplotlib.pyplot as plt | |
import keras.backend as K | |
from keras.callbacks import Callback | |
class LRFinder(Callback): | |
''' | |
A simple callback for finding the optimal learning rate range for your model + dataset. | |
# Usage | |
```python | |
lr_finder = LRFinder(min_lr=1e-5, | |
max_lr=1e-2, | |
steps_per_epoch=np.ceil(epoch_size/batch_size), | |
epochs=3) | |
model.fit(X_train, Y_train, callbacks=[lr_finder]) | |
lr_finder.plot_loss() | |
``` | |
# Arguments | |
min_lr: The lower bound of the learning rate range for the experiment. | |
max_lr: The upper bound of the learning rate range for the experiment. | |
steps_per_epoch: Number of mini-batches in the dataset. Calculated as `np.ceil(epoch_size/batch_size)`. | |
epochs: Number of epochs to run experiment. Usually between 2 and 4 epochs is sufficient. | |
# References | |
Blog post: jeremyjordan.me/nn-learning-rate | |
Original paper: https://arxiv.org/abs/1506.01186 | |
''' | |
def __init__(self, min_lr=1e-5, max_lr=1e-2, steps_per_epoch=None, epochs=None): | |
super().__init__() | |
self.min_lr = min_lr | |
self.max_lr = max_lr | |
self.total_iterations = steps_per_epoch * epochs | |
self.iteration = 0 | |
self.history = {} | |
def clr(self): | |
'''Calculate the learning rate.''' | |
x = self.iteration / self.total_iterations | |
return self.min_lr + (self.max_lr-self.min_lr) * x | |
def on_train_begin(self, logs=None): | |
'''Initialize the learning rate to the minimum value at the start of training.''' | |
logs = logs or {} | |
K.set_value(self.model.optimizer.lr, self.min_lr) | |
def on_batch_end(self, epoch, logs=None): | |
'''Record previous batch statistics and update the learning rate.''' | |
logs = logs or {} | |
self.iteration += 1 | |
self.history.setdefault('lr', []).append(K.get_value(self.model.optimizer.lr)) | |
self.history.setdefault('iterations', []).append(self.iteration) | |
for k, v in logs.items(): | |
self.history.setdefault(k, []).append(v) | |
K.set_value(self.model.optimizer.lr, self.clr()) | |
def plot_lr(self): | |
'''Helper function to quickly inspect the learning rate schedule.''' | |
plt.plot(self.history['iterations'], self.history['lr']) | |
plt.yscale('log') | |
plt.xlabel('Iteration') | |
plt.ylabel('Learning rate') | |
plt.show() | |
def plot_loss(self): | |
'''Helper function to quickly observe the learning rate experiment results.''' | |
plt.plot(self.history['lr'], self.history['loss']) | |
plt.xscale('log') | |
plt.xlabel('Learning rate') | |
plt.ylabel('Loss') | |
plt.show() |
Setting a learning rate schedule
Step Decay
For a simple step decay, we can use the LearningRateScheduler
Callback.
import numpy as np | |
from keras.callbacks import LearningRateScheduler | |
def step_decay_schedule(initial_lr=1e-3, decay_factor=0.75, step_size=10): | |
''' | |
Wrapper function to create a LearningRateScheduler with step decay schedule. | |
''' | |
def schedule(epoch): | |
return initial_lr * (decay_factor ** np.floor(epoch/step_size)) | |
return LearningRateScheduler(schedule) | |
lr_sched = step_decay_schedule(initial_lr=1e-4, decay_factor=0.75, step_size=2) | |
model.fit(X_train, Y_train, callbacks=[lr_sched]) |
Cyclical Learning Rate
To apply the cyclical learning rate technique, we can reference this repo which has already implemented the technique in the paper. In fact, this repo is cited in the paper's appendix.
Stochastic Gradient Descent with Restarts
from keras.callbacks import Callback | |
import keras.backend as K | |
import numpy as np | |
class SGDRScheduler(Callback): | |
'''Cosine annealing learning rate scheduler with periodic restarts. | |
# Usage | |
```python | |
schedule = SGDRScheduler(min_lr=1e-5, | |
max_lr=1e-2, | |
steps_per_epoch=np.ceil(epoch_size/batch_size), | |
lr_decay=0.9, | |
cycle_length=5, | |
mult_factor=1.5) | |
model.fit(X_train, Y_train, epochs=100, callbacks=[schedule]) | |
``` | |
# Arguments | |
min_lr: The lower bound of the learning rate range for the experiment. | |
max_lr: The upper bound of the learning rate range for the experiment. | |
steps_per_epoch: Number of mini-batches in the dataset. Calculated as `np.ceil(epoch_size/batch_size)`. | |
lr_decay: Reduce the max_lr after the completion of each cycle. | |
Ex. To reduce the max_lr by 20% after each cycle, set this value to 0.8. | |
cycle_length: Initial number of epochs in a cycle. | |
mult_factor: Scale epochs_to_restart after each full cycle completion. | |
# References | |
Blog post: jeremyjordan.me/nn-learning-rate | |
Original paper: http://arxiv.org/abs/1608.03983 | |
''' | |
def __init__(self, | |
min_lr, | |
max_lr, | |
steps_per_epoch, | |
lr_decay=1, | |
cycle_length=10, | |
mult_factor=2): | |
self.min_lr = min_lr | |
self.max_lr = max_lr | |
self.lr_decay = lr_decay | |
self.batch_since_restart = 0 | |
self.next_restart = cycle_length | |
self.steps_per_epoch = steps_per_epoch | |
self.cycle_length = cycle_length | |
self.mult_factor = mult_factor | |
self.history = {} | |
def clr(self): | |
'''Calculate the learning rate.''' | |
fraction_to_restart = self.batch_since_restart / (self.steps_per_epoch * self.cycle_length) | |
lr = self.min_lr + 0.5 * (self.max_lr - self.min_lr) * (1 + np.cos(fraction_to_restart * np.pi)) | |
return lr | |
def on_train_begin(self, logs={}): | |
'''Initialize the learning rate to the minimum value at the start of training.''' | |
logs = logs or {} | |
K.set_value(self.model.optimizer.lr, self.max_lr) | |
def on_batch_end(self, batch, logs={}): | |
'''Record previous batch statistics and update the learning rate.''' | |
logs = logs or {} | |
self.history.setdefault('lr', []).append(K.get_value(self.model.optimizer.lr)) | |
for k, v in logs.items(): | |
self.history.setdefault(k, []).append(v) | |
self.batch_since_restart += 1 | |
K.set_value(self.model.optimizer.lr, self.clr()) | |
def on_epoch_end(self, epoch, logs={}): | |
'''Check for end of current cycle, apply restarts when necessary.''' | |
if epoch + 1 == self.next_restart: | |
self.batch_since_restart = 0 | |
self.cycle_length = np.ceil(self.cycle_length * self.mult_factor) | |
self.next_restart += self.cycle_length | |
self.max_lr *= self.lr_decay | |
self.best_weights = self.model.get_weights() | |
def on_train_end(self, logs={}): | |
'''Set weights to the values from the end of the most recent cycle for best performance.''' | |
self.model.set_weights(self.best_weights) |
Snapshot Ensembles
To apply the "Train 1, get M for free" technique, you can reference this repo.
Further reading
- Stanford CS231n: Annealing the learning rate
- Cyclical Learning Rates for Training Neural Networks
- Exploring loss function topology with cyclical learning rates
- SGDR: Stochastic Gradient Descent with Warm Restarts
- Snapshot Ensembles: Train 1, get M for free
- (extension of concept) Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
- (extension of concept) Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
- Visualizing the Loss Landscape of Neural Nets
- Optimization for Deep Learning Highlights in 2017: Sebastian Ruder
- On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
- Sharp Minima Can Generalize For Deep Nets
- The Two Phases of Gradient Descent in Deep Learning
- Recent Advances for a Better Understanding of Deep Learning − Part I