Neural networks: training with backpropagation.
In my first post on neural networks, I discussed a model representation for neural networks and how we can feed in inputs and calculate an output. We calculated this output, layer by layer, by combining the inputs from the previous layer with weights for each neuron-neuron connection. I mentioned that we'd talk about how to find the proper weights to connect neurons together in a future post - this is that post!
Overview
In the previous post I had just assumed that we had magic prior knowledge of the proper weights for each neural network. In this post, we'll actually figure out how to get our neural network to "learn" the proper weights. However, for the sake of having somewhere to start, let's just initialize each of the weights with random values as an initial guess. We'll come back and revisit this random initialization step later on in the post.
Given our randomly initialized weights connecting each of the neurons, we can now feed in our matrix of observations and calculate the outputs of our neural network. This is called forward propagation. Given that we chose our weights at random, our output is probably not going to be very good with respect to our expected output for the dataset.
So where do we go from here?
Well, for starters, let's define what a "good" output looks like. Namely, we'll develop a cost function which penalizes outputs far from the expected value.
Next, we need to figure out a way to change the weights so that the cost function improves. Any given path from an input neuron to an output neuron is essentially just a composition of functions; as such, we can use partial derivatives and the chain rule to define the relationship between any given weight and the cost function. We can use this knowledge to then leverage gradient descent in updating each of the weights.
Pre-requisites
When I was first learning backpropagation, many people tried to abstract away the underlying math (derivative chains) and I was never really able to grasp what the heck was going on until watching Professor Winston's lecture at MIT. I hope that this post gives a better understanding of backpropagation than simply "this is the step where we send the error backwards to update the weights".
In order to fully grasp the concepts discussed in this post, you should be familiar with the following:
Partial derivatives
This post is going to be a bit dense with a lot of partial derivatives. However, it is my hope that even a reader without prior knowledge of multivariate calculus can still follow the logic behind backpropagation.
If you're not familiar with calculus,
If you'd like to brush up on multivariate calculus, check out Khan Academy's lessons on the subject.
Gradient descent
To prevent this post from getting too long, I've separated the topic of gradient descent into another post. If you're not familiar with the method, be sure to read about it here and understand it before continuing this post.
Matrix multiplication
Here is a quick refresher from Khan Academy.
Starting simple
To figure out how to use gradient descent in training a neural network, let's start with the simplest neural network: one input neuron, one hidden layer neuron, and one output neuron.
To show a more complete picture of what's going on, I've expanded each neuron to show 1) the linear combination of inputs and weights and 2) the nonlinear activation of this combination. It's easy to see that the forward propagation step is simply a series of functions where the output of one feeds as the input to the next.
Defining "good" performance in a neural network
Let's define our cost function to simply be the squared error.
There's a myriad of cost functions we could use, but for this neural network squared error will work just fine.
Remember, we want to evaluate our model's output with respect to the target output in an attempt to minimize the difference between the two.
Relating the weights to the cost function
In order to minimize the difference between our neural network's output and the target output, we need to know how the model performance changes with respect to each parameter in our model. In other words, we need to define the relationship (read: partial derivative) between our cost function and each weight. We can then update these weights in an iterative process using gradient descent.
Let's look at
Let's take a moment to examine how we could express the relationship between
As a reminder, the chain rule states:
Let's apply the chain rule to solve for
By similar logic, we can find
For the sake of clarity, I've updated our neural network diagram to visualize these chains. Make sure you are comfortable with this process before proceeding.
Adding complexity
Now let's try this same approach on a slightly more complicated example. Now, we'll look at a neural network with two neurons in our input layer, two neurons in one hidden layer, and two neurons in our output layer. For now, we'll disregard the bias neurons that are missing from the input and hidden layers.
Let's take a second to go over the notation I'll be using so you can follow along with these diagrams. The superscript (1) denotes what layer the object is in and the subscript denotes which neuron we're referring to in a given layer. For example
is the activation of the first neuron in the second layer. For the parameter values , I like to read them like a mailing label - the first value denotes what neuron the input is being sent to in the next layer, and the second value denotes from what neuron the information is being sent. For example, is used to send input to the 2nd neuron, from the 1st neuron in layer 2. The superscript denoting the layer corresponds with where the input is coming from. This notation is consistent with the matrix representation we discussed in my post on neural networks representation.
Let's expand this network to expose all of the math that's going on.
Yikes! Things got a little more complicated. I'll walk through the process for finding one of the partial derivatives of the cost function with respect to one of the parameter values; I'll leave the rest of the calculations as an exercise for the reader (and post the end results below).
Foremost, we'll need to revisit our cost function now that we're dealing with a neural network with more than one output. Let's now use the mean squared error as our cost function.
Note: If you're training on multiple observations (which you always will in practice), we'll also need to perform a summation of the cost function over all training examples. For this cost function, it's common to normalize by
Now that we've corrected our cost function, we can look at how changing a parameter affects the cost function. Specifically, I'll calculate
The derivative chain for the blue path is:
The derivative chain for the orange path is:
Combining these, we get the total expression for
I've provided the remainder of the partial derivatives below. Remember, we need these partial derivatives because they describe how changing each parameter affects the cost function. Thus, we can use this knowledge to change all of the parameter values in a way that continues to decrease the cost function until we converge on some minimum value.
Layer 2 Parameters
Layer 1 Parameters
Woah there. We just went from a neural network with 2 parameters that needed 8 partial derivative terms in the previous example to a neural network with 8 parameters that needed 52 partial derivative terms. This is going to quickly get out of hand, especially considering many neural networks that are used in practice are much larger than these examples.
Fortunately, upon closer inspection many of these partial derivatives are repeated. If we're smart about how we approach this problem, we can dramatically reduce the computational cost of training. Further, it'd really be a pain in the ass if we had to manually calculate the derivative chains for every parameter. Let's look at what we've done so far and see if we can generalize a method to this madness.
Generalizing a method
Let's examine the partial derivatives above and make a few observations. We'll start with looking at the partial derivatives with respect to the parameters for layer 2. Remember, the parameters in for layer 2 are combined with the activations in layer 2 to feed as inputs into layer 3.
Layer 2 Parameters
Let's analyze the following expressions; I encourage you to solve the partial derivatives as we go along to convince yourself of my logic.
Foremost, it seems as if the columns contain very similar values. For example, the first column contains partial derivatives of the cost function with respect to the neural network outputs. In practice, this is the difference between the expected output and the actual output (and then scaled by
The second column represents the derivative of the activation function used in the output layer. Note that for each layer, neurons will use the same activation function. A homogenous activation function within a layer is necessary in order to be able to leverage matrix operations in calculating the neural network output. Thus, the value for the second column will be the same for all four terms.
The first and second columns can be combined as
The third column represents how the parameter of interest changes with respect to the weighted inputs for the current layer; when you calculate the derivative this corresponds with the activation from the previous layer.
I'd also like to note that the first two partial derivative terms seem concerned with the first output neuron (neuron 1 in layer 3) while the last two partial derivative terms seem concerned with the second output neuron (neuron 2 in layer 3). This is evident in the
Next, let's go ahead and calculate the last partial derivative term. As we noted, this partial derivative ends up representing activations from the previous layer.
Note: I find it helpful to use the expanded neural network graphic in the previous section when calculating the partial derivatives.
It appears that we're combining the "error" terms with activations from the previous layer to calculate each partial derivative. It's also interesting to note that the indices
Let's see if we can figure out a matrix operation to compute all of the partial derivatives in one expression.
Multiplying these vectors together, we can calculate all of the partial derivative terms in one expression.
To summarize, see the graphic below.
Note: Technically, the first column should also be scaled by
Layer 1 Parameters
Now let's take a look and see what's going on in layer 1.
Whereas the weights in layer 2 only directly affected one output, the weights in layer 1 affect all of the outputs. Recall the following graphic.
This results in a partial derivative of the cost function with respect to a parameter now becoming a summation of different chains. Specifically, we'll have a derivative chain for every
The first two columns (of each summed term) corresponds with a
The third column corresponds with some parameter that connects layer 2 to layer 3. If we considered
We'll also redefine
We could write this more succinctly using matrix expressions.
Note: It's standard notation to denote vectors with lowercase letters and matrices as uppercase letters. Thus,
The fourth column represents the derivative of the activation function used in the current layer. Remember that for each layer, neurons will use the same activation function.
Lastly, the fifth column represents various inputs from the previous layer. In this case, it is the actual inputs to the neural network.
Let's take a closer look at one of the terms,
Foremost, we established that the first two columns of each derivative chains were previously calculated as
Further, we established the the third columns of each derivative chain was a parameter that acted to weight each of the respective
Factoring out
we're left with our new definition of
Lastly, we established the fifth column (
Again let's figure out a matrix operation to compute all of the partial derivatives in one expression.
Multiplying these vectors together, we can calculate all of the partial derivative terms in one expression.
If you've made it this far, congrats! We just calculated all of the partial derivatives necessary to use gradient descent and optimize our parameter values. In the next section, I'll introduce a way to visualize the process we've just developed in addition to presenting an end-to-end method for implementing backpropagation. If you understand everything up to this point, it should be smooth sailing from here on out.
Backpropagation
In the last section, we developed a way to calculate all of the partial derivatives necessary for gradient descent (partial derivative of the cost function with respect to all model parameters) using matrix expressions. In calculating the partial derivatives, we started at the end of the network and, layer by layer, worked our way back to the beginning. We also developed a new term,
Note: Backpropagation is simply a method for calculating the partial derivative of the cost function with respect to all of the parameters. The actual optimization of parameters (training) is done by gradient descent or another more advanced optimization technique.
Generally, we established that you can calculate the partial derivatives for layer
Backpropagation visualized
Before defining the formal method for backpropagation, I'd like to provide a visualization of the process.
First, we have to compute the output of a neural network via forward propagation.
Next, we compute the
Next, we "send back" the "error" terms in the exact same manner that we "send forward" the inputs to a neural network. The only difference is that time, we're starting from the back and we're feeding an error term, layer by layer, backwards through the network. Hence the name: backpropagation. The act of "sending back our error" is accomplished via the expression
For every layer except for the last, the "error" term is a linear combination of parameters connecting to the next layer (moving forward through the network) and the "error" terms of that next layer. This is true for all of the hidden layers, since we don't compute an "error" term for the inputs.
The last layer is a special case because we calculate the
A formalized method for implementing backpropagation
Here, I'll present a practical method for implementing backpropagation through a network of layers
-
Perform forward propagation.
-
Compute the
term for the output layer.
-
Compute the partial derivatives of the cost function with respect to all of the parameters that feed into the output layer,
.
$$\frac{{\partial J\left( \theta \right)}}{{\partial \theta _{ij}^{(L-1)}}} = {\left( {{\delta ^{(L)}}} \right)T}{a{(L-1)}}$$
-
Go back one layer.
-
Compute the
term for the current hidden layer.
-
Compute the partial derivatives of the cost function with respect to all of the parameters that feed into the current layer.
- Repeat 4 through 6 until you reach the input layer.
Revisiting the weights initialization
When we started, I proposed that we just randomly initialize our weights in order to have a place to start. This allowed us to perform forward propagation, compare the outputs to the expected values, and compute the cost of our model.
It's actually very important that we initialize our weights to random values so that we can break the symmetry in our model. If we had initialized all of our weights to be equal to be the same, every neuron in the next layer forward would be equal to the same linear combination of values.
By this same logic, the
Further, because we calculate the partial derivatives in any given layer by combining
Randomly initializing the network's weights allows us to break this symmetry and update each weight individually according to its relationship with the cost function. Typically we'll assign each parameter to a random value in
Putting it all together
After we've calculated all of the partial derivatives for the neural network parameters, we can use gradient descent to update the weights.
In general, we defined gradient descent as
where
We'll use this formula to update each of the weights, recompute forward propagation with the new weights, backpropagate the error, and calculate the next weight update. This process continues until we've converged on an optimal value for our parameters.
During each iteration we perform forward propagation to compute the outputs and backward propagation to compute the errors; one complete iteration is known as an epoch. It is common to report evaluation metrics after each epoch so that we can watch the evolution of our neural network as it trains.