The case we are handling: a 2 layers network
The above diagram shows the network to be used. From the last blog we get the loss function:
In order to use the gradient descent algorithm to train and , we need to compute the derivative of to and , which are:
If we have by hand, then can be trained using gradient descent:
How to compute ? We use the chain rule to “propagate back” the gradient, for :
As the last blog described, of the sample can be expressed as:
Now we will compute to train . The gradient will be propagated back to like this:
the same as before, we know that , the above equation results in:
is already known, depends on the activation function of the hidden layer.
Using this method, we can easily propagate the gradient back to the input layer through the whole network and update all the layers, no matter how many layers are there in between.