It is easy to think that we set all the weights to be zero, but it’s terribly wrong, cause using all zero initialization will make the neurons all the same during the backpropagation update. We don’t need so many identical neurons. Actually, this problem always exists if the weights are initialized to be the same.
Small random values
One guess to solve the problem of all-zero initialization is setting the weights to be small random values, such as . It is also problematic because very small weights cause very small updates and the update values become smaller and smaller during the backpropagation. In the deep network, this problem is very serious as you may find that the upper layers never update.
So in order to keep the data flow stable during the backpropagation, we want the variance of each neuron’s output and input to be the same. The expectations(mean) of both and are all initialized to be 0 so we only need to care about the variance. Then let’s say is the output and is the input, our target will be:
What we get now is:
according to the variance property, we get:
as we know that and are independent, we get:
considering all are equally distributed, we use to represent the variance of each ; the same for , we use to represent the variance of each . Then we get:
It is clear now that in order to make the variance of output to be the same as the variance of input , we have to make:
considering the variance’s property, for some and scalar :
So if we initialize the weights as:
We can guarantee .
According to the latest research, when using ReLu, the variance should be , so we should use instead.
To be studied later : Batch Normalization
Batch normalization is confirmed to be very helpful, need to be studied . . .