The above diagram shows a neuron in NN, it simulates a real neuron:
it has inputs:
it has weights for each inputs: : weight vector
it has bias
it has a threshold for the “activation function”
The neuron process the inputs with the equation inside the circle, if the result is bigger than the threshold, the neuron will not be activated and output 0, if the result if bigger than the threshold, the neuron will be activated and output 1.
We can consider a neuron as a binary classifier.
Sometimes we choose the activation function as :
Let’s make it a practical case:
is a 3072x1 feature vector, then for this neuron,
is a 1x3072 weights vector,
is a real value bias.
If we have 100 neurons on the 1st layer, they get the same input with different weight vectors. We can combine all these 100 neurons to one big weight matrix , its size is 100x3071, each row of is a weights vector of a neuron, so the output of the 1st layer will be a 100x1 output vector: each neuron outputs one value. It means the information is extracted and the dimension of the information is reduced from 3071 to 100. The equation for the 1st layer will be:
Now we add the 2nd layer, we set it to be 10 neurons, then it will receive the 100 inputs and output 10 inputs, combining the 1st and the second layer:
is a 10x100 weights matrix, it contains all the computation of the 2nd layer, each row of is a weights vector of a neuron in the 2nd layer. is the 10x1 bias vector of the 2nd layer.
Of course we can add the 3rd, 4th layers and so on, then the whole NN can be represented as a chain of activation functions: .
More possibilities of Activation Function
Previously, we usually use sigmoid function as activation function:
One reason to use this function is because it’s differentiable, we will use gradient descent to train the NN, we need the activation function to be differentiable.
Another reason is because this function always output a value between 0 and 1.
And it can be explained physically: its output value can be understood as a possibility:
Using regularization, we may notice that the gets smaller when the training process goes on, it can be explained as forgetting: the turns smaller means the information in each neurons is weaker, which means the NN is forgetting.
The 3rd reason is historical ….
But it is rarely used now, it has the following drawbacks:
1, When , the sigmoid’s gradient is almost 0, the same thing happens when . During the back propagation, if is very big(may be initialized this way), gradient will be very small, then the update in every training iteration will be very tiny:it doesn’t update at all, the learning efficiency will be very low.
2, sigmoid function always outputs positive value, during the back propagation, the gradient of is either all positive or negative, which brings a zig-zag dynamics, this is also of low efficiency.
In order to improve the sigmoid, Tanh is invented, it uses the sigmoid’s nonlinearity, actually it is a scaled sigmoid:
Obviously, it overcomes the sigmoid’s 2nd drawback.
The Rectified Linear Unit(ReLU) is another alternative, this is the one we used in “Practical Thingking”, it’s simple:
1, Computing cost is very low, simple thresholding, relative to sigmoid’s exponentials.
2, Very fast convergence using stochastic gradient descent
It may die after some updates, the neuron may be always deactivated and output 0. Why? Take a look at the following diagram:
In order to overcome the drawback of ReLU, the Leaky ReLU introduce non-zero:
when ( is a small constant value)
Maxout’s idea is not using a fixed activation function but learn one.
A Maxout neuron contains more than 1 pairs.
For traditional neuron, we get
For Maxout neuron, we get
, and .
The final result will be the max of all :
means the Maxout neuron contains linear models
The max will win and used as the output, that’s why it is called “Maxout”.
Which Type of Neuron should we use?
We should follow this priority:
Maxout/Leaky ReLU > ReLU > Tanh
Never use Sigmoid