SVM:multi-class SVM regularization
For the ith sample in the training set, we have the following loss function:
is the score classifying to class j，and is the score classifying correctly(classify to class )， is the th row of .
Considering the geometrical meaning of the weight vector , it is easy to find out that is not unique, can change in a small area and result in the same .
It the values in is scaled, the loss computed will also be scaled by the same ratio. Considering a loss of 15, if we scale all the weights in by 2, the loss will be scaled to 30. But this kind of scaling is meaningless, it doesn’t really represent the loss.
The method to solve the above two problems is adding a penalty term into the loss function to penalize the weight matrix:
The penalized loss function turns to be:
We can understand the first term as “data loss” term, its physical meaning is the summation of all loss values when applying the weight matrix to all data samples.
The second the term is the so called “regularization loss”, it represent the loss from the weight matrix ’s own structure, which means this term is only related to the weight matrix and not related to the sample data.
The equation after the full expansion is:
Let’s review the meaning of the symbols:
means there are N samples in the training set.
means the th sample in all these samples.
means the class of the th sample
means all the other classes except class in all K classes
means the th row of the weight matrix , the same for
is the confidential interval.
is the regularization ratio.
Usually we use as the regularization term, because when updating the , we need to compute the gradient respect to , by adding a in front the gradient of the regularization term will be instead of .
This regularization on is the so called L2 regularization, adding this to the loss function means for every update step, the weights is decayed linearly:
Why regularization can solve the problems?
For problem 1, as we added the loss of the weight matrix itself, we can finally get an unique solution.
For problem 2, using the weight matrix’s form as the penalty term, means we want the weights in the weight matrix to be small, which prevents the weights in from meaningless scaling.
The smaller is the weight matrix’s form, we know that the smaller will be the weights in , what’s more, it also contains another information: the weights in will distributed more equally:
The summation of vector and the vector are the same, but their norms are quite different, norm is 1 but norm is 0.25. If both of them can all do the classification job correctly, we apparently will choose , which has the smaller penalty, because its distribution is more equal, and the more equally distributed weight vector means all the pixels in a sample image are considered. ’s classification result may also be good, but it only consider the 1st pixel, which gets a weight of 1, the left 3 pixels get the weights of 0, which means they are not considered for classification.
But why we can still do the classification correctly without considering the other 3 pixels? It means the training samples in our training set is very special: the 1st pixels in all the samples occasionally contains enough information for us to do the classification, but in the real case, all the 4 pixels are equally important. Now if we want to classify a new sample image whose 1st pixel doesn’t contain enough information to classify the image, the can not classify this image anymore, this is the so-called “weak generalization ability” . If we choose , it will still work, as it takes all the 4 pixels into consideration.
L1 regularization means adding , it is also possible to combine L1 and L2 like:
which is called elastic net regularization. As we said that L2 regularization makes the decays linearly, for L1 regularization the weights will be reduced by a constant in every update step. Finally all the none-important weights will be disappeared and only the important weights will be left. Which means by using L1 regularization the system will be invariant to the “noisy” inputs. In comparison, using L2 will lead to diffuse, small weights.
Max norm constraints
Just as its name described, we can also enforce a upper bound for of each neuron. In practice, after the normal update, we multiple all weights in by a constant to make sure that , is the upper bound we set, which is usually of the order 3 or 4.