SVD, PCA and Least Square Problem
The idea behind PCA
In the field of machine learning, PCA is used to reduce the dimension of features. Usually we collect a lot of feature to feed the machine learning model, we believe that more features provides more information and will lead to better result.
But some of the features doesn’t really bring new information and they are correlated to some other features. PCA is then introduced to remove this correlation by approximating the original data in its subspace, some of the features of the original data may be correlated to each other in its original space, but this correlation doesn’t exit in its subspace approximation.
Visually, let’s assume that our original data points have 2 features and they can be visualized in a 2D space:
In the above diagram, the red crosses represent our original data points, we want to find a line , which approximates the original points by projecting then on .
Which kind of can best approximates the points? Naturally, we want the information of the original data points to be kept in the approximated data points, which means to make sure that they are far away from each other.
Mathematically, we want to find the which can maximize the variance of the projected data points.
: data points
: the th data point, a 2x1 feature vector containing 2 features
: the subspace of , are located in a 2D space, then will be a 1D line, parameterized to be a 2x1 vector: is a unit directional vector, representing one direction of the space.
Before everything, we firstly preprocess the data points by z-score normalization:
Compute the mean of :
Compute the standard deviation of :
z-score normalize every :
in order to simplify the symbols, we just update all original to be normalized.
Obviously, for a certain , the distance of its projection on to the origin will be its dot product with :
As is already centered, its variance is:
Then our goal will be:
We want to get rid of the in and tread every thing as matrix, in order to do this, we put all together to form a big feature matrix as the following diagram describes:
Then turns to be . Our maximization problem turns to be:
Solve it by using Lagrange multiplier and SVD
And we know that , this problem can be solved using Lagrange multiplier:
According to the Kuhn-Tucker Conditions, we get:
Obviously, and are ’s Eigen vector and Eigen value.
But has more than one Eigen vectors, which one should be the we are looking for? In order to find it, we replace in the original equation by :
Which means, we can maximize by simply choosing the biggest Eigen value and its corresponding Eigen vector.
Then we can use SVD to decompose A:
For , we get:
Now, as is a diagonal matrix, we can obviously see that the columns of are the Eigen vectors of and the diagonal values of are ’s Eigen values.
Its relationship to least mean square
As described above, we want to maximize the variance of the projected data points:
Actually, as the in the above equation are data points which will never change, so it will do no harm to add the following term into the equation:
Solving the maximization problem is equally solving the following minimization problem:
Let’s take a look at the diagram again:
Obviously, is actually equal to , so our minimization problem turns to:
Which means, we are looking for a line , which can minimize the summation of the projective distances of the data points onto this line, this is a least mean square error problem. Now we may notice that maximizing the variance is equal to minimizing the mean square.