From the last article, we get the following negative log likelihood function as our optimization target:

The optimization problem turns to be:

This article will explain how can this optimization problem be solved using Gauss-Newton method.

From the last article, we get the following negative log likelihood function as our optimization target:

The optimization problem turns to be:

This article will explain how can this optimization problem be solved using Gauss-Newton method.

Usually, all graph optimization papers or tutorials start from raising the optimization problem: minimizing the following function:

This article will explain where this optimization comes from and why it is related to gaussian noise.

The probabilistic modeling of graph optimization is based on the Maximum Likelihood Estimation(MLE) algorithm.

The case we use for this article will also be the one we used in the last article:

In the field of machine learning, PCA is used to reduce the dimension of features. Usually we collect a lot of feature to feed the machine learning model, we believe that more features provides more information and will lead to better result.

But some of the features doesn’t really bring new information and they are correlated to some other features. PCA is then introduced to remove this correlation by approximating the original data in its subspace, some of the features of the original data may be correlated to each other in its original space, but this correlation doesn’t exit in its subspace approximation.

Visually, let’s assume that our original data points have 2 features and they can be visualized in a 2D space:

Firstly the network architecture will be described as:

Dropout is a commonly used regularization method, it can be described by the diagram below: only part of the neurons in the whole network are updated. Mathematically, we apply some possibility (we use 0.5) to a neuron to keep it active or keep it asleep:

**All-Zero Initialization**

It is easy to think that we set all the weights to be zero, but it’s terribly wrong, cause using all zero initialization will make the neurons all the same during the backpropagation update. We don’t need so many identical neurons. Actually, this problem always exists if the weights are initialized to be the same.

**Small random values**

One guess to solve the problem of all-zero initialization is setting the weights to be small random values, such as . It is also problematic because **very small weights cause very small updates** and the update values become smaller and smaller during the backpropagation. In the deep network, this problem is very serious as you may find that the upper layers never update.