From the last article, we get the following negative log likelihood function as our optimization target:

The optimization problem turns to be:

This article will explain how can this optimization problem be solved using Gauss-Newton method.

Usually, all graph optimization papers or tutorials start from raising the optimization problem: minimizing the following function:

This article will explain where this optimization comes from and why it is related to gaussian noise.

Maximum Likelihood Estimation with Gaussian noise

The probabilistic modeling of graph optimization is based on the Maximum Likelihood Estimation(MLE) algorithm.

The case we use for this article will also be the one we used in the last article:


The idea behind PCA

In the field of machine learning, PCA is used to reduce the dimension of features. Usually we collect a lot of feature to feed the machine learning model, we believe that more features provides more information and will lead to better result.

But some of the features doesn’t really bring new information and they are correlated to some other features. PCA is then introduced to remove this correlation by approximating the original data in its subspace, some of the features of the original data may be correlated to each other in its original space, but this correlation doesn’t exit in its subspace approximation.

Visually, let’s assume that our original data points have 2 features and they can be visualized in a 2D space:

The case we are handling: a 2 layers network


Background:the network and symbols

Firstly the network architecture will be described as:


Dropout regularization

Dropout is a commonly used regularization method, it can be described by the diagram below: only part of the neurons in the whole network are updated. Mathematically, we apply some possibility (we use 0.5) to a neuron to keep it active or keep it asleep:


Weight Initialization

All-Zero Initialization

It is easy to think that we set all the weights to be zero, but it’s terribly wrong, cause using all zero initialization will make the neurons all the same during the backpropagation update. We don’t need so many identical neurons. Actually, this problem always exists if the weights are initialized to be the same.

Small random values

One guess to solve the problem of all-zero initialization is setting the weights to be small random values, such as . It is also problematic because very small weights cause very small updates and the update values become smaller and smaller during the backpropagation. In the deep network, this problem is very serious as you may find that the upper layers never update.