From the last article, we get the following negative log likelihood function as our optimization target:

The optimization problem turns to be:

This article will explain how can this optimization problem be solved using Gauss-Newton method.

From the last article, we get the following negative log likelihood function as our optimization target:

The optimization problem turns to be:

This article will explain how can this optimization problem be solved using Gauss-Newton method.

Usually, all graph optimization papers or tutorials start from raising the optimization problem: minimizing the following function:

This article will explain where this optimization comes from and why it is related to gaussian noise.

The probabilistic modeling of graph optimization is based on the Maximum Likelihood Estimation(MLE) algorithm.

The case we use for this article will also be the one we used in the last article:

In the field of machine learning, PCA is used to reduce the dimension of features. Usually we collect a lot of feature to feed the machine learning model, we believe that more features provides more information and will lead to better result.

But some of the features doesn’t really bring new information and they are correlated to some other features. PCA is then introduced to remove this correlation by approximating the original data in its subspace, some of the features of the original data may be correlated to each other in its original space, but this correlation doesn’t exit in its subspace approximation.

Visually, let’s assume that our original data points have 2 features and they can be visualized in a 2D space:

Firstly the network architecture will be described as:

Dropout is a commonly used regularization method, it can be described by the diagram below: only part of the neurons in the whole network are updated. Mathematically, we apply some possibility (we use 0.5) to a neuron to keep it active or keep it asleep:

**All-Zero Initialization**

It is easy to think that we set all the weights to be zero, but it’s terribly wrong, cause using all zero initialization will make the neurons all the same during the backpropagation update. We don’t need so many identical neurons. Actually, this problem always exists if the weights are initialized to be the same.

**Small random values**

One guess to solve the problem of all-zero initialization is setting the weights to be small random values, such as . It is also problematic because **very small weights cause very small updates** and the update values become smaller and smaller during the backpropagation. In the deep network, this problem is very serious as you may find that the upper layers never update.

The LSVM (SVM with latent variable) is mostly used for human figure detection, it is very efficiency because it puts the human figure’s structure into consideration: a human figure has hands, head and legs. The LSVM models the human figure structure with 6 parts, and the position of these 6 parts are latent value.

**The basic logic is sliding a window on the image, for every position we get a small image patch, by scoring this image patch we can predict whether this image patch contains a human figure or not.**

Anyway, the first thing to do: defining a score function:

Structural SVM is a variation of SVM, hereafter to be refered as SSVM

Firstly let’s recall the normal SVM’s prediction function:

ω is the weight vector，x is the input，b is the bias， is sign function， is the prediction result.

On of SSVM’s specialties is its prediction function：

y is the possible prediction result，Υ is y’s searching space，and Φ is some function of x and y.Φ will be a joint feature vector describes the relationship between x and y

Then for some given , different prediction will be made according to different x.

Sepp Hochreiter was graduated from Technische Universität München, LSTM was invented when he was in TU and now he is the head of Institute of Bioinformatics, Johannes Kepler University Linz.

Today he comes by Munich and gives a lecture in Fakultät Informatik.

At first, Hochreiter praised how hot is Deep Learning (to be referred as DL) around the world these days, especially LSTM, which is now used in the new version of google translate published a few days ago. The improvements DL made in the fields of vision and NLP are very impressive.

Then he starts to tell the magic of DL, taking face recognition as an example, the so-called CNN (Convolution Neuro Networks):

The above diagram shows a neuron in NN, it simulates a real neuron:

it has **inputs**:

it has weights for each inputs: : **weight vector**

it has **bias**

it has a **threshold** for the “activation function”

For the ith sample in the training set, we have the following loss function:

is the score classifying to class j，and is the score classifying correctly(classify to class )， is the th row of .

Problem 1:

Considering the geometrical meaning of the weight vector , it is easy to find out that is not unique, can change in a small area and result in the same .

Problem 2:

It the values in is scaled, the loss computed will also be scaled by the same ratio. Considering a loss of 15, if we scale all the weights in by 2, the loss will be scaled to 30. But this kind of scaling is meaningless, it doesn’t really represent the **loss**.

PCA (**Principal component analysis**), just as its name shows, it computes the data set’s internal structure, its “principal components”.

Considering a set of 2 dimensional data, for one data point, it has 2 dimensions and . Now we get **n** such data points . What is the relationship between the first dimension and the second dimension ? We compute the so called **covariance**:

the covariance shows how strong is the relationship between and . Its logic is the same as **Chebyshev’s sum inequality**:

HoG (**Histograms of Oriented Gradients**) feature is a kind of feature used for human figure detection. At an age without deep learning, it is the best feature to do this work.

Just as its name described, HoG feature compute the gradients of all pixels of an image patch/block. **It computes both the gradient’s magnitude and orientation**, that’s why it’s called “oriented”, then it computes the histogram of the oriented gradients by separating them to 9 ranges.

**One image block (upper left corner of the image) is combined of 4 cells**, **one cell owns a 9 bins histogram**, so for one image block we get 4 histograms, and all these 4 histograms will be flattened to one feature vector with a length of 4x9. Compute the feature vectors for all blocks in the image, we get a feature vector map.

Taking one pixel (marked red) from the yellow cell as an example: compute the and of this pixel, then we get its magnitude and orientation(represented by angle). When calculating the histogram, we **vote its magnitude to its neighboring 2 bins using bilinear interpolation of angles.**

Finally, when we get the 4 histograms of the 4 cells, we **normalize them according to the summation of all the 4x9 values.**

The details are described in the following chart:

Now we want to solve a image classification problem, for example classifying an image to be cow or cat. The machine learning algorithm will score a unclassified image according to different classes, and decide which class does this image belong to based on the score. One of the keys of the classification algorithm is designing this loss function.

**Map/compute image pixels to the confidence score of each class**

Assume a training set:

is the image and is the corresponding class

i∈1…N means the traning set constains N images

∈1…K means there are K image categories

**So a score function maps x to y:**

In the above function, each image is flattend to a 1 dimention vector

If one image’s size is 32x32 pixels with 3 channels

will be a 1 dimention vector with the length of **D=32x32x3=3072**
Parameter matrix **W** has the size of **[KxD]**, it is often called weights
**b** of size **[Kx1]** is often called bias vector
In this way, **W** is evaluating ’s confidence score for **K** categories at the same time