# NN Softmax loss function

Background:the network and symbols

Firstly the network architecture will be described as:

We are now dealing with a multi-class classification problem, let’s say there are 3 classes, then the output layer has 3 neurons, each neuron output a binary value, the neuron outputs 1 will be the target class, the other neurons output 0, as our target is one single class. The so called “ideal” case is the ground truth from the training set. For the ith sample from the training set, it will be represented as <>, is a feature vector and means its class. The sample in the above diagram belongs to the 2nd class, then the value of will be 1(index from 0) and the 3 neurons of the output layer will output 0,1,0 respectively.

Obviously, for the given input we can hardly get an output of absolute 1 or 0, instead, after a complicated data flow from the input layer till the output layer, what we get from the output layer directly will be some scores, which are from 3 neurons, their range can be anything, depending on the W and b through the whole network, they are computed by:

### Softmax function

The softmax function will be:

symbols:

: input feature vector, with a length of 5 in this case

: output value, ground truth from the training set, can be 0, 1 and 2

N : number of samples in the training set

: the sample

: output score, can be

: the class

C : number of classes, in this case 3

: probability of the output to be the correct class

we can see that the range of will always be [0,1], which will be considered as the probability of the output to be the class(classified correctly), its ideal case is 1: 100% for sure belongs to the class.

**Softmax function is actually a bridge connecting the direct outputs and the final result: the class with the biggest probability is the predicted class.**

Define the Loss function

Now we have the training set containing some pairs.

We are looking for the network parameter matrix , which can be described as:

As is given, will be 1 and is considered to be some constant. Using the Bayes’ theorem, the initial problem turns to:

given W, we get the output scores from X, so the problem turns further to:

using the joint probability:

what’s more

and both probabilities of the right side are constant, so finally our problem turns to:

which can be described as follows:

given a network and the input from the training set , looking for a , which can maximize the probability of the ’s correct classification:

As discussed above, this probability can be expressed with the softmax function. Putting all training samples together, we get:

log it, we get:

normally the optimization problem aims to minimize something, we add a minus and get the final loss function:

First step of optimization: calculate the derivative

The backpropagation algorithm will be discussed in the next blog, I’d like to introduce the first step: calculate .

start from the above , through a serious of calculation, the derivative of the loss to the to the **correct** outputs will be:

for the **wrong** outputs it will be: