Softmax function takes an N-dimensional vector of real numbers and transforms it into a vector of real number in range (0,1) which add upto 1.
As the name suggests, softmax function is a "soft" version of max function. Instead of selecting one maximum value, it breaks the whole (1) with maximal element getting the largest portion of the distribution, but other smaller elements getting some of it as well.
This property of softmax function that it outputs a probability distribution makes it suitable for probabilistic interpretation in classification tasks.
In python, we the code for softmax function as follows:
def softmax(X): exps = np.exp(X) return exps / np.sum(exps)
We have to note that the numerical range of floating point numbers in numpy is limited. For
float64 the upper bound is . For exponential, its not difficult to overshoot that limit, in which case python returns
To make our softmax function numerically stable, we simply normalize the values in the vector, by multiplying the numerator and denominator with a constant .
We can choose an arbitrary value for term, but generally is chosen, as it shifts all of elements in the vector to negative to zero, and negatives with large exponents saturate to zero rather than the infinity, avoiding overflowing and resulting in
The code for our stable softmax is as follows:
def stable_softmax(X): exps = np.exp(X - np.max(X)) return exps / np.sum(exps)
Due to the desirable property of softmax function outputting a probability distribution, we use it as the final layer in neural networks. For this we need to calculate the derivative or gradient and pass it back to the previous layer during backpropagation.
From quotient rule we know that for , we have .
In our case and . In , will always be has it will always have . But we have to note that in , will be only if , otherwise its 0.
So the derivative of the softmax function is given as,
Or using Kronecker delta
Cross entropy indicates the distance between what the model believes the output distribution should be, and what the original distribution really is. It is defined as,
Cross entropy measure is a widely used alternative of squared error. It is used when node activations can be understood as representing the probability that each hypothesis might be true, i.e. when the output is a probability distribution. Thus it is used as a loss function in neural networks which have softmax activations in the output layer.
def cross_entropy(X,y): """ X is the output from fully connected layer (num_examples x num_classes) y is labels (num_examples x 1) Note that y is not one-hot encoded vector. It can be computed as y.argmax(axis=1) from one-hot encoded vectors of labels if required. """ m = y.shape p = softmax(X) # We use multidimensional array indexing to extract # softmax probability of the correct label for each sample. # Refer to https://docs.scipy.org/doc/numpy/user/basics.indexing.html#indexing-multi-dimensional-arrays for understanding multidimensional array indexing. log_likelihood = -np.log(p[range(m),y]) loss = np.sum(log_likelihood) / m return loss
Cross Entropy Loss with Softmax function are used as the output layer extensively. Now we use the derivative of
From derivative of softmax we derived earlier,
is a one hot encoded vector for the labels, so, and . So we have,
which is a very simple and elegant expression. Translating it into
def delta_cross_entropy(X,y): """ X is the output from fully connected layer (num_examples x num_classes) y is labels (num_examples x 1) Note that y is not one-hot encoded vector. It can be computed as y.argmax(axis=1) from one-hot encoded vectors of labels if required. """ m = y.shape grad = softmax(X) grad[range(m),y] -= 1 grad = grad/m return grad