Lost Functions

Understanding Categorical Cross-Entropy Loss, Binary Cross-Entropy Loss, Softmax Loss, Logistic Loss, Focal Loss and all those confusing names.

People like to use cool names which are often confusing. When I started playing with CNN beyond single label classification, I got confused with the different names and formulations people write in their papers, and even with the loss layer names of the deep learning frameworks such as Caffe, Pytorch or TensorFlow. In this post I group up the different names and variations people use for Cross-Entropy Loss. I explain their main points, use cases and the implementations in different deep learning frameworks.

First, let’s introduce some concepts:

Tasks

Multi-Class Classification

One-of-many classification. Each sample can belong to ONE of C classes. The CNN will have C output neurons that can be gathered in a vector s (Scores). The target (ground truth) vector t will be a one-hot vector with a positive class and C − 1 negative classes. This task is treated as a single classification problem of samples in one of C classes.

Multi-Label Classification

Each sample can belong to more than one class. The CNN will have as well C output neurons. The target vector t can have more than a positive class, so it will be a vector of 0s and 1s with C dimensionality. This task is treated as C different binary ( C,=2,t,=0,t,=1C^, = 2, t^, = 0, t^,=1 ) and independent classification problems, where each output neuron decides if a sample belongs to a class or not.

Output Activation Functions

These functions are transformations we apply to vectors coming out from CNNs ( s ) before the loss computation.

Sigmoid

It squashes a vector in the range (0, 1). It is applied independently to each element of ss sis_i It’s also called logistic function.

f(si)=11+esif(s_i) = \frac{1}{1+e^{-s_i}}

Softmax

Softmax it’s a function, not a loss. It squashes a vector in the range (0, 1) and all the resulting elements add up to 1. It is applied to the output scores ss . As elements represent a class, they can be interpreted as class probabilities.

The Softmax function cannot be applied independently to each sis_i, since it depends on all elements of ss. For a given class sis_i, the Softmax function can be computed as:

f(si)=esijCesjf(s_i) = \frac{e^{s_i}}{\sum^{C}_{j}e^{s_j}}

Where sjs_jare the scores inferred by the net for each class in CC. Note that the Softmax activation for a class sis_i depends on all the scores in ss.

Losses

Cross-Entropy loss

The Cross-Entropy Loss is actually the only loss we are discussing here. The other losses names written in the title are other names or variations of it. The CE Loss is defined as:

CE=iCtilog(si)CE=-\sum^{C}_{i}t_i\log(s_i)

Where tit_iand sis_i are the groundtruth and the CNN score for each class ii in CC. As usually an activation function (Sigmoid / Softmax) is applied to the scores before the CE Loss computation, we write f(si)f(s_i) to refer to the activations.

In a binary classification problem, where C,,=2C^{,,}=2, the Cross Entropy Loss can be defined as

CE=i=1C,,=2tilog(si)=tilog(s1)(1t1)log(1s1)CE=-\sum^{C^{,,}=2}_{i=1}t_i\log(s_i) = -t_i\log(s_1) - (1-t_1)\log(1-s_1)

Where it’s assumed that there are two classes: C1C_1 and C2C_2 . t1t_1 [0,1] and s1s_1 are the groundtruth and the score for C1C_1, and t2=1t1t_2 = 1 - t_1 and s2=1s1s_2 = 1 - s_1 are the groundtruth and the score for C2C_2 . That is the case when we split a Multi-Label classification problem in CC binary classification problems. See next Binary Cross-Entropy Loss section for more details.

Logistic Loss and Multinomial Logistic Loss are other names for Cross-Entropy loss. [Discussion]

The layers of Caffe, Pytorch and Tensorflow than use a Cross-Entropy loss without an embedded activation function are:

Categorical Cross-Entropy loss

Also called Softmax Loss. It is a Softmax activation plus a Cross-Entropy loss. If we use this loss, we will train a CNN to output a probability over the CC classes for each image. It is used for multi-class classification.

In the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class CpC_p keeps its term in the loss. There is only one element of the Target vector tt which is not zero ti=tpt_i=t_p. So discarding the elements of the summation which are zero due to target labels, we can write:

CE=log(espjCesj)CE=-\log\Big(\frac{e^{s_p}}{\sum_{j}^{C}e^{s_j}}\Big) Where sps_p is the CNN score for the positive class.

Defined the loss, now we’ll have to compute its gradient respect to the output neurons of the CNN in order to backpropagate it through the net and optimize the defined loss function tuning the net parameters. So we need to compute the gradient of CE Loss respect each CNN class score in ss. The loss terms coming from the negative classes are zero. However, the loss gradient respect those negative classes is not cancelled, since the Softmax of the positive class also depends on the negative classes scores.

The gradient expression will be the same for all CC except for the ground truth class CpC_p , because the score of Cp(sp)C_p(s_p) is in the nominator. After some calculus, the derivative respect to the positive class is:

sp(log(espjCesj))=(esnjCesj)\frac{\partial}{\partial s_p}\Big(-\log\Big(\frac{e^{s_p}}{\sum^C_j e^{s_j}}\Big)\Big) = \Big(\frac{e^{s_n}}{\sum^C_je^{s_j}}\Big)

Where sns_n is the score of any negative class in C different from CpC_p .

Where each sps_p in M is the CNN score for each positive class. As in Facebook paper, I introduce a scaling factor 1 / M to make the loss invariant to the number of positive classes, which may be different per sample.

The gradient has different expressions for positive and negative classes. For positive classes:

Where spis_p^iis the score of any positive class. For negative classes:

This expressions are easily inferable from the single-label gradient expressions.

As Caffe Softmax with Loss layer nor Multinomial Logistic Loss Layer accept multi-label targets, I implemented my own PyCaffe Softmax loss layer, following the specifications of the Facebook paper. Caffe python layers let’s us easily customize the operations done in the forward and backward passes of the layer:

Forward pass: Loss computation

def forward(self, bottom, top):
   labels = bottom[1].data
   scores = bottom[0].data
   # Normalizing to avoid instability
   scores -= np.max(scores, axis=1, keepdims=True)  
   # Compute Softmax activations
   exp_scores = np.exp(scores)
   probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) 
   logprobs = np.zeros([bottom[0].num,1])
   # Compute cross-entropy loss
   for r in range(bottom[0].num): # For each element in the batch
       scale_factor = 1 / float(np.count_nonzero(labels[r, :]))
       for c in range(len(labels[r,:])): # For each class 
           if labels[r,c] != 0:  # Positive classes
               logprobs[r] += -np.log(probs[r,c]) * labels[r,c] * scale_factor # We sum the loss per class for each element of the batch

   data_loss = np.sum(logprobs) / bottom[0].num

   self.diff[...] = probs  # Store softmax activations
   top[0].data[...] = data_loss # Store loss

We first compute Softmax activations for each class and store them in probs. Then we compute the loss for each image in the batch considering there might be more than one positive label. We use an scale_factor MM and we also multiply losses by the labels, which can be binary or real numbers, so they can be used for instance to introduce class balancing. The batch loss will be the mean loss of the elements in the batch. We then save the data_loss to display it and the probs to use them in the backward pass.

Backward pass: Gradients computation

def backward(self, top, propagate_down, bottom):
   delta = self.diff   # If the class label is 0, the gradient is equal to probs
   labels = bottom[1].data
   for r in range(bottom[0].num):  # For each element in the batch
       scale_factor = 1 / float(np.count_nonzero(labels[r, :]))
       for c in range(len(labels[r,:])):  # For each class
           if labels[r, c] != 0:  # If positive class
               delta[r, c] = scale_factor * (delta[r, c] - 1) + (1 - scale_factor) * delta[r, c]
   bottom[0].diff[...] = delta / bottom[0].num

In the backward pass we need to compute the gradients of each element of the batch respect to each one of the classes scores s. As the gradient for all the classes C except positive classes M is equal to probs, we assign probs values to delta. For the positive classes in M we subtract 1 to the corresponding probs value and use scale_factor to match the gradient expression. We compute the mean gradients of all the batch to run the backpropagation.

Binary Cross-Entropy Loss

Also called Sigmoid Cross-Entropy loss. It is a Sigmoid activation plus a Cross-Entropy loss. Unlike Softmax loss it is independent for each vector component (class), meaning that the loss computed for every CNN output vector component is not affected by other component values. That’s why it is used for multi-label classification, were the insight of an element belonging to a certain class should not influence the decision for another class. It’s called Binary Cross-Entropy Loss because it sets up a binary classification problem between C′′=2 classes for every class in C, as explained above. So when using this Loss, the formulation of Cross Entroypy Loss for binary problems is often used:

Last updated