Neural Networks 101: Part 8 - Binary Cross Entropy
Binary Cross Entropy
Binary Cross Entropy is another loss function that is suited to problems that require multiple labels.
Given multiple labels, for example 20 labels. BCE will measure the loss of predictions for all labels against a single target (input). This will result in the ability to classify and predict multiple labels in a single input or predict no labels at all.
BCE - Simplified
First I’ll start with a basic and intuitive example, after, we’ll dive into the BCE formula and how it works.
Let’s say we have multiple dependent variables (labels). Any number of these labels can exist in one indepdent variable (input).
labels = [car, cat, dog]
Let’s assume our indepdent variables are images and an image can contain one, some or none of [car, cat, dog]
.
Here is an example training set:
input = <an image of a car and a dog>
target_labels = [car, dog]
The input is a picture of a car and a dog, but doesn’t contain a cat.
We can represent the values as numeric predictions and targets. Remember, multi label classification usually uses “one-hot encoding”, meaning when the label is present in the input, its set to a 1
and if its not present, it’s set to a 0
.
targets = [1, 0, 1]
predictions = [0.8, 0.3, 0.9]
We can see that the predictions that are closer to a 1 indicate that label is present and the predictions closer to 0 indicate the labels are not detected.
Using the first index, which is the prediction of a car, we can see that the prediction is 0.8 and the target is 1, we can say that the prediction is 80% confident a car is present.
Using the second index, which is the prediction of a cat, we can see that the prediction is a 0.3 and the target is 0, so we can that the prediction is 70% confident the cat is not present.
Intuitively, we can see that the predictions at this point are fairly accurate. We’ve created a prediction for each possible label for an input. This will be used to calculate the loss as the BCE is a loss function, determining the adjustment required to the parameters
BCE - Formula
This section will go into the details of the formula.
The formula is applied to each prediction and each label.
The predictions are derived from applying a Sigmoid
function to the raw logits. Each raw logit represents the potential of the label to appear or not appear.
$ BCE = -\left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] $
Each prediction:
$ \hat{y}_i $
Each target (the true label):
$ y_i $
If we use the previous example of an image with that could contain [car, cat, dog]
.
Target of a car in the input:
$ y = 1 $
The prediction of the car in the input:
$ \hat{y} = 0.8 $
Let’s substitue in the values:
$ BCE = -\left[ 1 \log(0.8) + (1 - 1) \log(1 - 0.8) \right] $
We’ll simplify and solve:
$ BCE = -\left[ 1\log(0.8) + 0 \log(1 - 0.8) \right] $
$ BCE = -\left[ 1\log(0.8) \right] $
$ BCE = -\left[ \log(0.8) \right] $
$ BCE = -\log(0.8) $
$ BCE = log(0.8) = -0.096910013 $
$ BCE = -(-0.096910013) $
$ BCE = 0.096910013 $
We can see the resulting loss of the first prediction for the label car is very good. It’s close to 0, so when calculate using back propagation, the adjustments would be minimal.
Let’s go through the calculation for the absent label (cat):
Target of a cat in the input:
$ y = 0 $
The prediction of the cat in the input:
$ \hat{y} = 0.3 $
Let’s substitue in the values:
$ BCE = -\left[ 0 \log(0.3) + (1 - 0) \log(1 - 0.3) \right] $
We’ll simplify and solve:
$ BCE = -\left[ 0\log(0.3) + 1 \log(1 - 0.3) \right] $
$ BCE = -\left[1 \log(0.7) \right] $
$ BCE = -\log(0.7) $
$ BCE = log(0.7) = -0.15490195998 $
$ BCE = -(-0.15490195998) $
$ BCE = 0.15490195998 $
We can see that with confidence that cat is not in the image since the loss is low.
Also notice, the formula is designed in a way to cancel out one side, allowing the calculation of a binary input (1 or 0).
If the target is a 1
, the target is present, the right side of the equation would be canceled out.
If the target is a 0
, the target is not present, then the left side would be canceled out.
Finally, to arrive at a single loss value, the losses are averaged according to the number of labels:
$ L = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] $
Summary
Binary Cross Entropy is a loss function that calculates the loss function for multiple labels in a single input
In simple terms, it allows the classification of multiple labels in a single input (e.g. identifying multiple objects in a single image)
In previous articles we looked at the NLL (Negative Log Likelihood). This wouldn’t be applicable to this problem, which uses one-hot encoding for multi-label classification. This is because the NLL assumes there is one true class for each input, whereas in this case, multiple labels can be true, meaning any number of labels (all, some or none) can appear simultaneously