Contents

Focal Loss explained for Object Detection

Abstract
This article explained Focal Loss in a simple way to understand how it can work. The paper is here.

Object Detection

Current SOTA object detectors are based on two-stage detector, which can filter anchors approximately in the inner network like RPN and others. For one-stage detector, given that these networks just are based on listing all possible boxes to predict their offsets to the ground-truth boxes, which means the number of useless negative examples is enormous, but the positive examples are rare compared to the negatives. The one-stage detectors cannot handle the class imbalance problem. This is the reason why Focal loss occurred.

Focal loss

This is the formula below

$$ FL(p_t) = -\alpha_t(1-p_t)^\gamma\log{p_t} $$

In the formula above, $p_t$ is the probability of the $t^{th}$ example, and the $\alpha_t$ is a weighting factor $\alpha \in [0,1]$. I will explain the $\alpha$ and $\gamma$ later. By the way, Cross Entropy Loss is Focal Loss without $\alpha$ but $\gamma=1$.

Therefore, the overall loss is the reduction of $FL(p_t)$, such as $$FL(p)=\sum_{t=1}^{k}{FL(p_t)} \space \text{if } \space \text{reduction=sum} $$ $k$ is the number of examples.

easy/hard example

For a network ,$loss=0$ means the network is perfect and absolutely right, which means all inputs are easy for this network. Therefore, if an example’s loss is a low value, it’s an easy example. reversely, if an example’s loss is a high value, it’s a hard example.

In Focal Loss, the factor $ \gamma $manipulate this. if $p_t$ is high, which means it’s an easy example, $-\log{p_t}$ is low, and $(1-p_t)^\gamma$ make it much less, so the loss is lower than Cross Entropy loss, resulting in the less gradient and less contribution to the update of weights.

positive/negative example

It’s clear that negative examples are the anchors which don’t match any ground-truth boxes, and positive examples are the anchors which match the ground-truth box.

In Focal Loss, $\alpha$ is a variant, so you can set different value for positive and negative examples. For example, you can set negative example’s $\alpha$ a lower value to lower its loss, resulting in the less gradient and less contribution to the update of weights.

Experiment

In the paper, the authors make such experiment below

https://images.376673.xyz/wp-contents/uploads/Screenshot%2B2023-01-20%2B231640.jpg?x-oss-process=style/webp
experiment

This image is about the cumulative distribution functions for positive and negative examples.

We can figure that for positive(foreground) examples, the curves are similar as $\gamma$ increases, but for negative(background) examples, the curves are totally different. for $\gamma=2$ in negative examples, the vast majority of the loss comes from a small fraction of samples. This means, Focal Loss can effectively discount of easy negative examples focusing much attention on hard negative examples.

Conclusion

Focal Loss is designed to address the class imbalance by down-weighting the easy negative examples so that their contribution to the network are smaller. It can be used in one-stage detector with a better performance than the original edition.

I hope that this article will be of some help to you!

References