The paper Focal Loss for Dense Object Detection introduces a new self balancing loss function that aims to address the huge imbalance problem between foreground/background objects found in one-step object detection networks.

y : binary class {+1, -1}

p : probability of input correctly classified to binary class

Given

CE(p, y) =

-log(p) , if y = 1

-log(1 - p), if y = -1

The paper introduces the

FL(p,y) =

-(1-p)^gamma * log(p), if y = +1

-(p)^gamma * log(1-p), if y = -1

With gamma values ranging from 0 (disabling focal loss, default CE) to 2.

Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives loss.

Easy examples are those that achieve p close to 0 and close to 1.

Example 1

Example 2

gamma = 2.0

p = 0.99

y = +1

FL(0.99, +1) = - ( 1 - 0.99 ) ^ 2.0 * log(0.99) = 0.000000436

CE(0.9,9 +1) = - log(0.99) = 0.00436

That means a near certainty (a very easy example) will have a very small FL compared cross entropy loss and an ambiguous result (close to p ~ 0.5) will have a much higher effect.

In practice the authors use an

FL(p,y) =

-a(y) * ( 1 - p ) ^ gamma * log(p), if y = +1

-a(y) * ( p ) ^ gamma * log(1 - p), if y = -1

Where a(y) is a multiplier term fixing the class imbalance. This form yields slightly improved accuracy over the non-a-balanced form.

The authors then go and build a network to show off the capabilities of their loss function. The network is called

For a high level understanding of deep learning click here

y : binary class {+1, -1}

p : probability of input correctly classified to binary class

Given

*loss for binary classification:***Cross Entropy (CE)**CE(p, y) =

-log(p) , if y = 1

-log(1 - p), if y = -1

The paper introduces the

**term as follows***Focal Loss (FL)*FL(p,y) =

-(1-p)^gamma * log(p), if y = +1

-(p)^gamma * log(1-p), if y = -1

With gamma values ranging from 0 (disabling focal loss, default CE) to 2.

Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives loss.

Easy examples are those that achieve p close to 0 and close to 1.

Example 1

gamma = 2.0

p = 0.9

y = +1

FL(0.9, +1) = - ( 1 - 0.9 ) ^ 2.0 * log(0.9) = 0.00045

CE(0.9, +1) = - log(0.9) = 0.0457

Example 2

gamma = 2.0

p = 0.99

y = +1

FL(0.99, +1) = - ( 1 - 0.99 ) ^ 2.0 * log(0.99) = 0.000000436

CE(0.9,9 +1) = - log(0.99) = 0.00436

That means a near certainty (a very easy example) will have a very small FL compared cross entropy loss and an ambiguous result (close to p ~ 0.5) will have a much higher effect.

In practice the authors use an

**a-balanced**variance of FL:FL(p,y) =

-a(y) * ( 1 - p ) ^ gamma * log(p), if y = +1

-a(y) * ( p ) ^ gamma * log(1 - p), if y = -1

Where a(y) is a multiplier term fixing the class imbalance. This form yields slightly improved accuracy over the non-a-balanced form.

The authors then go and build a network to show off the capabilities of their loss function. The network is called

**RetinaNet**and it's a standard**Feature Pyramid Network**(**FPN**) Backbone with two subnets's (one object classification, one box regression) attached at each feature map. It's a very common implementation for a one stage detector, similar to SSD (edit, exactly the same as SSD) and YOLO. A slight differentiation is the prior addition when initializing the bias for the object classification network and sparse calculation when adding the total cost.For a high level understanding of deep learning click here