Sunday, April 15, 2018

Review : Focal Loss for Dense Object Detection

The paper Focal Loss for Dense Object Detection introduces a new self balancing loss function that aims to address the huge imbalance problem between foreground/background objects found in one-step object detection networks.

y : binary class {+1, -1}
p : probability of input correctly classified to binary class

Given Cross Entropy (CE) loss for binary classification:
CE(p, y) =
-log(p) ,  if y = 1
-log(1 - p), if y = -1

The paper introduces the Focal Loss (FL) term as follows
FL(p,y) =
-(1-p)^gamma * log(p), if y = +1
-(p)^gamma * log(1-p), if y = -1

With gamma values ranging from 0 (disabling focal loss, default CE) to 2.
Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives loss.
Easy examples are those that achieve p close to 0 and close to 1.

Example 1
gamma = 2.0
p = 0.9
y = +1
FL(0.9, +1) = - ( 1 - 0.9 ) ^ 2.0 * log(0.9) = 0.00045 
CE(0.9, +1) = - log(0.9) = 0.0457

Example 2
gamma = 2.0
p = 0.99
y = +1
FL(0.99, +1) = - ( 1 - 0.99 ) ^ 2.0 * log(0.99) = 0.000000436
CE(0.9,9 +1) = - log(0.99) = 0.00436

That means a near certainty (a very easy example) will have a very small FL compared cross entropy loss and an ambiguous result (close to p ~ 0.5) will have a much higher effect.

In practice the authors use an a-balanced variance of FL:

FL(p,y) = 
-a(y) * ( 1 - p ) ^ gamma * log(p), if y = +1
-a(y) * ( p )  ^ gamma * log(1 - p), if y = -1

Where a(y) is a multiplier term fixing the class imbalance. This form yields slightly improved accuracy over the non-a-balanced form.

The authors then go and build a network to show off the capabilities of their loss function. The network is called RetinaNet and it's a standard Feature Pyramid Network (FPN) Backbone with two subnets's (one object classification, one box regression) attached at each feature map. It's a very common implementation for a one stage detector, similar to SSD (edit, exactly the same as SSD) and YOLO. A slight differentiation is the prior addition when initializing the bias for the object classification network and sparse calculation when adding the total cost.



For a high level understanding of deep learning click here