The paper Focal Loss for Dense Object Detection introduces a new self balancing loss function that aims to address the huge imbalance problem between foreground/background objects found in one-step object detection networks.
y : binary class {+1, -1}
p : probability of input correctly classified to binary class
Given Cross Entropy (CE) loss for binary classification:
CE(p, y) =
-log(p) , if y = 1
-log(1 - p), if y = -1
The paper introduces the Focal Loss (FL) term as follows
FL(p,y) =
-(1-p)^gamma * log(p), if y = +1
-(p)^gamma * log(1-p), if y = -1
With gamma values ranging from 0 (disabling focal loss, default CE) to 2.
Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives loss.
Easy examples are those that achieve p close to 0 and close to 1.
Example 1
Example 2
gamma = 2.0
p = 0.99
y = +1
FL(0.99, +1) = - ( 1 - 0.99 ) ^ 2.0 * log(0.99) = 0.000000436
CE(0.9,9 +1) = - log(0.99) = 0.00436
That means a near certainty (a very easy example) will have a very small FL compared cross entropy loss and an ambiguous result (close to p ~ 0.5) will have a much higher effect.
In practice the authors use an a-balanced variance of FL:
FL(p,y) =
-a(y) * ( 1 - p ) ^ gamma * log(p), if y = +1
-a(y) * ( p ) ^ gamma * log(1 - p), if y = -1
Where a(y) is a multiplier term fixing the class imbalance. This form yields slightly improved accuracy over the non-a-balanced form.
The authors then go and build a network to show off the capabilities of their loss function. The network is called RetinaNet and it's a standard Feature Pyramid Network (FPN) Backbone with two subnets's (one object classification, one box regression) attached at each feature map. It's a very common implementation for a one stage detector, similar to SSD (edit, exactly the same as SSD) and YOLO. A slight differentiation is the prior addition when initializing the bias for the object classification network and sparse calculation when adding the total cost.
For a high level understanding of deep learning click here
y : binary class {+1, -1}
p : probability of input correctly classified to binary class
Given Cross Entropy (CE) loss for binary classification:
CE(p, y) =
-log(p) , if y = 1
-log(1 - p), if y = -1
The paper introduces the Focal Loss (FL) term as follows
FL(p,y) =
-(1-p)^gamma * log(p), if y = +1
-(p)^gamma * log(1-p), if y = -1
With gamma values ranging from 0 (disabling focal loss, default CE) to 2.
Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives loss.
Easy examples are those that achieve p close to 0 and close to 1.
Example 1
gamma = 2.0
p = 0.9
y = +1
FL(0.9, +1) = - ( 1 - 0.9 ) ^ 2.0 * log(0.9) = 0.00045
CE(0.9, +1) = - log(0.9) = 0.0457
Example 2
gamma = 2.0
p = 0.99
y = +1
FL(0.99, +1) = - ( 1 - 0.99 ) ^ 2.0 * log(0.99) = 0.000000436
CE(0.9,9 +1) = - log(0.99) = 0.00436
That means a near certainty (a very easy example) will have a very small FL compared cross entropy loss and an ambiguous result (close to p ~ 0.5) will have a much higher effect.
In practice the authors use an a-balanced variance of FL:
FL(p,y) =
-a(y) * ( 1 - p ) ^ gamma * log(p), if y = +1
-a(y) * ( p ) ^ gamma * log(1 - p), if y = -1
Where a(y) is a multiplier term fixing the class imbalance. This form yields slightly improved accuracy over the non-a-balanced form.
The authors then go and build a network to show off the capabilities of their loss function. The network is called RetinaNet and it's a standard Feature Pyramid Network (FPN) Backbone with two subnets's (one object classification, one box regression) attached at each feature map. It's a very common implementation for a one stage detector, similar to SSD (edit, exactly the same as SSD) and YOLO. A slight differentiation is the prior addition when initializing the bias for the object classification network and sparse calculation when adding the total cost.
For a high level understanding of deep learning click here