This paper was kind of overlooked last year (even thought it showed up on NIPS) but it caught my attention.
The paper suggests replacing the usual softmax output layer for classification (the paper makes this suggestions for NLP problems but the theory stands for all domains with large number of classes) because its very design produces an information bottleneck.
I have yet to see this applied to computer vision problems but I will implement it in Tensorflow or Caffe and report back.
It's essence boils down to the following excerpts :
The paper then goes to suggest the properties of a replacement activation layer:
and the final outcome is this:
which satisfied the properties defined earlier
The paper suggests replacing the usual softmax output layer for classification (the paper makes this suggestions for NLP problems but the theory stands for all domains with large number of classes) because its very design produces an information bottleneck.
I have yet to see this applied to computer vision problems but I will implement it in Tensorflow or Caffe and report back.
It's essence boils down to the following excerpts :
The paper then goes to suggest the properties of a replacement activation layer:
and the final outcome is this:
which satisfied the properties defined earlier