Saturday, April 21, 2018

The "No Free Lunch" Theorem

The "No Free Lunch" theorem was first published by  David Wolpert and William Macready in their 1996 paper "No Free Lunch Theorems for Optimization".

In computational complexity and optimization the no free lunch theorem is a result that states that for certain types of mathematical problems, the computational cost of finding a solution, averaged over all problems in the class, is the same for any solution method. No solution therefore offers a 'short cut'. 

A model is a simplified version of the observations. The simplifications are meant to discard the superfluous details that are unlikely to generalize to new instances. However, to decide what data to keep , you must make assumptions. For example, a linear model makes the assumption that the data is fundamentally linear and the distance between the instances and the straight line is just noise, which can safely be ignored.

David Wolpert demonstrated that if you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. This is called the "No Free Lunch Theorem" (NFL).

NFL states that no model is a priori guaranteed to work better. The only way to know for sure which model is the best is to evaluate them all. Since this is not possible, in practice you make some reasonable assumptions about the data and you evaluate only a few reasonable models.




Sunday, April 15, 2018

Review : Focal Loss for Dense Object Detection

The paper Focal Loss for Dense Object Detection introduces a new self balancing loss function that aims to address the huge imbalance problem between foreground/background objects found in one-step object detection networks.

y : binary class {+1, -1}
p : probability of input correctly classified to binary class

Given Cross Entropy (CE) loss for binary classification:
CE(p, y) =
-log(p) ,  if y = 1
-log(1 - p), if y = -1

The paper introduces the Focal Loss (FL) term as follows
FL(p,y) =
-(1-p)^gamma * log(p), if y = +1
-(p)^gamma * log(1-p), if y = -1

With gamma values ranging from 0 (disabling focal loss, default CE) to 2.
Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives loss.
Easy examples are those that achieve p close to 0 and close to 1.

Example 1
gamma = 2.0
p = 0.9
y = +1
FL(0.9, +1) = - ( 1 - 0.9 ) ^ 2.0 * log(0.9) = 0.00045 
CE(0.9, +1) = - log(0.9) = 0.0457

Example 2
gamma = 2.0
p = 0.99
y = +1
FL(0.99, +1) = - ( 1 - 0.99 ) ^ 2.0 * log(0.99) = 0.000000436
CE(0.9,9 +1) = - log(0.99) = 0.00436

That means a near certainty (a very easy example) will have a very small FL compared cross entropy loss and an ambiguous result (close to p ~ 0.5) will have a much higher effect.

In practice the authors use an a-balanced variance of FL:

FL(p,y) = 
-a(y) * ( 1 - p ) ^ gamma * log(p), if y = +1
-a(y) * ( p )  ^ gamma * log(1 - p), if y = -1

Where a(y) is a multiplier term fixing the class imbalance. This form yields slightly improved accuracy over the non-a-balanced form.

The authors then go and build a network to show off the capabilities of their loss function. The network is called RetinaNet and it's a standard Feature Pyramid Network (FPN) Backbone with two subnets's (one object classification, one box regression) attached at each feature map. It's a very common implementation for a one stage detector, similar to SSD (edit, exactly the same as SSD) and YOLO. A slight differentiation is the prior addition when initializing the bias for the object classification network and sparse calculation when adding the total cost.



For a high level understanding of deep learning click here

Thursday, February 8, 2018

Critique on "Deep Learning: A Critical Appraisal "

Deep Learning: A Critical Appraisal 

Gary Marcus argues that deep learning is : 
1. Shallow : Meaning it has limited capacity for transfer 
2. Data Hungry: Requires millions of examples to generalize sufficiently
3. Not transparent enough: It is treated as a black box

I'm not an academic but I've been reading research papers and I've seen a huge effort on all 3 fronts. (cudos to https://blog.acolyer.org/

New architectures and layers that require far fewer data and can be used for several unrelated tasks. 
A lot of opening the black box approachs based on anything from MDL, to information theory and statistics on interpreting the weights, layers and results. 

It's not all doom and gloom but huge the milestone jumps like the ones we had in the last 5 years in most AI/ML tasks are probably in the past. What we will see is a culling of a lot of bad tech and hype and the quiet rise of Differentiable Neural Computing.  



For a high level understanding of deep learning click here

Monday, January 29, 2018

Peter Thiel's 7 questions on startups

Fom Peter Thiel's "Zero To one", notes on startups 

All excellent questions before you start any venture :
  1. Engineering : Can you create breakthrough technology instead of incremental improvements ?
  2. Timing : Is now the right time to start your particular business ?
  3. Monopoly : Are you starting with a big share of a small market ?
  4. People : Do you have the right team ?
  5. Distribution : Do you have a way to not just create but deliver your product ?
  6. Durability : Will your market position be defensible 10 and 20 years into the future ? 
  7. Secret : Have you identified a unique opportunity that others don’t see ? 


Wednesday, January 10, 2018

Compiling Tensorflow under Debian Linux with GPU support and CPU extensions


Tensorflow is a wonderful tool for Differentiable Neural Computing (DNC) and has enjoyed great success and market share in the Deep Learning arena. We usually use it with python in a prebuild fashion using Anaconda or pip repositories. What we miss that way is the chance to enable optimizations to better use our processing capabilities as well as do some lower level computing using C/C++.

The purpose of this post is to be a guide for compiling Tensorflow r1.4 on Linux with CUDA GPU support and the high performance AVX and SSE CPU extensions.

This guide is largely based on the official Tensorflow Guide and this snippet with some bug fixes from my side.

1. Install python adependencies:


sudo apt-get install python-numpy python-dev python-pip python-wheel python-setuptools

2. Install GPU prerequisites:
  • CUDA developer and drivers
  • CUDNN developer and runtime
  • CUBLAS
Make sure cuddn libs are copied inside the cuda/lib64 directory usually found under /usr/local/cuda.


sudo apt-get install libcupti-dev

3. Install Bazel google's custom build tool:


sudo apt-get install openjdk-8-jdk

echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list

curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -

sudo apt-get update && sudo apt-get install bazel

sudo apt-get upgrade bazel


4. Configure Tensorflow:


git clone https://github.com/tensorflow/tensorflow

cd tensorflow

git checkout r1.4

## don't use clang for nvcc backend [https://github.com/tensorflow/tensorflow/issues/11807] 
## when asked for the path to the gcc compiler, make sure it points to a version <= 5 
./configure


5. Compile with the SSE and AVX flags and install using pip:


# set locale to en_us [https://github.com/tensorflow/tensorflow/issues/36]

export LC_ALL=en_us.UTF-8

export LANG=en_us.UTF-8

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --incompatible_load_argument_is_label=false --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

sudo pip install /tmp/tensorflow_pkg/tensorflow-1.4.1*


6. Test that everything works:


cd ~/

python

>>> import tensorflow as tf

>>> session = tf.InteractiveSession()
>>> init = tf.global_variables_initializer()
## At this point if your get a malloc.c assertion failure, it is due to a wrong CUDA configuration (ie not using the runtime version)

At this point there should not be any CPU warning and the GPU should be initialized.

If you get a nasm broken link error :
edit  tensorflow/tensorflow/workspace.bzl and add an extra link

urls = [
          "https://mirror.bazel.build/www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2",  
          "http://www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2",
          "http://pkgs.fedoraproject.org/repo/pkgs/nasm/nasm-2.12.02.tar.bz2/d15843c3fb7db39af80571ee27ec6fad/nasm-2.12.02.tar.bz2",
      ]


Thursday, August 31, 2017

soviet iphone

I bought an iphone for a contract app. When my android phone died after 4 years of use I thought I would use it, How different can it be ?

It sure is shiny, fast, works well under stress and the battery lasts considerably longer BUT what you give up for convenience is flexibility. Try doing anything out of the ordinary, like changing your freaking ringtone to a custom tune.

I checked, it takes 12 steps. Part of that is installing the malware mess called itunes.

So you DON'T OWN your phone. Itunes owns your phone and it will decide to do whatever the hell it wants with it. Iphone is your soviet appartment and itunes is your assigned commisar. You want something changed ? Have fun complaining to him, you may end up with no appartment. I wont even go to the privacy issues because ... you know ... google.

I'll still keep it because it cost me 600 euros, but only as a dumb media player.


Friday, August 11, 2017

Confucius on settings

For every project there's a configuration.

In pet projects it may be a handful of constants and in bigger projects in may be a server configuration.

The simple truth is that configuration usually keeps building up, and so the more the succesful the project the bigger the configuration.

It is wiser to build it from the start than collect everything mid-way.




Saturday, July 29, 2017

Yodiwo joins Endeavour network


Yodiwo has been officialy accepted in the Endeavor network.
Entrepreneur: Alex Maniatopoulos
Company: Yodiwo
Description: What isn’t connected to the Internet nowadays? Though more systems are coming online, implementing Internet of Things (IoT) projects is incredibly complex, time-intensive, and costly. Engineers write thousands of lines of code in order to connect expensive smart devices, manage system workflows, and measure results. Yodiwo offers an affordable, code-free IoT application enablement platform plus customized solutions to expedite the process of constructing IoT applications and interconnected networks. Yodiwo’s proprietary three-tiered platform–Wisper, Cyan, and Alcyone–saves clients 90% on application development time, 30% on system operating costs, and and 40-50% on capital expenditure.
Our CEO Alex Maniatopoulos after a rigorous multiday interview convinced the panellists that yodiwo had all it takes to join their sucessful network.

This great milestone opens a brand new world for yodiwo and many business opportunities.


Thursday, July 6, 2017

Deep Learning : Why you should use gradient clipping

One new tip that I got reading Deep Learning is clipping gradients. It's been common knowledge amongst practitioners for years but somehow I missed it.

Problem

The problem with strongly nonlinear objective functions, such as those computed in recurrent or deep networks, is that their derivatives tend to be either very large or ver small in magnitude. 

The steep regions resemble cliffs and they result from the multiplication of several large weights together. On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far, usually jumping off the cliff structure together, undoing much of the work that had been done to reach the current solution. 

The gradient tells us the direction that corresponds to the steepest descent within an infinitesimal region surrounding the current parameters. Outside this tiny region, the cost function may begin to curve back upward. The update must be chosen to be small enough to avoid traversing too much upward curvature. 

Solution

One solution would be to have very small learning rate. This solution is problematic as it will slow training and maybe settle in a sub-optimal region. A much better solution is clipping the gradient (or norm-clipping). There are many instantiations of this idea but the main concept is to limit the gradient to a maximum number and if the gradient exceeds that number rescale the gradient parameters so the are limited within it. This customization retains the direction but limits the step size.

Thoughts

If your jobs involves training a lot of deep learning models automatically, then you should eliminate any unpredicable steps that require manual labor. We are engineers after all so whatever can be automated should be automated and no more. For me the problem was the unpredictability of the training. For a percentage of initializations in training mode gradient would explode. The reactionary solution was to lower the learning rate, but that costs time and money. In addition to that I wanted something that always works and thus can automated. Gradient clipping worked nicely in this regard and it allowed me to up the learning rate so that the training converges much faster.

Conclusion

Use gradient clipping everywhere, my default option is to limit to 1. In Caffe it is a single line in the solver and if your framework doesn't support it is easy to implement it yourself. You will save yourself enormous headaches and time.

* From the book "Deep Learning"

For a high level understanding of deep learning click here

Monday, July 3, 2017

Udacity AI nanodegree

I enrolled in Udacity's AI nanodegree 2 months ago and I just learned I was accepted.
I thought it would be a good refresher and maybe fill in some knowledge gaps I have.
The reviews on the net are pretty good so I'm pretty sure it will be a great experience especially since there will be AI legends like Peter Norvig doing the teaching.

The curriculum consits of five parts

  1.  Foundations of AI : In this Term, you'll learn the foundations of AI with Sebastian Thrun, Peter Norvig, and Thad Starner. We'll cover Game-Playing, Search, Optimization, Probabilistic AIs, and Hidden Markov Models. 
  2.  Deep Learning and Applications : In this term, you'll learn the cutting edge advancements of AI and Deep Learning. You'll get the chance to apply Deep Learning on a variety of different topics including Computer Vision, Speech, and Natural Language Processing. We'll cover Convolutional Neural Networks, Recurrent Neural Networks, and other advanced models. 
  3. Computer Vision : In this module, you will learn how to build intelligent systems that can see and understand the world using Computer Vision. You'll learn fundamental techniques for tasks like Object Recognition, Face Detection, Video Analysis, etc., and integrate classic methods with more modern Convolutional Neural Networks.
  4. Natural Language Processing : In this module, you will build end-to-end Natural Language Processing pipelines, starting from text processing, to feature extraction and modeling for different tasks such as Sentiment Analysis, Spam Detection and Machine Translation. You'll also learn how to design Recurrent Neural Networks for challenging NLP applications.
  5. Voice User Interfaces : This module will help you get started in the exciting and fast-growing area of designing Voice User Interfaces! You'll learn how to build Conversational Agents for products and services more natural to interact with. You will also dive deeper into the core challenge of Speech Recognition, applying Recurrent Neural Networks to solve it.
In my projects so far I've mostly tackled Computer Vision and Predictive Analytics problems, so it would be a nice change to dive into NLP and Voice processing.
I hope I can fit it in my busy schedule and I'll try to write some posts describing the experience for any future students.

Wednesday, June 14, 2017

Classic Machine Learning Literature

I'm often asked by software engineers on what to read to get into the Machine Learning world
For that purpose I've compiled a list of Machine Learning and Applied Mathematics books that I've used to gain a deeper understanding.

Machine Learning

We start of with the classic but very dated "Machine Learning" by Tom M. Mitchell.
This was the first one I read on the subject. Low on math, high on intuition, it is a descent introductory book. You can easily implement most of the algorithms described and get a fair understanding of what's going on. First couple of years in the business you may use it as basic reference but after that you will need the math heavy books.

Pattern Classification

We continue with my personal favorite "Pattern Classification" 2nd edition by Richard O. Duda, Peter E. Hart, David G. Stock. This impressive book is heavy on applied math, low on proofs and very readable. It is better used by beginners as well as experienced machine learning engineers. It builds the reader a very good intuition and understanding. The graphs and figures help a lot. I still use it as a reference on some issues.






Pattern Recognition and Machine Learning

A natural extension of "Pattern Classification" is the excellent "Pattern Recognition and Machine Learning" by Bishop. Somewhat heavy on the math, it provides a clear path of understanding but it is not for noobies. You should come into this book with some experience. This excellent book is still very relevant with great introduction on matrix calculus and probability theory.


Probabilistic Graphical Models

Going deeper, I refer to "Probabilistic Graphical Models". This is a subdomain of Machine Learning and it is not for the faint of heart. This massive book is hard, and I mean eyes glazing, concentrate and get a headache hard. If you manage to get through it you will have a greater understanding than most mortals. If however you are like me you are just gonna sample some of the parts and leave the rest for the PhD's.

Deep Learning

A new book that has gained classic status very fast is the "Deep Learning" by Ian Goodfellow and Yoshua Bengio. I found it very approachable and left me with a better understanding of deep learning. Very light on math, it concentrates on intuition and best practices rather than proofs. Highly recommended for all DL practitioners.
For a high level understanding of deep learning click here




Back to basics books

Numerical Recipes

Most books rely heavily on linear algebra, probability theory and algorithm "primitives". If you really want to know whats under the hood you should check this out.




Statistical Digital Signal Processing and Modeling

Before the Machine Learning and AI hype there was simply DSP.


Artificial Intelligence: A Modern Aproach

A general purpose AI book. Lots of good content, ideas, algorithms, though process, if a bit dated. I used the second edition, apparently the latest one is a bit better.

Matrix Computations

If you really really want to reinvent the wheel and by wheel I mean super fast BLAS primitives usually found in LAPACK and its variants, look no further than here.

Tuesday, May 16, 2017

If you don't define it then how can you understand it ?


A model's representational capacity is its ability to fit a wide variety of functions. Models with low capacity may struggle to fit the training set (high training error). Models with high capacity can overfit by memorizing properties of the training set that do not serve them well on the test set.

Overfitting is the situation where a learning algorithm achieves low training error but high test error. Overfitting is sign of poor generalization.

Simpler models (smaller hypothesis space and smaller capacity) are more likely to generalize (small gap between training and test error) however complex models are more likely to achieve low training error.

In practise the learning algorithm may not be able to find the best model among the model's hypothesis space. This additional limitations such as the imperfection of the optimization algorithm mean that the learning's algorithm effective capacity may be less than the representational capacity of the model family.

Statistical learning theory provides a way to quantify a model's capacity. The Vapnik-Chervonenkis dimension or VC dimension measures the capacity of a binary classifier. It is defined as being the largest possible value of m for which there exists a training set of m different x points that the classifier can label arbitrarily.

Thus the discrepancy between training error and generalization error is bounded from above by the quantity that grows as the model capacity grows but shrinks as the number of training examples increases.

Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error. Without regularization any search on the hyperparameters of a model would result on those that maximize the model's capacity resulting in overfitting.

Bias and variance measure two different sources of error in an estimator. 

Bias measures the expected deviation from the true value of the function or parameter. The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).

Variance provides a measure of the deviation from the expected estimator value that any particular sampling of the data is likely to cause. The variance is error from sensitivity to small fluctuations in the training set. High variance can cause overfitting: modeling the random noise in the training data, rather than the intended outputs.

The relationship between bias and variance is tightly linked to the machine learning concepts of capacity, underfitting and overfitting. When regularization error is measured by Mean Square Error (where bias and variance are meaningful components of generalization error), increasing capacity tends to increase variance and decrease bias.

In the context of deep learning, most regularization strategies are based on regularizing estimators. Regularization of an estimator works by trading increased bias for reduced variance. An effective regularizer is one that makes a profitable trade, reducing variance significally while not overly increasing the bias.

* from the book Deep Learning


Tuesday, February 21, 2017

The Black Magic of Deep Learning - Tips and Tricks for the practitioner


Spirits guide us to find the correct hyperparameters
I first heard of Deep Learning in 2012 when they gained traction against traditional methods. I followed their evolution but thought it was mostly hype.
Then in January 2015 I was involved in a green field project and I was in charge of deciding the core Machine Learning algorithms to be used in a computer vision platform.

Nothing worked good enough and if it did it wouldn't generalize, required fiddling all the time and when introduced to similar datasets it wouldn't perform as well. I was lost. I needed what Deep Learning promised but I was skeptical, so I read the papers, the books and the notes. I then went and put to work everything I learned.

Suprisingly, it was no hype, Deep Learning works and it works well. However it is such a new concept (even though the foundations were laid in the 70's) that a lot of anecdotal tricks and tips started coming out on how to make the most of it (Alex Krizhevsky covered a lot of them and in some ways pre-discovered batch normalization).

Anyway to sum, these are my tricks (that I learned the hard way) to make DNN tick.
  • Always shuffle. Never allow your network to go through exactly the same minibatch. If your framework allows it shuffle at every epoch. 
  • Expand your dataset. DNN's need a lot of data and the models can easily overfit a small dataset. I strongly suggest expanding your original dataset. If it is a vision task, add noise, whitening, drop pixels, rotate and color shift, blur and everything in between. There is a catch though if the expansion is too big you will be training mostly with the same data. I solved this by creating a layer that applies random transformations so no sample is ever the same. If you are going through voice data shift it and distort it
  • This tip is from Karpathy, before training on the whole dataset try to overfit on a very small subset of it, that way you know your network can converge.
  • Always use dropout to minimize the chance of overfitting. Use it after large > 256 (fully connected layers or convolutional layers). There is an excellent thesis about that (Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning)
  • Avoid LRN pooling, prefer the much faster MAX pooling.
  • Avoid Sigmoid's , TanH's gates they are expensive and get saturated and may stop back propagation. In fact the deeper your network the less attractive Sigmoid's and TanH's are. Use the much cheaper and effective ReLU's and PreLU's instead. As mentioned in Deep Sparse Rectifier Neural Networks they promote sparsity and their back propagation is much more robust.
  • Don't use ReLU or PreLU's gates before max pooling, instead apply it after to save computation
  • Don't use ReLU's they are so 2012. Yes they are a very useful non-linearity that solved a lot of problems. However try fine-tuning a new model and watch nothing happen because of bad initialization with ReLU's blocking backpropagation. Instead use PreLU's with a very small multiplier usually 0.1. Using PreLU's converges faster and will not get stuck like ReLU's during the initial stages. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ELU's are still good but expensive.
  • Use Batch Normalization (check paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift) ALWAYS. It works and it is great. It allows faster convergence ( much faster) and smaller datasets. You will save time and resources.
  • I don't like removing the mean as many do, I prefer squeezing the input data to [-1, +1]. This is more of  a training and deployment trick rather a performance trick.
  • Always go for the smaller models, if you are working and deploying deep learning models like me, you quickly understand the pain of pushing gigabytes of models to your users or to a server in the other side of the world. Go for the smaller models even if you lose some accuracy.
  • If you use the smaller models try ensembles. You can usually boost your accuracy by ~3% with an enseble of 5 networks. 
  • Use xavier initialization as much as possible. Use it only on large Fully Connected layers and avoid them on the CNN layers. An-explanation-of-xavier-initialization
  • If your input data has a spatial parameter try to go for CNN's end to end. Read and understand SqueezeNet , it is a new approach and works wonders, try applying the tips above. 
  • Modify your models to use 1x1 CNN's layers where it is possible, the locality is great for performance. 
  • Don't even try to train anything without a high end GPU.
  • If you are making templates out of models or your own layers, parameterize everything otherwise you will be rebuilding your binaries all the time. You know you will
  • And last but not least understand what you are doing, Deep Learning is the Neutron Bomb of Machine Learning. It is not to be used everywhere and always. Understand the architecture you are using and what you are trying to achieve don't mindlessly copy models.  
To get the math behind DL read Deep-Learning-Adaptive-Computation-Machine.
It is an excellent book and really clears things up. There is an free pdf on the net. But buy it to support the authors for their great work.
For a history lesson and a great introduction read Deep Learning: Methods and Applications (Foundations and Trends in Signal Processing) 
If your really want to start implementing from scratch, check out Deep Belief Nets in C++ and CUDA C, Vol. 1: Restricted Boltzmann Machines and Supervised Feedforward Networks
  
Suggested reading