Tuesday, February 21, 2017

The Black Magic of Deep Learning - Tips and Tricks for the practitioner

Spirits guide us to find the correct hyperparameters
I first heard of Deep Learning in 2012 when they gained traction against traditional methods. I followed their evolution but thought it was mostly hype.
Then in January 2015 I was involved in a green field project and I was in charge of deciding the core Machine Learning algorithms to be used in a computer vision platform.

Nothing worked good enough and if it did it wouldn't generalize, required fiddling all the time and when introduced to similar datasets it wouldn't perform as well. I was lost. I needed what Deep Learning promised but I was skeptical, so I read the papers, the books and the notes. I then went and put to work everything I learned.

Suprisingly, it was no hype, Deep Learning works and it works well. However it is such a new concept (even though the foundations were laid in the 70's) that a lot of anecdotal tricks and tips started coming out on how to make the most of it (Alex Krizhevsky covered a lot of them and in some ways pre-discovered batch normalization).

Anyway to sum, these are my tricks (that I learned the hard way) to make DNN tick.
  • Always shuffle. Never allow your network to go through exactly the same minibatch. If your framework allows it shuffle at every epoch. 
  • Expand your dataset. DNN's need a lot of data and the models can easily overfit a small dataset. I strongly suggest expanding your original dataset. If it is a vision task, add noise, whitening, drop pixels, rotate and color shift, blur and everything in between. There is a catch though if the expansion is too big you will be training mostly with the same data. I solved this by creating a layer that applies random transformations so no sample is ever the same. If you are going through voice data shift it and distort it
  • This tip is from Karpathy, before training on the whole dataset try to overfit on a very small subset of it, that way you know your network can converge.
  • Always use dropout to minimize the chance of overfitting. Use it after large > 256 (fully connected layers or convolutional layers). There is an excellent thesis about that (Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning)
  • Avoid LRN pooling, prefer the much faster MAX pooling.
  • Avoid Sigmoid's , TanH's gates they are expensive and get saturated and may stop back propagation. In fact the deeper your network the less attractive Sigmoid's and TanH's are. Use the much cheaper and effective ReLU's and PreLU's instead. As mentioned in Deep Sparse Rectifier Neural Networks they promote sparsity and their back propagation is much more robust.
  • Don't use ReLU or PreLU's gates before max pooling, instead apply it after to save computation
  • Don't use ReLU's they are so 2012. Yes they are a very useful non-linearity that solved a lot of problems. However try fine-tuning a new model and watch nothing happen because of bad initialization with ReLU's blocking backpropagation. Instead use PreLU's with a very small multiplier usually 0.1. Using PreLU's converges faster and will not get stuck like ReLU's during the initial stages. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ELU's are still good but expensive.
  • Use Batch Normalization (check paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift) ALWAYS. It works and it is great. It allows faster convergence ( much faster) and smaller datasets. You will save time and resources.
  • I don't like removing the mean as many do, I prefer squeezing the input data to [-1, +1]. This is more of  a training and deployment trick rather a performance trick.
  • Always go for the smaller models, if you are working and deploying deep learning models like me, you quickly understand the pain of pushing gigabytes of models to your users or to a server in the other side of the world. Go for the smaller models even if you lose some accuracy.
  • If you use the smaller models try ensembles. You can usually boost your accuracy by ~3% with an enseble of 5 networks. 
  • Use xavier initialization as much as possible. Use it only on large Fully Connected layers and avoid them on the CNN layers. An-explanation-of-xavier-initialization
  • If your input data has a spatial parameter try to go for CNN's end to end. Read and understand SqueezeNet , it is a new approach and works wonders, try applying the tips above. 
  • Modify your models to use 1x1 CNN's layers where it is possible, the locality is great for performance. 
  • Don't even try to train anything without a high end GPU.
  • If you are making templates out of models or your own layers, parameterize everything otherwise you will be rebuilding your binaries all the time. You know you will
  • And last but not least understand what you are doing, Deep Learning is the Neutron Bomb of Machine Learning. It is not to be used everywhere and always. Understand the architecture you are using and what you are trying to achieve don't mindlessly copy models.  
To get the math behind DL read Deep-Learning-Adaptive-Computation-Machine.
It is an excellent book and really clears things up. There is an free pdf on the net. But buy it to support the authors for their great work.
For a history lesson and a great introduction read Deep Learning: Methods and Applications (Foundations and Trends in Signal Processing) 
If your really want to start implementing from scratch, check out Deep Belief Nets in C++ and CUDA C, Vol. 1: Restricted Boltzmann Machines and Supervised Feedforward Networks
Suggested reading
If you are need help with you Deep Learning / Machine Learning pipeline contact us at  Electi Consulting


  1. Awesome tips. Thanks! Linking to this from Udacity Deep Learning course. http://bit.ly/dlfndprep

    1. Thanks for the wonderful style , i recommend Dr Obodo for anyone who need ,lovespells . { ( temple-of-answer-68.webselfsite.net ) templeofanswer@hotmail . co . uk }

  2. Thanks, fascinating article!
    You wrote:
    Always use dropout to minimize the chance of overfitting. Use it after large > 256 (fully connected layers or convolutional layers).

    256 layers is a crazy number of layers with the exception of some versions of resnet. I have come across dropout in deep NN architectures with a normal number of layers. Is there a typo? Did you mean a number of hidden units?

    1. This comment has been removed by the author.

    2. You are correct, I should have phrased it better, I didnt mean after 256 fully connected layers, I meant it is better to apply after fully connected layers that have a large number of outputs for example >= 256

  3. Very helpful list of points which everyone should keep in mind while working with DL! Thank you

  4. You mentioned using Xavier initialization only for the FC layers, not the CONV layers. Is there a paper that suggests this? Or is this something you found that works on your own? If the latter, can you explain more about it?

    1. When I read the xavier initialization paper I started using it everywhere I could, soon i'ld notice that my previously converging models (filled with gaussian distribution) were not converging, (most often than not my training is not from scratch but finetuning previously trained models so only 1-2 layers are trained each time), I tried many different variations and I just couldnt get the results that were promised from the paper. My convergence was hit and miss whereas previously was always hit and it usually came from filling my CONV layers.
      I haven't had the time to find the cause of the problem and explore it properly. I am aware that the journals as well as many anecdotal evidence show faster convergence with xavier. If I had to guess I'ld say that xavier needs lots units to get a better estimate of the variance or the intialization parameters are more fragile than expected.

      I would appreciate if people using smaller models and xavier initialization would confirm this.

    2. So what is the good type of initialization for conv layers? Gaussian/Normal, Uniform, Variance-Scale?

    3. From my experience, gaussian initialization gives me the best results (meaning it converges always on my system)
      I dont give up on the xavier, I will look into further and do a follow up on it.

    4. Thanks for the reply Nikolas, I appreciate it. In my own work I haven't noticed many issues when swapping out Gaussian initialization for Xavier or He et al. initialization in CONV layers. The reason I bring this up is because training VGG16 or VGG19 from scratch required "pre-training" of smaller VGGNets. Using He et al. and even Xavier initialization helped in CONV layers reduced this issue and allowed VGG16 and VGG19 to converge without pre-training, something that's not really feasible on super deep networks using Gaussian initialization.

    5. Hello Adrian, yes I'm aware of the benefits of xavier initialization,
      does your framework include any parameterization of the xavier layers ?

  5. If you don't use Xavier initialization for your CNN layers how are you randomly initializing them? Also any tips or tricks when working with RNNs? For example I've heard that many folks are using GRUs instead of LSTMs as they are more computationally efficient during training and can converge faster.

  6. Very nice write-up! Could we translate it into Chinese and post it on WeChat? I'm Wenfei from AI Era, a Chinese AI media. We could send the link back to you if you like. Many Thanks.

    1. Hi Nikolas,

      Here is the link to our translation: https://mp.weixin.qq.com/s?__biz=MzI3MTA0MTk1MA==&mid=2651994342&idx=5&sn=dbae830cebb360f78f43191cf5d2c7ab&chksm=f1214c17c656c501d3fa7e45fdc9ccc9ada604051663d79153d16d60496e3d15785aeff12d0e#rd

      It's popular among our readers. I believe Google Translate is good enough for you to read it:)

  7. Hi Nikolas

    Just want to ask one thing about Machine Learning and Deep Learning: you said in your last point:

    Deep Learning is the Neutron Bomb of Machine Learning. It is not to be used everywhere and always.

    If deep learning is a subset of machine learning, why can't we use DL to replace most ML task?

    For example, if we want to do classification job, can we just use a general NN instead of SVM?

    Thank you very much for your explanation.

    1. DL is certainly and exciting and provides a very good paradigm to build on. However I could think several reasons of why sometimes is not the answer to everything
      1. lack of data, vision and audio datasets are plenty but mosts ml problem's come from the lack of data, you make up on that with domain expertise on the model.
      2. Interpretability is still a big issue. As long as you deliver nobody asks you on the results BUT when you miss something (and you will) they will want to know why. Most DL models are a black box but this may change as people or pushing lots of money on this.
      3. Sometimes you have other factors in like performance, memory, portability.
      4. And Occam

  8. Very useful, clear and concise tips Nicolas with good references. I would work with you on any computer vision project anytime !

  9. Brilliant and concise. Thanks for sharing! It would help if you put this in a github page to which others can contribute.

  10. If input data is at the range of (-3, 3), do I still need to do data normalization?

    1. Yes, normalization, independent of range, helps convergence by bringing out in-class/between class variance. Plain normalization is a bit crude, in econometrics they scale by dropping low/high 5% and then normalizing to keep outliers from messing the results. In deep learning that would be a total performance killer though

  11. For deep learning with a large dataset, do we still need to do k-fold cross validation? If yes, it will be such an expensive process. Do we have a substitute method for cross validation? Thanks

    1. Hello Xu and sorry for the long delay, I personally don't do k-ford cross validation on dl projects, that would take forever. What i do is build custom layer that adds several distortions to the input, as well as the usual dropouts during middle and end stages.

  12. This comment has been removed by the author.

  13. Ixoxaxole mushrooms, otherwise called jaguar mushrooms. This kind of mushroom has been.magic