Monday, January 18, 2021

Keras/Tensorflow threshold with gradient flow

Tensorflow / Keras threshold operations break the gradient flow.

There is a way to fix this by using a combination of operations.

def threshold_min_max_value(input_layer,
Thresholds all the values of the a tensor that exceed value to that
max_value, and than are lower than the min_value,
this layer retains gradient flow

:param input_layer: the input layer
:param min_value: minimum value to threshold to
:param max_value: maximum value to threshold to
:return: threshold-ed input layer

def _threshold(_x):
ge_max_value = K.greater_equal(_x, max_value)
ge_max_value = K.cast_to_floatx(ge_max_value)
lt_max_value = 1.0 - ge_max_value

le_min_value = K.less_equal(_x, min_value)
le_min_value = K.cast_to_floatx(le_min_value)
gt_min_value = 1.0 - le_min_value

tmp0 = keras.layers.Multiply()([
lt_max_value, gt_min_value, _x
return tmp0 + (min_value * le_min_value) + (max_value * ge_max_value)

return keras.layers.Lambda(_threshold)(input_layer)

Deep Learning LSTM for Sentiment Analysis in Tensorflow with Keras API -  DEV Community

Thursday, December 24, 2020

Fix Tensorflow Object Detection Framework taking too much disk space while training

The file contains all the training and evaluation loops.

In the function:

def eager_train_step(detection_model,

We can see that every training iteration it saves a few training dataset images.


There are three problems with this:

1. It takes A LOT of space

2. It actually slows up training 

3. Images look saturated.

We can fix this easily by replacing the above snippet with this:

if global_step % 100 == 0:
# --- get images and normalize them
images_normalized = \
(features[fields.InputDataFields.image] + 128.0) / 255.0

 Everything you need to know about TensorFlow 2.0 | Hacker Noon

Monday, October 12, 2020

Gaussian Filter in Keras (code snippet)

Very often we need to perform basic vision operations on a computational graph like building a Laplacian pyramid or filter a tensor with a specific precalculated filter. 

Below i present a code snippet for building a fixed non-trainable gaussian filter in keras.

import keras
import numpy as np
import scipy.stats as st

def gaussian_filter_block(input_layer,
strides=(1, 1),
dilation_rate=(1, 1),
Build a gaussian filter block

def _gaussian_kernel(kernlen=[21, 21], nsig=[3, 3]):
Returns a 2D Gaussian kernel array
assert len(nsig) == 2
assert len(kernlen) == 2
kern1d = []
for i in range(2):
interval = (2 * nsig[i] + 1.) / (kernlen[i])
x = np.linspace(-nsig[i] - interval / 2., nsig[i] + interval / 2.,
kernlen[i] + 1)

kernel_raw = np.sqrt(np.outer(kern1d[0], kern1d[1]))
# divide by sum so they all add up to 1
kernel = kernel_raw / kernel_raw.sum()
return kernel

# Initialise to set kernel to required value
def kernel_init(shape, dtype):
kernel = np.zeros(shape)
kernel[:, :, 0, 0] = _gaussian_kernel([shape[0], shape[1]])
return kernel

return keras.layers.DepthwiseConv2D(

from my open source project

gaussian filter seminar ppt

Monday, August 24, 2020

Tensorflow to Onnx (tf2onnx) testing with different opsets

To test tf2onnx changes with different operators set your external variable prior to calling pytest.

So for example in a windows setup this will run the tests



and in a standard linux setup



Contribute to the Open Neural Network eXchange (ONNX) | by Svetlana Levitan  | Center for Open Source Data and AI Technologies | Medium

Thursday, June 11, 2020

Opencv 4.3.0 options

Building OpenCV 4.3.0 from source


OpenCV - Wikipedia

Tuesday, April 7, 2020

Netron an awesome neural networks visualization tool

I've been working recently on optimizing models from tensorflow to onnx and finally to tensorrt and got introduced to Netron an excellent tool for all ml engineers / data scientists working with different flavors and formats of neural networks. 

It is a huge improvement over the keras and tensorboard visualizer. 

Check it out here

Monday, March 23, 2020

Compiling Tensorflow 1.15 from source

Αποτέλεσμα εικόνας για tensorflow
Following my previous post Compiling Tensorflow under Debian Linux with GPU support and CPU extensions I was trying to build version r1.15 when i stumbled to another peculiar bug. When reaching the end of the compilation you get a lovely error in the form of

from keras.preprocessing import image as image_utils
ImportError: No module named keras.preprocessing
To fix this you you just need to install the following packages prior to compiling

pip install keras_applications==1.0.4 --no-deps
pip install keras_preprocessing==1.0.2 --no-deps
pip install h5py==2.8.0
And that's it you can build again.

bazel build -c opt --config=v1 --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

Wednesday, February 5, 2020

Software Commodification

Something that people outside the software industry don't fully grasp is the increasing pace of commodification

You had a Phd in CS 20 years ago ? congrats all your knowledge is a library. 

You were an excellent physicist 10 years ago ? Congrats we can import and run everything you knew without even understanding half of it. 

You studied Artificial Intelligence in depth 15 years ago ? A 17 year old with a weekend in keras can provide better solutions than you. Everything you know is getting obsolete in an increasing rate. 

As software people we are are used to packaging up our knowledge and re-learning new skills (tools, theories, frameworks) every couple of years, because the old have been completely commoditized. 

It is an exciting but tiresome journey that it shouldn't be required by all. However as everything is becoming software there may not be a choice by most people. 

Personally the increasing pace of commodification allows me to cut through the noise and add depth to my knowledge. Knowing that the low hanging fruits have been picked will clear up the space of Artificial Intelligence and Machine Learning.

Αποτέλεσμα εικόνας για software commodification

Monday, January 20, 2020

What if you could push your AI models to be 10 times faster?

Deceptive Easy

Modern Machine Learning (ML) and especially Deep Learning (DL) have become deceptively easy. It is almost trivial to have something up and running with presentable results as the the tools hide most of the complexity and hard decisions. Whereas that might be good enough for a Proof of Concept (POC) or a Minimum Viable Product (MVP), ensuring stability, high performance and scalability is a whole different ball game.

High Sunk Cost 

Many CXO’s and seniors managers find themselves trapped into subpar solutions that cannot be used effectively because they lack the technical know-how to productionalize them.

At Electi we have just release a new brochure listing the services we can offer to companies already implementing Deep Learning. Check it out here.

Monday, December 2, 2019

Performance Measures for Machine Learning


In this blog spot I'm presenting a few performance measures for machine learning tasks. These performance measures come up a lot in the marketing domain.

Take at home lessons:
  • the measure you optimize to makes a difference
  • the measure you report makes a difference
  • use measure appropriate for problem/community
  • accuracy often is not sufficient/appropriate
  • only accuracy generalizes to >2 classes
  • this is not an exhaustive list of performance measures


Confusion Matrix

First we construct a confusion matrix for a binary classification problem. Given a classification function f(x)->R and a threshold T that can split the outcomes into {0, 1} we can create a confusion matrix that counts the occurrences of the predicted class given the true label.



Accuracy is then measured as the percentage of correct responses (True Positives + True Negatives) over the total amount of responses.


Problems with Accuracy

This measure is commonly used but it can be misleading. The problems arise from the domain we are modelling. If one of the class for example is poorly represented into the metric is meaningless as we could predict the same class always and still have a good results.

• Assumes equal cost for both kinds of errors 
  • cost(b-type-error) = cost (c-type-error)
• is 99% accuracy good?
  • can be excellent, good, mediocre, poor, terrible
  • depends on problem
• Base Rate = accuracy of predicting predominant class


Weighted (Cost sensitive) Accuracy

A modified version of accuracy is "Weighted Accuracy" were we count the cost of misclassification.

In this scenario we aiming for a model and a threshold that can minimize the total cost.



If we are not interested in the accuracy on the entire dataset but want accurate predictions for 5%, 10% or 20% of the dataset then we can use the lift measure.
Lift measures how much better than random prediction on the fraction of the dataset predicted true (f(x) > threshold).


Precision / Recall

  • The Precision measure counts how many of the interest class are correct.
  • The Recall measure counts how many of the interest class does the model return.
In the case below the interest class is a(1). 

We can change the sweep over the threshold calculate Precision/Recall multiple times and graph out what is called the Precision/Recall curve.

At each different threshold we can see a different tradeoff between the two metrics.
  • When the threshold is too high then c (everything is predicted as class 0) becomes zero and then the precision becomes zero.
  • When the threshold is too low then b (everything is predicted as class 1) becomes zero and then the recall becomes zero.
Both of these metrics are flawed in isolation and it is the eye of the modeller on which one better represents the problem.


The F-Measure is an attempt to merge the two measures to construct a more meaningful performance measure.

Receiver Operating Characteristic (ROC)

• Developed in WWII to statistically model false positive and false negative detections of radar operators
• Better statistical foundations than most other measures
• Standard measure in medicine and biology
• Becoming more popular in ML 

Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice.

One of the earliest adopters of ROC graphs in machine learning was Spackman (1989), who demonstrated the value of ROC curves in evaluating and comparing algorithms.

ROC graphs are conceptually simple, but there are some non-obvious complexities that arise when they are used in research. 

ROC Plot


• Sweep threshold and plot
  • TPR vs. FPR
  • Sensitivity vs. 1-Specificity
  • P(true|true) vs. P(true|false)
• Sensitivity = a/(a+b) = Recall = LIFT numerator
• 1 - Specificity = 1 - d/(c+d)

A ROC graph depicts relative trade-offs between benefits (true positives) and costs (false positives).

  • The lower left point (0,0) represents the strategy of never issuing a positive classiffication. 
  • The opposite strategy is represented by the upper right point (1,1).
  • The point (0,1) represents perfect classiffication.
  • The diagonal line y = x represents the strategy of randomly guessing a class. 
  • A random classifier will produce an ROC point that "slides" back and forth on the diagonal based on the frequency with which it guesses the positive class. In order to get away from this diagonal into the upper triangular region, the classifier must exploit some information in the data. 
  • Any classifier that appears in the lower right triangle performs worse than random guessing. This triangle is therefore usually empty in ROC graphs.
  • ROC curves have an attractive property: they are insensitive to changes in class distribution.
  • Any performance metric that uses values from both columns of theconfusion matrix will be inherently sensitive to class skews. Metrics such as accuracy, precision, lift and F scores use values from both
    columns of the confusion matrix. 
  • ROC graphs are based upon TP rate and FP rate, in which each dimension is a strict columnar ratio, so do not depend on class distributions.

Thursday, October 24, 2019

Model Distillation

Model Distillation is the process of taking a big model or ensemble of models and producing a smaller model that captures most of the performance of the original bigger model. It could be also better be described as a blind model replication method.

The reasons for doing so are:
  1. improved run-time performance (FLOP operations)
  2. (maybe) better generalization because of the model simplicity
  3. you don't have access to the training of the original model.
  4. you have access to a remotely deployed model and you want to replicate it (it happens more than you can imagine)
  5. original model maybe is too complicated
  6. insights that may arise from the process itself

How it works

Assume a MNIST classifier \(F_{MNIST}\) composed of an ensemble of \(N\) convolutional deep neural networks that produces a logit \(z_i\) which is then converted to a probability of an input image, \(x_i\), for each of the possible labels \(C_{0}-C_{9}\).

The distillation process will give us an  \(F_{MNIST_{distilled}}\) composed of a single deep neural network that will approximate the classification results of the bigger ensemble of models.

In distillation, knowledge is transferred from the teacher model to the student by minimizing a loss function in which the target is the distribution of class probabilities predicted by the teacher model. That is - the output of a softmax function on the teacher model's logits.

Logits \(z_j\) are converted to probabilities \(P(C_i|x)\) using the softmax layer:

p_i = \frac

However, in many cases, this probability distribution has the correct class at a very high probability, with all other class probabilities very close to 0. As such, it doesn't provide much information beyond the ground truth labels already provided in the dataset.

To tackle this issue, Hinton et al., 2015 introduced the concept of "softmax temperature". The probability \(q_i\) is computer by the logit \(z_i\) for the scalar softmax temperature \(T\):

q_i = \frac

where T is a temperature that is normally set to 1. Using a higher value for T produces a softer probability distribution over classes. Softer probability distribution means that the values are somewhat diffused and a 0.999 probability may become 0.9 and the rest spread to the other classes.

In the simplest form of distillation, knowledge is transferred to the distilled model by training it on a transfer set and using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax. The same high temperature is
used when training the distilled model, but after it has been trained it uses a temperature of 1. When the correct labels are known for all or some of the transfer set, this method can be significantly improved by also training the distilled model to produce the correct labels. One way to do this is to use the correct labels to modify the soft targets, but we found that a better way is to simply use a weighted average of two different objective functions.
  1. The first objective function is the cross entropy with the soft targets and this cross entropy is computed using the same high temperature in the softmax of the distilled model as was used for generating the soft targets from the cumbersome model. 
  2. The second objective function is the cross entropy with the correct labels. This is computed using exactly the same logits in softmax of the distilled model but at a temperature of 1. 
This very simple operation can have a multitude of knobs and parameters to adjust but the core essence is very simple and works quite well.

Friday, October 18, 2019

ResNets inner workings and notes

A residual network (or ResNet) is a standard deep neural net architecture, with state-of-the-art performance across numerous applications. The main premise of ResNets is that they allow the training of each layer to focus on fitting just the residual of the previous layer’s output and the target output. Thus, we should expect that the trained network is no worse than what we can obtain if we remove the residual layers and train a shallower network instead.

Up until 2015 we had 3 mainstream ways of training deep networks:
  • Greedy per layer optimization and freezing
  • Various flavors of Dropout
  • and Batch Normalization (came out early 2015, wasn't mainstream until 2016, now it is patented by google so who knows if people can use it)
The main contribution of the "Deep Residual Learning for Image Recognition, 2015" paper is a novel  and smart building block for training very deep neural networks.

The "Residual Learning" or "Identity Learning" block. 


A special note here:
We need at least 2 weight layers here (nonlinearities are not essential but welcome) to get the benefits of the universal function approximator that has been proven since the 90's. You can add more but you then hit the limits of information propagation and you will need residual/skip connections inside the residual block.

A stack of \(n\) residual blocks is described as follows :

\(x_0\) : input
\(x_1 = x_0 + F_1(x_0)\)
\(x_2 = x_1 + F_2(x_1)\)
\(x_3 = x_2 + F_3(x_2)\)
\(x_n = x_{n-1} + F_{n-1}(x_{n-1})\) 

This can be re-written as  :

\(x_n = x_{n-1} + F_{n-1}(  x_{n-2} + F_{n-2}(x_{n-2})   \)

Which can be expanded as :

\(x_n = x_{n-1} + F_{n-1}(  x_{n-2} + F_{n-2}(     x_{n-3} +F_{n-3}( x_{n-4} + F_{n-4}(... + F_1(x_0))))))   \)

Now if we assume that \(F_i\) is a linear function (it is not for \(x \lt 0 \) ) but we can ignore it) :

x_n =
x_{n-1} +
F_{n-1}(  x_{n-2} ) +
F_{n-1}F_{n-2}(x_{n-3}) +
F_{n-1}F_{n-2}F_{n-3}(x_{n-4}) +

As we can see from the equations and the equivalent graph, there is a clear information flow from the raw data to the output. This means that the major pain point of vanishing gradient is avoided.

This design when stacked as above allows the gradient to flow through the whole network bypassing any tricky points and in essence training the deeper network properly. Information bottlenecks can be introduced by design but the gradient can flow from the loss layer up to the base layer.

Residual Networks (ResNets) have been used to train up to a 1000 layers. It was proven empirically that going deeper generalizes better than shallow wide networks. Again empirically several variations of the Residual Block are tried:

In practice the Residual Function is not 2 (convs or fully connected) weight layers but they are accompanied by a Batch Normalization layer. This helps stabilize the gradient and the co-variate shift (there are various opposing hypothesis on why batch norm works).

In the paper "Identity Mappings in Deep Residual Networks, 2016" a different layout was proposed that improved the performance. We can see an inversion of the order of the layers. The architecture, the analysis and the system design remains the same though.
This new arrangements produces better results on very deep networks for the CIFAR dataset, shown below. Also the convergence is much faster. Since they are practically the same from an engineering perspective we don't have a reason to reject the one with the better performance.