While the field of software engineering has developed a full range of best practices for developing reliable software systems, the set of standards and practices for developing ML models in a rigorous fashion is still developing. It can be all too tempting to rely on a single-number summary metric to judge performance, perhaps masking subtle areas of unreliability. Careful testing is needed to search for potential lurking issues.
It can be tempting to avoid, but disciplined code review remains an excellent method for avoiding
silly errors and for enabling more efficient incident response and debugging.
2. Test the relationship between offline proxy metrics and the actual impact metrics. For exam-
ple, how does a one-percent improvement in accuracy or AUC translate into effects on metrics of
user satisfaction, such as click through rates? This can be measured in a small scale A/B experiment
using an intentionally degraded model.
3. Test the impact of each tunable hyperparameter. Methods such as a grid search or a more
sophisticated hyperparameter search strategy not only improve predictive performance, but also
can uncover hidden reliability issues. For example, it can be surprising to observe the impact of
massive increases in data parallelism on model accuracy.
4. Test the effect of model staleness. If predictions are based on a model trained yesterday versus
last week versus last year, what is the impact on the live metrics of interest? All models need to be
updated eventually to account for changes in the external world; a careful assessment is important to
guide such decisions.
5. Test against a simpler model as a baseline. Regularly testing against a very simple baseline
model, such as a linear model with very few features, is an effective strategy both for confirming
the functionality of the larger pipeline and for helping to assess the cost to benefit tradeoffs of more
sophisticated techniques.
6. Test model quality on important data slices. Slicing a data set along certain dimensions of
interest provides fine-grained understanding of model performance. For example, important slices
might be users by country or movies by genre. Examining sliced data avoids having fine-grained
performance issues masked by a global summary metric.
7. Test the model for implicit bias. This may be viewed as an extension of examining important data
slices, and may reveal issues that can be root-caused and addressed. For example, implicit bias might
be induced by a lack of sufficient diversity in the training data.
* From "What’s your ML Test Score? A rubric for ML production systems" NIPS, 2016
Engineering checklist:
- Test that every model specification undergoes a code review and is checked in to a repository
- Test the relationship between offline proxy metrics and the actual impact metrics
- Test the impact of each tunable hyperparameter
- Test the effect of model staleness. Concept drift is real for non stationary processes
- Test against a simpler model as a baseline
- Test model quality on important data slices
- Test the model for implicit bias
It can be tempting to avoid, but disciplined code review remains an excellent method for avoiding
silly errors and for enabling more efficient incident response and debugging.
2. Test the relationship between offline proxy metrics and the actual impact metrics. For exam-
ple, how does a one-percent improvement in accuracy or AUC translate into effects on metrics of
user satisfaction, such as click through rates? This can be measured in a small scale A/B experiment
using an intentionally degraded model.
3. Test the impact of each tunable hyperparameter. Methods such as a grid search or a more
sophisticated hyperparameter search strategy not only improve predictive performance, but also
can uncover hidden reliability issues. For example, it can be surprising to observe the impact of
massive increases in data parallelism on model accuracy.
4. Test the effect of model staleness. If predictions are based on a model trained yesterday versus
last week versus last year, what is the impact on the live metrics of interest? All models need to be
updated eventually to account for changes in the external world; a careful assessment is important to
guide such decisions.
5. Test against a simpler model as a baseline. Regularly testing against a very simple baseline
model, such as a linear model with very few features, is an effective strategy both for confirming
the functionality of the larger pipeline and for helping to assess the cost to benefit tradeoffs of more
sophisticated techniques.
6. Test model quality on important data slices. Slicing a data set along certain dimensions of
interest provides fine-grained understanding of model performance. For example, important slices
might be users by country or movies by genre. Examining sliced data avoids having fine-grained
performance issues masked by a global summary metric.
7. Test the model for implicit bias. This may be viewed as an extension of examining important data
slices, and may reveal issues that can be root-caused and addressed. For example, implicit bias might
be induced by a lack of sufficient diversity in the training data.
* From "What’s your ML Test Score? A rubric for ML production systems" NIPS, 2016