EnVision: software development

Showing posts with label software development. Show all posts

Tuesday, August 27, 2019

How to Build Good Software

Software has characteristics that make it hard to build with traditional management techniques; effective development requires a different, more exploratory and iterative approach.

The root cause of bad software has less to do with specific engineering choices, and more to do with how development projects are managed.

The right coding language, system architecture, or interface design will vary wildly from project to project. But there are characteristics particular to software that consistently cause traditional management practices to fail, while allowing small startups to succeed with a shoestring budget:

• Reusing good software is easy; it is what allows you to build good things quickly;
• Software is limited not by the amount of resources put into building it, but by how complex it can get before it breaks down; and
• The main value in software is not the code produced, but the knowledge accumulated by the people who produced it.

Understanding these characteristics may not guarantee good outcomes, but it does help clarify why so many projects produce bad outcomes. Furthermore, these lead to some core operating principles that can dramatically improve the chances of success:

1. Start as simple as possible;
2. Seek out problems and iterate; and
3. Hire the best engineers you can.

While there are many subtler factors to consider, these principles form a foundation that lets you get started building good software.

Software should be treated not as a static product, but as a living manifestation of the development team’s collective understanding.

1. Start as Simple as Possible

Projects that set out to be a “one-stop shop” for a particular domain are often doomed. The reasoning seems sensible enough: What better way to ensure your app solves people’s problems than by having it address as many as possible? After all, this works for physical stores such as supermarkets. The difference is that while it is relatively easy to add a new item for sale once a physical store is set up, an app with twice as many features is more than twice as hard to build and much harder to use.

Building good software requires focus: starting with the simplest solution that could solve the problem. A well-made but simplistic app never has problems adding necessary features. But a big IT system that does a lot of things poorly is usually impossible to simplify and fix. Even successful “do it all” apps like WeChat, Grab, and Facebook started out with very specific functionality and only expanded after they had secured their place. Software projects rarely fail because they are too small; they fail because they get too big.

Unfortunately, keeping a project focused is very hard in practice: just gathering the requirements from all stakeholders already creates a huge list of features.

One way to manage this bloat is by using a priority list. Requirements are all still gathered, but each are tagged according to whether they are absolutely critical features, high-value additions, or nice-to-haves. This creates a much lower-tension planning process because features no longer need to be explicitly excluded. Stakeholders can then more sanely discuss which features are the most important, without worrying about something being left out of the project. This approach also makes explicit the trade-offs of having more features. Stakeholders who want to increase the priority for a feature have to also consider what features they are willing to deprioritise. Teams can start on the most critical objectives, working their way down the list as time and resources allow.

2. Seek Out Problems and Iterate

In truth, modern software is so complicated and changes so rapidly that no amount of planning will eliminate all shortcomings. Like writing a good paper, awkward early drafts are necessary to get a feel of what the final paper should be. To build good software, you need to first build bad software, then actively seek out problems to improve on your solution.

This starts with something as simple as talking to the actual people you are trying to help. The goal is to understand the root problem you want to solve and avoid jumping to a solution based just on preconceived biases. When we first started on Parking.sg, our hypothesis was that enforcement officers found it frustrating to have to keep doing the mental calculations regarding paper coupons. However, after spending just one afternoon with an experienced officer, we discovered that doing these calculations was actually quite simple for someone doing it professionally. That single conversation saved us months of potentially wasted effort and let us refocus our project on helping drivers instead.

Beware of bureaucratic goals masquerading as problem statements. “Drivers feel frustrated when dealing with parking coupons” is a problem. “We need to build an app for drivers as part of our Ministry Family Digitisation Plans” is not. “Users are annoyed at how hard it is to find information on government websites” is a problem. “As part of the Digital Government Blueprint, we need to rebuild our websites to conform to the new design service standards” is not. If our end goal is to make citizens’ lives better, we need to explicitly acknowledge the things that are making their lives worse.

Having a clear problem statement lets you experimentally test the viability of different solutions that are too hard to determine theoretically. Talking to a chatbot may not be any easier than navigating a website, and users may not want to install yet another app on their phones no matter how secure it makes the country. With software, apparently obvious solutions often have fatal flaws that do not show up until they are put to use. The aim is not yet to build the final product, but to first identify these problems as quickly and as cheaply as possible. Non-functional mock-ups to test interface designs. Semi-functional mock-ups to try different features. Prototype code, written hastily, could help garner feedback more quickly. Anything created at this stage should be treated as disposable. The desired output of this process is not the code written, but a clearer understanding of what the right thing to build is.

3. Hire the Best Engineers You Can

The key to having good engineering is having good engineers. Google, Facebook, Amazon, Netflix, and Microsoft all run a dizzying number of the largest technology systems in the world, yet, they famously have some of the most selective interview processes while still competing fiercely to recruit the strongest candidates. There is a reason that the salaries for even fresh graduates have gone up so much as these companies have grown, and it is not because they enjoy giving away money.

Both Steve Jobs and Mark Zuckerberg have said that the best engineers are at least 10 times more productive than an average engineer. This is not because good engineers write code 10 times faster. It is because they make better decisions that save 10 times the work.

A good engineer has a better grasp of existing software they can reuse, thus minimising the parts of the system they have to build from scratch. They have a better grasp of engineering tools, automating away most of the routine aspects of their own job. Automation also means freeing up humans to work on solving unexpected errors, which the best engineers are disproportionately better at. Good engineers themselves design systems that are more robust and easier to understand by others. This has a multiplier effect, letting their colleagues build upon their work much more quickly and reliably. Overall, good engineers are so much more effective not because they produce a lot more code, but because the decisions they make save you from work you did not know could be avoided.

This also means that small teams of the best engineers can often build things faster than even very large teams of average engineers. They make good use of available open source code and sophisticated cloud services, and offload mundane tasks onto automated testing and other tools, so they can focus on the creative problem-solving aspects of the job. They rapidly test different ideas with users by prioritising key features and cutting out unimportant work. This is the central thesis of the classic book “The Mythical Man-Month”: in general, adding more software engineers does not make a project go faster, it only makes it grow bigger.

Smaller teams of good engineers will also create fewer bugs and security problems than larger teams of average engineers. Similar to writing an essay, the more authors there are, the more coding styles, assumptions, and quirks there are to reconcile in the final composite product, exposing a greater surface area for potential issues to arise. In contrast, a system built by a smaller team of good engineers will be more concise, coherent, and better understood by its creators. You cannot have security without simplicity, and simplicity is rarely the result of large-scale collaborations.

The more collaborative an engineering effort, the better the engineers need to be. Problems in an engineer’s code affect not just his work but that of his colleagues as well. In large projects, bad engineers end up creating more work for one another, as errors and poor design choices snowball to create massive issues. Big projects need to be built on solid reliable code modules in an efficient design with very clear assumptions laid out. The better your engineers, the bigger your system can get before it collapses under its own weight. This is why the most successful tech companies insist on the best talent despite their massive size. The hard limit to system complexity is not the quantity of engineering effort, but its quality.

From How to Build Good Software

Tuesday, August 20, 2019

Machine Learning Engineering : Tests for Infrastructure

An ML system often relies on a complex pipeline rather than a single running binary.

Engineering checklist:

Test the reproducibility of training
Unit test model specification code
Integration test the full ML pipeline
Test model quality before attempting to serve it
Test that a single example or training batch can be sent to the model
Test models via a canary process before they enter production serving environments
Test how quickly and safely a model can be rolled back to a previous serving version

1. Test the reproducibility of training. Train two models on the same data, and observe any differences in aggregate metrics, sliced metrics, or example-by-example predictions. Large differences due to non-determinism can exacerbate debugging and troubleshooting.

2. Unit test model specification code. Although model specifications may seem like “configuration”, such files can have bugs and need to be tested. Useful assertions include testing that training results in decreased loss and that a model can restore from a checkpoint after a mid-training job crash.

3. Integration test the full ML pipeline. A good integration test runs all the way from original data sources, through feature creation, to training, and to serving. An integration test should run both continuously as well as with new releases of models or servers, in order to catch problems well before
they reach production.

4. Test model quality before attempting to serve it. Useful tests include testing against data with known correct outputs and validating the aggregate quality, as well as comparing predictions to a previous version of the model.

5. Test that a single example or training batch can be sent to the model, and changes to internal state can be observed from training through to prediction. Observing internal state on small amounts of data is a useful debugging strategy for issues like numerical instability.

6. Test models via a canary process before they enter production serving environments. Modeling code can change more frequently than serving code, so there is a danger that an older serving system will not be able to serve a model trained from newer code. This includes testing that a model can be loaded into the production serving binaries and perform inference on production input data at all. It also includes a canary process, in which a new version is tested on a small trickle of live data.

7. Test how quickly and safely a model can be rolled back to a previous serving version. A model “roll back” procedure is useful in cases where upstream issues might result in unexpected changes to model quality. Being able to quickly revert to a previous known-good state is as crucial with ML models as with any other aspect of a serving system.

* From "What’s your ML Test Score? A rubric for ML production systems" NIPS, 2016

Wednesday, August 14, 2019

Machine Learning Engineering : Tests for Model Development

While the field of software engineering has developed a full range of best practices for developing reliable software systems, the set of standards and practices for developing ML models in a rigorous fashion is still developing. It can be all too tempting to rely on a single-number summary metric to judge performance, perhaps masking subtle areas of unreliability. Careful testing is needed to search for potential lurking issues.

Engineering checklist:

Test that every model specification undergoes a code review and is checked in to a repository
Test the relationship between offline proxy metrics and the actual impact metrics
Test the impact of each tunable hyperparameter
Test the effect of model staleness. Concept drift is real for non stationary processes
Test against a simpler model as a baseline
Test model quality on important data slices
Test the model for implicit bias

1. Test that every model specification undergoes a code review and is checked in to a repository.
It can be tempting to avoid, but disciplined code review remains an excellent method for avoiding
silly errors and for enabling more efficient incident response and debugging.

2. Test the relationship between offline proxy metrics and the actual impact metrics. For exam-
ple, how does a one-percent improvement in accuracy or AUC translate into effects on metrics of
user satisfaction, such as click through rates? This can be measured in a small scale A/B experiment
using an intentionally degraded model.

3. Test the impact of each tunable hyperparameter. Methods such as a grid search or a more
sophisticated hyperparameter search strategy not only improve predictive performance, but also
can uncover hidden reliability issues. For example, it can be surprising to observe the impact of
massive increases in data parallelism on model accuracy.

4. Test the effect of model staleness. If predictions are based on a model trained yesterday versus
last week versus last year, what is the impact on the live metrics of interest? All models need to be
updated eventually to account for changes in the external world; a careful assessment is important to
guide such decisions.

5. Test against a simpler model as a baseline. Regularly testing against a very simple baseline
model, such as a linear model with very few features, is an effective strategy both for confirming
the functionality of the larger pipeline and for helping to assess the cost to benefit tradeoffs of more
sophisticated techniques.

6. Test model quality on important data slices. Slicing a data set along certain dimensions of
interest provides fine-grained understanding of model performance. For example, important slices
might be users by country or movies by genre. Examining sliced data avoids having fine-grained
performance issues masked by a global summary metric.

7. Test the model for implicit bias. This may be viewed as an extension of examining important data
slices, and may reveal issues that can be root-caused and addressed. For example, implicit bias might
be induced by a lack of sufficient diversity in the training data.

* From "What’s your ML Test Score? A rubric for ML production systems" NIPS, 2016

Machine Learning Engineering : Tests for Features and Data

Machine learning systems differ from traditional software-based systems in that the behavior of ML systems is not specified directly in code but is learned from data. Therefore, while traditional software can rely on unit tests and integration tests of the code, here we attempt to add a sufficient set of tests of the data.

Engineering checklist:

Test that the distributions of each feature match your expectations
Test the relationship between each feature and the target
Test the cost of each feature
Test that a model does not contain any unsuitable for use feature
Test that your system maintains privacy controls across its entire data pipeline
Test all code that creates input features

1. Test that the distributions of each feature match your expectations. One example might be to test that Feature A takes on values 1 to 5, or that the two most common values of Feature B are "Harry" and "Potter" and they account for 10% of all values. This test can fail due to real external changes, which may require changes in your model.

2. Test the relationship between each feature and the target, and the pairwise correlations between individual signals. It is important to have a thorough understanding of the individual features used in a given model; this is a minimal set of tests, more exploration may be needed to develop a full understanding. These tests may be run by computing correlation coefficients, by training models with one or two features, or by training a set of models that each have one of k features individually removed.

3. Test the cost of each feature. The costs of a feature may include added inference latency and RAM usage, more upstream data dependencies, and additional expected instability incurred by relying on that feature. Consider whether this cost is worth paying when traded off against the provided improvement in model quality.

4. Test that a model does not contain any features that have been manually determined as unsuitable for use. A feature might be unsuitable when it’s been discovered to be unreliable, overly expensive, etc. Tests are needed to ensure that such features are not accidentally included (e.g. via copy-paste) into new models.

5. Test that your system maintains privacy controls across its entire data pipeline. While strict access control is typically maintained on raw data, ML systems often export and transform that data during training. Test to ensure that access control is appropriately restricted across the entire pipeline. Test the calendar time needed to develop and add a new feature to the production model. The faster a team can go from a feature idea to it running in production, the faster it can both improve the system and respond to external changes.

6. Test all code that creates input features, both in training and serving. It can be tempting to believe feature creation code is simple enough to not need unit tests, but this code is crucial for correct behavior and so its continued quality is vital.

* From "What’s your ML Test Score? A rubric for ML production systems" NIPS, 2016