Tuesday, August 27, 2019

How to Build Good Software

Software has characteristics that make it hard to build with traditional management techniques; effective development requires a different, more exploratory and iterative approach.

The root cause of bad software has less to do with specific engineering choices, and more to do with how development projects are managed.

The right coding language, system architecture, or interface design will vary wildly from project to project. But there are characteristics particular to software that consistently cause traditional management practices to fail, while allowing small startups to succeed with a shoestring budget:

• Reusing good software is easy; it is what allows you to build good things quickly;
• Software is limited not by the amount of resources put into building it, but by how complex it can get before it breaks down; and
• The main value in software is not the code produced, but the knowledge accumulated by the people who produced it.

Understanding these characteristics may not guarantee good outcomes, but it does help clarify why so many projects produce bad outcomes. Furthermore, these lead to some core operating principles that can dramatically improve the chances of success:

1. Start as simple as possible;
2. Seek out problems and iterate; and
3. Hire the best engineers you can.

While there are many subtler factors to consider, these principles form a foundation that lets you get started building good software.

Software should be treated not as a static product, but as a living manifestation of the development team’s collective understanding.


1. Start as Simple as Possible

Projects that set out to be a “one-stop shop” for a particular domain are often doomed. The reasoning seems sensible enough: What better way to ensure your app solves people’s problems than by having it address as many as possible? After all, this works for physical stores such as supermarkets. The difference is that while it is relatively easy to add a new item for sale once a physical store is set up, an app with twice as many features is more than twice as hard to build and much harder to use.

Building good software requires focus: starting with the simplest solution that could solve the problem. A well-made but simplistic app never has problems adding necessary features. But a big IT system that does a lot of things poorly is usually impossible to simplify and fix. Even successful “do it all” apps like WeChat, Grab, and Facebook started out with very specific functionality and only expanded after they had secured their place. Software projects rarely fail because they are too small; they fail because they get too big.

Unfortunately, keeping a project focused is very hard in practice: just gathering the requirements from all stakeholders already creates a huge list of features.

One way to manage this bloat is by using a priority list. Requirements are all still gathered, but each are tagged according to whether they are absolutely critical features, high-value additions, or nice-to-haves. This creates a much lower-tension planning process because features no longer need to be explicitly excluded. Stakeholders can then more sanely discuss which features are the most important, without worrying about something being left out of the project. This approach also makes explicit the trade-offs of having more features. Stakeholders who want to increase the priority for a feature have to also consider what features they are willing to deprioritise. Teams can start on the most critical objectives, working their way down the list as time and resources allow.

2. Seek Out Problems and Iterate

In truth, modern software is so complicated and changes so rapidly that no amount of planning will eliminate all shortcomings. Like writing a good paper, awkward early drafts are necessary to get a feel of what the final paper should be. To build good software, you need to first build bad software, then actively seek out problems to improve on your solution.

This starts with something as simple as talking to the actual people you are trying to help. The goal is to understand the root problem you want to solve and avoid jumping to a solution based just on preconceived biases. When we first started on Parking.sg, our hypothesis was that enforcement officers found it frustrating to have to keep doing the mental calculations regarding paper coupons. However, after spending just one afternoon with an experienced officer, we discovered that doing these calculations was actually quite simple for someone doing it professionally. That single conversation saved us months of potentially wasted effort and let us refocus our project on helping drivers instead.

Beware of bureaucratic goals masquerading as problem statements. “Drivers feel frustrated when dealing with parking coupons” is a problem. “We need to build an app for drivers as part of our Ministry Family Digitisation Plans” is not. “Users are annoyed at how hard it is to find information on government websites” is a problem. “As part of the Digital Government Blueprint, we need to rebuild our websites to conform to the new design service standards” is not. If our end goal is to make citizens’ lives better, we need to explicitly acknowledge the things that are making their lives worse.

Having a clear problem statement lets you experimentally test the viability of different solutions that are too hard to determine theoretically. Talking to a chatbot may not be any easier than navigating a website, and users may not want to install yet another app on their phones no matter how secure it makes the country. With software, apparently obvious solutions often have fatal flaws that do not show up until they are put to use. The aim is not yet to build the final product, but to first identify these problems as quickly and as cheaply as possible. Non-functional mock-ups to test interface designs. Semi-functional mock-ups to try different features. Prototype code, written hastily, could help garner feedback more quickly. Anything created at this stage should be treated as disposable. The desired output of this process is not the code written, but a clearer understanding of what the right thing to build is.

3. Hire the Best Engineers You Can

The key to having good engineering is having good engineers. Google, Facebook, Amazon, Netflix, and Microsoft all run a dizzying number of the largest technology systems in the world, yet, they famously have some of the most selective interview processes while still competing fiercely to recruit the strongest candidates. There is a reason that the salaries for even fresh graduates have gone up so much as these companies have grown, and it is not because they enjoy giving away money.

Both Steve Jobs and Mark Zuckerberg have said that the best engineers are at least 10 times more productive than an average engineer. This is not because good engineers write code 10 times faster. It is because they make better decisions that save 10 times the work.

A good engineer has a better grasp of existing software they can reuse, thus minimising the parts of the system they have to build from scratch. They have a better grasp of engineering tools, automating away most of the routine aspects of their own job. Automation also means freeing up humans to work on solving unexpected errors, which the best engineers are disproportionately better at. Good engineers themselves design systems that are more robust and easier to understand by others. This has a multiplier effect, letting their colleagues build upon their work much more quickly and reliably. Overall, good engineers are so much more effective not because they produce a lot more code, but because the decisions they make save you from work you did not know could be avoided.

This also means that small teams of the best engineers can often build things faster than even very large teams of average engineers. They make good use of available open source code and sophisticated cloud services, and offload mundane tasks onto automated testing and other tools, so they can focus on the creative problem-solving aspects of the job. They rapidly test different ideas with users by prioritising key features and cutting out unimportant work. This is the central thesis of the classic book “The Mythical Man-Month”: in general, adding more software engineers does not make a project go faster, it only makes it grow bigger.

Smaller teams of good engineers will also create fewer bugs and security problems than larger teams of average engineers. Similar to writing an essay, the more authors there are, the more coding styles, assumptions, and quirks there are to reconcile in the final composite product, exposing a greater surface area for potential issues to arise. In contrast, a system built by a smaller team of good engineers will be more concise, coherent, and better understood by its creators. You cannot have security without simplicity, and simplicity is rarely the result of large-scale collaborations.

The more collaborative an engineering effort, the better the engineers need to be. Problems in an engineer’s code affect not just his work but that of his colleagues as well. In large projects, bad engineers end up creating more work for one another, as errors and poor design choices snowball to create massive issues. Big projects need to be built on solid reliable code modules in an efficient design with very clear assumptions laid out. The better your engineers, the bigger your system can get before it collapses under its own weight. This is why the most successful tech companies insist on the best talent despite their massive size. The hard limit to system complexity is not the quantity of engineering effort, but its quality.

From How to Build Good Software

Thursday, August 22, 2019

Preindustrial workers worked fewer hours than today's

Work brings purpose to life but it can be a great hack to keep you occupied from serious business. Work is not equivalent to production and it is not bliss. It is necessary to keep one alive and can bring happiness and purpose but it needs to be on itself purposeful and with actual tangible results.

There is an overworking and underproduction crisis. People work in bullshit works shuffling paper around and they are killing themselves while at the same time they produce nothing.

To unburden ourselves from normalcy bias we have to look at different times where things were simpler and their wasn't much space for non-producing workers.

From the paper "Preindustrial workers worked fewer hours than today's"

"The contrast between capitalist and precapitalist work patterns is most striking in respect to the working year. The medieval calendar was filled with holidays. Official -- that is, church -- holidays included not only long "vacations" at Christmas, Easter, and midsummer but also numerous saints days. These were spent both in sober churchgoing and in feasting, drinking and merrymaking. In addition to official celebrations, there were often week's worth of ales -- to mark important life events (bride ales or wake ales) as well as less momentous occasions (scot ale, lamb ale, and hock ale). All told, holiday leisure time in medieval England took up probably about one-third of the year. And the English were apparently working harder than their neighbors. The ancient regime in France is reported to have guaranteed fifty-two Sundays, ninety rest days, and thirty-eight holidays. In Spain, travelers noted that holidays totaled five months per year."

From the article " On the Phenomenon of Bullshit Jobs: A Work Rant by David Graeber "

In the year 1930, John Maynard Keynes predicted that, by century's end, technology would have advanced  sufficiently that countries like Great Britain or the United States would have achieved a 15-hour work week. There's every reason to believe he was right. In technological terms, we are quite capable of this. And yet it didn't happen. Instead, technology has been marshaled, if anything, to figure out ways to make us all work more. In order to achieve this, jobs have had to be created that are, effectively, pointless. Huge swathes of people, in Europe and North America in particular, spend their entire working lives performing tasks they secretly believe do not really need to be performed. The moral and spiritual damage that comes from this situation is profound. It is a scar across our collective soul. Yet virtually no one talks about it.

Why did Keynes' promised utopia—still being eagerly awaited in the '60s—never materialise? The standard line today is that he didn't figure in the massive increase in consumerism. Given the choice between less hours and more toys and pleasures, we've collectively chosen the latter. This presents a nice morality tale, but even a moment's reflection shows it can't really be true. Yes, we have witnessed the creation of an endless variety of new jobs and industries since the '20s, but very few have anything to do with the production and distribution of sushi, iPhones, or fancy sneakers.

In Bullshit Jobs, American anthropologist David Graeber posits that the productivity benefits of automation have not led to a 15-hour workweek, but instead to "bullshit jobs": "a form of paid employment that is so completely pointless, unnecessary, or pernicious that even the employee cannot justify its existence even though, as part of the conditions of employment, the employee feels obliged to pretend that this is not the case."

The author contends that more than half of societal work is pointless, both large parts of some jobs and, as he describes, five types of entirely pointless jobs:
  1. flunkies, who serve to make their superiors feel important, e.g., receptionists, administrative assistants, door attendants
  2. goons, who act aggressively on behalf of their employers, e.g., lobbyists, corporate lawyers, telemarketers, public relations specialists
  3. duct tapers, who ameliorate preventable problems, e.g., programmers repairing shoddy code, airline desk staff who calm passengers whose bags don't arrive
  4. box tickers, who use paperwork or gestures as a proxy for action, e.g., performance managers, in-house magazine journalists, leisure coordinators
  5. taskmasters, who manage—or create extra work for—those who don't need it, e.g., middle management, leadership professionals
 From the book "Bullshit Jobs"

Wednesday, August 21, 2019

Identity Mappings in Deep Residual Networks, 2016

A very nice improvement over the original ResNet.

In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation.

Tuesday, August 20, 2019

Companies with overpaid CEOs have markedly underperformed the S&P 500

The companies with overpaid CEOs we identified in our first report have markedly underperformed the S&P 500.

Two years ago, we analyzed how these firms’ stock price performed since we originally identified their CEOs as overpaid. We found then that the 10 companies we identified as having the most overpaid CEOs, in aggregate, underperformed the S&P 500 index by an incredible 10.5 percentage points and actually destroyed shareholder value, with a negative 5.7 percent financial return. The trend continues to hold true as we measure performance to year-end 2018. Last year, these 10 firms again, in aggregate, dramatically underperformed the S&P 500 index, this time by an embarrassing 15.6 percentage points.

In analyzing almost 4 years of returns for these 10 companies we find that they lag the S&P 500 by 14.3 percentage points, posting an overall loss in value of over 11 percent.

Consistent with our 2018 report, this year we used a two-ranking methodology to identify overpaid CEOs.
1. The first is the same HIP Investor regression we’ve used every year that computes excess CEO pay assuming such pay is related to total shareholder return (TSR). 
2. The second ranking identified the companies where the most shares were voted against the CEO pay package. 

These two rankings were weighted 2:1, with the regression analysis being the majority. We then excluded those CEOs whose total disclosed compensation (TDC) was in the lowest third of all the S&P 500 CEO pay packages. The full list of the 100 most overpaid CEOs using this methodology is found in Appendix A. The regression analysis of predicted and excess pay performed by HIP Investor is found in Appendix C, and its methodology is more fully explained there.

From the article
https://www.asyousow.org/report/the-100-most-overpaid-ceos-2019

 

Machine Learning Engineering : Tests for Infrastructure

An ML system often relies on a complex pipeline rather than a single running binary.

Engineering checklist:

  1. Test the reproducibility of training
  2. Unit test model specification code
  3. Integration test the full ML pipeline
  4. Test model quality before attempting to serve it
  5. Test that a single example or training batch can be sent to the model
  6. Test models via a canary process before they enter production serving environments
  7. Test how quickly and safely a model can be rolled back to a previous serving version

1. Test the reproducibility of training. Train two models on the same data, and observe any differences in aggregate metrics, sliced metrics, or example-by-example predictions. Large differences due to non-determinism can exacerbate debugging and troubleshooting.

2. Unit test model specification code. Although model specifications may seem like “configuration”, such files can have bugs and need to be tested. Useful assertions include testing that training results in decreased loss and that a model can restore from a checkpoint after a mid-training job crash.

3. Integration test the full ML pipeline. A good integration test runs all the way from original data sources, through feature creation, to training, and to serving. An integration test should run both continuously as well as with new releases of models or servers, in order to catch problems well before
they reach production.

4. Test model quality before attempting to serve it. Useful tests include testing against data with known correct outputs and validating the aggregate quality, as well as comparing predictions to a previous version of the model.

5. Test that a single example or training batch can be sent to the model, and changes to internal state can be observed from training through to prediction. Observing internal state on small amounts of data is a useful debugging strategy for issues like numerical instability.

6. Test models via a canary process before they enter production serving environments. Modeling code can change more frequently than serving code, so there is a danger that an older serving system will not be able to serve a model trained from newer code. This includes testing that a model can be loaded into the production serving binaries and perform inference on production input data at all. It also includes a canary process, in which a new version is tested on a small trickle of live data.


7. Test how quickly and safely a model can be rolled back to a previous serving version. A model “roll back” procedure is useful in cases where upstream issues might result in unexpected changes to model quality. Being able to quickly revert to a previous known-good state is as crucial with ML models as with any other aspect of a serving system.

* From  "What’s your ML Test Score? A rubric for ML production systems" NIPS, 2016


Wednesday, August 14, 2019

Machine Learning Engineering : Tests for Model Development

While the field of software engineering has developed a full range of best practices for developing reliable software systems, the set of standards and practices for developing ML models in a rigorous fashion is still developing. It can be all too tempting to rely on a single-number summary metric to judge performance, perhaps masking subtle areas of unreliability. Careful testing is needed to search for potential lurking issues.

Engineering checklist:

  1. Test that every model specification undergoes a code review and is checked in to a repository
  2. Test the relationship between offline proxy metrics and the actual impact metrics
  3. Test the impact of each tunable hyperparameter
  4. Test the effect of model staleness. Concept drift is real for non stationary processes
  5. Test against a simpler model as a baseline
  6. Test model quality on important data slices
  7. Test the model for implicit bias
1. Test that every model specification undergoes a code review and is checked in to a repository.
It can be tempting to avoid, but disciplined code review remains an excellent method for avoiding
silly errors and for enabling more efficient incident response and debugging.

2. Test the relationship between offline proxy metrics and the actual impact metrics. For exam-
ple, how does a one-percent improvement in accuracy or AUC translate into effects on metrics of
user satisfaction, such as click through rates? This can be measured in a small scale A/B experiment
using an intentionally degraded model.

3. Test the impact of each tunable hyperparameter. Methods such as a grid search or a more
sophisticated hyperparameter search strategy not only improve predictive performance, but also
can uncover hidden reliability issues. For example, it can be surprising to observe the impact of
massive increases in data parallelism on model accuracy.

4. Test the effect of model staleness. If predictions are based on a model trained yesterday versus
last week versus last year, what is the impact on the live metrics of interest? All models need to be
updated eventually to account for changes in the external world; a careful assessment is important to
guide such decisions.

5. Test against a simpler model as a baseline. Regularly testing against a very simple baseline
model, such as a linear model with very few features, is an effective strategy both for confirming
the functionality of the larger pipeline and for helping to assess the cost to benefit tradeoffs of more
sophisticated techniques.

6. Test model quality on important data slices. Slicing a data set along certain dimensions of
interest provides fine-grained understanding of model performance. For example, important slices
might be users by country or movies by genre. Examining sliced data avoids having fine-grained
performance issues masked by a global summary metric.

7. Test the model for implicit bias. This may be viewed as an extension of examining important data
slices, and may reveal issues that can be root-caused and addressed. For example, implicit bias might
be induced by a lack of sufficient diversity in the training data.

* From  "What’s your ML Test Score? A rubric for ML production systems" NIPS, 2016




Machine Learning Engineering : Tests for Features and Data

Machine learning systems differ from traditional software-based systems in that the behavior of ML systems is not specified directly in code but is learned from data. Therefore, while traditional software can rely on unit tests and integration tests of the code, here we attempt to add a sufficient set of tests of the data.


Engineering checklist:

  1. Test that the distributions of each feature match your expectations
  2. Test the relationship between each feature and the target 
  3. Test the cost of each feature
  4. Test that a model does not contain any unsuitable for use feature
  5. Test that your system maintains privacy controls across its entire data pipeline
  6. Test all code that creates input features

1. Test that the distributions of each feature match your expectations. One example might be to test that Feature A takes on values 1 to 5, or that the two most common values of Feature B are "Harry" and "Potter" and they account for 10% of all values. This test can fail due to real external changes, which may require changes in your model.

2. Test the relationship between each feature and the target, and the pairwise correlations between individual signals. It is important to have a thorough understanding of the individual features used in a given model; this is a minimal set of tests, more exploration may be needed to develop a full understanding. These tests may be run by computing correlation coefficients, by training models with one or two features, or by training a set of models that each have one of k features individually removed.

3. Test the cost of each feature. The costs of a feature may include added inference latency and RAM usage, more upstream data dependencies, and additional expected instability incurred by relying on that feature. Consider whether this cost is worth paying when traded off against the provided improvement in model quality.

4. Test that a model does not contain any features that have been manually determined as unsuitable for use. A feature might be unsuitable when it’s been discovered to be unreliable, overly expensive, etc. Tests are needed to ensure that such features are not accidentally included (e.g. via copy-paste) into new models.

5. Test that your system maintains privacy controls across its entire data pipeline. While strict access control is typically maintained on raw data, ML systems often export and transform that data during training. Test to ensure that access control is appropriately restricted across the entire pipeline. Test the calendar time needed to develop and add a new feature to the production model. The faster a team can go from a feature idea to it running in production, the faster it can both improve the system and respond to external changes.

6. Test all code that creates input features, both in training and serving. It can be tempting to believe feature creation code is simple enough to not need unit tests, but this code is crucial for correct behavior and so its continued quality is vital.

* From  "What’s your ML Test Score? A rubric for ML production systems" NIPS, 2016



Tuesday, August 6, 2019

What is deep learning ?

Long Story Short

I like this definition, clean, consise and to the point with zero marketing fluff.
“A class of parametrized non-linear representations encoding appropriate domain knowledge (invariance and stationarity) that can be (massively) optimized efficiently using stochastic gradient descent”

What neural networks actually do ?