A Data-Centric View of Technical Debt in AI
Several years ago, we introduced the concept of technical debt in machine learning. The concept of technical debt originally comes from the world of software engineering, where it has often been found that pushing to develop software very quickly can create long term maintenance costs that must be paid back later, and that if left unaddressed can compound over time. Taking on debt is sometimes a useful strategy, but we should make careful choices about when and where to do so – and to have a solid plan for how to pay it off later. The kinds of things that create technical debt look like breaking abstraction boundaries, poor documentation, lack of unit tests and system tests. The kinds of things that can pay off technical debt look like refactoring, improving documentation, and creating unit tests and integration tests.
One of the things that we found that was surprising at the time we published the paper was that machine learning and AI systems have a special ability to create large amounts of technical debt. The reason for this is that ML systems have all of the qualities of traditional software systems, as they are of course built on code. But they additionally have a second layer of issues based on the fact that the behavior of these systems is not able to be specified exactly in advance. Instead, their behavior is implicitly defined by the specific choices of model algorithms, hyperparameter settings, and of course the training data used. So unlike writing a unit test for, say, a sorting algorithm whose precondition and postcondition are well defined in advance for any possible inputs, it can be maddeningly difficult to specify what the “correct” behavior for an ML system should be in the face of the wide range of input data it might be presented with in practice. This makes things like refactoring, documentation, and rigorous testing challenging.
From the standpoint of data-centric AI, we can even make the case that it is the data itself that is the largest potential source of technical debt in an ML system. There are two ways to see this is true. The first is to look at the overall components of a typical production-level ML system.
It is useful to note that the ML code – the bit that we tend to think of as the cool part – is actually a small component of the overall system, maybe five percent or less in terms of overall code. Things like data collection, data verification, and feature extraction all form much larger parts of the overall system, and are all obviously at the heart of a data-centric approach. But even typical Serving Infrastructure that helps to deploy the model to make predictions within the context of a live system will require an extensive data pipeline to ensure that all relevant information is provided to the model at prediction time. And if we consider Monitoring, any ML Ops engineer worth their salt will make sure that monitoring data distributions is a top priority. Overall, this means that something like 70% of our overall system complexity is tied to data – processing, handling, and monitoring – and that these tasks can bridge multiple systems or subsystems. No wonder it is a major source of technical debt.
There is a second, perhaps even more important way to see that data can be a source of unexpectedly large technical debt. This is due to the fact that the data defines the behavior of our models, and in this way takes on the role of code. If we want a vision model to do a good job at identifying insects, or an audio model to recognize a verbal command, or a movie recommendation model to help a user pick entertainment for a Friday evening, the abilities and behaviors of our model will be defined primarily by what data we collect and use for training.
Over time, practitioners have found that it is beneficial to collect and train ever larger datasets for model training. In recent years, the rate at which data set sizes has been growing for many model types has been exponential. It is worth considering why this is useful, because in a lot of settings it is not actually statistically useful to get more and more data. For example, if we want to estimate the probability that a given coin will come up heads, classical statistics is happy to tell us that the value of additional random trails diminishes rapidly. Even for high stakes information gathering like opinion polls, it is rarely useful from a statistical point of view to collect more than a few thousand responses. So why do ML folks spend so much time, effort, and computation to collect datasets that can be many billions of training examples?
The answer is, of course, that we want our models to behave well in a wide variety of circumstances, and that larger datasets are useful because they have a larger amount of diversity. This allows our models to learn more about a long tail of rare or unusual circumstances, and how they should behave within them. Indeed, when we talk about large datasets improving model accuracy overall, this is usually a convenient shorthand to say that we are improving a model’s ability to predict well on difficult, rare, or unusual events. The easy cases are almost always well handled already.
This means that when we increase the size of our datasets – which again, is happening currently at an exponential rate in many cases – we are doing so because we want to increase the range of behaviors exhibited by our system. And this means that we now have a wider range of system behaviors to test, monitor, and verify, which if not attended to becomes unpaid and compounding technical debt.
This all might sound a little bleak, but there are things we can do to pay off data-centric technical debt for our ML and AI systems. It can be a long journey, but here are three places to start:
- Audit and monitor data quality. Manual inspection of data remains a critical step in many cases, so we can learn about problems and issues in the data that would not otherwise be apparent. Automating monitoring of data quality, and even of basic distributional statistics, can also go a long way to detecting any problems as quickly as possible.
- Create data sheets for data sets. Pioneered by Timnit Gebru and colleagues, the practice of creating standardized documentation for training data has the same remarkable ability to reduce ML-based technical debt that documenting code has for traditional code-based systems.
- Create and apply stress tests using data. ML systems that are deployed in the real world will eventually encounter data distributions and inputs differ significantly from historical training data. This means that we cannot reliably estimate the performance of our systems just by the traditional practice of evaluating a model based purely on a random holdout of training data. Instead, we should create a variety of specific “stress tests” datasets to help probe the limits of our model and its robustness. One good approach is to use data that specifically breaks common spurious correlations in the training data, including synthetic counterfactuals or naturally occurring but unusual data. It is also useful to create specific evaluation sets focused on specific use cases highlighted by domain experts.
It is fair to say that these steps are a lot of work, and from an organizational standpoint can be expensive. But from a data-centric point of view, it is clear that these are the areas that will have the greatest impact for the long term health and reliability of our AI and ML systems.