Data in Deployment

Introduction by

D. Sculley

D. is a director in Google Brain, leading research teams working on robust, responsible, reliable and efficient ML and AI. In his time at Google, he's worked on nearly every aspect of machine learning, and have led both product and research teams including those on some of the most challenging business problems.

A Data-Centric View of Technical Debt in AI

Several years ago, we introduced the concept of technical debt in machine learning. The concept of technical debt originally comes from the world of software engineering, where it has often been found that pushing to develop software very quickly can create long term maintenance costs that must be paid back later, and that if left unaddressed can compound over time. Taking on debt is sometimes a useful strategy, but we should make careful choices about when and where to do so – and to have a solid plan for how to pay it off later. The kinds of things that create technical debt look like breaking abstraction boundaries, poor documentation, lack of unit tests and system tests. The kinds of things that can pay off technical debt look like refactoring, improving documentation, and creating unit tests and integration tests.

One of the things that we found that was surprising at the time we published the paper was that machine learning and AI systems have a special ability to create large amounts of technical debt. The reason for this is that ML systems have all of the qualities of traditional software systems, as they are of course built on code. But they additionally have a second layer of issues based on the fact that the behavior of these systems is not able to be specified exactly in advance. Instead, their behavior is implicitly defined by the specific choices of model algorithms, hyperparameter settings, and of course the training data used. So unlike writing a unit test for, say, a sorting algorithm whose precondition and postcondition are well defined in advance for any possible inputs, it can be maddeningly difficult to specify what the “correct” behavior for an ML system should be in the face of the wide range of input data it might be presented with in practice. This makes things like refactoring, documentation, and rigorous testing challenging.

From the standpoint of data-centric AI, we can even make the case that it is the data itself that is the largest potential source of technical debt in an ML system. There are two ways to see this is true. The first is to look at the overall components of a typical production-level ML system.

It is useful to note that the ML code – the bit that we tend to think of as the cool part – is actually a small component of the overall system, maybe five percent or less in terms of overall code. Things like data collection, data verification, and feature extraction all form much larger parts of the overall system, and are all obviously at the heart of a data-centric approach. But even typical Serving Infrastructure that helps to deploy the model to make predictions within the context of a live system will require an extensive data pipeline to ensure that all relevant information is provided to the model at prediction time. And if we consider Monitoring, any ML Ops engineer worth their salt will make sure that monitoring data distributions is a top priority. Overall, this means that something like 70% of our overall system complexity is tied to data – processing, handling, and monitoring – and that these tasks can bridge multiple systems or subsystems. No wonder it is a major source of technical debt.

ML code – the bit that we tend to think of as the cool part – is actually a small component of the overall system.

There is a second, perhaps even more important way to see that data can be a source of unexpectedly large technical debt. This is due to the fact that the data defines the behavior of our models, and in this way takes on the role of code. If we want a vision model to do a good job at identifying insects, or an audio model to recognize a verbal command, or a movie recommendation model to help a user pick entertainment for a Friday evening, the abilities and behaviors of our model will be defined primarily by what data we collect and use for training.

Over time, practitioners have found that it is beneficial to collect and train ever larger datasets for model training. In recent years, the rate at which data set sizes has been growing for many model types has been exponential. It is worth considering why this is useful, because in a lot of settings it is not actually statistically useful to get more and more data. For example, if we want to estimate the probability that a given coin will come up heads, classical statistics is happy to tell us that the value of additional random trails diminishes rapidly. Even for high stakes information gathering like opinion polls, it is rarely useful from a statistical point of view to collect more than a few thousand responses. So why do ML folks spend so much time, effort, and computation to collect datasets that can be many billions of training examples?

In a lot of settings it is not actually statistically useful to get more data.

The answer is, of course, that we want our models to behave well in a wide variety of circumstances, and that larger datasets are useful because they have a larger amount of diversity. This allows our models to learn more about a long tail of rare or unusual circumstances, and how they should behave within them. Indeed, when we talk about large datasets improving model accuracy overall, this is usually a convenient shorthand to say that we are improving a model’s ability to predict well on difficult, rare, or unusual events. The easy cases are almost always well handled already.

This means that when we increase the size of our datasets – which again, is happening currently at an exponential rate in many cases – we are doing so because we want to increase the range of behaviors exhibited by our system. And this means that we now have a wider range of system behaviors to test, monitor, and verify, which if not attended to becomes unpaid and compounding technical debt.

This all might sound a little bleak, but there are things we can do to pay off data-centric technical debt for our ML and AI systems. It can be a long journey, but here are three places to start:

Audit and monitor data quality. Manual inspection of data remains a critical step in many cases, so we can learn about problems and issues in the data that would not otherwise be apparent. Automating monitoring of data quality, and even of basic distributional statistics, can also go a long way to detecting any problems as quickly as possible.
Create data sheets for data sets. Pioneered by Timnit Gebru and colleagues, the practice of creating standardized documentation for training data has the same remarkable ability to reduce ML-based technical debt that documenting code has for traditional code-based systems.
Create and apply stress tests using data. ML systems that are deployed in the real world will eventually encounter data distributions and inputs differ significantly from historical training data. This means that we cannot reliably estimate the performance of our systems just by the traditional practice of evaluating a model based purely on a random holdout of training data. Instead, we should create a variety of specific “stress tests” datasets to help probe the limits of our model and its robustness. One good approach is to use data that specifically breaks common spurious correlations in the training data, including synthetic counterfactuals or naturally occurring but unusual data. It is also useful to create specific evaluation sets focused on specific use cases highlighted by domain experts.

It is fair to say that these steps are a lot of work, and from an organizational standpoint can be expensive. But from a data-centric point of view, it is clear that these are the areas that will have the greatest impact for the long term health and reliability of our AI and ML systems.

Resources

The Hugging Face 🤗 Data Measurements Tool

Data in Deployment

The Hugging Face 🤗 Data Measurements Tool

We created the Hugging Face 🤗 Data Measurements Tool, a no-code interface that helps empower members of the AI community to build, measure, and compare datasets.

Sasha Luccioni

Published 31 Mar 2022

How Observability Uncovers the Effects of ML Technical Debt

Data in Deployment

How Observability Uncovers the Effects of ML Technical Debt

One of the most alarming aspects of machine learning is that many teams don’t yet have tools and processes to measure the negative effects of technical debt in their production systems. Many teams test their machine learning models offline but conduct little to no online evaluation after initial deployment. These teams are flying blind—running production… Continue reading How Observability Uncovers the Effects of ML Technical Debt

Bernease Herman, Danny D. Leybzon, Alessya Visnjic

Published 16 Feb 2022

Combining Data-driven Supervision with Human-in-the-loop Feedback for Entity Resolution

Data in Deployment

Combining Data-driven Supervision with Human-in-the-loop Feedback for Entity Resolution

Wenpeng Yin, Shelby Heinecke, Jia LiNitish Shirish Keskar, Michael Jones, Shouzhong ShiStanislav Georgiev, Kurt Milich, Joseph Esposito, Caiming Xiong The distribution gap between training datasets and data encountered in production is well acknowledged. Training datasets are often constructed over a fixed period of time and by carefully curating the data to be labeled. Thus, training… Continue reading Combining Data-driven Supervision with Human-in-the-loop Feedback for Entity Resolution

Published 14 Feb 2022

AutoDC: Automated data-centric processing

Data in Deployment

AutoDC: Automated data-centric processing

Zac Yung-Chun Liu, Shoumik Roychowdhury, Scott TarlowAkash Nair, Shweta Badhe, Tejas Shah AutoML (automated machine learning) has been extensively developed in the past few years for the model-centric approach. As for the data-centric approach, the processes to improve the dataset, such as fixing incorrect labels, adding examples that represent edge cases, and applying data augmentation,… Continue reading AutoDC: Automated data-centric processing

Published 14 Feb 2022

Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

Data in Deployment

Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

Haoping Bai, Meng Cao, Ping Huang, Jiulong Shan As the adoption of deep learning techniques in industrial applications grows with increasing speed and scale, successful deployment of deep learning models often hinges on the availability, volume, and quality of annotated data. In this paper, we tackle the problems of efficient data labeling and annotation verification… Continue reading Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

Published 14 Feb 2022

Contrasting the Profiles of Easy and Hard Observations in a Dataset

Data in Deployment

Contrasting the Profiles of Easy and Hard Observations in a Dataset

Camila Castro Moreno, Pedro Yuri Arbs Paiva, Gustavo H. Nunes, Ana Carolina Lorena For supporting data-centric analyzes, it is important to identify and characterize which observations from a dataset are hard or easy to classify. This paper employs meta-learning strategies to describe the main differences between observations which are easy and hard to classify in… Continue reading Contrasting the Profiles of Easy and Hard Observations in a Dataset

Published 14 Feb 2022

What can Data-Centric AI Learn from Data and ML Engineering?

Data in Deployment

What can Data-Centric AI Learn from Data and ML Engineering?

Alkis Polyzotis, Matei Zaharia Data-centric AI is a new and exciting research topic in the AI community, but many organizations already build and maintain various data-centric” applications whose goal is to produce high quality data. These range from traditional business data processing applications (e.g.,how much should we charge each of our customers this month?”) to… Continue reading What can Data-Centric AI Learn from Data and ML Engineering?

Published 14 Feb 2022

Challenges and Solutions to build a Data Pipeline to Identify Anomalies in Enterprise System Performance

Data in Deployment

Challenges and Solutions to build a Data Pipeline to Identify Anomalies in Enterprise System Performance

Xiaobo Huang, Amitabha Banerjee, Chien-Chia Chen, Chengzhi Huang, Tzu Yi Chuang, Abhishek Srivastava, Razvan Cheveresan, We discuss how VMware is solving the following challenges to harness data to operate our ML-based anomaly detection system to detect performance issues in our Software Defined Data Center (SDDC) enterprise deployments: (i) label scarcity and label bias due to… Continue reading Challenges and Solutions to build a Data Pipeline to Identify Anomalies in Enterprise System Performance

Published 14 Feb 2022

Finding Label Errors in Autonomous Vehicle Data With Learned Observation Assertions

Data in Deployment

Finding Label Errors in Autonomous Vehicle Data With Learned Observation Assertions

Daniel Kang, Nikos Arechiga, Sudeep Pillai, Peter D Bailis, Matei Zaharia ML is being deployed in complex, real-world scenarios where errors have impactful consequences. As such, thorough testing of the ML pipelines is critical. A key component in ML deployment pipelines is the curation of labeled training data, which is assumed to be ground truth.… Continue reading Finding Label Errors in Autonomous Vehicle Data With Learned Observation Assertions

Published 14 Feb 2022

Data in Deployment

How Observability Uncovers the Effects of ML Technical Debt

Many teams test their machine learning models offline but conduct little to no online evaluation after initial deployment. These teams are flying blind—running production systems with no insight into their ongoing performance. Observability is the first step in measuring how this technical debt has come to bear on your business.

Bernease Herman, Danny D. Leybzon, Alessya Visnjic

Published 13 Jan 2022

Posts

Topics

Data in Deployment

Introduction by

D. Sculley

A Data-Centric View of Technical Debt in AI

Resources

The Hugging Face 🤗 Data Measurements Tool

How Observability Uncovers the Effects of ML Technical Debt

Combining Data-driven Supervision with Human-in-the-loop Feedback for Entity Resolution

AutoDC: Automated data-centric processing

Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation

Contrasting the Profiles of Easy and Hard Observations in a Dataset

What can Data-Centric AI Learn from Data and ML Engineering?

Challenges and Solutions to build a Data Pipeline to Identify Anomalies in Enterprise System Performance

Finding Label Errors in Autonomous Vehicle Data With Learned Observation Assertions

How Observability Uncovers the Effects of ML Technical Debt