DCAI logo

NeurIPS Data-Centric AI Workshop

14 December 2021
Join Here

Important Dates Paper Submission Register

Data-Centric AI (DCAI) represents the recent transition from focusing on modeling to the underlying data used to train and evaluate models. Increasingly, common model architectures have begun to dominate a wide range of tasks, and predictable scaling rules have emerged. While building and using datasets has been critical to these successes, the endeavor is often artisanal -- painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets easier, cheaper, and more repeatable. The DCAI movement aims to address this lack of tooling, best practices, and infrastructure for managing data in modern ML systems.

The main objective of this workshop is to cultivate the DCAI community into a vibrant interdisciplinary field that tackles practical data problems. We consider some of those problems to be: data collection/generation, data labeling, data preprocess/augmentation, data quality evaluation, data debt, and data governance. Many of these areas are nascent, and we hope to further their development by knitting them together into a coherent whole. Together we will define the DCAI movement that will shape the future of AI and ML. Please see our call for papers below to take an active role in shaping that future! If you have any questions, please reach out to the organizers ([email protected])

Learn more about Data Centric AI (DCAI) here. This workshop builds on a tradition of series of workshops focusing on the role of data in AI:

Important Dates

Submission Deadline

September 30, 2021

Notification of acceptance

October 22, 2021


December 14, 2021


For questions please check FAQ

Call for Papers

The ML community has a strong track record of building and using datasets for AI systems. But this endeavor is often artisanal—painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining and evaluating datasets easier, cheaper and more repeatable. So, the core challenge is to accelerate dataset creation and iteration together with increasing the efficiency of use and reuse by democratizing data engineering and evaluation.

If 80 percent of machine learning work is data preparation, then ensuring data quality is the most important work of a machine learning team and therefore a vital research area. Human-labeled data has increasingly become the fuel and compass of AI-based software systems - yet innovative efforts have mostly focused on models and code. The growing focus on scale, speed, and cost of building and improving datasets has resulted in an impact on quality, which is nebulous and often circularly defined, since the annotators are the source of data and ground truth [Riezler, 2014]. The development of tools to make repeatable and systematic adjustments to datasets has also lagged. While dataset quality is still the top concern everyone has, the ways in which that is measured in practice is poorly understood and sometimes simply wrong. A decade later, we see some cause for concern: fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019], and unrealistic performance metrics [Bernstein 2021].

We need a framework for excellence in data engineering that does not yet exist. In the first to market rush with data, aspects of maintainability, reproducibility, reliability, validity, and fidelity of datasets are often overlooked. We want to turn this way of thinking on its head and highlight examples, case-studies, methodologies for excellence in data collection. Building an active research community focused on Data Centric AI is an important part of the process of defining the core problems and creating ways to measure progress in machine learning through data quality tasks.

Submission Instructions

We welcome short papers (1-2 pages) and long papers (4 pages) addressing one or more of the topics of interest below. All papers need to be formatted according to the NeurIPS2021 Formatting Instructions. Papers will be peer-reviewed by the program committee and accepted papers will be presented as lightning talks during the workshop. If you have any questions about submission, please first check the FAQ link below. Contact us per email only if your question is not answered in the FAQ below, or if you experience any problems with the submission site, please email us at ([email protected])

Topics of Interest

Data Centric AI workshop is inviting position papers from researchers and practitioners on topics that include but not limited to the following:

New Datasets in areas:

  • Speech, vision, manufacturing, medical, recommendation/personalization
  • Science: https://www.mgi.gov/

Tools & methodologies for accelerating open-source dataset iteration:

  • Tools that quantify and accelerate time to source and prepare high quality data
  • Tools that ensure that the data is labeled consistently, such as label consensus
  • Tools that make improving data quality more systematic
  • Tools that automate the creation of high quality supervised learning training data from low quality resources, such as forced alignment in speech recognition
  • Tools that produce consistent and low noise data samples, or remove labeling noise or inconsistencies from existing data
  • Tools for controlling what goes into the dataset and for making high level edits efficiently to very large datasets, e.g. adding new words, languages, or accents to speech datasets with thousands of hours
  • Search methods for finding suitably licensed datasets based on public resources
  • Tools for creating training datasets for small data problems, or for rare classes in the long tail of big data problems
  • Tools for timely incorporation of feedback from production systems into datasets
  • Tools for understanding dataset coverage of important classes, and editing them to cover newly identified important cases
  • Dataset importers that allow easy combination and composition of existing datasets
  • Dataset exporters that make the data consumable for models and interface with model training and inference systems such as webdataset.
  • System architectures and interfaces that enable composition of dataset tools such as, MLCube, Docker, Airflow

Algorithms for working with limited labeled data and improving label efficiency:

  • Data selection techniques such as active learning and core-set selection for identifying the most valuable examples to label.
  • Semi-supervised learning, few-shot learning, and weak supervision methods for maximizing the power of limited labeled data.
  • Transfer learning and self-supervised learning approaches for developing powerful representations that can be used for many downstream tasks with limited labeled data.
  • Novelty and drift detection to identify when more data needs to be labeled.

Responsible AI development :

  • Fairness, bias, diversity evaluation and analysis for data sets and modeling/algorithms
  • Tools for green AI hardware-software system design and evaluation
  • Scalable, reliable training methods and systems
  • Tools, methodologies, and techniques for private, secure machine learning training
  • Efforts toward reproducible AI, such as data cards, model cards

Organizing Committee

Andrew Ng, Landing AI, DeepLearning.AI
Lora Aroyo, Google Research
Cody Coleman, Stanford University
Greg Diamos, Landing AI
Vijay Janapa Reddi, Harvard University
Joaquin Vanschoren, Eindhoven University of Technology
Carole-Jean Wu, Facebook
Sharon Zhou, Stanford University


Morning Session Schedule

8:30 AM 11:30 AM 4:30 PM Opening Remarks With Andrew Ng
8:45 AM 11:45 AM 4:45 PM Workshop Information with Lora Aroyo
9:00 AM 12:00 PM 5:00 PM Keynote: HCI and Crowdsourcing for DCAI with Michael Bernstein
9:15 AM 12:15 PM 5:15 PM Invited Talk: Past/Future of data centric AI with Olga Russakovsky
9:25 AM 12:25 PM 5:25 PM Lightning Talks: Benchmarking
10:30 AM 1:30 PM 6:30 PM Invited Talk: DataPerf - Benchmarking Data Centric AI with Peter Mattson
10:40 AM 1:40 PM 6:40 PM Lightning Talks: Theory and Challenge Problems in Data Centric AI
11:20 AM 2:20 PM 7:20 PM Invited Talk: Dynabench with Douwe Kiela
11:30 AM 2:30 PM 7:30 PM Lightning Talks: Responsibility and Ethics
12:10 PM 3:10 PM 8:10 PM Panel with Morning Speakers
12:50 PM 3:50 PM 8:50 PM Break and Poster Session

Afternoon Session Schedule

1:20 PM 4:20 PM 9:20 PM Keynote: Chris Re
1:35 PM 4:35 PM 9:35 PM Invited Talk: D Sculley - Data Debt
1:45 PM 4:45 PM 9:45 PM Lightning Talks: Datasets and Data Synthesis
2:45 PM 5:45 PM 10:45 PM Invited Talk: Curtis Northcutt
2:55 PM 5:55 PM 10:55 PM Lightning Talks: Data Quality and Iteration
3:35 PM 6:35 PM 11:35 PM Invited Talk: Anima Anandkumar
3:45 PM 6:45 PM 11:45 PM Lightning Talks: Data Labeling
4:25 PM 7:25 PM 12:25 AM Panel session with afternoon speakers
5:05 PM 8:05 PM 1:05 AM Break and poster session

Invited Talks