Data-Centric AI (DCAI) represents the recent transition from focusing on modeling to the underlying data used to train and evaluate models. Increasingly, common model architectures have begun to dominate a wide range of tasks, and predictable scaling rules have emerged. While building and using datasets has been critical to these successes, the endeavor is often artisanal -- painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets easier, cheaper, and more repeatable. The DCAI movement aims to address this lack of tooling, best practices, and infrastructure for managing data in modern ML systems.
The main objective of this workshop is to cultivate the DCAI community into a vibrant interdisciplinary field that tackles practical data problems. We consider some of those problems to be: data collection/generation, data labeling, data preprocess/augmentation, data quality evaluation, data debt, and data governance. Many of these areas are nascent, and we hope to further their development by knitting them together into a coherent whole. Together we will define the DCAI movement that will shape the future of AI and ML. Please see our call for papers below to take an active role in shaping that future! If you have any questions, please reach out to the organizers ([email protected])
Learn more about Data Centric AI (DCAI) here. This workshop builds on a tradition of series of workshops focusing on the role of data in AI:
September 30, 2021
October 22, 2021
December 14, 2021
For questions please check FAQ
The ML community has a strong track record of building and using datasets for AI systems.
But this endeavor is often artisanal—painstaking and expensive.
The community lacks high productivity and efficient open data engineering tools to make building, maintaining and evaluating datasets easier, cheaper and more repeatable.
So, the core challenge is to accelerate dataset creation and iteration together with increasing the efficiency of use and reuse by democratizing data engineering and evaluation.
If 80 percent of machine learning work is data preparation, then ensuring data quality is the most important work of a machine learning team and therefore a vital research area. Human-labeled data has increasingly become the fuel and compass of AI-based software systems - yet innovative efforts have mostly focused on models and code. The growing focus on scale, speed, and cost of building and improving datasets has resulted in an impact on quality, which is nebulous and often circularly defined, since the annotators are the source of data and ground truth [Riezler, 2014]. The development of tools to make repeatable and systematic adjustments to datasets has also lagged. While dataset quality is still the top concern everyone has, the ways in which that is measured in practice is poorly understood and sometimes simply wrong. A decade later, we see some cause for concern: fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019], and unrealistic performance metrics [Bernstein 2021].
We need a framework for excellence in data engineering that does not yet exist. In the first to market rush with data, aspects of maintainability, reproducibility, reliability, validity, and fidelity of datasets are often overlooked. We want to turn this way of thinking on its head and highlight examples, case-studies, methodologies for excellence in data collection. Building an active research community focused on Data Centric AI is an important part of the process of defining the core problems and creating ways to measure progress in machine learning through data quality tasks.
We welcome short papers (1-2 pages) and long papers (4 pages) addressing one or more of the topics of interest below. All papers need to be formatted according to the NeurIPS2021 Formatting Instructions. Papers will be peer-reviewed by the program committee and accepted papers will be presented as lightning talks during the workshop. If you have any questions about submission, please first check the FAQ link below. Contact us per email only if your question is not answered in the FAQ below, or if you experience any problems with the submission site, please email us at ([email protected])
Data Centric AI workshop is inviting position papers from researchers and practitioners on topics that include but not limited to the following:
New Datasets in areas:
Tools & methodologies for accelerating open-source dataset iteration:
Algorithms for working with limited labeled data and improving label efficiency:
Responsible AI development :
|8:30 AM||11:30 AM||4:30 PM||Opening Remarks With Andrew Ng|
|8:45 AM||11:45 AM||4:45 PM||Workshop Information with Lora Aroyo|
|9:00 AM||12:00 PM||5:00 PM||Keynote: HCI and Crowdsourcing for DCAI with Michael Bernstein|
|9:15 AM||12:15 PM||5:15 PM||Invited Talk: Past/Future of data centric AI with Olga Russakovsky|
|9:25 AM||12:25 PM||5:25 PM||Lightning Talks: Benchmarking|
|10:30 AM||1:30 PM||6:30 PM||Invited Talk: DataPerf - Benchmarking Data Centric AI with Peter Mattson|
|10:40 AM||1:40 PM||6:40 PM||Lightning Talks: Theory and Challenge Problems in Data Centric AI|
|11:20 AM||2:20 PM||7:20 PM||Invited Talk: Dynabench with Douwe Kiela|
|11:30 AM||2:30 PM||7:30 PM||Lightning Talks: Responsibility and Ethics|
|12:10 PM||3:10 PM||8:10 PM||Panel with Morning Speakers|
|12:50 PM||3:50 PM||8:50 PM||Break and Poster Session|
|1:20 PM||4:20 PM||9:20 PM||Keynote: Chris Re|
|1:35 PM||4:35 PM||9:35 PM||Invited Talk: D Sculley - Data Debt|
|1:45 PM||4:45 PM||9:45 PM||Lightning Talks: Datasets and Data Synthesis|
|2:45 PM||5:45 PM||10:45 PM||Invited Talk: Curtis Northcutt|
|2:55 PM||5:55 PM||10:55 PM||Lightning Talks: Data Quality and Iteration|
|3:35 PM||6:35 PM||11:35 PM||Invited Talk: Anima Anandkumar|
|3:45 PM||6:45 PM||11:45 PM||Lightning Talks: Data Labeling|
|4:25 PM||7:25 PM||12:25 AM||Panel session with afternoon speakers|
|5:05 PM||8:05 PM||1:05 AM||Break and poster session|