DCAI logo

NeurIPS Data-Centric AI Workshop

14 December 2021

Important Dates Join Here

Data-Centric AI (DCAI) represents the recent transition from focusing on modeling to the underlying data used to train and evaluate models. Increasingly, common model architectures have begun to dominate a wide range of tasks, and predictable scaling rules have emerged. While building and using datasets has been critical to these successes, the endeavor is often artisanal -- painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets easier, cheaper, and more repeatable. The DCAI movement aims to address this lack of tooling, best practices, and infrastructure for managing data in modern ML systems.

The main objective of this workshop is to cultivate the DCAI community into a vibrant interdisciplinary field that tackles practical data problems. We consider some of those problems to be: data collection/generation, data labeling, data preprocess/augmentation, data quality evaluation, data debt, and data governance. Many of these areas are nascent, and we hope to further their development by knitting them together into a coherent whole. Together we will define the DCAI movement that will shape the future of AI and ML. Please see our call for papers below to take an active role in shaping that future! If you have any questions, please reach out to the organizers (neurips-data-centric-ai@googlegroups.com)

Learn more about Data Centric AI (DCAI) here. This workshop builds on a tradition of series of workshops focusing on the role of data in AI:

Important Dates

Submission Deadline

September 30, 2021

Notification of acceptance

October 22, 2021


December 14, 2021


For questions please check FAQ

Call for Papers

The ML community has a strong track record of building and using datasets for AI systems. But this endeavor is often artisanal—painstaking and expensive. The community lacks high productivity and efficient open data engineering tools to make building, maintaining and evaluating datasets easier, cheaper and more repeatable. So, the core challenge is to accelerate dataset creation and iteration together with increasing the efficiency of use and reuse by democratizing data engineering and evaluation.

If 80 percent of machine learning work is data preparation, then ensuring data quality is the most important work of a machine learning team and therefore a vital research area. Human-labeled data has increasingly become the fuel and compass of AI-based software systems - yet innovative efforts have mostly focused on models and code. The growing focus on scale, speed, and cost of building and improving datasets has resulted in an impact on quality, which is nebulous and often circularly defined, since the annotators are the source of data and ground truth [Riezler, 2014]. The development of tools to make repeatable and systematic adjustments to datasets has also lagged. While dataset quality is still the top concern everyone has, the ways in which that is measured in practice is poorly understood and sometimes simply wrong. A decade later, we see some cause for concern: fairness and bias issues in labeled datasets [Goel and Faltings, 2019], quality issues in datasets [Crawford and Paglen, 2019], limitations of benchmarks [Kovaleva et al., 2019, Welty et al., 2019] reproducibility concerns in machine learning research [Pineau et al., 2018, Gunderson and Kjensmo, 2018], lack of documentation and replication of data [Katsuno et al., 2019], and unrealistic performance metrics [Bernstein 2021].

We need a framework for excellence in data engineering that does not yet exist. In the first to market rush with data, aspects of maintainability, reproducibility, reliability, validity, and fidelity of datasets are often overlooked. We want to turn this way of thinking on its head and highlight examples, case-studies, methodologies for excellence in data collection. Building an active research community focused on Data Centric AI is an important part of the process of defining the core problems and creating ways to measure progress in machine learning through data quality tasks.

Submission Instructions

We welcome short papers (1-2 pages) and long papers (4 pages) addressing one or more of the topics of interest below. All papers need to be formatted according to the NeurIPS2021 Formatting Instructions. Papers will be peer-reviewed by the program committee and accepted papers will be presented as lightning talks during the workshop. If you have any questions about submission, please first check the FAQ link below. Contact us per email only if your question is not answered in the FAQ below, or if you experience any problems with the submission site, please email us at (neurips-data-centric-ai@googlegroups.com)

Topics of Interest

Data Centric AI workshop is inviting position papers from researchers and practitioners on topics that include but not limited to the following:

New Datasets in areas:

  • Speech, vision, manufacturing, medical, recommendation/personalization
  • Science: https://www.mgi.gov/

Tools & methodologies for accelerating open-source dataset iteration:

  • Tools that quantify and accelerate time to source and prepare high quality data
  • Tools that ensure that the data is labeled consistently, such as label consensus
  • Tools that make improving data quality more systematic
  • Tools that automate the creation of high quality supervised learning training data from low quality resources, such as forced alignment in speech recognition
  • Tools that produce consistent and low noise data samples, or remove labeling noise or inconsistencies from existing data
  • Tools for controlling what goes into the dataset and for making high level edits efficiently to very large datasets, e.g. adding new words, languages, or accents to speech datasets with thousands of hours
  • Search methods for finding suitably licensed datasets based on public resources
  • Tools for creating training datasets for small data problems, or for rare classes in the long tail of big data problems
  • Tools for timely incorporation of feedback from production systems into datasets
  • Tools for understanding dataset coverage of important classes, and editing them to cover newly identified important cases
  • Dataset importers that allow easy combination and composition of existing datasets
  • Dataset exporters that make the data consumable for models and interface with model training and inference systems such as webdataset.
  • System architectures and interfaces that enable composition of dataset tools such as, MLCube, Docker, Airflow

Algorithms for working with limited labeled data and improving label efficiency:

  • Data selection techniques such as active learning and core-set selection for identifying the most valuable examples to label.
  • Semi-supervised learning, few-shot learning, and weak supervision methods for maximizing the power of limited labeled data.
  • Transfer learning and self-supervised learning approaches for developing powerful representations that can be used for many downstream tasks with limited labeled data.
  • Novelty and drift detection to identify when more data needs to be labeled.

Responsible AI development :

  • Fairness, bias, diversity evaluation and analysis for data sets and modeling/algorithms
  • Tools for green AI hardware-software system design and evaluation
  • Scalable, reliable training methods and systems
  • Tools, methodologies, and techniques for private, secure machine learning training
  • Efforts toward reproducible AI, such as data cards, model cards

Organizing Committee

Andrew Ng, Landing AI, DeepLearning.AI
Lora Aroyo, Google Research
Cody Coleman, Stanford University
Greg Diamos, Landing AI
Vijay Janapa Reddi, Harvard University
Joaquin Vanschoren, Eindhoven University of Technology
Carole-Jean Wu, Facebook
Sharon Zhou, Stanford University


Morning Session Schedule

8:30 AM 11:30 AM 4:30 PM Andrew Ng - Opening Remarks
8:45 AM 11:45 AM 4:45 PM Lora Aroyo - Workshop Overview
9:00 AM 12:00 PM 5:00 PM Keynote: Michael Bernstein - HCI and Crowdsourcing for DCAI
9:15 AM 12:15 PM 5:15 PM Invited Talk: Past/Future of data centric AI with Olga Russakovsky
9:25 AM 12:25 PM 5:25 PM Lightning Talks: Benchmarking
10:25 AM 1:25 PM 6:25 PM Invited Talk: Peter Mattson - DataPerf - Benchmarking Data Centric AI
10:40 AM 1:40 PM 6:40 PM Lightning Talks: Theory and Challenge Problems in Data Centric AI
11:20 AM 2:20 PM 7:20 PM Invited Talk: Douwe Kiela - FAIR Dynabench
11:30 AM 2:30 PM 7:30 PM Lightning Talks: Responsibility and Ethics
12:10 PM 3:10 PM 8:10 PM Q&A Panel with Morning Speakers
12:50 PM 3:50 PM 8:50 PM Break to watch video recordings

Afternoon Session Schedule

1:20 PM 4:20 PM 9:20 PM Keynote: Alex Ratner & Chris Ré - The Future of Data Centric AI
1:35 PM 4:35 PM 9:35 PM Invited Talk: D Sculley - Data Debt
1:45 PM 4:45 PM 9:45 PM Lightning Talks: Datasets and Data Synthesis
2:45 PM 5:45 PM 10:45 PM Invited Talk: Curtis Northcutt
2:55 PM 5:55 PM 10:55 PM Lightning Talks: Data Quality and Iteration
3:40 PM 6:40 PM 11:40 PM Invited Talk: Anima Anandkumar
3:50 PM 6:50 PM 11:50 PM Lightning Talks: Data Labeling
4:30 PM 7:30 PM 12:30 AM Q&A Panel session with afternoon speakers
5:10 PM 8:10 PM 1:10 AM Break to watch video recordings/td>

Detailed Schedule and Accepted Papers

Invited Talks

Accepted Papers

Title Authors (* corresponding) Link
A Hybrid Bayesian Model to Analyse Healthcare Data Pourshahrokhi, Narges*; Kouchaki, Samaneh; Kober, Kord; Miaskowski, Christine ; Barnaghi, Payam Link
How should human translation coexist with NMT? Efficient tool for building high quality parallel corpus Park, Chanjun*; Lee, Seolhwa; Moon, Hyeonseok; Eo, Sugyeong; Seo, Jaehyung; Lim, Heuiseok Link
A New Tool for Efficiently Generating Quality Estimation Datasets Eo, Sugyeong; Park, Chanjun*; Seo, Jaehyung; Moon, Hyeonseok; Lim, Heuiseok Link
Automatic Knowledge Augmentation for Generative Commonsense Reasoning Seo, Jaehyung*; Park, Chanjun; Eo, Sugyeong; Moon, Hyeonseok; Lim, Heuiseok Link
Tabular Engineering with Automunge Teague, Nicholas* Link
A Probabilistic Framework for Knowledge GraphData Augmentation Chauhan, Jatin*; Gupta, Priyanshu; Minervini, Pasquale Link
FedHist: A Federated-First Dataset for Learning inHealthcare Khan, Usmann*
A First Look Towards One-Shot Object Detection with SPOT for Data-Efficient Learning Chakraborty, Ria*; Popli, Madhur; Lamba, Rachit; Verma, Rishi Link
YMIR: A Rapid Data-centric Development Platform for Vision Applications Huang, Phoenix X.; Hu, Wenze*; Brendel, William; Chandraker, Manmohan; Li, Li-Jia; Wang, Xiaoyu Link
Towards better data discovery and collection with flow-based programming Paleyes, Andrei*; Cabrera, Christian; Lawrence, Neil D Link
CircleNLU: A Tool for building Data-Driven Natural Language Understanding System Hoang, Vu* Link
Using Synthetic Images To Uncover Population Biases In Facial Landmarks Detection Shadmi, Ran*; Laserson, Jonathan; Elbaz, Gil
Challenges of Working with Materials R&D Data Kubie, Lenore*; Kroenlein, Kenneth
PyHard: a novel tool for generating hardness embeddings to support data-centric analysis Paiva, Pedro Yuri Arbs*; Smith-Miles, Kate; Valeriano, Maria; Lorena, Ana Link
AirSAS: Controlled Dataset Generation for Physics-Informed Machine Learning Cowen, Benjamin*; Park, J. Daniel; Blanford, Thomas E.; Goehle, Geoff; Brown, Daniel C. Link
Open-Sourcing Generative Models for Data-driven Robot Simulations Bamani, Eran*; Sintov, Avishai; Azulay, Osher; Gurevich, Anton
Few-Shot Image Classification Challenge On-Board OPS-SAT Derksen, Dawa*; Meoni, Gabriele; Lecuyer, Gurvan; Mergy, Anne; Maertens, Marcus; Izzo, Dario Link
Dialectal Voice : An Open-Source Voice Dataset and Automatic Speech Recognition model for Moroccan Arabic dialectal Allak, Anass*; Naira, Abdou Mohamed; Imade, Benelallam; Kamel, Gaanoun Link
DAG Card is the new Model Card Tagliabue, Jacopo*; Tuulos, Ville; Greco, Ciro; Dave, Valay Link
SCIMAT: Science and Mathematics Dataset Kollepara, Neeraj; Chatakonda, Snehith K; kumar, pawan* Link
Towards Systematic Evaluation in Machine Learning through Automated Stress Test Creation Madras, David*; Zemel, Richard
Annotation Quality Framework - Accuracy,Credibility, and Consistency Lavitas, Liliya*; Lee, Allen; Redfield, Olivia; Fletcher, Daniel; Eck, Matthias; Janardhanan, Sunil Link
Ontolabeling: Re-Thinking Data Labeling For Computer Vision Croce, Nicola*; Nieto, Marcos Link
Natural Adversarial Objects Lau, Felix*; Harrison, Sasha; Subramani, Nishant; Kim, Aerin; Branson, Elliot R; Liu, Rosanne
No News is Good News: A Critique of the One Billion Word Benchmark Ngo, Helen*; Frosst, Nicholas; Madeira Araújo, João G; Hui, Jeff
A Data-Centric Approach for Training Deep Neural Networks with Less Data Motamedi, Mohammad*; Sakharnykh, Nikolay; Kaldewey, Tim Link
Finding Label Errors in Autonomous Vehicle Data With Learned Observation Assertions Kang, Daniel*; Arechiga, Nikos; Pillai, Sudeep; Bailis, Peter D; Zaharia, Matei Link
Single-Click 3D Object Annotation on LiDAR Point Clouds Nguyen, Trung Duc*; Hua, Binh-Son; Nguyen, Thanh; Phung, Dinh Link
Decreasing Annotation Burden of Pairwise Comparisons with Human-in-the-Loop Sorting: Application in Medical Image Artifact Rating Jang, Ikbeom*; Danley, Garrison; Chang, Ken; Kalpathy-Cramer, Jayashree Link
Effect of Radiology Report Labeler Quality on Deep Learning Models for Chest X-Ray Interpretation Jain, Saahil*; Smit, Akshay; Ng, Andrew; Rajpurkar, Pranav Link
A Data-Centric Image Classification Benchmark Schmarje, Lars*; Liao, Yuan-Hong; Koch, Reinhard Link
Diagnosing severity levels of Autism Spectrum Disorder with Machine Learning Cinque, Marcello; Moscato, Vincenzo; Postiglione, Marco*; Riccio, Maria Pia Link
Sampling To Improve Predictions For Underrepresented Observations In Imbalanced Data Kjærsgaard, Rune D.*; Grønberg, Manja; Clemmensen, Line Link
Automatic Data Quality Evaluation for Text Classification li, jiazheng* Link
Building Legal Datasets Soh, Jerrold* Link
Comparing Data Augmentation and Annotation Standardization to Improve End-to-end Spoken Language Understanding Models Nicolich-Henkin, Leah*; Nakatani, Taichi; Trozenski, Zach; Whiteman, Joel; Susanj, Nathan Link
DiagnosisQA: A semi-automated pipeline for developing clinician validated diagnosis specific QA datasets. Mishra, Shreya; Awasthi, Raghav; Papay, Frankie; Maheshwari, Kamal; Cywinski, Jacek; Khanna, Ashish; Mathur, Piyush * Link
Influence of human-expert labels on a neonatal seizure detector based on a convolutional neural network Borovac, Ana*; Runarsson, Thomas P; Guðmundsson, Steinn; Thorvardsson, Gardar Link
Feminist Curation of Text for Data-centric AI Bartl, Marion*; Leavy, Susan Link
Challenges and Solutions to build a Data Pipeline to Identify Anomalies in Enterprise System Performance Huang, Xiaobo*; Banerjee, Amitabha; Chen, Chien-Chia; Huang, Chengzhi; Chuang, Tzu Yi; Srivastava, Abhishek; Cheveresan, Razvan
Human-inspired Data Centric Computer Vision Tsutsui, Satoshi*; Crandall, David; Yu, Chen Link
Utilizing Driving Context to Increase the Annotation Efficiency of Imbalanced Gaze Image Data Rehm, Johannes*; Gundersen, Odd Erik; Bach, Kerstin; Reshodko, Irina Link
Unleashing the Power of Industrial Big Data through Scalable Manual Labeling Paes Leao, Bruno*; Fradkin, Dmitriy; Lan, Tu; Wang, Jianhui Link
nferX: a case study on data-centric NLP in biomedicine Chang, David*; Mathew, Vineet; Kogler, Lorenzo; Jin, Roger; Rao, Krishna; Raghunathan, Bharathwaj; Ip, Wui; Doctor, Zainab; Pawlowski, Colin; Rajesekharan, Ajit Link
On Data-centric Myths Marcu, Antonia*; Prugel-Bennett, Adam Link
All in one Data Cleansing Tool Sairaman, Sri Aravind*; Vailoppilly, Arun Prasad ; Sakthivel, Ramkumar; Kumar, Resham Sundar; BDSV, Vignesh; G, Aravind Link
Contrasting the Profiles of Easy and Hard Observations in a Dataset Moreno, Camila C*; Paiva, Pedro; Nunes, Gustavo; Lorena, Ana Link
A concept for fitness-for-use evaluation in Machine Learning pipelines Jonietz, David* Link
Vietnamese Speech-based Question Answering over Car Manuals Vo, Tin Duy*; Luong, Manh; Minh Le, Duong; Tran, Hieu Minh; Do, Nhan; Nguyen, Duy; Nguyen, Thien; Bui, Hung; Nguyen, Dat Quoc; Phung, Dinh
Self-supervised Semi-supervised Learning for Data Labeling and Quality Evaluation Bai, Haoping*; Cao, Meng; Huang, Ping; Shan, Jiulong Link
Towards a Taxonomy of Graph Learning Datasets Liu, Renming; Cantürk, Semih; Wenkel, Frederik; Sandfelder, Dylan; Kreuzer, Devin; Little, Anna; McGuire, Sarah; Perlmutter, Michael; O'Bray, Leslie; Rieck, Bastian; Hirn, Matthew; Wolf, Guy; Rampášek, Ladislav* Link
Addressing Content Selection Bias in Creating Datasets for Hate Speech Detection Rahman, Md Mustafizur; Balakrishnan, Dinesh; Murthy, Dhiraj; Kutlu, Mucahid; Lease, Matthew* Link
Lhotse: a speech data representation library for the modern deep learning ecosystem Żelasko, Piotr*; Daniel Povey; Jan Trmal; Sanjeev Khudanpur Link
Bridging the gap between AI and the life sciences: towards a standardized multi-omics data type Herbsthofer, Laurin; Oberhuber, Monika; Prietl, Barbara; López García, Pablo* Link
Increasing Data Diversity with Iterative Sampling to Improve Performance Çavuşoğlu, Devrim*; Eryüksel, Oğulcan; Altınuç, Sinan O Link
Data preparation for training CNNs: Application to vibration-based condition monitoring Yaghoubi, Vahid*; Cheng, Liangliang; Van Paepegem, Wim; Kersemans, Mathias Link
Bridging the gap to real-world for network intrusion detection systems with data-centric approach de Carvalho Bertoli, Gustavo*; Alves Pereira Jr, Lourenço; Verri, Filipe; Santos, Aldri; Saotome, Osamu Link
Highly Efficient Representation and Active Learning Framework and Its Application to Imbalanced Medical Image Classification Hao, Heng*; Moon, Hankyu; Didari, Sima; Woo, Jae Oh; Bangert, Patrick Link
Evaluating Machine Learning Models for Internet Network Security with Data Slices Toman, Pamela*; Yadgaran, Elisha; Papadimitriou, Christina; Isaksen, Aaron; Kraning, Matt Link
AutoDQ: Automatic Data Quality for Financial Data Villarreal-Vasquez, Miguel*; Buford, John; Dhingra, Prashant; Yin, Fenglin
Data Cards: Purposeful and Transparent Documentation for Responsible AI Pushkarna, Mahima*; Zaldivar, Andrew Link
3D ImageNet: A data collection and labeling tool for Depth and RGB Images Singh, Gurjeet*; Patrick, Chiang; Zhou, Sifan; Qian, James Link
Combining Data-driven Supervision with Human-in-the-loop Feedback for Entity Resolution Yin, Wenpeng*; Heinecke, Shelby; Li, Vena; Keskar, Nitish Shirish; Jones, Michael; Shi, Shouzhong; Georgiev, Stanislav; Milich, Kurt; Esposito, Joseph; Xiong, Caiming Link
IMDB-WIKI-SbS: An Evaluation Dataset for Crowdsourced Pairwise Comparisons Pavlichenko, Nikita; Ustalov, Dmitry* Link
Exploiting Proximity Search and Easy Examples to Select Rare Events Kang, Daniel*; Derhacobian, Alex; Tsuji, Kaoru; Hebert, Trevor; Bailis, Peter D; Fukami, Tadashi; Hashimoto, Tatsunori; Sun, Yi; Zaharia, Matei Link
Fantastic Data and How to Query Them Tran, Trung-Kien*; Le-Tuan, Anh; Nguyen Duc, Manh; Yuan, Jicheng; Le Phuoc, Danh Link
Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation Denton, Emily*; Diaz, Mark; Kivlichan, Ian D; Prabhakaran, Vinodkumar; Rosen, Rachel Link
Two Approaches to Building Dialogue Systems for People on the Spectrum Firsanova, Victoria* Link
What can Data-Centric AI Learn from Data and ML Engineering? Polyzotis, Alkis*; Zaharia, Matei
Ground-Truth, Whose Truth? - Examining the Challenges with Annotating Toxic Text Datasets Arhin, Kofi*; Baldini, Ioana; Wei, Dennis; Natesan Ramamurthy, Karthikeyan ; Singh, Moninder Link
Towards a Shared Rubric for Dataset Annotation Greene, Andrew M* Link
LSH methods for data deduplication in a Wikipedia artificial dataset Ciro, Juan Manuel; Galvez, Daniel; Schlippe, Tim ; Kanter, David Link
Annotation Inconsistency and Entity Bias inMultiWOZ Qian, Kun*; Beirami, Ahmad; Lin, Zhouhan; De, Ankita; Geramifard, Alborz; YU, Zhou; Sankar, Chinnadhurai
Seg-Diff: Checkpoints Are All You Need Brewster, Grant*; Yuan, Bodi; Hooker, Sara; Cao, Chen; Yuan, Zhiqiang
AutoDC: Automated data-centric processing Liu, Zac Yung-Chun*; Roychowdhury, Shoumik; Tarlow, Scott; Nair, Akash; Badhe, Shweta; Shah, Tejas Link
Engineering AI Tools for Systematic and Scalable Quality Assessment in Magnetic Resonance Imaging Zou, Yukai; Jang, Ikbeom* Link
FinRL-Meta: A Universe of Near-Real Market Environments for Data-Driven Deep Reinforcement Learning in Quantitative Finance Liu, Xiao-Yang*; Rui, Jingyang; Gao, Jiechao; Yang, Liuqing; Yang, Hongyang; Wang, Zhaoran; Wang, Christina Dan ; Guo, Jian Link
Data Augmentation for Intent Classification Chen, Derek*; Yin, Claire Link
InfiniteForm: A synthetic, minimal bias dataset for fitness applications Weitz, Andrew*; Bent, Brinnae; Colucci, Lina; Primas, Sidney Link
Who Decides if AI is Fair? The Labels Problem in Algorithmic Auditing Mishra, Abhilash*; Gorana, Yash Link
Data-Centric AI Requires Rethinking Data Notion Hajij, Mustafa*; Zamzmi, Ghada; Natesan Ramamurthy, Karthikeyan ; Guzman Saenz, Aldo Link
Exploiting Domain Knowledge for Efficient Data-centric Session-based Recommendation model Mishra, Mayank*; Singhal, Rekha Link
Topological Deep Learning Hajij, Mustafa*; Natesan Ramamurthy, Karthikeyan ; Guzman Saenz, Aldo; Istvan, Kyle Link
Fix your Model by Fixing your Datasets Sanyal, Atindriyo*; Vyas, Nidhi Kaushik; Chatterji, Vikram; Epstein, Ben; Demir, Nikita; Corletti, Anthony
Data Expressiveness and Its Use in Data-centric AI Sharma, Parichit*; Kurban, Hasan; Dalkilic, Mehmet Link
Debiasing Pre-Trained Sentence Encoders With WordDropouts on Fine-Tuning Data Panda, Swetasudha*; Wick, Michael; Kobren, Ariel
Towards a Framework for Data Excellence in Data-Centric AI: Lessons from the Semantic Web Seneviratne, Oshani*; Hassanzadeh, Oktie; Gruen, Daniel; McCusker, Jamie P; McGuinness, Deborah
Sim2Real Docs: Domain Randomization for Documents in Natural Scenes using Ray-traced Rendering Huang, Austin V.* Link
Homogenization of Existing Inertial-Based Datasets to Support Human Activity Recognition Amrani, Hamza; Micucci, Daniela; Mobilio, Marco*; Napoletano, Paolo Link
Can machines learn to see without visual databases? Betti, Alessandro; Gori, Marco; Melacci, Stefano*; Pelillo, Marcello; Roli, Fabio Link
Augment & Valuate : A Data Enhancement Pipeline for data-centric AI Lee, Youngjune*; Kwon, Oh Joon; Lee, Haeju; Kim, Joonyoung; Lee, Kangwook; Kim, Kee-Eung Link
Simultaneous Improvement of ML Model Fairness and Performance by Identifying Bias in Data Chaudhari, Bhushan; Agarwal, Aakash*; Bhowmik, Tanmoy Link
Data Agnostic Image Annotation Mohamed Nishar, Abbaas Alif*; T V, Sethuraman; Rahman, Md Rashed; Gruteser, Marco; Mandayam, Narayan; Dana, Kristin; Jain, Shubham; Ashok, Ashwin
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs Schuhmann, Christoph; Vencu, Richard ; Beaumont, Romain; Kaczmarczyk, Robert; Mullis, Clayton; Jitsev, Jenia; Komatsuzaki, Aran* Link
Small Data in NLU: Proposals towards a Data-Centric Approach Zarcone, Alessandra*; Lehmann, Jens; Habets, Emanuel Link
On Biased Systems and Data Vieira, Daniel*
Data vast and low in variance: Augment machine learning pipelines with dataset profiles to improve data quality without sacrificing scale Herman, Bernease R*; Leybzon, Danny; Broomall, Jamie
CogALex 2.0: Impact of Data Quality on Lexical-Semantic Relation Prediction Lang, Christian; Wachowiak, Lennart; Heinisch, Barbara; Gromann, Dagmar* Link
A Data-Centric Behavioral Machine Learning Platform to Reduce Health Inequalities Tang, Dexian; Frances, Guillem; Perianez, Africa* Link