Data Augmentation

Introduction by

Anima Anandkumar

Anima Anandkumar holds dual positions in academia and industry. She is a Bren professor at Caltech CMS department and a Director of machine learning research at NVIDIA. She is passionate about designing principled AI algorithms and applying them to interdisciplinary domains.

Data augmentation to address imperfect real-world data

Many of the challenges of applying AI in the real world are due to imperfections in the data. Data in the real world has all sorts of limitations.

Domain gaps: The data you train your model with is quite different from the data you have to predict on in the real world.
Data bias: When the data you collect has imbalances due to societal bias, how can you design methods that can overcome them?
Data noise: Noise can come from a variety of sources, including where labels are ambiguous, cluttered, or otherwise corrupted.

In light of the uncertainty these imperfections bring to your AI system, how can your model still learn robust representations? Especially if human labeling of a high volume dataset is expensive, the challenge becomes whether you can overcome these imperfections without any labels at all.

The trinity of uncertainty in real world data

What is Data Augmentation?

There are a variety of techniques you can employ to overcome these uncertainties. Data augmentation is perhaps one of the simplest ones that involves adding additional training data through:

Self-Supervision: When you have limited labeled data, you can try combining it with unlabeled data. You can create augmentations of your data, and if you know that label is invariant to all those transformations, you’ve created supervision based off of those invariances. For instance, the classification of an image is invariant to transformations such as rotation or cropping. Using this type of learning has even beaten supervised learning in many classification tasks.
Synthetic Data: Another approach is to combine your limited labeled data with synthetic data. While synthetic data is still in its infancy, there has been ongoing advances in generative models and it will become hugely important in the future for testing systems such as autonomous driving or robot learning.

Data-centric principles in data augmentation

The core design of data augmentation is in the balance of positive and negative examples. Consider image recognition. Starting from an image of a cat, you have a positive example if you make one set of changes such as rotating the image or changing the contrast and the prediction stays the same. A negative example would be if you make other changes such as picking a different image and the prediction changes. Ultimately, you want to contrast these positive and negative examples, especially ones that come close to each other in order to learn where the classifier boundary should be. These contrastive learning principles serve as a foundation for popular methods such as MOCO and SimCLR.

Creating positive and negative cases to learn classifier boundaries with SimCLR

Even better, you can combine self-supervision with weak supervision – having labels of a related task that is usually much less demanding for human labeling. We recently developed such a technique called Discobox, where we used only bounding box labels in training data to train a model to do instance segmentation and correspondence learning. Having bounding box labels is usually a much cheaper and easily available form of labeling. As a real world case study, we are working to scale these techniques up to reduce labeling costs in autonomous driving. The model was tested on random videos on YouTube and autonomous driving videos – which were in completely different domains than the videos that we trained on – and demonstrated high quality instant segmentation results.

Instant segmentation on an “Uptown Funk” flash mob – a completely different domain the model was trained on! Only bounding boxes were used as labels during training.

The power that you harness when using methods such as self-supervision is that, without creating any more labels, you can overcome label scarcity in many domains. There has been so much recent progress in this area recently, but also so much more progress to be made. For example, while synthetic data is still in its infancy, it will become hugely important in the future for testing systems such as autonomous driving or robot learning. It is another form of data augmentation where you can start by training your model in a domain which is ultimately different from the one where it will be deployed. By learning progressively from easy to difficult cases by using positive and negative cases in a synthetic domain, you can transition to the real world by adjusting the classification boundary you learned from the synthetic domain. However, adding synthetic data to your limited data is not without its complications. Although synthetic data can be realistic enough that even humans aren’t able to tell simulated and real data apart, in terms of distribution of scenes there can still be huge shifts, and we need techniques to generate data in a highly controllable way. AI systems are data hungry, and we need more innovation on this and many other fronts to help overcome imperfections in real world data.