Introduction to Dataset Augmentation and Expansion
This article was originally published at Algorithimia’s website. The company was acquired by DataRobot in 2021. This article may not be entirely up-to-date or refer to products and offerings no longer in existence. Find out more about DataRobot MLOps here.
If your neural nets are getting larger and larger but your training sets aren’t, you’re going to hit an accuracy wall. If you want to train better models with less data, I’ve got good news for you.
Dataset augmentation – the process of applying simple and complex transformations like flipping or style transfer to your data – can help overcome the increasingly large requirements of Deep Learning models. This post will walk through why dataset augmentation is important, how it works, and how Deep Learning fits in to the equation.
It’s hard to build the right dataset from scratch
“What’s wrong with my dataset?!?”
Don’t worry, we didn’t mean to insult you. It’s not your fault: it’s Deep Learning’s fault. Algorithms are getting infinitely more complex, and neural nets are getting deeper and deeper. More layers in neural nets means more parameters that your model is learning from your data. In some of the recent more state of the art models we’ve seen, there can be more than 100 million parameters learned during training:
When your model is trying to understand a relationship this deeply, it needs a lot of examples to learn from. That’s why popular datasets for models like these might have something like 10,000 images for training. That size of data is not at all easy to come by.
Even if you’re using simpler or smaller types of models, it’s challenging to organize a dataset large enough to train effectively. Especially as Machine Learning gets applied to newer and newer verticals, it’s becoming harder and harder to find reliable training data. If you wanted to create a classifier to distinguish iPhones from Google Pixels, how would you get thousands of different photos?
Finally, even with the right size training set, things can still go awry. Remember that algorithms don’t think like humans: while you classify images based on a natural understanding of what’s in the image, algorithms are learning that on the fly. If you’re creating a cat / dog classifier and most of your training images for dogs have a snowy background, your algorithm might end up learning the wrong rules. Having images from varied perspectives and with different contexts is crucial.
Dataset augmentation can multiply your data’s effectiveness
For all of the reasons outlined above, it’s important to be able to augment your dataset: to make it more effective without acquiring loads of more training data. Dataset augmentation applies transformations to your training examples: they can be as simple as flipping an image, or as complicated as applying neural style transfer. The idea is that by changing the makeup of your data, you can improve your performance and increase your training set size.
For an idea of just how much this process can help, check out this benchmark that NanoNets ran in their explainer post. Their results showed an almost 20 percentage point increase in test accuracy with dataset augmentation applied.
It’s safer for us to assume the cause of this accuracy boost was a bit more complicated than just dataset augmentation, but the message is clear: it can really help.
Before we dive into what you might practically do to augment your data, it’s worth noting that there are two broad approaches to when to augment it. In offline dataset augmentation, transforms are applied en masse to your dataset before training. You might, for example, flip each of your images horizontally and vertically, resulting in a training set with twice as many examples. In online dataset augmentation, transforms are applied in real time as batches are passed into training. This won’t help with size, but is much quicker for larger training sets.
How basic dataset augmentation works
Basic augmentation is super simple, at least when it comes to images: just try to imagine all the things you could do in photoshop with a picture! A few of the simple and popular ones include:
- Flipping (both vertically and horizontally)
- Zooming and scaling
- Translating (moving along the x or y axis)
- Adding Gaussian noise (distortion of high frequency features)
Most of these transformations have fairly simple implementations in packages like Tensorflow. And though they might seem simple, combining them in creative ways across your dataset can yield impressive improvements in model accuracy.
One issue that often comes up is input size requirements, which are one of the most frustrating parts of neural nets for practitioners. If you shift or rotate an image, you’re going to end up with something that’s a different size, and that needs to be fixed before training. Different approaches advocate filling in empty space with constant values, zooming in until you’ve reached the right size, or reflecting pixel values into your empty space. As with any preprocessing, testing and validating is the best way to find a definitive answer.
Deep Learning for dataset augmentation
Moving from the simple to the complex, there are some more interesting things than just flips and rotations that you can do to your dataset to make it more robust.
Neural Style Transfer
Neural networks have proven effective in transferring stylistic elements from one image to another, like “Starry Stanford” here:
You can utilize pre-trained nets that transfer exterior styles onto your training images as part of a dataset augmentation pipeline.
Generative Adversarial Networks
A new type of algorithm called GANs have been stealing headlines lately for their ability to generate content (of all types) that’s actually pretty good. Using these types of algorithms, researchers were able to apply image-to-image translation and get some interesting results. Here are a few of the images they worked on:
Although it’s not entirely computationally feasible right now, it’s clear that this kind of technology can open doors for much more sophisticated dataset augmentation.
Google recently released a paper outlining a framework for AutoAugment, or using Machine Learning to augment your dataset. This is Machine Learning to improve Machine Learning: Machine Learning-ception. The idea is that the right augmentations depend on your dataset, and can be learned through a model: even though the actual augmentations themselves are pretty simple.