• Blog
  • Active learning machine learning: What it is and how it works

Active learning machine learning: What it is and how it works

September 27, 2020
· 4 min read

This article was originally published at Algorithimia’s website. The company was acquired by DataRobot in 2021. This article may not be entirely up-to-date or refer to products and offerings no longer in existence. Find out more about DataRobot MLOps here.

Active learning is the subset of machine learning in which a learning algorithm can query a user interactively to label data with the desired outputs.

A growing problem in machine learning is the large amount of unlabeled data, since data is continuously getting cheaper to collect and store. This leaves data scientists with more data than they are capable of analyzing. That’s where active learning comes in.

What is active learning?

Active learning is the subset of machine learning in which a learning algorithm can query a user interactively to label data with the desired outputs. In active learning, the algorithm proactively selects the subset of examples to be labeled next from the pool of unlabeled data. The fundamental belief behind the active learner algorithm concept is that an ML algorithm could potentially reach a higher level of accuracy while using a smaller number of training labels if it were allowed to choose the data it wants to learn from.

Therefore, active learners are allowed to interactively pose queries during the training stage. These queries are usually in the form of unlabeled data instances and the request is to a human annotator to label the instance. This makes active learning part of the human-in-the-loop paradigm, where it is one of the most powerful examples of success.

How does active learning work?

Active learning works in a few different situations. Basically, the decision of whether or not to query each specific label depends on whether the gain from querying the label is greater than the cost of obtaining that information. This decision making, in practice, can take a few different forms based on the data scientist’s budget limit and other factors.

The three categories of active learning are:

Three categories of active learning

Stream-based selective sampling

In this scenario, the algorithm determines if it would be beneficial enough to query for the label of a specific unlabeled entry in the dataset. While the model is being trained, it is presented with a data instance and immediately decides if it wants to query the label. This approach has a natural disadvantage that comes from the lack of guarantee that the data scientist will stay within budget.  

Pool-based sampling 

This is the most well known scenario for active learning. In this sampling method, the algorithm attempts to evaluate the entire dataset before it selects the best query or set of queries. The active learner algorithm is often initially trained on a fully labeled part of the data which is then used to determine which instances would be most beneficial to insert into the training set for the next active learning loop. The downside of this method is the amount of memory it can require.

Membership query synthesis

This scenario is not applicable to all cases, because it involves the generation of synthetic data. The active learner in this method is allowed to create its own examples for labeling. This method is compatible with problems where it is easy to generate a data instance.

How is active learning different from reinforcement learning?

Reinforcement learning and active learning can both reduce the number of labels required for models, but they are different concepts.

Reinforcement learning

Reinforcement learning

Reinforcement learning is a goal-oriented approach, inspired by behavioral psychology, that allows you to take inputs from the environment. This implies that the agent will get better and learn while it’s in use. This is similar to how us humans learn from our mistakes. We are basically functioning with a reinforcement learning approach. There is no training phase, because the agent learns through trial-and-error instead, using a predetermined reward system that provides inputs about how optimal a specific action was. This type of learning does not need to be fed data, because it generates its own as it goes.

Active learning

Active learning is closer to traditional supervised learning. It is a type of semi-supervised learning, meaning models are trained using both labeled and unlabeled data. The idea behind semi-supervised learning is that labeling just a small sample of data might result in the same accuracy or better than fully labeled training data. The only challenge is determining what that sample is. Active learning machine learning is all about labeling data dynamically and incrementally during the training phase so that the algorithm can identify what label would be the most beneficial for it to learn from.

See DataRobot MLOps in Action
Request a demo
About the author

The Next Generation of AI

DataRobot AI Platform is the next generation of AI. The unified platform is built for all data types, all users, and all environments to deliver critical business insights for every organization. DataRobot is trusted by global customers across industries and verticals, including a third of the Fortune 50.

Meet DataRobot
  • Listen to the blog
  • Share this post
    Subscribe to DataRobot Blog
    Thank you

    We will contact you shortly

    Thank You!

    We’re almost there! These are the next steps:

    • Look out for an email from DataRobot with a subject line: Your Subscription Confirmation.
    • Click the confirmation link to approve your consent.
    • Done! You have now opted to receive communications about DataRobot’s products and services.

    Didn’t receive the email? Please make sure to check your spam or junk folders.


    Related Posts

    Newsletter Subscription
    Subscribe to our Blog