Bayesian machine learning
This article was originally published at Algorithimia’s website. The company was acquired by DataRobot in 2021. This article may not be entirely up-to-date or refer to products and offerings no longer in existence. Find out more about DataRobot MLOps here.
Bayesian ML is a paradigm for constructing statistical models based on Bayes’ Theorem. Learn more from the experts at DataRobot.
Think about a standard machine learning problem. You have a set of training data, inputs and outputs, and you want to determine some mapping between them. So, you splice together a model and soon you have a deterministic way of generating predictions for a target variable 𝑦y given an unseen input 𝑥x.
There’s just one problem – you don’t have any way to explain what’s going on within your model! All you know is that it’s been trained to minimize some loss function on your training data, but that’s not much to go on. Ideally, you’d like to have an objective summary of your model’s parameters, complete with confidence intervals and other statistical nuggets, and you’d like to be able to reason about them using the language of probability.
That’s where Bayesian Machine Learning comes in.
What is Bayesian machine learning?
Bayesian ML is a paradigm for constructing statistical models based on Bayes’ Theorem
Generally speaking, the goal of Bayesian ML is to estimate the posterior distribution (𝑝(𝜃|𝑥)p(θ|x)) given the likelihood (𝑝(𝑥|𝜃)p(x|θ)) and the prior distribution, 𝑝(𝜃)p(θ). The likelihood is something that can be estimated from the training data.
In fact, that’s exactly what we’re doing when training a regular machine learning model. We’re performing Maximum Likelihood Estimation, an iterative process which updates the model’s parameters in an attempt to maximize the probability of seeing the training data 𝑥x having already seen the model parameters 𝜃θ.
So how does the Bayesian paradigm differ? Well, things get turned on their head in that in this instance we actually seek to maximize the posterior distribution which takes the training data as fixed and determines the probability of any parameter setting 𝜃θ given that data. We call this process Maximum a Posteriori (MAP). It’s easier, however, to think about it in terms of the likelihood function. By Bayes’ Theorem we can write the posterior as
Here we leave out the denominator, 𝑝(𝑥)p(x), because we are taking the maximization with respect to 𝜃θ which 𝑝(𝑥)p(x) does not depend on. Therefore, we can ignore it in the maximization procedure. The key piece of the puzzle which leads Bayesian models to differ from their classical counterparts trained by MLE is the inclusion of the term 𝑝(𝜃)p(θ). We call this the prior distribution over 𝜃θ.
The idea is that its purpose is to encode our beliefs about the model’s parameters before we’ve even seen them. That’s to say, we can often make reasonable assumptions about the “suitability” of different parameter configurations based simply on what we know about the problem domain and the laws of statistics. For example, it’s pretty common to use a Gaussian prior over the model’s parameters. This means we assume that they’re drawn from a normal distribution having some mean and variance. This distribution’s classic bell-curved shape consolidates most of its mass close to the mean while values towards its tails are rather rare.
By using such a prior, we’re effectively stating a belief that most of the model’s weights will fall in some narrow range about a mean value with the exception of a few outliers, and this is pretty reasonable given what we know about most real world phenomena.
It turns out that using these prior distributions and performing MAP is equivalent to performing MLE in the classical sense along with the addition of regularization. There’s a pretty easy mathematical proof of this fact that we won’t go into here, but the gist is that by constraining the acceptable model weights via the prior we’re effectively imposing a regularizer.
Methods of Bayesian ML
While MAP is the first step towards fully Bayesian machine learning, it’s still only computing what statisticians call a point estimate, that is the estimate for the value of a parameter at a single point, calculated from data. The downside of point estimates is that they don’t tell you much about a parameter other than its optimal setting. In reality, we often want to know other information, like how certain we are that a parameter’s value should fall within this predefined range.
To that end, the true power of Bayesian ML lies in the computation of the entire posterior distribution. This is a tricky business though. Distributions are not nicely packaged mathematical objects that can be manipulated at will. Often they come defined to us as tricky, intractable integrals over continuous parameter spaces that are infeasible to analytically compute. Therefore, a number of fascinating Bayesian methods have been devised that can be used to sample (i.e. draw sample values) from the posterior distribution.
Probably the most famous of these is an algorithm called Markov Chain Monte Carlo, an umbrella which contains a number of subsidiary methods such as Gibbs and Slice Sampling. The math behind MCMC is difficult but intriguing. In essence, these methods work by constructing a known Markov chain which settles into a distribution that’s equivalent to the posterior. A number of successor algorithms improve on the MCMC methodology by using gradient information to allow the sampler to more efficiently navigate the parameter space.
MCMC and its relatives are often used as a computational cog in a broader Bayesian model. Their downside is that they are often very computationally inefficient, although this drawback has been improved tremendously in recent years. That said, it’s often preferable to use the simplest tool possible for any given job.
To that end, there exist many simpler methods which can often get the job done. For example, there exist Bayesian linear and logistic regression equivalents in which something called the Laplace Approximation is used. This algorithm provides an analytical approximation to the posterior distribution by computing a second-order Taylor expansion around the log-posterior and centered at the MAP estimate.
One popular Bayesian method capable of performing both classification and regression is the Gaussian process. A GP is a stochastic process with strict Gaussian conditions imposed upon its constituent random variables. GPs have a rather profound theoretical underpinning, and much effort has been devoted to their study. In effect, these processes provide the ability to perform regression in function space.
That is, instead of choosing a single line to best fit your data, you can determine a probability distribution over the space of all possible lines and then select the line that is most likely given the data as the actual predictor. This is Bayesian estimation in the truest sense in that the full posterior distribution is analytically computed. The ability to actually work out the method in this instance is due to the suitability of conjugate functions.
In the case of classification using GPs, the posterior is once again no longer conjugate to the likelihood, and the ability to do analytic computations breaks down. In this case, it’s necessary to once again resort to approximate solvers like the Laplace Approximation in order to suitably train the model to a desired level of accuracy.
The presence of Bayesian models in ML
While Bayesian models are not nearly as widespread in industry as their counterparts, they are beginning to experience a new resurgence due to the recent development of computationally tractable sampling algorithms, greater access to CPU/GPU processing power, and their dissemination in arenas outside academia.
They are especially useful in low-data domains where deep learning methods often fail and in regimes in which the ability to reason about a model is essential. In particular, they see wide use in Bioinformatics and Healthcare, as in these fields taking a point estimate at face value can often have catastrophic effects and more insight into the underlying model is necessary.
For example, one would not want to naively trust the outputs of an MRI cancer prediction model without at least first having some knowledge about how that model was operating. Similarly, within bioinformatics, variant callers such as Mutect2 and Strelka rely heavily on Bayesian methods underneath the hood.
These software programs take in DNA reads from a person’s genome and label “variant” alleles which differ from those in the reference. In this domain, accuracy and statistical soundness are paramount, so Bayesian methods make a lot of sense despite the complexity that they add in implementation.
Overall, Bayesian ML is a fast growing subfield of machine learning and looks to develop even more rapidly in the coming years as advancements in computer hardware and statistical methodologies continue to make their way into the established canon.
The Next Generation of AI
DataRobot AI Platform is the next generation of AI. The unified platform is built for all data types, all users, and all environments to deliver critical business insights for every organization. DataRobot is trusted by global customers across industries and verticals, including a third of the Fortune 50.
We will contact you shortly
We’re almost there! These are the next steps:
- Look out for an email from DataRobot with a subject line: Your Subscription Confirmation.
- Click the confirmation link to approve your consent.
- Done! You have now opted to receive communications about DataRobot’s products and services.
Didn’t receive the email? Please make sure to check your spam or junk folders.
New DataRobot and Snowflake Integrations: Seamless Data Prep, Model Deployment, and MonitoringMarch 16, 2023· 5 min read
How the DataRobot AI Platform Is Delivering Value-Driven AIMarch 16, 2023· 4 min read
A New Era of Value-Driven AIMarch 16, 2023· 2 min read
Although still a relatively new field, the present global situation and changes have brought AI to the point where machine learning models really need to start showing their value. To achieve that, DataRobot provides a solution for organizations to build an MLOps foundation that allows data, development, and production teams to work together to successfully deploy and manage machine learning services…
In a global marketplace where decision-making needs to happen with increasing velocity, data science teams often need not only to speed up their modeling deployment but also do it at scale across their entire enterprise. Often, they are doing this with smaller teams in place than they need due to the shortage of data scientists. It’s no wonder, then, that…
In several earlier blog posts, we have focused on what we at DataRobot call the AI production gap, which refers to the gap that makes it difficult to transition models from the data science teams who develop them to the IT and DevOps teams who are responsible for deploying and monitoring them in production. Deployment and Monitoring Challenges Our DataRobot…