Statistics and machine learning: what’s the difference?
This article was originally published at Algorithimia’s website. The company was acquired by DataRobot in 2021. This article may not be entirely up-to-date or refer to products and offerings no longer in existence. Find out more about DataRobot MLOps here.
Some say that machine learning is just glorified statistics, rebranded for the age of big data and faster computing. Others say that they’re completely unrelated, so much so that you don’t even need to understand statistics to perform machine learning tasks.
So who’s right? As you can imagine, the reality falls somewhere between these two extremes.
The goal of this blog is to:
- Define statistics and machine learning.
- Explain how each discipline related to the other.
What is statistics?
The Department of Statistics at the University of California, Irvine defines the discipline as “the science concerned with developing and studying methods for collecting, analyzing, interpreting and presenting empirical data.”
Statistics has been studied and used for more than a thousand years, with the first writings on the subject dating back to the 8th century. There are two main statistical methods: descriptive statistics and inferential statistics.
Descriptive statistics is the process of summarizing information about a sample using metrics like mean, median, mode, standard deviation, etc. This is sometimes used for exploratory data analysis leading to a larger project, or descriptive statistics may be the extent of the investigation.
Inferential statistics is the process of inferring properties on a population based on the properties of a sample of a population.
What is machine learning?
Machine learning is a subset of artificial intelligence. It is the process of computers using large amounts of data to find patterns and make decisions without human intervention. Machine learning is used for tasks like text mining and sentiment analysis.
There are three methods of machine learning: supervised learning, unsupervised learning, and reinforcement learning. In supervised learning there is a target outcome variable. In unsupervised learning, there is no target and the machine is simply finding patterns and relationships within the data. The process of reinforcement learning involves an algorithm using trial and error to reach an objective.
While not centuries old, machine learning is not new and has been researched extensively since the 1950s. It has come into prominence in the past two decades due to the exponential growth in data collection and increased computing power.
How are statistics and machine learning related?
Many machine learning techniques are drawn from statistics (e.g., linear regression and logistic regression), in addition to other disciplines like calculus, linear algebra, and computer science. But it is this association with underlying statistical techniques that causes many people to conflate the disciplines.
Interestingly, newer machine learning engineers and data scientists who use machine learning packages like scikit-learn in Python may be unaware of the underlying relationship between machine learning and statistics.
This abstraction of machine learning from statistics with the use of libraries is often why some individuals make the argument that knowledge of statistics is not necessary to do machine learning. While this may be true for more basic tasks, experienced data scientists and machine learning engineers draw on their knowledge probability and statistics to develop models.
What are the key differences between statistics and machine learning?
Statistics and machine learning often get lumped together because they use similar means to reach a goal. However, the goals that they are trying to achieve are very different. The purpose of statistics is to make an inference about a population based on a sample. Machine learning is used to make repeatable predictions by finding patterns within data.
Machine learning requires large amounts of data in order to make accurate predictions. Models are built using training data, fine tuned using a validation dataset, and are evaluated with a test dataset. All of these steps help the machine “learn.”
Statistics does not involve multiple subsets of data because you are not trying to make a prediction. The point of modeling in this situation is to display the relationship between data and the outcome variable. In addition, statistics relies on significance tests to determine the direction and magnitude of a relationship, discounting noise and confounding variables.
Because of the large number of variables in machine learning datasets, the models developed from them can be simultaneously extremely accurate and almost impossible to understand. Statistical models, on the other hand are typically easier to understand because they are based on fewer variables, and the accuracy of relationships is supported by tests of statistical significance.
Is one more valuable than the other?
No. You really can’t place a value judgment on two disciplines that serve different purposes. The choice to develop a machine learning model or a statistical model will depend on your goal or industry.
One important factor to consider is interpretability, as mentioned earlier. A data scientist may not need to justify the decisions of a complicated model that gives recommendations about how many widgets to manufacture. However, a model meant to make more sensitive decisions (e.g., granting or denying a loan) may not be acceptable if a data scientist can’t prove the strength of relationships between the predictors and the outcome variable.
Where do data science and AI fit into the picture?
Artificial intelligence is a branch of computer science focused on building machines capable of performing tasks typically done by humans. Machine learning is a subset of this field.
Data science is a multidisciplinary field that includes aspects of computer science, math, statistics, and machine learning to derive insights from large data sets. Data scientists work to solve problems or uncover opportunities using the vast amounts of data that companies and governments generate. They often draw upon their knowledge of statistics to aid them in developing models most appropriate for the task.
Where can I learn more?
Not surprisingly a web search of “statistics vs machine learning” leads to more debates than unbiased analysis. However, we recommend that you check out the article “Statistics versus machine learning” in the journal Nature Methods. The authors provide a clear contrast between the two disciplines by using both methods to identify the relationship between certain genes and observable physical characteristics. It’s a useful read if you are looking for a less theoretical comparison between machine learning and statistics.