What is applied machine learning?
This article was originally published at Algorithimia’s website. The company was acquired by DataRobot in 2021. This article may not be entirely up-to-date or refer to products and offerings no longer in existence. Find out more about DataRobot MLOps here.
Applied machine learning is the application of machine learning to a specific data-related problem. This machine learning can involve either supervised models, meaning that there is an algorithm that improves itself on the basis of labeled training data, or unsupervised models, in which the inferences and analyses are drawn from data that is unlabeled. Applied machine learning is characterized in general by the use of statistical algorithms and techniques to make sense of, categorize, and manipulate data.
Machine learning can be applied in any case in which there are nondeterministic elements to a problem, and especially where the manipulation and analysis of a large amount of statistically generated data are required.
What machine learning isn’t
An example of a deterministic problem is the classic “fizzbuzz” question often posed in interviews for software engineering jobs. In it a programmer is asked, for each positive integer from 1 to a large number like 1,000, to print the string “fizz” if the integer is divisible by 3, “buzz” if the integer is divisible by 5, or “fizzbuzz” if the integer is divisible by 15.
Correct programs for solving the fizzbuzz problem are deterministic as there is exactly one correct thing for the code to do for each input integer. Statistical algorithms and techniques are not required to solve the problem, and so the problem is not a good target for machine learning.
Where applied ML works
An example of a common business problem that is not completely deterministic and introduces a need for statistical methods is displaying search results for a given search query on a search engine. Such a problem is a good target for machine learning because there is no deterministic way to define the “best” search results for a given query.
Therefore, companies resort to using statistical models in order to display results that best relate to the users’ search queries. Because this constitutes machine learning applied to a specific problem, it is an instance of applied machine learning.
Bloom filters in applied ML
However, it is worth noting that even applied machine learning problems can involve significant deterministic and non-statistical components. In the example of returning desired search results for a given query, a bloom filter is a mostly deterministic structure that could be used to help determine whether a given searched keyword is present in a given page.
Bloom filters tell you in a rapid and memory-efficient manner whether a given element is a member of a given set. The cost of this efficiency is that bloom filters are slightly probabilistic structures, in the sense that they don’t tell you for certain whether a given searched element will be a member of a given set.
Search engine algorithms
However, in practice, they are constructed to provide the right answer nearly all the time, and are thus a mostly deterministic structure. In the search engine problem, a bloom filter would allow engineers to search keywords over the entire set of pages stored in a database, select the pages with a large number of hits for particular searched keywords, and then apply other methods like the PageRank algorithm over the returned results to ultimately generate the search results that will be served to the user.
Weights and frequency
The weighting of which keywords in a given search string to pay most attention to and how to weight the number of hits for a given keyword on a given page are statistical problems solved by statistical methods, and this ultimately makes the search engine problem a machine learning problem.
For example, one of the statistical factors that the algorithm takes into consideration is the frequency with which each searched word appears in the vocabulary. Words that appear relatively frequently like “the” are weighted as less important than words that appear relatively uncommonly like “porcupine.”
Ads and machine learning
Each search engine employs different statistical techniques to gain advantage over its competitors. The differences in ad revenue for even slightly improved search results can be so significant to a business that statistical techniques for problems like search are often fiercely sought after, and the ones that are found to give high performance are treasured.
Writing ML algorithms
Python is the most commonly used programming language for machine learning. Part of the reason for this is simply due to the momentum of history: because Python was commonly used for machine learning in the past, machine learning libraries and tutorials were written in Python, which encouraged future libraries, tutorials, and other work to be written again in the same language, until Python became the lingua franca of machine learning.
TensorFlow and Scikit-learn are examples of commonly used machine learning libraries for Python that do not have true duplicates in other programming languages.
Python dominates ML
Another reason for Python’s success in machine learning is that it offers a clean and intuitive syntax with dynamic typing. Statically typed languages like C and Go are best suited for low-level problems where one is working close to the hardware, and the efficiency of the code may matter significantly.
Dynamically typed languages like Python are best for the math-heavy work common in machine learning, where the ability of humans to understand the implementations of complex algorithms like neural networks and support vector machines written in code is more important than the ease with which the computer can compile the code.
A final reason for Python’s success as a machine learning language is that it can be as efficient as lower-level languages like C through the use of libraries like Numpy that compile to C code and extensions for hooking C code into Python like Cython.
The most common low-level language for machine learning is C++, a statically typed extension of C that includes objects. While C++ makes it easy to write fast code in a native way, the complexity of code to human eyes for algorithms like neural networks and support vector machines can often be overwhelming, which is a significant downside.
Learning ML languages
Other common high-level languages for machine learning include R, a programming language designed specifically for statistical manipulations, and Java, historically one of the most popular statically typed languages. One of the downsides of both Java and C++ is the lack of libraries like Numpy, Scipy, Scikit-learn, and Tensorflow which exist in Python and simplify the process of constructing machine learning algorithms.
One of the best ways to learn about applied machine learning is to take an applied AI course, either at a university or through online learning platforms like Coursera and Udemy. Time to complete such a course varies from 4 months at a university to a few weeks for some of the shorter online courses. While it may have been difficult in the past for many people to access university courses, online platforms like Coursera have significantly broadened access to such courses for the general public, especially now when remote learning is the new normal.