Bringing MMM to 21st Century with Machine Learning and Automation￼
What is MMM?
MMM stands for Marketing Mix Model and it is one of the oldest and most well-established techniques to measure the sales impact of marketing activity statistically. MMM allows you to isolate incremental sales contributions coming from each of the marketing channels. When this contribution is put against the marketing spend in the particular channel, it produces a reading on the highly coveted return on investment (ROI). ROI gives a standard interpretation of whether a marketing activity was profitable and to compare efficiency across media channels or campaigns.
Historically, this analysis was applied to traditional offline media channels: TV, radio, print (magazines, newspaper), out-of-home (billboards and posters), etc. However, even with the advent of digital and social media and receiving an ever-growing proportion of the media budget, offline channels remain the dominating media type for most of the largest companies today. All of this has led to certain innovations in MMMs to measure both offline and online media effects in a single statistical modeling technique, but the historical core principles remain upheld.
As with any type of statistical model, data is key and GIGO (“Garbage In, Garbage Out”) principle definitely applies. Historical data needs to be accurate, consistent and, most importantly, available. More often than I would like to admit, I have heard the following phrase from a client: “We do not have the data for the five media campaigns we ran last year, but we have data for the other four. When can you give us the ROI?”. It is pretty obvious to a data scientist that the only way one can measure the effect of an event is by knowing which events are happening at what time. But it is not always obvious that the business expectations need to be set before the project starts:
- What can be measured with the current state of the data?
- What cannot be measured?
- What other data is needed to answer any additional questions?
So, what data is necessary to build a good MMM? The three main ingredients are:
- Sales data (usually weekly): product quantity, value, selling distribution, promotional activity (discounts, multi-buys, etc.)
- Media data (usually weekly): media costs, media ratings generated (TVRs, magazine copies, digital impressions, likes, shares, etc.), targeted audience (if available)
- External factors: macro-economical indicators, weather, competitor media campaigns and promotional activity
The standard practice is that the data should be aggregated into a weekly format and span at least the last two to three years (ideally around five years). This metric is important to discover the fluctuation in sales that is due to seasonal demand patterns versus the sales generated by pricing or media.
Classical Modeling Considerations
MMM landscape is still dominated by linear and log-linear regression. Although more recent players in the market are shifting to Bayesian techniques, the fundamental model structure remains relatively unchanged for over 20 years.
The model target – the thing the model is trying to predict – is usually product quantity sold. In some cases, the chosen target is sales value, however, this approach is not recommended, because sales value is influenced by both quantity and average price and does not give an accurate representation of demand for the product, whereas quantity does not have this issue.
All the other explanatory variables are used in the model as features, the information that you want to use to predict the target.
Before the data is put into the model comes a process called feature engineering – transforming the original data columns to impose certain business assumptions or simply increase model accuracy. Some of the most common data transformations for MMMs are:
Applying adstock to media: The premise is that after people have seen the ad not all of them purchase immediately. Some of them buy the next week, and others the next month, which is why adstock is sometimes referred to as “the memory effect”, as it represents how long the person still remembers the ad after seeing it.
- Applying diminishing returns to media: The premise is that if you spend more and more on your media, you saturate the market. Everybody who could have seen the ad has seen it, and repeated ad exposure does not increase the audience propensity to buy that much. The way to encode this relationship into the model is to apply certain functions on media such as log(), atan(), tanh() or other:
- Instead of using average price, consider Base Price and Discount Percentage: The premise here is that people react differently to permanent price changes than they do to discount. For example, if you double the price of a candy bar from $1 to $2 and then run a 50% off promotion, then the sales volume will be higher, even though the average price is still $1. This concept is sometimes described as Base Price and is less elastic than promotional price, (i.e., you will lose fewer customers due to price hike than you will gain due to temporary price drop).
These are just a few of the many examples that can be applied to your data during the feature engineering stage.
Classical Attribution Methods
Once the model is built, we can measure the incremental benefit of media activity. To achieve this goal, we have to use the model structure to attribute the sales to various factors. In the case of MMMs, sales attribution is also referred to as decomposition, meaning that historical sales are decomposed into the sources – base, promotions, media, etc.
The process of attribution is the following:
- Variables used in the model are grouped into different classes:
- Base Variables: seasonality, selling distribution, exogenous variables such as temperature, GDP, etc.
- Pricing/Promotional Variables: Base price, discount, promotion type, promotional distribution, etc.
- Media Variables: TV ratings, print volumes, impressions, likes, etc.
- The contributions of all the base variables are said to amount to asking, what would the natural sales level (Base Sales) be with no promotions or media?
- The remainder of the sales are attributed to promotion/media activities that happened to be running at the time
In the case of a simple linear model, this is a pretty trivial exercise. One simply has to take the model coefficient associated with the feature that represents the activity and multiply the coefficient with the value week by week. This procedure is very easy to implement, even in Excel. Thus, many companies stick with linear models, as this allows them to build simple “what-if” scenario simulators. Also, due to simple model structure, it is very intuitive to business users: If I spend $1,000 on print, I will get $5,000 extra sales. If I spend $10,000, I will get $50,000 sales, regardless of when I do it.
However, there is always a tradeoff between model complexity and model accuracy. And as one would expect, a linear model is rarely as accurate as its more complex counterparts (i.e., log-linear model or more sophisticated ML models). The model that is frequently chosen in the industry is the log-linear model, as it has slightly better accuracy for modeling sales and still is relatively easy to decompose. Log-linear model conceptually differs from linear model because it provides a multiplicative view of the word, not additive. Hence, a what-if scenario would look something like: If I spend $1,000 on print, then I will get +1% extra Sales. If I spend $10,000, I will get +10% extra sales, and then we have to look at my Base Sales level to see how much that 1% or 10% actually is. (A good explanation of how to interpret coefficients can be found here).
Bringing Marketing Mix Modeling into the 21st century with ML and Automation
Historically, marketing departments have been the pioneers in adoption of new approaches for data analysis in most organizations. and that is true to this day when it comes to digital marketing. But the principles and techniques for MMMs have remained practically unchanged over the last 30 years. With the advent of ML growing compute power, barely anybody doubts that it is possible to beat the predictive accuracy of linear/log-linear models, and forecasting accuracy is effectively a measure of trust in the models’ recommendation. The argument has always been that ML models are too opaque (black-box) to be able to produce the level of insights and transparency needed for marketing strategy setting. This was a valid cause for concern in the early days of ML, but over the last 5-7 years there have been significant research efforts spent and developments made on addressing model transparency/insights that a decade ago would have been computationally prohibitive, but now can be handled by a mid-range laptop.
Breakthrough #1: Partial Dependence Plots
Partial dependence plots are not a new idea, but it has greatly spiked in popularity and adoption with the dawn of Kaggle. In principle, PD plots can be thought of as a generalization of model coefficients that can be applied to ANY ML model. Combined with ML algorithm that can discover non-linear relationships out of the box (e.g., any tree based algo’s like XGboost), we can address one of the most painstakingly tedious parts of MMM building—determining diminishing returns parameters. This task is essentially automated by the algorithm itself. And in case one wishes for more control over the relationship, it is possible to impose monotonic constraints on media features (positive monotonicity) or price features (elasticity – negative monotonicity).
Breakthrough #2: Shapley Value
Originating in game theory, there is a reason why Shapley Value earned Lloyd Shapley a Nobel Prize in 2012. The theory revolutionized the explainability of ML models in a matter that is especially relevant for marketing use cases, such as when we ask a question like, “Did my X$ spend on marketing drive any effect on sales/customer behavior?” Initially, the benefit of SHAP came in digital attribution, allowing complex ML models to attribute the impact of every touchpoint on the conversion probability. However, with a clever trick, it can also be used to make any model used for MMM to produce a sales decomposition that is equivalent to ones coming from linear models. The trick is to adjust the SHAP methodology to solve the problem of SHAP values comparing the incremental benefit against the median value, whereas in marketing we want to compare against the 0$ spend, not the typical spend. Two ways this can done:
Running a Shapley Value based simulation against scenarios where media features are “turned off” (set to 0) in various combinations to compute the impact of each channel.
If your model has inbuilt SHAP support (e.g., XGBoost) you can compare the out-of-the-box SHAP scores against a 0$ scenario SHAP scores to get around the issue. This is truly revolutionary for MMM as the ability to produce a sales decomposition has been the key objection to using advanced ML algorithms ever since the beginning.
Breakthrough #3: ML algorithms’ complexity
This one is a little bit less obvious but, nonetheless, if used correctly, can help address the second bane of MMM—finding the adstock effect. The classical approach is to assume the adstock function (typically linear ) and test out various values of λ to find the one that maximizes model fit. There is a lot of utility in its simplicity, but this approach has one major drawback: we ASSUME that the adstock function is a linear filter. Another way to address/validate our assumptions is to relax the restriction and let ML models find the relationship. By creating multiple lag or rolling sum features of the media activity and training a model from a family of algorithms that implicitly build feature interactions (e.g., tree-based models, neural networks etc.), the model is able to explore a very wide set of “shapes” for adstock and find the one suggested by the data.
Regardless of your personal views on methodology, MMMs are here to stay. With offline media in the United States alone attracting spending of 196 billion U.S. dollars in 2021, it is vital for the businesses to know how much value they are getting from those investments as well as how best to allocate budgets. Today, most marketing departments are using MMMs—either building them themselves or using third parties to do it for them. The adoption of ML for this task is inevitable due to the high demand it places on time and costs needed to hand-build models in the classical way. And with the latest developments in ML interpretability paired with the huge productivity boost coming from the field of AutoML, the old objections are no longer valid.
So, my call to action for every data scientist supporting the marketing teams is this: Let’s make sure marketing organizations stay on the cutting edge of data science innovation and apply 21st century techniques to 21st century marketing campaign planning with 21st century MMM!