Challenge of Predicting Time
This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about DataRobot, AI Cloud, data science, and more.
There are many use cases where the piece of information you are interested in predicting is related to time. When will it start (or stop) raining? When are my clients going to pay? When am I going to run out of battery? Or even more existentially: how many years do I still have to live?
All these questions are trying to predict the time remaining until an event happens. And you will know the actual answer to the question only once the event has happened. Human intelligence so far has come up with the idea of a “health status” that informs about the condition based on different criteria or sensors. This approach is powered by domain expertise and the ability to interpret complex relationships. In many of the areas mentioned above, the results are not entirely satisfactory.
So what historical examples can you give your machine to learn from? To learn the failing behavior you need data over time, from any starting point until the event of interest happens. Think of your failing battery: you can give it all the examples from fully charged to dying. Or, consider all the examples possible from any cloudy day until the rain breaks. This type of data is often referred to as lifetime data or run-to-failure history and is inherently time series data (i.e., ordered data over time).
Machine learning modeling on time series data is challenging and requires you to think about the data timeline very thoroughly to avoid pitfalls. In an introduction to time series modeling you learn that the order of rows matter, that the future data is different from the past data, and that a model cannot be evaluated on a single row of data. So you look at your data and think, “Yes, I can do time series modeling!”
In the following figure, you can see the behavior of ten different engines over time. The X-axis depicts the remaining useful life (RUL). We see that not all engines start at the same point in time (on the right) but all end up failing (at RUL = 0). The Y-axis stands for any indicator or sensor that you might use as a feature to capture the failing behavior.
Time series modeling is very prone to target leakage (i.e., when your training data includes information that would not be available at the time of prediction). We need to make sure that our model does not contain such features. For example, don’t include actual weather information when only weather forecasts would be available; you will end up predicting an event already happening, like predicting it will rain when it’s raining:
(Source: Artificial Intelligence Wiki, Target Leakage (AI Simplified: What is Target Leakage in Data Science?))
The real challenge in predicting time comes from your target: the RUL of your engine. When predicting tomorrow’s sales, you will know how much you sold today and yesterday; however, when predicting the RUL, you don’t have access to past RUL data. The remaining useful life at any point in time can only be computed after the event has occurred. So, if you simply begin modeling using RUL as target, your model would include target leakage.
In this case, having time series data does not mean that we can use time series modeling approaches to make predictions. Is it at all possible to predict RUL by modeling on run-to-failure data? Can we predict how long it will take until it rains, until our phone dies, or until a machine breaks? A valid approach is to learn from each time point separately, in a non-time-aware model. In order to not lose information from past data, it can be incorporated on a row-by-row basis in the form of lags. To understand in detail how this can be done, see this article which explains how to approach the problem for a predictive maintenance dataset from NASA.
Predicting time is a great way to reduce uncertainty about events affecting our lives. These use cases can rely on time series data but, as shown here, time series modeling inherently introduces target leakage. Row-by-row regression modeling with time lags is a valid approach to predict remaining useful life but caution and domain expertise are needed to ensure quality.
- ODSC (2019) presentation on target leakage by Yuriy Guts (Machine Learning Engineer, DataRobot)
- Chapter 24. Seven Types of Target Leakage in Machine Learning and an Exercise
- If you’re a licensed DataRobot customer, search in-app Platform documentation for Data quality assessment, then locate the section “Target leakage.”