Predictive Maintenance (of NASA Turbofans) Using DataRobot

September 1, 2021
· 7 min read

This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about the DataRobot AI Platform, data science, and more.

There are many things in our daily lives that we expect to function properly. You might rely on the elevator to take you down from your apartment, or a car, bicycle, or train to get you to work. Such machinery, as well as components in manufacturing plants, are just some examples of areas where maintenance is of critical importance. The traditional approach is to carry out maintenance based on a rule-of-thumb approach: the older the component, the more regularly the maintenance needs to be carried out. Predictive maintenance with machine learning aims to better predict the likelihood of failure in order to optimize maintenance strategies so that they are only carried out when required.

Predictive maintenance can be effective in two ways. On the one hand, by better understanding when failure might occur, we can prevent costly reactive maintenance incurred when the machine is repaired only after failure. For example, it can be extremely disruptive (and costly) if the train breaks down during everyone’s morning commute. And on the other hand, we can minimize the amount of preventative maintenance involving unnecessary routine check-ups and repairs. For example, if a manufacturing component is taken out to be inspected periodically, it not only means that it cannot be in use at that time, but that it may also may be damaged during the inspection process.

In this blog, we discuss how machine learning can be used for predictive maintenance. The raw data is captured from a set of 100 NASA turbofan engines,1 each with a differing amount of initial wear and tear. For each engine we are given three operational settings and 21 sensor readings recorded during each cycle of use. The measurements are given for each cycle until the engine eventually fails.

FeatureData Type
Cycle of measurementTime
Operational settings (3 columns)Numeric
Sensor readings (21 columns)Numeric
20,630 rows

We tried two different approaches:

1. First we attempted to predict the remaining useful life (RUL): this is a regression problem.
2. Alternatively, we could predict the probability of failure within n cycles: this becomes a binary classification problem.

Remaining Useful Life

The first step is to compute the target which we want to predict: the RUL, based on the maximum number of cycles until failure. Using group partitioning as the cross-validation strategy, we uploaded our dataset into DataRobot and ran Autopilot to predict the RUL.

A note about the partitioning strategy

The default in DataRobot is to apply random partitioning into 5 cross-validation folds, having already assigned 20% of the data for holdout. In this case, such a strategy would introduce target leakage; if our model is trained on data for a particular engine close to the end of its RUL, then it can learn the failure mechanism for that engine and result in an artificially good validation accuracy. Instead we apply group partitioning; the data for a particular engine must be contained within a single validation fold (which can contain any number and combination of engines). This ensures that we can assess the extent to which our model is truly generalizable. To change the partitioning strategy in DataRobot, go to the Advanced Options on the Data page prior to pressing Start.

The resulting models were then ranked based on their performance in terms of root mean squared error (RMSE). The best performing model, an Average Blender, is an ensemble model. It gives RMSE scores of 37.4168, 41.4712 and 46.7555 for validation, cross-validation, and holdout (respectively) for the model trained on ~64% of the data. In practice, this means that the predicted remaining useful life is accurate within a +/– 40-day window. If we evaluate our model residuals (in the Evaluate tab), we can see that the prediction accuracy is improved further for shorter RULs (RMSE ~ 9 for RUL < 50); this indicates that our model better predicts RUL when the time to failure is closer.


Feature engineering

Of course, since we have access to the change in sensor readings over time, it is likely that the measurements will be autocorrelated, i.e., the value observed today is affected by yesterday’s reading.2 Below are plotted the changes in sensor readings (sensors 1, 2, and 7) for a sample of four engines over time (or more specifically, RUL).

Sensor SM1 plotSensor SM1 plot

As shown in the SM1 plot (above), the sensor reading remains constant over time for all four engines–shown in yellow, blue, brown and navy blue. DataRobot will automatically detect that this is an uninformative feature and exclude it from the modeling process.

The plots for sensors SM2 and SM7 (below) show a clear time dependence. By introducing rolling statistics (e.g., the standard deviation of the previous 40 days), we can capture this time dependence to improve the accuracy of our model. The plots show the RUL over time for four different engines, shown in yellow, blue, brown, and navy blue.

Sensor SM2 plotSensor SM2 plot

Sensor SM7 plotSensor SM7 plot

In the first case (SM1), the sensor reading never changes; perhaps it is broken? DataRobot will automatically detect that there is only one unique value, will flag it as uninformative, and will not use this sensor reading in modeling. The other two (SM2 and SM7) are simply representative of how incorporating rolling statistics might improve the accuracy of our model.

We can incorporate an understanding of this time dependence by introducing rolling statistics as additional features in our model. In this case we computed the 10-, 20-, and 40-day rolling standard deviation and the 10-, 20-, and 40-day rolling mean for each sensor measurement and operational setting. The full list of additional engineered features is listed below; note that the inclusion of a 40-day rolling mean reduces our dataset to 16,730 total observations.

FeaturesData type
Remaining useful lifeTarget
Operational settings (3 columns)Numeric
Sensor readings (21 columns)Numeric
Rolling mean 10 cycles (24 columns)Numeric
Rolling mean 20 cycles (24 columns)Numeric
Rolling std 40 cycles (24 columns)Numeric
Rolling std 10 cycles (24 columns)Numeric
Rolling std 20 cycles (24 columns)Numeric
Rolling std 40 cycles (24 columns)Numeric
16,730 rows x 171 columns

The best performing (non-blender) model is a Ridge Regressor which now has RMSE scores of 27.6178, 32.2755, and 24.8812 for validation, cross-validation, and holdout (respectively) for the model trained on ~64% of the data.3 This model utilizes a reduced feature list (33 features) that has been automatically generated by DataRobot.4 Of these 33 most impactful features, all of them are engineered features.

Failure within n cycles

We can also approach predictive maintenance as a binary classification problem. For example, if the lead time for ordering a new component for repairs is 10 days, we might want our system to flag whether an engine is likely to break within the next 20 days.

The best performing classification model is an ensemble ‘GLM Blender model’ which achieves an F1 score of 0.9304. It may be the case, for predictive maintenance, that the sensor data is collected on millisecond timescales and that a response with extremely low latency is desired. If a non-ensemble model is the most desirable model, then the most accurate non-ensemble model is ‘Light Gradient Boosting on ElasticNet Predictions.’ For this model, DataRobot has reduced the number of features required to 41 (of the 171 available once lags are included). It is often the case—as we can see here—that framing a problem as a binary classification problem results in higher accuracy. More importantly, this often makes more business sense: if the aim of building a predictive model is to intervene when a fault is likely to occur, then knowledge of the precise RUL is perhaps unnecessary.



We have shown here how machine learning can be used for predictive maintenance using both regression and binary classification approaches. As the number of IoT devices continues to grow, the application of such strategies could lead to increased efficiencies for businesses (and individuals). The next important question for the end user: What is the best next step?


This work was carried out in collaboration with Vyas Adhikari, Customer Facing Data Scientist at DataRobot, and Florian Rossiaud-Fischer, Applied Data Scientist Associate, DataRobot.

Explore Our Marketplace of AI Use Cases
Learn more

This project uses raw data from the Turbofan Engine Degradation Simulation Data Set (part of PCoE datasets,

2 Although you might instinctively think that modeling this as a Time Series problem using DataRobot’s Time Series capability would be beneficial, the relationship between time and RUL will always result in target leakage with out-of-time validation (OTV). And while it could be reframed as a Time Series classification problem, the OTV partitioning would involve always training on ‘healthy’ engines and testing on failing engine sensor readings.

3 The random seed used for partitioning this model was 451.

4 Of course, with engineering additional features we now have more features with which to fit our data and are therefore (potentially) more at risk of overfitting. In this case, DataRobot has automatically detected that of the 171 features introduced, 121 of these are informative features. DataRobot then takes the best performing non-blender model and creates a reduced feature list based on the features that can describe 95% of the accumulated impact. 

About the author
Linda Haviland
Linda Haviland

Community Manager

Meet Linda Haviland
  • Listen to the blog
  • Share this post
    Subscribe to DataRobot Blog
    Newsletter Subscription
    Subscribe to our Blog