Starter: Predicting Flight Delays

Airlines Starter Trial

This accelerator aims to assist DataRobot trial users by providing a guided walkthough of the trial experience. DataRobot suggests that you complete the Flight Delays sample use case in the graphical user interface first, and then return to this accelerator.

Request a Demo

In this notebook, you will:

  • Create a use case folder.
  • Import data from an S3 bucket (this differs from the in-software walkthrough tutorial).
  • Perform a data wrangling operation to create the target feature with code (this also differs from the in-software walkthrough tutorial).
  • Register the wrangled data set.
  • Explore the new data set.
  • Create an experiment and allow DataRobot automation to populate it with many modeling pipelines.
  • Explore model insights for the best performing model.
  • View the modeling pipeline for the best performing model.
  • Register a model in the Model Registry.
  • Configure a deployment.
  • Create a deployment.
  • Make predictions using the deployment.
  • Review deployment metrics.

Data source

Information on flight delays is made available by the Bureau of Transportation Statistics. DataRobot downloaded data from this website in April 2023 and made minor modifications to prepare it for use in this accelerator.

To narrow down the amount of data involved, the dataset assembled for this use case is limited to January 2023 flights originating in Portland (PDX) and Seattle (SEA) operated by the carriers Alaska Airlines (AS) and United Airlines (UA).

Data contents

There 7671 rows in the training data set with these 15 columns:

FieldData TypeDescription
Datestr (MM/DD/YYYY)The date of the flight.
Carrier CodecategoricalThe carrier code of the airline (one of AS, UA).
Origin AirportcategoricalThe three-letter airport code of the origin airport (one of PDX, SEA).
Flight NumbernumericThe flight number for the flight (needs to be converted to categorical).
Tail NumbercategoricalThe tail number of the aircraft.
Destination AirportcategoricalThe three-letter airport code of the destination airport (many variants).
Scheduled Departure Timetime (HH:MM)The 24-hour scheduled departure time of the flight, in the origin airport’s timezone.
Take-Off Delay MinutesnumericThe number of minutes past the scheduled departure time that the flight took off (also known as wheels-up time).

Setup

Import libraries

The first cell of the notebook imports necessary packages, and sets up the connection to the DataRobot platform. This accelerator uses Use Cases to organize related AI assets into a folder. Use Cases was added to the DataRobot Python package in version 3.2.0.

In [1]:

import datarobot as dr
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
In [2]:

print(dr.__version__)

3.2.1

In [3]:

# Upgrade datarobot package to at least 3.2.0
# pip install --upgrade datarobot

Connect to DataRobot

Read more about different options for connecting to DataRobot from the Python client.

In [4]:

# Connect to DataRobot when using DataRobot Notebooks
dr.Client()
Out [4]:

<datarobot.rest.RESTClientObject at 0x10e4afeb0>
In [5]:

# Connect to DataRobot when working outside DataRobot Notebooks
# api_key = "YOUR_API_TOKEN"  # Get this from the Developer Tools page in the DataRobot UI
# endpoint = "YOUR_DATAROBOT_BASE_URL"  # eg "https://app.datarobot.com/" This should be the URL you use to access the DataRobot UI

# dr.Client(endpoint="%sapi/v2" % (endpoint), token=api_key)
In [6]:

# or use a config file
# dr.Client(config_path = "/Users/andrea.kropp/datarobot_ee_account_config.yaml")

Create a new Use Case

Create a new use case folder to contain the assets that this accelerator will produce. This notebook will create three datasets, one experiment and one deployment inside the use case folder.

In [7]:

# Create a use case folder

flight_delays_use_case = dr.UseCase.create(
    name="Final Flight Delays Walkthrough from Notebook",
    description="Tutorial for understanding end-to-end workflows in DataRobot.",
)

flight_delays_use_case.id
Out [7]:

'6543e9d18115b2f028fe6bd9'

Retrieve the data

The data for this use case is provided in DataRobot’s Amazon S3 public datasets bucket. There is a 300 KB CSV file used to train the model and a 31 KB CSV file used for performing a batch scoring job.

You can download the data to inspect it or add it directly to DataRobot’s AI Catalog as shown in this notebook. The datasets can be added to AI Catalog directly from the Amazon S3 bucket.

Download the training data (optional)

Download the scoring data (optional)

Upload data to DataRobot

The datasets can be added to AI Catalog directly from the Amazon S3 bucket. After they are added, associate them with the current Use Case.

In [8]:

training_data = dr.Dataset.upload(
    "https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/flight_delays/flight_delays_training.csv"
)
training_data.id
Out [8]:

'6543e9d18115b2f028fe6bdb'
In [9]:

scoring_data = dr.Dataset.upload(
    "https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/flight_delays/flight_delays_scoring.csv"
)
scoring_data.id
Out [9]:

'6543ea10134c85fe7808bba3'
In [10]:

# Add the two datasets to the use case

flight_delays_use_case.add(entity=training_data)
flight_delays_use_case.add(entity=scoring_data)
flight_delays_use_case.list_datasets()
Out [10]:

[Dataset(name='flight_delays_scoring.csv', id='6543ea10134c85fe7808bba3'),
 Dataset(name='flight_delays_training.csv', id='6543e9d18115b2f028fe6bdb')]

Explore and wrangle the data

Access the dataset that has been uploaded to AI Catalog and convert it to a pandas dataframe. Then view the top few rows. Perform a calculation to create a new column. The new column will be used as the target for the binary classifier, so this step is required.

View dataset

In [11]:

# Retrieve the data set locally and convert it to a pandas dataframe
df = dr.Dataset.get(training_data.id).get_as_dataframe()
df.head(10)
DateCarrier_CodeOrigin_AirportFlight_NumberTail_NumberDestination_AirportScheduled_Departure_TimeTake_Off_Delay_Minutes
01/1/23UASEA2090N79521ORD0:293
11/1/23ASSEA185N408ASANC0:3075
21/1/23UASEA2672N67501DEN5:033
31/1/23UAPDX2493N39423DEN5:058
41/1/23ASPDX697N237AKSJC6:0012
51/1/23ASPDX371N533ASLAX6:00-1
61/1/23ASSEA257N926VALAX6:009
71/1/23ASSEA1014N260AKORD6:0015
81/1/23ASSEA74N618ASSNA6:00-1
91/1/23UASEA1771N35271ORD6:244
In [12]:

# Explore the data provided with some plots

plt.rcParams["axes.labelsize"] = 10
plt.rcParams["axes.titlesize"] = 14
plt.rcParams["xtick.labelsize"] = 10
plt.rcParams["ytick.labelsize"] = 10
plt.rcParams["figure.titlesize"] = 14

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 10), tight_layout=True)

plt.subplot(2, 2, 1)
plt.hist(df["Take_Off_Delay_Minutes"], bins=100)
plt.title(
    "Histogram of Flight Delays",
)
plt.ylabel("Flight Count")
plt.xlabel("Minutes Delayed")

plt.subplot(2, 2, 2)
df["Origin_Airport"].value_counts().sort_values().plot(kind="barh")
plt.title("Flight Count by Origin")
plt.xlabel("Flight Count")

plt.subplot(2, 2, 3)
df["Destination_Airport"].value_counts().sort_values()[0:15].plot(kind="barh")
plt.title("Flight Count by Destination (Top 15)")
plt.xlabel("Flight Count")

plt.subplot(2, 2, 4)
df["Carrier_Code"].value_counts().sort_values().plot(kind="barh")
plt.title("Flight Count by Carrier")
plt.xlabel("Flight Count")
Out [12]:
Text(0.5, 0, 'Flight Count')
Output flight delays

Perform one data wrangling operation

The training data contains the flight delay in minutes with the feature name TAKE_OFF_DELAY_MINUTES. Compute a new binary feature to indicate if the delay is greater than 30 minutes named TAKE_OFF_DELAY_GREATER_THAN_30.

In [13]:

# Calculate whether the delay exceeds 30 minutes
df["TAKE_OFF_DELAY_GREATER_THAN_30"] = np.where(df["Take_Off_Delay_Minutes"] > 30, 1, 0)

# Calculate the percentage of all flights delayed by more than 30 minutes
np.mean(df["TAKE_OFF_DELAY_GREATER_THAN_30"])
Out [13]:

0.24172099087353324
In [14]:

# Add the wrangled data to the Use Case; obtain the unique identifier for the dataset

df_final = dr.Dataset.create_from_in_memory_data(
    data_frame=df, use_cases=flight_delays_use_case, fname="flight delays wrangled"
)
df_final.id
Out [14]:

'6543ea4e36f70a91d0700201'

Create an Experiment

To create an experiment project from a dataset, use create_from_dataset().

Start a new project and obtain its ID number for future reference. If you wish, you can open the DataRobot graphical user interface and you’ll see a new Experiment project listed after you execute the next cell.

In [15]:

# Create a new experiment project from a previously registered dataset
# Place the project in the same use case folder

project = dr.Project.create_from_dataset(
    df_final.id,
    project_name="Flight Delays Walkthrough Experiments",
    use_case=flight_delays_use_case,
)
project.id
Out [15]:

'6543ea8335cacd9d5d4703d4'

Start modeling

The project is ready to start modeling. There are numerous configurable settings which are out of scope of this accelerator. Visit the DataRobot documentation to learn more about advanced options, modeling modes, worker counts and selecting an optimization metric.

You can optionally open the DataRobot GUI to see a new experiment listed after you execute the next cell.

In [16]:

# Start modeling. Once you execute this, the initial settings can no longer be changed.
project.analyze_and_model(target="TAKE_OFF_DELAY_GREATER_THAN_30")

# You can optionally open the DataRobot GUI to see the modeling in progress.
project.wait_for_autopilot(timeout=3600)
In progress: 8, queued: 1 (waited: 0s)
In progress: 8, queued: 1 (waited: 1s)
In progress: 8, queued: 1 (waited: 2s)
In progress: 8, queued: 1 (waited: 3s)
In progress: 8, queued: 1 (waited: 4s)
In progress: 8, queued: 1 (waited: 6s)
In progress: 8, queued: 1 (waited: 10s)
In progress: 8, queued: 1 (waited: 18s)
In progress: 8, queued: 1 (waited: 31s)
In progress: 4, queued: 0 (waited: 52s)
In progress: 2, queued: 0 (waited: 73s)
In progress: 2, queued: 0 (waited: 94s)
In progress: 1, queued: 0 (waited: 115s)
In progress: 7, queued: 9 (waited: 136s)
In progress: 8, queued: 8 (waited: 157s)
In progress: 7, queued: 2 (waited: 178s)
In progress: 7, queued: 0 (waited: 199s)
In progress: 2, queued: 0 (waited: 220s)
In progress: 0, queued: 0 (waited: 241s)
In progress: 0, queued: 0 (waited: 262s)
In progress: 1, queued: 0 (waited: 282s)
In progress: 1, queued: 0 (waited: 303s)
In progress: 1, queued: 0 (waited: 324s)
In progress: 1, queued: 0 (waited: 345s)
In progress: 0, queued: 0 (waited: 365s)
In progress: 0, queued: 0 (waited: 386s)
In progress: 0, queued: 0 (waited: 407s)

Examine Experiment contents and explore the top model

List all the models that have been created within the project. Then you can learn more about the top performing model.

In [17]: 

# Get all the models for the project. The top model for the optimization metric on the validation set is the first one.
models = project.get_models()
models
Out [17]: 

[Model('RuleFit Classifier'),
 Model('RuleFit Classifier'),
 Model('Light Gradient Boosting on ElasticNet Predictions '),
 Model('eXtreme Gradient Boosted Trees Classifier with Early Stopping'),
 Model('RandomForest Classifier (Gini)'),
 Model('Light Gradient Boosted Trees Classifier with Early Stopping'),
 Model('Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)'),
 Model('Elastic-Net Classifier (L2 / Binomial Deviance)'),
 Model('Generalized Additive2 Model'),
 Model('Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units)')]
In [18]:

# Retrieve the model object for the top performing model. Select one of the models to examine more closely.
# Note that model comparison is better conducted in the graphical user interface.

top_model = project.get_top_model(metric="LogLoss")
top_model.id
Out [18]:

'6543eacc46dd8c49b326ae86'
In [19]:

# Print the LogLoss and AUC for the top model
f"The top model is a {top_model.model_type} with a Log Loss of {top_model.metrics['LogLoss']['crossValidation']}
Out [19]:
'The top model is a RuleFit Classifier with a Log Loss of 0.521166 and an AUC of 0.664088.'

Feature Impact

Learn about the features used in the top performing model and visualize which features are driving model decisions.

In [19]:

# Get Feature Impact for the top model
feature_impact = top_model.get_or_request_feature_impact()

# Save feature impact in a pandas dataframe
fi_df = pd.DataFrame(feature_impact)

fi_df
featureNameimpactNormalizedimpactUnnormalizedredundantWith
0Date1.0000000.023229None
1Scheduled_Departure_Time0.9496820.022060None
2Flight_Number0.5910230.013729None
3Date (Day of Week)0.5488150.012748None
4Origin_Airport0.5062950.011761None
5Destination_Airport0.3971920.009226None
6Scheduled_Departure_Time (Hour of Day)0.2608960.006060None
7Carrier_Code0.0360590.000838None
8Tail_Number0.0000000.000000None

Blueprint

See the full machine leanring pipeline for the top model including preprocessing steps.

In [21]:

# Retrieve the blueprint of the top model
blueprint = dr.Blueprint.get(project.id, top_model.blueprint_id)

print(blueprint.processes)
['One-Hot Encoding', 'Missing Values Imputed', 'Matrix of word-grams occurrences', 'RuleFit Classifier']

Register the model

Register the top model in the Model Registry. This will be possible with future versions of the DataRobot Python package.

Deploy the model

Deploy the model to a DataRobot prediction server.

In [22]:

# The default prediction server is used when making predictions against the deployment, and is a requirement for creating a deployment on DataRobot cloud.
prediction_server = dr.PredictionServer.list()[0]
prediction_server
Out [22]:

PredictionServer(https://mlops.dynamic.orm.datarobot.com)
In [23]:

# Create a new deployment

deployment = dr.Deployment.create_from_learning_model(
    top_model.id,
    label="Flight Delays Deployment",
    description="Deployment from notebook of top model",
    default_prediction_server_id=prediction_server.id,
)

deployment.id
Out [23]:

'6543ec6e6fcffd7022f1cf12'
In [24]:

# Update deployment settings to enable draft tracking

deployment.update_drift_tracking_settings(
    target_drift_enabled=True, feature_drift_enabled=True
)

Make batch predictions

Make predictions using the scoring dataset. There are many options for how to score new data. Read about all your choices for making predictions.

In [35]:

# Option 1: Score an in-memory Pandas DataFrame.

scoring_df = dr.Dataset.get(scoring_data.id).get_as_dataframe()
scoring_df.head(5)

Out [35]:

DateCarrier_CodeOrigin_AirportFlight_NumberTail_NumberDestination_AirportScheduled_Departure_Time
01/29/23UASEA529N27724IAH5:00
11/29/23UAPDX2075N466UAIAH5:04
21/29/23UASEA2672N808UADEN5:15
31/29/23UAPDX2493N463UADEN5:30
41/29/23UASEA1780N37536SFO5:30
In [36]:

# The method returns a copy of the job status and the updated DataFrame with the predictions added.
# So your DataFrame will now contain the following extra columns:

job, scoring_df = dr.BatchPredictionJob.score_pandas(deployment.id, scoring_df)

scoring_df.head(5)
Streaming DataFrame as CSV data to DataRobot
Created Batch Prediction job ID 6543edb01cd1acb572c6059c
Waiting for DataRobot to start processing
Job has started processing at DataRobot. Streaming results.

Out [36]:

01234
Date1/29/231/29/231/29/231/29/231/29/23
Carrier_CodeUAUAUAUAUA
Origin_AirportSEAPDXSEAPDXSEA
Flight_Number5292075267224931780
Tail_NumberN27724N466UAN808UAN463UAN37536
Destination_AirportIAHIAHDENDENSFO
Scheduled_Departure_Time5:005:045:155:305:30
TAKE_OFF_DELAY_GREATER_THAN_30_1_PREDICTION0.1454280.0637910.0937070.0609080.144848
TAKE_OFF_DELAY_GREATER_THAN_30_0_PREDICTION0.8545720.9362090.9062930.9390920.855152
TAKE_OFF_DELAY_GREATER_THAN_30_PREDICTION00000
THRESHOLD0.50.50.50.50.5
POSITIVE_CLASS11111
DEPLOYMENT_APPROVAL_STATUSAPPROVEDAPPROVEDAPPROVEDAPPROVEDAPPROVED

Review deployment metrics

Having scored some data, you can now view the monitoring metrics associated with the deployment. Learn about all monitoring metrics associated with the Deployment.

In [ ]:
# Wait a few seconds to be sure that monitoring metrics will be available when requested
import time

time.sleep(15)  # Sleep for 15 seconds
In [38]:

# Returns the total number of predictions made

from datarobot.enums import SERVICE_STAT_METRIC

service_stats = deployment.get_service_stats()

service_stats[SERVICE_STAT_METRIC.TOTAL_PREDICTIONS]
Out [38]:

1648
In [39]:

# Returns the drift in the target feature

from datarobot.enums import DATA_DRIFT_METRIC

target_drift = deployment.get_target_drift()
target_drift.drift_score
Out [39]:

0.41781727459845186
In [40]:

# Retrieves drift metrics for features other than the target

feature_drift_data = deployment.get_feature_drift()

feature_drift_data
Out [40]:

[FeatureDrift(6543eacc46dd8c49b326ae86 | Flight_Number | 2023-10-26 19:00:00+00:00 - 2023-11-02 19:00:00+00:00),
 FeatureDrift(6543eacc46dd8c49b326ae86 | Date (Day of Week) | 2023-10-26 19:00:00+00:00 - 2023-11-02 19:00:00+00:00),
 FeatureDrift(6543eacc46dd8c49b326ae86 | Origin_Airport | 2023-10-26 19:00:00+00:00 - 2023-11-02 19:00:00+00:00),
 FeatureDrift(6543eacc46dd8c49b326ae86 | Destination_Airport | 2023-10-26 19:00:00+00:00 - 2023-11-02 19:00:00+00:00),
 FeatureDrift(6543eacc46dd8c49b326ae86 | Scheduled_Departure_Time (Hour of Day) | 2023-10-26 19:00:00+00:00 - 2023-11-02 19:00:00+00:00),
 FeatureDrift(6543eacc46dd8c49b326ae86 | Carrier_Code | 2023-10-26 19:00:00+00:00 - 2023-11-02 19:00:00+00:00),
 FeatureDrift(6543eacc46dd8c49b326ae86 | Tail_Number | 2023-10-26 19:00:00+00:00 - 2023-11-02 19:00:00+00:00)]
In [41]:

# View first feature name and drift

feature_drift = feature_drift_data[1]

f"The feature {feature_drift.name} has drift by {feature_drift.drift_score}"
Out [41]:

'The feature Date (Day of Week) has drift by 3.176396672912841'
Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

Get Started with Predicting Flight Delays

airlines plane travel
Explore more Airlines AI Accelerators
Airlines harness AI to optimize flight routes, enhance passenger experiences, and streamline operations. From predictive maintenance of aircraft to dynamic pricing strategies, AI empowers airlines to operate more efficiently and safely.

Explore more AI Accelerators