Financial Planning and Analysis Workflow

Forecasting Horizontal Time Series

This notebook illustrates an end-to-end FP&A workflow in DataRobot. Time series forecasting in DataRobot has a huge suite of tools and approaches to handle highly complex multiseries problems.

Request a Demo

DataRobot will be used for the model training, selection, deployment, and creating forecasts. While this example will leverage a snapshot file as a datasource this workflow applies to any data source, e.g. Redshift, S3, Big Query, Synapse, etc.

This notebook will demonstrate how to use the Python API client to:

Connect to DataRobot
Import and preparation of data for time series modeling
Create a time series forecasting project and run Autopilot
Retrieve and evaluate model performance and insights
Making forward looking forecasts
Evaluating forecasts vs. historical trends
Deploy a model

Setup – Import libraries

In [40]:

from datetime import datetime as dt
from platform import python_version

import datarobot as dr
from datarobot.models.data_engine_query_generator import (
    QueryGeneratorDataset,
    QueryGeneratorSettings,
)
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots

print("Python version:", python_version())
print("Client version:", dr.__version__)

Python version: 3.8.8

Client version: 3.3.1

Connect to DataRobot

To connect to DataRobot, you need to provide your API token and the endpoint. For more information, please refer to the following documentation:

Your API token can be found in the DataRobot UI in the Developer tools section, accessed from the profile menu in the top right corner. Copy the API token and paste in the cell below.

In [4]:

# Instantiate the DataRobot connection

# Get the token from the Developer Tools page in the DataRobot UI
DATAROBOT_API_TOKEN = ""

# Endpoint - This notebook uses the default endpoint for DataRobot Managed AI Cloud (US)
DATAROBOT_ENDPOINT = "https://app.datarobot.com/api/v2"  # This should be the URL you use to access the DataRobot UI

dr.Client(token=DATAROBOT_API_TOKEN, endpoint=DATAROBOT_ENDPOINT)

Out [4]:

<datarobot.rest.RESTClientObject at 0x7fe359b58bb0>

Import Data

We will be importing 4 years of raw transactional sales data and then wrangling that data, for each product segment, into a time series format for modeling

In [5]:

# Read in csv file to dataframe
df = pd.read_csv("storage/sales.csv")

# Convert 'Order Date' columns to datetime format
df["Order Date"] = pd.to_datetime(df["Order Date"])

# Display first few rows of data
df.head()

Out [5]:

	Order ID	Order Date	Segment	Sales	Quantity	Discount	Region
0	DE-163398	2021-11-09	Segment A	235.764	2	0	South
1	DE-152520	2021-11-09	Segment A	658.746	3	0	South
2	DE-151015	2021-06-13	Segment B	21.93	2	0	West
3	DE-140543	2020-10-11	Segment A	861.81975	5	0.45	South
4	DE-166730	2020-10-11	Segment A	20.1312	2	0.2	South

Data Preparation

We now need to transform our transactional data into a time series dataset with evenly spaced intervals. We will leverage DataRobot’s Data Prep for Time Series to transform the dataset and aggregate each product segment to the monthly level.

Generally speaking, DataRobot’s Data Prep for Time Series will aggregates the dataset to the selected time step, and, if missing rows are detected will impute the target value. Data Prep for Time Series allows you to choose aggregation methods for numeric, categorical, and text values. You can also use it to explore modeling at different time scales. The resulting dataset is then published to DataRobot’s AI Catalog.

In [6]:

# Upload the dataset to the AI Catalog
dataset = dr.Dataset.upload(df)

# Rename the entry in AI Catalog
dataset.modify(name="Transactional_Sales_Data", categories=dataset.categories)

# Create a time series data prep query generator from the dataset we just uploaded to AI Catalog
query_generator_dataset = QueryGeneratorDataset(
    alias="Transactional_Sales_Data",
    dataset_id=dataset.id,
    dataset_version_id=dataset.version_id,
)

# Set the parameters for our time series Data Prep
query_generator_settings = QueryGeneratorSettings(
    datetime_partition_column="Order Date",  # Date/time feature used as the basis for partitioning
    time_unit="MONTH",  # Time unit (seconds, days, months, etc.) that comprise the time step
    time_step=1,  # Number of (time) units that comprise the time step.
    default_numeric_aggregation_method="sum",  # Aggregate the target using either mean & most recent or sum & zero
    default_categorical_aggregation_method="last",  # Aggregate categorical features using the most frequent value or the last value within the aggregation time step.
    target="Sales",  # Numeric column in the dataset to predict.
    multiseries_id_columns=[
        "Segment"
    ],  # Column containing the series identifier, which allows DataRobot to process the dataset as a separate time series.
    default_text_aggregation_method="meanLength",  # Choose ignore to skip handling of text features or aggregate by: 'concat', 'last', 'meanLength', 'mostFrequent', 'totalLength'
    start_from_series_min_datetime=True,  # Basis for the series start date, either the earliest date for each series (per series) or the earliest date found for any series (global).
    end_to_series_max_datetime=True,  # Basis for the series end date, either the last entry date for each series (per series) or the latest date found for any series (global).
)
query_generator = dr.DataEngineQueryGenerator.create(
    generator_type="TimeSeries",
    datasets=[query_generator_dataset],
    generator_settings=query_generator_settings,
)

# Prep the training dataset
training_dataset = query_generator.create_dataset()

# Rename the entry in AI Catalog
training_dataset.modify(
    name="ts_monthly_training", categories=training_dataset.categories
)

Exploratory Data Analysis

Let’s visualize the monthly sales for each segment in order to quickly identify anomalies, evaluate patterns and seasonality, and identify anything that may warrant further expoloration.

In [42]:

# Load the dataset into a pandas dataframe
training_df = training_dataset.get_as_dataframe()

# Convert 'Order Date' to datetime format and sort
training_df["Order Date"] = pd.to_datetime(training_df["Order Date"])

# Adding Total Sales as an additional segment
total_sales = training_df.groupby("Order Date").agg({"Sales": "sum"}).reset_index()
total_sales["Segment"] = "Total"
training_df = pd.concat([training_df, total_sales], ignore_index=True)

# Visualize our data:
fig = go.Figure()

# Line chart for monthly sales by segment
fig.add_trace(
    go.Scatter(
        x=training_df[training_df["Segment"] == "Segment A"]["Order Date"],
        y=training_df[training_df["Segment"] == "Segment A"]["Sales"],
        mode="lines",
        name="Segment A",
    )
)
fig.add_trace(
    go.Scatter(
        x=training_df[training_df["Segment"] == "Segment B"]["Order Date"],
        y=training_df[training_df["Segment"] == "Segment B"]["Sales"],
        mode="lines",
        name="Segment B",
    )
)
fig.add_trace(
    go.Scatter(
        x=training_df[training_df["Segment"] == "Segment C"]["Order Date"],
        y=training_df[training_df["Segment"] == "Segment C"]["Sales"],
        mode="lines",
        name="Segment C",
    )
)
fig.add_trace(
    go.Scatter(
        x=training_df[training_df["Segment"] == "Total"]["Order Date"],
        y=training_df[training_df["Segment"] == "Total"]["Sales"],
        mode="lines",
        name="Total",
    )
)

fig.update_layout(
    title="Monthly Sales by Segment and Total",
    xaxis_title="Date",
    yaxis_title="Sales",
    template="plotly_dark",
)
fig.show()

Create a time series forecasting project and run Autopilot

We can create a project using our dataset in the AI Catalog:

In [8]:

# Create a new DataRobot project
project = dr.Project.create_from_dataset(
    project_name="Monthly_Sales_Forecast", dataset_id=training_dataset.id
)

In [9]:

# Quick link to the DataRobot project you just created
# Note: the get_uri for projects goes to the Model tab. This won't be populated yet since we haven't run Autopilot.
# Switch to the Data tab in the UI after following the url to get to the project setup section.
print("DataRobot Project URL: " + project.get_uri())
print("Project ID: " + project.id)

DataRobot Project URL: https://app.datarobot.com/projects/659c86c4f40076f93f064d79/models Project ID: 659c86c4f40076f93f064d79

Configure time-series modeling settings Time-series projects have a number of parameters we can adjust. This includes:

Multi-series (i.e. Series ID column)
Backtest partitioning
Feature Derivation Window
Forecast Window
Known-in-advance (KA) Variables
Do not derive (DND) Variables
Calendars

In [10]:
# Set Time Series Parameters
# Feature Derivation Window
# What rolling window should DataRobot use to derive features?
FDW = [(-6, 0)]

# Forecast Window
# Which future values do you want to forecast? (i.e. Forecast Distances)
FW = [(1, 12)]

# Known In Advance features
# Features that will be known at prediction time - all other features will go through an iterative feature engineering and selection process to create time-series features.
FEATURE_SETTINGS = []
KA_VARS = []
for column in KA_VARS:
    FEATURE_SETTINGS.append(
        dr.FeatureSettings(column, known_in_advance=True, do_not_derive=False)
    )

# Calendar
# Create a calendar file from a dataset to see how specific events by date contribute to better model performance
CALENDAR = dr.CalendarFile.create_calendar_from_country_code(
    country_code="US",
    start_date=min(training_df["Order Date"]),  # Earliest date in calendar
    end_date=max(training_df["Order Date"]),
)  # Last date in calendar

We pass all our settings to a DatetimePartitioningSpecification object which will then be passed to our Autopilot process.

In [11]:

# Create DatetimePartitioningSpecification
# The DatetimePartitioningSpecification object is how we pass our settings to the project
time_partition = dr.DatetimePartitioningSpecification(
    # General TS settings
    use_time_series=True,
    datetime_partition_column="Order Date",  # Date column
    multiseries_id_columns=["Segment"],  # Multi-series ID column
    # FDW and FD
    forecast_window_start=FW[0][0],
    forecast_window_end=FW[0][1],
    feature_derivation_window_start=FDW[0][0],
    feature_derivation_window_end=FDW[0][1],
    # Advanced settings
    feature_settings=FEATURE_SETTINGS,
    calendar_id=CALENDAR.id,
)

Start modeling with autopilot

To start the Autopilot process, call the analyze_and_model function. Provide the prediction target and our DatetimePartitioningSpecification as part of the function call. We have several modes to spin up Autopilot – in this demo, we will use the default “Quick” mode.

In [12]:

# Start Autopilot
project.analyze_and_model(
    # General parameters
    target="Sales",  # Target to predict
    worker_count=-1,  # Use all available modeling workers for faster processing
    # TS options
    partitioning_method=time_partition,  # Feature settings
)

Out [12]:

Project(Monthly_Sales_Forecast)

In [13]:

# If you want to wait for Autopilot to finish, run this code
# You can set verbosity to 1 if you want to print progress updates as Autopilot runs
project.wait_for_autopilot(verbosity=0)

Retrieve and evaluate model performances and insights

After Autopilot completes, you can easily evaluate your model results. Evaluation can include compiling the Leaderboard as a dataframe, measuring performances across different backtest partitions with different metrics, visualizing the accuracy across series, analyzing Feature Impact and Feature Effects to understand each models’ behaviors, and more. This can be done for every single model created by DataRobot.

As a simple example in this notebook, we identify the best model created by Autopilot and evaluate:

RMSE performance
MASE performance
Accuracy for Time for various Forecast Distance and Series combinations
Feature Impact of Top 10 features
Compare Accuracy across Series

In [14]:

# Unlock the holdout set within the project
project.unlock_holdout()

Out [14]:

Project(Monthly_Sales_Forecast)

In [15]:

# Identify the best model by the optimization metric
metric_of_interest = project.metric

# Get all models
all_models = project.get_datetime_models()

# Extract models that have a "All Backtests" performance evaluation for our metric
best_models = sorted(
    [model for model in all_models if model.metrics[project.metric]["backtesting"]],
    key=lambda m: m.metrics[project.metric]["backtesting"],
)

# Iterate through the models and extract model metadata and performance
scores = pd.DataFrame()

df_list = []  # This will store each individual DataFrame to concatenate at the end

for m in best_models:
    model_performances = pd.DataFrame(
        [
            {
                "Project_Name": project.project_name,
                "Project_ID": project.id,
                "Model_ID": m.id,
                "Model_Type": m.model_type,
                "Featurelist": m.featurelist_name,
                "Optimization_Metric": project.metric,
                "Partition": "All backtests",
                "Value": m.metrics[project.metric]["backtesting"],
            }
        ]
    )
    df_list.append(model_performances)  # Append the DataFrame to the list

# Concatenate all DataFrames in the list
scores = pd.concat(df_list, ignore_index=True)


# Sort by performance value
scores = scores.sort_values(
    by="Value", ascending=True
)  # Sort ascending so best model (lowest RMSE) is first
scores
Project_Name	Proj

Out [15]:

	0	1	2	3
Project_Name	Monthly_Sales_Forecast	Monthly_Sales_Forecast	Monthly_Sales_Forecast	Monthly_Sales_Forecast
Project_ID	659c86c4f40076f93f064d79	659c86c4f40076f93f064d79	659c86c4f40076f93f064d79	659c86c4f40076f93f064d79
Model_ID	659c879d224b220fd75be42e	659c879d224b220fd75be426	659c879d224b220fd75be427	659c879d224b220fd75be431
Model_Type	eXtreme Gradient Boosted Trees Regressor with …	eXtreme Gradient Boosting on ElasticNet Predic…	Light Gradient Boosting on ElasticNet Predicti…	eXtreme Gradient Boosting on ElasticNet Predic…
Featurelist	With Differencing (average baseline)	No Differencing	No Differencing	With Differencing (latest)
Optimization_Metric	Gamma Deviance	Gamma Deviance	Gamma Deviance	Gamma Deviance
Partition	All backtests	All backtests	All backtests	All backtests
Value	0.10532	0.13968	0.16402	6694.53784

In [16]:

# Select the top model in our project for further evaluation
top_model = dr.Model.get(project=project.id, model_id=scores["Model_ID"][0])

# Quick link to the recommended model built by Autopilot
print("Top Model URL: " + top_model.get_uri())
print("Top Model Type: " + top_model.model_type)

Top Model URL: https://app.datarobot.com/projects/659c86c4f40076f93f064d79/models/659c879d224b220fd75be42e

Top Model Type: eXtreme Gradient Boosted Trees Regressor with Early Stopping (learning rate =0.3)

In [17]:

print(
    "Top Model RMSE performance (All Backtests): "
    + str(top_model.metrics["RMSE"]["backtesting"])
)
print(
    "Top Model MASE performance (All Backtests): "
    + str(top_model.metrics["MASE"]["backtesting"])
)

Top Model RMSE performance (All Backtests): 6781.43916

Top Model MASE performance (All Backtests): 0.47072

Get Accuracy Over Time

DataRobot provides two helpful views of our forecasts out-of-the-box:

Accuracy Over Time fixes the forecast distance and visualizes the corresponding forecast for each forecasted date.
Forecast vs Actual sets a specific forecast point and visualizes the corresponding forecasts for the entire forecast window. We can pull the results out for either analysis. As a demonstration, we will generate the Accuracy Over Time plots for forecast distances of 1 day and 7 day.

In [18]:

# Get Accuracy over Time for FD=1, Averaged for all series
acc_plot_FD1_Avg = top_model.get_accuracy_over_time_plot(
    backtest=1, forecast_distance=1, series_id=None
)

# Convert to dataframe
df = pd.DataFrame.from_dict(acc_plot_FD1_Avg.bins)

# Plotly Graph
fig = go.Figure()

# Adding traces for "predicted" and "actual"
fig.add_trace(
    go.Scatter(x=df["start_date"], y=df["predicted"], mode="lines", name="Predicted")
)
fig.add_trace(
    go.Scatter(x=df["start_date"], y=df["actual"], mode="lines", name="Actual")
)

# Update layout for better visualization
fig.update_layout(
    title="Accuracy Over Time for FD=1",
    xaxis_title="Start Date",
    yaxis_title="Value",
    legend_title="Legend",
)

# Display the plot

# Update layout
fig.update_layout(template="plotly_dark")
fig.show()

In [47]:

# Get Accuracy over Time for FD=6, For just Segment A
acc_plot_FD6_Consumer = top_model.get_accuracy_over_time_plot(
    backtest=0, forecast_distance=6, series_id="Segment A"
)

# Convert to dataframe
df = pd.DataFrame.from_dict(acc_plot_FD6_Consumer.bins)

# Plotly Graph
fig = go.Figure()

# Adding traces for "predicted" and "actual"
fig.add_trace(
    go.Scatter(x=df["start_date"], y=df["predicted"], mode="lines", name="Predicted")
)
fig.add_trace(
    go.Scatter(x=df["start_date"], y=df["actual"], mode="lines", name="Actual")
)

# Update layout for better visualization
fig.update_layout(
    title="Accuracy Over Time for FD=6 (Segment A)",
    xaxis_title="Start Date",
    yaxis_title="Value",
    legend_title="Legend",
    template="plotly_dark",
)

# Display the plot
fig.show()

Retrieve Feature Impact

As an example of model explainability, calculate the Feature Impact values of the model using the get_or_request_feature_impact function.

In [20]:

# get model
top_model = dr.Model.get(project=project.id, model_id=scores["Model_ID"][2])

# Request and retrieve feature impact
feature_impacts = (
    top_model.get_or_request_feature_impact()
)  # Will trigger Feature Impact calculations if not done
FI_df = pd.DataFrame(feature_impacts)  # Convert to dataframe

# Sort features by Normalized Feature Impact
FI_df = FI_df.sort_values(by="impactNormalized", ascending=False)

# Take top 10
FI_df = FI_df[0:5]

# Plotly Graph
fig = go.Figure()

# Add bar trace
fig.add_trace(
    go.Bar(y=FI_df["featureName"], x=FI_df["impactNormalized"], orientation="h")
)

# Update layout for better visualization
fig.update_layout(
    title="Feature Impact",
    xaxis=dict(title="Normalized Feature Impact", range=[0, 1.1]),
    yaxis=dict(title="Feature", autorange="reversed"),
    margin=dict(
        l=200
    ),  # this is to ensure the y-labels (feature names) are not cut off
    template="plotly_dark",
)

# Display the plot
fig.show()

Analyze Accuracy for Each Series

The Series Insight tool provides the ability to compute the accuracy for each indivudal series. This is especially powerful to help us identify which series the model is doing particularly better or worse in forecasting.

In this demonstration, we see that the model has particularly high RMSE for the Savannah and Louisville store forecasts. We may consider refining our model by splitting those two series into a separate model as a future modeling experiment.

In [21]:

# Trigger the Series Insight computation
series_insight_job = top_model.compute_series_accuracy()
series_insight_job.wait_for_completion()  # Complete job before progressing

In [22]:

# Retrieve Series Accuracy
model_series_insight = top_model.get_series_accuracy_as_dataframe(
    metric="RMSE", order_by="backtestingScore"
)

# Unlist 'multiseriesValues' to 'Series' column
model_series_insight["multiseriesValues"] = model_series_insight[
    "multiseriesValues"
].apply(lambda x: x[0])
model_series_insight.rename(columns={"multiseriesValues": "Series"}, inplace=True)

# View
model_series_insight

Out [22]:

	0	1	2
multiseriesId	659c873bbfbeaeb16bd40181	659c873bbfbeaeb16bd4017f	659c873bbfbeaeb16bd40180
Series	Segment C	Segment A	Segment B
rowCount	48	48	48
duration	P3Y11M0D	P3Y11M0D	P3Y11M0D
startDate	2019-01-01T00:00:00.000000Z	2019-01-01T00:00:00.000000Z	2019-01-01T00:00:00.000000Z
endDate	2022-12-01T00:00:00.000000Z	2022-12-01T00:00:00.000000Z	2022-12-01T00:00:00.000000Z
validationScore	13365.29888	12213.31966	10130.19989
backtestingScore	4402.5372	7665.218465	8543.374845
holdoutScore	5592.83073	15377.09417	24672.35535
targetAverage	10741.32871	21776.27522	22067.07396

In [23]:

# Create a scatter plot with Plotly
fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=model_series_insight["Series"],
        y=model_series_insight["backtestingScore"],
        mode="markers",
    )
)

# Update the layout
fig.update_layout(
    title="Performance by Segment",
    xaxis=dict(title="Series", tickangle=45),
    yaxis=dict(title="RMSE"),
)

# Display the plot
fig.update_layout(template="plotly_dark")
fig.show()

Make new predictions with a test dataset

We can make new predictions directly on the leaderboard by uploading new test datasets to the project. We can then score the test dataset with any model on the leaderboard and retrieve the results.

In this notebook, we will load the data into the notebook and upload to the project. As with the training dataset, you can also use the JDBC connector to ingest the test dataset into the AI catalog and then directly use the dataset ID in the request_predictions call. The JDBC path provides more efficient upload of large datasets to DataRobot.

For time-series predictions, DataRobot expects new prediction datasets to have rows for each new date we want to forecast. These rows should have values for the known-in-advance features and NA everywhere else. For example, since the model is forecasting 1-7 days out from each forecast point, we will have 7 new rows (corresponding from June 15 to June 21, 2014).

In [24]:

# Upload data to modeling project
df = training_dataset.get_as_dataframe()
test_dataset = project.upload_dataset(df)

# Get frozen model
frozen_model = all_models[0]

# Request Predictions
pred_job = frozen_model.request_predictions(
    dataset_id=test_dataset.id,
    include_prediction_intervals=True,
    prediction_intervals_size=85,
)

preds = pred_job.get_result_when_complete()

preds.head(5)

Out [24]:

	0	1	2	3	4
row_id	144	145	146	147	148
prediction	34059.37165	34059.37165	34059.37165	34059.37165	34059.37165
forecast_distance	1	2	3	4	5
forecast_point	2022-12-01T00:00:00.000000Z	2022-12-01T00:00:00.000000Z	2022-12-01T00:00:00.000000Z	2022-12-01T00:00:00.000000Z	2022-12-01T00:00:00.000000Z
timestamp	2023-01-01T00:00:00.000000Z	2023-02-01T00:00:00.000000Z	2023-03-01T00:00:00.000000Z	2023-04-01T00:00:00.000000Z	2023-05-01T00:00:00.000000Z
series_id	Segment A	Segment A	Segment A	Segment A	Segment A
prediction_interval_lower_bound	26322.56316	25439.79651	26978.56112	24518.98477	26594.81305
prediction_interval_upper_bound	46154.14547	46434.68766	43444.78734	46113.28592	45538.58923

In [25]:

# Step 1: Rename columns in preds to match training_df
renamed_preds = preds.rename(
    columns={"prediction": "Sales", "timestamp": "Order Date", "series_id": "Segment"}
)

# Step 2: Drop columns from preds that are not in training_df
columns_to_keep = [
    "Segment",
    "Order Date",
    "Sales",
]  # Columns from preds that correspond to training_df
modified_preds = renamed_preds[columns_to_keep]

Join Forecasts to Actuals

In [26]:

# Function to evaluate date time values


def convert_or_localize_to_utc(series):
    if series.dt.tz is not None:  # If it's already timezone-aware
        return series.dt.tz_convert("UTC")
    return series.dt.tz_localize("UTC")  # If it's naive, then localize it


# Convert or localize 'Order Date' in training_df to UTC
training_df["Order Date"] = pd.to_datetime(training_df["Order Date"])
training_df["Order Date"] = convert_or_localize_to_utc(training_df["Order Date"])

# Rename columns in prediction data
renamed_preds = preds.rename(
    columns={"prediction": "Sales", "timestamp": "Order Date", "series_id": "Segment"}
)

# Drop columns from prediction data that is not in training data
columns_to_keep = [
    "Segment",
    "Order Date",
    "Sales",
]  # Columns from preds that correspond to training_df
modified_preds = renamed_preds[columns_to_keep]

# Convert or localize 'Order Date' in preds_subset to UTC
preds_subset = modified_preds.copy()  # To avoid SettingWithCopyWarning
preds_subset["Order Date"] = pd.to_datetime(preds_subset["Order Date"])
preds_subset["Order Date"] = convert_or_localize_to_utc(preds_subset["Order Date"])

# Concatenate the DataFrames
combined_df = pd.concat([training_df, preds_subset], ignore_index=True)

# Sort by Segment and Order Date
combined_df.sort_values(by=["Segment", "Order Date"], inplace=True)

Visualize Forecasts With a Line Chart

In [27]:

# Initialize the figure
fig = go.Figure()

# Define a color mapping for segments (you can extend or modify this as needed)
color_map = {"Segment A": "blue", "Segment B": "green", "Segment C": "orange"}

# For each segment, plot actual sales and forecasted sales
for segment, color in color_map.items():
    segment_df = combined_df[combined_df["Segment"] == segment]
    actuals = segment_df[:-12]  # Adjust based on your actual data
    forecast = segment_df[-12:]

    fig.add_trace(
        go.Scatter(
            x=actuals["Order Date"],
            y=actuals["Sales"],
            mode="lines",
            name=f"{segment} Actuals",
            line=dict(color=color),  # Use the color from the color_map
        )
    )

    fig.add_trace(
        go.Scatter(
            x=forecast["Order Date"],
            y=forecast["Sales"],
            mode="lines",
            name=f"{segment} Forecast",
            line=dict(
                dash="dot", color=color
            ),  # Use the color from the color_map and make the line dotted
        )
    )

fig.update_layout(
    title="Actual + Forecasted Sales by Segment",
    xaxis_title="Order Date",
    yaxis_title="Sales",
    legend_title="Segments",
    template="plotly_dark",
)

fig.show()

Evaluate Forecasted Growth Metrics

FP&A teams typically use a variety of growth metrics to gain a comprehensive understanding of sales performance. Different metrics can provide unique insights into how a financial metrics is trending and what underlying factors might be driving those trends.

Month Over Month Growth (MoM Growth): evaluates the growth of sales (or any other metric) from one month to the next. MoM Growth can provide a short-term perspective on trends to quickly identify any sudden changes or spikes which might be due to seasonality, promotions, or other short-term factors. Useful for identifying short-term operational challenges or successes.
Monthly Year Over Year Growth (YoY Growth): evaluates a specific month to the sales in the same month the previous year. This view accounts for seasonality, as it compares the same month across years and can helps identify longer-term trends and determine if a particular month’s sales are genuinely growing or declining over the years.
Year to Date Year over Year Growth (YTD YoY Growth): evaluates the cumulative sum from the beginning of the year up to a specific month as compared to the same time period in the previous year. This provides an aggregated view of performance over a year, making it easier to see if the company is on track to meet annual targets. This can also mitigate any month-to-month volatility, providing a more consistent perspective on how the year is progressing.
Rolling 12 Month Year Over Year Growth (R12M YoY Growth): evaluates the most recent 12-month period to sales in the 12 months before that. This offers a continuously updated yearly perspective, irrespective of where you are in the calendar year and smoothens out short-term fluctuations and seasonality, as it always encompasses a full year of data.

Calculate the forecasted year over year growth

In [28]:

def calculate_growth(segment_data):
    forecast_sum = segment_data.tail(12)["Sales"].sum()
    prior_year_sum = segment_data.iloc[-24:-12]["Sales"].sum()
    two_years_ago_sum = segment_data.iloc[-36:-24]["Sales"].sum()

    forecasted_growth = (forecast_sum - prior_year_sum) / prior_year_sum * 100
    prior_year_growth = (prior_year_sum - two_years_ago_sum) / two_years_ago_sum * 100

    return forecast_sum, prior_year_sum, forecasted_growth, prior_year_growth


segments = combined_df["Segment"].unique()
growth_data = [
    (*calculate_growth(combined_df[combined_df["Segment"] == segment]), segment)
    for segment in segments
]

growth_df = pd.DataFrame(
    growth_data,
    columns=[
        "Forecasted Total",
        "Prior Year Total",
        "Forecasted Growth",
        "Prior Year Growth",
        "Segment",
    ],
)

# Calculate aggregate level
aggregate_prior_year = growth_df["Prior Year Total"].sum()
aggregate_forecast = growth_df["Forecasted Total"].sum()
aggregate_growth = (
    (aggregate_forecast - aggregate_prior_year) / aggregate_prior_year * 100
)

adjusted_prior_year_growth = (
    (
        aggregate_prior_year
        - growth_df["Prior Year Total"].sum()
        + growth_df.iloc[-1]["Prior Year Total"]
    )
    / (growth_df["Prior Year Total"].sum() - growth_df.iloc[-1]["Prior Year Total"])
    * 100
)

aggregate_row = pd.DataFrame(
    [
        {
            "Segment": "Total",
            "Prior Year Total": aggregate_prior_year,
            "Forecasted Total": aggregate_forecast,
            "Forecasted Growth": aggregate_growth,
            "Prior Year Growth": adjusted_prior_year_growth,
        }
    ]
)

growth_df = pd.concat([growth_df, aggregate_row], ignore_index=True)

Visualize the year over year growth in a bar chart

In [29]:

def plot_sales_data_with_table(growth_df):
    # Filter out the 'Total' row for the bar chart
    chart_df = growth_df[growth_df["Segment"] != "Total"]

    # Create subplots: one row for bar chart, one row for table
    fig = make_subplots(
        rows=2,
        cols=1,
        shared_xaxes=True,
        vertical_spacing=0.15,
        subplot_titles=(
            "Prior Year vs Forecasted Sales Totals by Segment",
            "Data Table",
        ),
        row_heights=[0.7, 0.3],
        specs=[[{"type": "xy"}], [{"type": "table"}]],
    )

    # Add bar chart to the first row of subplot
    fig.add_trace(
        go.Bar(
            name="Prior Year Total",
            x=chart_df["Segment"],
            y=chart_df["Prior Year Total"],
        ),
        row=1,
        col=1,
    )
    forecasted_total_bar = go.Bar(
        name="Forecasted Total", x=chart_df["Segment"], y=chart_df["Forecasted Total"]
    )
    fig.add_trace(forecasted_total_bar, row=1, col=1)
 
    # Overlay forecast growth on top of the Forecasted Total bars
    for i, segment in enumerate(chart_df["Segment"]):
        fig.add_annotation(
            x=segment,
            y=chart_df.loc[chart_df["Segment"] == segment, "Forecasted Total"].values[
                0
            ],
            text=f"{chart_df.loc[chart_df['Segment'] == segment, 'Forecasted Growth'].values[0]:.2f}%",
            showarrow=False,
            font=dict(color="white"),
            row=1,
            col=1,
        )

    # Change the bar mode to 'group'
    fig.update_layout(barmode="group")

    # Specify the order of columns for the table
    ordered_columns = [
        "Segment",
        "Prior Year Total",
        "Forecasted Total",
        "Forecasted Growth",
        "Prior Year Growth",
    ]

    # Format the numeric values with commas and two decimal points
    formatted_data = []
    for col in ordered_columns:
        if growth_df[col].dtype in [float, "float64"]:
            # Format float columns
            formatted_data.append(growth_df[col].map("{:,.2f}".format).tolist())
        elif growth_df[col].dtype in [int, "int64"]:
            # Format integer columns
            formatted_data.append(growth_df[col].map("{:,}".format).tolist())
        else:
            # Keep non-numeric columns as is
            formatted_data.append(growth_df[col].tolist())

    table_header = ordered_columns

    fig.add_trace(
        go.Table(header=dict(values=table_header), cells=dict(values=formatted_data)),
        row=2,
        col=1,
    )

    # Update layout
    fig.update_layout(template="plotly_dark")

    fig.show()


# Example usage with your growth_df dataframe
plot_sales_data_with_table(growth_df)

Calculate additional metrics to evaluate forecasted growth

In [30]:

# Create a copy of training_df with only the necessary columns
df_copy = combined_df[["Segment", "Order Date", "Sales"]].copy()

# Convert "Order Date" column of df_copy to datetime and set as index
df_copy["Order Date"] = pd.to_datetime(df_copy["Order Date"])
df_copy.set_index("Order Date", inplace=True)


def calculate_growth_metrics(segment_data):
    # Using .loc to set values to avoid SettingWithCopyWarning
    segment_data = segment_data.copy()
    segment_data.loc[:, "Year"] = segment_data.index.year
    segment_data.loc[:, "Month"] = segment_data.index.month

    # Group by year and month to get monthly sales and specify numeric_only=True
    monthly_sales = segment_data.groupby(["Year", "Month"]).sum(numeric_only=True)

    # Calculate growth metrics
    monthly_sales["MoM Growth"] = monthly_sales["Sales"].pct_change() * 100
    monthly_sales["Monthly YoY Growth"] = (
        monthly_sales["Sales"].pct_change(periods=12) * 100
    )
    monthly_sales["Rolling 12 Mth YoY Growth"] = (
        monthly_sales["Sales"].rolling(window=12).sum().pct_change(periods=12) * 100
    )

    # New YTD Growth logic
    monthly_sales["YTD"] = monthly_sales["Sales"].groupby(level=0).cumsum()
    previous_ytd = monthly_sales["YTD"].shift(12)
    monthly_sales["YTD YoY Growth"] = (
        (monthly_sales["YTD"] - previous_ytd) / previous_ytd
    ) * 100

    # Convert 'Year' and 'Month' back to actual date (the first day of each month)
    monthly_sales.reset_index(inplace=True)
    monthly_sales["Order Date"] = pd.to_datetime(
        monthly_sales[["Year", "Month"]].assign(DAY=1)
    )
    monthly_sales.drop(["Year", "Month"], axis=1, inplace=True)

    return monthly_sales


# Filter data based on segments and calculate metrics
segmenta_df = calculate_growth_metrics(df_copy[df_copy["Segment"] == "Segment A"])
segmentb_df = calculate_growth_metrics(df_copy[df_copy["Segment"] == "Segment B"])
segmentc_df = calculate_growth_metrics(df_copy[df_copy["Segment"] == "Segment C"])
total_df = calculate_growth_metrics(df_copy[df_copy["Segment"] == "Total"])

Visualize the additional metrics with line charts

# Function to plot a specific growth metric for all segments and show table below


def plot_metric_across_segments_with_table(metric_name, datasets, yaxis_range=None):
    # Create subplots: one row for line chart, one row for table
    fig = make_subplots(
        rows=2,
        cols=1,
        shared_xaxes=True,
        vertical_spacing=0.1,
        subplot_titles=(f"Comparison of {metric_name} Across Segments", "Data Table"),
        row_heights=[0.7, 0.3],
        specs=[[{"type": "xy"}], [{"type": "table"}]],
    )

    # Lists to store filtered data for the table
    table_dates = None
    table_metric_values = []

    # Add line plots to the first row of subplot
    for segment_name, segment_df in datasets.items():
        # Filter DataFrame to remove null values in the metric column and keep data from January 2014 onwards
        segment_df = segment_df[
            pd.notnull(segment_df[metric_name])
            & (segment_df["Order Date"] >= "2014-01-01")
        ]

        # Store filtered data for table
        if table_dates is None:
            table_dates = segment_df["Order Date"].tolist()
 # Format metric values to two decimal points
        formatted_metric_values = [
            round(val, 2) for val in segment_df[metric_name].tolist()
        ]
        table_metric_values.append(formatted_metric_values)

        fig.add_trace(
            go.Scatter(
                x=segment_df["Order Date"],
                y=segment_df[metric_name],
                mode="lines",
                name=segment_name,
            ),
            row=1,
            col=1,
        )

    # Create a table with the filtered data
    table_data = [table_dates] + table_metric_values
    table_header = ["Order Date"] + list(datasets.keys())

    fig.add_trace(
        go.Table(header=dict(values=table_header), cells=dict(values=table_data)),
        row=2,
        col=1,
    )

    # Update layout
    fig.update_layout(template="plotly_dark", yaxis_range=yaxis_range)

    fig.show()

# Datasets for all segments
datasets = {
    "Segment A": segmenta_df,
    "Segment B": segmentb_df,
    "Segment C": segmentc_df,
    "Total": total_df,
}

# Create the plots for each specific metric:
plot_metric_across_segments_with_table("MoM Growth", datasets, yaxis_range=[-200, 500])
plot_metric_across_segments_with_table(
    "Monthly YoY Growth", datasets, yaxis_range=[-200, 600]
)
plot_metric_across_segments_with_table(
    "YTD YoY Growth", datasets, yaxis_range=[-200, 300]
)
plot_metric_across_segments_with_table("Rolling 12 Mth YoY Growth", datasets)

Deploy Model to DataRobot ML Production for Monitoring and Governance

In [32]:

# Set the prediction server to deploy to
prediction_server_id = dr.PredictionServer.list()[
    0
].id  # EDIT THIS BASED ON THE PREDICTION SERVERS AVAILABLE TO YOU

# Set deployment details
deployment = dr.Deployment.create_from_learning_model(
    model_id=frozen_model.id,
    label="FP&A - Segment Forecasting Model",
    description="FP&A - Segment Forecasting Model",
    default_prediction_server_id=prediction_server_id,
)

Request Predictions from Deployment

In [33]:

# Score the dataset using the given deployment ID
job, predictions = dr.BatchPredictionJob.score_pandas(
    deployment.id, df
)  # Deployment ID and Scoring dataset

# Print a message to indicate that scoring has started
print("Started scoring...", job)

# Wait for the job to complete
job.wait_for_completion()

Streaming DataFrame as CSV data to DataRobot
Created Batch Prediction job ID 659c8b967ac471a283a743a2
Waiting for DataRobot to start processing
Job has started processing at DataRobot. Streaming results.
Started scoring... BatchPredictionJob(batchPredictions, '659c8b967ac471a283a743a2', status=INITIALIZING)

Clean Up

In [34]:

# # CLEAN UP - Uncomment and run this cell to remove everything you added during this session
deployment.delete()
# project.delete()

Next Steps and Additional Resources

Improve Accuracy

There are a number of approaches that we could apply to improve model accuracy. We can try:

Running DataRobot’s entire blueprint repository
Evaluating different feature derivation windows across projects
Evaluating different training lengths
Evaluating different blenders / ensemble models
Adding in additional data or other data sources (e.g. macroeconomic data)

Other Use Cases

Scenario analysis – evaluate what happens to our sales forecasts under certain conditions by leveraging known in advance variables in DataRobot
Long term strategic planning where we can forecast over longer time horizons than annual planning requires.
Risk management – model potential risks that have financial impacts, such as: churn, write-downs, etc.

Additional Resources

The DataRobot AI Accelerator Library has similar accelerators for other Ecosystem Integrations to use DataRobot with other tools (e.g. AWS, GCP, Azure, etc) as well as accelerators for more advanced time-series applications.
The DataRobot API user guide provides code examples covering topics such as model factories, classification problems, feature impact rank ensembling, and more.

To learn more about advanced workflows for handling complex and large scale time series problems:

Get Started with Free Trial

Experience new features and capabilities previously only available in our full AI Platform product.

Get Started Now