DataRobot Machine Learning Project Checklist Background V2.0

Machine Learning Project Checklist

July 21, 2022
· 6 min read
Download the Machine Learning Project Checklist

Planning Machine Learning Projects

Machine learning and AI empower organizations to analyze data, discover insights, and drive decision making from troves of data. More organizations are investing in machine learning than ever before. With only 87% of projects never making it to production, success hinges on diligent planning.

Data scientists need to understand the business problem and the project scope to assess feasibility, set expectations, define metrics, and design project blueprints. Close collaboration and alignment across business and technical teams will help ensure success.

Not every project needs machine learning. If there is no forward-looking predictive component to the use case, it can probably be addressed with analytics and visualizations applied to historical data.

Define the business problem

Understand the pain points and end goals for the use cases. Investigate whether the business problem can be solved with machine learning and has sufficient business impact to warrant such an approach. Inquire whether there is sufficient data to support machine learning.

Define project scope

Align on project vision and end results. Outline clear metrics to measure success. Document assumptions and risks to develop a risk management strategy.

Identify project stakeholders

Stakeholders from business, legal, and IT should be involved. Machine learning models created in silos are rarely implemented.

Assess the infrastructure

Evaluate the computing resources and development environment that the data science team will need. Small projects can potentially be completed on employee laptops but are hard to share and version control. Large projects or those involving text, images, or streaming data may need specialized infrastructure. Enterprise platforms, such as DataRobot, offer out-of-the-box integrations, support for multimodal data, a unified environment for collaboration, and enterprise governance to help teams accelerate AI delivery.

Invest in solutions compatible with your cloud and on-premises requirements

The infrastructure team may want models deployed on a major cloud platform (such as Amazon Web Services, Google Cloud Platform, and Microsoft Azure), in your on-premises data center, or both. Models could also be deployed into multiple environments at once. Ensure your solutions offer flexibility and avoid being locked into any single technical infrastructure or cloud platform.

Identify a consumption strategy

Discuss how the stakeholders want to interact with the machine learning model after it is built. Model deployment can vary in complexity depending on business requirements. Predictions can be made in batches or in real time. Predictions can be saved to a database or used immediately in another process.

Plan for ongoing maintenance and enhancement

Discuss with stakeholders how accuracy and data drift will be monitored. Agree on acceptable levels of model performance degradation before redevelopment is needed. Decide who owns monitoring and who owns model redevelopment.

Exploring and Transforming Data

Clean and transform your dataset as needed. Good data curation and data preparation leads to more practical, accurate model outcomes.

Create the target variable

Define the exact calculation for the target variable or create a couple options to test. There will be several reasonable choices for most use cases. For a customer churn use case, churn could be defined as “no purchases in the next 30 days” or “no purchases in the next 180 days” or “subscription is canceled.” For a credit risk model, the target could be defined as “fully repays loan” or “payments in first 2 years are current” or or “collateral is repossessed.” 

Reshape and aggregate data as necessary

Data may need to be reshaped from long to wide format. Rows irrelevant to the analysis (e.g., discontinued products) may need to be removed. Data aggregation such as from hourly to daily or from daily to weekly time steps may also be required.

Perform data quality checks and develop procedures for handling issues

Typical data quality checks and corrections include:

  • Missing data or incomplete records
  • Inconsistent data formatting (e.g., dashes and parentheses in telephone numbers)
  • Inconsistent units of measure (e.g., mixture of dollars and euros in a currency field)
  • Inconsistent coding of categorical data (e.g., mixture of abbreviations such as “TX” and full names such as “Texas”)
  • Outliers and anomalies (e.g., ages below 0 and over 150 years)

Engineer predictive features

Construct additional features to improve performance and accuracy in your machine learning models. Feature engineering may include binning and aggregating numeric features (e.g., average purchase in last 12 months), creating new categorical variables (e.g., summer/winter), and applying calculations (e.g., debt to income ratio). Sound knowledge of the business problem and your available data sources will enable the most effective feature engineering.

Standardize features

Many modeling techniques require numeric features standardized to have a mean of zero. Log transformations may be appropriate before standardizing the data if its distribution is highly skewed.

Building Machine Learning Models

Machine learning can be applied to numerous business scenarios. Outcomes depend on hundreds of factors — factors that are difficult or impossible for a human to monitor. Models produced from these factors require guardrails to ensure that you receive results you can trust before deploying to production.

Verify the languages supported by your production system

Write machine learning models in a language that your production system can understand. Your production environment needs to be able to read your models. Otherwise, re-coding can extend project timelines by weeks or months.

Select, train, and automate multiple machine learning models

Develop and compare multiple models that most accurately solve your business problem. Some considerations when comparing models include accuracy, retraining difficulty, and production performance.

Incorporate methodologies to address model drift and data drift

Shifting business needs may cause decreased model relevance, requiring models to be retrained. Retraining may also be warranted when mismatch exists between the initial training dataset and the scored dataset, such as differences in seasonality, consumer preferences, and regulations. Adding retraining methodologies upfront to address these concerns will save time.

Ensure predictions are explainable

Avoid the “black box” syndrome by incorporating feature explanations that describe model results. This helps you identify high-impact factors to focus business strategies, explain results to stakeholders, and steer model development to comply with regulations.

Test for bias to ensure fairness

Machine learning models may contain unintended bias that cause practical concerns and considerations, in addition to hindering performance. Testing, monitoring, and mitigating bias helps ensure models align with company ethics and culture.

Deploying Machine Learning Models

Machine learning models can quickly turn from assets into liabilities in a volatile world. Successful model deployment and lifecycle management involves creating compliance documentation for highly regulated industries, well-defined MLOps processes, and strategies that keep your models in peak performance. These strategies enable you to scale AI adoption.

Create model compliance documentation for regulated industries

Highly regulated industries, such as banking, financial markets, and insurance, must comply with government regulations for model validation before a model can be put into production. This includes creating robust model development documentation based on centralized monitoring, management, and governance for deployed models.

Ensure well-defined MLOps processes

Scaling your models’ usage and value requires robust and repeatable production processes, including clear roles, procedures, and logging to support established controls. Model governance practices must be established to ensure consistent management and minimal risk when deploying and modifying models.

Deploy machine learning model

Models need to be deployed into production environments for practical decision-making. Deployments require coordination between data scientists, IT teams, software developers, and business professionals to ensure the models work reliably in production.

Monitor and observe results

Dashboards that display the agreed-upon success metrics are a key communication tool with business stakeholders. However, since users rarely review dashboards with consistency, alert functionalities play a crucial role in notifying stakeholders of significant activities. This feature can be used to highlight success and to detect anomalies.

Accelerating Machine Learning Projects with DataRobot

Learn how your organization can accelerate machine learning projects with DataRobot. Collaborate in a unified environment built for continuous optimization across the entire machine learning lifecycle — from data to value.

Download the Machine Learning Project Checklist
Download Now
About the author
Wei Shiang Kao
Wei Shiang Kao

Former Growth Marketing Manager at DataRobot

Wei Shiang Kao worked closely with data science and marketing teams to drive adoption in the DataRobot AI platform. Wei has 10+ years of data analytics experience within the spaces of network automation, security, and content collaboration, tackling attribution challenges and steering budget. In his previous role, he transformed marketing analytics to build trust across the organization through transparency and clarity.

Wei holds a B.S. in Applied Mathematics from San Jose State University, and an MBA from Purdue University.

Meet Wei Shiang Kao
  • Listen to the blog
  • Share this post
    Subscribe to DataRobot Blog
    Newsletter Subscription
    Subscribe to our Blog