# How to Formulate a Machine Learning Question

January 11, 2018
by
· 3 min read

Machine learning uses data to create a model that addresses a business question you want answered. You first need to understand the problem you want to solve. The format of your question influences what algorithm is used to solve the problem.

For example, say you are an e-commerce marketing manager and you want to run an email campaign to increase products sold for past customers. You can ask different questions to determine your email campaign strategy. The answers to these questions indicate the type of machine learning problem. I will give hypothetical question examples for classification, regression, time series, natural language processing, and anomaly detection problems.

# Classification

The answer to your question about the email campaign may be categorical:

• “Based on past customer email data, should I email this customer?” Answers to this question would fall into a “yes” or “no.” Use the answer to determine email recipients.
• “Based on past purchasing patterns, what type of buyer group should the customer be segmented into?” Answers might fall into categories such as “high spender” and “low spender.”

These questions have categorical answers making them classification problems.

# Regression

The answer to your question may be numeric:

• “Based on past items per shopping cart, what is the items per shopping cart for this customer?” Use the items per cart to target customers for the email campaign.
• “Based on past transaction \$, what is the transaction \$ for this customer?” Use the transaction \$ to target customers.

These questions have numeric answers and can be considered regression problems.

# Time series

The question you’re asking may have an answer that changes over time:

• “When is the best date and time to send the email?” You would predict email open rates over time by date and hour of the day. Use the time when open rate is predicted to be the highest.
• “If I don’t send the email campaign, what will website traffic be?” You would predict website traffic had the email campaign not been sent to determine impact and if the campaign is worth it.

When there is a relationship between your target and time it typically means it is a time series problem. Learn how to distinguish time series from other regression problems.

# Natural language processing

The answer to your question could have a language component:

• “What keywords and content should I include in the email?” You could use natural language processing to analyze customer reviews to determine whether the sentiment is positive or negative and get ideas for email content.
• “What do customers like about product x?” You could use natural language processing to analyze specific product reviews to decide what attributes to market in the email campaign.

# Anomaly detection

The answer to your question may require you to distinguish between “normal” and “anomalous” observations:

• “Is the customer review from a bot account?” You could answer this question with anomaly detection.
• “Is this email address fake?” You could also answer this question with anomaly detection.

# Conclusion

Not surprisingly, a “lack of clear question to answer” appeared as a major barrier for data scientists on Kaggle’s State of Data Science in 2017 Survey. Involving different parts of the business can help you evaluate machine learning opportunities more thoroughly from all angles. However, once you formulate the question you want to be answered, you should ensure your data is relevant to the problem and ready for machine learning algorithms.