Data Prep for Machine Learning

January 3, 2019
· 4 min read

You want to transform your business with AI. You know the benefits: increased revenue, reduced costs, lower risk. You see the potential of AI in all parts of your business.

You know that if you want to succeed, you must rethink the AI process. Reduce cycle time from data to application. You can’t reach your goals if every project takes months or years to deliver. The old way of doing things is too slow.

To accelerate a process, look for bottlenecks. Talk to your data scientists, and ask them how they spend their time. The answer: messing with the data. Data collection and preparation. What many data scientists call “data wrangling” or “data prep.”

They’ll tell you that it’s the worst part of the data science job. You value your data scientists and want to keep them. Fixing the data prep bottleneck should be your top priority.

The Data Prep Bottleneck

Why does data prep consume time and cause so pain?

Data prep for machine learning includes four distinct tasks:

Data access: Data scientists copy data from files and source systems. They manipulate data to create a single table. The tools they use depend on the data source. For data stored in relational databases, they use SQL. For data stored in Hadoop, they use Hive, Pig, or Spark. For files and other formats, data scientists use tools like Python, R, or SAS.

Initial data analysis: Data scientists perform an analysis to check data quality. The data scientist identifies fields with no information value. These include constants, blanks, and duplicates. This analysis helps the data scientist make feature engineering decisions.

Sampling and partitioning: Data scientists use sampling in two ways. The data scientist takes a sample of records from the universe. This reduces the training data set to a convenient size. Data scientists use sampling to partition the data into training, test, and validation data sets. Improper sampling can produce a biased model. This task requires care and attention for best results.

Feature engineering (or featurization): This is the last step in the data prep process. Data scientists prepare the data for best results with each algorithm. They use many different techniques for this task. For information on feature engineering, read this article.

Feature engineering is the most demanding of the four tasks. It takes the most time, too.

  • Most data scientists can access and retrieve data
  • Initial data analysis is relatively quick
  • Data scientists know how to create samples and partitions

Successful feature engineering is different. It requires an in-depth knowledge of machine learning techniques. Every technique requires a different treatment. The data scientist must know the best way to prepare the data for each technique.

Feature engineering is a lot like conducting a symphony orchestra. Successful maestros understand every instrument. If you want to tell the tuba player how to play a passage, you need to know a lot about the tuba. It’s the same with feature engineering. If you want the best results from different machine learning techniques, you need to know a lot about how each technique works.

DataRobot Automates Data Prep

You want to deliver AI faster. Data prep is a key bottleneck. DataRobot automation is the solution.

DataRobot evaluates the quality of your data. Unlike some tools, it doesn’t just highlight problems. It fixes them. DataRobot finds fields in your data that have no information value. That simplifies the data collection process. Users don’t have to find and remove these fields first.

DataRobot samples and partitions your data. You don’t have to tell it to do this. Users can’t “forget” to create a sample to validate models. They can’t bollix a line of code and introduce bias. You can trust that DataRobot does it right. You don’t have to check the work of novice users. That’s another problem solved.

DataRobot Automates Feature Engineering

Best of all, DataRobot automates feature engineering. DataRobot tests many different techniques to build models. It knows all of them. It knows how to prepare your data for the best results with each technique. Depending on the characteristics of each algorithm, it will:

  • Perform one-hot encoding
  • Impute missing values
  • Standardize variables

DataRobot also applies sophisticated techniques such as:

  • Univariate credibility estimates
  • Category count
  • Text mining
  • Search for differences
  • Search for ratios

You would need a team of experts to provide DataRobot’s built-in knowledge. For more information about how DataRobot automates feature engineering, read this blog post.

DataRobot doesn’t tell you what to do. It doesn’t “guide” you through the data prep task. It just does it. Like a silent butler.

New call-to-action

About the author

Value-Driven AI

DataRobot is the leader in Value-Driven AI – a unique and collaborative approach to AI that combines our open AI platform, deep AI expertise and broad use-case implementation to improve how customers run, grow and optimize their business. The DataRobot AI Platform is the only complete AI lifecycle platform that interoperates with your existing investments in data, applications and business processes, and can be deployed on-prem or in any cloud environment. DataRobot and our partners have a decade of world-class AI expertise collaborating with AI teams (data scientists, business and IT), removing common blockers and developing best practices to successfully navigate projects that result in faster time to value, increased revenue and reduced costs. DataRobot customers include 40% of the Fortune 50, 8 of top 10 US banks, 7 of the top 10 pharmaceutical companies, 7 of the top 10 telcos, 5 of top 10 global manufacturers.

Meet DataRobot
  • Listen to the blog
  • Share this post
    Subscribe to DataRobot Blog
    Newsletter Subscription
    Subscribe to our Blog