Data Preparation

What is Data Preparation for Machine Learning?

Data preparation is the process of transforming raw data so that data scientists and analysts can run it through machine learning algorithms to uncover insights or make predictions.

The data preparation process can be complicated by issues such as:

    1. Missing or incomplete records. It is difficult to get every data point for every record in a dataset. Missing data sometimes appear as empty cells or a particular character, such as a question mark. For example:
      Data Preparation Dataset for Machine Learning
    2. Improperly formatted data. Data sometimes needs to be extracted into a different format or location. A good way to address this is to consult domain experts or join data from other sources.
    3. The need for techniques such as feature engineering. Even if all of the relevant data is available, the data preparation process may require techniques such as feature engineering to generate additional content that will result in more accurate, relevant models.

Why is Data Preparation Important?

Most machine learning algorithms require data to be formatted in a very specific way, so datasets generally require some amount of preparation before they can yield useful insights. Some datasets have values that are missing, invalid, or otherwise difficult for an algorithm to process. If data is missing, the algorithm can’t use it. If data is invalid, it causes the algorithm to produce less accurate or even misleading outcomes. Good data preparation produces clean and well-curated data that leads to more practical, accurate model outcomes.

Data Preparation + DataRobot

The DataRobot automated machine learning platform interfaces with tools like Trifacta to help with the data preparation process.

Once a user has properly prepared a dataset for machine learning, importing it into DataRobot is as easy as dragging and dropping a .csv or importing it from Hadoop and most popular SQL databases. Users can also directly upload common formats such as Pandas (Python) or R Data Frames via DataRobot’s language-specific APIs. No matter your programming skill level, DataRobot has a corresponding import mechanism.

Dataset Preparation

Once a user uploads data to the platform, DataRobot automatically performs exploratory data analysis, identifying each variable type and generating descriptive statistics for numerical records (mean, median, standard deviation, etc.). This allows analysts and data scientists to easily explore data without having to manually slice-and-dice it using Excel or other similar software. DataRobot also automatically imputes missing values when necessary and provides graphical tools for manually deriving additional features.