Organizations looking to add modern data preparation to their analytics technology arsenal have multiple choices – ranging from line of business, self-service solutions to modules from legacy, IT-centric data management platforms.
Diverse use cases, varied skill levels, and unique business requirements make the data preparation tool selection process complex and confusing. Knowing the correct evaluation and selection criteria would go a long way towards helping organizations clarify goals and guiding them through the decision-making process.
In our experience, these three criteria are most commonly considered:
- User Interface: Some data preparation tools offer visual drag-and-drop or spreadsheet-like user interfaces. Others utilize scripting or coding to convey data preparation instructions.If non-technical users will be using the data preparation tool, a spreadsheet-like user interface is highly advantageous, given many business analysts know and use Excel. The familiar, Excel-like user interface is natural and intuitive to them. For this group of users, working directly with data and logic instead of abstractions and workflows increases their confidence level and accelerates iterative data discovery and preparation cycles.
- Governance: Data preparation tools vary widely in their approach to data governance, but because workflow is a fundamental part of data preparation, all tools offer data lineage tracking. Self-documenting data preparation tools offer especially strong data lineage capabilities. They record each data preparation step as it occurs. As the data changes, each operation that transforms, cleans, or blends data is documented automatically. For example, if a user removes white spaces from a column, that action gets documented, which then creates repeatability and enables users to govern data as they simultaneously discover it.
- Sampling Limitations for Profiling Data: When working with data that is highly standard and predictable, it is acceptable to work with data samples to build data preparation processes and then apply those processes to an entire data collection. However, when data is less known and its structure is highly complex, the probability of unexpected outcomes increases. In this case, samples may not include all of the outliers and anomalies that exist in a full data collection.When working with uncertain data, it is critical for the data preparation tool to have the ability to work with the entire data set, not just a sample. This will help to avoid any unpleasant surprises which may arise from sampling alone.
Having a good set of criteria is essential for choosing a data preparation tool that meets your needs today and grows with your organization in the future.
Event
DataRobot Data Prep
Interactively explore, combine, and shape diverse datasets into data ready for machine learning and AI applications
Get free access now
About the author
DataRobot
The Next Generation of AI
DataRobot AI Platform is the next generation of AI. The unified platform is built for all data types, all users, and all environments to deliver critical business insights for every organization. DataRobot is trusted by global customers across industries and verticals, including a third of the Fortune 50.
Meet DataRobot