Feature Selection

What is Feature Selection in Machine Learning?

Adding features to your dataset can improve the accuracy of your machine learning model, especially when the model is too simple to fit the existing data properly. However, it is important to focus on features that are relevant to the problem you’re trying to solve and to avoid focusing those features that contribute nothing. For example, if you’re trying to predict flight delays, today’s temperature may be important, but the temperature three months ago is not.

Good feature selection eliminates irrelevant or redundant columns from your dataset without sacrificing accuracy. As opposed to dimensionality reduction, feature selection doesn’t involve creating new features or transforming existing ones, but rather eliminating the ones that don’t add value to your analysis.

Why is Feature Selection Important?

The benefits of feature selection for machine learning include:

  1. Reducing the chance of overfitting.
  2. Improving algorithm run speed by reducing the CPU, I/O, and RAM load the production system requires to build and use the model by lowering the number of operations needed to read and preprocess data and perform data science.
  3. Increasing the model’s interpretability by revealing the most informative factors driving the model’s outcomes.

Feature Selection + DataRobot

The DataRobot AI platform combines multiple approaches for feature selection in its modeling workflow:

  1. Model-agnostic feature importance. Before running any algorithms, DataRobot determines the univariate importance of each feature with respect to the target variable.
  2. Model-specific feature impact analysis. DataRobot produces a quantitative ranking of how impactful each feature is for each model it produces.
  3. Automated feature selection. DataRobot employs expert model blueprints that automatically select relevant features. The platform also supports manual tuning.
  4. Support for multiple feature lists. By running DataRobot on different subsets of features, users can see how different feature lists compare.

Not all data is equally useful. Sift through the noise.