Text Mining

What is Text Mining?

Text algorithms allow analysts to extract useful insights from raw text, which is useful when a dataset has information in the form of notes or descriptions from doctor visits or loan applications.

When data scientists build traditional machine learning models, they use numeric and categorical data as features, such as the requested loan amount (in dollars) or the borrower’s employment type (in a word or two). Text mining algorithms give analysts the ability to leverage information about the purpose of the loan, greatly improving the accuracy of the model. With text mining, analysts can identify which words or phrases in raw text are associated with certain outcomes, giving them greater insight into the factors that relate to their target variable, or object of analysis.

Some common text mining algorithms include:

  1. Sentiment Analysis. Determines how a writer feels and reacts to a particular topic or event. Often used in marketing to evaluate consumers’ responses to new products.
  2. Named Entity Recognition. Locates and classifies specific references to people, organizations, places, and dates.
    For example, in the sentence “DataRobot acquired Nutonian, another Boston-based company, in 2017,” the algorithm will recognize DataRobot and Nutonian as organizations, Boston as a location, and 2017 as a date.
  3. Topic Modeling. Discovers the hidden semantic structure of a collection of raw text documents. Used to measure a topic’s prevalence and describe which terms are the most representative in each document.
  4. Summarization and Keyphrase Extraction. Distills a large document down to a set of sentences or terms that summarize it without sacrificing important information.

Why is Text Mining important?

The vast majority of data is unstructured in the form of images, audio, or video. Text data is ubiquitous in every industry. Claims investigator reports, medical examination notes, social network comments, software logs, and more contain vital information for predicting particular future events but are rarely formally structured. Text mining allows analysts to make the most of this data, leading to more practical models with higher accuracy.

Text Mining + DataRobot

The majority of the DataRobot automated machine learning platform’s models support text data right out of the box. If a particular combination of words or characters in the text is highly related to the target variable, DataRobot automatically captures the pattern and displays it along with other insights. DataRobot is also multilingual, using automatic language identification for text data and supporting different text mining algorithms depending on the language it detects.

The process of feature engineering free text data in the traditional way is notoriously complex and difficult, and data scientists often avoid doing it manually. DataRobot automatically finds, tunes, and interprets the best text mining algorithms for a dataset, saving both time and resources.