Handling Text Data with State-of- the-Art Natural Language Processing Tools

February 4, 2021

· 3 min read

This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about the DataRobot AI Platform, data science, and more.

This article summarizes how DataRobot handles text features using state of the art Natural Language Processing (NLP) tools such as Matrix of Word N-gram, Auto-Tuned Word N-gram Text Modelers, Word2Vec, Fasttext, cosine similarity, and Vowpal Wabbit. It also covers NLP visualization techniques such as frequency value table and word clouds.

If your dataset contains one or more text variables, as shown in Figure 1, you may wonder whether DataRobot can incorporate this information into the modeling process. This article will show you just how it can.

lhaviland 0 1612471561291 — Figure 1. Input dataset with one or more text variables

DataRobot lets you explore the frequency of the words by giving you a frequency value table, which is the histogram of the most frequent terms in your data and a general table where you can see the same information in a tabular format (Figure 2).

Figure 2 Frequency Values Table for word frequency visualization1 — Figure 2a. Frequency Values Table for word frequency visualization

Figure 2 General Table for word frequency visualization — Figure 2b. General Table for word frequency visualization

Let’s move on to modeling. DataRobot commonly incorporates the matrix of word-grams in blueprints (Figure 3). This is a matrix produced using a widely used technique, TF-IDF values, and combines multiple text columns.

lhaviland 0 1612481819338 — Figure 3. An example blueprint that uses a Matrix of Word Ngram as a preprocessing step

For dense data, DataRobot offers the Auto-Tuned Word N-gram text modelers (Figure 4), which only looks at one individual text column at a time. The latter approach uses a single n-gram model to each text feature in the input dataset, and then uses the predictions from these models as inputs to other models.

lhaviland 1 16124820336081 — Figure 4. An example blueprint that uses an Auto Tuned Word Ngram text modelers as a preprocessing step

Auto-Tuned models for a given sample size are visualized as Word Clouds (Figure 5). These can be found in the Insights > Word Cloud tab. The top 200 terms with the highest coefficients are shown, along with the frequency with which each term appears in the text.

lhaviland 2 16124824102221 — Figure 5. Text visualization using Word Cloud

In Figure 5, terms are displayed in a color spectrum from blue to red with blue indicating a negative effect and red indicating a positive effect relative to the target values. Terms that appear more frequently are displayed in a larger font size, and those that appear less frequently are displayed in a smaller font size.

There are a number of things you can do to this display:

View the coefficient value specific to a term by mousing over the term
View the word cloud of another model by clicking the dropdown arrow above the word cloud
View class-specific word clouds (for multiclass classification projects)
Show or hide common stop words (the, for, was, etc.)

The coefficients for the Auto-Tuned Word N-gram text are available in the Insights > Text Mining tab (see Figure 6). It shows the most relevant terms in the text variable, and the strength of the coefficient. You can download all the coefficients in a spreadsheet by clicking Export.

lhaviland 3 16124832279871 — Figure 6. Text Mining tab

DataRobot Public Documentation

Find the latest information on exploring AI insights within DataRobot

Finally, DataRobot also offers more NLP approaches in the Repository, such as Fasttext (Figure 7a). You can find those algorithms by typing ‘Fasttext’ in the search box; DataRobot will retrieve all blueprints that contain that preprocessing step.

lhaviland 4 1612483737320 — Figure 7a. Example blueprints with Fasttext as part of their preprocessing steps

DataRobot also has other techniques such as cosine similarity (Figure 7b) when there are multiple text features.

lhaviland 6 1612484457500 — Figure 7b. Example blueprints with cosine similarity as part of their preprocessing steps

Figure 8 Example blueprints with Pairwise Cosine Similarity as part of their preprocessing steps — Figure 8a. Blueprint with Pairwise Cosine Similarity as a preprocessing step

And Vowpal wabbit-based classifiers, which use use N-grams (Figure 8b).

lhaviland 7 1612484622232 — Figure 8b. Example blueprints with Vowpal Wabbit-based classifiers

Interested in learning more about DataRobot and its capabilities when it comes to text, as well as other data types? Reach out now for a personalised demo.

About the author

Linda Haviland

Community Manager

Meet Linda Haviland

Share this post

Subscribe to DataRobot Blog

First Name

Last Name

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

Share this post

Subscribe to DataRobot Blog

First Name

Last Name

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

See other posts in AI & ML Expertise

Subscribe to our Blog

First Name

Last Name

Country

State

Yes! Please email me news and offers for DataRobot products and services.

DataRobot is committed to protecting your privacy. You can find full details of how we use your information, and directions on opting out from our marketing emails, in our Privacy Policy.

Handling Text Data with State-of- the-Art Natural Language Processing Tools

Belong @ DataRobot: Celebrating 2024 Women’s History Month with DataRobot AI Legends

Choosing the Right Vector Embedding Model for Your Generative AI Use Case

Reflecting on the Richness of Black Art

Related Posts

Thanks! Check your inbox to confirm your subscription.