Handling Text Data with State-of- the-Art Natural Language Processing Tools
This post was originally part of the DataRobot Community. Visit now to browse discussions and ask questions about DataRobot, AI Cloud, data science, and more.
This article summarizes how DataRobot handles text features using state of the art Natural Language Processing (NLP) tools such as Matrix of Word N-gram, Auto-Tuned Word N-gram Text Modelers, Word2Vec, Fasttext, cosine similarity, and Vowpal Wabbit. It also covers NLP visualization techniques such as frequency value table and word clouds.
If your dataset contains one or more text variables, as shown in Figure 1, you may wonder whether DataRobot can incorporate this information into the modeling process. This article will show you just how it can.
DataRobot lets you explore the frequency of the words by giving you a frequency value table, which is the histogram of the most frequent terms in your data and a general table where you can see the same information in a tabular format (Figure 2).
Let’s move on to modeling. DataRobot commonly incorporates the matrix of word-grams in blueprints (Figure 3). This is a matrix produced using a widely used technique, TF-IDF values, and combines multiple text columns.
For dense data, DataRobot offers the Auto-Tuned Word N-gram text modelers (Figure 4), which only looks at one individual text column at a time. The latter approach uses a single n-gram model to each text feature in the input dataset, and then uses the predictions from these models as inputs to other models.
Auto-Tuned models for a given sample size are visualized as Word Clouds (Figure 5). These can be found in the Insights > Word Cloud tab. The top 200 terms with the highest coefficients are shown, along with the frequency with which each term appears in the text.
In Figure 5, terms are displayed in a color spectrum from blue to red with blue indicating a negative effect and red indicating a positive effect relative to the target values. Terms that appear more frequently are displayed in a larger font size, and those that appear less frequently are displayed in a smaller font size.
There are a number of things you can do to this display:
- View the coefficient value specific to a term by mousing over the term
- View the word cloud of another model by clicking the dropdown arrow above the word cloud
- View class-specific word clouds (for multiclass classification projects)
- Show or hide common stop words (the, for, was, etc.)
The coefficients for the Auto-Tuned Word N-gram text are available in the Insights > Text Mining tab (see Figure 6). It shows the most relevant terms in the text variable, and the strength of the coefficient. You can download all the coefficients in a spreadsheet by clicking Export.
Finally, DataRobot also offers more NLP approaches in the Repository, such as Fasttext (Figure 7a). You can find those algorithms by typing ‘Fasttext’ in the search box; DataRobot will retrieve all blueprints that contain that preprocessing step.
DataRobot also has other techniques such as cosine similarity (Figure 7b) when there are multiple text features.
And Vowpal wabbit-based classifiers, which use use N-grams (Figure 8b).
Interested in learning more about DataRobot and its capabilities when it comes to text, as well as other data types? Reach out now for a personalised demo.
We will contact you shortly
We’re almost there! These are the next steps:
- Look out for an email from DataRobot with a subject line: Your Subscription Confirmation.
- Click the confirmation link to approve your consent.
- Done! You have now opted to receive communications about DataRobot’s products and services.
Didn’t receive the email? Please make sure to check your spam or junk folders.
Accelerate Your AI Journey with the DataRobot Partner EcosystemMarch 28, 2023· 3 min read
How MLOps Enables Machine Learning Production at ScaleMarch 23, 2023· 4 min read
How the DataRobot AI Platform Is Delivering Value-Driven AIMarch 16, 2023· 4 min read
Deep learning has been all over the news lately. In a presentation I gave at Boston Data Festival 2013 and at a recent PyData Boston meet-up I provided some history of the method and a sense of what it is being used for presently. This post aims to cover the first half of that presentation, focusing on the question of…
As artificial intelligence secures its position in the public sphere, consumers expect companies to use the maturing technology ethically and responsibly. Seventy percent of customers expect organizations to provide transparent and fair AI experiences, according to a recent Capgemini report. But as the technology’s popularity grows, a number of concerning examples have emerged of AI models operating with algorithmic bias.…
One of my favorite things about Chinese culture is going out with friends and family for dim sum on the weekend. Dim sum is prepared as small bite-sized portions of food served in small steamer baskets or on small plates. Dim sum dishes are served for brunch or lunch. In my hometown, dim sum is also known as yum cha,…