Use LDA to Classify Text Documents

September 8, 2016
by
· 5 min read

This article was originally published at Algorithimia’s website. The company was acquired by DataRobot in 2021. This article may not be entirely up-to-date or refer to products and offerings no longer in existence. Find out more about DataRobot MLOps here.    

The LDA microservice is a quick and useful implementation of MALLET, a machine learning language toolkit for Java. This topic modeling package automatically finds the relevant topics in unstructured text data.

The Algorithmia implementation makes LDA available as a REST API, and removes the need to install multiple packages, manage servers, or deal with dependencies. This microservice accepts strings, files, and URLs, as well as the ability to include a stop word list as an argument.

Want to understand the topic frequency of a document? Try LDA Mapper, a microservice that provides the topic distribution in a document. Combined, LDA and LDA Mapper help you find hidden topics and trends in unstructured text.

What is LDA?

In natural language processing, Latent Dirichlet Allocation (LDA) is a generative topic bag of words model that automatically discovers topics in text documents. This model regards each document (observations of words) as a mixture of various topics, and that each word in the document belongs to one of the document’s topics. This algorithm was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael Jordan in 2003.

For example, when classifying newspaper articles, Story A may contain a topic with the words “economic,” “downturn,” “Wall Street,” and “Forecasted.” It’d be reasonable to assume that Story A is about Business. Whereas Story B may return a topic with the words “movie,” “rated,” “enjoyed,” and “recommend.” Story B is clearly about Entertainment.

LDA works by calculating the probability that a word belongs to a topic. For instance, in Story B, the word “movie” would have a higher probability than the word “rated.” This makes intuitive sense, because “movie” is more closely related to the topic Entertainment than the word “rated.”

For more information about LDA go to it’s Wiki page or Edwin Chen’s blog.

Why Use LDA?

LDA is useful when you have a set of documents, and you want to discover patterns within, but without knowing about the documents themselves. LDA can be used to generate topics to understand a document’s general theme, and is often used in recommendation systems, document classification, data exploration, and document summarization. Additionally, LDA is useful in training predictive, linear regression models with the topics and occurrences.

How To Use LDA

The algorithm takes an object with an array of strings. As part of the API call you can specific a mode to balance speed vs quality. See the LDA algorithm page for more information about input options.

Sample Input

{
  "docsList": [
    "Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences",
    "Paddle: Baidu's open source deep learning framework",
    "An overview of gradient descent optimization algorithms",
    "Create a Chatbot for Telegram in Python to Summarize Text",
    "Image super-resolution through deep learning",
    "World's first self-driving taxis debut in Singapore",
    "Minds and machines: The art of forecasting in the age of artificial intelligence"
  ]
}

Sample LDA API Call

import Algorithmia
client = Algorithmia.client('your_algorithmia_api_key')
algo = client.algo('nlp/LDA/1.0.0')
result = algo.pipe(input).result

Sample Output

[
  {
    "intelligence": 1,
    "text": 1,
    "create": 1,
    "super-resolution": 1,
    "forecasting": 1,
    "taxis": 1,
    "python": 1,
    "art": 1
  },
  {
    "learning": 3,
    "source": 1,
    "telegram": 1,
    "baidu's": 1,
    "chatbot": 1
  },
  {
    "deep": 3,
    "machines": 1,
    "summarize": 1,
    "self-driving": 1,
    "framework": 1,
    "overview": 1,
    "descent": 1,
    "minds": 1
  },
  {
    "age": 1,
    "world's": 1,
    "optimization": 1,
    "artificial": 1,
    "algorithms": 1,
    "image": 1,
    "singapore": 1,
    "debut": 1
  }
]

There’s a few patterns that emerge from the documents. With more documents, the topics would be more clearly defined.

Combine LDA With LDA Mapper

The LDA Mapper algorithm can map the group of topics generated by LDA to the list of documents. This tells you the distribution of topics across the documents (e.g. Story A is made up of 35% Business, 25% Entertainment, and 40% Science topics).

Sample Input

{
  "topics": [
  {
    "intelligence": 1,
    "text": 1,
    "create": 1,
    "super-resolution": 1,
    "forecasting": 1,
    "taxis": 1,
    "python": 1,
    "art": 1
  },
  {
    "learning": 3,
    "source": 1,
    "telegram": 1,
    "baidu's": 1,
    "chatbot": 1
  },
  {
    "deep": 3,
    "machines": 1,
    "summarize": 1,
    "self-driving": 1,
    "framework": 1,
    "overview": 1,
    "descent": 1,
    "minds": 1
  },
  {
    "age": 1,
    "world's": 1,
    "optimization": 1,
    "artificial": 1,
    "algorithms": 1,
    "image": 1,
    "singapore": 1,
    "debut": 1
  }
],
  "docsList": [
    "Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences",
    "Paddle: Baidu's open source deep learning framework",
    "An overview of gradient descent optimization algorithms",
    "Create a Chatbot for Telegram in Python to Summarize Text",
    "Image super-resolution through deep learning",
    "World's first self-driving taxis debut in Singapore",
    "Minds and machines: The art of forecasting in the age of artificial intelligence"
  ]
}

Sample LDA Mapper API Call

import Algorithmia
client = Algorithmia.client('your_algorithmia_api_key')
algo = client.algo('nlp/LDAMapper/0.1.0')
result = algo.pipe(input).result

Sample Output

The LDAMapper algorithm output each document, and the influence of topics on each document.

{
  "topic_distribution": [
    {
      "doc": "Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences",
      "freq": {"0": 0.5, "1": 0, "2": 0, "3": 0.5}
    },
    {
      "doc": "Paddle: Baidu's open source deep learning framework",
      "freq": {"0": 0, "1": 0.5, "2": 0.5, "3": 0}
    },
    {
      "doc": "An overview of gradient descent optimization algorithms",
      "freq": {"0": 0, "1": 0, "2": 0.5, "3": 0.5}
    },
    {
      "doc": "Create a Chatbot for Telegram in Python to Summarize Text",
      "freq": {"0": null, "1": null, "2": null, "3": null}
    },
    {
      "doc": "Image super-resolution through deep learning",
      "freq": {"0": 0.25, "1": 0.25, "2": 0.25, "3": 0.25}
    },
    {
      "doc": "World's first self-driving taxis debut in Singapore",
      "freq": {
        "0": 0.3333333333333333,
        "1": 0,
        "2": 0.3333333333333333,
        "3": 0.3333333333333333
      }
    },
    {
      "doc": "Minds and machines: The art of forecasting in the age of artificial intelligence",
      "freq": {
        "0": 0.5714285714285716,
        "1": 0,
        "2": 0.1428571428571429,
        "3": 0.2857142857142858
      }
    }
  ]
}

Here’s a sample recipe for calling LDA, and passing the results to the LDA Mapper algorithm:

import Algorithmia

client = Algorithmia.client('your_algorithmia_api_key')

docslist = [
    "Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences",
    "Paddle: Baidu's open source deep learning framework",
    "An overview of gradient descent optimization algorithms",
    "Create a Chatbot for Telegram in Python to Summarize Text",
    "Image super-resolution through deep learning",
    "World's first self-driving taxis debut in Singapore",
    "Minds and machines: The art of forecasting in the age of artificial intelligence"
]

# The LDA required input using a list of documents
lda_input = {
    "docsList": docslist
}

# LDA algorithm: https://algorithmia.com/algorithms/nlp/LDA
lda = client.algo('nlp/LDA/1.0.0')
# Returns a list of dictionaries of trends
result = lda.pipe(lda_input).result

# LDA Mapping algorithm: https://algorithmia.com/algorithms/nlp/LDAMapper
lda_mapper = client.algo(
    'nlp/LDAMapper/0.1.1')

# LDA Mapper input using the LDA algorithm's result as 'topics' value
lda_mapper_input = {
    "topics": result,
    "docsList": docslist
}

# Prints out the result from calling the LDA Mapper algo
print(lda_mapper.pipe(lda_mapper_input).result)

By using these two algorithms together, we’re able to create a microservice that gives us the same result with less code. If we wanted to, we could include a pre-processing algorithm in this pipeline, like Porter Stemmer, and then use sentiment analysis to discover the sentiment of the documents as they pertain to each topic.

Demo
See DataRobot in Action
Request a demo

About the author
DataRobot

The Next Generation of AI

DataRobot AI Cloud is the next generation of AI. The unified platform is built for all data types, all users, and all environments to deliver critical business insights for every organization. DataRobot is trusted by global customers across industries and verticals, including a third of the Fortune 50. For more information, visit https://www.datarobot.com/.

Meet DataRobot
  • Listen to the blog
     
  • Share this post
    Subscribe to DataRobot Blog
    Thank you

    We will contact you shortly

    Thank You!

    We’re almost there! These are the next steps:

    • Look out for an email from DataRobot with a subject line: Your Subscription Confirmation.
    • Click the confirmation link to approve your consent.
    • Done! You have now opted to receive communications about DataRobot’s products and services.

    Didn’t receive the email? Please make sure to check your spam or junk folders.

    Close

    Newsletter Subscription
    Subscribe to our Blog