Use LDA to Classify Text Documents

September 8, 2016
by
· 5 min read

This article was originally published at Algorithimia’s website. The company was acquired by DataRobot in 2021. This article may not be entirely up-to-date or refer to products and offerings no longer in existence. Find out more about DataRobot MLOps here.    

The LDA microservice is a quick and useful implementation of MALLET, a machine learning language toolkit for Java. This topic modeling package automatically finds the relevant topics in unstructured text data.

The Algorithmia implementation makes LDA available as a REST API, and removes the need to install multiple packages, manage servers, or deal with dependencies. This microservice accepts strings, files, and URLs, as well as the ability to include a stop word list as an argument.

Want to understand the topic frequency of a document? Try LDA Mapper, a microservice that provides the topic distribution in a document. Combined, LDA and LDA Mapper help you find hidden topics and trends in unstructured text.

What is LDA?

In natural language processing, Latent Dirichlet Allocation (LDA) is a generative topic bag of words model that automatically discovers topics in text documents. This model regards each document (observations of words) as a mixture of various topics, and that each word in the document belongs to one of the document’s topics. This algorithm was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael Jordan in 2003.

For example, when classifying newspaper articles, Story A may contain a topic with the words “economic,” “downturn,” “Wall Street,” and “Forecasted.” It’d be reasonable to assume that Story A is about Business. Whereas Story B may return a topic with the words “movie,” “rated,” “enjoyed,” and “recommend.” Story B is clearly about Entertainment.

LDA works by calculating the probability that a word belongs to a topic. For instance, in Story B, the word “movie” would have a higher probability than the word “rated.” This makes intuitive sense, because “movie” is more closely related to the topic Entertainment than the word “rated.”

For more information about LDA go to it’s Wiki page or Edwin Chen’s blog.

Why Use LDA?

LDA is useful when you have a set of documents, and you want to discover patterns within, but without knowing about the documents themselves. LDA can be used to generate topics to understand a document’s general theme, and is often used in recommendation systems, document classification, data exploration, and document summarization. Additionally, LDA is useful in training predictive, linear regression models with the topics and occurrences.

How To Use LDA

The algorithm takes an object with an array of strings. As part of the API call you can specific a mode to balance speed vs quality. See the LDA algorithm page for more information about input options.

Sample Input

{
  "docsList": [
    "Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences",
    "Paddle: Baidu's open source deep learning framework",
    "An overview of gradient descent optimization algorithms",
    "Create a Chatbot for Telegram in Python to Summarize Text",
    "Image super-resolution through deep learning",
    "World's first self-driving taxis debut in Singapore",
    "Minds and machines: The art of forecasting in the age of artificial intelligence"
  ]
}

Sample LDA API Call

import Algorithmia
client = Algorithmia.client('your_algorithmia_api_key')
algo = client.algo('nlp/LDA/1.0.0')
result = algo.pipe(input).result

Sample Output

[
  {
    "intelligence": 1,
    "text": 1,
    "create": 1,
    "super-resolution": 1,
    "forecasting": 1,
    "taxis": 1,
    "python": 1,
    "art": 1
  },
  {
    "learning": 3,
    "source": 1,
    "telegram": 1,
    "baidu's": 1,
    "chatbot": 1
  },
  {
    "deep": 3,
    "machines": 1,
    "summarize": 1,
    "self-driving": 1,
    "framework": 1,
    "overview": 1,
    "descent": 1,
    "minds": 1
  },
  {
    "age": 1,
    "world's": 1,
    "optimization": 1,
    "artificial": 1,
    "algorithms": 1,
    "image": 1,
    "singapore": 1,
    "debut": 1
  }
]

There’s a few patterns that emerge from the documents. With more documents, the topics would be more clearly defined.

Combine LDA With LDA Mapper

The LDA Mapper algorithm can map the group of topics generated by LDA to the list of documents. This tells you the distribution of topics across the documents (e.g. Story A is made up of 35% Business, 25% Entertainment, and 40% Science topics).

Sample Input

{
  "topics": [
  {
    "intelligence": 1,
    "text": 1,
    "create": 1,
    "super-resolution": 1,
    "forecasting": 1,
    "taxis": 1,
    "python": 1,
    "art": 1
  },
  {
    "learning": 3,
    "source": 1,
    "telegram": 1,
    "baidu's": 1,
    "chatbot": 1
  },
  {
    "deep": 3,
    "machines": 1,
    "summarize": 1,
    "self-driving": 1,
    "framework": 1,
    "overview": 1,
    "descent": 1,
    "minds": 1
  },
  {
    "age": 1,
    "world's": 1,
    "optimization": 1,
    "artificial": 1,
    "algorithms": 1,
    "image": 1,
    "singapore": 1,
    "debut": 1
  }
],
  "docsList": [
    "Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences",
    "Paddle: Baidu's open source deep learning framework",
    "An overview of gradient descent optimization algorithms",
    "Create a Chatbot for Telegram in Python to Summarize Text",
    "Image super-resolution through deep learning",
    "World's first self-driving taxis debut in Singapore",
    "Minds and machines: The art of forecasting in the age of artificial intelligence"
  ]
}

Sample LDA Mapper API Call

import Algorithmia
client = Algorithmia.client('your_algorithmia_api_key')
algo = client.algo('nlp/LDAMapper/0.1.0')
result = algo.pipe(input).result

Sample Output

The LDAMapper algorithm output each document, and the influence of topics on each document.

{
  "topic_distribution": [
    {
      "doc": "Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences",
      "freq": {"0": 0.5, "1": 0, "2": 0, "3": 0.5}
    },
    {
      "doc": "Paddle: Baidu's open source deep learning framework",
      "freq": {"0": 0, "1": 0.5, "2": 0.5, "3": 0}
    },
    {
      "doc": "An overview of gradient descent optimization algorithms",
      "freq": {"0": 0, "1": 0, "2": 0.5, "3": 0.5}
    },
    {
      "doc": "Create a Chatbot for Telegram in Python to Summarize Text",
      "freq": {"0": null, "1": null, "2": null, "3": null}
    },
    {
      "doc": "Image super-resolution through deep learning",
      "freq": {"0": 0.25, "1": 0.25, "2": 0.25, "3": 0.25}
    },
    {
      "doc": "World's first self-driving taxis debut in Singapore",
      "freq": {
        "0": 0.3333333333333333,
        "1": 0,
        "2": 0.3333333333333333,
        "3": 0.3333333333333333
      }
    },
    {
      "doc": "Minds and machines: The art of forecasting in the age of artificial intelligence",
      "freq": {
        "0": 0.5714285714285716,
        "1": 0,
        "2": 0.1428571428571429,
        "3": 0.2857142857142858
      }
    }
  ]
}

Here’s a sample recipe for calling LDA, and passing the results to the LDA Mapper algorithm:

import Algorithmia

client = Algorithmia.client('your_algorithmia_api_key')

docslist = [
    "Machine Learning is Fun Part 5: Language Translation with Deep Learning and the Magic of Sequences",
    "Paddle: Baidu's open source deep learning framework",
    "An overview of gradient descent optimization algorithms",
    "Create a Chatbot for Telegram in Python to Summarize Text",
    "Image super-resolution through deep learning",
    "World's first self-driving taxis debut in Singapore",
    "Minds and machines: The art of forecasting in the age of artificial intelligence"
]

# The LDA required input using a list of documents
lda_input = {
    "docsList": docslist
}

# LDA algorithm: https://algorithmia.com/algorithms/nlp/LDA
lda = client.algo('nlp/LDA/1.0.0')
# Returns a list of dictionaries of trends
result = lda.pipe(lda_input).result

# LDA Mapping algorithm: https://algorithmia.com/algorithms/nlp/LDAMapper
lda_mapper = client.algo(
    'nlp/LDAMapper/0.1.1')

# LDA Mapper input using the LDA algorithm's result as 'topics' value
lda_mapper_input = {
    "topics": result,
    "docsList": docslist
}

# Prints out the result from calling the LDA Mapper algo
print(lda_mapper.pipe(lda_mapper_input).result)

By using these two algorithms together, we’re able to create a microservice that gives us the same result with less code. If we wanted to, we could include a pre-processing algorithm in this pipeline, like Porter Stemmer, and then use sentiment analysis to discover the sentiment of the documents as they pertain to each topic.

Demo
See DataRobot in Action
Request a demo

About the author
DataRobot

Value-Driven AI

DataRobot is the leader in Value-Driven AI – a unique and collaborative approach to AI that combines our open AI platform, deep AI expertise and broad use-case implementation to improve how customers run, grow and optimize their business. The DataRobot AI Platform is the only complete AI lifecycle platform that interoperates with your existing investments in data, applications and business processes, and can be deployed on-prem or in any cloud environment. DataRobot and our partners have a decade of world-class AI expertise collaborating with AI teams (data scientists, business and IT), removing common blockers and developing best practices to successfully navigate projects that result in faster time to value, increased revenue and reduced costs. DataRobot customers include 40% of the Fortune 50, 8 of top 10 US banks, 7 of the top 10 pharmaceutical companies, 7 of the top 10 telcos, 5 of top 10 global manufacturers.

Meet DataRobot
  • Listen to the blog
     
  • Share this post
    Subscribe to DataRobot Blog
    Newsletter Subscription
    Subscribe to our Blog