Introduction to Latent Dirichlet Allocation (LDA) Model

Motivation

Uncovering the insights from User-generated content (UGC) is important for researchers as UGC provides a rich information about consumer’s experiences with quality.

Tirunillai and Tellis (2014) propose a unified framework:

  1. Extract the latent dimensions of quality from UGC

  2. Ascertain the valence, labels, validity, importance, dynamics, and heterogeneity of those dimensions

  3. Use those dimensions for strategy analysis (e.g., brand positioning)

fig1

LDA Model

LDA is a generative probabilistic model used primarily for topic modeling in text. This model helps us discovering the hidden thematic structure in a large collection of documents.

Key Idea

  • Every document is a mixture of topics.

  • Every topic is a mixture of words.

LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document.

LDA R Code Example

Many awesome R packages to implement LDA. For example, this blog post tutorial by Julia Silge walks through how to build a structural topic model and then how to understand and interpret it.

library(stm)
topic_model <- stm(lyrics_data, K = 4)
  • The most important parameter when training a topic modeling is k, the number of topics.

  • This is like k in k-means in that it is a hyperparamter of the model and we must choose this value ahead of time.

  • We can find the best value for k using data-driven methods.

Rplot1

Rplot2

Dirichlet and Categorical distributions

As we are familiar with Beta and Bernoulli distributions, it is not hard to understand the Dirichlet and Categorical distributions, which are natural extensions of previous ones.

image-20250503102001553

image-20250503102449205

image-20250503102525107

Mathematical Components of LDA

0. Model Setup:

  • Topics: K topics.
  • Documents: D documents.
  • Words: V words in the vocabulary.

1. Topic Distribution for Each Document:

  • Each document d has a topic distribution θd.
  • θdDirichlet(α).
  • Here, α is a K-dimensional vector, where K is the number of topics.

2. Word Distribution for Each Topic:

  • Each topic k has a word distribution ϕk.
  • ϕkDirichlet(β).
  • β is a V-dimensional vector, where V is the vocabulary size.

3. Generative Process for Each Document:

  • For each document d:

    • Draw topic distribution θdDirichlet(α).
  • For each word n in document d:

    • Draw topic zd,nMultinomial(θd).
    • Draw word wd,nMultinomial(ϕzd,n).
img

4. Inference Problem:

  • Compute posterior distribution:

    P(θ,z|w,α,β)=P(θ,z,ϕ,w|α,β)P(w|α,β)=P(w|z,ϕ),P(z|θ),P(θ|α),P(ϕ|β)P(w|α,β)

  • Exact computation is intractable. Gibbs sampling or variational inference are used.

5. Hyperparameters α and β:

  • α and β are hyperparameters of the Dirichlet distributions.

  • α: Affects the mixture of topics in each document.

  • β: Affects the distribution of words in each topic.

6. Why Dirichlet Distribution?

  • In LDA, topics and words are assumed to be multinomially distributed.

  • Dirichlet is the conjugate prior to the multinomial, which simplifies computation, especially for Bayesian inference.

  • XDirichlet(α) with pdf f(X;α)=1B(α)i=1Kxiαi1

  • Dirichlet distribution is a generalization of the Beta distribution.

    • Binomial Likelihood 🤝 Beta Prior
    • Multinomial Likelihood 🤝 Dirichlet Prior

Reference

Chen Xing
Chen Xing
Founder & Data Scientist

Enjoy Life & Enjoy Work!

Related