Introduction to Latent Dirichlet Allocation (LDA) Model

May 2, 2025 3 min read ML, quantitive-marketing

Motivation

Uncovering the insights from User-generated content (UGC) is important for researchers as UGC provides a rich information about consumer’s experiences with quality.

Tirunillai and Tellis (2014) propose a unified framework:

Extract the latent dimensions of quality from UGC
Ascertain the valence, labels, validity, importance, dynamics, and heterogeneity of those dimensions
Use those dimensions for strategy analysis (e.g., brand positioning)

LDA Model

LDA is a generative probabilistic model used primarily for topic modeling in text. This model helps us discovering the hidden thematic structure in a large collection of documents.

Key Idea

Every document is a mixture of topics.
Every topic is a mixture of words.

LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document.

LDA R Code Example

Many awesome R packages to implement LDA. For example, this blog post tutorial by Julia Silge walks through how to build a structural topic model and then how to understand and interpret it.

library(stm)
topic_model <- stm(lyrics_data, K = 4)

The most important parameter when training a topic modeling is k, the number of topics.
This is like k in k-means in that it is a hyperparamter of the model and we must choose this value ahead of time.
We can find the best value for k using data-driven methods.

Rplot2

Dirichlet and Categorical distributions

As we are familiar with Beta and Bernoulli distributions, it is not hard to understand the Dirichlet and Categorical distributions, which are natural extensions of previous ones.

Mathematical Components of LDA

0. Model Setup:

Topics: $K$ topics.
Documents: $D$ documents.
Words: $V$ words in the vocabulary.

1. Topic Distribution for Each Document:

Each document $d$ has a topic distribution $θ_{d}$ .
$θ_{d} \sim Dirichlet (α)$ .
Here, $α$ is a $K$ -dimensional vector, where $K$ is the number of topics.

2. Word Distribution for Each Topic:

Each topic $k$ has a word distribution $ϕ_{k}$ .
$ϕ_{k} \sim Dirichlet (β)$ .
$β$ is a $V$ -dimensional vector, where $V$ is the vocabulary size.

3. Generative Process for Each Document:

For each document $d$ :
- Draw topic distribution $θ_{d} \sim Dirichlet (α)$ .
For each word $n$ in document $d$ :
- Draw topic $z_{d, n} \sim Multinomial (θ_{d})$ .
- Draw word $w_{d, n} \sim Multinomial (ϕ_{z_{d, n}})$ .

4. Inference Problem:

Compute posterior distribution:

$P (θ, z | w, α, β) = \frac{P (θ, z, ϕ, w | α, β)}{P (w | α, β)} = \frac{P (w | z, ϕ), P (z | θ), P (θ | α), P (ϕ | β)}{P (w | α, β)}$
Exact computation is intractable. Gibbs sampling or variational inference are used.

5. Hyperparameters $α$ and $β$ :

$α$ and $β$ are hyperparameters of the Dirichlet distributions.
$α$ : Affects the mixture of topics in each document.
$β$ : Affects the distribution of words in each topic.

6. Why Dirichlet Distribution?

In LDA, topics and words are assumed to be multinomially distributed.
Dirichlet is the conjugate prior to the multinomial, which simplifies computation, especially for Bayesian inference.
$X \sim D i r i c h l e t (α)$ with pdf $f (X; α) = \frac{1}{B (α)} \prod_{i = 1}^{K} x_{i}^{α_{i} - 1}$
Dirichlet distribution is a generalization of the Beta distribution.
- Binomial Likelihood 🤝 Beta Prior
- Multinomial Likelihood 🤝 Dirichlet Prior

Reference

Tirunillai, S., & Tellis, G. J. (2014). Mining Marketing Meaning from Online Chatter: Strategic Brand Analysis of Big Data Using Latent Dirichlet Allocation. Journal of Marketing Research, 51(4), 463–479. https://doi.org/10.1509/jmr.12.0106
Julia Silge’s tutorial Topic modeling for #TidyTuesday Taylor Swift lyrics

text mining topic modeling LDA bayesian

Chen Xing

Founder & Data Scientist

Enjoy Life & Enjoy Work!