Notes for Variational Inference

Introduction

In modern Bayesian statistics, we often face posterior distributions that are difficult to compute. Let p(z) be prior density and p(xz) be likelihood. The standard approach to compute posterior p(zx) is to use MCMC (like Metropolis-Hastings, Gibbs sampling and HMC). But MCMC has downsides:

  • Slow for big datasets or complex models.
  • Difficult to scale in the era of massive data.

Variational inference (VI) offers a faster alternative. The key difference between MCMC and VI is:

  • MCMC sample a Markov chain
  • VI solve an optimization problem

The main idea behind VI is to use optimization. Specifically,

  • step 1: we posit a family of densities Q

  • step 2: find a member in Q to minimize the KL divergence q(z)=argminq(z)QKL(q(z)|p(zx)).

Key Idea: Rather than sampling, VI optimizes — it finds a best guess distribution by minimizing a divergence.

VI vs MCMC: When to Use Which?

image-20250427083257373

Pros and Cons of VI

  • Pros:

    • Much faster than MCMC.
    • Easy to scale with stochastic optimization and distributed computation.
  • Cons:

    • VI underestimates posterior variance (it tends to be “overconfident”).
    • It does not guarantee exact samples from the true posterior.

Variational Inference

Recall that in the Bayesian framework, p(zx)=p(z,x)p(x)p(xz)p(z) We try to avoid calculating the denominator (marginal likelihood), p(x), also called evidence, as it requires us to calculate high dimensional integrals.

Variational inference turns Bayesian inference into an optimization problem by minimizing KL divergence within a simpler family of distributions, typically using coordinate ascent to maximize the evidence lower bound (ELBO).

First, the optimization goal is

q(z)=argminq(z)QKL(q(z)|p(zx)),

where q is the best approximation.

KL(q(\boldsymbolz)p(\boldsymbolz\boldsymbolx))=zq(\boldsymbolz)log[q(\boldsymbolz)p(\boldsymbolz\boldsymbolx)]d\boldsymbolz=\boldsymbolz[q(\boldsymbolz)logq(\boldsymbolz)]d\boldsymbolz\boldsymbolz[q(\boldsymbolz)logp(\boldsymbolz\boldsymbolx)]d\boldsymbolz=Eq[logq(\boldsymbolz)]Eq[logp(\boldsymbolz\boldsymbolx)]=Eq[logq(\boldsymbolz)]Eq[log[p(\boldsymbolx,\boldsymbolz)p(\boldsymbolx)]]=Eq[logq(\boldsymbolz)]Eq[logp(\boldsymbolx,\boldsymbolz)]+Eq[logp(\boldsymbolx)]=Eq[logq(\boldsymbolz)]Eq[logp(\boldsymbolx,\boldsymbolz)]+logp(\boldsymbolx)

Note that, logp(\boldsymbolx) does not contain q(), so we can ignore it in the optimization. We define evidence lower bound (ELBO) as

ELBO(q)=Eq[logp(\boldsymbolx,\boldsymbolz)]Eq[logq(\boldsymbolz)]

This value is called evidence lower bound because it is the lower bound of “log evidence”.

logp(\boldsymbolx)=ELBO(q)+KL(q(\boldsymbolz)p(\boldsymbolz\boldsymbolx))ELBO(q)

The second line holds because KL(|)0 by Jensen’s inequality.

Therefore, maximizing ELBO is equivalent to minimizing KL divergence.

q(\boldsymbolz)=argminq(\boldsymbolz)QKL(q(\boldsymbolz)p(\boldsymbolz\boldsymbolx))=argmaxq(\boldsymbolz)QELBO(q)=argmaxq(\boldsymbolz)Q{Eq[logp(\boldsymbolx,\boldsymbolz)]Eq[logq(\boldsymbolz)]}

What is the intuition for ELBO(q)?

ELBO(q)Eq[logp(\boldsymbolx,\boldsymbolz)]Eq[logq(\boldsymbolz)]=E[logp(z)]+E[logp(xz)]E[logq(z)]=E[logp(xz)]KL(q(z)p(z)).
  • first term is try to “maximize the likelihood”

  • second term is try to encourage density q() close to prior

  • balance between likelihood and prior

Mean-Field Variational Family

In mean field variational inference, we assume that the variational family factorizes,

q(z1,,zm)=j=1mqj(zj), Each variable is independent.

Coordinate ascent algorithm

We will use coordinate ascent inference, iteratively optimizing each variational distribution holding the others fixed.

The ELBO converges to a local minimum. Use the resulting q is as a proxy for the true posterior.

There is a strong relationship between this algorithm and Gibbs sampling.

  • In Gibbs sampling we sample from the conditional

  • In coordinate ascent variational inference, we iteratively set each factor to

distribution of zkexpE[log(conditional)]

Reference

Chen Xing
Chen Xing
Founder & Data Scientist

Enjoy Life & Enjoy Work!

Related