A Story Behind Maximum Likelihood

TL;DR

What are we actually doing on MLE? What is the motivation for the Maximum Likelihood Estimation? In this post, we will go over the story behind the maximum likelihood.

We’ll first talk about the total variation distance, which is a very intuitive measure to tell you how “close” our estimator is from the true parameter.

Next, let’s move to KL-divergence used to replace the total variation distance.

In the end, we’ll minimize the KL-divergence to get the “good” estimator. In this process, the maximum likelihood principle will come in.

Total Variation Distance

How to define a “good” estimator?

A “good” estimator should be very “close” to the true parameter isn’t it?

image-20220831072120650
Motivation

Click to view the formal setting

image-20220831072454087

Definiton of Totol Variation Distance

image-20220831072725776
Def (TV distance)

This definition is very intuitive but pretty strong and hard to calculate right? Because you have to find the maximum over all possible sets. The good news is, we also have this formulation:

image-20220831082216550

Click to view the formal statement.

image-20220831074119183

Before the proof, let’s use the graph to get some feeling for the equation. Where does the 1/2 come from?

image-20220831074556512
Area between the curves

Proof:

Click to view the proof

image-20220831075931046
Proof of TV equation

Note that, the key part in the proof is the observation: $$\int f = \int g = 1 \implies \int f - g = 0,$$ Then we have $$\int_{\{x: f-g > 0\}} f - g = \int_{\{x: f-g < 0\}} g - f.$$

Properties of TV

Total variation is symmetric, non-negative, definite and satisfies the triangle inequality.

image-20220831082810899
Properites of Total Variation

Unclear how to estimate TV!

Our goal is to find the “good” estimator.

image-20220901074618037

If using Total Variation distance to describe the “close”, we need to :

  1. Build an estimator $\widehat{TV}(\mathbb{P}_{\theta}, \mathbb{P}_{\theta^*})$.

  2. Find $\hat{\theta}$ that minimize the function.

However, it is unclear how to build $\widehat{TV}(\mathbb{P}_{\theta}, \mathbb{P}_{\theta^*})$!

  • We don’t know the $f_{\theta^*}$ (the true parameter $\theta^*$ is unknown), and it is very hard to manipulate the integral of density difference.

  • The common strategy is to replace the expectation ($E(\cdot)$ ) with the average ($\frac{1}{n}\sum_n(\cdot)$ ), but there is no clear expectation in $TV(\cdot)$.

Due to above difficulties, we need a more convenient distance between probability measure to replace total variation. This is the motivation for DL divergence.

REMARK:

The total variation distance, describing the “worst” scenario, is very intuitive and has a clear interpretation. However, it is hard to build an estimator. That’s why we move to KL divergence.

KL divergence

Let’s check the definition first,

image-20220831135444913
KL divergence

image-20220831140735027
Properties of KL-divergence

For the second property (non-negative), we can use Jensen’s inequality to show it.

KL 🤝 MLE

Now we are ready to introduce maximum likelihood principle using the KL-divergence.

image-20220831143953327
Maximum Likelihood Connection with KL-divergence

Summary

What are we actually doing on MLE?

The short answer is we are minimizing the KL-divergence.

Remember: Maximum Likelihood Estimation is just the empirical version of trying to minimize the KL-divergence!

Reference

Chen Xing
Chen Xing
Founder & Data Scientist

Enjoy Life & Enjoy Work!

Related