A Story Behind Maximum Likelihood

Aug 31, 2022 3 min read Probability, tutorial

TL;DR

What are we actually doing on MLE? What is the motivation for the Maximum Likelihood Estimation? In this post, we will go over the story behind the maximum likelihood.

We’ll first talk about the total variation distance, which is a very intuitive measure to tell you how “close” our estimator is from the true parameter.

Next, let’s move to KL-divergence used to replace the total variation distance.

In the end, we’ll minimize the KL-divergence to get the “good” estimator. In this process, the maximum likelihood principle will come in.

Total Variation Distance

How to define a “good” estimator?

A “good” estimator should be very “close” to the true parameter isn’t it?

Click to view the formal setting

Definiton of Totol Variation Distance

This definition is very intuitive but pretty strong and hard to calculate right? Because you have to find the maximum over all possible sets. The good news is, we also have this formulation:

Click to view the formal statement.

Before the proof, let’s use the graph to get some feeling for the equation. Where does the 1/2 come from?

Proof:

Click to view the proof

Note that, the key part in the proof is the observation: $$\int f = \int g = 1 \implies \int f - g = 0,$$ Then we have $$\int_{\{x: f-g > 0\}} f - g = \int_{\{x: f-g < 0\}} g - f.$$

Properties of TV

Total variation is symmetric, non-negative, definite and satisfies the triangle inequality.

Unclear how to estimate TV!

Our goal is to find the “good” estimator.

If using Total Variation distance to describe the “close”, we need to :

Build an estimator $\widehat{TV}(\mathbb{P}_{\theta}, \mathbb{P}_{\theta^*})$.
Find $\hat{\theta}$ that minimize the function.

However, it is unclear how to build $\widehat{TV}(\mathbb{P}_{\theta}, \mathbb{P}_{\theta^*})$!

We don’t know the $f_{\theta^*}$ (the true parameter $\theta^*$ is unknown), and it is very hard to manipulate the integral of density difference.
The common strategy is to replace the expectation ($E(\cdot)$ ) with the average ($\frac{1}{n}\sum_n(\cdot)$ ), but there is no clear expectation in $TV(\cdot)$.

Due to above difficulties, we need a more convenient distance between probability measure to replace total variation. This is the motivation for DL divergence.

REMARK:

The total variation distance, describing the “worst” scenario, is very intuitive and has a clear interpretation. However, it is hard to build an estimator. That’s why we move to KL divergence.