KL Divergence | Jihoon Kim

How can we measure the difference between two probability distributions? One way is to use the Kullback-Leibler Divergence, also known as KL Divergence. KL Divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. It is widely used in various fields, including machine learning, statistics, and information theory.

What is KL Divergence?

KL Divergence, also known as relative entropy, is a measure of how one probability distribution differs from another. It’s often used in machine learning, data compression, and statistical inference to compare probability distributions.

Definition

The KL Divergence from distribution Q to distribution P is defined as:

\[D_{KL}(P || Q) = \sum_{x \in \mathcal{X}} P(x) \log \left(\frac{P(x)}{Q(x)}\right)\]

For continuous distributions, we use an integral instead of a sum:

\[D_{KL}(P || Q) = \int_{-\infty}^{\infty} P(x) \log \left(\frac{P(x)}{Q(x)}\right) dx\]

Where:

\(P(x)\) is the true probability distribution
\(Q(x)\) is the approximating probability distribution

Meaning of KL Divergence

KL Divergence measures the average number of extra bits needed to encode samples from P when using an optimal code for Q, instead of using an optimal code for P. In simpler terms, it quantifies the information lost when Q is used to approximate P.

Important properties of KL Divergence:

It’s always non-negative: \(D_{KL}(P \Vert Q) \geq 0\)
It equals zero if and only if \(P\) and \(Q\) are identical: \(D_{KL}(P \Vert Q) = 0 \iff P = Q\)
It’s not symmetric(commutative): \(D_{KL}(P \Vert Q) \neq D_{KL}(Q \Vert P)\)

Practical Applications

KL Divergence has numerous applications in machine learning and data science:

Variational Inference: In Bayesian inference, KL Divergence is used to approximate complex posterior distributions.
Anomaly Detection: By comparing the distribution of normal data to new observations, we can detect anomalies.
Model Selection: KL Divergence can help choose between different probabilistic models by measuring how well each model approximates the true data distribution.
Information Bottleneck Method: In deep learning, KL Divergence is used to balance between compressing input data and preserving relevant information for a task.

What is KL Divergence?

Definition

Meaning of KL Divergence

Practical Applications

Enjoy Reading This Article?