# Explanation of On Variational Bounds of Mutual Information

Published:

Explanation of the paper On Variational Bounds of Mutual Information, ICML 2019.

# 1. Introduction

This paper introduces how to derive class of variational bounds of the mutual information, including MINE-f and InfoNCE. This paper also propose a new upper and a new lower bounds trading off between the bias and variance of the estimation of the MI.

Let’s begin with this paper, On Variational Bounds of Mutual Information, in ICML 2019.

# 2. Review of Variational Bounds of MI

MI:

$\label{mi-kl} I(X ; Y)=\mathbb{E}_{p(x, y)}\left[\log \frac{p(x \mid y)}{p(x)}\right]=\mathbb{E}_{p(x, y)}\left[\log \frac{p(y \mid x)}{p(y)}\right]$

Notes: In the following, there is a term “(un)normalized”, which indicates whether a variational distribution is normalized.

## 2.1. Normalized Upper and Lower Bounds

In Eq. (\ref{mi-kl}), we can easily obtain different types of bounds with KL-divergence when we are substituting different variational distributions to different marginal or conditional distributions.

For example, if $p(y\mid x)$ is like an encoder in VAE, by inserting $q(y)$, which approximates $p(y)$, we can have a tractable variational upper bound:

\begin{aligned} I(X ; Y) & \equiv \mathbb{E}_{p(x, y)}\left[\log \frac{p(y \mid x)}{p(y)}\right] \\ &=\mathbb{E}_{p(x, y)}\left[\log \frac{p(y \mid x) q(y)}{q(y) p(y)}\right] \\ &=\mathbb{E}_{p(x, y)}\left[\log \frac{p(y \mid x)}{q(y)}\right]-K L(p(y) \| q(y)) \\ & \leq \mathbb{E}_{p(x)}[K L(p(y \mid x) \| q(y))], \end{aligned}

where the inequality comes from the non-negative KL term. This bound is tight when $q(y)=p(y)$. VAE just uses the same thing int he optimization.

### 2.1.1. $I_{\mathrm{BA}}$

We can do the opposite direction by using another variational distribution $q(x\mid y)$. The lower bound is derived by Barber & Agakov in 2003:

\begin{aligned} I(X ; Y)&=\mathbb{E}_{p(x, y)}\left[\log \frac{q(x \mid y)}{p(x)}\right] \\ &\qquad+\mathbb{E}_{p(y)}[K L(p(x \mid y) \| q(x \mid y))] & \\ &\geq \mathbb{E}_{p(x, y)}[\log q(x \mid y)]+h(X) \triangleq I_{\mathrm{BA}}, \end{aligned}

where $q(x\mid y)$ can be thought of as a decoder and $h(X)=-\mathbb{E}_{p(x,y)}\log p(x)$ is the differential entropy of $X$. This bound is tight when $q(x\mid y)=p(x\mid y)$. The differential entropy term actually makes this bound intractable because $p(x)$ is always unknown. A further drawback of this bound is that the decoder term, is always challenging when the data is high-dimensional.

## 2.2. Unnormalized Lower Bounds

In this section, the lower bound is mainly studied with an unnormalized variational distribution $q(x\mid y)$.

The variational family is an energy-based style with a critic function $f(x,y)$ and scaled by the data density $p(x)$:

$q(x \mid y)=\frac{p(x)}{Z(y)} e^{f(x, y)}, \text { where } Z(y)=\mathbb{E}_{p(x)}\left[e^{f(x, y)}\right],$

which is exactly what we did in MINE.

### 2.2.1. $I_{\mathrm{UBA}}$

Substituting $q(x\mid y)$ in $I_{\mathrm{BA}}$, we can get an unnormalized lower bound $I_{\mathrm{UBA}}$:

\begin{aligned} I_{\mathrm{UBA}}&\triangleq\mathbb{E}_{p(x, y)}[\log q(x \mid y)]+h(X) \\ &=\mathbb{E}_{p(x, y)}[\log p(x)-\log Z(y)+f(x,y)]+h(X)\\ &=\mathbb{E}_{p(x, y)}[f(x,y)]-\mathbb{E}_{p(y)}[\log Z(y)]. \end{aligned}

This bound is tight when:

\begin{aligned} q(x\mid y)&=\frac{p(x)}{Z(y)} e^{f(x, y)}=p(x\mid y)=\frac{p(x,y)}{p(y)}\\ f(x,y)&=\log \frac{p(x,y)}{p(x)} + \log \frac{Z(y)}{p(y)}\\ &=\log p(y\mid x)+c(y), \end{aligned}

where $c(y)=\log \frac{Z(y)}{p(y)}$ is a function solely related $y$. This bound removes the intractable term $h(X)$ in $I_{\mathrm{BA}}$! Are we done? Not yet because the log partition function $\log Z(y)=\log\mathbb{E}_{p(x)}\left[e^{f(x, y)}\right]$ is still intractable. Can’t we just use Monte Carlo to estimate it? “Yes, we can!” But such an estimation is biased. In a batch of data samples, we are actually calculating the mean of $\log e^{f(x, y)}$ because with each discrete data point, we can only calculate $\log\mathbb{E}_{p(x)}\left[e^{f(x, y)}\right]$ without the expectation. And until we calculate without the expectation term over the batch, we can finally calculate the mean of this term, which becomes $\mathbb{E}_{p(x)}[\log e^{f(x, y)}]$. Therefore, it is biased and intractable with Monte Carlo estimation.

### 2.2.2. $I_{\mathrm{DV}}$ and MINE

Applying Jensen’s inequality to $I_{\mathrm{UBA}}$ can recover the bound of Donsker & Varadhan in 1983:

$I_{\mathrm{UBA}} \geq \mathbb{E}_{p(x, y)}[f(x, y)]-\log \mathbb{E}_{p(y)}[Z(y)] \triangleq I_{\mathrm{DV}}.$

It is still intractable because of the same biased reason.

If we apply Jensen’s inequality in the other direction, we can have a tractable objective, but not a bound:

$I \geq I_{\mathrm{UBA}} \leq \mathbb{E}_{p(x, y)}[f(x, y)]-\mathbb{E}_{p(x)}[f(x,y)],$

which is the same case when estimating $I_{\mathrm{DV}}$ using Monte Carlo. MINE is doing the same thing, which is not a bound at all!

### 2.2.3. $I_{\mathrm{TUBA}}$

So, how can we obtain a tractable unnormalized lower bound? We need to escape from expectation inside $\log$. Note that $\log (x) \leq \frac{x}{a}+\log (a)-1,\forall x,a>0$. Applying it to $\log Z(y)\leq \frac{Z(y)}{a(y)}+\log (a(y))-1$, which is tight when $a(y)=Z(y)$. This results in a Tractable Unnormalized version of $I_{\mathrm{BA}}$:

\begin{aligned} I \geq I_{\mathrm{UBA}} \geq & \mathbb{E}_{p(x, y)}[f(x, y)] \\ &-\mathbb{E}_{p(y)}\left[\frac{\mathbb{E}_{p(x)}\left[e^{f(x, y)}\right]}{a(y)}+\log (a(y))-1\right] \\ & \triangleq I_{\mathrm{TUBA}} \end{aligned}.

### 2.2.4. $I_{\mathrm{NWJ}}$

Letting $a(y)=e$ recovers the bound of Nguyen, Wainwright, and Jordan in 2010, also known as $f$-GAN and MINE-$f$:

$\mathbb{E}_{p(x, y)}[f(x, y)]-e^{-1} \mathbb{E}_{p(y)}[Z(y)] \triangleq I_{\mathrm{NWJ}},$

which no longer needs to learn $a(y)$. But $f(x,y)$ needs to learn to self-normalize. And the optimal critic comes from:

$\frac{p(x)}{e}e^{f(x,y)}=p(x\mid y).$

And the solution is:

$f^{*}(x, y)=1+\log \frac{p(x \mid y)}{p(x)}.$

In conclusion, these unnormalized bounds are unbiased while exhibit a high variance because the log partition function is of high variance.

## 2.3. Multi-sample unnormalized Lower Bounds

Multi-sample unnormalized bounds have low-variance but high-bias. We will mainly talk about the InfoNCE.

Assume $x_1$ and $y$ are from a sample pair $p\left(x_{1}\right) p\left(y \mid x_{1}\right)$ while other $K-1$ samples are from an independent distribution $x_{2: K} \sim r^{K-1}\left(x_{2: K}\right)$. We can have:

$I\left(X_{1} ; Y\right)=\mathbb{E}_{r^{K-1}\left(x_{2: K}\right)}\left[I\left(X_{1} ; Y\right)\right]=I\left(X_{1}, X_{2: K} ; Y\right).$

Such a multi-sample mutual information can be used in all previous bounds with the same optimal critic $f^{*}\left(x_{1: K}, y\right)=1+\log \frac{p\left(y \mid x_{1: K}\right)}{p(y)}=1+\log \frac{p\left(y \mid x_{1}\right)}{p(y)}$.

Particularly, if we set the critic in $I_{\mathrm{NWJ}}$ to:

$f\left(x_{1: K}, y\right)=1+\log \frac{e^{f\left(x_{1}, y\right)}}{a\left(y ; x_{1: K}\right)},$

then $I_{\mathrm{NWJ}}$ becomes:

\begin{aligned} I\left(X_{1} ; Y\right) &\geq I_{\mathrm{NWJ}} \triangleq\mathbb{E}_{p(x, y)}[f(x, y)]-e^{-1} \mathbb{E}_{p(y)}[Z(y)] \\ &=\mathbb{E}_{p\left(x_{1: K}\right) p\left(y \mid x_{1}\right)}\left[1+\log \frac{e^{f\left(x_{1}, y\right)}}{a\left(y ; x_{1: K}\right)}\right]-e^{-1} \mathbb{E}_{p(y)}\left[\mathbb{E}_{p(x_{1:K})}\left[e^{f\left(x_{1: K}, y\right)}\right]\right]\\ &=1+\mathbb{E}_{p\left(x_{1: K}\right) p\left(y \mid x_{1}\right)}\left[\log \frac{e^{f\left(x_{1}, y\right)}}{a\left(y ; x_{1: K}\right)}\right]-\mathbb{E}_{p\left(x_{1: K}\right) p(y)}\left[\frac{e^{f\left(x_{1}, y\right)}}{a\left(y ; x_{1: K}\right)}\right], \end{aligned}

where the inequality comes from the critic not guaranteed to be optimal. The additional samples from $p(x)$ can be used to estimate the partition function $Z(y)$:

$Z(y)=a\left(y ; x_{1: K}\right)=m\left(y ; x_{1: K}\right)=\frac{1}{K} \sum_{i=1}^{K} e^{f\left(x_{i}, y\right)}.$

The last term in the bound can be simplified as:

$\begin{array}{l} \mathbb{E}_{p\left(x_{1: K}\right) p(y)}\left[\frac{e^{f\left(x_{1}, y\right)}}{m\left(y ; x_{1: K}\right)}\right]=\frac{1}{K} \sum_{i=1}^{K} \mathbb{E}\left[\frac{e^{f\left(x_{i}, y\right)}}{m\left(y ; x_{1: K}\right)}\right] \\ =\mathbb{E}_{p\left(x_{1: K}\right) p(y)}\left[\frac{\frac{1}{K} \sum_{i=1}^{K} e^{f\left(x_{i}, y\right)}}{m\left(y ; x_{1: K}\right)}\right]=1, \end{array}$

and thus the bound recovers the InfoNCE loss:

$I(X ; Y) \geq \mathbb{E}\left[\frac{1}{K} \sum_{i=1}^{K} \log \frac{e^{f\left(x_{i}, y_{i}\right)}}{\frac{1}{K} \sum_{j=1}^{K} e^{f\left(x_{i}, y_{j}\right)}}\right] \triangleq I_{\mathrm{NCE}}.$

The optimal critic for $I_{\mathrm{NCE}}$ is $f(x,y)=\log p(y\mid x)+c(y)$ like $I_{\mathrm{UBA}}$. Note that $I_{\mathrm{NCE}}$ itself is upper bounded by $\log K$, which means that if $I(X;Y)>\log K$, the $I_{\mathrm{NCE}}$ bound is loose. The larger the batch, the better the estimation.

## 2.4. Nonlinearly Interpolated Lower Bounds

We can interpolate between $I_{\mathrm{NCE}}$ (high-bias, low-variance) and $I_{\mathrm{NWJ}}$ (low-bias, high-variance) to trade off the bias and the variance. Setting the critic to $1+\log \frac{e^{f\left(x_{1}, y\right)}}{\alpha m\left(y ; x_{1: K}\right)+(1-\alpha) q(y)}$ with $\alpha\in\left[0,1\right]$ to get a continuum of lower bounds:

$\begin{array}{l} 1+\mathbb{E}_{p\left(x_{1: K}\right) p\left(y \mid x_{1}\right)}\left[\log \frac{e^{f\left(x_{1}, y\right)}}{\alpha m\left(y ; x_{1: K}\right)+(1-\alpha) q(y)}\right] \\ -\mathbb{E}_{p\left(x_{1: K}\right) p(y)}\left[\frac{e^{f\left(x_{1}, y\right)}}{\alpha m\left(y ; x_{1: K}\right)+(1-\alpha) q(y)}\right] \triangleq I_{\alpha} \end{array}.$

Setting $\alpha=0$, we can recover $I_{\mathrm{NWJ}}$ and $\alpha=1$, we can recover $I_{\mathrm{NCE}}$. This interpolated bound is upper bounded by $\log \frac{K}{\alpha}$.

## 2.5. Structured Bounds with Tractable Encoders

In current representation learning area, $p(y\mid x)$ is usually easily accessible when the $y$ is a learned stochastic representation.

### 2.5.1. InfoNCE with a Tractable Conditional

An optimal critic of InfoNCE is given by $f(x, y)=\log p(y \mid x)$. (WARNING: this is strange for me because $c(y)$ is omitted here) We can plug in the optimal $f$ and have a bound:

$\label{nce-vae} I(X ; Y) \geq \mathbb{E}\left[\frac{1}{K} \sum_{i=1}^{K} \log \frac{p\left(y_{i} \mid x_{i}\right)}{\frac{1}{K} \sum_{j=1}^{K} p\left(y_{i} \mid x_{j}\right)}\right].$

### 2.5.2. Leave One Out Upper Bound

In Section 2.1, the upper bound is obtained by a variational distribution $q(y)$.

$I(X ; Y) \leq \mathbb{E}_{p(x)}[K L(p(y \mid x) \| q(y))],$

Given a batch of $K$ samples, we can approximate $p(y)$ in this way:

$p(y) \approx\frac{1}{K} \sum_{i} p\left(y \mid x_{i}\right).$

Leaving out the corresponding $y_i$ for $x_i$, we can have $q_{i}(y)=\frac{1}{K-1} \sum_{j \neq i} p\left(y \mid x_{j}\right)$. Therefore, the upper bound is:

$\label{upper-vae} I(X ; Y) \leq \mathbb{E}\left[\frac{1}{K} \sum_{i=1}^{K}\left[\log \frac{p\left(y_{i} \mid x_{i}\right)}{\frac{1}{K-1} \sum_{j \neq i} p\left(y_{i} \mid x_{j}\right)}\right]\right].$

With Eq. (\ref{nce-vae}) and (\ref{upper-vae}), we successfully sandwich the MI without introducing a variational distribution. The only difference between these two bounds is whether $p\left(y_{i} \mid x_{i}\right)$ is included in the denominator.

Tags: