Explanation of On Variational Bounds of Mutual Information

8 minute read


Explanation of the paper On Variational Bounds of Mutual Information, ICML 2019.

1. Introduction

This paper introduces how to derive class of variational bounds of the mutual information, including MINE-f and InfoNCE. This paper also propose a new upper and a new lower bounds trading off between the bias and variance of the estimation of the MI.

Let’s begin with this paper, On Variational Bounds of Mutual Information, in ICML 2019.

2. Review of Variational Bounds of MI


\[\label{mi-kl} I(X ; Y)=\mathbb{E}_{p(x, y)}\left[\log \frac{p(x \mid y)}{p(x)}\right]=\mathbb{E}_{p(x, y)}\left[\log \frac{p(y \mid x)}{p(y)}\right]\]

Notes: In the following, there is a term “(un)normalized”, which indicates whether a variational distribution is normalized.

2.1. Normalized Upper and Lower Bounds

In Eq. (\ref{mi-kl}), we can easily obtain different types of bounds with KL-divergence when we are substituting different variational distributions to different marginal or conditional distributions.

For example, if \(p(y\mid x)\) is like an encoder in VAE, by inserting \(q(y)\), which approximates \(p(y)\), we can have a tractable variational upper bound:

\[\begin{aligned} I(X ; Y) & \equiv \mathbb{E}_{p(x, y)}\left[\log \frac{p(y \mid x)}{p(y)}\right] \\ &=\mathbb{E}_{p(x, y)}\left[\log \frac{p(y \mid x) q(y)}{q(y) p(y)}\right] \\ &=\mathbb{E}_{p(x, y)}\left[\log \frac{p(y \mid x)}{q(y)}\right]-K L(p(y) \| q(y)) \\ & \leq \mathbb{E}_{p(x)}[K L(p(y \mid x) \| q(y))], \end{aligned}\]

where the inequality comes from the non-negative KL term. This bound is tight when \(q(y)=p(y)\). VAE just uses the same thing int he optimization.

2.1.1. \(I_{\mathrm{BA}}\)

We can do the opposite direction by using another variational distribution \(q(x\mid y)\). The lower bound is derived by Barber & Agakov in 2003:

\[\begin{aligned} I(X ; Y)&=\mathbb{E}_{p(x, y)}\left[\log \frac{q(x \mid y)}{p(x)}\right] \\ &\qquad+\mathbb{E}_{p(y)}[K L(p(x \mid y) \| q(x \mid y))] & \\ &\geq \mathbb{E}_{p(x, y)}[\log q(x \mid y)]+h(X) \triangleq I_{\mathrm{BA}}, \end{aligned}\]

where \(q(x\mid y)\) can be thought of as a decoder and \(h(X)=-\mathbb{E}_{p(x,y)}\log p(x)\) is the differential entropy of \(X\). This bound is tight when \(q(x\mid y)=p(x\mid y)\). The differential entropy term actually makes this bound intractable because \(p(x)\) is always unknown. A further drawback of this bound is that the decoder term, is always challenging when the data is high-dimensional.

2.2. Unnormalized Lower Bounds

In this section, the lower bound is mainly studied with an unnormalized variational distribution \(q(x\mid y)\).

The variational family is an energy-based style with a critic function \(f(x,y)\) and scaled by the data density \(p(x)\):

\[q(x \mid y)=\frac{p(x)}{Z(y)} e^{f(x, y)}, \text { where } Z(y)=\mathbb{E}_{p(x)}\left[e^{f(x, y)}\right],\]

which is exactly what we did in MINE.

2.2.1. \(I_{\mathrm{UBA}}\)

Substituting \(q(x\mid y)\) in \(I_{\mathrm{BA}}\), we can get an unnormalized lower bound \(I_{\mathrm{UBA}}\):

\[\begin{aligned} I_{\mathrm{UBA}}&\triangleq\mathbb{E}_{p(x, y)}[\log q(x \mid y)]+h(X) \\ &=\mathbb{E}_{p(x, y)}[\log p(x)-\log Z(y)+f(x,y)]+h(X)\\ &=\mathbb{E}_{p(x, y)}[f(x,y)]-\mathbb{E}_{p(y)}[\log Z(y)]. \end{aligned}\]

This bound is tight when:

\[\begin{aligned} q(x\mid y)&=\frac{p(x)}{Z(y)} e^{f(x, y)}=p(x\mid y)=\frac{p(x,y)}{p(y)}\\ f(x,y)&=\log \frac{p(x,y)}{p(x)} + \log \frac{Z(y)}{p(y)}\\ &=\log p(y\mid x)+c(y), \end{aligned}\]

where \(c(y)=\log \frac{Z(y)}{p(y)}\) is a function solely related \(y\). This bound removes the intractable term \(h(X)\) in \(I_{\mathrm{BA}}\)! Are we done? Not yet because the log partition function \(\log Z(y)=\log\mathbb{E}_{p(x)}\left[e^{f(x, y)}\right]\) is still intractable. Can’t we just use Monte Carlo to estimate it? “Yes, we can!” But such an estimation is biased. In a batch of data samples, we are actually calculating the mean of \(\log e^{f(x, y)}\) because with each discrete data point, we can only calculate \(\log\mathbb{E}_{p(x)}\left[e^{f(x, y)}\right]\) without the expectation. And until we calculate without the expectation term over the batch, we can finally calculate the mean of this term, which becomes \(\mathbb{E}_{p(x)}[\log e^{f(x, y)}]\). Therefore, it is biased and intractable with Monte Carlo estimation.

2.2.2. \(I_{\mathrm{DV}}\) and MINE

Applying Jensen’s inequality to \(I_{\mathrm{UBA}}\) can recover the bound of Donsker & Varadhan in 1983:

\[I_{\mathrm{UBA}} \geq \mathbb{E}_{p(x, y)}[f(x, y)]-\log \mathbb{E}_{p(y)}[Z(y)] \triangleq I_{\mathrm{DV}}.\]

It is still intractable because of the same biased reason.

If we apply Jensen’s inequality in the other direction, we can have a tractable objective, but not a bound:

\[I \geq I_{\mathrm{UBA}} \leq \mathbb{E}_{p(x, y)}[f(x, y)]-\mathbb{E}_{p(x)}[f(x,y)],\]

which is the same case when estimating \(I_{\mathrm{DV}}\) using Monte Carlo. MINE is doing the same thing, which is not a bound at all!

2.2.3. \(I_{\mathrm{TUBA}}\)

So, how can we obtain a tractable unnormalized lower bound? We need to escape from expectation inside \(\log\). Note that \(\log (x) \leq \frac{x}{a}+\log (a)-1,\forall x,a>0\). Applying it to \(\log Z(y)\leq \frac{Z(y)}{a(y)}+\log (a(y))-1\), which is tight when \(a(y)=Z(y)\). This results in a Tractable Unnormalized version of \(I_{\mathrm{BA}}\):

\[\begin{aligned} I \geq I_{\mathrm{UBA}} \geq & \mathbb{E}_{p(x, y)}[f(x, y)] \\ &-\mathbb{E}_{p(y)}\left[\frac{\mathbb{E}_{p(x)}\left[e^{f(x, y)}\right]}{a(y)}+\log (a(y))-1\right] \\ & \triangleq I_{\mathrm{TUBA}} \end{aligned}.\]

2.2.4. \(I_{\mathrm{NWJ}}\)

Letting \(a(y)=e\) recovers the bound of Nguyen, Wainwright, and Jordan in 2010, also known as \(f\)-GAN and MINE-\(f\):

\[\mathbb{E}_{p(x, y)}[f(x, y)]-e^{-1} \mathbb{E}_{p(y)}[Z(y)] \triangleq I_{\mathrm{NWJ}},\]

which no longer needs to learn \(a(y)\). But \(f(x,y)\) needs to learn to self-normalize. And the optimal critic comes from:

\[\frac{p(x)}{e}e^{f(x,y)}=p(x\mid y).\]

And the solution is:

\[f^{*}(x, y)=1+\log \frac{p(x \mid y)}{p(x)}.\]

In conclusion, these unnormalized bounds are unbiased while exhibit a high variance because the log partition function is of high variance.

2.3. Multi-sample unnormalized Lower Bounds

Multi-sample unnormalized bounds have low-variance but high-bias. We will mainly talk about the InfoNCE.

Assume \(x_1\) and \(y\) are from a sample pair \(p\left(x_{1}\right) p\left(y \mid x_{1}\right)\) while other \(K-1\) samples are from an independent distribution \(x_{2: K} \sim r^{K-1}\left(x_{2: K}\right)\). We can have:

\[I\left(X_{1} ; Y\right)=\mathbb{E}_{r^{K-1}\left(x_{2: K}\right)}\left[I\left(X_{1} ; Y\right)\right]=I\left(X_{1}, X_{2: K} ; Y\right).\]

Such a multi-sample mutual information can be used in all previous bounds with the same optimal critic \(f^{*}\left(x_{1: K}, y\right)=1+\log \frac{p\left(y \mid x_{1: K}\right)}{p(y)}=1+\log \frac{p\left(y \mid x_{1}\right)}{p(y)}\).

Particularly, if we set the critic in \(I_{\mathrm{NWJ}}\) to:

\[f\left(x_{1: K}, y\right)=1+\log \frac{e^{f\left(x_{1}, y\right)}}{a\left(y ; x_{1: K}\right)},\]

then \(I_{\mathrm{NWJ}}\) becomes:

\[\begin{aligned} I\left(X_{1} ; Y\right) &\geq I_{\mathrm{NWJ}} \triangleq\mathbb{E}_{p(x, y)}[f(x, y)]-e^{-1} \mathbb{E}_{p(y)}[Z(y)] \\ &=\mathbb{E}_{p\left(x_{1: K}\right) p\left(y \mid x_{1}\right)}\left[1+\log \frac{e^{f\left(x_{1}, y\right)}}{a\left(y ; x_{1: K}\right)}\right]-e^{-1} \mathbb{E}_{p(y)}\left[\mathbb{E}_{p(x_{1:K})}\left[e^{f\left(x_{1: K}, y\right)}\right]\right]\\ &=1+\mathbb{E}_{p\left(x_{1: K}\right) p\left(y \mid x_{1}\right)}\left[\log \frac{e^{f\left(x_{1}, y\right)}}{a\left(y ; x_{1: K}\right)}\right]-\mathbb{E}_{p\left(x_{1: K}\right) p(y)}\left[\frac{e^{f\left(x_{1}, y\right)}}{a\left(y ; x_{1: K}\right)}\right], \end{aligned}\]

where the inequality comes from the critic not guaranteed to be optimal. The additional samples from \(p(x)\) can be used to estimate the partition function \(Z(y)\):

\[Z(y)=a\left(y ; x_{1: K}\right)=m\left(y ; x_{1: K}\right)=\frac{1}{K} \sum_{i=1}^{K} e^{f\left(x_{i}, y\right)}.\]

The last term in the bound can be simplified as:

\[\begin{array}{l} \mathbb{E}_{p\left(x_{1: K}\right) p(y)}\left[\frac{e^{f\left(x_{1}, y\right)}}{m\left(y ; x_{1: K}\right)}\right]=\frac{1}{K} \sum_{i=1}^{K} \mathbb{E}\left[\frac{e^{f\left(x_{i}, y\right)}}{m\left(y ; x_{1: K}\right)}\right] \\ =\mathbb{E}_{p\left(x_{1: K}\right) p(y)}\left[\frac{\frac{1}{K} \sum_{i=1}^{K} e^{f\left(x_{i}, y\right)}}{m\left(y ; x_{1: K}\right)}\right]=1, \end{array}\]

and thus the bound recovers the InfoNCE loss:

\[I(X ; Y) \geq \mathbb{E}\left[\frac{1}{K} \sum_{i=1}^{K} \log \frac{e^{f\left(x_{i}, y_{i}\right)}}{\frac{1}{K} \sum_{j=1}^{K} e^{f\left(x_{i}, y_{j}\right)}}\right] \triangleq I_{\mathrm{NCE}}.\]

The optimal critic for \(I_{\mathrm{NCE}}\) is \(f(x,y)=\log p(y\mid x)+c(y)\) like \(I_{\mathrm{UBA}}\). Note that \(I_{\mathrm{NCE}}\) itself is upper bounded by \(\log K\), which means that if \(I(X;Y)>\log K\), the \(I_{\mathrm{NCE}}\) bound is loose. The larger the batch, the better the estimation.

2.4. Nonlinearly Interpolated Lower Bounds

We can interpolate between \(I_{\mathrm{NCE}}\) (high-bias, low-variance) and \(I_{\mathrm{NWJ}}\) (low-bias, high-variance) to trade off the bias and the variance. Setting the critic to \(1+\log \frac{e^{f\left(x_{1}, y\right)}}{\alpha m\left(y ; x_{1: K}\right)+(1-\alpha) q(y)}\) with \(\alpha\in\left[0,1\right]\) to get a continuum of lower bounds:

\[\begin{array}{l} 1+\mathbb{E}_{p\left(x_{1: K}\right) p\left(y \mid x_{1}\right)}\left[\log \frac{e^{f\left(x_{1}, y\right)}}{\alpha m\left(y ; x_{1: K}\right)+(1-\alpha) q(y)}\right] \\ -\mathbb{E}_{p\left(x_{1: K}\right) p(y)}\left[\frac{e^{f\left(x_{1}, y\right)}}{\alpha m\left(y ; x_{1: K}\right)+(1-\alpha) q(y)}\right] \triangleq I_{\alpha} \end{array}.\]

Setting \(\alpha=0\), we can recover \(I_{\mathrm{NWJ}}\) and \(\alpha=1\), we can recover \(I_{\mathrm{NCE}}\). This interpolated bound is upper bounded by \(\log \frac{K}{\alpha}\).

2.5. Structured Bounds with Tractable Encoders

In current representation learning area, \(p(y\mid x)\) is usually easily accessible when the \(y\) is a learned stochastic representation.

2.5.1. InfoNCE with a Tractable Conditional

An optimal critic of InfoNCE is given by \(f(x, y)=\log p(y \mid x)\). (WARNING: this is strange for me because \(c(y)\) is omitted here) We can plug in the optimal \(f\) and have a bound:

\[\label{nce-vae} I(X ; Y) \geq \mathbb{E}\left[\frac{1}{K} \sum_{i=1}^{K} \log \frac{p\left(y_{i} \mid x_{i}\right)}{\frac{1}{K} \sum_{j=1}^{K} p\left(y_{i} \mid x_{j}\right)}\right].\]

2.5.2. Leave One Out Upper Bound

In Section 2.1, the upper bound is obtained by a variational distribution \(q(y)\).

\[I(X ; Y) \leq \mathbb{E}_{p(x)}[K L(p(y \mid x) \| q(y))],\]

Given a batch of \(K\) samples, we can approximate \(p(y)\) in this way:

\[p(y) \approx\frac{1}{K} \sum_{i} p\left(y \mid x_{i}\right).\]

Leaving out the corresponding \(y_i\) for \(x_i\), we can have \(q_{i}(y)=\frac{1}{K-1} \sum_{j \neq i} p\left(y \mid x_{j}\right)\). Therefore, the upper bound is:

\[\label{upper-vae} I(X ; Y) \leq \mathbb{E}\left[\frac{1}{K} \sum_{i=1}^{K}\left[\log \frac{p\left(y_{i} \mid x_{i}\right)}{\frac{1}{K-1} \sum_{j \neq i} p\left(y_{i} \mid x_{j}\right)}\right]\right].\]

With Eq. (\ref{nce-vae}) and (\ref{upper-vae}), we successfully sandwich the MI without introducing a variational distribution. The only difference between these two bounds is whether \(p\left(y_{i} \mid x_{i}\right)\) is included in the denominator.