This paper introduces how a deep neural network with an infinite width can works like a Gaussian process. Note that the neural network in this paper is NOT trained, which means that all the distributions are the initialized distributions. NO SGD and NO backprop. Two old papers have discussed how a single-layer FC NN acts like a Gaussian process Bayesian Learning for Neural Networks from Neal, 1996 and Computing with infinite networks from Williams, 1996. We will start from simple of these two old but pioneering papers.
Gaussian process is generally defined in the time continuous style, which is not the case we are interested in actually because we do not have a time series for the neural network. Traditionally, for a process \(\left\{X_{t} ; t \in T\right\}\), it is Gaussian if and only if any finite set of it is a multivariate Gaussian random variable. Then there will be a mean \(\mu\) and a covariance \(\sigma\) to describe this process. Generally, \(\sigma\) is defined by a kernel function \(K(x,x')\).
The input of the network is an \(I\) dimensional vector with each unit denoted as \(x_i\). The hidden dimension is \(H\). \(u,v\) are weights and \(a,b\) are bias terms. The output is \(f_{k}(x)\), where \(k\) is the index of the dimension of \(f(x)\).
To set the priors of these parameters, Gaussian priors are employed. Specifically, the parameters in the hidden-output layer, \(v\) and \(b\) are set to independent zero-mean Gaussian with standard deviations of \(\sigma_v\) and \(\sigma_b\). While for the weights \(u\) and \(a\) in input-hidden layer, the priors are Gaussian as well but without restriction of the mean and the standard deviation.
Then we can see how the prior of the output \(f_k(x)\) behaves. For example, we look into the k-th unit of the output vector with the input vector \(x^{(1)}\), \(f_k(x^{(1)})\), with its prior distributions. Then, the mean or the expected value of \(f_k(x^{(1)})\) is:
\[\mathbb{E}\left[f_k(x^{(1)})\right]=\mathbb{E}\left[b_{k}+ \sum_{j=1}^{H}v_{jk}h_j(x^{(1)})\right]=0.\]This is because \(\mathbb{E}\left[b_{k}\right]=0\) (zero mean) and for each hidden unit, \(\mathbb{E}\left[v_{jk}h_j(x^{(1)})\right]=\mathbb{E}\left[v_{jk}\right]\mathbb{E}\left[h_j(x^{(1)})\right]=0\) (independence and zero mean).
For the variance, again, we look at the variance of each individual term: for the bias term, the variance is \(\sigma_b\). While for each summation term:
\(\mathbb{E}\left[\left(v_{jk}h_j(x^{(1)})\right)^2\right]=\mathbb{E}\left[v_{jk}^2\right]\mathbb{E}\left[h_j^2(x^{(1)})\right]=\sigma_v^2\mathbb{E}\left[h_j^2(x^{(1)})\right]=\sigma_v^2V\left(x^{(1)}\right),\) where \(V\left(x^{(1)}\right)\) is used to denote the expectation of the square term.
By Central Limit Theorem, for a large \(H\), the number of hidden unit, the summation term in the neural network will behave like a Gaussian with the variance as \(H\sigma_v^2V\left(x^{(1)}\right)\). And the variance of \(f_k(x^{(1)})\) is \(\sigma_b^2+H\sigma_v^2V\left(x^{(1)}\right)\).
If the variance of the weight is set to \(\sigma_v=\omega_vH^{-1/2}\), then the prior of the output will converge to a Gaussian of zero mean and variance \(\sigma_b^2+\omega_v^2V\left(x^{(1)}\right)\) as H goes to infinity.
Untill now, the output prior is still a single variable distribution. Where does the Gaussian process come from? We do not have a “time” related thing. In the prior of \(f_k(x)\), the input \(x\) can be seen as a parameter space, which can result in a stochastic process with \(x\) being the varying subscript and we can investigate the joint distribution of \(f_k(x^{(1)}),f_k(x^{(2)}),\ldots,f_k(x^{(k)})\). This, is a stochastic process, which we have not derived it into a Gaussian process. So we are clear now the Gaussian process is defined to capture the relationship of network outputs between different inputs.
For this stochastic process, the mean is \(0\) because every random variable of it has a zero mean. As for the variance, the definition of it is:
\[\text{cov}(X,Y)=\mathbb{E}[XY]-\mu v.\]In our case here, the mean of every output random variable is zero. Thus, the covariance of outputs is:
\(\begin{aligned} \text{cov}(f_k(x^{(p)}),f_k(x^{(q)}))&=\mathbb{E}[f_k(x^{(p)})\cdot f_k(x^{(q)})]\\ &=\mathbb{E}\left[b_k^2+b_k\sum A+b_k\sum B+\sum A\sum B\right],\text{by definition, A and B for the corresponding term}\\ &=\sigma_b^2+\mathbb{E}\left[\sum A\sum B\right],\text{by definition and by zero mean}\\ &=\sigma_b^2+\mathbb{E}\left[\sum_{m=1}^Hv_{mk}h_m(x^{(p)})\sum_{m=1}^Hv_{nk}h_n(x^{(q)})\right]\\ &=\sigma_b^2+\mathbb{E}\left[H^2\left(v_{mk}h_m(x^{(p)})v_{nk}h_n(x^{(q)})\right)\right]\\ &=\sigma_b^2+\sigma_v^2H\mathbb{E}\left[h_j(x^{(p)})h_j(x^{(q)})\right],h_m(x^{(p)})\text{ and }h_n(x^{(q)})\text{ are independent when }m\neq n\\ &=\sigma_b^2+\omega_v^2\mathbb{E}\left[h_j(x^{(p)})h_j(x^{(q)})\right]\\ &=\sigma_b^2+\omega_v^2C\left(x^{(p)},x^{(q)}\right), \end{aligned}\) where \(C\left(x^{(p)},x^{(q)}\right)=\mathbb{E}\left[h_j(x^{(p)})h_j(x^{(q)})\right]\), which is the same for every \(j\). And it is actually the second moment. So for every finite number of outputs, they will be multivariate Gaussian, which makes the output actually a Gaussian process.
According to a great note from Mihai Nica, there is a fact:
If \(X\) is Gaussian, random variable \(X\sim\mathcal{N}(\mu_X, \Sigma_X)\) and if \(Y=a+BX\), then \(Y\) is Gaussian too with \(Y \sim \mathcal{N}\left(a+B \mu_{X}, B \Sigma_{X} B^{T}\right)\).
Till now, everything is shallow, with only one hidden layer. What if we’re goin’ deep?
Let’s first describe the forwarding of a deep neural network: \(z_i^l(x)=b_i^l+\sum_{j=1}^{N_l}W_{ij}^lx_j^l(x),\quad x_j^l(x)=\phi\left(z_{i-1}^l(x)\right).\)
It just looks like the definition of the one-hidden layer network. And since the calculation of the kernel and mean does not depend on the prior distribution of the input of the current layer, it is natural to extend the result to deeper networks.
By definition, \(z_i^l(x)\) is a sum of i.i.d. random terms (bias and weights), then the width \(N_l\) goes to infinity, any finite collection \(\{z_i^l(x^{\alpha=1}),z_i^l(x^{\alpha=2}),\ldots,z_i^l(x^{\alpha=k})\}\) will have a joint Gaussian distribution and \(z_{i}^{l} \sim \mathcal{G P}\left(0, K^{l}\right)\). The covariance (kernel) is: \(K^{l}\left(x, x^{\prime}\right) \equiv \mathbb{E}\left[z_{i}^{l}(x) z_{i}^{l}\left(x^{\prime}\right)\right]=\sigma_{b}^{2}+\sigma_{w}^{2} \mathbb{E}_{z_{i}^{l-1} \sim \mathcal{G} \mathcal{P}\left(0, K^{l-1}\right)}\left[\phi\left(z_{i}^{l-1}(x)\right) \phi\left(z_{i}^{l-1}\left(x^{\prime}\right)\right)\right],\) which is related to the second moment of the pre-activated value in the last layer. The first kernel (first hidden layer) is the same as the one derived in the previous section: \(K^{0}\left(x, x^{\prime}\right)=\mathbb{E}\left[z_{j}^{0}(x) z_{j}^{0}\left(x^{\prime}\right)\right]=\sigma_{b}^{2}+\sigma_{w}^{2}\left(\frac{x \cdot x^{\prime}}{d_{\text {in }}}\right).\)
]]>This paper introduces a contrastive log-ratio upper bound of the mutual information. It provides a more stable estimation than the previously proposed L1OUT upper bound (previous post).
Let’s begin with this paper, CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information, in ICML 2020.
In our previous post about variational bounds, we discuss two types of upper bounds.
Alemi et al. (2016) introduces a variational marginal approximation \(r(y)\) to build a variational upper bound (VUB):
\[\begin{aligned} \mathrm{I}(\boldsymbol{x} ; \boldsymbol{y}) &=\mathbb{E}_{p(\boldsymbol{x}, \boldsymbol{y})}\left[\log \frac{p(\boldsymbol{y} \mid \boldsymbol{x})}{p(\boldsymbol{y})}\right] \\ &=\mathbb{E}_{p(\boldsymbol{x}, \boldsymbol{y})}\left[\log \frac{p(\boldsymbol{y} \mid \boldsymbol{x})}{r(\boldsymbol{y})}\right]-\operatorname{KL}(p(\boldsymbol{y}) \| r(\boldsymbol{y})) \\ & \leq \mathbb{E}_{p(\boldsymbol{x}, \boldsymbol{y})}\left[\log \frac{p(\boldsymbol{y} \mid \boldsymbol{x})}{r(\boldsymbol{y})}\right]=\mathrm{KL}(p(\boldsymbol{y} \mid \boldsymbol{x}) \| r(\boldsymbol{y})) = \mathrm{I}_{\mathrm{VUB}}. \end{aligned}\]In the experiment, the variational distribution \(r(y)\) is usually set to a standard normal distribution, which results in the estimation of MI to be high-biased.
Poole et al. (2019) (previous post) replaces \(r(y)\) with a Monte-Carlo approximation \(r_{i}(\boldsymbol{y})=\frac{1}{N-1} \sum_{j \neq i} p\left(\boldsymbol{y} \mid \boldsymbol{x}_{j}\right) \approx p(\boldsymbol{y})\) and derives a leave one-out upper bound (L1Out):
\[\mathrm{I}_{\mathrm{L} 1 \mathrm{Out}}:=\mathbb{E}\left[\frac{1}{N} \sum_{i=1}^{N}\left[\log \frac{p\left(\boldsymbol{y}_{i} \mid \boldsymbol{x}_{i}\right)}{\frac{1}{N-1} \sum_{j \neq i} p\left(\boldsymbol{y}_{i} \mid \boldsymbol{x}_{j}\right)}\right]\right].\]This bound suffers from numerical instability in practice compared with the proposed method, which will be discussed later.
When the conditional distribution is unknown (not like VAE), we will introduce a neural network \(q_\theta(y\mid x)\) to approximate \(p(y\mid x)\) and develop variational versions of VUB and L1Out as:
\[\mathrm{I}_{\mathrm{vVUB}}=\mathbb{E}_{p(\boldsymbol{x}, \boldsymbol{y})}\left[\log \frac{q_{\theta}(\boldsymbol{y} \mid \boldsymbol{x})}{r(\boldsymbol{y})}\right],\] \[\mathrm{I}_{\mathrm{VL} 1 \mathrm{Out}}=\mathbb{E}\left[\frac{1}{N} \sum_{i=1}^{N}\left[\log \frac{q_{\theta}\left(\boldsymbol{y}_{i} \mid \boldsymbol{x}_{i}\right)}{\frac{1}{N-1} \sum_{j \neq i} q_{\theta}\left(\boldsymbol{y}_{i} \mid \boldsymbol{x}_{j}\right)}\right]\right].\]The full name of CLUB is Contrastive Log-ratio Upper Bound.
The bound is defined as:
\[\mathrm{I}_{\mathrm{CLUB}}(\boldsymbol{x} ; \boldsymbol{y}):=\mathbb{E}_{p(\boldsymbol{x}, \boldsymbol{y})}[\log p(\boldsymbol{y} \mid \boldsymbol{x})]-\mathbb{E}_{p(\boldsymbol{x})} \mathbb{E}_{p(\boldsymbol{y})}[\log p(\boldsymbol{y} \mid \boldsymbol{x})]\]Proof.
Let \(\Delta\) be the gap between the CLUB bound and the MI itself:
\[\begin{aligned} \Delta:=& \mathrm{I}_{\mathrm{CLUB}}(\boldsymbol{x} ; \boldsymbol{y})-\mathrm{I}(\boldsymbol{x} ; \boldsymbol{y}) \\ =& \mathbb{E}_{p(\boldsymbol{x}, \boldsymbol{y})}[\log p(\boldsymbol{y} \mid \boldsymbol{x})]-\mathbb{E}_{p(\boldsymbol{x})} \mathbb{E}_{p(\boldsymbol{y})}[\log p(\boldsymbol{y} \mid \boldsymbol{x})]-\mathbb{E}_{p(\boldsymbol{x}, \boldsymbol{y})}[\log p(\boldsymbol{y} \mid \boldsymbol{x})-\log p(\boldsymbol{y})]\text{, by def.}\\ =& \mathbb{E}_{p(\boldsymbol{x}, \boldsymbol{y})}[\log p(\boldsymbol{y})]-\mathbb{E}_{p(\boldsymbol{x})} \mathbb{E}_{p(\boldsymbol{y})}[\log p(\boldsymbol{y} \mid \boldsymbol{x})],\quad\text{eliminate } \mathbb{E}_{p(\boldsymbol{x}, \boldsymbol{y})}[\log p(\boldsymbol{y} \mid \boldsymbol{x})]\\ =& \mathbb{E}_{p(\boldsymbol{y})}\left[\log p(\boldsymbol{y})-\mathbb{E}_{p(\boldsymbol{x})}[\log p(\boldsymbol{y} \mid \boldsymbol{x})]\right],\quad \boldsymbol{x}\text{ doesn't affect }p(\boldsymbol{y}). \end{aligned}\]By definition,
\[p(\boldsymbol{y})=\int p(\boldsymbol{y} \mid \boldsymbol{x}) p(\boldsymbol{x}) \mathrm{d} \boldsymbol{x}=\mathbb{E}_{p(\boldsymbol{x})}[p(\boldsymbol{y} \mid \boldsymbol{x})].\]Adding \(\log\) to both sides,
\[\log p(\boldsymbol{y})=\log \mathbb{E}_{p(\boldsymbol{x})}[p(\boldsymbol{y} \mid \boldsymbol{x})].\]Remember we talked about the tractability of some bounds in the previous post about variational bounds, which includes the Jensen’s Inequality of the expectation and \(\log\). We apply the inequality here again,
\[\log \left(\mathbb{E}_{p(\boldsymbol{x})}[p(\boldsymbol{y} \mid \boldsymbol{x})]\right) \geq \mathbb{E}_{p(\boldsymbol{x})}[\log p(\boldsymbol{y} \mid \boldsymbol{x})].\]Therefore, the gap is non-negative. CLUB is an upper bound. (Kinda) obviously, when \(\boldsymbol{x}\) and \(\boldsymbol{x}\) are independent, CLUB is tight.
With multiple sample pairs \(\left\{\left(\boldsymbol{x}_{i}, \boldsymbol{y}_{i}\right)\right\}_{i=1}^{N}\), \(\mathrm{I}_{\mathrm{CLUB}}(\boldsymbol{x} ; \boldsymbol{y})\) has an unbiased estimation as:
\[\begin{array}{l} \hat{\mathrm{I}}_{\mathrm{CLUB}}=\frac{1}{N} \sum_{i=1}^{N} \log p\left(\boldsymbol{y}_{i} \mid \boldsymbol{x}_{i}\right)-\frac{1}{N^{2}} \sum_{i=1}^{N} \sum_{j=1}^{N} \log p\left(\boldsymbol{y}_{j} \mid \boldsymbol{x}_{i}\right) \\ =\frac{1}{N^{2}} \sum_{i=1}^{N} \sum_{j=1}^{N}\left[\log p\left(\boldsymbol{y}_{i} \mid \boldsymbol{x}_{i}\right)-\log p\left(\boldsymbol{y}_{j} \mid \boldsymbol{x}_{i}\right)\right] . \end{array}\]Look at the form of the bound, it is exactly a log-ratio of the positive sample and the negative sample.
For some tasks, like if it is not a stochastic representation, then we have no access to the conditional distribution \(p\left(\boldsymbol{y} \mid \boldsymbol{x}\right)\). Therefore, it is naturally to use a variational distribution \(q_{\theta}(\boldsymbol{y} \mid \boldsymbol{x})\) with parameter \(\theta\). Therefore, the variational CLUB is:
\[\mathrm{I}_{\mathrm{vCLUB}}(\boldsymbol{x} ; \boldsymbol{y}):= \mathbb{E}_{p(\boldsymbol{x}, \boldsymbol{y})}\left[\log q_{\theta}(\boldsymbol{y} \mid \boldsymbol{x})\right] -\mathbb{E}_{p(\boldsymbol{x})} \mathbb{E}_{p(\boldsymbol{y})}\left[\log q_{\theta}(\boldsymbol{y} \mid \boldsymbol{x})\right].\]Similarly, when we are using multiple samples, the bound becomes:
\[\begin{aligned} &\hat{\mathrm{I}}_{\mathrm{vCLUB}}=\frac{1}{N^{2}} \sum_{i=1}^{N} \sum_{j=1}^{N}\left[\log q_{\theta}\left(\boldsymbol{y}_{i} \mid \boldsymbol{x}_{i}\right)-\log q_{\theta}\left(\boldsymbol{y}_{j} \mid \boldsymbol{x}_{i}\right)\right]\\ &=\frac{1}{N} \sum_{i=1}^{N}\left[\log q_{\theta}\left(\boldsymbol{y}_{i} \mid \boldsymbol{x}_{i}\right)-\frac{1}{N} \sum_{j=1}^{N} \log q_{\theta}\left(\boldsymbol{y}_{j} \mid \boldsymbol{x}_{i}\right)\right] \end{aligned}\]WARNING: No more guarantee on the bound. The proof of the power of this bound is based on the expressiveness of the neural network, which is not fully guaranteed as well.
In calculation, this sample-based method needs an \(\mathcal{O}\left(N^{2}\right)\) because of the second term.
Authors propose another simpler bound, vCLUB-S (sampled vCLUB):
\[\hat{\mathrm{I}}_{\mathrm{vCLUB}-\mathrm{S}}=\frac{1}{N} \sum_{i=1}^{N}\left[\log q_{\theta}\left(\boldsymbol{y}_{i} \mid \boldsymbol{x}_{i}\right)-\log q_{\theta}\left(\boldsymbol{y}_{k_{i}^{\prime}} \mid \boldsymbol{x}_{i}\right)\right],\]which is unbiased with the computation complexity deduced to \(\mathcal{O}(N)\). The negative pairs computation originally need all paired computation, which now only needs ONE pair. This property does not work for L1Out bound because in the denominator, there is a mean calculation within a \(\log\), which if using single sample to estimate, is biased due to the Jensen’s Inequality (again).
Domain adaption tasks can be solved if the representation of an image can be disentangled in terms of the domain information and the semantic information.
Applying two encoders, a content encoder and a domain encoder, then we can apply the CLUB bound to minimize the mutual information between the content representation and the domain presentation. These two representation should be disentangled.
]]>This paper introduces how to derive class of variational bounds of the mutual information, including MINE-f and InfoNCE. This paper also propose a new upper and a new lower bounds trading off between the bias and variance of the estimation of the MI.
Let’s begin with this paper, On Variational Bounds of Mutual Information, in ICML 2019.
MI:
\[\label{mi-kl} I(X ; Y)=\mathbb{E}_{p(x, y)}\left[\log \frac{p(x \mid y)}{p(x)}\right]=\mathbb{E}_{p(x, y)}\left[\log \frac{p(y \mid x)}{p(y)}\right]\]Notes: In the following, there is a term “(un)normalized”, which indicates whether a variational distribution is normalized.
In Eq. (\ref{mi-kl}), we can easily obtain different types of bounds with KL-divergence when we are substituting different variational distributions to different marginal or conditional distributions.
For example, if \(p(y\mid x)\) is like an encoder in VAE, by inserting \(q(y)\), which approximates \(p(y)\), we can have a tractable variational upper bound:
\[\begin{aligned} I(X ; Y) & \equiv \mathbb{E}_{p(x, y)}\left[\log \frac{p(y \mid x)}{p(y)}\right] \\ &=\mathbb{E}_{p(x, y)}\left[\log \frac{p(y \mid x) q(y)}{q(y) p(y)}\right] \\ &=\mathbb{E}_{p(x, y)}\left[\log \frac{p(y \mid x)}{q(y)}\right]-K L(p(y) \| q(y)) \\ & \leq \mathbb{E}_{p(x)}[K L(p(y \mid x) \| q(y))], \end{aligned}\]where the inequality comes from the non-negative KL term. This bound is tight when \(q(y)=p(y)\). VAE just uses the same thing int he optimization.
We can do the opposite direction by using another variational distribution \(q(x\mid y)\). The lower bound is derived by Barber & Agakov in 2003:
\[\begin{aligned} I(X ; Y)&=\mathbb{E}_{p(x, y)}\left[\log \frac{q(x \mid y)}{p(x)}\right] \\ &\qquad+\mathbb{E}_{p(y)}[K L(p(x \mid y) \| q(x \mid y))] & \\ &\geq \mathbb{E}_{p(x, y)}[\log q(x \mid y)]+h(X) \triangleq I_{\mathrm{BA}}, \end{aligned}\]where \(q(x\mid y)\) can be thought of as a decoder and \(h(X)=-\mathbb{E}_{p(x,y)}\log p(x)\) is the differential entropy of \(X\). This bound is tight when \(q(x\mid y)=p(x\mid y)\). The differential entropy term actually makes this bound intractable because \(p(x)\) is always unknown. A further drawback of this bound is that the decoder term, is always challenging when the data is high-dimensional.
In this section, the lower bound is mainly studied with an unnormalized variational distribution \(q(x\mid y)\).
The variational family is an energy-based style with a critic function \(f(x,y)\) and scaled by the data density \(p(x)\):
\[q(x \mid y)=\frac{p(x)}{Z(y)} e^{f(x, y)}, \text { where } Z(y)=\mathbb{E}_{p(x)}\left[e^{f(x, y)}\right],\]which is exactly what we did in MINE.
Substituting \(q(x\mid y)\) in \(I_{\mathrm{BA}}\), we can get an unnormalized lower bound \(I_{\mathrm{UBA}}\):
\[\begin{aligned} I_{\mathrm{UBA}}&\triangleq\mathbb{E}_{p(x, y)}[\log q(x \mid y)]+h(X) \\ &=\mathbb{E}_{p(x, y)}[\log p(x)-\log Z(y)+f(x,y)]+h(X)\\ &=\mathbb{E}_{p(x, y)}[f(x,y)]-\mathbb{E}_{p(y)}[\log Z(y)]. \end{aligned}\]This bound is tight when:
\[\begin{aligned} q(x\mid y)&=\frac{p(x)}{Z(y)} e^{f(x, y)}=p(x\mid y)=\frac{p(x,y)}{p(y)}\\ f(x,y)&=\log \frac{p(x,y)}{p(x)} + \log \frac{Z(y)}{p(y)}\\ &=\log p(y\mid x)+c(y), \end{aligned}\]where \(c(y)=\log \frac{Z(y)}{p(y)}\) is a function solely related \(y\). This bound removes the intractable term \(h(X)\) in \(I_{\mathrm{BA}}\)! Are we done? Not yet because the log partition function \(\log Z(y)=\log\mathbb{E}_{p(x)}\left[e^{f(x, y)}\right]\) is still intractable. Can’t we just use Monte Carlo to estimate it? “Yes, we can!” But such an estimation is biased. In a batch of data samples, we are actually calculating the mean of \(\log e^{f(x, y)}\) because with each discrete data point, we can only calculate \(\log\mathbb{E}_{p(x)}\left[e^{f(x, y)}\right]\) without the expectation. And until we calculate without the expectation term over the batch, we can finally calculate the mean of this term, which becomes \(\mathbb{E}_{p(x)}[\log e^{f(x, y)}]\). Therefore, it is biased and intractable with Monte Carlo estimation.
Applying Jensen’s inequality to \(I_{\mathrm{UBA}}\) can recover the bound of Donsker & Varadhan in 1983:
\[I_{\mathrm{UBA}} \geq \mathbb{E}_{p(x, y)}[f(x, y)]-\log \mathbb{E}_{p(y)}[Z(y)] \triangleq I_{\mathrm{DV}}.\]It is still intractable because of the same biased reason.
If we apply Jensen’s inequality in the other direction, we can have a tractable objective, but not a bound:
\[I \geq I_{\mathrm{UBA}} \leq \mathbb{E}_{p(x, y)}[f(x, y)]-\mathbb{E}_{p(x)}[f(x,y)],\]which is the same case when estimating \(I_{\mathrm{DV}}\) using Monte Carlo. MINE is doing the same thing, which is not a bound at all!
So, how can we obtain a tractable unnormalized lower bound? We need to escape from expectation inside \(\log\). Note that \(\log (x) \leq \frac{x}{a}+\log (a)-1,\forall x,a>0\). Applying it to \(\log Z(y)\leq \frac{Z(y)}{a(y)}+\log (a(y))-1\), which is tight when \(a(y)=Z(y)\). This results in a Tractable Unnormalized version of \(I_{\mathrm{BA}}\):
\[\begin{aligned} I \geq I_{\mathrm{UBA}} \geq & \mathbb{E}_{p(x, y)}[f(x, y)] \\ &-\mathbb{E}_{p(y)}\left[\frac{\mathbb{E}_{p(x)}\left[e^{f(x, y)}\right]}{a(y)}+\log (a(y))-1\right] \\ & \triangleq I_{\mathrm{TUBA}} \end{aligned}.\]Letting \(a(y)=e\) recovers the bound of Nguyen, Wainwright, and Jordan in 2010, also known as \(f\)-GAN and MINE-\(f\):
\[\mathbb{E}_{p(x, y)}[f(x, y)]-e^{-1} \mathbb{E}_{p(y)}[Z(y)] \triangleq I_{\mathrm{NWJ}},\]which no longer needs to learn \(a(y)\). But \(f(x,y)\) needs to learn to self-normalize. And the optimal critic comes from:
\[\frac{p(x)}{e}e^{f(x,y)}=p(x\mid y).\]And the solution is:
\[f^{*}(x, y)=1+\log \frac{p(x \mid y)}{p(x)}.\]In conclusion, these unnormalized bounds are unbiased while exhibit a high variance because the log partition function is of high variance.
Multi-sample unnormalized bounds have low-variance but high-bias. We will mainly talk about the InfoNCE.
Assume \(x_1\) and \(y\) are from a sample pair \(p\left(x_{1}\right) p\left(y \mid x_{1}\right)\) while other \(K-1\) samples are from an independent distribution \(x_{2: K} \sim r^{K-1}\left(x_{2: K}\right)\). We can have:
\[I\left(X_{1} ; Y\right)=\mathbb{E}_{r^{K-1}\left(x_{2: K}\right)}\left[I\left(X_{1} ; Y\right)\right]=I\left(X_{1}, X_{2: K} ; Y\right).\]Such a multi-sample mutual information can be used in all previous bounds with the same optimal critic \(f^{*}\left(x_{1: K}, y\right)=1+\log \frac{p\left(y \mid x_{1: K}\right)}{p(y)}=1+\log \frac{p\left(y \mid x_{1}\right)}{p(y)}\).
Particularly, if we set the critic in \(I_{\mathrm{NWJ}}\) to:
\[f\left(x_{1: K}, y\right)=1+\log \frac{e^{f\left(x_{1}, y\right)}}{a\left(y ; x_{1: K}\right)},\]then \(I_{\mathrm{NWJ}}\) becomes:
\[\begin{aligned} I\left(X_{1} ; Y\right) &\geq I_{\mathrm{NWJ}} \triangleq\mathbb{E}_{p(x, y)}[f(x, y)]-e^{-1} \mathbb{E}_{p(y)}[Z(y)] \\ &=\mathbb{E}_{p\left(x_{1: K}\right) p\left(y \mid x_{1}\right)}\left[1+\log \frac{e^{f\left(x_{1}, y\right)}}{a\left(y ; x_{1: K}\right)}\right]-e^{-1} \mathbb{E}_{p(y)}\left[\mathbb{E}_{p(x_{1:K})}\left[e^{f\left(x_{1: K}, y\right)}\right]\right]\\ &=1+\mathbb{E}_{p\left(x_{1: K}\right) p\left(y \mid x_{1}\right)}\left[\log \frac{e^{f\left(x_{1}, y\right)}}{a\left(y ; x_{1: K}\right)}\right]-\mathbb{E}_{p\left(x_{1: K}\right) p(y)}\left[\frac{e^{f\left(x_{1}, y\right)}}{a\left(y ; x_{1: K}\right)}\right], \end{aligned}\]where the inequality comes from the critic not guaranteed to be optimal. The additional samples from \(p(x)\) can be used to estimate the partition function \(Z(y)\):
\[Z(y)=a\left(y ; x_{1: K}\right)=m\left(y ; x_{1: K}\right)=\frac{1}{K} \sum_{i=1}^{K} e^{f\left(x_{i}, y\right)}.\]The last term in the bound can be simplified as:
\[\begin{array}{l} \mathbb{E}_{p\left(x_{1: K}\right) p(y)}\left[\frac{e^{f\left(x_{1}, y\right)}}{m\left(y ; x_{1: K}\right)}\right]=\frac{1}{K} \sum_{i=1}^{K} \mathbb{E}\left[\frac{e^{f\left(x_{i}, y\right)}}{m\left(y ; x_{1: K}\right)}\right] \\ =\mathbb{E}_{p\left(x_{1: K}\right) p(y)}\left[\frac{\frac{1}{K} \sum_{i=1}^{K} e^{f\left(x_{i}, y\right)}}{m\left(y ; x_{1: K}\right)}\right]=1, \end{array}\]and thus the bound recovers the InfoNCE loss:
\[I(X ; Y) \geq \mathbb{E}\left[\frac{1}{K} \sum_{i=1}^{K} \log \frac{e^{f\left(x_{i}, y_{i}\right)}}{\frac{1}{K} \sum_{j=1}^{K} e^{f\left(x_{i}, y_{j}\right)}}\right] \triangleq I_{\mathrm{NCE}}.\]The optimal critic for \(I_{\mathrm{NCE}}\) is \(f(x,y)=\log p(y\mid x)+c(y)\) like \(I_{\mathrm{UBA}}\). Note that \(I_{\mathrm{NCE}}\) itself is upper bounded by \(\log K\), which means that if \(I(X;Y)>\log K\), the \(I_{\mathrm{NCE}}\) bound is loose. The larger the batch, the better the estimation.
We can interpolate between \(I_{\mathrm{NCE}}\) (high-bias, low-variance) and \(I_{\mathrm{NWJ}}\) (low-bias, high-variance) to trade off the bias and the variance. Setting the critic to \(1+\log \frac{e^{f\left(x_{1}, y\right)}}{\alpha m\left(y ; x_{1: K}\right)+(1-\alpha) q(y)}\) with \(\alpha\in\left[0,1\right]\) to get a continuum of lower bounds:
\[\begin{array}{l} 1+\mathbb{E}_{p\left(x_{1: K}\right) p\left(y \mid x_{1}\right)}\left[\log \frac{e^{f\left(x_{1}, y\right)}}{\alpha m\left(y ; x_{1: K}\right)+(1-\alpha) q(y)}\right] \\ -\mathbb{E}_{p\left(x_{1: K}\right) p(y)}\left[\frac{e^{f\left(x_{1}, y\right)}}{\alpha m\left(y ; x_{1: K}\right)+(1-\alpha) q(y)}\right] \triangleq I_{\alpha} \end{array}.\]Setting \(\alpha=0\), we can recover \(I_{\mathrm{NWJ}}\) and \(\alpha=1\), we can recover \(I_{\mathrm{NCE}}\). This interpolated bound is upper bounded by \(\log \frac{K}{\alpha}\).
In current representation learning area, \(p(y\mid x)\) is usually easily accessible when the \(y\) is a learned stochastic representation.
An optimal critic of InfoNCE is given by \(f(x, y)=\log p(y \mid x)\). (WARNING: this is strange for me because \(c(y)\) is omitted here) We can plug in the optimal \(f\) and have a bound:
\[\label{nce-vae} I(X ; Y) \geq \mathbb{E}\left[\frac{1}{K} \sum_{i=1}^{K} \log \frac{p\left(y_{i} \mid x_{i}\right)}{\frac{1}{K} \sum_{j=1}^{K} p\left(y_{i} \mid x_{j}\right)}\right].\]In Section 2.1, the upper bound is obtained by a variational distribution \(q(y)\).
\[I(X ; Y) \leq \mathbb{E}_{p(x)}[K L(p(y \mid x) \| q(y))],\]Given a batch of \(K\) samples, we can approximate \(p(y)\) in this way:
\[p(y) \approx\frac{1}{K} \sum_{i} p\left(y \mid x_{i}\right).\]Leaving out the corresponding \(y_i\) for \(x_i\), we can have \(q_{i}(y)=\frac{1}{K-1} \sum_{j \neq i} p\left(y \mid x_{j}\right)\). Therefore, the upper bound is:
\[\label{upper-vae} I(X ; Y) \leq \mathbb{E}\left[\frac{1}{K} \sum_{i=1}^{K}\left[\log \frac{p\left(y_{i} \mid x_{i}\right)}{\frac{1}{K-1} \sum_{j \neq i} p\left(y_{i} \mid x_{j}\right)}\right]\right].\]With Eq. (\ref{nce-vae}) and (\ref{upper-vae}), we successfully sandwich the MI without introducing a variational distribution. The only difference between these two bounds is whether \(p\left(y_{i} \mid x_{i}\right)\) is included in the denominator.
]]>This paper is kind of parallel to the previous MINE method. This paper also provides an objective to optimize the mutual information between random variables. This InfoNCE objective is later discussed and claimed to be tighter than MINE.
Let’s begin with this paper, Representing Learning with Contrastive Predictive Coding, in arXiv.
We will discuss from the mutual information itself to how to get an objective function. If there is any concern about the notation and the model structure, please referred to Figure 1 of the original paper.
In the MINE paper, the mutual information is represented as a KL term, which later is transformed into the DV representation. While in this InfoNCE method, the MI between the original signal \(x\) and the encoding \(c\) is firstly defined as:
\[\label{eq:mi} I(x ; c)=\sum_{x, c} p(x, c) \log \frac{p(x \mid c)}{p(x)}.\]The original task is a sequence prediction task. Given observed input \(x_t\), an encoder maps the input into a sequential latent representation \(z_t=g_\text{enc}(x_t)\). And an autoregressive model summarizes all \(z_{\leq t}\) into a context latent representation \(c_{t}=g_{\text{ar}}\left(z_{\leq t}\right)\).
To predict the futural signals \(x_{t+k}\), we do not apply a generative model \(p_k(x_{t+k}\mid c)\). Instead, we want to maximize the MI between \(x_{t+k}\) and \(x_{\leq t}\) (or \(c_t\)). According to Eq. (\ref{eq:mi}), the ground truth joint distribution \(p(x,c)\) is not the thing that we can control, we want to maximize the density ratio:
\[\label{eq:ratio} f_{k}\left(x_{t+k}, c_{t}\right) \propto \frac{p\left(x_{t+k} \mid c_{t}\right)}{p\left(x_{t+k}\right)},\]where we can do a “proportional to” trick because it is an unbounded value. In the paper, this density ratio is approximated by the following function:
\[\label{eq:nn} f_{k}\left(x_{t+k}, c_{t}\right)=\exp \left(z_{t+k}^{T} W_{k} c_{t}\right),\]where \(W_k\) is for the \(k\)-th step. This function is similar to a dot product, which could indicate the similarity between the predicted latent representation \(z_{t+k}^{T}\) and the step-wise context latent representation \(W_{k} c_{t}\).
After obtaining a surrogate representation function to the MI, we want to plug it into the loss function. Here, the NCE loss is applied:
\[\label{eq:loss} \mathcal{L}_{\mathrm{N}}=-\underset{X}{\mathbb{E}}\left[\log \frac{f_{k}\left(x_{t+k}, c_{t}\right)}{\sum_{x_{j} \in X} f_{k}\left(x_{j}, c_{t}\right)}\right],\]where \(X\) is a set of samples, \(X=\left\{x_{1}, \ldots x_{N}\right\}\). Using this set, the \(x_{t+k}\) is generated from \(c_t\), which can be viewed as a positive sample pair. While for all other samples \(x_j\in X\), along with the same \(c_t\), they serve as negative samples. To minimize this loss, intuitively, we will have the numerator as large as possible while the denominator as small as possible. This is exactly what we want: the MI between positive samples to be large while for negative samples, to be small.
Take a look at Eq. (\ref{eq:loss}), it is actually a cross-entropy loss for multi categories (note that cross-entropy does not need a softmax, which is usually misunderstood in lots of literatures).
So what is the optimal case of Eq. (\ref{eq:loss})? Assume for now, \(x_i\) is the ground truth prediction of \(c_t\). Then the probability of \(x_i\) and other \(x_l\) is:
\[\begin{aligned} p\left(d=i \mid X, c_{t}\right) &=\frac{p\left(x_{i} \mid c_{t}\right) \prod_{l \neq i} p\left(x_{l}\right)}{\sum_{j=1}^{N} p\left(x_{j} \mid c_{t}\right) \prod_{l \neq j} p\left(x_{l}\right)} \\ &=\frac{\frac{p\left(x_{i} \mid c_{t}\right)}{p\left(x_{i}\right)}}{\sum_{j=1}^{N} \frac{p\left(x_{j} \mid c_{t}\right)}{p\left(x_{j}\right)}}, \end{aligned}\]where \(p\left(x_{i} \mid c_{t}\right) \prod_{l \neq i} p\left(x_{l}\right)\) stands for that independently, \(x_i\) is from \(c_t\) while other \(x_l\) is from the prior or the marginal distribution.
Therefore, we can see that \(f(x_{t+k},c_t)\) in Eq. (\ref{eq:ratio}) is, indeed, proportional to \(\frac{p\left(x_{i} \mid c_{t}\right)}{p\left(x_{i}\right)}\).
Although we have already been able to train the network, we still want to see how well the InfoNCE approximate the MI.
Inserting the optimal \(f(x_{t+k},c_t)\) to the loss function Eq. (\ref{eq:loss}),
\[\begin{aligned} \mathcal{L}_{\mathrm{N}}^{\text {opt }} &=-\underset{X}{\mathbb{E}} \log \left[\frac{\frac{p\left(x_{t+k} \mid c_{t}\right)}{p\left(x_{t+k}\right)}}{\frac{p\left(x_{t+k} \mid c_{t}\right)}{p\left(x_{t+k}\right)}+\sum_{x_{j} \in X_{\text {neg }}} \frac{p\left(x_{j} \mid c_{t}\right)}{p\left(x_{j}\right)}}\right] \text{, split into positive and negative} \\ &=\underset{X}{\mathbb{E}} \log \left[1+\frac{p\left(x_{t+k}\right)}{p\left(x_{t+k} \mid c_{t}\right)} \sum_{x_{j} \in X_{\text {neg }}} \frac{p\left(x_{j} \mid c_{t}\right)}{p\left(x_{j}\right)}\right] \text{, move "-" inside} \\ & \approx \underset{X}{\mathbb{E}} \log \left[1+\frac{p\left(x_{t+k}\right)}{p\left(x_{t+k} \mid c_{t}\right)}(N-1) \underset{x_{j}}{\mathbb{E}} \frac{p\left(x_{j} \mid c_{t}\right)}{p\left(x_{j}\right)}\right] \text{, sum to expectation}\\ &=\underset{X}{\mathbb{E}} \log \left[1+\frac{p\left(x_{t+k}\right)}{p\left(x_{t+k} \mid c_{t}\right)}(N-1)\right],x_j \text{ and } c_t \text{ are independent}\\ & \geq \underset{X}{\mathbb{E}} \log \left[\frac{p\left(x_{t+k}\right)}{p\left(x_{t+k} \mid c_{t}\right)} N\right],p\left(x_{t+k} \mid c_{t}\right)>p\left(x_{t+k}\right) \\ &=-I\left(x_{t+k}, c_{t}\right)+\log (N). \end{aligned}\]We now derive the relation between InfoNCE and MINE.
For simplicity and without the loss of generality, we let \(f(x,c)=e^{F(x,c)}\), then:
\[\begin{aligned} &\mathbb{E}_{X}\left[\log \frac{f(x, c)}{\sum_{x_{j} \in X} f\left(x_{j}, c\right)}\right] \\ &=\underset{(x, c)}{\mathbb{E}}[F(x, c)]-\underset{(x, c)}{\mathbb{E}}\left[\log \sum_{x_{j} \in X} e^{F\left(x_{j}, c\right)}\right] \\ &=\underset{(x, c)}{\mathbb{E}}[F(x, c)]-\underset{(x, c)}{\mathbb{E}}\left[\log \left(e^{F(x, c)}+\sum_{x_{j} \in X_{\text {neg }}} e^{F\left(x_{j}, c\right)}\right)\right] \\ &\leq \underset{(x, c)}{\mathbb{E}}[F(x, c)]-\underset{c}{\mathbb{E}}\left[\log \sum_{x_{j} \in X_{\mathrm{neg}}} e^{F\left(x_{j}, c\right)}\right] \\ &=\underset{(x, c)}{\mathbb{E}}[F(x, c)]-\underset{c}{\mathbb{E}}\left[\log \frac{1}{N-1} \sum_{x_{j} \in X_{\text {neg }}} e^{F\left(x_{j}, c\right)}+\log (N-1)\right]. \end{aligned}\]MINE by definition:
\[I(\widehat{X ; Z})_{n}=\sup _{\theta \in \Theta} \mathbb{E}_{\mathbb{P}_{X Z}^{(n)}}\left[T_{\theta}\right]-\log \left(\mathbb{E}_{\mathbb{P}_{X}^{(n)} \otimes \hat{\mathbb{P}}_{Z}^{(n)}}\left[e^{T_{\theta}}\right]\right).\]From both equations, we can see that they are almost the same, except that InfoNCE has an extra constant term \(\underset{c}{\mathbb{E}}[\log (N-1)]\). Authors claim that for difficult tasks, both losses perform almost the same well. While for easy tasks, MINE is unstable.
]]>Theorem 1 (The Gaussian Tail Inequality). Let \(X\sim N(0.1)\). Then
\[\mathbb{P}(\mid X\mid >\epsilon) \leq \frac{2 e^{-\epsilon^{2} / 2}}{\epsilon}.\]If \(X_{1}, \ldots, X_{n} \sim N(0,1)\) then
\[\mathbb{P}\left(\mid \bar{X}_{n}\mid >\epsilon\right) \leq \frac{2}{\sqrt{n} \epsilon} e^{-n \epsilon^{2} / 2} \stackrel{\text { large n}}{\leq} e^{-n \epsilon^{2} / 2}.\]Proof. Density of \(X\) is \(\phi(x)=(2\pi)^{(-1/2)}e^{-x^2/2}\). For one side:
\[\begin{aligned} \mathbb{P}(X>\epsilon)&=\int_{\epsilon}^{\infty} \phi(s) d s=\int_{\epsilon}^{\infty} \frac{s}{s} \phi(s) d s \stackrel{s>\epsilon}{\leq} \frac{1}{\epsilon} \int_{\epsilon}^{\infty} s \phi(s) d s\\ &\stackrel{\phi^{\prime}(x)=-x\phi(x)}{=}\frac{1}{\epsilon}(-\phi^{\prime}(s))\arrowvert^{\infty}_{\epsilon}=\frac{\phi(\epsilon)}{\epsilon} \leq \frac{e^{-\epsilon^{2} / 2}}{\epsilon}. \end{aligned}\]By symmetry,
\[\mathbb{P}(\mid X\mid >\epsilon) \leq \frac{2 e^{-\epsilon^{2} / 2}}{\epsilon}.\]Let \(X_{1}, \ldots, X_{n} \sim N(0,1)\). Then \(\bar{X}_{n}=n^{-1} \sum_{i=1}^{n} X_{i} \sim N(0,1 / n)\). Thus, \(\bar{X}_{n} \stackrel{d}{=} n^{-1 / 2} Z\), where \(Z \sim N(0,1)\) and based on the inequality above:
\[\mathbb{P}\left(\mid \bar{X}_{n}\mid >\epsilon\right)=\mathbb{P}\left(n^{-1 / 2}\mid Z\mid >\epsilon\right)=\mathbb{P}(\mid Z\mid >\sqrt{n} \epsilon) \leq \frac{2}{\sqrt{n} \epsilon} e^{-n \epsilon^{2} / 2}.\]Theorem 2 (Markov Inequality). Let \(X\) be a non-negative random variable and suppose that \(\mathbb{E}(X)\) exists. For any \(t>0\),
\[\label{eq:mk} \mathbb{P}(X>t) \leq \frac{\mathbb{E}(X)}{t}.\]Proof. Since \(X>0\),
\[\begin{aligned} \mathbb{E}(X)&=\int_{0}^{\infty} x p(x) d x=\int_{0}^{t} x p(x) d x+\int_{t}^{\infty} x p(x) d x\\ &\geq\int_{t}^{\infty} x p(x) d x\stackrel{x\geq t}{\geq}t\int_{t}^{\infty}p(x) d x=t\mathbb{P}(X>t). \end{aligned}\]Theorem 3 (Chebyshev’s inequality). Let \(\mu=\mathbb{E}(X)\) and \(\sigma^{2}=\operatorname{Var}(X)\). Then
\[\mathbb{P}(\mid X-\mu\mid \geq t) \leq \frac{\sigma^{2}}{t^{2}}\text{ and }\mathbb{P}(\mid Z\mid \geq k) \leq \frac{1}{k^{2}},\]where \(Z=(X-\mu) / \sigma\). In particular, \(\mathbb{P}(\mid Z\mid>2) \leq 1 / 4 \text { and } \mathbb{P}(\mid Z\mid>3) \leq 1 / 9\).
Proof.
\[\mathbb{P}(\mid X-\mu\mid \geq t)=\mathbb{P}\left(\mid X-\mu\mid ^{2} \geq t^{2}\right) \stackrel{Markov}{\leq} \frac{\mathbb{E}(X-\mu)^{2}}{t^{2}}=\frac{\sigma^{2}}{t^{2}}.\]Let \(t=k\sigma\),
\[\mathbb{P}(\mid Z\mid \geq k)=\mathbb{P}(\mid X-\mu\mid \geq k\sigma)\leq\frac{1}{k^2}.\]Lemma 1. Let \(X\) be a random variable. Then
\[\mathbb{P}(X>\epsilon) \leq \inf _{t \geq 0} e^{-t \epsilon} \mathbb{E}\left(e^{t X}\right).\]Proof. For any \(t>0\),
\[\mathbb{P}(X>\epsilon)=\mathbb{P}(e^X>e^\epsilon)=\mathbb{P}(e^{tX}>e^{t\epsilon})\stackrel{Markov}{\leq}\frac{\mathbb{E}(e^{tX})}{e^{t\epsilon}}.\]Lemma 2 (Chernoff’s method). Suppose that \(a\leq X\leq b\). Then
\[\mathbb{E}\left(e^{t X}\right) \leq e^{t \mu} e^{\frac{t^{2}(b-a)^{2}}{8}},\]where \(\mu=\mathbb{E}[X]\).
Proof. Assume \(\mu=0\) for the simplicity. Since \(a\leq X\leq b\), \(X\) can be written as a convex combination of \(a\) and \(b\), \(X=\alpha b+(1-\alpha)a\), where \(\alpha=(X-a)/(b-a)\). Because \(e^x\) is also a convex function,
\[e^{tX}\stackrel{convex}{\leq}\alpha e^{tb}+(1-\alpha)e^{ta}=\frac{X-a}{b-a} e^{t b}+\frac{b-X}{b-a} e^{t a}.\]Take the expection of both sides and remember \(\mathbb{E}(X)=\mu=0\),
\[\mathbb{E}(e^{tX})\leq -\frac{a}{b-a} e^{t b}+\frac{b}{b-a} e^{t a}=e^{g(u)},\]where \(u=t(b-a)\), \(g(u)=-\gamma u+log(1-\gamma+\gamma e^u)\) and \(\gamma=-a/(b-a)\).
Why do we need such a \(g\) function? That because it can provide a upper bound of its value. Basic properties of \(g(u)\) are \(g(0)=0\); the first order derivative is
\[g\prime(u)=-\gamma+\frac{\gamma e^u}{1-\gamma+\gamma e^u}=-\gamma+1-\frac{1-\gamma}{1-\gamma+\gamma e^u},\]and \(g\prime(0)=0\); and for the second derivative
\[g\prime\prime(u)=\frac{\gamma(1-\gamma)e^u}{(1-\gamma+\gamma e^u)^2},\]which has a bound \(g\prime\prime(u)\leq1/4\) for all \(u>0\). By Taylor’s theorem to the second order, there is a \(\xi \in(0, u)\) such that
\[g(u)=g(0)+ug\prime(0)+\frac{u^2}{2}g\prime\prime(\xi)\leq\frac{u^2}{8}=\frac{t^2(b-a)^2}{8}.\]Therefore,
\[\mathbb{E}(e^{tX})\leq e^{g(u)}\leq e^{\frac{t^2(b-a)^2}{8}}.\]Theorem 4 (Hoeffding’s Inequality). Let \(Y_1,\dots,Y_n\) be iid observations such that \(\mathbb{E}(Y_i)=\mu\) and \(a\leq Y_i\leq b\). Then, for any \(\epsilon>0\),
\[\mathbb{P}(\mid \bar{Y}_n-\mu\mid\geq\epsilon)\leq2e^{-2n\epsilon^2/(b-a)^2}.\]Proof. For one side \(\mathbb{P}(\bar{Y}_n\geq\epsilon)\), for any \(t>0\), we have
\[\begin{aligned} \mathbb{P}(\bar{Y}_n\geq\epsilon)&=\mathbb{P}(\sum_{i=1}^nY_i\geq n\epsilon)=\mathbb{P}(e^{\sum_{i=1}^nY_i}\geq e^{n\epsilon})\\ &=\mathbb{P}(e^{t\sum_{i=1}^nY_i}\geq e^{tn\epsilon})\stackrel{Markov}{\leq}\frac{\mathbb{E}(e^{t\sum_{i=1}^nY_i})}{e^{tn\epsilon}}\\ &=e^{-tn\epsilon}\prod_{i=1}^n\mathbb{E}(e^{tY_i})=e^{-tn\epsilon}(\mathbb{E}(e^{tY_i})^n. \end{aligned}\]From Lemma 2, \(\mathbb{E}(e^{tY_i})\leq e^{t^2(b-a)^2/8}\). So
\[\mathbb{P}(\bar{Y}_n\geq\epsilon)\leq e^{-tn\epsilon+nt^2(b-a)^2/8},\]which can achieve a minimum by setting \(t=4\epsilon/(b-a)^2\):
\[\mathbb{P}(\bar{Y}_n\geq\epsilon)\leq e^{-2n\epsilon^2/(b-a)^2}.\]By symmetry, it yields the Hoeffding’s inequality.
Example. Let \(X_1,\dots,x_n\sim\)Bernoulli\((p)\). From Hoeffding’s inequality,
\[\mathbb{P}(\bar{X}_n-p)\leq2e^{-2n\epsilon^2}.\]Theorem 4 (Cauchy-Schwartz inequality). If \(X\) and \(Y\) have finite variances, then
\[\mid\mathbb{E}(XY)\mid\leq\mathbb{E}\mid XY\mid\leq\sqrt{\mathbb{E}X^2\mathbb{E}Y^2}.\]It can be written as:
\[\text{Cov}^2(X,Y)\leq\sigma^2_X\sigma^2_Y.\]Theorem 5 (Jensen’s inequality). If \(g\) is convex, then
\[\mathbb{E}(g(X))\geq g(\mathbb{E}(X));\]If concave, then
\[\mathbb{E}(g(X))\leq g(\mathbb{E}(X)).\]Example. Let \(g(x)=x^2\), and it is convex. Then
\[\mathbb{E}(X^2)\geq(\mathbb{E}(X))^2.\]Example. KL divergence is defined as
\[D_{KL}(p, q)=\int p(x) \log \left(\frac{p(x)}{q(x)}\right) d x.\]If \(p=q\), \(D_{KL}(p,p)=0\). Else,
\[\begin{aligned} -D(p, q)&=\mathbb{E} \log \left(\frac{q(X)}{p(X)}\right) \stackrel{Jensen}{\leq} \log \mathbb{E}\left(\frac{q(X)}{p(X)}\right)\\ &=\log\int p(x)\frac{q(x)}{p(x)}dx=0. \end{aligned}\]]]>Mutual information has been successfully applied in deep learning recently. The original difficulty of using mutual information is that it is hard to compute exactly. Recent methods focus on how to derive a tractable bound that can be optimized on.
Let’s begin with the first big-hit paper, Mutual Information Neural Estimation, in ICML 2018.
We will discuss from the mutual information itself to how to get a tractable bound for optimization.
Mutual information \(I\) quantifies the statistical dependence of two random variables \(X\) and \(Z\) (can be thought of as input variables and the corresponding latent variables). \(I(X;Z)\) is generally defined as:
\[\label{eq:mi} I(X;Z)=H(X)-H(X\mid Z)=H(Z)-H(Z\mid X)=I(Z;X),\]where \(H\) is Shannon entropy of a random variable. \(H(X)=-\mathbb{E}_{p(x)}\log p(x)\) and \(H(X\mid Z)=-\mathbb{E}_{p(x,z)}\log p(x\mid z)\).
Note that \(H(X\mid Z)\) is derived as:
\[\label{eq:con-ent} \begin{aligned} H(X\mid Z)&=\int_{z}p(z)H(X\mid Z=z)dz \\ &=-\int_{z}p(z)\int_{x}p(x\mid z)\log p(x\mid z)dxdz \\ &=-\int_{x,z}p(x,z)\log p(x\mid z)dxdz. \end{aligned}\]There is a more useful way, in terms of computation, to describe \(I(X;Z)\):
\[\begin{aligned} \label{eq:kl-mi} I(X; Z) &=H(X)-H(X \mid Z) \\ &=-\int_{x} p(x) \log p(x) d x+\int_{x, z} p(x, z) \log p(x \mid z) d x d z \\ &=\int_{x, z}(-p(x, z) \log p(x)+p(x, z) \log p(x \mid z)) d x d z \\ &=\int_{x, z}\left(p(x, z) \log \frac{p(x, z)}{p(x) p(z)}\right) d x d z \\ &=D_{K L}(P(X, Z) \| P(X) \otimes P(Z)) \end{aligned}\]So the mutual information can be thought of as the KL divergence between the joint distribution and the product of both marginal distributions. And the mutual information is obviously symmetric.
According to the KL divergence form of the mutual information, if \(X\) and \(Z\) are independent, \(p(x, z) = p(x) \times p(z)\) and \(I(X;Z)=0\). If \(X\) and \(Z\) become more dependent, the mutual information will intuitively increase. Therefore, mutual information can be considered as a good metric for determining the dependence between random variables.
Although we have a metric for the dependence, this KL form is intractable (yet). Consider an normal encoder model, we will only have \(P(Z\mid X)\). While \(P(X,Z)\), \(P(X)\) and \(P(Z)\) are all intractable. Therefore, we need to find a tractable representation to optimize. Here comes the Donsker-Varadhan (DV) representation:
\[\label{eq:dv-dk} D_{K L}(\mathbb{P} \| \mathbb{Q})=\sup _{T: \Omega \rightarrow \mathbb{R}} \mathbb{E}_{\mathbb{P}}[T]-\log (\mathbb{E}_{\mathbb{Q}}[e^{T}]),\]where \(\mathbb{P}\) and \(\mathbb{Q}\) are two arbitrary distributions and \(T\) is an arbitrary function mapping from the sample space \(\Omega\) to the real number \(\mathbb{R}\).
Proof. To prove Eq. (\ref{eq:dv-dk}), we actually want to prove the following statement:
\[\label{eq:dv-le} D_{K L}(\mathbb{P} \| \mathbb{Q}) \geq \mathbb{E}_{\mathbb{P}}[T]-\log (\mathbb{E}_{\mathbb{Q}}[e^{T}]),\]and under some \(T\), the equal sign in Eq. (\ref{eq:dv-le}) is achievable.
First, we define an auxiliary Gibbs distribution \(d \mathbb{G}=\frac{1}{Z} e^{T} d \mathbb{Q}\), where \(Z=\mathbb{E}_{\mathbb{Q}}\left[e^{T}\right]\) and \(d \mathbb{Q}\) can be safely thought of as the density function. In an informal way, the Gibbs distribution can be re-written as \(g(x)=\frac{1}{Z} e^{T(x)} q(x)\). And to ensure \(g(x)\) is a probability density function (integral result equals 1), we let \(Z=\int_{x}e^{T(x)}q(x)dx\), which is exactly the expectation of \(e^{T(x)}\) under distribution \(\mathbb{Q}\).
With \(\mathbb{G}\), the right hand side of Eq. (\ref{eq:dv-le}) can be reformed as:
\[\label{eq:rhs} \begin{aligned} \mathbb{E}_{\mathbb{P}}[T]-\log (\mathbb{E}_{\mathbb{Q}}[e^{T}])&=\mathbb{E}_{\mathbb{P}}\left[\log (e^{T})-\log (\mathbb{E}_{\mathbb{Q}}[e^{T}])\right] \\ &=\mathbb{E}_{\mathbb{P}}\left[\log \frac{e^{T}}{\mathbb{E}_{\mathbb{Q}}[e^{T}]}\right], \text{ (group log terms together)} \\ &=\mathbb{E}_{\mathbb{P}}\left[\log \frac{e^{T}d\mathbb{Q}}{\mathbb{E}_{\mathbb{Q}}[e^{T}]}\frac{1}{d\mathbb{Q}}\right], \text{ (multiply } d\mathbb{Q}\text{)} \\ &=\mathbb{E}_{\mathbb{P}}\left[\log \frac{d\mathbb{G}}{d\mathbb{Q}}\right]. \end{aligned}\]Let \(\Delta\) be the gap of terms in Eq. (\ref{eq:dv-le}):
\[\Delta:=D_{K L}(\mathbb{P} \| \mathbb{Q})-\left(\mathbb{E}_{\mathbb{P}}[T]-\log \left(\mathbb{E}_{Q}\left[e^{T}\right]\right)\right).\]And \(\Delta\) can be represented as a KL term:
\[\Delta=\mathbb{E}_{\mathbb{P}}\left[\log \frac{d \mathbb{P}}{d \mathbb{Q}}-\log \frac{d \mathbb{G}}{d \mathbb{Q}}\right]=\mathbb{E}_{\mathbb{P}} \log \frac{d \mathbb{P}}{d \mathbb{G}}=D_{K L}(\mathbb{P} \| \mathbb{G}) \geq 0.\]Therefore, Eq. (\ref{eq:dv-le}) holds in any situation. Further, for \(\mathbb{G}=\mathbb{P}\), this bound is tight because the KL in between is 0.
But what does it indicate by \(\mathbb{G}=\mathbb{P}\)? From the definition of \(d \mathbb{G}=\frac{1}{Z} e^{T} d \mathbb{Q}\), it means that \(d \mathbb{P}=\frac{1}{Z} e^{T} d \mathbb{Q}\) as well. Overall, it indicates that for some optimal \(T^{*}=\log \frac{d \mathbb{P}}{d \mathbb{Q}}+\text{const}\), the gap can equal 0. Note that it does not mean that \(\mathbb{Q}=\mathbb{P}\) because we are not minimizing the KL and instead, we are calculating a bound of the KL and finding the tightest point of this bound.
So formally, the bound is like:
\[\label{eq:bound-dv} D_{K L}(\mathbb{P} \| \mathbb{Q}) \geq \sup _{T \in \mathcal{F}} \mathbb{E}_{\mathbb{P}}[T]-\log (\mathbb{E}_{\mathbb{Q}}[e^{T}]),\]where \(\mathcal{F}\) is an arbitrary transformation function (neural networks are functions).
There is a weaker bound of KL using \(f\)-divergence to derive:
\[D_{K L}(\mathbb{P} \| \mathbb{Q}) \geq \sup _{T \in \mathcal{F}} \mathbb{E}_{\mathbb{P}}[T]-\mathbb{E}_{\mathbb{Q}}[e^{T-1}].\]It is weaker because \(\mathbb{E}_{\mathbb{Q}}[e^{T-1}]>\log (\mathbb{E}_{\mathbb{Q}}[e^{T}])\). Consider an inequality \(\frac{x}{e}>\log x\). It is easy to verify it holds.
In Eq. (3), the MI is represented as a KL term \(D_{K L}(P(X, Z) \| P(X) \otimes P(Z))\). With the Bound (\ref{eq:bound-dv}), we have:
\[\label{eq:bound-mi} I(X ; Z) \geq I_{\Theta}(X, Z)=\sup _{\theta \in \Theta} \mathbb{E}_{\mathbb{P}_{X Z}}[T_{\theta}]-\log (\mathbb{E}_{\mathbb{P}_{X} \otimes \mathbb{P}_{Z}}[e^{T_{\theta}}]),\]where \(T_{\theta}: \mathcal{X} \times \mathcal{Z} \rightarrow \mathbb{R}\) is a neural network with input of two random variables and one-digit output (statistics network in the original paper). Note that we can use samples of \(X\) and \(Z\) to estimate the expectation in Bound (\ref{eq:bound-mi}) and those intractable probabilities are no longer needed.
Yea, it is MINE.
Using a neural network \(T\) and \(n\) samples of \(X\), we can have the following estimator of MI:
\[I \widehat{(X ; Z)}_{n}=\sup _{\theta \in \Theta} \mathbb{E}_{\mathbb{P}_{X Z}^{(n)}}[T_{\theta}]-\log (\mathbb{E}_{\mathbb{P}_{X}^{(n)} \otimes \hat{\mathbb{P}}_{Z}^{(n)}}[e^{T_{\theta}}]),\]where all distribution are empirical distribution. \(X\) and \(Z\) from \(\mathbb{P}_{X Z}^{(n)}\) are empirically related, e.g., input and label, input and latent code, or noise and the generation result etc. While for the marginal distribution, if the data generation procedure is known, like experiment 4.1 in the paper, samples can still be drawn empirically. If the procedure is unknown, like the GAN experiment, shuffling the batch dimension of the data from the joint distribution respectively can give out samples from marginal distributions.
Detailed optimization steps of MINE can be found in Algorithm 1 in the paper. It is very straightforward.
The estimated gradient (by data) of \(\theta\) in the network \(T\) is:
\[\label{eq:grad} \widehat{G}_{B}=\mathbb{E}_{B}\left[\nabla_{\theta} T_{\theta}\right]-\frac{\mathbb{E}_{B}\left[\nabla_{\theta} T_{\theta} e^{T_{\theta}}\right]}{\mathbb{E}_{B}\left[e^{T_{\theta}}\right]},\]where \(B\) is a batch of data.
Are we done? Yes and no.
We do find a proper way to update parameters. But this way is biased (in the technical sense). The gradient in Eq. (\ref{eq:grad}) is biased estimation of the full-batch gradient under the expectation.
For an extreme case, a batch only includes one sample. Then the second term in the gradient becomes \(\nabla_{\theta} T_{\theta}\). The numerator and denominator are simplified with the term \(e^{T_{\theta}}\). However, what we really want to calculate is the expectation of the numerator and denominator respectively:
\[\frac{\mathbb{E}_{B}\left[\nabla_{\theta} T_{\theta} e^{T_{\theta}}\right]}{\mathbb{E}_{B}\left[e^{T_{\theta}}\right]}\neq \frac{\nabla_{\theta} T_{\theta} e^{T_{\theta}}}{e^{T_{\theta}}}\neq \nabla_{\theta} T_{\theta}.\]We cannot directly use batch-based optimization algorithms (e.g., SGD) to optimize this objective.
A common question is that why \(\widehat{G}_{B}\) is not 0 accordingly. That is because the first term in Eq. (\ref{eq:grad}) uses samples from the joint distribution while the second term uses samples from marginal distributions.
Assume \(T\) will not change drastically. Then we can use a moving average to estimate \(\mathbb{E}_{B}\left[e^{T_{\theta}}\right]\). The original paper uses an exponential moving average with a small learning rate.
Because MINE is an estimator, we need to see how good it is.
First, it is unbiased since the derivation of MINE is based on DV representation of KL, which is based on calculating the expectation. And we prove that the bound is tight.
Besides, we want the estimator to converge to the true value, which is a stronger requirement.
The strong consistency says: for all \(\epsilon > 0\), there is a positive integer \(N\) and a choice of the statistics network (\(T_\theta\)) such that:
\[\forall n \geq N, \quad\left|I(X, Z)-I \widehat{(X ; Z)}_{n}\right| \leq \epsilon, \text { a.e. }.\]For the term \(I \widehat{(X ; Z)}_{n}\), it consists two things that need to verify: (1) does \(T_\theta\) converge to the real transformation function? (2) does sampling estimation converge to the theoretical true value?
Lemma 1 (approximation). Let \(\epsilon > 0\). There exists a neural network parameterizing functions \(T_\theta\) with parameters \(\theta\) in some compact domain \(\Theta \subset \mathbb{R}^k\), such that
\[\left|I(X, Z)-I_{\Theta}(X, Z)\right| \leq \epsilon, \text { a.e. }.\]Proof. The proof is not difficult but a little bit long. Please refer to the appendix in the original paper.
Lemma 2 (estimation). Let \(\epsilon > 0\). Given a family of neural network functions \(T_\theta\) with parameters \(\theta\) in some bounded domain \(\Theta \subset \mathbb{R}^k\), there exists an \(N\in\mathbb{N}\), such that
\[\forall n \geq N, \quad\left|I \widehat{(X ; Z)}_{n}-I_{\Theta}(X, Z)\right| \leq \epsilon, \text { a.e.}.\]Theorem 1. MINE is strong consistent.
Proof. With Lemma 1 and Lemma 2,
\[\begin{aligned} \left|I(X, Z)-I \widehat{(X ; Z)}_{n}\right| &=\left|I(X, Z)-I_{\Theta}(X, Z)+I_{\Theta}(X, Z)-I \widehat{(X ; Z)}_{n}\right|\\ &\leq \left|I(X, Z)-I_{\Theta}(X, Z)\right|+\left|I \widehat{(X ; Z)}_{n}-I_{\Theta}(X, Z)\right|\\ &\leq 2\epsilon. \end{aligned}\]This is the sample complexity for Lemma 2. We need to see how many sample we need to estimate the optimal \(T^*\).
Theorem 2. Given any values \(\epsilon,\delta\) of the desired accuracy and confidence parameters, we have,
\[\operatorname{Pr}\left(\left|I \widehat{(X ; Z)}_{n}-I_{\Theta}(X, Z)\right| \leq \epsilon\right) \geq 1-\delta,\]whenever the number n of samples satisfies
\[n \geq \frac{2 M^{2}(d \log (16 K L \sqrt{d} / \epsilon)+2 d M+\log (2 / \delta))}{\epsilon^{2}}.\]]]>