Explanation of NNGP: Neural Network Gaussian Process
Published:
Explanation of the paper Deep Neural Networks as Gaussian Process, ICLR 2018.
1. Introduction
This paper introduces how a deep neural network with an infinite width can works like a Gaussian process. Note that the neural network in this paper is NOT trained, which means that all the distributions are the initialized distributions. NO SGD and NO backprop. Two old papers have discussed how a single-layer FC NN acts like a Gaussian process Bayesian Learning for Neural Networks from Neal, 1996 and Computing with infinite networks from Williams, 1996. We will start from simple of these two old but pioneering papers.
2. Gaussian Process
Gaussian process is generally defined in the time continuous style, which is not the case we are interested in actually because we do not have a time series for the neural network. Traditionally, for a process \(\left\{X_{t} ; t \in T\right\}\), it is Gaussian if and only if any finite set of it is a multivariate Gaussian random variable. Then there will be a mean \(\mu\) and a covariance \(\sigma\) to describe this process. Generally, \(\sigma\) is defined by a kernel function \(K(x,x')\).
3. Single Hidden Layer Neural Network
\[\begin{aligned} h_{j}(x)&=\tanh\left(a_j+\sum_{i=1}^{I}u_{ij}x_i\right)\\ f_{k}(x)&=b_{k}+ \sum_{j=1}^{H}v_{jk}h_j(x). \end{aligned}\]The input of the network is an \(I\) dimensional vector with each unit denoted as \(x_i\). The hidden dimension is \(H\). \(u,v\) are weights and \(a,b\) are bias terms. The output is \(f_{k}(x)\), where \(k\) is the index of the dimension of \(f(x)\).
To set the priors of these parameters, Gaussian priors are employed. Specifically, the parameters in the hidden-output layer, \(v\) and \(b\) are set to independent zero-mean Gaussian with standard deviations of \(\sigma_v\) and \(\sigma_b\). While for the weights \(u\) and \(a\) in input-hidden layer, the priors are Gaussian as well but without restriction of the mean and the standard deviation.
3.1. Gaussian Prior
Then we can see how the prior of the output \(f_k(x)\) behaves. For example, we look into the k-th unit of the output vector with the input vector \(x^{(1)}\), \(f_k(x^{(1)})\), with its prior distributions. Then, the mean or the expected value of \(f_k(x^{(1)})\) is:
\[\mathbb{E}\left[f_k(x^{(1)})\right]=\mathbb{E}\left[b_{k}+ \sum_{j=1}^{H}v_{jk}h_j(x^{(1)})\right]=0.\]This is because \(\mathbb{E}\left[b_{k}\right]=0\) (zero mean) and for each hidden unit, \(\mathbb{E}\left[v_{jk}h_j(x^{(1)})\right]=\mathbb{E}\left[v_{jk}\right]\mathbb{E}\left[h_j(x^{(1)})\right]=0\) (independence and zero mean).
For the variance, again, we look at the variance of each individual term: for the bias term, the variance is \(\sigma_b\). While for each summation term:
\(\mathbb{E}\left[\left(v_{jk}h_j(x^{(1)})\right)^2\right]=\mathbb{E}\left[v_{jk}^2\right]\mathbb{E}\left[h_j^2(x^{(1)})\right]=\sigma_v^2\mathbb{E}\left[h_j^2(x^{(1)})\right]=\sigma_v^2V\left(x^{(1)}\right),\) where \(V\left(x^{(1)}\right)\) is used to denote the expectation of the square term.
By Central Limit Theorem, for a large \(H\), the number of hidden unit, the summation term in the neural network will behave like a Gaussian with the variance as \(H\sigma_v^2V\left(x^{(1)}\right)\). And the variance of \(f_k(x^{(1)})\) is \(\sigma_b^2+H\sigma_v^2V\left(x^{(1)}\right)\).
If the variance of the weight is set to \(\sigma_v=\omega_vH^{-1/2}\), then the prior of the output will converge to a Gaussian of zero mean and variance \(\sigma_b^2+\omega_v^2V\left(x^{(1)}\right)\) as H goes to infinity.
3.2. Gaussian Process
Untill now, the output prior is still a single variable distribution. Where does the Gaussian process come from? We do not have a “time” related thing. In the prior of \(f_k(x)\), the input \(x\) can be seen as a parameter space, which can result in a stochastic process with \(x\) being the varying subscript and we can investigate the joint distribution of \(f_k(x^{(1)}),f_k(x^{(2)}),\ldots,f_k(x^{(k)})\). This, is a stochastic process, which we have not derived it into a Gaussian process. So we are clear now the Gaussian process is defined to capture the relationship of network outputs between different inputs.
For this stochastic process, the mean is \(0\) because every random variable of it has a zero mean. As for the variance, the definition of it is:
\[\text{cov}(X,Y)=\mathbb{E}[XY]-\mu v.\]In our case here, the mean of every output random variable is zero. Thus, the covariance of outputs is:
\(\begin{aligned} \text{cov}(f_k(x^{(p)}),f_k(x^{(q)}))&=\mathbb{E}[f_k(x^{(p)})\cdot f_k(x^{(q)})]\\ &=\mathbb{E}\left[b_k^2+b_k\sum A+b_k\sum B+\sum A\sum B\right],\text{by definition, A and B for the corresponding term}\\ &=\sigma_b^2+\mathbb{E}\left[\sum A\sum B\right],\text{by definition and by zero mean}\\ &=\sigma_b^2+\mathbb{E}\left[\sum_{m=1}^Hv_{mk}h_m(x^{(p)})\sum_{m=1}^Hv_{nk}h_n(x^{(q)})\right]\\ &=\sigma_b^2+\mathbb{E}\left[H^2\left(v_{mk}h_m(x^{(p)})v_{nk}h_n(x^{(q)})\right)\right]\\ &=\sigma_b^2+\sigma_v^2H\mathbb{E}\left[h_j(x^{(p)})h_j(x^{(q)})\right],h_m(x^{(p)})\text{ and }h_n(x^{(q)})\text{ are independent when }m\neq n\\ &=\sigma_b^2+\omega_v^2\mathbb{E}\left[h_j(x^{(p)})h_j(x^{(q)})\right]\\ &=\sigma_b^2+\omega_v^2C\left(x^{(p)},x^{(q)}\right), \end{aligned}\) where \(C\left(x^{(p)},x^{(q)}\right)=\mathbb{E}\left[h_j(x^{(p)})h_j(x^{(q)})\right]\), which is the same for every \(j\). And it is actually the second moment. So for every finite number of outputs, they will be multivariate Gaussian, which makes the output actually a Gaussian process.
According to a great note from Mihai Nica, there is a fact:
If \(X\) is Gaussian, random variable \(X\sim\mathcal{N}(\mu_X, \Sigma_X)\) and if \(Y=a+BX\), then \(Y\) is Gaussian too with \(Y \sim \mathcal{N}\left(a+B \mu_{X}, B \Sigma_{X} B^{T}\right)\).
4. Deep Neural Network
Till now, everything is shallow, with only one hidden layer. What if we’re goin’ deep?
Let’s first describe the forwarding of a deep neural network: \(z_i^l(x)=b_i^l+\sum_{j=1}^{N_l}W_{ij}^lx_j^l(x),\quad x_j^l(x)=\phi\left(z_{i-1}^l(x)\right).\)
It just looks like the definition of the one-hidden layer network. And since the calculation of the kernel and mean does not depend on the prior distribution of the input of the current layer, it is natural to extend the result to deeper networks.
By definition, \(z_i^l(x)\) is a sum of i.i.d. random terms (bias and weights), then the width \(N_l\) goes to infinity, any finite collection \(\{z_i^l(x^{\alpha=1}),z_i^l(x^{\alpha=2}),\ldots,z_i^l(x^{\alpha=k})\}\) will have a joint Gaussian distribution and \(z_{i}^{l} \sim \mathcal{G P}\left(0, K^{l}\right)\). The covariance (kernel) is: \(K^{l}\left(x, x^{\prime}\right) \equiv \mathbb{E}\left[z_{i}^{l}(x) z_{i}^{l}\left(x^{\prime}\right)\right]=\sigma_{b}^{2}+\sigma_{w}^{2} \mathbb{E}_{z_{i}^{l-1} \sim \mathcal{G} \mathcal{P}\left(0, K^{l-1}\right)}\left[\phi\left(z_{i}^{l-1}(x)\right) \phi\left(z_{i}^{l-1}\left(x^{\prime}\right)\right)\right],\) which is related to the second moment of the pre-activated value in the last layer. The first kernel (first hidden layer) is the same as the one derived in the previous section: \(K^{0}\left(x, x^{\prime}\right)=\mathbb{E}\left[z_{j}^{0}(x) z_{j}^{0}\left(x^{\prime}\right)\right]=\sigma_{b}^{2}+\sigma_{w}^{2}\left(\frac{x \cdot x^{\prime}}{d_{\text {in }}}\right).\)
Leave a Comment