A review of the most important papers in theory of signal propagation
This post is dedicated to a review of the foundational results in the theory of deep neural networks, viewed through the lens of infinitely wide networks—a regime often called the “mean field.” The mean-field approach, which has its roots in statistical physics, shifts our focus from the chaotic behavior of individual neurons to the stable, predictable macro-statistics of the network as a whole. This perspective allows us to tame the immense complexity of deep models and uncover the principles governing how they learn.
Our theoretical journey begins with the earliest challenges of training deep networks, where mean-field ideas first appeared in the context of LSTMs and later in principled initialization schemes. These initial works focused on a crucial problem: ensuring that information (or “signal”) could propagate reliably forward and backward through a deep network at initialization. From a slihtly different perspective, Neural Net Gaussian Process (NNGP), showed how an infinitely wide network is mathematically equivalent to a Gaussian Process—a simple, and therefore function as a non-parametric statistical Bayesian model.
Up until this point, virtually all theoretical studies with a few notable exceptions of deep linear networks, were limited to initialization. The field experienced a dramatic leap forward with the introduction of the Neural Tangent Kernel (NTK). The NTK extended the mean-field analysis from a static snapshot at initialization to the entire training dynamic, showing that infinitely wide networks learn in a surprisingly simple way, governed by a fixed kernel.
However, the story did not end there. This “lazy” learning regime described by the NTK was later understood to be just one of several possibilities, and perhaps not the one that captures the powerful “feature learning” that makes deep learning so effective in practice. Later discoveries revealed other mean-field regimes that offer a richer, more complete picture.
I have covered some of these foundational papers already and will continue to update this post as I explore the literature further. Before we delve into the technical details of this fascinating story, let’s take a step back and ask a fundamental question.
Let’s start with the most basic question: Why are we interested in a mean-field view of neural networks?
The historical motivation was borrowed directly from statistical physics. When a system contains millions of complex, interacting parts—like molecules in a gas or neurons in a network—tracking each one individually is hopeless. A mean-field approach offers a way out by studying the system’s macro-statistical properties, revealing surprisingly simple behavior that emerges from the underlying randomness. The initial goal, therefore, was simply to make sense of the chaos.
However, this pursuit led to a profound discovery that gave rise to a second, more ambitious goal. Researchers found that as networks become infinitely wide, the law of large numbers washes away the randomness not just at initialization, but throughout training. The learning dynamics themselves become deterministic.
This discovery shifted the objective from merely describing these networks to establishing a fully deterministic, idealized model of neural computation. This brings us to the modern motivation behind the study of the NTK and feature learning regimes: to find the “Turing machine” of deep learning.
Why is this so important? Before Alan Turing, “computation” was a collection of clever but disconnected tricks. The idealized Turing machine—with its infinite tape and perfect operations—provided a stable theoretical backbone. It defined the absolute limits of what is computable and, just as importantly, gave engineers a fixed reference point. Every real-world computer, with its finite memory and potential for errors, could then be understood as a principled approximation of this perfect ideal. This framework is what allows for rigorous analysis of trade-offs between cost, speed, and accuracy.
Similarly, an idealized infinite-width network gives us a benchmark for deep learning. It allows us to analyze any practical, finite-sized network as a principled approximation of this powerful limit, turning the art of network design into a more rigorous science of trade-offs.
One of the challenges I had while reading these papers was the constant changing of symbols and notations. So I thought introducing and using a consistent notation will go a long way to make them more readable.
Definition (Network Architecture). A fully-connected network has \(L\) hidden layers, indexed \(l=0\) (input) to \(L+1\) (output).
Definition (Parameters and Initialization). The network parameters consist of weight matrices \(W^l \in \mathbb{R}^{n_{l+1} \times n_l}\) and bias vectors \(b^l \in \mathbb{R}^{n_{l+1}}\). We assume a standard random initialization where parameters are drawn i.i.d from a Gaussian distribution: \(\begin{align} W_{ij}^l &\sim \mathcal{N}\left(0, \frac{\sigma_w^2}{n_l}\right) \quad \text{and} \quad b_i^l \sim \mathcal{N}(0, \sigma_b^2) \end{align}\) where \(\sigma_w^2\) and \(\sigma_b^2\) are hyperparameters controlling the variance of weights and biases respectively.
The signal flow is defined by the recurrence: \(\begin{align} h^0(x) &= x \cr z^{l+1}(x) &= W^l h^l(x) + b^l \cr h^{l+1}(x) &= \phi(z^{l+1}(x)) \end{align}\)
A core principle underlying these theories is the behavior of sums of random variables in the limit of many terms.
Theorem (Law of Large Numbers). For i.i.d. random variables \(X_j\) with mean \(\mathbb{E}[X]\), their sample average converges in probability to the true mean: \(\lim_{N \to \infty} \frac{1}{N}\sum_{j=1}^N X_j \xrightarrow{P} \mathbb{E}[X]\)
In our context, the pre-activation of a single neuron is \(z_i^{l+1} = \sum_{j=1}^{n_l} W_{ij}^l h_j^l + b_i^l\). This is a sum over the \(n_l\) neurons in the previous layer. Since the weights \(W_{ij}^l\) are drawn i.i.d., and for a fixed input \(x\) the activations \(h_j^l(x)\) are also random variables, many quantities involving sums over the layer width converge to expectations in the limit \(n_l \to \infty\).
For example, consider the squared norm of the pre-activations, averaged over the layer: \(\begin{align} \frac{1}{n_{l+1}}\sum_{i=1}^{n_{l+1}} (z_i^{l+1})^2 &= \frac{1}{n_{l+1}}\sum_{i=1}^{n_{l+1}} \left(\sum_{j=1}^{n_l} W_{ij}^l h_j^l + b_i^l\right)^2 \cr \end{align}\) In the limit \(n_{l+1} \to \infty\), this converges to its expectation over the random choice of parameters at layer \(l\): \(\begin{align} \lim_{n_{l+1} \to \infty}& \frac{1}{n_{l+1}}\sum_{i=1}^{n_{l+1}} (z_i^{l+1})^2 \cr &= \mathbb{E}_{W^l, b^l}[(z_i^{l+1})^2] \cr &= \mathbb{E}\left[\left(\sum_j W_{ij}^l h_j^l + b_i^l\right)^2\right] \cr &= \sum_j \mathbb{E}[(W_{ij}^l)^2]\mathbb{E}[(h_j^l)^2] + \mathbb{E}[(b_i^l)^2] \quad \text{(iid and zero mean)} \cr &= \sum_{j=1}^{n_l} \frac{\sigma_w^2}{n_l} \mathbb{E}[(h_j^l)^2] + \sigma_b^2 \cr &= \sigma_w^2 \left( \frac{1}{n_l}\sum_{j=1}^{n_l}\mathbb{E}[(h_j^l)^2] \right) + \sigma_b^2 \end{align}\) As we also take \(n_l \to \infty\), the term in the parenthesis becomes an expectation over the distribution of neurons in layer \(l\). This formal replacement of sums with expectations is a key step in all mean-field derivations that follow.
The earliest challenge in training deep networks was the instability of the backpropagation algorithm.
In recurrent neural networks (RNNs), which can be seen as very deep networks unfolded in time, the problem is particularly acute. The backpropagated error signal at a given time step is a product of many Jacobian matrices from future time steps. As analyzed by (Hochreiter, 1991), this product causes the error to either vanish or blow up exponentially.
Remark (Constant Error Carousel). The Long Short-Term Memory (LSTM) network was introduced by (Hochreiter & Schmidhuber, 1997) to solve this exact problem. Its core innovation is the Constant Error Carousel (CEC) – a linear unit with a fixed self-connection of weight 1.0. The internal state updates additively: \(s_c(t) = s_c(t-1) + \text{input}\), ensuring that the derivative of the state with respect to its previous state is one: \(\frac{\partial s_c(t)}{\partial s_c(t-1)} = 1\) This guarantees that error can flow backward through many time steps without its magnitude being scaled up or down, thus preventing the vanishing and exploding gradient problem. Gating mechanisms are then introduced to control the flow of information into and out of this stable memory cell.
For feedforward networks, a similar but distinct problem exists. (Glorot & Bengio, 2010) analyzed how the variance of activations and gradients changes as they propagate through the layers. The key idea is to maintain stable training by ensuring that the variance of outputs of each layer equals the variance of its inputs.
Consider a network in the linear regime of its activation functions (\(\phi'(z) \approx 1\)).
Forward propagation of variance: The variance of the activations at layer \(l\) is given by: \(\text{Var}[h^l] = n_{l-1} \text{Var}[W^{l-1}] \text{Var}[h^{l-1}]\) To maintain constant variance (\(\text{Var}[h^l] = \text{Var}[h^{l-1}]\)), we need: \(n_{l-1} \text{Var}[W^{l-1}] = 1\)
Backward propagation of variance: The variance of the backpropagated gradients \(\delta^l\) is: \(\text{Var}[\delta^l] = n_l \text{Var}[W^l] \text{Var}[\delta^{l+1}]\) To maintain constant gradient variance, we need: \(n_l \text{Var}[W^l] = 1\)
These conditions conflict unless all layers have equal width. As a compromise:
Definition (Glorot (Xavier) Initialization). The variance of weights at layer \(l\) is set to: \(\text{Var}[W^l] = \frac{2}{n_{l-1} + n_l}\) For a uniform distribution \(U[-a, a]\), this corresponds to \(a = \sqrt{\frac{6}{n_{l-1} + n_l}}\).
The theory developed by (Schoenholz et al., 2017) moved beyond variance-based heuristics to develop a precise mean-field theory for wide, randomly initialized networks.
The key insight is that in a wide network, the pre-activations \(z^l\) for any given input are a sum of many i.i.d. random variables. By the Central Limit Theorem, these pre-activations converge to a Gaussian distribution. We can track the parameters of this Gaussian (mean and covariance) from layer to layer.
Consider two inputs, \(x_a\) and \(x_b\). The central object is the covariance matrix of their pre-activations at layer \(l\): \(K^l = \begin{pmatrix} q_{aa}^l & q_{ab}^l \\ q_{ba}^l & q_{bb}^l \end{pmatrix}\) where \(q_{ab}^l = \mathbb{E}[z_i^l(x_a) z_i^l(x_b)]\). The expectation is over the random initialization of the network parameters.
Theorem (Forward Covariance Recurrence). The evolution of the covariance from layer \(l-1\) to \(l\) is governed by a deterministic map: \(\begin{align} q_{ab}^{l} &= \sigma_w^2 \mathbb{E}_{z^{l-1} \sim \mathcal{N}(0, K^{l-1})}[\phi(z^{l-1}(x_a))\phi(z^{l-1}(x_b))] + \sigma_b^2 \end{align}\)
Proof. We start with the definition of pre-activations: \(z_i^l(x_a) = \sum_{j=1}^{n_{l-1}} W_{ij}^{l-1} h_j^{l-1}(x_a) + b_i^{l-1}\). \(\begin{align} q_{ab}^l &= \mathbb{E}\left[ z_i^l(x_a) z_i^l(x_b) \right] \cr &= \mathbb{E}\left[ \left(\sum_j W_{ij}^{l-1} h_j^{l-1}(x_a) + b_i^{l-1}\right) \left(\sum_k W_{ik}^{l-1} h_k^{l-1}(x_b) + b_i^{l-1}\right) \right] \cr \end{align}\) We expand the product. The expectation is over the i.i.d. parameters \(W^{l-1}\) and \(b^{l-1}\). Since they are zero-mean and independent of each other and of the activations \(h^{l-1}\) from the previous layer, all cross-terms vanish. \(\begin{align} &= \mathbb{E}\left[ \sum_{j,k} W_{ij}^{l-1}W_{ik}^{l-1} h_j^{l-1}(x_a)h_k^{l-1}(x_b) \right] + \mathbb{E}[(b_i^{l-1})^2] \cr \end{align}\) Since weights \(W_{ij}\) and \(W_{ik}\) are independent for \(j \neq k\), \(\mathbb{E}[W_{ij}W_{ik}]=0\). The sum over \(j,k\) collapses to terms where \(j=k\). \(\begin{align} &= \sum_{j=1}^{n_{l-1}} \mathbb{E}[(W_{ij}^{l-1})^2] \mathbb{E}[h_j^{l-1}(x_a)h_j^{l-1}(x_b)] + \sigma_b^2 \cr \end{align}\) Substitute the weight variance \(\mathbb{E}[(W_{ij}^{l-1})^2] = \sigma_w^2/n_{l-1}\) and \(h_j^{l-1} = \phi(z_j^{l-1})\). \(\begin{align} &= \sum_{j=1}^{n_{l-1}} \frac{\sigma_w^2}{n_{l-1}} \mathbb{E}[\phi(z_j^{l-1}(x_a))\phi(z_j^{l-1}(x_b))] + \sigma_b^2 \cr &= \sigma_w^2 \left(\frac{1}{n_{l-1}}\sum_{j=1}^{n_{l-1}} \mathbb{E}[\phi(z_j^{l-1}(x_a))\phi(z_j^{l-1}(x_b))]\right) + \sigma_b^2 \end{align}\) In the infinite-width limit (\(n_{l-1} \to \infty\)), the average over neurons becomes an expectation over the distribution of pre-activations \(z^{l-1}\). By the inductive hypothesis that \(z^{l-1}\) is a Gaussian process with covariance \(K^{l-1}\), we arrive at the stated recurrence relation. Note that the hyperparameters \(\sigma_w^2\) and \(\sigma_b^2\) simply apply an affine transformation (scaling and shifting) to the core integral, which captures the effect of the nonlinearity.
The recurrence relation for the correlation \(c^l = q_{ab}^l / \sqrt{q_{aa}^l q_{bb}^l}\) defines a 1D map \(c^{l-1} \mapsto c^l = F(c^{l-1})\). The behavior of this map dictates information propagation:
The stability of the \(c^\star=1\) fixed point is controlled by the derivative of the map at \(c=1\). This derivative is denoted \(\chi_1\).
Lemma (The Criticality Parameter \(\chi_1\)). The stability of the ordered phase is determined by: \(\chi_1 = \frac{\partial c^l}{\partial c^{l-1}}\bigg|_{c^{l-1}=1} = \sigma_w^2 \mathbb{E}_{z \sim \mathcal{N}(0, q^\star)}[(\phi'(z))^2]\) where \(q^\star\) is the fixed-point variance, i.e., \(q^\star = \lim_{l \to \infty} q_{aa}^l\).
Proof. We linearize the map \(c^l = F(c^{l-1})\) around \(c^{l-1}=1\). Let \(c^{l-1} = 1 - \epsilon\). The covariance is \(q_{ab}^{l-1} = (1-\epsilon)\sqrt{q_{aa}^{l-1}q_{bb}^{l-1}} \approx (1-\epsilon)q^\star\). The joint distribution of \((z(x_a), z(x_b))\) is a 2D Gaussian with covariance \(\begin{pmatrix} q^\star & (1-\epsilon)q^\star \cr (1-\epsilon)q^\star & q^\star \end{pmatrix}\). We can write \(z(x_b) = (1-\epsilon)z(x_a) + \sqrt{1-(1-\epsilon)^2} \eta \approx (1-\epsilon)z(x_a) + \sqrt{2\epsilon} \eta\), where \(\eta \sim \mathcal{N}(0, q^\star)\) is independent of \(z(x_a)\). We need to compute \(q_{ab}^l = \sigma_w^2 \mathbb{E}[\phi(z(x_a))\phi(z(x_b))] + \sigma_b^2\). We expand \(\phi(z(x_b))\) around \(z(x_a)\): \(\begin{align} \phi(z(x_b)) &\approx \phi(z(x_a) - \epsilon z(x_a) + \sqrt{2\epsilon}\eta) \cr &\approx \phi(z(x_a)) + (-\epsilon z(x_a) + \sqrt{2\epsilon}\eta)\phi'(z(x_a)) + \frac{1}{2}(2\epsilon \eta^2)\phi''(z(x_a)) + \ldots \end{align}\) Plugging this into the expectation for \(q_{ab}^l\) and keeping terms up to first order in \(\epsilon\): \(\begin{align} \mathbb{E}[\phi(z_a)\phi(z_b)] &\approx \mathbb{E}[\phi(z_a)^2] - \epsilon\mathbb{E}[z_a \phi(z_a)\phi'(z_a)] + \epsilon\mathbb{E}[\eta^2 \phi(z_a)\phi''(z_a)] \cr \end{align}\) Using Stein’s Lemma, \(\mathbb{E}[z_a g(z_a)] = q^\star \mathbb{E}[g'(z_a)]\), on the second term, and \(\mathbb{E}[\eta^2]=q^\star\): \(\begin{align} \mathbb{E}[z_a \phi(z_a)\phi'(z_a)] &= q^\star \mathbb{E}[\phi(z_a)\phi'(z_a) + z_a(\phi'(z_a)^2 + \phi(z_a)\phi''(z_a))] \cr \end{align}\) This is complicated. A simpler way is to differentiate the integral expression for \(q_{ab}^l\) w.r.t \(q_{ab}^{l-1}\) and evaluate at \(q_{ab}^{l-1}=q^\star\). \(\begin{align} \frac{\partial q_{ab}^l}{\partial q_{ab}^{l-1}} &= \sigma_w^2 \frac{\partial}{\partial q_{ab}^{l-1}} \mathbb{E}[\phi(z_a)\phi(z_b)] = \sigma_w^2 \mathbb{E}[\phi'(z_a)\phi'(z_b)] \cr \end{align}\) At the fixed point \(q_{ab}^{l-1}=q^\star\), we have \(z_a=z_b\), so: \(\begin{align} \frac{\partial q_{ab}^l}{\partial q_{ab}^{l-1}}\bigg|_{q_{ab}^{l-1}=q^\star} &= \sigma_w^2 \mathbb{E}[\phi'(z_a)^2] = \chi_1 \end{align}\) Since \(c^l \approx q_{ab}^l/q^\star\), we have \(\frac{\partial c^l}{\partial c^{l-1}} \approx \frac{\partial q_{ab}^l}{\partial q_{ab}^{l-1}}\), giving the result.
Definition (Phase Classification). \(\begin{align} \chi_1 < 1 &\Rightarrow \text{\textbf{Ordered Phase}} \text{ (Stable at } c=1\text{)} \cr \chi_1 > 1 &\Rightarrow \text{\textbf{Chaotic Phase}} \text{ (Unstable at } c=1\text{)} \cr \chi_1 = 1 &\Rightarrow \text{\textbf{Edge of Chaos}} \text{ (Criticality)} \end{align}\)
At the critical point, the characteristic depth scale \(\xi_c\) over which correlations decay diverges: $$\xi_c \propto 1/ | 1-\chi_1 | $$. |
Remark (Trainability Condition). Networks are trainable if and only if their depth \(L\) satisfies \(L \lesssim 6\xi_c\). This provides the first principled, quantitative method for selecting hyperparameters to enable training of arbitrarily deep networks.
A similar analysis applies to the backward pass. The backpropagation update for the error vector is \(\delta_j^{l-1} = \phi'(z_j^{l-1}) \sum_i W_{ij}^l \delta_i^l\). We want to find the recurrence for the gradient variance \(\tilde{q}_{aa}^l = \mathbb{E}[(\delta_i^l)^2]\).
Theorem (Backward Recurrence). The gradient variance evolves according to: \(\tilde{q}_{aa}^{l-1} = \chi_1 \cdot \tilde{q}_{aa}^l\)
Proof. We compute the variance of \(\delta^{l-1}\) assuming the weights \(W^l\) are independent of the forward-pass weights (a standard approximation). \(\begin{align} \tilde{q}_{aa}^{l-1} &= \mathbb{E}[(\delta_j^{l-1})^2] \cr &= \mathbb{E}\left[\left(\phi'(z_j^{l-1}) \sum_i W_{ij}^l \delta_i^l\right)^2\right] \cr &= \mathbb{E}[(\phi'(z_j^{l-1}))^2] \cdot \mathbb{E}\left[\left(\sum_i W_{ij}^l \delta_i^l\right)^2\right] \quad \text{(independence)} \cr &= \mathbb{E}[(\phi')^2] \sum_i \mathbb{E}[(W_{ij}^l)^2] \mathbb{E}[(\delta_i^l)^2] \quad \text{(independence of } W_{ij} \text{)} \cr &= \mathbb{E}[(\phi')^2] \sum_{i=1}^{n_l} \frac{\sigma_w^2}{n_l} \tilde{q}_{aa}^l \cr &= \left(\sigma_w^2 \mathbb{E}[(\phi')^2]\right) \left(\frac{1}{n_l} \sum_{i=1}^{n_l} \tilde{q}_{aa}^l\right) \cr &= \chi_1 \cdot \tilde{q}_{aa}^l \end{align}\) This reveals a beautiful duality:
As first noted by (Neal, 1994) and rigorously extended by (Lee et al., 2018), the mean-field theory has a profound consequence: in the infinite-width limit, the entire network function becomes a draw from a Gaussian Process.
Definition (Gaussian Process (GP)). A Gaussian Process is a collection of random variables, any finite number of which have a joint Gaussian distribution. A GP is completely specified by its mean function \(\mu(x) = \mathbb{E}[f(x)]\) and its covariance (or kernel) function \(K(x, x') = \mathbb{E}[(f(x)-\mu(x))(f(x')-\mu(x'))]\).
Theorem (NNGP Correspondence). In the infinite-width limit, the network function \(f_\theta\) is a draw from a zero-mean Gaussian Process: \(f_\theta \sim \mathcal{GP}(0, K^{L+1})\) where the kernel \(K^{L+1}\) is the same covariance function derived from the mean-field recurrence.
Proof. Proof Sketch: The proof is by induction on the layer depth \(l\).
The covariance function of this process is precisely the object \(q_{ab}^l\) we tracked in the mean-field theory, which is why it is called a GP kernel. A function \(K(x,x')\) is a valid covariance kernel if the Gram matrix \([K(x_i, x_j)]_{i,j}\) is positive semi-definite for any set of points \(\{x_i\}\). This holds for the NNGP kernel because it is constructed from expectations of outer products, \(\mathbb{E}[f(x)f(x')^T]\), which are inherently positive semi-definite.
This equivalence allows for fully Bayesian inference without ever training a network. Given a training set \(\mathcal{D} = \{(x_i, t_i)\}^n_{i=1}\) and assuming Gaussian observation noise \(\epsilon \sim \mathcal{N}(0, \sigma_\epsilon^2)\), we can predict the output \(z^\star\) for a new test point \(x^\star\). The joint distribution of training targets and the test prediction is Gaussian. The predictive distribution \(P(z^\star | \mathcal{D}, x^\star)\) is also Gaussian, with a posterior mean and variance given by: \(\begin{align} \overline{\mu} &= K_{x^\star,\mathcal{D}}(K_{\mathcal{D},\mathcal{D}}+\sigma_{\epsilon}^{2}\mathbb{I}_{n})^{-1}t \cr \overline{K} &= K_{x^{*},x^{*}} - K_{x^{*},\mathcal{D}}(K_{\mathcal{D},\mathcal{D}}+\sigma_{\epsilon}^{2}\mathbb{I}_{n})^{-1}K_{x^{*},\mathcal{D}}^{T} \end{align}\) Intuition: The predicted mean \(\overline{\mu}\) is a weighted average of the training targets \(t\). The weights are computed by solving a linear system that asks: “How much does the output at each training point \(x_j\) influence the output at the test point \(x^\star\)?” This influence is measured by the kernel \(K(x^\star, x_j)\), while correcting for redundancies between the training points themselves (the \((K_{\mathcal{D},\mathcal{D}} + \sigma_\epsilon^2 I)^{-1}\) term). The posterior variance \(\overline{K}\) measures our uncertainty; it starts at the prior variance \(K_{x^\star,x^\star}\) and is reduced by an amount corresponding to the information gained from the training data.
The GP correspondence describes networks at initialization. (Jacot et al., 2018) answered the crucial question: what happens during training?
Consider training a model \(f_\theta(x)\) with gradient descent on a loss \(\mathcal{L}\). The parameters evolve as \(\frac{d\theta}{dt} = -\nabla_\theta \mathcal{L}\) (for continuous-time gradient flow). The function itself then evolves according to the chain rule: \(\frac{df_\theta(x)}{dt} = (\nabla_\theta f_\theta(x))^T \frac{d\theta}{dt} = -(\nabla_\theta f_\theta(x))^T \nabla_\theta \mathcal{L}\) For a standard mean-squared error loss \(\mathcal{L} = \frac{1}{2N}\sum_i (f_\theta(x_i) - y_i)^2\), the loss gradient is \(\nabla_\theta \mathcal{L} = \frac{1}{N}\sum_i \nabla_\theta f_\theta(x_i) (f_\theta(x_i) - y_i)\). Plugging this in: \(\frac{df_\theta(x)}{dt} = - \frac{1}{N} \sum_i \left( (\nabla_\theta f_\theta(x))^T \nabla_\theta f_\theta(x_i) \right) (f_\theta(x_i) - y_i)\) If we define the Neural Tangent Kernel (NTK) as the inner product of the parameter gradients, this equation becomes the equation for kernel gradient flow:
Definition (Neural Tangent Kernel (NTK)). \(\Theta(\theta)(x, x') = (\nabla_\theta f_\theta(x))^T \nabla_\theta f_\theta(x')\)
\(\frac{df_\theta(x)}{dt} = - \frac{1}{N} \sum_i \Theta(\theta)(x, x_i) (f_\theta(x_i) - y_i)\) This shows that training any neural network with gradient descent is equivalent to performing kernel gradient descent in function space. The crucial difference from standard kernel methods is that for a finite-width network, the kernel \(\Theta(\theta)\) changes as the parameters \(\theta\) are updated.
Theorem (NTK Constancy). In the infinite-width limit (\(n_l \to \infty\) for all hidden layers):
Intuition: In the infinite-width limit, each parameter’s update is infinitesimal (\(O(1/n_l)\)). The parameters move very little from their random initialization. While the collective effect of these tiny changes is enough to change the function output \(f_\theta\) by a finite amount, the change in the gradient of the function with respect to the parameters, \(\nabla_\theta f_\theta\), is negligible. Since the NTK is an inner product of these gradients, it remains fixed at its initial value. This is often called the “lazy training” regime.
This means infinitely wide networks behave like linear models in the (infinitely high-dimensional) feature space defined by the gradients at initialization.
The limiting NTK, \(\Theta_\infty^{(L)}\), can be computed via a recurrence relation.
Theorem (NTK Recurrence). \(\Theta_\infty^{(L+1)}(x, x') = \Theta_\infty^{(L)}(x, x') \dot{K}^{(L+1)}(x, x') + K^{(L+1)}(x, x')\) where \(K^{(L+1)}\) is the NNGP kernel and \(\dot{K}^{(L+1)}(x,x') = \mathbb{E}_{z \sim \mathcal{GP}(0, K^{(L)})}[\phi'(z(x))\phi'(z(x'))]\).
Proof. The proof is by induction on the depth \(L\).
Decomposition: The NTK is a sum over all parameters. We can split the parameters \(\theta\) into those of the first \(L\) layers (\(\tilde{\theta}\)) and those of the last layer (\(\theta_L = \{W^L, b^L\}\)). \(\Theta^{(L+1)}(\theta) = \sum_{p \in \tilde{\theta}} (\nabla_p f)^T (\nabla_p f) + \sum_{p \in \theta_L} (\nabla_p f)^T (\nabla_p f)\)
Last Layer Contribution: Let’s analyze the term for the last layer’s parameters. The gradient of the network output \(f_k(x) = \frac{1}{\sqrt{n_L}}\sum_i W^L_{ki}h^L_i(x) + b^L_k\) with respect to its own parameters is: \(\begin{align} \nabla_{W^L_{ki}} f_k(x) &= \frac{1}{\sqrt{n_L}} h^L_i(x) \cr \nabla_{b^L_k} f_k(x) &= 1 \end{align}\) The contribution to the NTK from the last layer is the sum of dot products over all these parameters. For the \((k, k')\) entry of the kernel matrix: \(\begin{align} [\nabla_{\theta_L} f(x) \cdot \nabla_{\theta_L} f(x')]_{kk'} &= \sum_{i=1}^{n_L} (\nabla_{W^L_{ki}} f_k(x))(\nabla_{W^L_{k'i}} f_{k'}(x')) + (\nabla_{b^L_k} f_k(x))(\nabla_{b^L_{k'}} f_{k'}(x')) \cr &= \delta_{kk'} \left( \sum_{i=1}^{n_L} \frac{1}{n_L} h^L_i(x)h^L_i(x') + 1 \right) \end{align}\) In the limit \(n_L \to \infty\), the sum becomes an expectation by the Law of Large Numbers. Assuming \(\sigma^2_b=1\) for this part of the kernel (or generally, \(\beta^2\) as in the original paper), this term converges to \(\delta_{kk'} K^{(L+1)}(x,x')\). This is the NNGP kernel.
Lower Layers Contribution: Now consider the gradient with respect to a parameter \(\tilde\theta_p\) in the lower layers. By the chain rule: \(\nabla_{\tilde{\theta}_p} f_k(x) = \frac{\partial f_k}{\partial h^L} \frac{\partial h^L}{\partial z^L} \frac{\partial z^L}{\partial \tilde{\theta}_p} = \sum_{i=1}^{n_L} \frac{1}{\sqrt{n_L}} W_{ki}^L \phi'(z_i^L(x)) \nabla_{\tilde{\theta}_p} z_i^L(x)\) The dot product for the \((k,k')\) entry of the NTK is: \(\begin{align} [\nabla_{\tilde{\theta}} &f(x) \cdot \nabla_{\tilde{\theta}} f(x')]_{kk'}\cr &= \sum_p \left(\sum_i \frac{W_{ki}^L}{\sqrt{n_L}}\phi'(z_i^L(x))\nabla_p z_i^L(x)\right) \left(\sum_j \frac{W_{k'j}^L}{\sqrt{n_L}}\phi'(z_j^L(x'))\nabla_p z_j^L(x')\right) \cr &= \frac{1}{n_L} \sum_{i,j} W_{ki}^L W_{k'j}^L \phi'(z_i^L(x))\phi'(z_j^L(x')) \left( \sum_p \nabla_p z_i^L(x) \cdot \nabla_p z_j^L(x') \right) \end{align}\) In the infinite-width limit, the weights \(W^L\) are independent of the lower-layer gradients. Also, \(\nabla_p z^L_i\) and \(\nabla_p z^L_i\) are independent for \(i \neq j\). So the sum collapses to \(i=j\) and \(\mathbb{E}[W^L_{ki} W^L_{k'j}] = \delta_{kk'} \delta_{ij} \sigma^2_w/n_L\). The term \(\sum_p (\nabla_p z^L_i(x) \cdot \nabla_p z^L_i(x'))\) is the NTK of the \(L\)-layer subnetwork, \(\Theta^{(L)}\). Taking the expectation over the random variables at layer \(L\), we get: \(\delta_{kk'} \left(\frac{1}{n_L} \sum_i \frac{\sigma_w^2}{n_L} \Theta^{(L)}(x,x') \phi'(z_i^L(x))\phi'(z_i^L(x'))\right) \xrightarrow{n_L \to \infty} \delta_{kk'} \sigma_w^2 \Theta_\infty^{(L)}(x,x') \dot{K}^{(L+1)}(x,x')\) (Assuming \(\sigma_w^2=1\) for the recursive part gives the formula in the theorem).
Combining Terms: Adding the contributions from the last layer and the lower layers gives the final recurrence relation.
Remark (Training Dynamics). Training dynamics for infinitely wide networks on MSE loss are described by a linear ODE. The network converges fastest along directions of largest NTK eigenvalues (kernel principal components), providing a theoretical motivation for early stopping. The final function is equivalent to performing kernel ridge regression with the NTK.
This theoretical journey provides a coherent narrative for understanding deep learning:
The hyperparameters (\(\sigma_w^2, \sigma_b^2\)) that place a network at the “edge of chaos” are precisely those that define a well-behaved, non-degenerate kernel, allowing effective learning and generalization through structured exploration of function space. This progression from heuristic fixes to a complete kernel description represents one of the most significant theoretical advances in modern deep learning, bridging the gap between the practical art of training neural networks and their rigorous mathematical theory.