NTK: A First Principles Derivation

A first principles derivation of the classic NTK result (no magic), with an analysis that already suggests the scaling for feature learning

Neural Tangent Kernel or NTK by (Jacot et al., 2018) is arguably one of the first very deep theoretical insights about neural networks in the mean-field (infinitely wide) regime. While nowadays its importance has been surpassed by the feature learning (muP) regime, it is nevertheless one of the classical papers that are worth fully understanding for any researcher in the deep learning community with theoretical inclinations. However, I have heard from virtually everyone I have talked to that the paper is very hard, if not impossible, to read, which is of course a big pity. To be very honest, my initial attempt at reading the paper was also met with a lot of hiccups. So at some point, I decided to “reinvent the wheel” by proving essentially all the key results of the paper, without reading the paper. This note is the result of my efforts. While there might be some mistakes here and there (which I would be thankful if you report them to me!), I believe the method shown here replicates all the technical results of the NTK paper. The key aspect of these derivations is that they are completely from scratch. I do not assume anything beyond a few basic mathematical facts, namely the law of large numbers (LLN), central limit theorem (CLT) for fluctuation scales, and a few tricks about the expectation of Gaussians (Stein’s lemma).

Towards the end of the document, there is an exposition of my effort to verify or correct an assumption that I had made in my first attempt about the independence of backwards from forwards. These effects manifest themselves in our algebraic framework as “non-diagonal” terms in the backward kernel (meaning interactions between backwards that correspond to different units). These are exactly the terms that we will ignore if we assume independence of forward and backward, which allows us to use the expectation-of-product rule and get a very simple solution. The issue that I originally thought existed was that the number of these interactions could be of the same order as they are decaying, meaning that they could contribute significantly to the backward kernel. The calculations—longer than I had hoped—show the opposite: these corrections are indeed an order of magnitude smaller than the main terms in the infinite-width limit. I keep those notes for anyone interested.

Setup: The Infinite-Width (Mean-Field) Regime

We consider a fully-connected neural network with an input layer, \(L-1\) hidden layers with widths \(d_1,\ldots, d_{L-1}\), and a scalar output layer \(d_L=1\), for a total of \(L\) layers of activations.

Layer 0 (Input): \(h^0(x) = x \in \R^{d_0}\)
Hidden Layers (\(1 \le \ell < L\)): \(\begin{align} \begin{aligned} z^\ell(x) &= W^{\ell-1} h^{\ell-1}(x) + b^{\ell-1} && \text{(pre-activation)}\cr h^\ell(x) &= \phi(z^\ell(x)) && \text{(activation)}\cr z^L(x) &= W^{L-1} h^{L-1}(x) + b^{L-1} && \text{(output, scalar)} \end{aligned} \end{align}\)

Weights and biases are initialized i.i.d. zero-mean Gaussian with fan-in variance scaling: \(\begin{align} W^{\ell}_{ij} \sim \mathcal{N}\!\left(0, \frac{\sigma_w^2}{d_\ell}\right), \qquad b^{\ell}_i \sim \mathcal{N}(0, \sigma_b^2). \end{align}\)

Loss for analysis. For simplicity, we take the “loss” to be the scalar network output at input \(x'\): \(\mathcal{L} = z^L(x')\). This isolates the structure of gradients. (The conclusion that only the last layer has a nonzero first moment below is specific to this choice.)

Kernel in the Mean-Field Regime

The (pre-activation) kernel at layer \(\ell\) between \(x\) and \(x'\) is the average inner product of their layer-\(\ell\) pre-activations: \(\begin{align} \Sigma^\ell(x,x') := \frac{1}{d_\ell}\,\langle z^\ell(x), z^\ell(x') \rangle. \end{align}\)

The forward recursion (LLN) is \(\begin{align} \Sigma^{0}(x,x') := \sigma_w^2\,\frac{x^\top x'}{d_0} + \sigma_b^2, \qquad \Sigma^{\ell+1}(x,x') := \sigma_w^2\,\E\big[\phi(z)\phi(z')\big] + \sigma_b^2, \end{align}\) where \((z,z')\sim \mathcal{N}(0,\Sigma^{\ell}(x,x'))\).

We also define the derivative kernel: \(\begin{align} \dot \Sigma^\ell(x,x') := \sigma_w^2\,\E\!\left[\phi'(z)\phi'(z')\right], \quad (z,z')\sim \mathcal{N}\!\left(0,\Sigma^{\ell-1}(x,x')\right). \end{align}\) Thus \(\dot\Sigma^{\ell+1}\) is computed under \(\Sigma^\ell\).

We will study how representations change when we move parameters in the negative gradient direction of \(\mathcal L = z^L(x')\). In particular, we look at the change in \(z^\ell(x)\) due to updating weights in block \(W^{k-1}\), which we denote \(\dot\Theta_{k,\ell}\) (a single-step influence term, not the NTK itself):

\[\begin{align} \dot{\Theta}_{k,\ell} := \frac{1}{d_\ell}\sum_{i=1}^{d_\ell}\left\langle \frac{\partial z^\ell_i(x)}{\partial W^{k-1}}, \frac{\partial z^L(x')}{\partial W^{k-1}} \right\rangle_{W^{k-1}}. \end{align}\]

Using \(\nabla_W(Wa) = a^\top I\), \(\begin{align} \dot{\Theta}_{k,\ell} = \langle h^{k-1}(x), h^{k-1}(x') \rangle \cdot \frac{1}{d_\ell}\sum_{i=1}^{d_\ell}\left\langle \frac{\partial z^\ell_i(x)}{\partial z^{k}}, \frac{\partial z^L(x')}{\partial z^{k}} \right\rangle_{z^{k}}. \end{align}\)

The forward inner product is \(O(d_{k-1})\). The key work is in the second factor involving the backward Jacobians. We formalize its first and second moments:

Definition (kernel shift moments). With the shorthand \(\begin{align} J_{j}^{(\ell \leftarrow k)}(x) := \frac{\partial z^\ell(x)}{\partial z^k_j}, \qquad J_{j}^{(L \leftarrow k)}(x') := \frac{\partial z^L(x')}{\partial z^k_j}, \end{align}\) define \(\begin{align} \begin{aligned} M_{k,\ell} &:= \frac{1}{d_\ell}\sum_{i=1}^{d_\ell}\E\!\left[\sum_{j=1}^{d_k} J_{j,i}^{(\ell \leftarrow k)}(x)\,J_{j}^{(L \leftarrow k)}(x')\right],\cr Q_{k,\ell} &:= \frac{1}{d_\ell}\sum_{i=1}^{d_\ell}\E\!\left[\Big(\sum_{j=1}^{d_k} J_{j,i}^{(\ell \leftarrow k)}(x)\,J_{j}^{(L \leftarrow k)}(x')\Big)^2\right],\cr V_{k,\ell} &:= Q_{k,\ell}-M_{k,\ell}^2. \end{aligned} \end{align}\)

By symmetry in the feature index \(i\), expectations are independent of \(i\); we keep the average for cleaner recursions.

Recursion for the Mean Influence

The Jacobian between successive layers is \(\begin{align} z^{\ell+1}_i = \sum_{j=1}^{d_\ell} W^\ell_{ij}\,\phi(z^\ell_j) + b^\ell_i \;\;\Rightarrow\;\; \frac{\partial z^{\ell+1}_i}{\partial z^\ell_j} = W^\ell_{ij}\,\phi'(z^\ell_j). \end{align}\)

Substituting in the definition of \(M_{k,\ell}\) and unrolling one layer: \(\begin{align} \begin{aligned} M_{k,\ell} &=\frac{1}{d_\ell}\sum_{i,j}\E\left[ \left(\sum_{s}\frac{\partial z^\ell_i(x)}{\partial z^{k+1}_s}W^k_{sj}\phi'(z^k_j(x))\right) \left(\sum_{r}\frac{\partial z^L(x')}{\partial z^{k+1}_r}W^k_{rj}\phi'(z^k_j(x'))\right) \right]\cr &=\frac{1}{d_\ell}\sum_{i,j,s,r}\E\left[ \frac{\partial z^\ell_i(x)}{\partial z^{k+1}_s}\, \frac{\partial z^L(x')}{\partial z^{k+1}_r}\, \phi'(z^k_j(x))\phi'(z^k_j(x'))\,W^k_{sj}W^k_{rj} \right]. \end{aligned} \end{align}\)

Independence across layers (exact). Parameters from different layers are independent at initialization. The terms \(\frac{\partial z^\ell}{\partial z^{k+1}}\) and \(\frac{\partial z^L}{\partial z^{k+1}}\) depend only on layers \(>k\), while \(\phi'(z^k)\) depends only on layers \(\le k\). Therefore they are independent of \(W^k\) exactly (no asymptotics needed). Taking expectation over \(W^k\), only the paired weights survive: \(\begin{align} \E[W^k_{sj}W^k_{rj}] = \delta_{sr}\,\frac{\sigma_w^2}{d_k}. \end{align}\)

Hence \(\begin{align} M_{k,\ell} =\frac{1}{d_\ell}\sum_{i,s}\E\!\left[ \frac{\partial z^\ell_i(x)}{\partial z^{k+1}_s}\, \frac{\partial z^L(x')}{\partial z^{k+1}_s}\, \left(\frac{\sigma_w^2}{d_k}\sum_{j=1}^{d_k}\phi'(z^k_j(x))\phi'(z^k_j(x'))\right) \right]. \end{align}\)

By the LLN in width \(d_k\), \(\begin{align} \frac{1}{d_k}\sum_{j=1}^{d_k}\phi'(z^k_j(x))\phi'(z^k_j(x')) \;\xrightarrow[d_k\to\infty]{}\; \E\big[\phi'(z)\phi'(z')\big], \end{align}\) with \((z,z')\sim \mathcal N(0,\Sigma^{k-1}(x,x'))\). Therefore \(\begin{align} M_{k,\ell}=\dot\Sigma^{k}(x,x')\; \frac{1}{d_\ell}\sum_{i,s}\E\!\left[ \frac{\partial z^\ell_i(x)}{\partial z^{k+1}_s}\, \frac{\partial z^L(x')}{\partial z^{k+1}_s}\right] =\dot\Sigma^{k}(x,x')\,M_{k+1,\ell}. \end{align}\)

Unrolling from \(k\) to \(\ell\) (for \(k\le \ell\)): \(\begin{align} M_{k,\ell}=M_{\ell,\ell}\prod_{i=k}^{\ell-1}\dot\Sigma^{i}(x,x'). \end{align}\)

Induction Basis for Mean Influence

Recall \(\begin{align} M_{k,\ell}=\frac{1}{d_\ell}\sum_{i=1}^{d_\ell}\E\left[\sum_{j=1}^{d_k}\frac{\partial z^\ell_i(x)}{\partial z^k_j}\frac{\partial z^L(x')}{\partial z^k_j}\right]. \end{align}\) Set \(k=\ell\). Only \(j=i\) survives: \(\begin{align} M_{\ell,\ell}=\frac{1}{d_\ell}\sum_{i=1}^{d_\ell}\E\left[\frac{\partial z^L(x')}{\partial z^\ell_i}\right]. \end{align}\)

If \(\ell=L\), then \(d_L=1\) and \(\partial z^L/\partial z^L=1\), so \(M_{L,L}=1\).
If \(\ell<L\), the expectation vanishes by independence and symmetry, so \(M_{\ell,\ell}=0\).

Thus \(\begin{align} M_{k,\ell}= \begin{cases} 0, & \ell<L, \cr \prod_{i=k}^{L-1}\dot\Sigma^{i}(x,x'), & \ell=L. \end{cases} \end{align}\)

Remark (signal propagation / edge of chaos). For \(x=x'\), \(\dot\Sigma^\ell(x,x)=\sigma_w^2\,\E[\phi'(z)^2]\) with \(z\sim\mathcal N(0,q_{\ell-1})\), where the pre-activation variances satisfy \(\begin{align} q_{\ell}=\Sigma^{\ell}(x,x) \quad\text{and}\quad q_{\ell+1}=\sigma_w^2\,\E\big[\phi(\sqrt{q_\ell}Z)^2\big]+\sigma_b^2. \end{align}\) If \(q_\ell\to q^\star\) in depth and \(\chi_1:=\sigma_w^2\,\E[\phi'(\sqrt{q^\star}Z)^2]\), then \(\prod_{i=k}^{L-1}\dot\Sigma^{i}(x,x)\le \chi_1^{\,L-k}\). At the edge of chaos \(\chi_1\approx 1\).

Mapping to one gradient step (optional). If we take a gradient step on \(W^{k-1}\) with layer-wise learning rate \(\eta_{k}\propto \eta_0/d_{k-1}\), then \(\begin{align} \langle h^{k-1}(x),h^{k-1}(x')\rangle/d_{k-1}\xrightarrow{} \Sigma^{k-1}(x,x'), \end{align}\) and the expected influence of block \(k\) on layer-\(L\) after one step is \(\begin{align} \E[\dot\Theta_{k,L}]=\eta_0\,\Sigma^{k-1}(x,x')\prod_{i=k}^{L-1}\dot\Sigma^{i}(x,x'). \end{align}\) This is about training dynamics; the NTK itself (defined next) does not include \(\eta_0\).

The Variance of Kernel Shifts

We now analyze fluctuations around the mean. The second moment is \(\begin{align} Q_{k,\ell} &=\frac{1}{d_\ell}\sum_{i=1}^{d_\ell}\E\!\left[ \Big(\sum_{j=1}^{d_k} J_{j,i}^{(\ell \leftarrow k)}(x)\,J_{j}^{(L \leftarrow k)}(x')\Big)^2\right]\cr &=\frac{1}{d_\ell}\sum_{i=1}^{d_\ell}\E\!\left[\sum_{j,r,s} \frac{\partial z^\ell_i}{\partial z^{k+1}_r}\,W^k_{rj}\,\phi'(z^k_j(x))\; \frac{\partial z^L}{\partial z^{k+1}_s}\,W^k_{sj}\,\phi'(z^k_j(x'))\right]. \end{align}\)

Expanding indices and taking expectation over weights with Wick’s theorem, only pairings of \(W^k\) contribute. The leading configuration imposes the pairings within each row/column and yields (after LLN) \(\begin{align} \left(\frac{\sigma_w^2}{d_k}\sum_{j=1}^{d_k}\phi'(z^k_j(x))\phi'(z^k_j(x'))\right)^2 \xrightarrow{} \big(\dot\Sigma^{k}(x,x')\big)^2. \end{align}\) Subleading pairings are smaller by \(O(1/d_k)\). Therefore, \(\begin{align} Q_{k,\ell}=\big(\dot\Sigma^{k}(x,x')\big)^2\,Q_{k+1,\ell}\quad\text{(up to }O(1/d_k)\text{)}. \end{align}\) Unrolling to \(Q_{\ell,\ell}\) gives \(\begin{align} Q_{k,\ell}=Q_{\ell,\ell}\prod_{i=k}^{\ell-1}\big(\dot\Sigma^{i}(x,x')\big)^2 \quad\text{(leading order).} \end{align}\)

For the basis \(Q_{\ell,\ell}\), expanding one layer upward and taking expectations as above: \(\begin{align} \begin{aligned} Q_{\ell,\ell} &=\frac{1}{d_\ell}\sum_{i=1}^{d_\ell}\E\left[\left(\sum_{j}\frac{\partial z^L}{\partial z^{\ell+1}_j}\,W^\ell_{ji}\,\phi'(z^\ell_i)\right)^2\right]\cr &=\frac{1}{d_\ell}\sum_{i,j}\E\Big[\Big(\frac{\partial z^L}{\partial z^{\ell+1}_j}\Big)^2\,\big(\phi'(z^\ell_i)\big)^2\Big]\; \E\big[(W^\ell_{ji})^2\big]\cr &=\frac{\sigma_w^2}{d_\ell^2}\sum_{j}\E\Big[\Big(\frac{\partial z^L}{\partial z^{\ell+1}_j}\Big)^2\Big]\; \sum_{i}\E\big[(\phi'(z^\ell_i))^2\big]. \end{aligned} \end{align}\) By LLN, \(\begin{align} \frac{1}{d_\ell}\sum_{i}(\phi'(z^\ell_i))^2 \xrightarrow{} \E[\phi'(z)^2], \quad z\sim\mathcal N(0,\Sigma^\ell(x',x')), \end{align}\) so \(\begin{align} Q_{\ell,\ell} =\frac{d_{\ell+1}}{d_\ell}\,Q_{\ell+1,\ell+1}\;\underbrace{\sigma_w^2\,\E[\phi'(z)^2]}_{=\ \dot\Sigma^{\ell+1}(x',x')}. \end{align}\) Thus the correct basis recursion is \(\begin{align} \boxed{\,Q_{\ell,\ell} =Q_{\ell+1,\ell+1}\,\dot\Sigma^{\ell+1}(x',x')\,\frac{d_{\ell+1}}{d_\ell}\,}. \end{align}\) Unrolling to \(L\) (with \(Q_{L,L}=1\) for scalar output) gives \(\begin{align} Q_{\ell,\ell} = \frac{1}{d_\ell}\prod_{i=\ell+1}^{L}\dot\Sigma^{i}(x',x'). \end{align}\) Combining with the first recursion, \(\begin{align} \boxed{\,Q_{k,\ell} = \frac{1}{d_\ell}\left(\prod_{i=k}^{\ell-1}\big(\dot\Sigma^{i}(x,x')\big)^2\right) \left(\prod_{i=\ell+1}^{L}\dot\Sigma^{i}(x',x')\right)\,} \quad\text{(leading order, errors }O(\sum 1/d_i)\text{)}. \end{align}\)

If we include the learning-rate scaling for a single gradient step on block \(k\) (as above), the second moment of the influence becomes \(\begin{align} \E[(\dot\Theta_{k,\ell})^2] = \frac{1}{d_\ell}\Big(\eta_0\,\Sigma^{k-1}(x,x')\Big)^2 \left(\prod_{i=k}^{\ell-1}\big(\dot\Sigma^{i}(x,x')\big)^2\right) \left(\prod_{i=\ell+1}^{L}\dot\Sigma^{i}(x',x')\right), \end{align}\) again accurate up to \(O(\sum 1/d_i)\).

Since for \(\ell=L\) the leading term of the second moment equals the square of the first moment, the variance is entirely from subleading terms: \(\begin{align} \text{Var}[\dot\Theta_{k,\ell}]= \begin{cases} \displaystyle \frac{\eta_0^{2}}{d_\ell}\big(\Sigma^{k-1}(x,x')\big)^2 \left(\prod_{i=k}^{\ell-1}\big(\dot\Sigma^{i}(x,x')\big)^2\right) \left(\prod_{i=\ell+1}^{L}\dot\Sigma^{i}(x',x')\right), & \ell<L,\\[10pt] \cr \displaystyle \sum_{m=1}^{L}O\!\left(\frac{1}{d_m}\right)\,\eta_0^{2} \big(\Sigma^{k-1}(x,x')\big)^2 \left(\prod_{i=k}^{L-1}\big(\dot\Sigma^{i}(x,x')\big)^2\right), & \ell=L. \end{cases} \end{align}\) Letting all hidden widths \(d_1,\ldots,d_{L-1}\to\infty\), these variances vanish.

Putting It All Together

Cross-terms \(\E[\dot\Theta_{k,\ell}\dot\Theta_{k',\ell}]\) with \(k\ne k'\) vanish because weights from different layers cannot be paired under expectation. Hence the mean kernel shift at layer \(L\) after one step (under the learning-rate scaling above) is \(\begin{align} \sum_{k=1}^{L}\E[\dot\Theta_{k,L}] = \sum_{k=1}^{L}\eta_0\,\Sigma^{k-1}(x,x') \prod_{i=k}^{L-1}\dot\Sigma^{i}(x,x'), \end{align}\) and the variance vanishes at infinite width. (Recall this “kernel shift” is about training dynamics for one step and therefore carries \(\eta_0\).)

The NTK itself. The Neural Tangent Kernel \(\Theta^\ell(x,x')\) is the (width-limit) inner product of parameter gradients of \(z^\ell\) at initialization. It does not include any learning-rate factor. With the conventions above, the standard NTK recursion for a fully-connected network is \(\begin{align} \boxed{\,\Theta^{L}(x,x')=\Theta^{L-1}(x,x')\,\dot\Sigma^{L-1}(x,x')\;+\;\Sigma^{L-1}(x,x')\,} \end{align}\) with the natural base case \(\Theta^{1}(x,x')=\Sigma^{0}(x,x')\).

This is the classical result of (Jacot et al., 2018).

A More Nuanced First Moment Calculation

The simple derivation implicitly neglected off-diagonal (row–row) interactions in the backward pass when the same \(W^k\) appears both in \(z^{k+1}\) and in the Jacobians. Here we make this precise using Gaussian integration by parts.

Define for layer \(t\): \(\begin{align} \begin{aligned} C^t_{f,g} &:= \E[f(z)\,g(z')] \quad \text{with } (z,z')\sim\mathcal N(0,\Sigma^{t-1}),\cr \mu^{t}_{f} &:= \E[f(z)] \quad \text{with } z\sim\mathcal N(0,\Sigma^{t-1}_{11}). \end{aligned} \end{align}\) Note that \(\dot\Sigma^{t}=\sigma_w^2\,C^{t}_{\phi',\phi'}\).

From the chain rule written with post-activations, \(\begin{align} \begin{aligned} M_{k,\ell} &= \frac{C^{k}_{\phi',\phi'}}{d_\ell}\sum_{i,j,s,r} \E\!\left[ \frac{\partial z^\ell_i(x)}{\partial h^{k+1}_s}\, \phi'(z^{k+1}_s(x))\,W^k_{sj}\, \frac{\partial z^L(x')}{\partial h^{k+1}_r}\, \phi'(z^{k+1}_r(x'))\,W^k_{rj}\right]\cr &=\frac{C^{k}_{\phi',\phi'}}{d_\ell}\sum_{i,s,r} \E\!\left[ \frac{\partial z^\ell_i(x)}{\partial h^{k+1}_s}\, \frac{\partial z^L(x')}{\partial h^{k+1}_r}\right]\cdot T_{rs}, \end{aligned} \end{align}\) where \(\begin{align} T_{rs}:=\sum_{j=1}^{d_k}\E\!\left[\phi'(z^{k+1}_s(x))\phi'(z^{k+1}_r(x'))\,W^k_{sj}W^k_{rj}\right], \quad z^{k+1}=W^k h^{k}+b^k. \end{align}\)

We use the following extended Stein identities for jointly Gaussian \((X,Y,Z)\) with the relevant covariances \(\sigma_{\bullet\bullet}\) and twice-differentiable \(f\):

\[\begin{align} \begin{aligned} \E[f(X)Y] &= \sigma_{XY}\,\E[f'(X)],\cr \E[f(X)f(Y)Z] &= \sigma_{XZ}\,\E[f'(X)f(Y)] + \sigma_{YZ}\,\E[f(X)f'(Y)],\cr \E[f(X)f(Y)Z^2] &= \sigma_Z^2\,\E[f(X)f(Y)] + \sigma_{XZ}^2\,\E[f''(X)f(Y)] \cr &\qquad + 2\sigma_{XZ}\sigma_{YZ}\,\E[f'(X)f'(Y)] + \sigma_{YZ}^2\,\E[f(X)f''(Y)]. \end{aligned} \end{align}\]

Two cases:

(i) \(r=s\) (diagonal). \(\begin{align} \begin{aligned} \E[T_{rr}] &=\sum_{j}\E\big[(W^k_{rj})^2\big]\;\E\big[\phi'(z)\phi'(z')\big]\;+\;O(1/d_k)\cr &=\frac{\sigma_w^2}{d_k}\cdot d_k\cdot C^{k+1}_{\phi',\phi'}\;+\;O(1/d_k) =\boxed{\,\sigma_w^2\,C^{k+1}_{\phi',\phi'}\,}\;+\;O(1/d_k) =\boxed{\,\dot\Sigma^{k+1}\,}\;+\;O(1/d_k). \end{aligned} \end{align}\)

(ii) \(r\neq s\) (off-diagonal). Rows \(r\) and \(s\) of \(W^k\) are independent, so \(\begin{align} \E[T_{rs}] = \sum_{j}\E\big[\phi'(z^{k+1}_s)W^k_{sj}\big]\;\E\big[\phi'(z^{k+1}_r)W^k_{rj}\big]. \end{align}\) By Stein, \(\begin{align} \E\big[\phi'(z^{k+1}_s)W^k_{sj}\big] =\text{Cov}(z^{k+1}_s,W^k_{sj})\;\mu^{k+1}_{\phi''} =\frac{\sigma_w^2}{d_k}\,\E[h^k_j(x)]\;\mu^{k+1}_{\phi''} =\frac{\sigma_w^2}{d_k}\,\mu^{k}_{\phi}\;\mu^{k+1}_{\phi''}. \end{align}\) Therefore \(\begin{align} \boxed{\,\E[T_{rs}] = \frac{\sigma_w^4}{d_k}\,(\mu^{k+1}_{\phi''})^2\,(\mu^{k}_{\phi})^2\,}. \end{align}\) Equivalently, if one conditions on \(h^k\) first and averages later, the same scaling can be written as \(\begin{align} \E[T_{rs}] = \frac{\sigma_w^4}{d_k}\,(\mu^{k+1}_{\phi''})^2\,C^{k}_{\phi,\phi}, \end{align}\) using \(C^{k}_{\phi,\phi}=\E[\phi(z)\phi(z')]\) and the identity \(C^{k}_{\phi,\phi}=(\Sigma^{k+1}-\sigma_b^2)/\sigma_w^2\). Both viewpoints lead to the same \(O(\sigma_w^4/d_k)\) scale; the first makes it explicit that the contribution vanishes when \(\mu_\phi^{k}=0\).

Putting these into \(M_{k,\ell}\), define \(\begin{align} D_{t,\ell}:=\sum_{i,r}\E\!\left[\frac{\partial z^\ell_i(x)}{\partial h^{t}_r}\,\frac{\partial z^L(x')}{\partial h^{t}_r}\right], \qquad S_{t,\ell}:=\sum_{i}\sum_{\substack{r,s\cr r\neq s}}\E\!\left[\frac{\partial z^\ell_i(x)}{\partial h^{t}_s}\,\frac{\partial z^L(x')}{\partial h^{t}_r}\right]. \end{align}\) Then \(\begin{align} M_{k,\ell}=\frac{C^{k}_{\phi',\phi'}}{d_\ell}\left[\;\underbrace{\dot\Sigma^{k+1}}_{=\sigma_w^2 C^{k+1}_{\phi',\phi'}}D_{k+1,\ell} +\underbrace{\frac{\sigma_w^4}{d_k}\,(\mu^{k+1}_{\phi''})^2\,C^{k}_{\phi,\phi}}_{\text{off-diagonal}}\;S_{k+1,\ell}\right] =\frac{C^{k}_{\phi',\phi'}}{d_\ell}\,D_{k,\ell}. \end{align}\)

This yields the coupled recursions

\[\boxed{ \begin{align} D_{k,\ell} &= \dot\Sigma^{k+1}\,D_{k+1,\ell} \;+\;\frac{\sigma_w^4}{d_k}\,(\mu^{k+1}_{\phi''})^2\,C^{k}_{\phi,\phi}\;S_{k+1,\ell},\\[4pt] S_{k,\ell} &= 2\,\sigma_w^4\,C^{k+1}_{\phi'',\phi''}\,(\mu^{k+1}_{\phi})^2\,D_{k+1,\ell} \;+\;(\mu^{k+1}_{\phi})^2(\mu^{k+1}_{\phi''})^2\,S_{k+1,\ell}, \end{align}}\]

with boundary conditions \(D_{L-1,L}=1\) and \(S_{L-1,L}=0\).

Equivalently, as a \(2\times 2\) transfer matrix \(v_k=T_k\,v_{k+1}\) for \(v_k=(D_{k,L},S_{k,L})\), \(\begin{align} \boxed{ T_k= \begin{pmatrix} \dot\Sigma^{k+1} & \displaystyle \frac{\sigma_w^4}{d_k}\,C^{k}_{\phi,\phi}\,(\mu^{k+1}_{\phi''})^2\\[8pt] \displaystyle 2\,\sigma_w^4\,C^{k+1}_{\phi'',\phi''}\,(\mu^{k+1}_{\phi})^2 & \displaystyle (\mu^{k+1}_{\phi})^2(\mu^{k+1}_{\phi''})^2 \end{pmatrix}.} \end{align}\)

The off-diagonal entries are \(O(1/d_k)\) (top-right) or carry \((\mu_\phi^{k+1})^2\) factors (bottom-left and bottom-right). Hence, in the mean-field limit, they vanish, and we recover the pure NTK result. Moreover, if \(\mu^{t}_{\phi}=0\) at initialization for all layers (e.g., centered activations), then \(S_{k,\ell}\equiv 0\) exactly by the boundary \(S_{L-1,L}=0\) and the fact that both terms driving \(S\) contain \((\mu_\phi^{t})^2\).

Implication. The refined calculation shows that off-diagonal corrections are (i) lower order in width and (ii) further suppressed when activations are centered. They can, however, matter in feature-learning regimes where \(1/d\) effects are effectively amplified.

Concluding Remarks

The derivations above (i) give a self-contained re-derivation of the NTK recursion and (ii) make precise why off-diagonal backward couplings do not alter the infinite-width limit.
Only the last layer has a non-vanishing first moment for our chosen “loss” \(\mathcal L=z^L(x')\); this asymmetry drives the clean NTK recursion after summing parameter blocks.
Where I previously said “by CLT the sample average converges,” the correct statement is LLN for convergence of empirical averages; CLT quantifies the \(O(1/\sqrt{d})\) fluctuations that vanish at infinite width.
The NTK is an architectural/parametric object and does not include the learning rate. Learning rate enters only when mapping the NTK to training dynamics (\(\Delta f=-\eta\,\Theta\,\partial_f\mathcal L\)).
The refined “nuanced view” confirms that potential \(O(1/d)\) corrections are indeed negligible for NTK at infinite width and clarifies when they vanish exactly (centered activations).