Random Fourier Features and Rotary Position Embeddings
Self-attention is permutation-equivariant, hence Transformers require positional information to represent ordered or structured inputs (Vaswani et al. 2017). Rotary Position Embedding (RoPE) (Su et al. 2021) injects relative position information by applying position-dependent \(2\times 2\) rotations to query/key feature pairs. In its common deterministic form, RoPE uses a fixed geometric progression of frequencies (angles) (Su et al. 2021), which couples the induced positional interactions to a particular spectral profile and limits the ability to encode domain-specific inductive biases. Extensions and adaptations of RoPE to higher-dimensional structure (e.g., 2D vision) further motivate a principled design view (Li et al. 2021; Heo et al. 2024).
We introduce Random Feature RoPE (RF-RoPE): rather than fixing frequencies, we sample them from a designed distribution \(p(\boldsymbol{\omega})\). Our central result is a kernel-design theorem showing that RF-RoPE implements any desired shift-invariant kernel in expectation at initialization. This turns positional-encoding design into a mathematically grounded inverse problem: \[\Phi(\Delta \mathbf{p}) \;\longrightarrow\; p(\boldsymbol{\omega}) \;\longrightarrow\; \{\boldsymbol{\omega}_i\}_{i=1}^{D},\] with the inverse map justified by Bochner’s theorem (Bochner 1955; Rasmussen and Williams 2006) and operationalized via random features (Rahimi and Recht 2007).
Random Fourier Features (RFF) approximate shift-invariant kernels via Monte Carlo samples from their spectral density (Rahimi and Recht 2007). Bochner’s theorem characterizes such kernels as Fourier transforms of nonnegative measures (Bochner 1955; Rasmussen and Williams 2006).
Random features have been used to approximate attention kernels in linear-time Transformers (e.g., Performer) (Choromanski et al. 2021). Separately, Stochastic Positional Encoding (SPE) generates relative positional behavior compatible with linear attention (Liutkus et al. 2021), and F-StrIPE extends random-feature positional encodings with structure-informed priors in symbolic music (Agarwal et al. 2025). In parallel, RoPE variants have explored learned or content-dependent rotations (e.g., Selective RoPE) (Movahedi et al. 2025).
Prior random-feature PE work primarily builds additive feature maps for (often linear) attention. Our contribution is to identify the RoPE rotation operator itself as an RFF-style kernel estimator when frequencies are sampled symmetrically. This yields a direct, efficient pathway to import kernel-design theory into the RoPE mechanism.
Let \(\mathbf{q}_m,\mathbf{k}_n\in\mathbb{R}^d\) be query/key vectors at positions \(\mathbf{p}_m,\mathbf{p}_n\in\mathbb{R}^k\). Assume \(d\) is even and split features into \(D=d/2\) two-dimensional blocks: \[\mathbf{q}_m = (\mathbf{q}_{m,1},\ldots,\mathbf{q}_{m,D}),\qquad \mathbf{k}_n = (\mathbf{k}_{n,1},\ldots,\mathbf{k}_{n,D}),\] where each \(\mathbf{q}_{m,i},\mathbf{k}_{n,i}\in\mathbb{R}^2\).
For each block \(i\), sample an i.i.d. frequency vector \(\boldsymbol{\omega}_i\in\mathbb{R}^k\) from \(p(\boldsymbol{\omega})\). Define the \(2\times 2\) rotation \[R(\alpha) \;=\; \begin{pmatrix} \cos\alpha & -\sin\alpha\\ \sin\alpha & \cos\alpha \end{pmatrix}.\] RF-RoPE applies the phase \(\alpha=\mathbf{p}\cdot\boldsymbol{\omega}_i\): \[\mathbf{q}'_{m,i} = R(\mathbf{p}_m\cdot\boldsymbol{\omega}_i)\,\mathbf{q}_{m,i}, \qquad \mathbf{k}'_{n,i} = R(\mathbf{p}_n\cdot\boldsymbol{\omega}_i)\,\mathbf{k}_{n,i},\] and concatenates \(\mathbf{q}'_m=(\mathbf{q}'_{m,1},\dots,\mathbf{q}'_{m,D})\), \(\mathbf{k}'_n=(\mathbf{k}'_{n,1},\dots,\mathbf{k}'_{n,D})\).
Let \(\Delta\mathbf{p}=\mathbf{p}_n-\mathbf{p}_m\) and \(u_i=\Delta\mathbf{p}\cdot\boldsymbol{\omega}_i\). Using the rotation identity \(R(a)^\top R(b)=R(b-a)\), the contribution of block \(i\) to the RoPE-modulated dot product is \[\begin{aligned} S_i \;:=\; (\mathbf{q}'_{m,i})^\top \mathbf{k}'_{n,i} &= \mathbf{q}_{m,i}^\top R(u_i)\,\mathbf{k}_{n,i}. \end{aligned}\] Introduce the \(+90^\circ\) rotation matrix \(J= \begin{pmatrix} 0 & -1\\ 1 & 0 \end{pmatrix}\) so that \(J\mathbf{x}\) rotates \(\mathbf{x}\) by \(+90^\circ\). Then \[R(u) = \cos u \,I_2 + \sin u \,J.\] Substituting yields the exact decomposition \begin{equation}\label{eq:Si-decomp} S_i = (\mathbf{q}_{m,i}^\top \mathbf{k}_{n,i})\cos u_i + (\mathbf{q}_{m,i}^\top J\mathbf{k}_{n,i})\sin u_i.\end{equation} The full RF-RoPE attention score (pre-softmax) is \[S(\mathbf{p}_m,\mathbf{p}_n) := (\mathbf{q}'_m)^\top \mathbf{k}'_n = \sum_{i=1}^{D} S_i.\]
Lemma 1 (Odd-term cancellation under central symmetry). Fix \(\Delta\mathbf{p}\in\mathbb{R}^k\) and let \(u=\Delta\mathbf{p}\cdot\boldsymbol{\omega}\) with \(\boldsymbol{\omega}\sim p(\boldsymbol{\omega})\). If \(p\) is centrally symmetric, i.e. \(p(\boldsymbol{\omega})=p(-\boldsymbol{\omega})\), then \[\mathbb{E}[\sin u]=0, \qquad \mathbb{E}[\sin u \cos u]=0.\]
Proof. By symmetry and the change of variables \(\boldsymbol{\omega}\mapsto -\boldsymbol{\omega}\), \[\mathbb{E}[\sin(\Delta\mathbf{p}\cdot\boldsymbol{\omega})] = \int_{\mathbb{R}^k}\sin(\Delta\mathbf{p}\cdot\boldsymbol{\omega})\,p(\boldsymbol{\omega})\,d\boldsymbol{\omega} = \int_{\mathbb{R}^k}\sin(\Delta\mathbf{p}\cdot(-\boldsymbol{\omega}))\,p(-\boldsymbol{\omega})\,d\boldsymbol{\omega} = -\mathbb{E}[\sin(\Delta\mathbf{p}\cdot\boldsymbol{\omega})],\] hence it must be \(0\). The second identity follows since \(\sin u\cos u=\tfrac{1}{2}\sin(2u)\) is also odd in \(u\). ◻
Define the RF-RoPE positional kernel induced by \(p\) as the cosine transform \begin{equation}\label{eq:Phi-def} \Phi(\Delta\mathbf{p}) := \mathbb{E}_{\boldsymbol{\omega}\sim p}\!\left[\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega})\right] = \int_{\mathbb{R}^k}\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega})\,p(\boldsymbol{\omega})\,d\boldsymbol{\omega}.\end{equation}
Theorem 2 (Master Theorem (RF-RoPE kernelization)). Assume \(\boldsymbol{\omega}_1,\ldots,\boldsymbol{\omega}_D\) are i.i.d. from a centrally symmetric distribution \(p(\boldsymbol{\omega})\) and RF-RoPE is applied as above. Then for any fixed content vectors \(\mathbf{q}_m,\mathbf{k}_n\) and positions \(\mathbf{p}_m,\mathbf{p}_n\), \[\mathbb{E}_{\boldsymbol{\omega}_{1:D}}\!\left[S(\mathbf{p}_m,\mathbf{p}_n)\right] = (\mathbf{q}_m^\top \mathbf{k}_n)\,\Phi(\mathbf{p}_n-\mathbf{p}_m),\] where \(\Phi\) is given by \(\eqref{eq:Phi-def}\).
Proof. From \(\eqref{eq:Si-decomp}\) and \(u_i=\Delta\mathbf{p}\cdot\boldsymbol{\omega}_i\), \[S_i = A_i\cos u_i + B_i\sin u_i, \qquad A_i:=\mathbf{q}_{m,i}^\top \mathbf{k}_{n,i},\;\; B_i:=\mathbf{q}_{m,i}^\top J\mathbf{k}_{n,i}.\] Taking expectation over \(\boldsymbol{\omega}_i\) and applying Lemma \(1\), \[\mathbb{E}[S_i] = A_i\,\mathbb{E}[\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega})] + B_i\,\mathbb{E}[\sin(\Delta\mathbf{p}\cdot\boldsymbol{\omega})] = A_i\,\Phi(\Delta\mathbf{p}).\] Summing over blocks and using linearity of expectation, \[\mathbb{E}\!\left[S(\mathbf{p}_m,\mathbf{p}_n)\right] = \sum_{i=1}^{D}\mathbb{E}[S_i] = \Phi(\Delta\mathbf{p})\sum_{i=1}^{D}A_i = \Phi(\Delta\mathbf{p})\,\mathbf{q}_m^\top \mathbf{k}_n,\] since \(\mathbf{q}_m^\top \mathbf{k}_n=\sum_{i=1}^D \mathbf{q}_{m,i}^\top \mathbf{k}_{n,i}=\sum_{i=1}^D A_i\). ◻
Corollary 3 (Inverse kernel design via Bochner). Let \(\Phi:\mathbb{R}^k\to\mathbb{R}\) be continuous, shift-invariant, and positive definite with \(\Phi(\mathbf{0})=1\). Then there exists a centrally symmetric probability measure \(p(\boldsymbol{\omega})\) such that \(\eqref{eq:Phi-def}\) holds. Consequently, RF-RoPE can realize \(\Phi\) in expectation at initialization by sampling \(\boldsymbol{\omega}_i\sim p\) and applying Theorem \(2\).
Proof. By Bochner’s theorem (Bochner 1955; Rasmussen and Williams 2006), \(\Phi\) admits a representation \[\Phi(\Delta\mathbf{p})=\int_{\mathbb{R}^k} e^{i\,\Delta\mathbf{p}^\top \boldsymbol{\omega}}\,d\mu(\boldsymbol{\omega})\] for a finite nonnegative measure \(\mu\). The normalization \(\Phi(\mathbf{0})=1\) implies \(\mu(\mathbb{R}^k)=1\), so \(\mu\) is a probability measure. Since \(\Phi\) is real-valued, \(\mu\) can be chosen centrally symmetric, and taking real parts yields \[\Phi(\Delta\mathbf{p}) = \int_{\mathbb{R}^k}\cos(\Delta\mathbf{p}^\top \boldsymbol{\omega})\,d\mu(\boldsymbol{\omega}),\] which is \(\eqref{eq:Phi-def}\) with \(p=\mu\). ◻
The Master Theorem describes the mean behavior. With finite \(D\), \(S\) is a random estimator of its expectation.
Proposition 4 (Second moment and variance of a block). Assume \(p(\boldsymbol{\omega})\) is centrally symmetric and let \(u=\Delta\mathbf{p}\cdot\boldsymbol{\omega}\). For a fixed block \(i\), define \(A_i=\mathbf{q}_{m,i}^\top \mathbf{k}_{n,i}\) and \(B_i=\mathbf{q}_{m,i}^\top J\mathbf{k}_{n,i}\) as in the proof of Theorem \(2\). Let \(\Phi(\Delta\mathbf{p})=\mathbb{E}[\cos u]\) and \(\Phi_2(\Delta\mathbf{p})=\mathbb{E}[\cos(2u)]\). Then \[\mathbb{E}[S_i] = A_i\,\Phi(\Delta\mathbf{p}),\] and the variance satisfies \[\mathrm{Var}(S_i) = \frac{A_i^2+B_i^2}{2} + \frac{A_i^2-B_i^2}{2}\,\Phi_2(\Delta\mathbf{p}) - A_i^2\,\Phi(\Delta\mathbf{p})^2.\]
Proof. We already derived \(\mathbb{E}[S_i]=A_i\Phi(\Delta\mathbf{p})\). For the second moment, write \(S_i=A_i\cos u + B_i\sin u\). Expanding, \[S_i^2 = A_i^2\cos^2 u + B_i^2\sin^2 u + 2A_iB_i\sin u\cos u.\] Taking expectations and applying Lemma \(1\) gives \(\mathbb{E}[\sin u\cos u]=0\). Using the identities \[\cos^2 u=\frac{1+\cos(2u)}{2},\qquad \sin^2 u=\frac{1-\cos(2u)}{2},\] we obtain \[\mathbb{E}[S_i^2] = A_i^2\frac{1+\mathbb{E}[\cos(2u)]}{2} + B_i^2\frac{1-\mathbb{E}[\cos(2u)]}{2} = \frac{A_i^2+B_i^2}{2}+\frac{A_i^2-B_i^2}{2}\,\Phi_2(\Delta\mathbf{p}).\] Finally, \(\mathrm{Var}(S_i)=\mathbb{E}[S_i^2]-\mathbb{E}[S_i]^2\) yields the stated expression. ◻
Because \(\boldsymbol{\omega}_i\) are i.i.d., the random variables \(S_i\) are independent conditioned on fixed content vectors. Hence, \[\mathrm{Var}(S)=\sum_{i=1}^{D}\mathrm{Var}(S_i).\] In regimes where block magnitudes are comparable (e.g., after layernorm), the standard deviation grows like \(\sqrt{D}\) while the mean magnitude grows like \(D\), yielding a typical relative Monte Carlo error that scales as \(O(D^{-1/2})=O(d^{-1/2})\).
By Corollary \(3\), specifying a desired kernel \(\Phi\) determines a spectral distribution \(p(\boldsymbol{\omega})\) (or density when it exists), from which RF-RoPE samples frequencies. The following standard examples follow directly from classical kernel/spectral pairs (Rasmussen and Williams 2006; Rahimi and Recht 2007).
\[\Phi(\Delta\mathbf{p})=\exp\!\left(-\frac{1}{2\sigma^2}\|\Delta\mathbf{p}\|^2\right), \qquad \boldsymbol{\omega}\sim \mathcal{N}(\mathbf{0},\sigma^{-2}I_k).\]
\[\Phi(\Delta m)=\left(1+(\Delta m/b)^2\right)^{-1}, \qquad \omega \sim \mathrm{Laplace}(0,b^{-1})\ \ (\text{equivalently }p(\omega)\propto e^{-|\omega|/b}).\]
For separable bandwidths \(W_j\), \[\Phi(\Delta\mathbf{p})=\prod_{j=1}^{k}\mathrm{sinc}(W_j\Delta p_j), \qquad \omega_j \sim \mathrm{Unif}(-W_j,W_j)\ \text{ independently.}\]
Matérn kernels interpolate between rough and smooth locality (Matérn 1960; Rasmussen and Williams 2006). Their spectral densities exhibit Student-\(t\)-like heavy tails, motivating heavy-tailed sampling schemes for \(\boldsymbol{\omega}\).
Standard deterministic RoPE for \(k=1\) uses a fixed geometric frequency list (Su et al. 2021). Writing these frequencies as \(\theta_i=B^{-2i/d}\), the RoPE-modulated dot product can be expressed in the form \[S(m,n)=\sum_{i=1}^{D}\Big[A_i\cos(\Delta m\,\theta_i)+B_i\sin(\Delta m\,\theta_i)\Big],\] for coefficients \(A_i,B_i\) determined by the content vectors within each 2D block (cf. \(\eqref{eq:Si-decomp}\)).
RF-RoPE relies on symmetric sampling to guarantee \(\mathbb{E}[\sin(\Delta m\,\omega)]=0\) and hence a clean multiplicative modulation (Theorem \(2\)). In standard RoPE, the frequency list is strictly positive, so there is no mechanism enforcing cancellation of the sine term. As a result, standard RoPE does not, in general, admit a representation of the form \[S(m,n)\stackrel{?}{=}(\mathbf{q}_m^\top\mathbf{k}_n)\,\Phi(\Delta m)\] for a single real-valued scalar kernel \(\Phi\) without additional assumptions on the content-dependent coefficients.
A geometric grid \(\theta_i=B^{-2i/d}\) corresponds to an approximately log-uniform spacing. Treat \(i\) as continuous and solve for \(i(\theta)\): \[\theta = B^{-2i/d} \quad\Longrightarrow\quad i(\theta)= -\frac{d}{2}\log_B\theta.\] Thus, \[\left|\frac{di}{d\theta}\right| = \frac{d}{2}\cdot \frac{1}{|\ln B|}\cdot \frac{1}{\theta} \;\propto\; \frac{1}{\theta}.\] Interpreting \(\left|di/d\theta\right|\) as a continuum density suggests an implied heavy-tailed spectrum \(p(\theta)\propto 1/\theta\) over the covered band. Heavy tails place substantial mass on low frequencies, which tends to induce slowly decaying correlations in position space, consistent with RoPE’s empirically weak inherent locality prior (Su et al. 2021).
Remark 5. The above \(p(\theta)\propto 1/\theta\) argument is a density-of-states approximation: it describes how densely the deterministic grid samples log-frequency. RF-RoPE replaces this implicit bias with an explicit, designer-chosen \(p(\omega)\) tied to a target kernel via Corollary \(3\).
RF-RoPE controls the initialization kernel. Yet Transformers learn attention heads that focus on fixed, nonzero offsets (e.g., induction heads) (Olsson et al. 2022). We now formalize why RF-RoPE kernels are necessarily centered and how learning can induce shifts.
Theorem 6 (Peak at the origin). Let \(\Phi(\Delta\mathbf{p})=\mathbb{E}_{\boldsymbol{\omega}\sim p}[\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega})]\) for any probability distribution \(p\). Then \(\Phi\) attains its global maximum at \(\Delta\mathbf{p}=\mathbf{0}\), and \(\Phi(\Delta\mathbf{p})\le 1\) for all \(\Delta\mathbf{p}\).
Proof. \(\Phi(\mathbf{0})=\mathbb{E}[\cos 0]=1\). For any \(\Delta\mathbf{p}\), \(\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega})\le 1\) pointwise, hence \(\Phi(\Delta\mathbf{p})=\mathbb{E}[\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega})]\le 1\). ◻
Thus, selecting a different \(p(\boldsymbol{\omega})\) can change the shape (width, smoothness, tails) of locality, but cannot move the peak away from \(\Delta\mathbf{p}=\mathbf{0}\) at initialization.
Consider a single block \(i\) and suppose training makes the query/key projections implement additional content-dependent rotations within that \(2\)D subspace. Concretely, assume there exist angles \(\phi_{q,i},\phi_{k,i}\) (possibly head- and block-specific) such that the pre-RoPE projected vectors satisfy \[\tilde{\mathbf{q}}_{m,i}\approx R(\phi_{q,i})\,\mathbf{v}_{m,i}, \qquad \tilde{\mathbf{k}}_{n,i}\approx R(\phi_{k,i})\,\mathbf{v}_{n,i},\] for some base content vectors \(\mathbf{v}_{m,i},\mathbf{v}_{n,i}\). Applying RF-RoPE yields \[\mathbf{q}'_{m,i}=R(\mathbf{p}_m\cdot\boldsymbol{\omega}_i)\tilde{\mathbf{q}}_{m,i}, \qquad \mathbf{k}'_{n,i}=R(\mathbf{p}_n\cdot\boldsymbol{\omega}_i)\tilde{\mathbf{k}}_{n,i}.\] Since rotations commute in 2D, \[\mathbf{q}'_{m,i}\approx R(\mathbf{p}_m\cdot\boldsymbol{\omega}_i+\phi_{q,i})\mathbf{v}_{m,i}, \qquad \mathbf{k}'_{n,i}\approx R(\mathbf{p}_n\cdot\boldsymbol{\omega}_i+\phi_{k,i})\mathbf{v}_{n,i}.\] Therefore, the relative phase entering the dot product is shifted by \[(\mathbf{p}_n-\mathbf{p}_m)\cdot\boldsymbol{\omega}_i + (\phi_{k,i}-\phi_{q,i}).\] Let \(\Phi_i:=\phi_{k,i}-\phi_{q,i}\). Then block \(i\) contributes approximately \[S_i(\Delta\mathbf{p})\approx C_i\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega}_i+\Phi_i), \qquad C_i:=\mathbf{v}_{m,i}^\top\mathbf{v}_{n,i}.\]
Suppose the training objective encourages attention to peak near \(\Delta\mathbf{p}=-\mathbf{L}\). Maximizing \(S(-\mathbf{L})\) over phases pushes \[-\mathbf{L}\cdot\boldsymbol{\omega}_i+\Phi_i \approx 0 \quad\Longrightarrow\quad \Phi_i \approx \mathbf{L}\cdot\boldsymbol{\omega}_i.\] Substituting gives the shifted form \[S(\Delta\mathbf{p}) \approx \sum_{i=1}^{D} C_i\cos\big((\Delta\mathbf{p}+\mathbf{L})\cdot\boldsymbol{\omega}_i\big),\] i.e. learning can recenter the effective kernel away from the origin even though initialization kernels (Theorem \(6\)) must be centered. This provides a simple mechanistic explanation for how RoPE-based models can develop fixed-offset heads such as induction heads (Olsson et al. 2022).
RF-RoPE reframes RoPE frequency selection as kernel engineering. By sampling frequencies from a designed spectral density, RF-RoPE realizes any desired shift-invariant positive-definite kernel in expectation at initialization (Theorem \(2\)), with principled inverse design via Bochner’s theorem (Corollary \(3\)). We also characterized finite-sample variance (Proposition \(4\)) and clarified why initialization kernels are necessarily centered (Theorem \(6\)), while learned projections can induce shifted-local attention via phase shifts. Together, these results provide a rigorous foundation for designing and interpreting RoPE-style positional mechanisms.