RFF-RoPE: Adjust your RoPE using Random Fourier Features

Random Fourier Features and Rotary Position Embeddings

Introduction

Self-attention is permutation-equivariant, hence Transformers require positional information to represent ordered or structured inputs (Vaswani et al. 2017). Rotary Position Embedding (RoPE) (Su et al. 2021) injects relative position information by applying position-dependent \(2\times 2\) rotations to query/key feature pairs. In its common deterministic form, RoPE uses a fixed geometric progression of frequencies (angles) (Su et al. 2021), which couples the induced positional interactions to a particular spectral profile and limits the ability to encode domain-specific inductive biases. Extensions and adaptations of RoPE to higher-dimensional structure (e.g., 2D vision) further motivate a principled design view (Li et al. 2021; Heo et al. 2024).

We introduce Random Feature RoPE (RF-RoPE): rather than fixing frequencies, we sample them from a designed distribution \(p(\boldsymbol{\omega})\). Our central result is a kernel-design theorem showing that RF-RoPE implements any desired shift-invariant kernel in expectation at initialization. This turns positional-encoding design into a mathematically grounded inverse problem: \[\Phi(\Delta \mathbf{p}) \;\longrightarrow\; p(\boldsymbol{\omega}) \;\longrightarrow\; \{\boldsymbol{\omega}_i\}_{i=1}^{D},\] with the inverse map justified by Bochner’s theorem (Bochner 1955; Rasmussen and Williams 2006) and operationalized via random features (Rahimi and Recht 2007).

Background and Related Work

Random Fourier Features and Bochner.

Random Fourier Features (RFF) approximate shift-invariant kernels via Monte Carlo samples from their spectral density (Rahimi and Recht 2007). Bochner’s theorem characterizes such kernels as Fourier transforms of nonnegative measures (Bochner 1955; Rasmussen and Williams 2006).

Random features in efficient attention and positional encoding.

Random features have been used to approximate attention kernels in linear-time Transformers (e.g., Performer) (Choromanski et al. 2021). Separately, Stochastic Positional Encoding (SPE) generates relative positional behavior compatible with linear attention (Liutkus et al. 2021), and F-StrIPE extends random-feature positional encodings with structure-informed priors in symbolic music (Agarwal et al. 2025). In parallel, RoPE variants have explored learned or content-dependent rotations (e.g., Selective RoPE) (Movahedi et al. 2025).

Novelty gap.

Prior random-feature PE work primarily builds additive feature maps for (often linear) attention. Our contribution is to identify the RoPE rotation operator itself as an RFF-style kernel estimator when frequencies are sampled symmetrically. This yields a direct, efficient pathway to import kernel-design theory into the RoPE mechanism.

The RF-RoPE Framework

Generalized rotational encoding for \(\mathbf{p}\in\mathbb{R}^k\)

Let \(\mathbf{q}_m,\mathbf{k}_n\in\mathbb{R}^d\) be query/key vectors at positions \(\mathbf{p}_m,\mathbf{p}_n\in\mathbb{R}^k\). Assume \(d\) is even and split features into \(D=d/2\) two-dimensional blocks: \[\mathbf{q}_m = (\mathbf{q}_{m,1},\ldots,\mathbf{q}_{m,D}),\qquad \mathbf{k}_n = (\mathbf{k}_{n,1},\ldots,\mathbf{k}_{n,D}),\] where each \(\mathbf{q}_{m,i},\mathbf{k}_{n,i}\in\mathbb{R}^2\).

For each block \(i\), sample an i.i.d. frequency vector \(\boldsymbol{\omega}_i\in\mathbb{R}^k\) from \(p(\boldsymbol{\omega})\). Define the \(2\times 2\) rotation \[R(\alpha) \;=\; \begin{pmatrix} \cos\alpha & -\sin\alpha\\ \sin\alpha & \cos\alpha \end{pmatrix}.\] RF-RoPE applies the phase \(\alpha=\mathbf{p}\cdot\boldsymbol{\omega}_i\): \[\mathbf{q}'_{m,i} = R(\mathbf{p}_m\cdot\boldsymbol{\omega}_i)\,\mathbf{q}_{m,i}, \qquad \mathbf{k}'_{n,i} = R(\mathbf{p}_n\cdot\boldsymbol{\omega}_i)\,\mathbf{k}_{n,i},\] and concatenates \(\mathbf{q}'_m=(\mathbf{q}'_{m,1},\dots,\mathbf{q}'_{m,D})\), \(\mathbf{k}'_n=(\mathbf{k}'_{n,1},\dots,\mathbf{k}'_{n,D})\).

Per-block inner product decomposition

Let \(\Delta\mathbf{p}=\mathbf{p}_n-\mathbf{p}_m\) and \(u_i=\Delta\mathbf{p}\cdot\boldsymbol{\omega}_i\). Using the rotation identity \(R(a)^\top R(b)=R(b-a)\), the contribution of block \(i\) to the RoPE-modulated dot product is \[\begin{aligned} S_i \;:=\; (\mathbf{q}'_{m,i})^\top \mathbf{k}'_{n,i} &= \mathbf{q}_{m,i}^\top R(u_i)\,\mathbf{k}_{n,i}. \end{aligned}\] Introduce the \(+90^\circ\) rotation matrix \(J= \begin{pmatrix} 0 & -1\\ 1 & 0 \end{pmatrix}\) so that \(J\mathbf{x}\) rotates \(\mathbf{x}\) by \(+90^\circ\). Then \[R(u) = \cos u \,I_2 + \sin u \,J.\] Substituting yields the exact decomposition \begin{equation}\label{eq:Si-decomp} S_i = (\mathbf{q}_{m,i}^\top \mathbf{k}_{n,i})\cos u_i + (\mathbf{q}_{m,i}^\top J\mathbf{k}_{n,i})\sin u_i.\end{equation} The full RF-RoPE attention score (pre-softmax) is \[S(\mathbf{p}_m,\mathbf{p}_n) := (\mathbf{q}'_m)^\top \mathbf{k}'_n = \sum_{i=1}^{D} S_i.\]

Master Theorem: kernelization in expectation

Lemma 1 (Odd-term cancellation under central symmetry). Fix \(\Delta\mathbf{p}\in\mathbb{R}^k\) and let \(u=\Delta\mathbf{p}\cdot\boldsymbol{\omega}\) with \(\boldsymbol{\omega}\sim p(\boldsymbol{\omega})\). If \(p\) is centrally symmetric, i.e. \(p(\boldsymbol{\omega})=p(-\boldsymbol{\omega})\), then \[\mathbb{E}[\sin u]=0, \qquad \mathbb{E}[\sin u \cos u]=0.\]

Proof. By symmetry and the change of variables \(\boldsymbol{\omega}\mapsto -\boldsymbol{\omega}\), \[\mathbb{E}[\sin(\Delta\mathbf{p}\cdot\boldsymbol{\omega})] = \int_{\mathbb{R}^k}\sin(\Delta\mathbf{p}\cdot\boldsymbol{\omega})\,p(\boldsymbol{\omega})\,d\boldsymbol{\omega} = \int_{\mathbb{R}^k}\sin(\Delta\mathbf{p}\cdot(-\boldsymbol{\omega}))\,p(-\boldsymbol{\omega})\,d\boldsymbol{\omega} = -\mathbb{E}[\sin(\Delta\mathbf{p}\cdot\boldsymbol{\omega})],\] hence it must be \(0\). The second identity follows since \(\sin u\cos u=\tfrac{1}{2}\sin(2u)\) is also odd in \(u\). ◻

Define the RF-RoPE positional kernel induced by \(p\) as the cosine transform \begin{equation}\label{eq:Phi-def} \Phi(\Delta\mathbf{p}) := \mathbb{E}_{\boldsymbol{\omega}\sim p}\!\left[\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega})\right] = \int_{\mathbb{R}^k}\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega})\,p(\boldsymbol{\omega})\,d\boldsymbol{\omega}.\end{equation}

Theorem 2 (Master Theorem (RF-RoPE kernelization)). Assume \(\boldsymbol{\omega}_1,\ldots,\boldsymbol{\omega}_D\) are i.i.d. from a centrally symmetric distribution \(p(\boldsymbol{\omega})\) and RF-RoPE is applied as above. Then for any fixed content vectors \(\mathbf{q}_m,\mathbf{k}_n\) and positions \(\mathbf{p}_m,\mathbf{p}_n\), \[\mathbb{E}_{\boldsymbol{\omega}_{1:D}}\!\left[S(\mathbf{p}_m,\mathbf{p}_n)\right] = (\mathbf{q}_m^\top \mathbf{k}_n)\,\Phi(\mathbf{p}_n-\mathbf{p}_m),\] where \(\Phi\) is given by \(\eqref{eq:Phi-def}\).

Proof. From \(\eqref{eq:Si-decomp}\) and \(u_i=\Delta\mathbf{p}\cdot\boldsymbol{\omega}_i\), \[S_i = A_i\cos u_i + B_i\sin u_i, \qquad A_i:=\mathbf{q}_{m,i}^\top \mathbf{k}_{n,i},\;\; B_i:=\mathbf{q}_{m,i}^\top J\mathbf{k}_{n,i}.\] Taking expectation over \(\boldsymbol{\omega}_i\) and applying Lemma \(1\), \[\mathbb{E}[S_i] = A_i\,\mathbb{E}[\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega})] + B_i\,\mathbb{E}[\sin(\Delta\mathbf{p}\cdot\boldsymbol{\omega})] = A_i\,\Phi(\Delta\mathbf{p}).\] Summing over blocks and using linearity of expectation, \[\mathbb{E}\!\left[S(\mathbf{p}_m,\mathbf{p}_n)\right] = \sum_{i=1}^{D}\mathbb{E}[S_i] = \Phi(\Delta\mathbf{p})\sum_{i=1}^{D}A_i = \Phi(\Delta\mathbf{p})\,\mathbf{q}_m^\top \mathbf{k}_n,\] since \(\mathbf{q}_m^\top \mathbf{k}_n=\sum_{i=1}^D \mathbf{q}_{m,i}^\top \mathbf{k}_{n,i}=\sum_{i=1}^D A_i\). ◻

Corollary 3 (Inverse kernel design via Bochner). Let \(\Phi:\mathbb{R}^k\to\mathbb{R}\) be continuous, shift-invariant, and positive definite with \(\Phi(\mathbf{0})=1\). Then there exists a centrally symmetric probability measure \(p(\boldsymbol{\omega})\) such that \(\eqref{eq:Phi-def}\) holds. Consequently, RF-RoPE can realize \(\Phi\) in expectation at initialization by sampling \(\boldsymbol{\omega}_i\sim p\) and applying Theorem \(2\).

Proof. By Bochner’s theorem (Bochner 1955; Rasmussen and Williams 2006), \(\Phi\) admits a representation \[\Phi(\Delta\mathbf{p})=\int_{\mathbb{R}^k} e^{i\,\Delta\mathbf{p}^\top \boldsymbol{\omega}}\,d\mu(\boldsymbol{\omega})\] for a finite nonnegative measure \(\mu\). The normalization \(\Phi(\mathbf{0})=1\) implies \(\mu(\mathbb{R}^k)=1\), so \(\mu\) is a probability measure. Since \(\Phi\) is real-valued, \(\mu\) can be chosen centrally symmetric, and taking real parts yields \[\Phi(\Delta\mathbf{p}) = \int_{\mathbb{R}^k}\cos(\Delta\mathbf{p}^\top \boldsymbol{\omega})\,d\mu(\boldsymbol{\omega}),\] which is \(\eqref{eq:Phi-def}\) with \(p=\mu\). ◻

Variance analysis and convergence

The Master Theorem describes the mean behavior. With finite \(D\), \(S\) is a random estimator of its expectation.

Proposition 4 (Second moment and variance of a block). Assume \(p(\boldsymbol{\omega})\) is centrally symmetric and let \(u=\Delta\mathbf{p}\cdot\boldsymbol{\omega}\). For a fixed block \(i\), define \(A_i=\mathbf{q}_{m,i}^\top \mathbf{k}_{n,i}\) and \(B_i=\mathbf{q}_{m,i}^\top J\mathbf{k}_{n,i}\) as in the proof of Theorem \(2\). Let \(\Phi(\Delta\mathbf{p})=\mathbb{E}[\cos u]\) and \(\Phi_2(\Delta\mathbf{p})=\mathbb{E}[\cos(2u)]\). Then \[\mathbb{E}[S_i] = A_i\,\Phi(\Delta\mathbf{p}),\] and the variance satisfies \[\mathrm{Var}(S_i) = \frac{A_i^2+B_i^2}{2} + \frac{A_i^2-B_i^2}{2}\,\Phi_2(\Delta\mathbf{p}) - A_i^2\,\Phi(\Delta\mathbf{p})^2.\]

Proof. We already derived \(\mathbb{E}[S_i]=A_i\Phi(\Delta\mathbf{p})\). For the second moment, write \(S_i=A_i\cos u + B_i\sin u\). Expanding, \[S_i^2 = A_i^2\cos^2 u + B_i^2\sin^2 u + 2A_iB_i\sin u\cos u.\] Taking expectations and applying Lemma \(1\) gives \(\mathbb{E}[\sin u\cos u]=0\). Using the identities \[\cos^2 u=\frac{1+\cos(2u)}{2},\qquad \sin^2 u=\frac{1-\cos(2u)}{2},\] we obtain \[\mathbb{E}[S_i^2] = A_i^2\frac{1+\mathbb{E}[\cos(2u)]}{2} + B_i^2\frac{1-\mathbb{E}[\cos(2u)]}{2} = \frac{A_i^2+B_i^2}{2}+\frac{A_i^2-B_i^2}{2}\,\Phi_2(\Delta\mathbf{p}).\] Finally, \(\mathrm{Var}(S_i)=\mathbb{E}[S_i^2]-\mathbb{E}[S_i]^2\) yields the stated expression. ◻

Because \(\boldsymbol{\omega}_i\) are i.i.d., the random variables \(S_i\) are independent conditioned on fixed content vectors. Hence, \[\mathrm{Var}(S)=\sum_{i=1}^{D}\mathrm{Var}(S_i).\] In regimes where block magnitudes are comparable (e.g., after layernorm), the standard deviation grows like \(\sqrt{D}\) while the mean magnitude grows like \(D\), yielding a typical relative Monte Carlo error that scales as \(O(D^{-1/2})=O(d^{-1/2})\).

Application to RF-RoPE Design

By Corollary \(3\), specifying a desired kernel \(\Phi\) determines a spectral distribution \(p(\boldsymbol{\omega})\) (or density when it exists), from which RF-RoPE samples frequencies. The following standard examples follow directly from classical kernel/spectral pairs (Rasmussen and Williams 2006; Rahimi and Recht 2007).

Gaussian (RBF) kernel.

\[\Phi(\Delta\mathbf{p})=\exp\!\left(-\frac{1}{2\sigma^2}\|\Delta\mathbf{p}\|^2\right), \qquad \boldsymbol{\omega}\sim \mathcal{N}(\mathbf{0},\sigma^{-2}I_k).\]

Cauchy kernel (1D).

\[\Phi(\Delta m)=\left(1+(\Delta m/b)^2\right)^{-1}, \qquad \omega \sim \mathrm{Laplace}(0,b^{-1})\ \ (\text{equivalently }p(\omega)\propto e^{-|\omega|/b}).\]

Sinc / band-limited kernel.

For separable bandwidths \(W_j\), \[\Phi(\Delta\mathbf{p})=\prod_{j=1}^{k}\mathrm{sinc}(W_j\Delta p_j), \qquad \omega_j \sim \mathrm{Unif}(-W_j,W_j)\ \text{ independently.}\]

Matérn family.

Matérn kernels interpolate between rough and smooth locality (Matérn 1960; Rasmussen and Williams 2006). Their spectral densities exhibit Student-\(t\)-like heavy tails, motivating heavy-tailed sampling schemes for \(\boldsymbol{\omega}\).

Analytical Re-Assessment of Standard RoPE

Standard deterministic RoPE for \(k=1\) uses a fixed geometric frequency list (Su et al. 2021). Writing these frequencies as \(\theta_i=B^{-2i/d}\), the RoPE-modulated dot product can be expressed in the form \[S(m,n)=\sum_{i=1}^{D}\Big[A_i\cos(\Delta m\,\theta_i)+B_i\sin(\Delta m\,\theta_i)\Big],\] for coefficients \(A_i,B_i\) determined by the content vectors within each 2D block (cf. \(\eqref{eq:Si-decomp}\)).

Asymmetry and loss of multiplicative kernel form

RF-RoPE relies on symmetric sampling to guarantee \(\mathbb{E}[\sin(\Delta m\,\omega)]=0\) and hence a clean multiplicative modulation (Theorem \(2\)). In standard RoPE, the frequency list is strictly positive, so there is no mechanism enforcing cancellation of the sine term. As a result, standard RoPE does not, in general, admit a representation of the form \[S(m,n)\stackrel{?}{=}(\mathbf{q}_m^\top\mathbf{k}_n)\,\Phi(\Delta m)\] for a single real-valued scalar kernel \(\Phi\) without additional assumptions on the content-dependent coefficients.

Implied spectral density and slow decay (heuristic continuum limit)

A geometric grid \(\theta_i=B^{-2i/d}\) corresponds to an approximately log-uniform spacing. Treat \(i\) as continuous and solve for \(i(\theta)\): \[\theta = B^{-2i/d} \quad\Longrightarrow\quad i(\theta)= -\frac{d}{2}\log_B\theta.\] Thus, \[\left|\frac{di}{d\theta}\right| = \frac{d}{2}\cdot \frac{1}{|\ln B|}\cdot \frac{1}{\theta} \;\propto\; \frac{1}{\theta}.\] Interpreting \(\left|di/d\theta\right|\) as a continuum density suggests an implied heavy-tailed spectrum \(p(\theta)\propto 1/\theta\) over the covered band. Heavy tails place substantial mass on low frequencies, which tends to induce slowly decaying correlations in position space, consistent with RoPE’s empirically weak inherent locality prior (Su et al. 2021).

Remark 5. The above \(p(\theta)\propto 1/\theta\) argument is a density-of-states approximation: it describes how densely the deterministic grid samples log-frequency. RF-RoPE replaces this implicit bias with an explicit, designer-chosen \(p(\omega)\) tied to a target kernel via Corollary \(3\).

Beyond Initialization: Learned Adaptations and Shifted Attention

RF-RoPE controls the initialization kernel. Yet Transformers learn attention heads that focus on fixed, nonzero offsets (e.g., induction heads) (Olsson et al. 2022). We now formalize why RF-RoPE kernels are necessarily centered and how learning can induce shifts.

A locality constraint at initialization

Theorem 6 (Peak at the origin). Let \(\Phi(\Delta\mathbf{p})=\mathbb{E}_{\boldsymbol{\omega}\sim p}[\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega})]\) for any probability distribution \(p\). Then \(\Phi\) attains its global maximum at \(\Delta\mathbf{p}=\mathbf{0}\), and \(\Phi(\Delta\mathbf{p})\le 1\) for all \(\Delta\mathbf{p}\).

Proof. \(\Phi(\mathbf{0})=\mathbb{E}[\cos 0]=1\). For any \(\Delta\mathbf{p}\), \(\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega})\le 1\) pointwise, hence \(\Phi(\Delta\mathbf{p})=\mathbb{E}[\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega})]\le 1\). ◻

Thus, selecting a different \(p(\boldsymbol{\omega})\) can change the shape (width, smoothness, tails) of locality, but cannot move the peak away from \(\Delta\mathbf{p}=\mathbf{0}\) at initialization.

Learned phase shifts from \(W_Q\) and \(W_K\)

Consider a single block \(i\) and suppose training makes the query/key projections implement additional content-dependent rotations within that \(2\)D subspace. Concretely, assume there exist angles \(\phi_{q,i},\phi_{k,i}\) (possibly head- and block-specific) such that the pre-RoPE projected vectors satisfy \[\tilde{\mathbf{q}}_{m,i}\approx R(\phi_{q,i})\,\mathbf{v}_{m,i}, \qquad \tilde{\mathbf{k}}_{n,i}\approx R(\phi_{k,i})\,\mathbf{v}_{n,i},\] for some base content vectors \(\mathbf{v}_{m,i},\mathbf{v}_{n,i}\). Applying RF-RoPE yields \[\mathbf{q}'_{m,i}=R(\mathbf{p}_m\cdot\boldsymbol{\omega}_i)\tilde{\mathbf{q}}_{m,i}, \qquad \mathbf{k}'_{n,i}=R(\mathbf{p}_n\cdot\boldsymbol{\omega}_i)\tilde{\mathbf{k}}_{n,i}.\] Since rotations commute in 2D, \[\mathbf{q}'_{m,i}\approx R(\mathbf{p}_m\cdot\boldsymbol{\omega}_i+\phi_{q,i})\mathbf{v}_{m,i}, \qquad \mathbf{k}'_{n,i}\approx R(\mathbf{p}_n\cdot\boldsymbol{\omega}_i+\phi_{k,i})\mathbf{v}_{n,i}.\] Therefore, the relative phase entering the dot product is shifted by \[(\mathbf{p}_n-\mathbf{p}_m)\cdot\boldsymbol{\omega}_i + (\phi_{k,i}-\phi_{q,i}).\] Let \(\Phi_i:=\phi_{k,i}-\phi_{q,i}\). Then block \(i\) contributes approximately \[S_i(\Delta\mathbf{p})\approx C_i\cos(\Delta\mathbf{p}\cdot\boldsymbol{\omega}_i+\Phi_i), \qquad C_i:=\mathbf{v}_{m,i}^\top\mathbf{v}_{n,i}.\]

Shifted-local attention as learned recentering

Suppose the training objective encourages attention to peak near \(\Delta\mathbf{p}=-\mathbf{L}\). Maximizing \(S(-\mathbf{L})\) over phases pushes \[-\mathbf{L}\cdot\boldsymbol{\omega}_i+\Phi_i \approx 0 \quad\Longrightarrow\quad \Phi_i \approx \mathbf{L}\cdot\boldsymbol{\omega}_i.\] Substituting gives the shifted form \[S(\Delta\mathbf{p}) \approx \sum_{i=1}^{D} C_i\cos\big((\Delta\mathbf{p}+\mathbf{L})\cdot\boldsymbol{\omega}_i\big),\] i.e. learning can recenter the effective kernel away from the origin even though initialization kernels (Theorem \(6\)) must be centered. This provides a simple mechanistic explanation for how RoPE-based models can develop fixed-offset heads such as induction heads (Olsson et al. 2022).

Conclusion

RF-RoPE reframes RoPE frequency selection as kernel engineering. By sampling frequencies from a designed spectral density, RF-RoPE realizes any desired shift-invariant positive-definite kernel in expectation at initialization (Theorem \(2\)), with principled inverse design via Bochner’s theorem (Corollary \(3\)). We also characterized finite-sample variance (Proposition \(4\)) and clarified why initialization kernels are necessarily centered (Theorem \(6\)), while learned projections can induce shifted-local attention via phase shifts. Together, these results provide a rigorous foundation for designing and interpreting RoPE-style positional mechanisms.

References

Agarwal, Manvi, Changhong Wang, and Gaël Richard. 2025. “F-StrIPE: Fast Structure-Informed Positional Encoding for Symbolic Music Generation.” arXiv Preprint arXiv:2502.10491. https://arxiv.org/abs/2502.10491.
Bochner, Salomon. 1955. Harmonic Analysis and the Theory of Probability. University of California Press.
Choromanski, Krzysztof M., Valerii Likhosherstov, David Dohan, et al. 2021. “Rethinking Attention with Performers.” International Conference on Learning Representations. https://arxiv.org/abs/2009.14794.
Heo, Byeongho, Song Park, Dongyoon Han, and Sangdoo Yun. 2024. “Rotary Position Embedding for Vision Transformer.” European Conference on Computer Vision. https://arxiv.org/abs/2403.13298.
Li, Yang, Si Si, Gang Li, Cho-Jui Hsieh, and Samy Bengio. 2021. “Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding.” Advances in Neural Information Processing Systems. https://openreview.net/forum?id=R0h3NUMao_U.
Liutkus, Antoine, Ondřej Cı́fka, Shih-Lun Wu, Umut Şimşekli, Yi-Hsuan Yang, and Gaël Richard. 2021. “Relative Positional Encoding for Transformers with Linear Complexity.” Proceedings of the 38th International Conference on Machine Learning, Proceedings of machine learning research, vol. 139: 7067–79. https://proceedings.mlr.press/v139/liutkus21a.html.
Matérn, Bertil. 1960. Spatial Variation: Stochastic Models and Their Application to Some Problems in Forest Surveys and Other Sampling Investigations. Statens Skogsforskningsinstitut.
Movahedi, Sajad, Timur Carstensen, Arshia Afzal, Frank Hutter, Antonio Orvieto, and Volkan Cevher. 2025. “Selective Rotary Position Embedding.” arXiv Preprint arXiv:2511.17388. https://arxiv.org/abs/2511.17388.
Olsson, Catherine, Nelson Elhage, Neel Nanda, et al. 2022. “In-Context Learning and Induction Heads.” arXiv Preprint arXiv:2209.11895. https://arxiv.org/abs/2209.11895.
Rahimi, Ali, and Benjamin Recht. 2007. “Random Features for Large-Scale Kernel Machines.” Advances in Neural Information Processing Systems 20. https://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.
Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. MIT Press. https://gaussianprocess.org/gpml/chapters/RW.pdf.
Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv Preprint arXiv:2104.09864. https://arxiv.org/abs/2104.09864.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, et al. 2023. LLaMA: Open and Efficient Foundation Language Models.” arXiv Preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30: 5998–6008. https://arxiv.org/abs/1706.03762.