Group Representational Position Encoding

Yifan Zhang; Zixiang Chen; Yifeng Liu; Zhen Qin; Huizhuo Yuan; Kangping Xu; Quanquan Gu; Andrew Chi-Chih Yao

Abstract

We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions. GRAPE brings together two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $SO(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $GL$.

In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ acts as $G(n) = \exp(n\omega L)$ with a rank-2 skew generator $L$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the planes are canonical coordinate pairs.

In Additive GRAPE, additive logits arise as rank-1 unipotent actions in a lifted homogeneous space, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law. Altogether, GRAPE supplies a principled design space for positional geometry in long-context models.

Code & Models Read the Paper

The Unified Framework

GRAPE unifies positional encodings via the General Relative Law of one-parameter subgroups: $$G(t-s) = G(s)^{-1}G(t).$$ Whether the action is a rotation in $SO(d)$ or a translation in a lifted $GL(d+k)$, this algebraic property ensures that attention scores depend solely on the relative offset $t-s$.

Figure 1: Visual overview of the GRAPE framework, contrasting rotational and unipotent actions.

Multiplicative GRAPE

Operation: Rotation ($SO(d)$)

Generator: $L$ (Rank-2 Skew)

$$G(n) = \exp(n \cdot \omega \cdot L)$$ $$L = \mathbf{ab}^\top - \mathbf{ba}^\top$$

Recovers RoPE, Learned Basis, Non-commuting.

Additive GRAPE

Operation: Translation ($GL(d+k)$)

Generator: $A$ (Low-rank Nilpotent)

$$G_{\text{add}}(n) = I + n \cdot \omega \cdot A$$ $$A^2 = 0 \implies \text{Unipotent}$$

Recovers ALiBi, FoX, Path Integrals.

Multiplicative GRAPE: $SO(d)$

We model positions as norm-preserving rotations generated by a rank-2 skew-symmetric matrix $L$. Unlike dense matrix exponentials ($\mathcal{O}(d^3)$), we derive a closed-form Rodrigues-type formula evaluable in $\mathcal{O}(d)$ time.

Rank-2 Generator & Closed Form

// Generator L defined by vectors a, b
L = a b^T - b a^T;    s = sqrt( ||a||^2 ||b||^2 - (a^T b)^2 )

// Closed-form Matrix Exponential (Rodrigues)
exp(L) = I + (sin(s)/s) * L + ((1 - cos(s))/s^2) * L^2

Recovery of RoPE

RoPE is recovered exactly when $d/2$ commuting rank-2 generators act on disjoint canonical coordinate planes with a log-uniform spectrum.

$$L_{\text{RoPE}} = \sum_{i=1}^{d/2} \theta_i L(e_{2i-1}, e_{2i})$$

GRAPE extends this geometry to learned commuting subspaces and compact non-commuting mixtures that capture cross-subspace feature coupling at negligible cost.

Additive GRAPE: Unipotent Lift in $GL$

To produce additive biases (like ALiBi) within a group-theoretic framework, we employ a homogeneous lift. We augment the space to $\mathbb{R}^{d+k}$ and use a nilpotent generator $A$ (where $A^2=0$).

The Unipotent Action

// Generator A is strictly nilpotent (rank-1 or low-rank)
A^2 = 0  ==>  exp(n w A) = I + n w A

// Application via Homogeneous Coordinates
G_add(n) = [ I_d   n w u ]
           [ 0^T   1     ]

Exact Relative Law via Inverse Transpose

In the general linear group, the transpose is not the inverse. To preserve relativity, keys are transformed via the inverse transpose:

$$\tilde{q}_i = G_{\text{add}}(i)\hat{q}_i, \quad \tilde{k}_j = G_{\text{add}}(j)^{-\top}\hat{k}_j$$ $$\implies \tilde{q}_i^\top \tilde{k}_j = q_i^\top k_j + \text{bias}(j-i)$$

This formulation recovers ALiBi exactly when using a rank-1 nilpotent in $\mathbb{R}^{d+2}$. It also proves that the Forgetting Transformer (FoX) is an exact instance of Additive GRAPE with content-gated decay.

Path Integrals & Contextual Forms

GRAPE extends naturally to Path-Integral Additive GRAPE (GRAPE-AP), where the scalar offset is replaced by a path sum of edge potentials $\psi_h(t, l)$. This allows for strictly causal, content-adaptive biases that retain the exact relative law structure.

$$b_h(t, j) := \sum_{l=j+1}^t \psi_h(t, l)$$

Combining Multiplicative and Additive GRAPE yields a powerful, unified encoding that supports both high-fidelity rotation (preservation of norm) and flexible additive biases (length extrapolation and forgetting) within a single efficient streaming implementation.

Citation

If you find this work useful, please cite:

@article{zhang2025grape,
  title   = {Group Representational Position Encoding},
  author  = {Zhang, Yifan and Chen, Zixiang and Liu, Yifeng and 
             Qin, Zhen and Yuan, Huizhuo and Xu, Kangping and Gu, Quanquan and Yao, Andrew Chi-Chih},
  journal = {arXiv preprint arXiv:2512.07805},
  year    = {2025}
}