Models that can stand the test of space and time.
MatterWave AI is a research lab building efficient AI architectures. We design models that carry more intelligence through less — less energy, less space, less time. Our first contribution, the Geometric Rotation Transformer, proves that structure can substitute for scale.
Every parameter has to earn its place. Most labs build bigger models by adding more numbers to unconstrained grids. We'd rather find the right shape.
We would rather discover the optimal shape for an architecture than spend vast compute on an unconstrained one. Our first contribution, the Geometric Rotation Transformer (GRT), replaces standard dense weight matrices with learned rotations — orthogonal cascades that can turn a signal but never amplify or collapse it.
The core idea: force each projection into a cascade of Givens rotations wired in an FFT butterfly pattern (O(D log D) parameters and compute). A standard weight matrix can be anything, but most of that freedom is wasted relearning structure from scratch. GRT trades that freedom for structure.
To make this structured representation dynamic, we introduce a Clifford-feature coupling. This coupling reads the length, direction, and signed area of each coordinate pair directly from the input signal to modulate the rotation angles on the fly.
A 2-D rotation is a phase shift; the cascade routes them like an FFT.
That single idea buys structural guarantees — unit-magnitude weights that hold before a single step of training.
During training, gradient descent adjusts the rotation angle for each pair at each stage, and the scale parameter for each coordinate. The butterfly routing is fixed wiring with zero learned parameters. The Clifford coupling weights are learned. They determine how each pair's geometry affects its neighbors' rotations. The model's knowledge lives in angles and scales, not in dense grids of numbers.
A dense projection is a D×D grid of numbers. As a rotation cascade it is a handful of angles — O(D log D) instead of O(D²). The compactness is in the math, not trimmed off afterward.
Every pair carries its own geometry: its energy, its direction, its twist. Those features feed back to steer the rotations, so the same coordinates are treated differently depending on what they actually carry.
A rotation can only turn a signal — it can never amplify it into noise or shrink it to nothing. Magnitude lives in a separate, explicit knob, not hidden inside a matrix. Training stays stable from the first step.
Every stage does two things to each pair: rotate by a learned angle (green arc), then scale by a learned factor (teal arrow). The rotation preserves radius. The scale changes it. This figure shows the fundamental unit of computation.
This is one stage of one projection. In block 0's Q projection, learned angles range from -0.81 to +1.03 rad and scales from 0.38 to 1.64. Here is pair 0 at stage 0, using the actual learned angle of -0.075 rad (-4.3 degrees) and scales (0.781, 0.926) from a trained GRT model.
Input pair (3, 4) became (2.481, 3.576). The rotation turned it (radius stayed 5). The scale shrank it (radius 4.35). The angle was -0.045 rad: -0.075 from what the model learned, +0.03 from what the input carried. These are real checkpoint values from a trained GRT model.
A standard transformer block has three learned projections: Q (queries, what to look for), K (keys, what to match against), and V (values, what to retrieve). These are matrix multiplies: each takes the input vector and produces a new vector. Attention then compares queries to keys, weights the values, and an output projection O projects the result back. After attention comes a feedforward network (FFN) that expands and contracts the signal.
Each projection runs across multiple heads. The model splits the d=256 vector into 4 heads of d=64 each. Every head has its own independent Q, K, V, and O parameters. Head 0 might learn one pattern while head 1 learns something completely different. They run in parallel and concatenate back. The mixer then crosses information between heads.
But strip away the names and every one of those projections is the same operation: a dense matrix multiply, y = Wx. W is a grid of D×D numbers, and each entry is a single scalar weight — a magnitude knob that answers exactly one question: how much? A weight can make a feature larger or smaller, but it has no native sense of direction. It cannot turn a signal, only resize it. The model's entire vocabulary is magnitude: D² independent multipliers, each free to become anything.
That freedom is also the liability. Those matrices live in GL(N) — the set of all invertible grids, with no structural constraint on what they can do to a signal's size. Nothing stops one layer from amplifying it a hundredfold, or shrinking it toward zero. Stack twelve such layers and you meet the two failures every transformer trainer knows: exploding gradients, where magnitude runs away and the signal blows up, and vanishing gradients, where repeated shrinking flattens the signal until nothing can learn. Residual connections, layer normalization, and careful initialization all exist to fight this — but they are patches on a primitive that is, by design, free to explode or collapse.
GRT keeps the architecture — same Q, K, V, O, attention, FFN, residuals, layer stacking — and swaps only the primitive. Instead of D² independent magnitude knobs, it treats the vector as real-number pairs, and on each pair it learns two things at once: a rotation (an angle, a direction) and a scale (a magnitude). The rotation preserves length by construction, so it can turn a signal without distorting it; the scale is an explicit, separate, positive number. Direction and magnitude, stored together. The math is different. The architecture is identical — and the note below pins down the one question that naturally follows: which numbers, exactly, is that pair?
Each learned per-pair transform — rotate by an angle θ, then scale by a positive r — is a single element of the group SO(2) × ℝ₊. That group is isomorphic to ℂ*, the multiplicative group of nonzero complex numbers, under the polar map (r, θ) ↔ r·e^(iθ). So the slice of complex numbers GRT touches is ℂ* = ℂ ∖ {0} — and only its multiplication, never its addition. GRT never forms the sum of two complex pairs; "addition-like" behavior comes from the real residual stream and normalization, not from the complex field.
And the imaginary unit i is not actually used anywhere. There is no complex datatype, no complex autograd, and no conjugation. The isomorphism is exact, so multiplying by r·e^(iθ), expanded term-by-term, collapses to the same four real entries — (cosθ, −sinθ; sinθ, cosθ) times r — that the Givens rotation already is. i is scaffolding: it packages two real equations as one, but produces no number that is not real in the end. Complex numbers describe GRT faithfully; they are never an ingredient of it.
That was one stage. The full Q projection chains 11 of them together. The only thing that changes between stages is the stride: how far apart the paired coordinates sit.
Before the first projection, PearlNorm normalizes all 128 pair radii to the same value. This is why every pair starts at radius 0.79. The signal enters the block as a uniform circle, every pair equal. The cascade's job is to break that uniformity and turn it into information.
The cascade's job is to break that uniformity and turn it into information. The block walks through four attention projections, then two feedforward cascades:
Here is the rhythm: PearlNorm resets all pair radii to equal. The rotation cascade breaks that equality — each pair learns its own scale, so by the final stage the radii are spread out, not uniform. Some pairs grew, some shrank. That spread IS the information. Then PearlNorm runs again before the next sublayer and resets everything back to equal. Reset, diverge, reset, diverge. The rotation preserves direction; the scale encodes magnitude; PearlNorm prevents magnitude from running away across layers.
Q projection runs 11 stages.
Each stage shuffles pairs, reads Clifford features, nudges angles, rotates, and scales. The signal starts as a uniform circle (PearlNorm radius 0.79) and the learned parameters sculpt it into structure.
The output is a set of query vectors, one per head. Queries encode what each token is looking for. You just saw one pair traced through all 11 stages in the concrete example above. The plot shows all 128 pairs doing the same thing simultaneously.
K runs 11 stages.
Same cascade structure, different learned angles and scales. K produces keys, which determine what each token attends to. K learns different parameters than Q because keys and queries serve different roles in attention.
The key cascade is structurally identical to Q: 11 stages, butterfly routing, Clifford coupling, per-pair scales. But the learned angles are completely different. The model discovered through training that keys need to point in different directions than queries for attention to work. Two cascades with the same wiring, solving different problems.
V runs 11 stages.
Same cascade again, different parameters. V produces values, which carry the content that attention will retrieve. V learns yet another set of angles and scales because values carry meaning, not questions or labels.
After V, the model has three representations of the input: queries (what to look for), keys (what to match against), and values (what to pull out). Attention combines them.
Softmax attention weights V.
This is standard transformer math. QKᵀ/√d determines how much each token attends to every other token. The values get weighted-summed by these scores. No rotation stages here, no butterfly, no Clifford coupling. Just dot products and softmax.
If Q asks the question, K holds the label that answers it. The dot product between a query and a key tells the model how relevant two tokens are to each other. Softmax turns those scores into weights that sum to one. Those weights multiply V, the values, to produce the attention output. V carries the content that gets pulled out.
Nothing about attention changes in GRT. Three rotation cascades produced Q, K, and V. The matching and retrieval is the same as any transformer.
O projection runs 11 stages.
The attention output gets projected back to d=256 through another rotation cascade. This is the fourth and final cascade in the attention sublayer.
After O, the attention result is added to the input via a residual connection: x + attention(x). This is where the projection-level det > 0 guarantee can break at the Jacobian level, because the residual Jacobian I + J_f can go negative.
Residual +, then PearlNorm.
The attention output is added to the input. PearlNorm runs again before the FFN. All pair radii get equalized once more. This is the last normalization before the feedforward network.
The signal is now a mix of the original input and what attention found. PearlNorm resets it to a clean starting point for the FFN.
Mixer runs 1 stage across all heads.
Q, K, V, and O each operated within a single head (d_head=64, 32 pairs). The four heads ran in parallel but never talked to each other. The mixer fixes that. It runs one rotation stage at d_model=256 (128 pairs), shuffling information across head boundaries.
After the mixer, what head 0 learned is visible to heads 1, 2, and 3. The FFN gets the full picture, not four isolated slices.
FFN expand runs 21 stages.
The feedforward network is where the model does most of its thinking. Expand grows the signal from d=256 to d=1024. That means 128 pairs become 512 pairs. But PearlNorm ran at d=256, not d=1024. So the 256-dimensional signal gets padded with 768 zeros to reach 1024. Only 128 pairs carry signal at the start. The other 384 pairs are empty.
Through 21 rotation stages, the butterfly shuffles coordinates so the 128 active pairs mix into the 384 empty ones. The model fills the new space with learned transformations of the input. In a standard transformer this step alone is a dense matrix of 256 x 1024 = 262,144 parameters. Here it is 21 rotation stages.
RadialGELU gates the signal.
Between expand and contract, RadialGELU applies GELU to each pair's radius while preserving its angle. This decides which pairs carry signal forward and which get suppressed. It is the only nonlinearity in the FFN.
Some pairs get amplified, others damped, but no pair's direction is distorted. It is what lets the model decide which features matter and which to silence.
FFN contract runs 21 stages.
The 512-pair signal shrinks back to 128 pairs (d=1024 to d=256). The model compresses what it learned back into the residual stream's dimension. Another 21 rotation stages, different learned angles.
What the model discovered during expand gets distilled into the output. The same butterfly routing carries information from all 512 pairs back down to 128.
Residual +. That is ONE block.
The FFN output is added back to the residual stream. The block is done. Q (11) + K (11) + V (11) + O (11) + Mixer (1) + FFN expand (21) + FFN contract (21) = 87 rotation stages per block. Its output feeds into the next block, which repeats everything with its own learned parameters. There are 12 blocks. That is 1,044 rotation stages total. Zero dense matrices.
Every rotation stage from block 0 stacked on one plot. Z=0 is the embedding projected onto a circle by PearlNorm. Then Q (11), K (11), V (11), O (11), FFN expand (21), FFN contract (21). Attention and mixer stages run over 128 pairs (4 heads × 32 at d_head=64); the FFN stages run over 512 pairs (d_ff=1024). The signal flows through the entire block in one continuous view.
Twelve blocks, each one full pass.
One block does three things — attention mixes information across tokens, the mixer crosses it between heads, and the FFN transforms it per token. Stack twelve and the model runs that cycle twelve times. Each block reads the stream, transforms it, writes back; the next block works from what the previous one left.
The released model: d_model=256, four heads, twelve blocks, eleven rotation stages per attention projection, and twenty-one in the FFN. The stage counts come from ablation — about two-to-three times log₂ of the pair count.
What stays identical across all twelve: the wiring. The butterfly schedule, the PearlNorm rhythm, the order of the sublayers — set by the architecture, not learned. What's different in every block: the angles, the scales, the coupling weights, and the residual gates. Each block learns its own parameters from scratch. The shape is shared; the contents are not.
A block is the structure we just walked through, configured. The same shape repeats twelve times; what changes is the learned parameters inside it. Every number below comes from the config — a choice, not a constant.
Normalize, process, gate, add.
Every sublayer — attention, mixer, FFN — follows the same four-step cycle. PearlNorm normalizes a copy of the stream, so the sublayer sees a clean input no matter how deep we are. The sublayer does its work: rotates pairs, reads coupling features, scales. A learned gate decides how strongly to write the result back. Then the residual add accumulates it into the stream.
This cycle is why a twelve-deep stack is trainable at all. The rotations alone are well-conditioned — they can't amplify or collapse a signal — but that isn't enough across twelve blocks of accumulated writes. Normalize-a-copy gives each sublayer a stable input. The gate stops any one sublayer from dominating. The residual add keeps the history. Orthogonality is the floor; normalize-process-gate-add is the structure built on it.
The cycle runs three times per block — attention, mixer, FFN — and the block repeats twelve times. That's thirty-six rounds of reset, transform, gate, accumulate from input to output.
The 256-dimensional signal splits into four groups of 64 coordinates. Each group is a head. Every head runs its own Q, K, V, and O rotation cascades with its own learned angles. The four heads share wiring but nothing else — four parallel processors looking at four different slices of the pair stream.
Attention between two tokens, per pair, is the cosine of the angle between that token's Q pair and the other's K pair. So a head's "strategy" is which pairs it aligns for which tokens — the schedule of Q and K angles it learned. Align the angles on a pair and two tokens attend strongly there. Push them ninety degrees apart and they don't.
In a standard transformer, a head's Q projection is a fixed matrix — the same map for every input. In GRT each rotation angle is nudged by what the pair carries (its energy, direction, and twist), read through the Clifford coupling. One head can run different strategies on different inputs. That's "content-adaptive" in practice. (The FFN has no coupling — it's the same rotate-and-scale for every input. Only attention adapts.)
The heatmap below shows real attention weights from the trained model on "the earth is blue". Head 2 attends almost exclusively to self. Head 3 spreads broadly. The model didn't design these patterns; it discovered them.
Every pair's Q and K end up at different angles because Q and K use independent learned rotations. Below: the angle gap for all 128 pairs across all 12 blocks. Blue means aligned (high attention). Red means opposite (suppressed).
Between blocks, information moves through a 256-dimensional residual stream — 128 pairs of angle and radius. Every block reads from it and writes back. Because each write is a rotation plus a scale, a single sublayer can only turn pairs and resize them. It cannot, in one step, collapse the signal to nothing or blow it up to noise. That's a stronger guarantee than a dense matrix gives.
PearlNorm runs on a copy. The sublayer sees a normalized input — every pair at the same radius — so depth doesn't drift the input distribution. But the raw, unnormalized stream passes straight through the residual connection. Local normalization for stability; global accumulation for memory. The stream keeps everything; each sublayer just borrows a normalized view.
Each sublayer's write is scaled by a learned gate before it's added. In block zero, attention writes at about 0.99 — nearly all of itself, because attention refines what's already there. The mixer at about 0.97. The FFN at about 0.79 — it's the largest transform in the block, so the model learns to gate it down to keep the stack stable. The gates drift across the twelve blocks as the model rebalances who contributes.
This is the part only depth produces. Block zero's stream is almost the embedding — phases and magnitudes. Block eleven's stream is the model's final representation, shaped by thirty-six rounds of normalize-process-gate-add. Across the stack, pair radii differentiate: some pairs grow, some shrink, as the model decides which features matter. Pair angles settle into stable configurations. The stream doesn't get overwritten; it gets sculpted. By the end, the same 256 dimensions carry something the embedding never could.
A single sublayer can only rotate and rescale — it cannot catastrophically distort what's already there in one step. Over twelve blocks the stream is reshaped, not erased.
GRT is a transformer with one thing changed: every dense weight matrix is replaced by a cascade of rotations. The routing, the attention, the residuals, the depth — all standard. The primitive is different. A rotation is one angle; it can turn a signal but never break it. That single swap is what makes the model space-efficient by math, content-adaptive where it needs to be, and stable by construction.
Benchmarks coming soon.