Crea team di IA autonomi con Toone

Scarica Toone per macOS e inizia a creare team di IA che gestiscono il tuo lavoro.

macOS

The KV Cache Gets Squeezed from Both Sides

Pubblicato il 2026-04-07 di Toone AI Digest

ai-papersresearchdeep-learningkv-cacheinferencellm-optimization

Toone AI Digest

AI Digest Editor

If you've tried running OpenClaw on a 24GB GPU with a 32K context window, you know the drill: model loads fine, reasoning starts, and then—OOM. Your GPU drowns in KV cache because every single token your model has ever thought about gets stored in memory. For a single AIME math problem, that's 30,000+ tokens of intermediate reasoning, ballooning to gigabytes of cached key-value pairs.

Today brought two papers that attack this problem from completely opposite angles. One uses trigonometry to compress the cache 10.7x. The other eliminates the need for most cache entries entirely by injecting knowledge without using any tokens at all. Together, they paint a picture of where LLM inference is headed—and it's not "buy more GPUs."

TriAttention: Trigonometric KV Cache Compression

TL;DR: TriAttention discovers that before RoPE rotates your queries and keys, they cluster around fixed centers in a predictable, stable way. It exploits this with a trigonometric scoring function that predicts which cached tokens actually matter—without relying on the unstable post-RoPE attention scores that existing methods use. Result: 10.7x KV memory reduction on Qwen3-8B while matching full attention accuracy on AIME25.

The Problem with Current Approaches

Every existing KV cache compression method—SnapKV, R-KV, H2O—asks the same question: "which tokens did recent queries pay attention to?" But here's the catch: they ask this in post-RoPE space, where queries are already rotated by their position. Position 500 and position 5000 produce fundamentally different query vectors even for identical content. So when you use recent queries to judge which keys matter, you're making decisions based on position-dependent snapshots that shift during long reasoning chains. Your importance estimates go sideways.

R-KV, the current state-of-the-art (NeurIPS 2025), adds redundancy detection on top. It helps—but it's still building on shaky ground.

The Core Insight: Pre-RoPE Q/K Concentration

The TriAttention team—MIT, NVIDIA, and Zhejiang University, led by Song Han (the StreamingLLM/QServe guy) and including Tianfu Fu (a core GPT-5 contributor at OpenAI)—looked at what happens before RoPE does its thing.

What they found: in the pre-RoPE space, Q and K vectors cluster tightly around fixed, non-zero centers. This isn't noise—it's a model-intrinsic property that holds across architectures (GQA, MLA), domains (math, coding, general), and even calibration data. Google's homepage HTML works as calibration data just as well as ShareGPT conversations. Yes, really. (46.2% vs 46.7% on AIME24.)

They measure this clustering using the Mean Resultant Length (R) from directional statistics—the same math used to analyze wind directions and compass bearings. When R is close to 1, the vectors are tightly clustered. For most attention heads, they are.

From Concentration to Trigonometry

Here's where it gets clever. When Q and K are concentrated around fixed centers, the attention logit between a query at position m and a key at position n simplifies to a trigonometric series over their distance Δ = m − n:

logit(Δ) ≈ Σ_f [a_f · cos(ω_f · Δ) + b_f · sin(ω_f · Δ)]

Translation: the model has built-in distance preferences. Some attention heads strongly prefer nearby tokens (recency bias), others have periodic patterns. These preferences are baked into the Q/K centers and are stable across positions—unlike the jittery attention scores existing methods rely on.

TriAttention combines two scoring components: S_trig (the trigonometric distance-preference score) and S_norm (a norm-based salience score, weighted by 1−R so it matters more in heads where concentration is weaker). Every 128 tokens, it prunes back to the KV budget. For GQA models, it z-score normalizes per head and aggregates via max.

The Numbers

On AIME25 with Qwen3-8B at a KV budget of 2048 tokens:

TriAttention: 32.9%
R-KV: 17.5%
SnapKV: ~15%

That's a 15.4 percentage-point gap over R-KV at the same budget. At budget 4096, TriAttention hits 43.3%—actually surpassing full attention (40.8%), because pruning redundant tokens can improve reasoning quality.

At comparable accuracy to full attention: 2.5x throughput or 10.7x KV memory reduction. On MATH 500, throughput jumps to 6.3x (1,405 vs 223 tokens/second) with comparable accuracy.

The memory retention benchmark is equally telling. In a recursive DFS simulation (testing whether the model can backtrack through nested call stacks), R-KV catastrophically fails at depth 16—dropping from 61% to 31%. TriAttention matches full attention up to depth 16.

Tianfu Fu tweeted about running OpenClaw 32B on a single RTX 4090 with TriAttention. Without it: OOM. With it: smooth long-context reasoning on 24GB of VRAM.

Code: github.com/WeianMao/triattention (181 stars, Apache 2.0, vLLM integration included).

Knowledge Packs: Zero-Token Knowledge Delivery

TL;DR: Instead of stuffing knowledge into the prompt (RAG-style, burning tokens), Knowledge Packs precomputes the KV cache for knowledge documents and injects them directly at query time—zero tokens consumed. Achieves identical accuracy to RAG with up to 95% token savings. Also discovers that getting the chat template wrong causes 6-7pp accuracy drops, which likely explains why prior work on KV injection got inconsistent results.

The Causal Mask Equivalence

The insight is almost embarrassingly simple. In a causal transformer, the KV cache from a forward pass on text F is mathematically identical to what a joint forward pass on F+q would produce. This follows directly from the causal attention mask—future tokens can't see past tokens, so appending query q doesn't change F's cached representations.

This means you can:
1. Run a forward pass on your knowledge document F
2. Save the resulting KV cache to disk
3. At query time, load the saved cache and run only the query
4. Get the exact same answer as if you'd put F + q in the prompt

No fine-tuning. No approximation. Mathematically exact.

Andrey Pustovit (independent researcher, no institutional affiliation) validates this with zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B using HotpotQA. Both models produce byte-identical outputs whether knowledge arrives via prompt tokens or injected KV cache.

The Chat Template Trap

Here's the finding that matters most for anyone trying to implement this. If you split the chat template incorrectly when constructing the KV cache, accuracy drops 6.0 percentage points on Qwen and 5.5pp on Llama. Duplicate <|im_start|> tokens from naive splitting cost an additional 1.5pp on Llama. Bridge questions (requiring cross-referencing multiple facts) suffer worst: 7.4pp drops.

This is likely why prior attempts at KV cache injection—including TurboRAG and some Cache-Augmented Generation implementations—produced inconsistent results. The method is exact, but only if you format it right. TurboRAG's solution was to fine-tune the model to tolerate bad formatting (requiring 30 A100 GPUs). Knowledge Packs' solution: just get the formatting right.

Token Savings at Scale

The economics are straightforward:

| Retrieval Steps | RAG Tokens | KV Tokens | Savings |
|----------------|-----------|-----------|--------|
| 1 | 176 | 35 | 80% |
| 2 | 299 | 35 | 88% |
| 3 | 438 | 35 | 92% |
| 5 | 739 | 35 | 95% |

RAG adds ~140-150 tokens per retrieval step. KV stays at 31-35 tokens regardless. At enterprise scale (millions of queries per day), that 95% token savings translates directly to inference cost reduction.

Value-Space Behavioral Steering

The paper's second contribution is more speculative but intriguing. RoPE rotates Q and K vectors to encode position—but leaves V (value) vectors untouched. This asymmetry creates an opportunity: contrastive modifications to value vectors steer behavior without destroying positional coherence, while the same arithmetic on key vectors causes collapse.

The authors demonstrate this with a "defensive coding" style experiment. Mid-layer values (33-66% of model depth) carry the steering effect. Two independent steering directions are near-orthogonal (cosine similarity ≈ 0.0003 on Qwen), meaning you can compose multiple behavioral modifications. At α ≤ 0.7, you can deliver knowledge AND steer behavior simultaneously without accuracy loss.

It's roughly half as effective as putting the instruction in the prompt (~1.93 vs ~3.67 on a behavioral scale), but it comes at zero token cost and composes with knowledge delivery.

Limitations

The paper is refreshingly honest:
- KV caches are model-specific (no transfer between architectures)
- Steering tested on only 15 coding tasks—generalization unknown
- Routing evaluation uses synthetic fictional facts, not realistic knowledge bases
- The kvpack library is brand new with minimal community validation

So What? The KV Cache Convergence

These papers aren't random coincidences from the same day. They're two sides of the same coin, and the connection runs deeper than "both involve KV cache."

TriAttention says: "You've got 30,000 tokens of reasoning chain cached in memory. Let me keep only the ones that matter, using the model's own distance preferences."

Knowledge Packs says: "Why are you putting 10,000 tokens of knowledge into the prompt at all? Just inject the KV cache directly—zero tokens consumed."

The deeper connection is RoPE itself. TriAttention exploits the stability of pre-RoPE Q/K centers for compression scoring. Knowledge Packs exploits the fact that RoPE doesn't touch values for behavioral steering. Both papers are essentially reading the same mathematical structure and finding different applications.

Combined, the vision is radical: precompute your knowledge into KV packs (zero tokens consumed), let the model reason freely, and compress the reasoning chain's KV cache with TriAttention on the fly. For an agent like OpenClaw running a 32B model on an RTX 4090, this could mean the difference between OOM crashes and smooth multi-step reasoning over a knowledge base.

For practitioners right now:

Running reasoning models locally and hitting OOM? TriAttention has a vLLM integration you can try today. One environment variable sets the KV budget.
Building RAG pipelines and tired of burning tokens on context? The kvpack library offers precomputed KV injection—but test your chat templates carefully, because the formatting sensitivity is real.
Doing both (agentic AI + knowledge bases + long reasoning)? These approaches compose naturally, and that combination is where the real gains live.

The KV cache has been LLM inference's dirty secret—the thing nobody wanted to deal with because "just add more VRAM" worked until it didn't. With reasoning models generating 30K+ tokens per question and agent frameworks hitting 128K context windows, that era is over. These papers are part of a wave—alongside Google's TurboQuant, NVIDIA's KVTC, and Expected Attention—that's finally making KV cache management a solved-ish problem.

The question isn't whether you'll use KV cache optimization. It's which flavor.

Reading List

TriAttention: arXiv:2604.04921 | Code | Tianfu Fu on X
Knowledge Packs: arXiv:2604.03270
Background: StreamingLLM (attention sinks, ICLR 2024) | R-KV (current compression SOTA, NeurIPS 2025) | CAG (the "Don't Do RAG" paper) | KV Cache Steering (contrastive cache steering)

Crea team di IA autonomi con Toone

Scarica Toone per macOS e inizia a creare team di IA che gestiscono il tuo lavoro.

macOS