Google Research · publié en mars 2026

TurboQuant
Redéfinir l'efficacité de l'IA

Un nouvel algorithme de quantification vectorielle en ligne qui compresse le cache KV à 3 bits sans perte de qualité, réduit la mémoire par 6 et accélère l'attention jusqu'à 8x.

Voir le principe Lire l'article

6x+

compression mémoire

accélération de l'attention (H100)

3-bit

compression sans perte

Dernières actualités

Derniers développements autour de TurboQuant

Après la publication de l'article, la discussion a vite basculé vers l'implémentation, le déploiement et l'économie de l'inférence long contexte.

Dernières mises à jour

mars 2026

Quelqu'un a réalisé une implémentation MLX de TurboQuant en 25 minutes avec GPT-5.4

Un développeur a indiqué avoir finalisé une implémentation MLX de TurboQuant en 25 minutes avec GPT-5.4.

Voir la publication

mars 2026

Google Research a officiellement présenté TurboQuant

La publication a présenté TurboQuant comme une méthode de quantification en ligne proche de l'optimum théorique pour la compression du cache KV et la recherche vectorielle.

mars 2026

Les discussions d'intégration communautaire ont démarré immédiatement

Les échanges open source se sont rapidement concentrés sur la façon d'intégrer TurboQuant dans des piles d'inférence comme llama.cpp et des runtimes associés.

mars 2026

L'attention s'est déplacée de la théorie vers l'économie du déploiement

Le point central est devenu la question de savoir si une compression KV 3 bits sans perte pouvait changer le budget mémoire et latence des services long contexte.

Impact

Impact marché

$MU et $SNDK ont été fortement touchés à l'ouverture

Après la publication de TurboQuant par $GOOGL, $MU et $SNDK ont subi une forte pression dès l'ouverture.

A practical read on what TurboQuant changes

One expert view on what is likely already deployed, what still remains hard, and why the paper matters even if most easy gains are gone.

Independent Industry Expert

LLM systems and inference engineering

TurboQuant matters less because it saves a bit more memory, and more because it marks where KV-cache compression starts to hit a real boundary.

KV cache has long been the largest source of memory consumption in large-model inference. What this paper does, in essence, is compress that data in a way that approaches the information-theoretic optimum. It is not just lowering precision. It is reallocating information density: ordinary regions are represented with extremely low bits, while outliers retain higher precision. At the same time, the method stops treating values independently and instead encodes them at the vector level, which fits the inner-product structure of attention itself.

The critical point is that its error is already close to the information-theoretic lower bound, the Shannon limit. That means compression efficiency is already near the theoretical ceiling. The paper reports roughly 4x to 4.5x compression with little visible performance loss. The result is strong, but it also suggests there is not much room left for further compression without harming model quality.

Given how large-tech internal R&D usually works, the optimization effects implied by the paper were likely absorbed in stages before publication. Low-bit quantization has already been widely deployed, from int8 to int4 and beyond, across mainstream inference stacks. Separate handling for outliers is also not new: methods such as SmoothQuant and AWQ are already doing closely related things. KV-cache compression itself, sliding windows, and hierarchical cache designs are already standard practice in large-model systems.

What likely has not fully landed yet is the most extreme part of the paper: vector quantization and coding schemes that move closer to the information-theoretic limit. The barrier is not theory, but implementation. These methods are less GPU-friendly, harder to keep low-latency, and more difficult to stabilize and generalize in production, so they may take much longer to ship.

If I had to estimate roughly how much of the paper's benefit is already reflected in deployed systems, it would look something like this: the earliest KV cache starts at 1x cost; basic quantization gets to around 2x to 3x compression; adding outlier-aware handling can reach about 3x to 4x; the paper pushes that further to around 4x to 4.5x. In other words, most of the easy gains have already been captured. What remains is smaller in upside and increasingly expensive to realize.

The reason is straightforward. Early compression removes redundancy. Later compression starts to hit effective information, so every additional step has a much higher chance of hurting model capability. Error no longer degrades smoothly; beyond a certain point, it can worsen quickly. Engineering difficulty also does not grow linearly. It rises sharply.

You can infer from current model behavior that mainstream systems are already using many of these ideas. Better long-context behavior, lower inference cost, and stable performance all suggest that KV-cache efficiency has already been significantly improved. A team at Google's level has very likely already deployed low-bit quantization, outlier handling, and at least part of KV-cache compression.

That means if this Google paper has an impact on storage, much of that impact has probably already shown up. The parts that have not shown up yet will likely be harder to implement than the gains that came before.

More importantly, the significance of the paper is not just how much more memory it saves. It gives us a boundary. KV-cache compression is approaching its limit, and the remaining room is narrow. The next major change is unlikely to come from compression alone. It will require finding a different path.

Innovation centrale

Pourquoi TurboQuant ressemble à un résultat de rupture

TurboQuant n'est pas un simple outil de compression. C'est un cadre de quantification en ligne proche de la limite informationnelle, tout en restant data-oblivious et adapté aux accélérateurs.

Méthodes classiques (par ex. PQ)

Require dataset-specific training
Store many full-precision normalization constants
Long indexing time
Visible accuracy loss

TurboQuant

Random rotation plus polar transform (PolarQuant)
1-bit residual correction (QJL) removes normalization overhead
Near-zero indexing time
Matches the 32-bit baseline on reported benchmarks

AISTATS 2026

PolarQuant

Polar-transform core that eliminates normalization overhead

arXiv: 2502.02617 →

AAAI 2025

QJL

1-bit unbiased inner-product estimator

ACM DL →

ICLR 2026

TurboQuant

Two-stage design with near-optimal distortion

arXiv: 2504.19874 →

Contexte technique

Pourquoi TurboQuant compte

Un rappel rapide des limites de la quantification vectorielle et de la pression du cache KV

1Le problème classique de la quantification vectorielle

La quantification vectorielle compresse des vecteurs de grande dimension tout en minimisant la distorsion. Les bornes théoriques sont claires, mais les méthodes classiques restent loin de ces limites.

Formules de distorsion

MSE: D_MSE = E[||x - x̂||²]

Inner product: D_prod = E[|⟨y,x⟩ - ⟨y,x̂⟩|²]

Theory

MSE lower bound: D_MSE ≥ 1/4^b

Inner-product lower bound: D_prod ≥ (||y||² / d) · 1/4^b

Les approches classiques comme PQ restent sensiblement au-dessus de ces bornes.

2Le goulot d'étranglement du cache KV dans les LLM

Dans les transformers décodeurs, chaque token ajoute une paire clé/valeur. Avec de longues fenêtres de contexte, ce coût mémoire devient dominant.

Estimation mémoire

memory ≈ 2 × L × d × 2 bytes (FP16)

Contexte 128K + modèle 7Bdes dizaines de Go

Part du cache KV dans la mémoire totale80%+

Ce que TurboQuant change

✓ Pas d'entraînement ni de finetuning
✓ 3,5 bits par canal pour une qualité neutre
✓ LongBench au niveau de FP32
✓ Rend l'inférence long contexte plus réaliste en périphérie

3Applications en recherche vectorielle

Dans des systèmes ANN comme FAISS, TurboQuant améliore le rappel tout en gardant un coût d'indexation proche de zéro.

Meilleur rappel

Dépasse PQ et RabbiQ sur GloVe

Temps d'indexation ≈ 0

Adapté aux grands magasins de vecteurs

Core Principle

TurboQuant as a two-stage algorithm

TurboQuant = PolarQuant for main compression + QJL for residual correction

PolarQuant: polar-coordinate transform

The key idea is to remove per-block normalization overhead. PolarQuant rotates the vector randomly so coordinates follow a concentrated distribution that is easy to quantize.

Coordinate distribution

f_X(x) = Γ(d/2) / (√π · Γ((d-1)/2))
× (1 - x²)^((d-3)/2)

where x ∈ [-1, 1]

Group the d-dimensional vector into pairs to obtain radii and angles

Apply recursive polar transforms on the radii

Quantize only the angles, whose distribution is highly concentrated

Why it works

No per-block full-precision constants
Overhead drops to zero.
Near-lossless beyond 4.2x compression
Stronger than conventional baselines.
Gaussian-like coordinates in high dimension
Supports optimal scalar quantizers such as Lloyd-Max.

Résultats

Les chiffres portent l'argument

Benchmarks sur Gemma, Mistral et Llama-3.1-8B

Benchmarks de compression du cache KV

50.06

Score LongBench

3,5 bits = cache complet

100

Needle In A Haystack

parfait de 4K à 104K

6x+

réduction mémoire

forte baisse des coûts

vitesse d'attention

H100 en mode 4 bits

Benchmark	TurboQuant 3,5 bits	TurboQuant 2,5 bits	Cache complet
LongBench	50.06	49.44	50.06
Needle In A Haystack	100	99.8	100
ZeroSCROLLS	meilleur	presque meilleur	baseline
RULER	meilleur	presque meilleur	baseline
L-Eval	meilleur	presque meilleur	baseline

Benchmark de recherche vectorielle (GloVe d=200)

Rappel 1@k

TurboQuantbest

PQlower

RabbiQmiddle

Temps d'indexation

TurboQuant≈ 0

PQ (codebook training)long

RabbiQmiddle

Comparaison avec les alternatives

Méthode	Entraînement	Sans biais	Compression	Accélération
TurboQuant	Non	Oui	6x+	8x
KIVI	Calibration	Non	4x	4x
SnapKV	Finetuning	Non	2-4x	2-4x
DuQuant	Calibration	Partiel	4x	4x

100K context memory projections

Predicted total VRAM before and after TurboQuant, plus the RTX 4090 count needed to hold each setup.

Assumes RTX 4090 nominal VRAM of 24GB, with practical allocation rounded up after framework overhead.

Model	Weights	Pure model VRAM	Total VRAM before	Total VRAM after	4090s before	4090s after	Change
ChatGLM-4 (9B)	BF16	18 GB	19.8 GB	18.3 GB	1	1	Extra headroom on a single 4090.
ChatGLM-4 (9B)	INT8	9 GB	10.8 GB	9.3 GB	1	1	Still single-card, with more buffer.
ChatGLM-4 (9B)	INT4	5 GB	6.8 GB	5.3 GB	1	1	Very comfortable single-card fit.
Qwen-2.5 (32B)	BF16	64 GB	69 GB	64.8 GB	3	3	Savings help, but not enough to drop a GPU.
Qwen-2.5 (32B)	INT8	32 GB	37 GB	32.8 GB	2	2	More margin on a 2x4090 node.
Qwen-2.5 (32B)	INT4	18 GB	23 GB	18.8 GB	2	1(-1)	Pulled back under the single-4090 limit.
Llama-3.1 (70B)	BF16	140 GB	150 GB	141.7 GB	7	6(-1)	Drops one RTX 4090 at 100K context.
Llama-3.1 (70B)	INT8	70 GB	80 GB	71.7 GB	4	3(-1)	Material hardware cost reduction.
Llama-3.1 (70B)	INT4	38 GB	48 GB	39.7 GB	3	2(-1)	Brings 70B into a practical dual-4090 envelope.
Mixtral 8x22B (141B MoE)	BF16	282 GB	288 GB	283 GB	13	13	MoE keeps KV share relatively small.
Mixtral 8x22B (141B MoE)	INT8	141 GB	147 GB	142 GB	7	7	Lower pressure, but same card class.
Mixtral 8x22B (141B MoE)	INT4	75 GB	81 GB	76 GB	4	4	Useful slack without a node count change.
DeepSeek-R1 (671B MoE)	FP8	700 GB	712 GB	702 GB	31	30(-1)	Saves one 4090 even at hyperscale.
DeepSeek-R1 (671B MoE)	INT4	350 GB	362 GB	352 GB	16	15(-1)	Still too large for small nodes, but one card disappears.

Usage

Du papier à la production

Comment intégrer TurboQuant dans une vraie stack

État actuel

L'article fournit la théorie et le pseudocode, mais il n'existe pas encore d'implémentation open source officielle. Le travail d'intégration communautaire a déjà commencé.

•llama.cpp Discussion #20969 suit les pistes d'intégration
•Des expériences MLX rapportent environ 5x de compression avec 99,5% de qualité conservée
•Une publication open source est largement attendue autour de T2 2026

Esquisse d'implémentation

Precompute Lloyd-Max centroids

Do it once offline and reuse them.

# Python-like pseudocode
centroids = lloyd_max_quantizer(
    distribution="beta",
    bits=b
)

Generate a random rotation matrix

Use QR decomposition to build an orthogonal matrix.

# random rotation
G = np.random.randn(d, d)
Pi, _ = np.linalg.qr(G)

Build quant / dequant primitives

This is the core path for storage and recovery.

def quant(x, Pi, centroids):
    y = Pi @ x
    idx = find_nearest(y, centroids)
    return idx

def dequant(idx, Pi, centroids):
    y = centroids[idx]
    x = Pi.T @ y
    return x

Integrate inside attention

Store K/V in TurboQuant form and estimate inner products with QJL.

# Transformer attention
k_quant = turboquant_quant(k)
v_quant = turboquant_quant(v)
# use QJL during attention

Notes de déploiement

Hardware

H100 and A100 are ideal. 4-bit mode is where the paper reports 8x speedups.

Mixed precision

Use TurboQuant for KV cache and INT4 for weights to maximize total compression.

Edge devices

3-bit KV cache can make 32K+ context feasible on phones with software-only implementations.

Risques pratiques et réponses

Random rotation overhead

Pre-generate and reuse the matrices instead of rebuilding them online.

Residual norm storage

One FP16 scalar is small enough to keep the overhead negligible.

Chemin open source recommandé

fork de llama.cpp → ajout d'un kernel turboquant_quant

Perspective

Comment TurboQuant peut déplacer la pile IA

LLM inference

Million-token contexts become materially cheaper, with a path to native support in future model stacks.

Vector databases

Real-time indexing and sub-millisecond search become easier to deliver.

Edge AI

Long-context inference on mobile and embedded devices becomes more realistic.

Multimodal embeddings

The same ideas can extend to image and video embedding compression.

Theory extensions

Combining with outlier-handling methods could push the field toward practical 2-bit systems.

Community impact

Expect rapid follow-through from ecosystems such as vLLM and Hugging Face.

Calendrier attendu

2026 Q2

Open-source code and framework integrations

2026 Q4

Commercial products, likely cloud-first

2027

Potential normalization as an LLM quantization standard

Note de risque : une mauvaise gestion du seed aléatoire peut introduire un léger biais, mais l'article considère l'effet négligeable en grande dimension.

FAQ

Questions fréquentes

Les premières questions que posent ingénieurs et lecteurs

Ressources

Références et liens

Blog officiel

Annonce Google Research

Article TurboQuant

PDF principal ICLR 2026

Article PolarQuant

Article AISTATS 2026 sur la transformée polaire

Article QJL

Transformation JL quantifiée, AAAI 2025

Discussion communautaire

Reddit r/LocalLLaMA

Community discussion around local LLM deployment

X (Twitter) discussion

Ongoing reactions and commentary

llama.cpp #20969

Main integration discussion thread

TurboQuantRedéfinir l'efficacité de l'IA

Derniers développements autour de TurboQuant

Dernières mises à jour

Impact

A practical read on what TurboQuant changes

Pourquoi TurboQuant ressemble à un résultat de rupture

Méthodes classiques (par ex. PQ)

TurboQuant

PolarQuant

QJL

TurboQuant

Pourquoi TurboQuant compte

1Le problème classique de la quantification vectorielle

Formules de distorsion

Theory

2Le goulot d'étranglement du cache KV dans les LLM

Estimation mémoire

Ce que TurboQuant change

3Applications en recherche vectorielle

TurboQuant as a two-stage algorithm

PolarQuant: polar-coordinate transform

Coordinate distribution

Why it works

Les chiffres portent l'argument

Benchmarks de compression du cache KV

Benchmark de recherche vectorielle (GloVe d=200)

Rappel 1@k

Temps d'indexation

Comparaison avec les alternatives

Du papier à la production

État actuel

Esquisse d'implémentation

Precompute Lloyd-Max centroids

Generate a random rotation matrix

Build quant / dequant primitives

Integrate inside attention

Notes de déploiement

Hardware

Mixed precision

Edge devices

Risques pratiques et réponses

Random rotation overhead

Residual norm storage

Comment TurboQuant peut déplacer la pile IA

LLM inference

Vector databases

Edge AI

Multimodal embeddings

Theory extensions

Community impact

Calendrier attendu

2026 Q2

2026 Q4

2027

Questions fréquentes

01Does TurboQuant require training?

02How does it compare with KIVI or standard 4-bit quantization?

03Which models does it support?

04Where is the code?

05Why does it matter for vector search?

06Is the zero-loss claim real?

07What hardware is required?

08What is the relation to PolarQuant?

Références et liens

Blog officiel

Article TurboQuant

Article PolarQuant

Article QJL

Discussion communautaire

Reddit r/LocalLLaMA

X (Twitter) discussion

llama.cpp #20969

TurboQuant
Redéfinir l'efficacité de l'IA