Google Research · 2026年3月公開

TurboQuant
AI 効率を再定義する

精度劣化なしの 3 ビット KV キャッシュ圧縮を実現し、メモリ使用量を 6 分の 1 に、アテンション速度を最大 8 倍にする新しいオンラインベクトル量子化アルゴリズムです。

コア原理を見る論文を読む

6x+

メモリ圧縮

アテンション高速化 (H100)

3-bit

無損失圧縮

TurboQuantを巡る最新の動き

論文公開後、議論はすぐに実装、導入、長文脈推論のコスト構造へ広がりました。

影響

市場への影響

$MUと$SNDKが寄り付きで大きく売られた

$GOOGLによるTurboQuant公開を受けて、$MUと$SNDKは寄り付きで大きな圧力を受けました。

A practical read on what TurboQuant changes

One expert view on what is likely already deployed, what still remains hard, and why the paper matters even if most easy gains are gone.

Independent Industry Expert

LLM systems and inference engineering

TurboQuant matters less because it saves a bit more memory, and more because it marks where KV-cache compression starts to hit a real boundary.

KV cache has long been the largest source of memory consumption in large-model inference. What this paper does, in essence, is compress that data in a way that approaches the information-theoretic optimum. It is not just lowering precision. It is reallocating information density: ordinary regions are represented with extremely low bits, while outliers retain higher precision. At the same time, the method stops treating values independently and instead encodes them at the vector level, which fits the inner-product structure of attention itself.

The critical point is that its error is already close to the information-theoretic lower bound, the Shannon limit. That means compression efficiency is already near the theoretical ceiling. The paper reports roughly 4x to 4.5x compression with little visible performance loss. The result is strong, but it also suggests there is not much room left for further compression without harming model quality.

Given how large-tech internal R&D usually works, the optimization effects implied by the paper were likely absorbed in stages before publication. Low-bit quantization has already been widely deployed, from int8 to int4 and beyond, across mainstream inference stacks. Separate handling for outliers is also not new: methods such as SmoothQuant and AWQ are already doing closely related things. KV-cache compression itself, sliding windows, and hierarchical cache designs are already standard practice in large-model systems.

What likely has not fully landed yet is the most extreme part of the paper: vector quantization and coding schemes that move closer to the information-theoretic limit. The barrier is not theory, but implementation. These methods are less GPU-friendly, harder to keep low-latency, and more difficult to stabilize and generalize in production, so they may take much longer to ship.

If I had to estimate roughly how much of the paper's benefit is already reflected in deployed systems, it would look something like this: the earliest KV cache starts at 1x cost; basic quantization gets to around 2x to 3x compression; adding outlier-aware handling can reach about 3x to 4x; the paper pushes that further to around 4x to 4.5x. In other words, most of the easy gains have already been captured. What remains is smaller in upside and increasingly expensive to realize.

The reason is straightforward. Early compression removes redundancy. Later compression starts to hit effective information, so every additional step has a much higher chance of hurting model capability. Error no longer degrades smoothly; beyond a certain point, it can worsen quickly. Engineering difficulty also does not grow linearly. It rises sharply.

You can infer from current model behavior that mainstream systems are already using many of these ideas. Better long-context behavior, lower inference cost, and stable performance all suggest that KV-cache efficiency has already been significantly improved. A team at Google's level has very likely already deployed low-bit quantization, outlier handling, and at least part of KV-cache compression.

That means if this Google paper has an impact on storage, much of that impact has probably already shown up. The parts that have not shown up yet will likely be harder to implement than the gains that came before.

More importantly, the significance of the paper is not just how much more memory it saves. It gives us a boundary. KV-cache compression is approaching its limit, and the remaining room is narrow. The next major change is unlikely to come from compression alone. It will require finding a different path.

コアイノベーション

TurboQuant がゲームチェンジャーに見える理由

TurboQuant は単なる圧縮テクニックではありません。情報理論的限界に近づきながら、データ非依存かつアクセラレータ向けに設計されたオンライン量子化フレームワークです。

従来手法 (例: PQ)

データセットごとの学習が必要
正規化定数を多く保存する必要がある
インデックス作成が遅い
精度低下が目立つ

TurboQuant

ランダム回転 + 極座標変換 (PolarQuant)
1 ビット残差補正 (QJL) で正規化オーバーヘッドを除去
インデックス時間はほぼ 0
報告ベンチマークで 32 ビット基準と一致

AISTATS 2026

PolarQuant

正規化オーバーヘッドを消す極座標変換の中核

arXiv: 2502.02617 →

AAAI 2025

QJL

1 ビットの不偏内積推定器

ACM DL →

ICLR 2026

TurboQuant

ほぼ最適な歪みを実現する 2 段構成

arXiv: 2504.19874 →

技術背景

なぜ TurboQuant が必要なのか

ベクトル量子化の限界と KV キャッシュ圧力を短く整理します

1ベクトル量子化の古典的な問題

ベクトル量子化は高次元ベクトルを小さなコードへ写像しつつ歪みを最小化します。理論下限は明確ですが、従来法はそこからまだ遠いままです。

歪みの式

MSE: D_MSE = E[||x - x̂||²]

Inner product: D_prod = E[|⟨y,x⟩ - ⟨y,x̂⟩|²]

Theory

MSE lower bound: D_MSE ≥ 1/4^b

Inner-product lower bound: D_prod ≥ (||y||² / d) · 1/4^b

PQ のような従来法は、これらの下限よりまだかなり上にあります。

2LLM における KV キャッシュのボトルネック

デコーダ型 Transformer では、各トークンについて Key/Value を保存します。コンテキストが長くなるほど、このコストがシステム全体を支配します。

メモリ見積もり

memory ≈ 2 × L × d × 2 bytes (FP16)

128K コンテキスト + 7B モデル数十 GB

総メモリに占める KV キャッシュ80%+

TurboQuant が変える点

✓ 学習もファインチューニングも不要
✓ チャネルあたり 3.5 ビットで品質中立性
✓ LongBench が FP32 と一致
✓ エッジ機器での長文脈推論を現実的にする

3ベクトル検索への応用

FAISS のような ANN システムで、TurboQuant はインデックス作成コストをほぼゼロのままリコールを改善します。

高いリコール

GloVe で PQ と RabbiQ を上回る

インデックス時間 ≈ 0

大規模ベクトルストアに向く

コア原理

TurboQuant の 2 段アルゴリズム

TurboQuant = 主圧縮の PolarQuant + 残差補正の QJL

PolarQuant: 極座標変換

重要なのはブロックごとの正規化オーバーヘッドを消すことです。PolarQuant はランダム回転により、量子化しやすい集中分布へ座標を写します。

座標分布

f_X(x) = Γ(d/2) / (√π · Γ((d-1)/2))
× (1 - x²)^((d-3)/2)

where x ∈ [-1, 1]

d 次元ベクトルをペアにして半径と角度へ分解する

半径に対して再帰的に極座標変換を適用する

集中した角度成分だけを量子化する

主要な利点

ブロックごとの全精度定数が不要
オーバーヘッドは 0 になります。
4.2x を超えてもほぼ無損失
従来ベースラインより強い挙動です。
高次元でガウス的な座標分布
Lloyd-Max のような最適スカラ量子化器をそのまま使えます。

実験結果

数字がそのまま主張になる

Gemma、Mistral、Llama-3.1-8B にまたがるベンチマーク

KV キャッシュ圧縮ベンチマーク

50.06

LongBench スコア

3.5 ビット = フルキャッシュ

100

Needle In A Haystack

4K から 104K まで完全

6x+

メモリ削減

コストを大幅に低減

アテンション速度

H100 の 4 ビットモード

ベンチマーク	TurboQuant 3.5 ビット	TurboQuant 2.5 ビット	フルキャッシュ
LongBench	50.06	49.44	50.06
Needle In A Haystack	100	99.8	100
ZeroSCROLLS	最高	ほぼ最高	ベースライン
RULER	最高	ほぼ最高	ベースライン
L-Eval	最高	ほぼ最高	ベースライン

ベクトル検索ベンチマーク (GloVe d=200)

1@k リコール

TurboQuant最高

PQ低い

RabbiQ中程度

インデックス時間

TurboQuant≈ 0

PQ (コードブック学習)長い

RabbiQ中程度

代替手法との比較

手法	学習必要	不偏	圧縮率	高速化
TurboQuant	不要	はい	6x+	8x
KIVI	キャリブレーション	いいえ	4x	4x
SnapKV	ファインチューニング	いいえ	2-4x	2-4x
DuQuant	キャリブレーション	部分的	4x	4x

100K context memory projections

Predicted total VRAM before and after TurboQuant, plus the RTX 4090 count needed to hold each setup.

Assumes RTX 4090 nominal VRAM of 24GB, with practical allocation rounded up after framework overhead.

Model	Weights	Pure model VRAM	Total VRAM before	Total VRAM after	4090s before	4090s after	Change
ChatGLM-4 (9B)	BF16	18 GB	19.8 GB	18.3 GB	1	1	Extra headroom on a single 4090.
ChatGLM-4 (9B)	INT8	9 GB	10.8 GB	9.3 GB	1	1	Still single-card, with more buffer.
ChatGLM-4 (9B)	INT4	5 GB	6.8 GB	5.3 GB	1	1	Very comfortable single-card fit.
Qwen-2.5 (32B)	BF16	64 GB	69 GB	64.8 GB	3	3	Savings help, but not enough to drop a GPU.
Qwen-2.5 (32B)	INT8	32 GB	37 GB	32.8 GB	2	2	More margin on a 2x4090 node.
Qwen-2.5 (32B)	INT4	18 GB	23 GB	18.8 GB	2	1(-1)	Pulled back under the single-4090 limit.
Llama-3.1 (70B)	BF16	140 GB	150 GB	141.7 GB	7	6(-1)	Drops one RTX 4090 at 100K context.
Llama-3.1 (70B)	INT8	70 GB	80 GB	71.7 GB	4	3(-1)	Material hardware cost reduction.
Llama-3.1 (70B)	INT4	38 GB	48 GB	39.7 GB	3	2(-1)	Brings 70B into a practical dual-4090 envelope.
Mixtral 8x22B (141B MoE)	BF16	282 GB	288 GB	283 GB	13	13	MoE keeps KV share relatively small.
Mixtral 8x22B (141B MoE)	INT8	141 GB	147 GB	142 GB	7	7	Lower pressure, but same card class.
Mixtral 8x22B (141B MoE)	INT4	75 GB	81 GB	76 GB	4	4	Useful slack without a node count change.
DeepSeek-R1 (671B MoE)	FP8	700 GB	712 GB	702 GB	31	30(-1)	Saves one 4090 even at hyperscale.
DeepSeek-R1 (671B MoE)	INT4	350 GB	362 GB	352 GB	16	15(-1)	Still too large for small nodes, but one card disappears.

導入ガイド

論文から本番へ

TurboQuant を実システムへ組み込む考え方

現在の状況

論文には理論と擬似コードがありますが、公式実装はまだ公開されていません。コミュニティでの統合作業はすでに始まっています。

•llama.cpp Discussion #20969 で統合案が追跡されている
•MLX では約 5x 圧縮と 99.5% の品質維持が報告されている
•2026 年 Q2 前後のオープンソース公開が広く期待されている

実装スケッチ

Precompute Lloyd-Max centroids

Do it once offline and reuse them.

# Python-like pseudocode
centroids = lloyd_max_quantizer(
    distribution="beta",
    bits=b
)

Generate a random rotation matrix

Use QR decomposition to build an orthogonal matrix.

# random rotation
G = np.random.randn(d, d)
Pi, _ = np.linalg.qr(G)

Build quant / dequant primitives

This is the core path for storage and recovery.

def quant(x, Pi, centroids):
    y = Pi @ x
    idx = find_nearest(y, centroids)
    return idx

def dequant(idx, Pi, centroids):
    y = centroids[idx]
    x = Pi.T @ y
    return x

Integrate inside attention

Store K/V in TurboQuant form and estimate inner products with QJL.

# Transformer attention
k_quant = turboquant_quant(k)
v_quant = turboquant_quant(v)
# use QJL during attention

デプロイの要点

ハードウェア

H100 と A100 が理想的です。論文の 8x 高速化は 4 ビットモードで報告されています。

混合精度

KV キャッシュに TurboQuant、重みに INT4 を使うと全体圧縮を最大化できます。

エッジ機器

3 ビット KV キャッシュにより、モバイルで 32K+ コンテキストが現実味を帯びます。

実務上のリスクと対策

ランダム回転のオーバーヘッド

行列を事前生成して再利用すれば、オンライン負荷を抑えられます。

残差ノルムの保存

FP16 スカラー 1 つで足りるため、追加コストは小さいです。

推奨オープンソース経路

llama.cpp を fork → turboquant_quant カーネルを追加

展望

TurboQuant が AI スタックをどう変えるか

LLM 推論

百万トークン級コンテキストのコストが大きく下がり、次世代スタックの標準機能になり得ます。

ベクトルデータベース

リアルタイム索引とサブミリ秒検索がより実現しやすくなります。

エッジ AI

モバイルや組み込み環境での長文脈推論が現実に近づきます。

マルチモーダル埋め込み

同じ発想を画像や動画の埋め込み圧縮にも拡張できます。

理論拡張

外れ値処理と組み合わせれば、実用的な 2 ビット系にさらに近づけます。

コミュニティへの影響

vLLM や Hugging Face などのエコシステムが素早く追随する可能性があります。

想定タイムライン

2026 Q2

オープンソース実装とフレームワーク統合

2026 Q4

商用製品、まずはクラウド中心の可能性

2027

LLM 量子化標準として定着する可能性

注意: ランダムシード処理が悪いとわずかなバイアスが生じる可能性がありますが、論文では高次元では無視できるとされています。

FAQ

よくある質問

最初に確認されることの多い質問

リソース

参考リンクと資料

公式ブログ

Google Research の発表記事

TurboQuant 論文

ICLR 2026 メイン論文 PDF

PolarQuant 論文

AISTATS 2026 極座標変換論文

QJL 論文

AAAI 2025 量子化 JL 変換

コミュニティ議論

Reddit r/LocalLLaMA

ローカル LLM 配備に関するコミュニティ

X (Twitter) #TurboQuant

継続中の反応と議論

llama.cpp #20969

主要な統合ディスカッション

TurboQuantAI 効率を再定義する

TurboQuantを巡る最新の動き

最新アップデート

影響

A practical read on what TurboQuant changes

TurboQuant がゲームチェンジャーに見える理由

従来手法 (例: PQ)

TurboQuant

PolarQuant

QJL

TurboQuant

なぜ TurboQuant が必要なのか

1ベクトル量子化の古典的な問題

歪みの式

Theory

2LLM における KV キャッシュのボトルネック

メモリ見積もり

TurboQuant が変える点

3ベクトル検索への応用

TurboQuant の 2 段アルゴリズム

PolarQuant: 極座標変換

座標分布

主要な利点

数字がそのまま主張になる

KV キャッシュ圧縮ベンチマーク

ベクトル検索ベンチマーク (GloVe d=200)

1@k リコール

インデックス時間

代替手法との比較

論文から本番へ

現在の状況

実装スケッチ

Precompute Lloyd-Max centroids

Generate a random rotation matrix

Build quant / dequant primitives

Integrate inside attention

デプロイの要点

ハードウェア

混合精度

エッジ機器

実務上のリスクと対策

ランダム回転のオーバーヘッド

残差ノルムの保存

TurboQuant が AI スタックをどう変えるか

LLM 推論

ベクトルデータベース

エッジ AI

マルチモーダル埋め込み

理論拡張

コミュニティへの影響

想定タイムライン

2026 Q2

2026 Q4

2027

よくある質問

01TurboQuant に学習は必要ですか？

02KIVI や通常の 4 ビット量子化と比べてどうですか？

03どのモデルをサポートしますか？

04コードはどこですか？

05ベクトル検索でなぜ重要ですか？

06本当に無損失ですか？

07必要なハードウェアは？

08PolarQuant との関係は？

参考リンクと資料

公式ブログ

TurboQuant 論文

PolarQuant 論文

QJL 論文

コミュニティ議論

Reddit r/LocalLLaMA

X (Twitter) #TurboQuant

llama.cpp #20969

TurboQuant
AI 効率を再定義する