Google Research · 2026년 3월 공개

TurboQuant
AI 효율의 기준을 다시 쓰다

정확도 손실 없는 3비트 KV 캐시 압축을 구현해 메모리를 6배 줄이고, 어텐션 속도를 최대 8배 높이는 새로운 온라인 벡터 양자화 알고리즘입니다.

핵심 원리 보기 논문 읽기

6x+

메모리 압축

어텐션 가속 (H100)

3-bit

무손실 압축

최근 동향

TurboQuant를 둘러싼 최신 흐름

논문 공개 직후 논의는 빠르게 구현, 배포, 장문맥 추론 비용 구조로 확장됐습니다.

영향

시장 영향

$MU와 $SNDK가 장 시작 직후 하락 압력을 받았다

$GOOGL의 TurboQuant 발표 이후 $MU와 $SNDK는 개장 시점에 크게 흔들렸습니다.

A practical read on what TurboQuant changes

One expert view on what is likely already deployed, what still remains hard, and why the paper matters even if most easy gains are gone.

Independent Industry Expert

LLM systems and inference engineering

TurboQuant matters less because it saves a bit more memory, and more because it marks where KV-cache compression starts to hit a real boundary.

KV cache has long been the largest source of memory consumption in large-model inference. What this paper does, in essence, is compress that data in a way that approaches the information-theoretic optimum. It is not just lowering precision. It is reallocating information density: ordinary regions are represented with extremely low bits, while outliers retain higher precision. At the same time, the method stops treating values independently and instead encodes them at the vector level, which fits the inner-product structure of attention itself.

The critical point is that its error is already close to the information-theoretic lower bound, the Shannon limit. That means compression efficiency is already near the theoretical ceiling. The paper reports roughly 4x to 4.5x compression with little visible performance loss. The result is strong, but it also suggests there is not much room left for further compression without harming model quality.

Given how large-tech internal R&D usually works, the optimization effects implied by the paper were likely absorbed in stages before publication. Low-bit quantization has already been widely deployed, from int8 to int4 and beyond, across mainstream inference stacks. Separate handling for outliers is also not new: methods such as SmoothQuant and AWQ are already doing closely related things. KV-cache compression itself, sliding windows, and hierarchical cache designs are already standard practice in large-model systems.

What likely has not fully landed yet is the most extreme part of the paper: vector quantization and coding schemes that move closer to the information-theoretic limit. The barrier is not theory, but implementation. These methods are less GPU-friendly, harder to keep low-latency, and more difficult to stabilize and generalize in production, so they may take much longer to ship.

If I had to estimate roughly how much of the paper's benefit is already reflected in deployed systems, it would look something like this: the earliest KV cache starts at 1x cost; basic quantization gets to around 2x to 3x compression; adding outlier-aware handling can reach about 3x to 4x; the paper pushes that further to around 4x to 4.5x. In other words, most of the easy gains have already been captured. What remains is smaller in upside and increasingly expensive to realize.

The reason is straightforward. Early compression removes redundancy. Later compression starts to hit effective information, so every additional step has a much higher chance of hurting model capability. Error no longer degrades smoothly; beyond a certain point, it can worsen quickly. Engineering difficulty also does not grow linearly. It rises sharply.

You can infer from current model behavior that mainstream systems are already using many of these ideas. Better long-context behavior, lower inference cost, and stable performance all suggest that KV-cache efficiency has already been significantly improved. A team at Google's level has very likely already deployed low-bit quantization, outlier handling, and at least part of KV-cache compression.

That means if this Google paper has an impact on storage, much of that impact has probably already shown up. The parts that have not shown up yet will likely be harder to implement than the gains that came before.

More importantly, the significance of the paper is not just how much more memory it saves. It gives us a boundary. KV-cache compression is approaching its limit, and the remaining room is narrow. The next major change is unlikely to come from compression alone. It will require finding a different path.

핵심 혁신

TurboQuant가 판을 바꾸는 결과로 보이는 이유

TurboQuant는 단순한 압축 기법이 아니라, 정보이론적 한계에 가까우면서도 데이터 비의존적이고 가속기 친화적인 온라인 양자화 프레임워크입니다.

기존 방법 (예: PQ)

데이터셋별 학습 필요
정규화 상수를 많이 저장해야 함
인덱싱 시간이 길다
정확도 손실이 눈에 띈다

TurboQuant

무작위 회전 + 극좌표 변환 (PolarQuant)
1비트 잔차 보정 (QJL)으로 정규화 오버헤드 제거
거의 0에 가까운 인덱싱 시간
보고된 벤치마크에서 32비트 기준선과 동일

AISTATS 2026

PolarQuant

정규화 오버헤드를 없애는 극좌표 변환 핵심

arXiv: 2502.02617 →

AAAI 2025

QJL

1비트 비편향 내적 추정기

ACM DL →

ICLR 2026

TurboQuant

거의 최적 왜곡을 달성하는 2단계 설계

arXiv: 2504.19874 →

기술 배경

왜 TurboQuant가 필요한가

벡터 양자화 한계와 KV 캐시 병목을 빠르게 정리합니다.

1벡터 양자화의 고전적 문제

벡터 양자화는 고차원 벡터를 작은 코드로 바꾸면서 왜곡을 최소화하는 문제입니다. 이론적 하한은 분명하지만, 기존 방식은 실제로 거기까지 가지 못합니다.

왜곡 공식

MSE: D_MSE = E[||x - x̂||²]

Inner product: D_prod = E[|⟨y,x⟩ - ⟨y,x̂⟩|²]

Theory

MSE lower bound: D_MSE ≥ 1/4^b

Inner-product lower bound: D_prod ≥ (||y||² / d) · 1/4^b

PQ 같은 전통적 방법은 이 하한보다 여전히 상당히 높습니다.

2LLM의 KV 캐시 병목

디코더 기반 트랜스포머에서는 토큰마다 Key/Value를 저장해야 하므로, 컨텍스트가 길어질수록 메모리 비용이 빠르게 지배적이 됩니다.

메모리 추정

memory ≈ 2 × L × d × 2 bytes (FP16)

128K 컨텍스트 + 7B 모델수십 GB

전체 메모리 중 KV 캐시 비중80%+

TurboQuant가 바꾸는 점

✓ 학습과 미세조정이 필요 없음
✓ 채널당 3.5비트로 품질 중립성 달성
✓ LongBench가 FP32와 동일
✓ 엣지 기기에서도 장문맥 추론 가능성 확대

3벡터 검색 적용

FAISS 같은 ANN 시스템에서 TurboQuant는 인덱싱 오버헤드를 거의 0으로 유지하면서 더 높은 리콜을 제공합니다.

더 높은 리콜

GloVe에서 PQ와 RabbiQ보다 우수

인덱싱 시간 ≈ 0

대규모 벡터 저장소에 적합

핵심 원리

TurboQuant의 2단계 알고리즘

TurboQuant = 주 압축을 담당하는 PolarQuant + 잔차 보정을 담당하는 QJL

PolarQuant: 극좌표 변환

핵심은 블록별 정규화 오버헤드를 제거하는 것입니다. PolarQuant는 무작위 회전으로 좌표를 더 양자화하기 쉬운 집중 분포로 바꿉니다.

좌표 분포

f_X(x) = Γ(d/2) / (√π · Γ((d-1)/2))
× (1 - x²)^((d-3)/2)

where x ∈ [-1, 1]

d차원 벡터를 쌍으로 묶어 반지름과 각도를 얻는다

반지름에 재귀적 극좌표 변환을 적용한다

분포가 집중된 각도만 양자화한다

핵심 장점

블록별 전정밀 상수 제거
오버헤드가 0이 됩니다.
4.2배 이상 압축에서도 거의 무손실
기존 기준선보다 강합니다.
고차원에서 가우시안 유사 좌표
Lloyd-Max 같은 최적 스칼라 양자화기를 바로 적용할 수 있습니다.

실험 결과

숫자가 주장이다

Gemma, Mistral, Llama-3.1-8B 전반의 벤치마크

KV 캐시 압축 벤치마크

50.06

LongBench 점수

3.5비트 = 전체 캐시

100

Needle In A Haystack

4K~104K 구간 완벽

6x+

메모리 절감

비용 크게 감소

어텐션 속도

H100 4비트 모드

벤치마크	TurboQuant 3.5비트	TurboQuant 2.5비트	전체 캐시
LongBench	50.06	49.44	50.06
Needle In A Haystack	100	99.8	100
ZeroSCROLLS	최상	거의 최상	기준선
RULER	최상	거의 최상	기준선
L-Eval	최상	거의 최상	기준선

벡터 검색 벤치마크 (GloVe d=200)

1@k 리콜

TurboQuant최상

PQ낮음

RabbiQ중간

인덱싱 시간

TurboQuant≈ 0

PQ (코드북 학습)김

RabbiQ중간

대안 대비 비교

방법	학습 필요	비편향	압축률	속도 향상
TurboQuant	아니오	예	6x+	8x
KIVI	보정	아니오	4x	4x
SnapKV	미세조정	아니오	2-4x	2-4x
DuQuant	보정	부분적	4x	4x

100K context memory projections

Predicted total VRAM before and after TurboQuant, plus the RTX 4090 count needed to hold each setup.

Assumes RTX 4090 nominal VRAM of 24GB, with practical allocation rounded up after framework overhead.

Model	Weights	Pure model VRAM	Total VRAM before	Total VRAM after	4090s before	4090s after	Change
ChatGLM-4 (9B)	BF16	18 GB	19.8 GB	18.3 GB	1	1	Extra headroom on a single 4090.
ChatGLM-4 (9B)	INT8	9 GB	10.8 GB	9.3 GB	1	1	Still single-card, with more buffer.
ChatGLM-4 (9B)	INT4	5 GB	6.8 GB	5.3 GB	1	1	Very comfortable single-card fit.
Qwen-2.5 (32B)	BF16	64 GB	69 GB	64.8 GB	3	3	Savings help, but not enough to drop a GPU.
Qwen-2.5 (32B)	INT8	32 GB	37 GB	32.8 GB	2	2	More margin on a 2x4090 node.
Qwen-2.5 (32B)	INT4	18 GB	23 GB	18.8 GB	2	1(-1)	Pulled back under the single-4090 limit.
Llama-3.1 (70B)	BF16	140 GB	150 GB	141.7 GB	7	6(-1)	Drops one RTX 4090 at 100K context.
Llama-3.1 (70B)	INT8	70 GB	80 GB	71.7 GB	4	3(-1)	Material hardware cost reduction.
Llama-3.1 (70B)	INT4	38 GB	48 GB	39.7 GB	3	2(-1)	Brings 70B into a practical dual-4090 envelope.
Mixtral 8x22B (141B MoE)	BF16	282 GB	288 GB	283 GB	13	13	MoE keeps KV share relatively small.
Mixtral 8x22B (141B MoE)	INT8	141 GB	147 GB	142 GB	7	7	Lower pressure, but same card class.
Mixtral 8x22B (141B MoE)	INT4	75 GB	81 GB	76 GB	4	4	Useful slack without a node count change.
DeepSeek-R1 (671B MoE)	FP8	700 GB	712 GB	702 GB	31	30(-1)	Saves one 4090 even at hyperscale.
DeepSeek-R1 (671B MoE)	INT4	350 GB	362 GB	352 GB	16	15(-1)	Still too large for small nodes, but one card disappears.

활용 가이드

논문에서 제품까지

실제 스택에 TurboQuant를 넣는 방법

현재 상태

논문은 이론과 의사코드를 제공하지만 공식 오픈소스 구현은 아직 없습니다. 다만 커뮤니티 통합 작업은 이미 시작됐습니다.

•llama.cpp Discussion #20969에서 통합 아이디어를 추적 중
•MLX 실험은 약 5배 압축과 99.5% 품질 보존을 보고
•2026년 2분기 전후로 오픈소스 공개가 널리 예상됨

구현 스케치

Precompute Lloyd-Max centroids

Do it once offline and reuse them.

# Python-like pseudocode
centroids = lloyd_max_quantizer(
    distribution="beta",
    bits=b
)

Generate a random rotation matrix

Use QR decomposition to build an orthogonal matrix.

# random rotation
G = np.random.randn(d, d)
Pi, _ = np.linalg.qr(G)

Build quant / dequant primitives

This is the core path for storage and recovery.

def quant(x, Pi, centroids):
    y = Pi @ x
    idx = find_nearest(y, centroids)
    return idx

def dequant(idx, Pi, centroids):
    y = centroids[idx]
    x = Pi.T @ y
    return x

Integrate inside attention

Store K/V in TurboQuant form and estimate inner products with QJL.

# Transformer attention
k_quant = turboquant_quant(k)
v_quant = turboquant_quant(v)
# use QJL during attention

배포 메모

하드웨어

H100과 A100이 이상적입니다. 논문의 8배 가속은 4비트 모드 기준입니다.

혼합 정밀도

KV 캐시는 TurboQuant, 가중치는 INT4를 사용해 전체 압축을 극대화합니다.

엣지 기기

3비트 KV 캐시는 휴대폰에서도 32K+ 문맥을 현실화할 수 있습니다.

실무 리스크와 완화책

무작위 회전 오버헤드

행렬을 미리 생성해 재사용하면 온라인 비용을 줄일 수 있습니다.

잔차 norm 저장

FP16 스칼라 하나면 충분해서 오버헤드는 매우 작습니다.

권장 오픈소스 경로

fork llama.cpp → turboquant_quant 커널 추가

전망

TurboQuant가 AI 스택을 바꿀 수 있는 방식

LLM 추론

백만 토큰 문맥 비용이 크게 낮아지고, 차세대 모델 스택의 기본 기능이 될 수 있습니다.

벡터 데이터베이스

실시간 인덱싱과 서브밀리초 검색이 더 쉬워집니다.

엣지 AI

모바일과 임베디드 환경의 장문맥 추론이 현실에 가까워집니다.

멀티모달 임베딩

같은 아이디어를 이미지와 비디오 임베딩 압축으로 확장할 수 있습니다.

이론 확장

아웃라이어 처리와 결합하면 실용적 2비트 시스템으로 더 다가갈 수 있습니다.

커뮤니티 영향

vLLM과 Hugging Face 같은 생태계가 빠르게 따라올 가능성이 높습니다.

예상 일정

2026 Q2

오픈소스 코드와 프레임워크 통합

2026 Q4

상용 제품, 아마도 클라우드 우선

2027

LLM 양자화 표준으로 자리잡을 가능성

리스크 노트: 랜덤 시드를 잘못 다루면 작은 편향이 생길 수 있지만, 논문은 고차원에서 영향이 미미하다고 봅니다.

FAQ

자주 묻는 질문

엔지니어가 먼저 묻는 핵심 질문

리소스

참고 링크와 자료

공식 블로그

Google Research 발표 글

TurboQuant 논문

ICLR 2026 메인 논문 PDF

PolarQuant 논문

AISTATS 2026 극좌표 변환 논문

QJL 논문

AAAI 2025 양자화 JL 변환

커뮤니티 논의

Reddit r/LocalLLaMA

로컬 LLM 배포 관련 커뮤니티

X (Twitter) #TurboQuant

실시간 반응과 논의

llama.cpp #20969

주요 통합 토론 스레드

TurboQuantAI 효율의 기준을 다시 쓰다

TurboQuant를 둘러싼 최신 흐름

최신 업데이트

영향

A practical read on what TurboQuant changes

TurboQuant가 판을 바꾸는 결과로 보이는 이유

기존 방법 (예: PQ)

TurboQuant

PolarQuant

QJL

TurboQuant

왜 TurboQuant가 필요한가

1벡터 양자화의 고전적 문제

왜곡 공식

Theory

2LLM의 KV 캐시 병목

메모리 추정

TurboQuant가 바꾸는 점

3벡터 검색 적용

TurboQuant의 2단계 알고리즘

PolarQuant: 극좌표 변환

좌표 분포

핵심 장점

숫자가 주장이다

KV 캐시 압축 벤치마크

벡터 검색 벤치마크 (GloVe d=200)

1@k 리콜

인덱싱 시간

대안 대비 비교

논문에서 제품까지

현재 상태

구현 스케치

Precompute Lloyd-Max centroids

Generate a random rotation matrix

Build quant / dequant primitives

Integrate inside attention

배포 메모

하드웨어

혼합 정밀도

엣지 기기

실무 리스크와 완화책

무작위 회전 오버헤드

잔차 norm 저장

TurboQuant가 AI 스택을 바꿀 수 있는 방식

LLM 추론

벡터 데이터베이스

엣지 AI

멀티모달 임베딩

이론 확장

커뮤니티 영향

예상 일정

2026 Q2

2026 Q4

2027

자주 묻는 질문

01TurboQuant는 학습이 필요한가?

02KIVI나 일반 4비트 양자화와 비교하면?

03어떤 모델을 지원하나?

04코드는 어디 있나?

05벡터 검색에서 왜 중요한가?

06정말 무손실인가?

07어떤 하드웨어가 필요한가?

08PolarQuant와의 관계는?

참고 링크와 자료

공식 블로그

TurboQuant 논문

PolarQuant 논문

QJL 논문

커뮤니티 논의

Reddit r/LocalLLaMA

X (Twitter) #TurboQuant

llama.cpp #20969

TurboQuant
AI 효율의 기준을 다시 쓰다