Community · numaya.ai

We got a real million-token context window out of an open model on two RTX 3090s — not a theoretical limit, but a prompt that actually fills it — and kept it coherent enough to drive agentic tool-calling. We benchmarked seven configurations of Qwen3.6-35B-A3B on llama.cpp TurboQuant across three context depths, ran a full 1M-token needle-in-haystack test, and put the winner through an eight-test agentic eval battery. Everything below is the result. Config D wins.

Why this is hard

True million-token context is normally the preserve of datacentre GPUs. The open question: can a quantised open model hold a real 1M-token window on hardware that fits under a desk — 48 GB of VRAM total — while staying stable enough to call tools reliably? The short answer is yes, but only with the right quantisation choice. Here is the full data.

The winner — Config D

Model	Qwen3.6-35B-A3B Q6_K (26.6 GB, near-lossless quality)
Engine	llama-cpp-turboquant v0.1.1
KV cache	K = `q8_0` / V = `turbo3`
Context	1,048,576 tokens (1M)
VRAM	~44 GB / 48 GB
Agentic eval	7/8 (87.5%) — the one failure is a llama.cpp engine limit, not the model

1M context test — PASS

A needle-in-haystack run pushed a prompt to 1,038,653 tokens — 99% of the 1M limit — and recovered the needle as an exact match, with no crash and no out-of-memory.

Metric	Value
Prompt tokens	1,038,653 (99% of the 1M limit)
Needle found	Yes — exact match
Total time	~32 min (prefill + generation)
Crash / OOM	None

Config D is confirmed stable at true 1M context.

Agentic eval — 7/8

A custom tool-calling and instruction-following battery, run against the OpenAI-compatible API:

Test	What it checks	Result
T1 — Single tool call	Correct tool, valid JSON	PASS
T2 — Multi-tool selection	Picks correctly from 3 options, correct args	PASS
T3 — Tool result synthesis	Correct file categorisation from results	PASS
T4 — Nested schema	All required fields, valid enum + arrays	PASS
T5 — Strict JSON-only output	Honours format (needs `/no_think` prefix)	PASS
T6 — Parallel tool calls	—	FAIL*
T7 — Hallucination resistance	Refuses to invent real-time data	PASS
T8 — Ordered multi-step plan	Correct read→write sequencing	PASS

*T6 is a server limitation, not model quality: vLLM passes T6 but caps context at 262K on this hardware, so it can't hold the 1M window. You get parallel calls or 1M context, not both — and 1M context is the harder prize.

Full benchmark matrix — seven configs

Peak text-generation throughput (best single run), measured at three prompt depths: empty (d0), 32K, and 131K tokens.

Config	Model	KV K/V	Ctx	d0 t/s	d32K t/s	d131K t/s	Stable
A	Q4_K_M	q4_0/q4_0	1M	120	83	39	Crash*
B	Q4_K_M	q8_0/turbo3	1M	115.5	71	31	Crash*
C	Q4_K_M	t3/t3	1M	73.5	45	21	OK
D	Q6_K	q8_0/turbo3	1M	110	68.5	31.5	OK
E	Q6_K	t3/t3	1M	74	44.5	21	OK
F	Q4_K_M	q8_0/turbo3	256K	118.5	72	31.5	OK
G	Q6_K	q8_0/turbo3	256K	112.5	67.8	31.5	OK

*Crash = server crash at the d131K second run. Configs A and B are the fastest but unstable at 1M; Config D trades a few tokens/sec for the stability and near-lossless Q6_K quality that hold up at the full window.

Key finding — "turbo" means smaller, not faster

TurboQuant KV is ~38–46% slower than plain q4_0 KV. The WHT rotation and Lloyd-Max dequant add compute that outweighs the bandwidth savings on an RTX 3090, so "turbo" buys you compression, not speed. The practical rule:

Want maximum tokens/sec? Use q4_0/q4_0 (Config A).
Want maximum quality and stability at 1M? Use q8_0/turbo3 (Config D).

Why it fits — only 10 KV layers

Qwen3.6-35B-A3B is a hybrid architecture: of its 40 layers, only 10 are full-attention. The other 30 are Gated DeltaNet (SSM-like) with a fixed recurrent state and no KV cache. So KV memory is roughly 4× lower than a naive 40-layer estimate — which is the whole reason 1M context fits in 48 GB at all. The real KV VRAM at 1M context (2 KV heads, head_dim 256):

KV quant	Bytes/elem	VRAM at 1M
`q4_0`	0.500	5.12 GB
`q8_0`	1.000	10.24 GB
`turbo3`	0.406	4.15 GB

Hardware

2× RTX 3090 (24 GB each = 48 GB total)
Windows 11, CUDA 12.4+
~80 GB free disk (model 26.6 GB + engine + working space)

That's the complete result set. The repository has the launch scripts, the model downloader, the full benchmark harness, the eight-test agentic eval, the 1M needle test, and a step-by-step setup guide — grab them below.

Get the scripts, benchmark harness & setup guide on GitHub ↗Download only — every result is on this page.

← All research