We got a real million-token context window out of an open model on two RTX 3090s — not a theoretical limit, but a prompt that actually fills it — and kept it coherent enough to drive agentic tool-calling. We benchmarked seven configurations of Qwen3.6-35B-A3B on llama.cpp TurboQuant across three context depths, ran a full 1M-token needle-in-haystack test, and put the winner through an eight-test agentic eval battery. Everything below is the result. Config D wins.
Why this is hard
True million-token context is normally the preserve of datacentre GPUs. The open question: can a quantised open model hold a real 1M-token window on hardware that fits under a desk — 48 GB of VRAM total — while staying stable enough to call tools reliably? The short answer is yes, but only with the right quantisation choice. Here is the full data.
The winner — Config D
| Model | Qwen3.6-35B-A3B Q6_K (26.6 GB, near-lossless quality) |
|---|---|
| Engine | llama-cpp-turboquant v0.1.1 |
| KV cache | K = q8_0 / V = turbo3 |
| Context | 1,048,576 tokens (1M) |
| VRAM | ~44 GB / 48 GB |
| Agentic eval | 7/8 (87.5%) — the one failure is a llama.cpp engine limit, not the model |
1M context test — PASS
A needle-in-haystack run pushed a prompt to 1,038,653 tokens — 99% of the 1M limit — and recovered the needle as an exact match, with no crash and no out-of-memory.
| Metric | Value |
|---|---|
| Prompt tokens | 1,038,653 (99% of the 1M limit) |
| Needle found | Yes — exact match |
| Total time | ~32 min (prefill + generation) |
| Crash / OOM | None |
Config D is confirmed stable at true 1M context.
Agentic eval — 7/8
A custom tool-calling and instruction-following battery, run against the OpenAI-compatible API:
| Test | What it checks | Result |
|---|---|---|
| T1 — Single tool call | Correct tool, valid JSON | PASS |
| T2 — Multi-tool selection | Picks correctly from 3 options, correct args | PASS |
| T3 — Tool result synthesis | Correct file categorisation from results | PASS |
| T4 — Nested schema | All required fields, valid enum + arrays | PASS |
| T5 — Strict JSON-only output | Honours format (needs /no_think prefix) | PASS |
| T6 — Parallel tool calls | — | FAIL* |
| T7 — Hallucination resistance | Refuses to invent real-time data | PASS |
| T8 — Ordered multi-step plan | Correct read→write sequencing | PASS |
*T6 is a server limitation, not model quality: vLLM passes T6 but caps context at 262K on this hardware, so it can't hold the 1M window. You get parallel calls or 1M context, not both — and 1M context is the harder prize.
Full benchmark matrix — seven configs
Peak text-generation throughput (best single run), measured at three prompt depths: empty (d0), 32K, and 131K tokens.
| Config | Model | KV K/V | Ctx | d0 t/s | d32K t/s | d131K t/s | Stable |
|---|---|---|---|---|---|---|---|
| A | Q4_K_M | q4_0/q4_0 | 1M | 120 | 83 | 39 | Crash* |
| B | Q4_K_M | q8_0/turbo3 | 1M | 115.5 | 71 | 31 | Crash* |
| C | Q4_K_M | t3/t3 | 1M | 73.5 | 45 | 21 | OK |
| D | Q6_K | q8_0/turbo3 | 1M | 110 | 68.5 | 31.5 | OK |
| E | Q6_K | t3/t3 | 1M | 74 | 44.5 | 21 | OK |
| F | Q4_K_M | q8_0/turbo3 | 256K | 118.5 | 72 | 31.5 | OK |
| G | Q6_K | q8_0/turbo3 | 256K | 112.5 | 67.8 | 31.5 | OK |
*Crash = server crash at the d131K second run. Configs A and B are the fastest but unstable at 1M; Config D trades a few tokens/sec for the stability and near-lossless Q6_K quality that hold up at the full window.
Key finding — "turbo" means smaller, not faster
TurboQuant KV is ~38–46% slower than plain q4_0 KV. The WHT rotation and Lloyd-Max
dequant add compute that outweighs the bandwidth savings on an RTX 3090, so "turbo" buys you
compression, not speed. The practical rule:
- Want maximum tokens/sec? Use
q4_0/q4_0(Config A). - Want maximum quality and stability at 1M? Use
q8_0/turbo3(Config D).
Why it fits — only 10 KV layers
Qwen3.6-35B-A3B is a hybrid architecture: of its 40 layers, only 10 are full-attention. The other 30 are Gated DeltaNet (SSM-like) with a fixed recurrent state and no KV cache. So KV memory is roughly 4× lower than a naive 40-layer estimate — which is the whole reason 1M context fits in 48 GB at all. The real KV VRAM at 1M context (2 KV heads, head_dim 256):
| KV quant | Bytes/elem | VRAM at 1M |
|---|---|---|
q4_0 | 0.500 | 5.12 GB |
q8_0 | 1.000 | 10.24 GB |
turbo3 | 0.406 | 4.15 GB |
Hardware
- 2× RTX 3090 (24 GB each = 48 GB total)
- Windows 11, CUDA 12.4+
- ~80 GB free disk (model 26.6 GB + engine + working space)
That's the complete result set. The repository has the launch scripts, the model downloader, the full benchmark harness, the eight-test agentic eval, the 1M needle test, and a step-by-step setup guide — grab them below.
Get the scripts, benchmark harness & setup guide on GitHub ↗Download only — every result is on this page.