Qwen3.6-35B-A3B at 1M context on two RTX 3090s

Research

We got a real million-token context window out of an open model on two RTX 3090s — not a theoretical limit, but a prompt that actually fills it — and kept it coherent enough to drive agentic tool-calling. We benchmarked seven configurations of Qwen3.6-35B-A3B on llama.cpp TurboQuant across three context depths, ran a full 1M-token needle-in-haystack test, and put the winner through an eight-test agentic eval battery. Everything below is the result. Config D wins.

Why this is hard

True million-token context is normally the preserve of datacentre GPUs. The open question: can a quantised open model hold a real 1M-token window on hardware that fits under a desk — 48 GB of VRAM total — while staying stable enough to call tools reliably? The short answer is yes, but only with the right quantisation choice. Here is the full data.

The winner — Config D

ModelQwen3.6-35B-A3B Q6_K (26.6 GB, near-lossless quality)
Enginellama-cpp-turboquant v0.1.1
KV cacheK = q8_0 / V = turbo3
Context1,048,576 tokens (1M)
VRAM~44 GB / 48 GB
Agentic eval7/8 (87.5%) — the one failure is a llama.cpp engine limit, not the model

1M context test — PASS

A needle-in-haystack run pushed a prompt to 1,038,653 tokens — 99% of the 1M limit — and recovered the needle as an exact match, with no crash and no out-of-memory.

MetricValue
Prompt tokens1,038,653 (99% of the 1M limit)
Needle foundYes — exact match
Total time~32 min (prefill + generation)
Crash / OOMNone

Config D is confirmed stable at true 1M context.

Agentic eval — 7/8

A custom tool-calling and instruction-following battery, run against the OpenAI-compatible API:

TestWhat it checksResult
T1 — Single tool callCorrect tool, valid JSONPASS
T2 — Multi-tool selectionPicks correctly from 3 options, correct argsPASS
T3 — Tool result synthesisCorrect file categorisation from resultsPASS
T4 — Nested schemaAll required fields, valid enum + arraysPASS
T5 — Strict JSON-only outputHonours format (needs /no_think prefix)PASS
T6 — Parallel tool callsFAIL*
T7 — Hallucination resistanceRefuses to invent real-time dataPASS
T8 — Ordered multi-step planCorrect read→write sequencingPASS

*T6 is a server limitation, not model quality: vLLM passes T6 but caps context at 262K on this hardware, so it can't hold the 1M window. You get parallel calls or 1M context, not both — and 1M context is the harder prize.

Full benchmark matrix — seven configs

Peak text-generation throughput (best single run), measured at three prompt depths: empty (d0), 32K, and 131K tokens.

ConfigModelKV K/VCtxd0 t/sd32K t/sd131K t/sStable
AQ4_K_Mq4_0/q4_01M1208339Crash*
BQ4_K_Mq8_0/turbo31M115.57131Crash*
CQ4_K_Mt3/t31M73.54521OK
DQ6_Kq8_0/turbo31M11068.531.5OK
EQ6_Kt3/t31M7444.521OK
FQ4_K_Mq8_0/turbo3256K118.57231.5OK
GQ6_Kq8_0/turbo3256K112.567.831.5OK

*Crash = server crash at the d131K second run. Configs A and B are the fastest but unstable at 1M; Config D trades a few tokens/sec for the stability and near-lossless Q6_K quality that hold up at the full window.

Key finding — "turbo" means smaller, not faster

TurboQuant KV is ~38–46% slower than plain q4_0 KV. The WHT rotation and Lloyd-Max dequant add compute that outweighs the bandwidth savings on an RTX 3090, so "turbo" buys you compression, not speed. The practical rule:

  • Want maximum tokens/sec? Use q4_0/q4_0 (Config A).
  • Want maximum quality and stability at 1M? Use q8_0/turbo3 (Config D).

Why it fits — only 10 KV layers

Qwen3.6-35B-A3B is a hybrid architecture: of its 40 layers, only 10 are full-attention. The other 30 are Gated DeltaNet (SSM-like) with a fixed recurrent state and no KV cache. So KV memory is roughly 4× lower than a naive 40-layer estimate — which is the whole reason 1M context fits in 48 GB at all. The real KV VRAM at 1M context (2 KV heads, head_dim 256):

KV quantBytes/elemVRAM at 1M
q4_00.5005.12 GB
q8_01.00010.24 GB
turbo30.4064.15 GB

Hardware

  • 2× RTX 3090 (24 GB each = 48 GB total)
  • Windows 11, CUDA 12.4+
  • ~80 GB free disk (model 26.6 GB + engine + working space)

That's the complete result set. The repository has the launch scripts, the model downloader, the full benchmark harness, the eight-test agentic eval, the 1M needle test, and a step-by-step setup guide — grab them below.

Get the scripts, benchmark harness & setup guide on GitHub ↗Download only — every result is on this page.

← All research