Blog · numaya.ai

We got a real million-token context window — and 7/8 on an agentic eval — out of an open model on two RTX 3090s. Not a theoretical limit: a prompt with 1,038,653 tokens, needle recovered exactly, no OOM.

The trick is the model's hybrid architecture (only 10 of 40 layers carry a KV cache) plus a careful KV-quantisation choice. We wrote up the full seven-config benchmark, the 1M test, and the surprising finding that "TurboQuant" KV is slower, not faster.

Read the full research write-up and grab the open-source benchmark harness.

View on GitHub ↗

← Back to blog