Research

Original thinking from building AI that runs in production.

Grounding agents without hallucination drift

A practical retrieval architecture that keeps agents factual over long sessions.

The real cost of LLM eval at scale

What we learned running 2M evaluations across production systems.

Human-in-the-loop that actually scales

Designing review workflows that don't become the bottleneck.