We built Treni, a C/CUDA runtime where routing, tokenization, tool models, and state run in one GPU process.

Most agent stacks only get serialized tool strings back. Treni exposes execution signals in-process (entropy/logprobs/retrieval distance/route confidence), so the agent can branch before committing bad output.

Canonical A10G (G5), token-parity vs vLLM (max_tokens=48):

TTFT: 5.130 ms (Treni) vs 84.837 ms (vLLM) -> 16.537x Full request: 316.403 ms vs 1232.660 ms -> 3.896x Cold total first response: 1320.240 ms vs 28937.430 ms -> 21.918x Steady state:

Warm mean: 80.602 ms Warm p99: 90.350 ms Additional checks:

Frontend A/B repeatability (warm_fixed + mixed_churn, repeats=3): custom path wins all tracked metrics Numerical parity vs PyTorch (strict mode): 0 failures Separate OpenAI routing-overhead test (different question, not engine-vs-engine):

Same model endpoint on both sides (gpt-5.2) Internal path: client -> OpenAI External path: client -> controller/tool hop -> same OpenAI endpoint Fairness-hardened local controls (runs=8): model-only: near parity (int = 0.971x) tool-only: external slower (int = 1.038x) Docs + raw artifacts:

https://treni-docs.pages.dev/docs/ https://treni-docs.pages.dev/docs/objectives-and-thesis https://treni-docs.pages.dev/docs/leaderboard https://treni-docs.pages.dev/docs/trackb-claim-safe-table https://treni-docs.pages.dev/docs/raw-artifacts

Treni is a runtime project, not an agent framework. The problem we’re targeting is that most agents today operate on serialized tool outputs (HTTP -> JSON -> string parsing). That means the policy often cannot observe execution quality directly.

Our thesis is “seeing is believing”: keep models + routing + tokenization + tool execution + state in one GPU runtime so the agent can branch on execution signals (entropy/logprobs, retrieval distance/quality, route confidence) before committing the next step.

What we set out to prove (in order): A) Speed: is a unified runtime materially faster than the Python stack? B) Routing: does in-process routing beat external hop routing under matched tasks? C) Awareness: do uncertainty-aware loops improve multi-step outcomes?

Current status from the docs: - Speed: warm path on G5 is 80.602 ms mean / 90.350 ms p99; warm baseline ratio is 29x vs Python pipeline. - Runtime vs vLLM (external-cold, token parity max_tokens=48): 5.130 vs 84.837 ms TTFT (16.537x), 316.403 vs 1232.660 ms full (3.896x), 1320.240 vs 28937.430 ms cold-total (21.918x). - Routing matrix (G5): overall ext/int 1.208x (internal faster), internal error 0.0000, external error 0.0347. - Awareness harness (C2): runtime-native uncertainty deltas are positive in both baseline and stress (e.g., +0.1539 internal baseline; +0.1154 external stress).

Paper connection: The Entropy-Guided Loop paper (arXiv:2509.00079) is the research base for the uncertainty trigger logic. Reported there: ~95% of larger reasoning model performance at ~1/3 compute, with selective refinement activating on ~31% of responses and +16pp accuracy lift.

What is still open: - lower startup overhead in the fused miss-mitigation path - higher-N, region-pinned commercial reruns to tighten CIs

If you have ideas, I’d really value specific feedback on: 1) failure modes we should add to the harness 2) uncertainty policies (thresholds, rolling debt, escalation rules) 3) task families that would be most convincing for decision-quality (not just latency)