Research · pre-beta

How a cascade reaches the frontier, for a fraction of the cost.

A full write-up of the methods and measurements behind Tīrtha is in progress. Below is what it will cover, published as we finish each section, with every number shown honestly as the directional, pre-beta result it is.

Published · measured 2026-06-23

Speed: retrieval vs solving fresh

We timed it on the live gateway: send a coding task, then send the very same task again. The first time, the team solves it end-to-end; the second, the answer is already remembered. We measured both, wall-clock, across a dozen tasks.

~0.16sto return remembered work, steady every time

24–185×faster than solving fresh · median 71×

4–28sto solve a brand-new problem from scratch

Answer length	Solved fresh	On repeat	Faster
Short	4.0s	0.17s	24×
Medium	12.0s	0.17s	71×
Long	27.8s	0.15s	185×

Retrieval is rock-steady at ~0.16s; the gap is set by how much there is to generate the first time, so a short answer is ~24× faster on repeat, a long one ~185×. Method: live api.tirtha.ai, the identical request sent twice, end-to-end wall-clock, n=8 clean fresh-solve cases. Directional · pre-beta · to be re-run on our own hardware.

Published · measured 2026-06-19

Accuracy: frontier parity on HumanEval+

The whole thesis is that you don't have to send every problem to a frontier model to get frontier-quality answers. We tested it head-to-head on the full HumanEval+ benchmark: 164 coding problems, scored by actually running each solution against its tests (pass@1, a single attempt, the same harness for every system).

95%Tīrtha: statistical parity with the top frontier models

57%escalated to a frontier model; the other 43% served free by the lightweight tier

82%the lightweight model alone (a 7B, the tier used in this run)

System	Solved correctly · pass@1 · 164 problems
Tīrtha	95%
Claude	96%
Codex	95%
One model, alone	82%

The number that matters isn't just the 95%. It's that Tīrtha reached it while sending only 57% of problems to a frontier model. The lightweight tier drafts an answer; an automated check decides whether it's trustworthy or needs escalating. Frontier-level accuracy, with frontier prices paid only where they're earned. Method: full HumanEval+ (164 problems), pass@1, each solution scored by execution against its tests, identical harness per system, run 2026-06-19. The lightweight tier in this run was a 7B model; today's lightweight tier is substantially stronger, so we expect to hold this accuracy while handling an even larger share without a frontier model. These are pre-beta numbers and we're still testing, and we'll re-run the full suite on our own hardware as we move out of pre-beta, and we expect them to get better, not worse. Gemini is excluded (quota-blocked mid-run → invalid). Directional · pre-beta.

Model · projected, not production-validated

The cost curve: where the savings come from

The savings aren't a discount. They're structural. Send everyday work to a lightweight tier and only the genuinely hard problems to a frontier model, and the bill is lower from the very first request. Then, as the system recognizes work it has seen before, it drops further.

~8×lower on day one, from routing alone, before it has learned anything

~60×lower on routine work as it learns your workload (~36× on the hard tail)

exact repeatsonly. A deliberately conservative model; near-duplicates compound it faster

This is a cost model, not a measured production bill, projected from per-request routing economics and a conservative reuse assumption (it counts only exact repeats). It has not been tested on the full production system yet, and it is not final. The day-one routing saving is the firmest part; the compounding tail depends on how repetitive a given workload is. Directional · pre-beta.

The honest ledger

Honesty & limits: what's measured, what's projected, what isn't tested yet

We're pre-beta, and we'd rather you trust the numbers than be dazzled by them. So here's the honest accounting of every claim on this site.

Claim	Status
Speed on repeat work: ~0.16s, 24–185×	Measured (live gateway, n=8)
Accuracy: 95% on HumanEval+, frontier parity	Measured (full 164, pass@1; 2026-06-19, on a 7B tier; current tier is stronger)
Cost: ~8× day one, compounding further	Projected (a model, not a production bill; not final)
Lightweight-tier coding (aider-polyglot)	Early (directional; Python subset, single run; not the full leaderboard metric)
Retrieval-quality improvements	Measured internally. The results are real; the method stays private

What we haven't done yet: re-run the full suite on our own hardware (we're on borrowed compute), measure across every benchmark language, and validate the cost model against a real production bill at scale. We'll publish each as we finish it, and we expect the numbers to get better, not worse. If a claim here isn't marked "measured," treat it as a direction of travel, not a guarantee.

What's coming

The cascade approach. Meeting each request at the level it needs, instead of one model for everything.

soon

Coding at scale. An aider-polyglot run on an open-weight lightweight tier (directional, Python-only so far).

soon

Retrieval quality. Results measured internally; method kept private.

internal

Want the results as they land?

We're publishing section by section. Email [email protected] and we'll send the write-up as each part goes live.