Frontier-Quality Coding at Cheap-Tier Cost
You can get frontier-quality coding scores at a fraction of the cost.
We built a system that uses a cheap local model for most tasks. It only sends hard problems to a frontier model. This method works because of the structure, not just the model size.
How the architecture works:
- Two channels: A capability channel (the cheap local model) and a structure channel (verification gates).
- Verification: Guards decide if an answer is trustworthy.
- Escalation: If guards fail, the system moves the request to a frontier model.
- Cache: A cache layer prevents re-solving exact repeats.
The results from our HumanEval+ tests:
- Full cascade score: 94.5% plus correctness.
- Local model solo score: 84.8% plus correctness.
- The structure channel adds roughly 10 points of accuracy.
We tested the importance of the structure through an ablation study:
- Full system: 100% correct.
- Removed verification: 75% correct.
- Removed guards: 50% correct.
Correctness drops by half when you remove the guards. This proves the structure carries the reliability.
The cost benefits:
- Blended cost: $0.00201 per request.
- Frontier cost: $0.017 per request.
- Our system is about 8x cheaper than using a frontier model for every request.
- 91% of requests are served by the local model.
A note on long context:
Our compaction layer uses 165 tokens compared to 28,000 tokens for raw context. This is a massive increase in efficiency. We hit an infrastructure limit at 208k tokens, but this is a setting, not a model failure.
What we have not proven yet:
We do not have official long-horizon benchmark numbers. We have built the runners for RULER and SWE-bench, but we have not run them in a clean sandbox. We are not claiming official results for long-horizon performance yet.
Summary of our claim:
Our system matches frontier coding scores while using cheap local models. This reduces costs by 8x. The reliability comes from our structure channel.
Optional learning community: https://t.me/GyaanSetuAi
