Probably Raises $9M to Combat LLM Hallucinations with Precision Engineering
As Large Language Models (LLMs) become increasingly integrated into professional workflows, the industry faces a persistent hurdle: the tendency for even the most advanced models to hallucinate. Startup Probably is tackling this challenge head-on, securing $9 million in seed funding led by Andreessen Horowitz to build a more rigorous, deterministic approach to AI reliability.
Moving Toward 99.99% Accuracy
The core mission of Probably, led by founder Peter Elias, is to bridge the gap between the probabilistic nature of LLMs and the 99.99% accuracy standard expected of deterministic systems. In high-stakes environments, a single factual error can render an AI tool useless. To solve this, Probably is moving away from the idea that accuracy is purely a function of model size and instead focusing on "harness engineering."
The company’s flagship product is a data science tool designed to extract insights from complex datasets. Unlike standard chatbots that provide conversational responses, Probably’s tool provides every answer with a specific citation and a transparent audit trail, allowing users to verify the logic behind every output.
The "Data Science Mech Suit" Architecture
Rather than relying solely on the reasoning capabilities of a massive model, Probably utilizes what Elias calls a "data science mech suit." This architecture functions as an elaborate harness system where the LLM’s initial output is immediately scrutinized by a deterministic validator.
If the LLM produces a result that does not align perfectly with the underlying dataset, the validator rejects it. Crucially, the LLM is trained specifically against this validator, creating a closed-loop system optimized for speed and factual integrity. This approach operates on a fundamental principle: by refining the context and reducing ambiguity through engineering, you can force the model to "do the right thing" without requiring massive computational brute force.
Efficiency Through Smaller, Local Models
One of the most significant technical implications of Probably’s approach is the ability to use smaller, more efficient models. Because the "mech suit" handles the heavy lifting of validation and context refinement, the system can operate on models that are "four classes weaker than frontier models."
This shift has massive economic and operational benefits:
- Reduced Token Costs: Smaller models significantly lower the per-query cost, a vital factor as enterprises look to optimize AI budgets.
- Local Execution: These lighter models can run on local hardware, such as desktop computers, rather than requiring expensive, high-latency data center connections.
- Scalability: The engine is designed to be extensible beyond data science into precision-sensitive sectors like accounting and medical services.
Challenging the Big AI Lab Incentive Model
Elias points out a structural misalignment in the current AI landscape: major AI labs are incentivized to build massive, general-purpose models that require frequent user corrections. Since these labs often charge based on token usage, more errors and more follow-up queries can actually increase revenue. By focusing on precision and "reducing ambiguity" through engineering rather than scale, Probably is carving out a niche for mission-critical AI applications where reliability is the only metric that matters.
Key Takeaways
- Deterministic Validation: Probably uses a "mech suit" architecture to check LLM outputs against a deterministic validator, aiming for 99.99% accuracy.
- Cost-Effective Engineering: By reducing ambiguity through better context engineering, the system can run on much smaller, cheaper models that can operate on local hardware.
- Precision-First Focus: The technology is designed to move AI into high-stakes, precision-sensitive industries like medicine and finance where hallucinations are unacceptable.