Probably Raises $9M to Combat LLM Hallucinations with Precision Engineering

📅2 hours ago⏱3 min read

In this article

Probably Raises $9M to Combat LLM Hallucinations with Precision Engineering

As Large Language Models (LLMs) become increasingly integrated into professional workflows, the industry faces a persistent hurdle: the tendency for even the most advanced models to hallucinate. Startup Probably is tackling this challenge head-on, securing $9 million in seed funding led by Andreessen Horowitz to build a more rigorous, deterministic approach to AI reliability.

Moving Toward 99.99% Accuracy

The core mission of Probably, led by founder Peter Elias, is to bridge the gap between the probabilistic nature of LLMs and the 99.99% accuracy standard expected of deterministic systems. In high-stakes environments, a single factual error can render an AI tool useless. To solve this, Probably is moving away from the idea that accuracy is purely a function of model size and instead focusing on "harness engineering."

The company’s flagship product is a data science tool designed to extract insights from complex datasets. Unlike standard chatbots that provide conversational responses, Probably’s tool provides every answer with a specific citation and a transparent audit trail, allowing users to verify the logic behind every output.

The "Data Science Mech Suit" Architecture

Rather than relying solely on the reasoning capabilities of a massive model, Probably utilizes what Elias calls a "data science mech suit." This architecture functions as an elaborate harness system where the LLM’s initial output is immediately scrutinized by a deterministic validator.

If the LLM produces a result that does not align perfectly with the underlying dataset, the validator rejects it. Crucially, the LLM is trained specifically against this validator, creating a closed-loop system optimized for speed and factual integrity. This approach operates on a fundamental principle: by refining the context and reducing ambiguity through engineering, you can force the model to "do the right thing" without requiring massive computational brute force.

Efficiency Through Smaller, Local Models

One of the most significant technical implications of Probably’s approach is the ability to use smaller, more efficient models. Because the "mech suit" handles the heavy lifting of validation and context refinement, the system can operate on models that are "four classes weaker than frontier models."

This shift has massive economic and operational benefits:

Reduced Token Costs: Smaller models significantly lower the per-query cost, a vital factor as enterprises look to optimize AI budgets.
Local Execution: These lighter models can run on local hardware, such as desktop computers, rather than requiring expensive, high-latency data center connections.
Scalability: The engine is designed to be extensible beyond data science into precision-sensitive sectors like accounting and medical services.

Challenging the Big AI Lab Incentive Model

Elias points out a structural misalignment in the current AI landscape: major AI labs are incentivized to build massive, general-purpose models that require frequent user corrections. Since these labs often charge based on token usage, more errors and more follow-up queries can actually increase revenue. By focusing on precision and "reducing ambiguity" through engineering rather than scale, Probably is carving out a niche for mission-critical AI applications where reliability is the only metric that matters.

Key Takeaways

Deterministic Validation: Probably uses a "mech suit" architecture to check LLM outputs against a deterministic validator, aiming for 99.99% accuracy.
Cost-Effective Engineering: By reducing ambiguity through better context engineering, the system can run on much smaller, cheaper models that can operate on local hardware.
Precision-First Focus: The technology is designed to move AI into high-stakes, precision-sensitive industries like medicine and finance where hallucinations are unacceptable.

Probably Raises $9M to Combat LLM Hallucinations with Precision Engineering

Probably Raises $9M to Combat LLM Hallucinations with Precision Engineering

Moving Toward 99.99% Accuracy

The "Data Science Mech Suit" Architecture

Efficiency Through Smaller, Local Models

Challenging the Big AI Lab Incentive Model

Key Takeaways

Continue reading

𝗙𝗶𝘅𝗶𝗻𝗴 𝗔𝗜 𝗛𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻𝘀

𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗗𝗼𝗺𝗮𝗶𝗻 𝗦𝗽𝗲𝗰𝗶𝗳𝗶𝗰 𝗟𝗟𝗠 𝗘𝘃𝗮𝗹 𝗦𝗲𝘁𝘀

Red Teaming de IA: Protegendo Grandes Modelos de Linguagem contra Riscos Adversários

𝗠𝗔 𝗣𝗿𝗼𝗼𝗳𝗕𝗲𝗻𝗰𝗵: 𝗚𝗣𝗧 𝟱.𝟱 𝗛𝗶𝘁𝘀 𝟭𝟲% 𝗼𝗻 𝗠𝗮𝘁𝗵 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀

𝗧𝗼𝘄𝗮𝗿𝗱𝘀 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗟𝗟𝗠 𝗦𝗲𝗿𝘃𝗶𝗻𝗴