Probably Raises $9M to Combat LLM Hallucinations with Precision Engineering

As Large Language Models (LLMs) become increasingly integrated into professional workflows, the industry faces a persistent hurdle: the tendency for even the most advanced models to hallucinate. Startup Probably is tackling this challenge head-on, securing $9 million in seed funding led by Andreessen Horowitz to build a more rigorous, deterministic approach to AI reliability.

Moving Toward 99.99% Accuracy

The core mission of Probably, led by founder Peter Elias, is to bridge the gap between the probabilistic nature of LLMs and the 99.99% accuracy standard expected of deterministic systems. In high-stakes environments, a single factual error can render an AI tool useless. To solve this, Probably is moving away from the idea that accuracy is purely a function of model size and instead focusing on "harness engineering."

The company’s flagship product is a data science tool designed to extract insights from complex datasets. Unlike standard chatbots that provide conversational responses, Probably’s tool provides every answer with a specific citation and a transparent audit trail, allowing users to verify the logic behind every output.

The "Data Science Mech Suit" Architecture

Rather than relying solely on the reasoning capabilities of a massive model, Probably utilizes what Elias calls a "data science mech suit." This architecture functions as an elaborate harness system where the LLM’s initial output is immediately scrutinized by a deterministic validator.

If the LLM produces a result that does not align perfectly with the underlying dataset, the validator rejects it. Crucially, the LLM is trained specifically against this validator, creating a closed-loop system optimized for speed and factual integrity. This approach operates on a fundamental principle: by refining the context and reducing ambiguity through engineering, you can force the model to "do the right thing" without requiring massive computational brute force.

Efficiency Through Smaller, Local Models

One of the most significant technical implications of Probably’s approach is the ability to use smaller, more efficient models. Because the "mech suit" handles the heavy lifting of validation and context refinement, the system can operate on models that are "four classes weaker than frontier models."

Этот сдвиг дает огромные экономические и операционные преимущества:

Вызов модели стимулирования крупных ИИ-лабораторий

Элиас указывает на структурное несоответствие в текущем ландшафте ИИ: крупные ИИ-лаборатории заинтересованы в создании массивных моделей общего назначения, которые требуют частых исправлений со стороны пользователя. Поскольку эти лаборатории часто взимают плату на основе использования токенов, большее количество ошибок и последующих запросов может фактически увеличить их доход. Сосредоточившись на точности и «устранении двусмысленности» с помощью инженерии, а не масштабирования, Probably формирует нишу для критически важных приложений ИИ, где надежность — единственный значимый показатель.

Основные выводы