Your CI Passed. Your Agent Isn't Operator-Ready

We shipped a document agent to an enterprise client last quarter.

Our test suite showed a 94% pass rate.

Three weeks into the pilot, the agent started issuing refunds for invoices it could not read. It did this silently. There were no errors or logs. The agent just gave wrong answers that looked correct.

Our CI stayed green the entire time.

The problem was not the model or the prompt. The problem was the 6% of data we did not test. That 6% arrived as the first real data from the operator.

That is not an edge case. That is the definition of being operator-ready.

Production-ready is about infrastructure. It means your service stays up and handles the load.

Operator-ready is different. It means your agent works for someone who did not build it. It works on data you did not design. It makes decisions with real consequences.

Most test pipelines measure pass rates on a set you created. They do not measure what happens when real data differs from your test set.

A model with 97% validation success sounds good. But look at the 3% that fail.

If your agent fills missing fields with default values during a retry, you built a silent error machine. The schema passes, but the data is wrong.

To fix this, separate schema validity from content confidence.

We added a confidence score to every response. Low confidence now triggers a human review instead of a retry. This change caught 14 of our first 18 incidents.

Your test set covers what you thought of. An operator's data covers what you missed.

In our case, we tested single-page invoices. The operator used multi-page invoices with scanned PDFs. The agent failed on the new format.

Do not just fix the parser. Test against the actual operator's data before you go live.

Before any handoff, we now require 50 documents from the operator's own data. We do not use synthetic data. We use theirs.

You also need a complete audit trail. Do not just log what the model returned. Log what the model decided not to do.

A minimum audit trail needs:

  • Output with field-level confidence scores
  • A fallback indicator showing if the agent retried
  • An input hash to replay the exact document
  • The specific model and prompt version used

Before you hand an agent to an operator, check these five things:

  • Run 50+ samples from the operator's actual data.
  • Search logs for outputs that passed schema checks but caused downstream errors.
  • Feed malformed inputs to ensure the agent fails safely.
  • Ensure you can answer what happened to a specific document in under 5 minutes.
  • Check that the agent has the lowest possible permissions.

Our test pass rate was 94%. Our error rate in month one was 8%.

After we added confidence scores, real-world testing, and better logs, the error rate dropped to 1.4%.

The test score was not the problem. The test scope was.

Source: https://dev.to/ethanwritesai/our-ci-passed-your-agent-isnt-operator-ready-2mfn

Optional learning community: https://t.me/GyaanSetuAi