Trust The Harness, Not The Model

A local 27B coding model running on my home hardware is a coin flip. Sometimes it fixes code in twenty minutes. Other times it edits the wrong file or writes a test that passes no matter what.

The goal of LLMKube's Foreman is not to find a perfect model. The goal is to build a harness you trust more than the model.

This weekend, the harness tested itself by building its own guardrails. Here is what happened:

• My local coder built three new gates for itself. • One gate arrived with the exact flaw it was meant to catch. The review process caught it. • Three new human contributors sent four clean pull requests while the machines worked. • The same model ran on an AMD box and an Apple Silicon Mac. The Mac won a round nobody expected. • No data touched a cloud API.

A coding agent on a local model produces variable quality. No amount of prompt tuning makes a small model as reliable as a frontier model.

Foreman does not ask the model to be reliable. It wraps the model in a pipeline:

  • The coder works in a cloned workspace.
  • A fast gate runs gofmt, vet, build, lint, and unit tests.
  • A reviewer reads the diff against the issue.
  • A clean-room Kubernetes Job re-runs the full suite before any code is approved.
  • Deterministic rails like scope checks and repo-map context sit around the whole process.

The model is a random part of a system. The system's job is to make the verdict trustworthy even when the model is not.

The real question is not "is the model good." The question is "does the harness catch the model when it is bad."

This weekend, the harness caught two major bugs. Both were self-confirming tests. The tests passed because the tests themselves were written to pass. They did not actually test the code.

I handed these failures back to Foreman. I let the harness build the gates to fix them:

  • A scope guard: It rejects any edit that touches the wrong subsystem.
  • A reviewer rubric: It ensures tests use real production values instead of placeholders.
  • A bite check: A new test must fail against the old code. If it passes both, it is not a real test.

The bite check failed its own first test. The gate it built was flawed. But the review process caught the error before it merged. That is the design working.

You do not need a massive frontier model to do engineering work locally. You need a harness you trust. Build that, and a small model becomes a useful coworker. It becomes a system that catches mistakes instead of forcing you to read every line of code at midnight.

Source: https://dev.to/defilan/trust-the-harness-not-the-model-a-weekend-of-local-agents-building-their-own-guardrails-52nl

Optional learning community: https://t.me/GyaanSetuAi