Audit AI-Generated Tests: Half of Green CI Proves Nothing

A passing test feels like evidence. It is usually less than you think.

A test asserts that code does what the test expects. If the same author writes both the code and the expectation, the expectation is shaped by the code. The test passes because it was written to pass.

A green checkmark proves your tests agree with the code. It does not prove the code is correct.

This problem grows when AI agents ship entire pull requests. They write the implementation and the tests in one go. The checkmark does not get more trustworthy when the author changes.

I built mirror_audit.py to find this gap. It reads the test source using AST. It never runs the code. It looks for three common patterns:

  • The Recompute: The test uses the same formula as the code. It is f(x) == f(x) in a costume.
  • The Golden Literal: The test uses a number copied from a previous run. It pins the test to whatever the code did on day one, including bugs.
  • The Smoke Test: The test checks if a result is not None but lacks a real assertion.

I ran this against two suites.

The first suite was designed to mirror the implementation. It scored a 50.0% mirror-ratio. The CI failed. Half the tests carried no independent signal.

The second suite was honest. It used negative cases and independent expectations. It scored 0.0%. The CI passed.

The mirror-ratio does not measure your bug rate. It measures missing independent signal. It tells you how much of your green CI is just the suite nodding along with the code.

If a test computes the same broken result as the implementation, the test stays green. If a test asserts a real contract, the test turns red and catches the bug.

Stop looking only at coverage. Ask if your tests could actually fail for a real reason.

Source: https://dev.to/alex_spinov/audit-ai-generated-tests-half-of-green-ci-proves-nothing-4bmb

Optional learning community: https://t.me/GyaanSetuAi