𝗦𝗰𝗼𝗿𝗶𝗻𝗴 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀: 𝗗𝗲𝘁𝗲𝗿𝗺𝗶𝗻𝗶𝘀𝘁𝗶𝗰 𝗠𝗲𝘁𝗿𝗶𝗰𝘀 + 𝗮𝗻 𝗟𝗟𝗠 𝗝𝘂𝗱𝗴𝗲

You run many small AI agents. You have agents for backend, frontend, mobile, and devops. Each agent has one job.

When you have many agents, you face a problem. You do not know if they are good. You do not know if a prompt edit makes them better or worse. Saying "it looks fine" does not work at scale.

I built a framework to solve this. It uses numbers to measure performance and improves prompts automatically.

The Strategy

Measure what you can measure with math first. Use an LLM judge only when you must. Deterministic metrics are fast and free. An LLM judge is slow and costs money.

How the system works:

• The harness runs each agent as a separate process. • It feeds a task to the agent. • It captures the output. • It scores the result against expected data.

The agent only needs to read from stdin and write to stdout. It can be Python or a shell script. The harness does not care.

Five core metrics to track:

  • Accuracy: Does the output match the goal?
  • Fuzzy score: How similar is the text to the target?
  • Timeout rate: How often does the agent fail to finish?
  • Safety violations: Does the output match unsafe patterns?
  • Reproducibility variance: Does the agent give the same answer every time?

If an agent is correct but inconsistent, it is a bug.

The LLM Judge

Some things are hard to measure with math. You need to know if an agent stayed in its role or followed constraints.

For these cases, an LLM judge reviews the work. It receives a rubric and the agent output. It returns a structured verdict. I validate this verdict against a JSON schema so it does not break the report.

The judge does more than just grade. It must suggest fixes. A critique like "this is weak" is useless. A critique like "add a JSON block to the prompt" is actionable.

The Improvement Loop

Failures go into a file. This file feeds an automated loop. The system looks at the weakest part of a prompt and tries to fix it. It keeps a pool of good candidates. It writes the best versions back to the code.

A single score is a snapshot. Use history to track trends. This tells you if you are getting better over time.

Build your foundation on deterministic metrics. Use the judge as a scalpel, not a hammer.

为 AI Agent 评分:确定性指标 vs. LLM 评委

评估 LLM 是一回事,但评估 AI Agent 则是完全不同的挑战。LLM 主要关注文本生成,而 Agent 则涉及工具使用、推理和多步工作流。这种复杂性使得传统的评估指标显得力不从心。

在评估 Agent 时,我们通常有两种主要的路径:确定性指标 (Deterministic Metrics)LLM 评委 (LLM-as-a-Judge)

确定性指标 (Deterministic Metrics)

确定性指标是基于规则的。它们速度快、成本低且具有可复现性。

常见类型:

  • 精确匹配 (Exact Match):检查输出是否与标准答案(Ground Truth)完全一致。这适用于简单的问答或分类任务。
  • 正则表达式 (Regex):验证输出是否符合特定的格式模式(例如日期、电子邮件或特定的 ID 格式)。
  • JSON/Schema 验证:如果 Agent 的任务是生成结构化数据,我们可以验证其输出是否符合预定义的 JSON Schema。
  • 代码执行 (Code Execution):如果 Agent 生成的是代码,我们可以直接运行代码并检查其是否通过了预设的单元测试。

优点:

  • 速度极快。
  • 成本几乎为零。
  • 结果完全可预测且无偏见。

缺点:

  • 缺乏灵活性。
  • 无法处理语义上的正确性(例如,即使意思相同,只要单词不同,精确匹配就会失败)。
  • 难以评估开放式回答的质量。

LLM 评委 (LLM-as-a-Judge)

当输出是开放式的,或者需要理解复杂的语义逻辑时,确定性指标就会失效。这时就需要“LLM 评委”模式。这种方法是使用一个能力更强的模型(例如 GPT-4o)来充当裁判,根据一套详细的评分细则 (Rubric) 对 Agent 的表现进行打分。

如何实施:

  • 提供上下文:向评委模型提供任务说明、用户输入和 Agent 的输出。
  • 定义评分细则:明确告诉模型什么是“好”,什么是“坏”。例如,从准确性、简洁性和语气三个维度进行评分。
  • 少样本提示 (Few-shot Prompting):提供一些评分示例,帮助评委模型理解标准。

优点:

  • 能够理解语义和细微差别。
  • 可以处理极其复杂的、非结构化的任务。
  • 能够模拟人类的判断逻辑。

缺点:

  • 成本高:调用高级模型进行大规模评估非常昂贵。
  • 速度慢:相比于规则检查,LLM 推理需要时间。
  • 偏见问题:LLM 评委可能会表现出“位置偏见”(倾向于给第一个看到的答案打高分)或“长度偏见”(倾向于给长回答打高分)。

对比总结

| 特性 |