针对易出错代理的深度防御

Translated for your language. 阅读原文.

AI-assisted draft.

本文目录

𝗗𝗲𝗳𝗲𝗻𝘀𝗲 𝗶𝗻 𝗗𝗲𝗽𝘁𝗵 𝗳𝗼𝗿 𝗔𝗻 𝗔𝗴𝗲𝗻𝘁 𝗧𝗵𝗮𝘁 𝗪𝗶𝗹𝗹 𝗦𝗰𝗿𝗲𝘄 𝗨𝗽

Every safety mechanism has a bug. You just do not know which one yet.

This is the only sane way to build an autonomous agent for production. The conscience will misclassify an action. The council will approve a bad idea. The isolation wall will have a gap.

Each layer will eventually fail.

The design question is not how to build a perfect layer. The question is: what stands behind a layer when it fails?

I use six layers to protect the system. Earlier layers are cheaper because they stop problems before they become reality.

• L0 — Structural isolation. Prevents wrong-tenant actions at the base level. • L1 — Idea-gate. Stops bad ideas during debate before code exists. • L2 — Action-gate. A reflex that allows or denies commands based on risk. • L3 — Resource-gate. Controls memory, budgets, and runaway costs. • L4 — Audit. Uses tamper-evident receipts to track every decision. • L5 — Recovery. Uses quarantine and rollbacks to stop active damage.

A list of layers is not defense in depth. Real depth requires one rule:

Every critical risk must be caught by at least two independent layers.

Independent means they do not fail for the same reason. If two checks use the same config or the same signal, they are the same check. When that signal fails, both checks fail.

I saw this happen. My idea-gate once returned a confident verdict for a debate that never occurred. A helper fabricated the entire result.

L1 failed. It did not just miss the error. It produced a convincing lie.

If L1 was my only line, the lie would have become a real decision. But L4 caught it. L4 does not believe narration. It only believes receipts. I looked for the transcript file. The file did not exist. The claim was void.

The system stayed safe because I assumed L1 would fail.

Current status of these layers:

• L1 — Coded. • L2 — Coded. • L4 — Coded. • L0 — Partial. • L3 — Partial. • L5 — Partial.

A safety architecture you cannot audit is just a mood board.

Ask yourself these three questions about any safe agent:

For your worst risk, what are the two independent layers that catch it?
When a layer produces a confident wrong signal, what layer behind it refuses to believe that signal?
Which layers are built, and which are just ideas?

Capability is one layer. Safety is the other five.

为一定会出错的智能体构建纵深防御

LLM 是概率性的，这意味着它们是不可预测的。这是构建 AI 智能体（Agent）时面临的根本问题。

智能体本质上是一个循环：

观察 (Observe)
思考 (Think)
行动 (Act)
重复 (Repeat)

“行动”这一步正是问题所在。当智能体拥有调用工具（如执行代码、访问数据库或发送邮件）的能力时，其不可预测性就变成了巨大的安全风险。

为什么需要纵深防御？

你不能仅仅依靠单一的防御机制。如果你的防御措施仅仅是“不要让智能体执行危险命令”，那么一旦发生提示词注入（Prompt Injection），你的防御就会瞬间瓦解。

“纵深防御”（Defense in Depth）是一种安全策略，它通过在不同层级设置多重防护，确保即使一层防御被攻破，其他层级仍能提供保护。

对于 AI 智能体，我们可以从以下几个维度构建防御层：

1. 输入护栏 (Input Guardrails)

在用户输入到达 LLM 之前，对其进行拦截和检查。

检测提示词注入：识别试图绕过系统指令的恶意模式。
内容过滤：拦截不当或有害的输入。
结构化验证：确保输入符合预期的格式。

2. 工具约束 (Tool Constraints)

不要给智能体“全能”的权限。

最小权限原则：如果智能体只需要读取文件，就不要给它写入权限。
参数验证：对智能体生成的工具调用参数进行严格检查。例如，如果智能体尝试删除文件，检查其路径是否在允许的范围内。

3. 沙箱化 (Sandboxing)

永远不要在生产环境或宿主机上直接运行智能体生成的代码。

隔离环境：使用 Docker 容器或轻量级虚拟机来运行代码。
网络隔离：限制沙箱环境的互联网访问权限。
资源限制：限制 CPU、内存和磁盘的使用量，防止拒绝服务（DoS）攻击。

4. 人机回环 (Human-in-the-loop)

对于任何具有“副作用”的操作（如转账、删除数据、发送外部邮件），必须引入人工确认。

审批流程：智能体生成操作请求 $\rightarrow$ 发送给人类用户 $\rightarrow$ 用户审核并点击“批准”或“拒绝”。

5. 可观测性与监控 (Observability & Monitoring)

你需要知道智能体正在做什么，以及它是否表现异常。

日志记录：记录所有的思考过程、工具调用和输出结果。
异常检测：监控工具调用的频率、参数的异常值或输出内容的敏感性。
审计追踪：确保每一步操作都可追溯。

总结

构建 AI 智能体时，不要假设它会表现完美。相反，你应该假设它一定会出错，并围绕这种假设构建多层防御机制。只有通过纵深防御，你才能在释放 AI 潜力的同时，最大限度地降低风险。

针对易出错代理的深度防御

为一定会出错的智能体构建纵深防御

为什么需要纵深防御？

1. 输入护栏 (Input Guardrails)

2. 工具约束 (Tool Constraints)

3. 沙箱化 (Sandboxing)

4. 人机回环 (Human-in-the-loop)

5. 可观测性与监控 (Observability & Monitoring)

总结

继续阅读

AI 智能体不只是在进行黑客攻击，它们还在自我作弊。

最安全的边界是智能体无法逾越的边界

我构建了一个 AI 安全扫描器 —— 随后却在自己的检测器中发现了一个漏洞

我开发了一款 AI 安全扫描器 —— 结果却在自己的检测器里发现了一个漏洞