Google DeepMind 支持 1000 万美元基金,旨在解决多智能体 AI 安全风险
随着 AI 智能体从简单的聊天机器人演变为能够执行复杂任务的自主实体,一种全新的系统性风险领域正在浮现。Google DeepMind 与多家全球合作伙伴发起了一项大规模倡议,旨在研究当数百万个此类自主智能体开始在现实世界中进行交互时,所产生的不可预测行为。
多智能体问题:超越单一模型的安全性
在当前 AI 时期的很大一部分时间里,研究重点一直集中在单一模型的安全性上——即确保特定的 LLM 不会输出有害内容或遵循恶意提示词。然而,Google DeepMind 及其合作伙伴意识到,真正的挑战在于“多智能体系统”。
当大量智能体被部署到整个经济体系中时,它们会创造出一个复杂的生态系统,其集体行为可能与各部分行为的总和截然不同。这种“智能体蜂群思维”(agent hive mind)可能会导致涌现智能,或者更令人担忧的是,导致涌现性的混乱。专家警告称,我们无法通过研究孤立的模型来预测这些结果;相反,研究人员必须使用真实的、大规模的模拟,在数字沙盒中观察智能体是如何进行交互、竞争或在无意中进行协作的。
旨在支持学术研究的 1000 万美元联盟
为了填补这一空白,Google DeepMind 组建了一个强大的联盟,为研究人员提供 1000 万美元的资金支持。该合作伙伴关系包括 Schmidt Sciences(由 Eric 和 Wendy Schmidt 领导的慈善基金会)、ARIA(英国政府的“登月计划”机构)、Cooperative AI 基金会以及 Google.org。
其战略目标是将研究从大型科技公司的实验室带入学术界。虽然像 Google 和 Anthropic 这样的行业领导者正在构建相关技术,但学术研究人员拥有更大的自由度去展望更远的未来,并调查那些在商业产品周期中可能不是首要任务的长期系统性风险。这项资金旨在建立“多智能体安全”这一目前尚不存在的基础领域。
从提示词注入到数字无政府状态
与多智能体系统相关的风险不仅仅是理论上的;它们是现有网络安全威胁的升级版。主要担忧包括:
- Advanced Prompt Injections: An agent could be "hijacked" by a single malicious sentence buried in a document, turning a helpful assistant into self-guided malware.
- Automated Scams and Cyberattacks: Agents capable of reasoning and improvisation can execute complex, multi-step social engineering or hacking attempts at scale.
- Systemic Instability: Just as human institutions can cause unforeseen economic shifts, a massive deployment of autonomous agents could lead to digital "anarchy" or market instability.
Unlike traditional software, which follows fixed paths written by humans, AI agents reason and improvise. This unpredictability necessitates a shift toward "zero trust" frameworks—an approach championed by Anthropic—where every agent is treated as a potential vulnerability.
Key Takeaways
- New Funding Initiative: Google DeepMind and partners have committed $10 million to fund academic research into the unpredictable behaviors of interacting AI agents.
- Emergent Risks: The primary concern is that millions of autonomous agents could create systemic risks, such as automated cyberattacks and "hive mind" behaviors, that cannot be predicted by testing single models.
- Shift in Security Paradigms: As agents move from fixed software to reasoning entities, the industry is shifting toward "zero trust" models to mitigate the risks of hijacking and prompt injection.