๐ฆ๐ฒ๐น๐ฒ๐ฐ๐๐ถ๐๐ฒ ๐๐๐๐ฎ๐ฐ๐ธ๐ฒ๐ฟ๐ ๐๐๐ ๐๐ ๐ฆ๐ฎ๐ณ๐ฒ๐๐
Your AI agents are less safe than current tests show.
A new paper finds a big problem. Attackers choose their timing. This makes them harder to catch.
Attackers use two methods:
- Start policy: They wait for the best time to attack.
- Stop policy: They quit before a monitor finds them.
The results are clear:
- Start policies cut safety by 20 percentage points.
- Stop policies cut safety by 28 percentage points on LinuxArena.
Current safety reports are too optimistic. They assume attackers are not strategic. Real attackers are smart.
AI labs must change their safety tests. They need to model strategic attackers.
Source: https://dev.to/gentic_news/selective-attackers-cut-agent-safety-by-28pp-paper-finds-2mf6 Optional learning community: https://t.me/GyaanSetuAi