You Can't Prevent Prompt Injection. So What Do You Do?
Stop trying to build the perfect system message. Stop waiting for a better model version to fix security.
Prompt injection will happen. No model reliably refuses malicious instructions when they look like commands. If you design your security around perfect models, you will fail.
Shift your focus. Do not ask how to prevent injection. Ask what your agent can do after an injection succeeds.
Follow these rules to limit the damage:
- Use capability-scoped credentials. Give your agent only the permissions it needs for the current task. An agent with read-only access causes less damage than an agent with an admin key.
- Gate destructive actions. Require a second factor or a manual check for actions like deleting, paying, or granting access. Do not let the model decide if these actions are safe.
- Treat all external input as untrusted. This includes user messages, web pages, tool outputs, and documents. Data often becomes instructions when it enters the context window.
- Watch your tool outputs. Many frameworks pipe tool logs directly into the model context. A study of 17,022 agent skills found credentials leaking through debug logs in 73.5% of cases. Redact secrets from tool outputs before they reach the model.
- Monitor behavior, not just quality. A hijacked agent produces high-quality text while performing unauthorized actions. You must baseline normal action sequences and alert on deviations.
Use these three levels of behavioral monitoring:
- Static rules: Catch obvious changes, like an agent suddenly sending emails.
- Sequence patterns: Flag actions that do not fit the normal shape of the agent's work.
- Independent review: Use a second model to judge the primary agent.
If your agent uses memory, keep it clean. A single injection can write a poisoned fact that repeats in every session. Scope memory per instance and track where every piece of information comes from.
If an agent gets hijacked right now, would you see it in your logs? Or are you flying blind?
Source: https://dev.to/brennhill/you-cant-prevent-prompt-injection-so-what-do-you-actually-do-1d37
Optional learning community: https://t.me/GyaanSetuAi
