𝗧𝗵𝗲 𝗛𝘂𝗺𝗮𝗻-𝗶𝗻-𝘁𝗵𝗲-𝗟𝗼𝗼𝗽 𝗦𝗥𝗘
Automation moves faster than humans.
In 2021, a Fastly configuration change caused a global outage. The automation spread the error in under a minute. It took humans 49 minutes to fix it.
This is the core challenge of AI-assisted SRE. AI can detect and fix issues at speeds humans cannot match. The danger is not the technology. The danger is the speed gap between automated actions and human accountability.
You must design an escalation policy to define where automation ends and human judgment begins.
Use the Automation Autonomy Spectrum to govern your AI:
• Level 0 (Manual): AI provides no help. Humans do everything. • Level 1 (Assisted): AI provides context. Humans make all decisions. • Level 2 (Supervised): AI suggests actions. Humans must approve each one. • Level 3 (Conditional): AI acts within set rules. Humans get notified. • Level 4 (Autonomous): AI acts and verifies alone.
Never leave an automation at Level 4 forever. Systems change. An automation that works today might become dangerous tomorrow if the underlying issue shifts. You must review every autonomous action regularly.
Shift from automation to human oversight when these four triggers occur:
- Low Confidence: The AI is unsure of its diagnosis.
- High Blast Radius: The action affects too many services or users.
- Novelty: The failure pattern is new and unseen by the AI.
- Regulation: The action touches a protected or compliant system.
Do not let "the AI decided" be your excuse. Every action must trace back to a human or a policy approved by leadership.
Build your policy before you turn on the automation. Use data to prove your AI is accurate. If your AI is wrong too often, downgrade its autonomy immediately.
Optional learning community: https://t.me/GyaanSetuAi