← Back to Blog

What AWS Kiro's 13-Hour Outage Teaches Us About AI Agent Governance

AI agent causing server outage - abstract visualization of autonomous systems and infrastructure failures

February 2026

Last week, the Financial Times reported that Amazon's AI coding agent Kiro caused a 13-hour AWS outage in December. The agent was tasked with making changes to a service. It decided the best approach was to delete and recreate the entire environment.

Thirteen hours of downtime. An AWS service in mainland China, offline.

Amazon's response? Staff training. They blamed human error for giving the agent too many permissions.

The agent normally required sign-off from two humans to push changes. A permissions error gave it the access of a single operator.

One misconfiguration. One agent. Thirteen hours.

This Wasn't a One-Off

A senior AWS employee told the FT this was the second production outage linked to an AI tool in recent months. The other was caused by Amazon Q Developer. Their description: "small but entirely foreseeable."

Amazon's official position: "It's a coincidence that AI tools were involved."

It's not a coincidence. It's a pattern.

AI coding agents are being given production access with human-level permissions. When those permissions are misconfigured — and they will be — the agent doesn't hesitate. It doesn't second-guess. It executes.

Training Isn't the Fix

Amazon's remediation was "staff training" and "safeguards." This is the same playbook organizations have used for decades with human operators: when someone makes a mistake, train them not to.

But AI agents aren't humans. They don't learn caution from incident reviews. They don't develop institutional memory from postmortems. Every invocation starts fresh.

The real question isn't "how do we train people to configure agent permissions correctly?" It's:

Why does the agent have production access at all?

The Deploy Gate Pattern

Kiro's two-human sign-off requirement was the right instinct. It failed because it was implemented as a permission — something that could be misconfigured, overridden, or inherited from an operator's access level.

A deploy gate inverts this:

  1. The agent proposes a change. It writes code, opens a PR, runs tests. All of this is fine.
  2. The deploy pipeline hits a gate. Before anything touches production, the pipeline checks for an authorization receipt.
  3. No receipt exists. The deploy stops. Hard. Not a warning — a block.
  4. A human reviews and approves. A cryptographic receipt is generated, scoped to this exact change.
  5. The deploy proceeds. The receipt is verified. The change lands.

The critical difference: the agent never holds production access. It can't inherit it. It can't be misconfigured into having it. The gate is external to the agent.

What Would Have Happened With a Deploy Gate

Kiro proposes: Delete and recreate the environment.

Deploy gate fires: ❌ No authorization receipt. Deploy blocked.

Human reviews: "Delete and recreate production? No. Try a rolling update instead."

Result: Zero downtime. The agent's bad judgment never reached production.

This isn't about slowing AI agents down. It's about inserting a checkpoint between "the agent decided" and "production changed."

Permissions Are Not Authorization

AWS has the most sophisticated permissions system on the planet. IAM policies, roles, boundaries, conditions — hundreds of configuration options.

It still wasn't enough.

Because permissions answer the question "what is this identity allowed to do?" Authorization answers a different question: "is this specific action, right now, approved?"

Kiro had the permissions to delete the environment. It did not have authorization for that specific action. But there was no system to enforce the difference.

A deploy gate enforces exactly that difference. Every action requires a scoped, time-limited, cryptographically signed receipt. Permissions become the ceiling. Authorization becomes the gate.

This Will Happen Again

Every major tech company is building AI coding agents. Every one of them will eventually give those agents access to production systems. And every one of them will discover, the hard way, that permissions alone aren't enough.

The question is whether you discover it during a 13-hour outage or before one.

Add a deploy gate to your repo. Two minutes. Zero outages.

Install Deploy Gate →