After an embarrassing outage saga, blame your engineers, not the golden-goose coding tool that just got launched!
The recent spate of serious outages in Amazon has seen rife speculation pointing to AI-assisted coding as a contributing factor. This saga underscores growing concerns about AI’s role in software development at scale.
On 10 March, the investigations had culminated in a mandatory “deep dive” meeting led by Senior Vice President Dave Treadwell. The firm has attributed the entire saga to a “software code deployment” error.
In a 13-hour disruption in December 2025, the in-house AI coding tool Kiro is claimed to have autonomously decided to “delete and recreate” an environment, affecting services in parts of China. Engineers had granted it operator-level permissions without secondary approval, bypassing protocol.
Another AI assistant had also contributed to a prior outage, highlighting how these tools, launched like Kiro in July 2025, can introduce unforeseen risks when acting independently.
One report by Finextra has questioned if accelerated code velocity from generative AI is eroding reliability in high-traffic environments. It had noted the internal Amazon deep-dive meeting’s focus on recent website and app availability issues, now requiring senior engineer approval for AI-assisted changes to mitigate “blast radius” in production.
However, Amazon’s spokesperson has since framed the meeting as routine operational review during their weekly “This Week Stores Tech” session, amid external speculation calling out the firm for attempted whitewashing.
Weighing in on the matter, software development expert Ilan Peleg, CEO, Lightrun, has framed the outages as a shift to “non-deterministic” risks from AI-generated code, which performs well in isolation but falters in live, complex runtimes lacking contextual awareness.
Traditional testing misses these “unknown unknowns”, and while senior oversight helps in the short term, it bottlenecks velocity and overlooks logic errors unfamiliar to humans, said Peleg. He believes the industry should be integrating production insights directly into the development workflow, exposing real-world code interactions pre-scale to catch Sev2 issues early and empower developers to verify AI outputs against live data. This sustains AI speed without inevitable outages.
Overall, the Amazon outages reveal AI’s double-edged promise: boosting productivity but demanding new safeguards such as runtime visibility to tame unpredictability in cloud-scale operations.