Is AI the Catalyst Chaos Engineering Needed to Solve Service Reliability Struggles?

Engineering leaders are shouldering a growing responsibility to avoid downtime and help the business meet resilience regulations like DORA and NIS2. This means working to ensure teams are following best practice, putting disaster recovery plans in place, and carrying out rigorous testing throughout the software delivery lifecycle (SDLC). But this responsibility can quickly become overwhelming.

To better ensure resilience, engineering teams should focus on automating processes where possible, setting up failover policies so applications can quickly recover from an outage. They must also be able to automatically rollback to the last working version of their software if something breaks. There are many approaches to improving resilience across the SDLC, but one of the most impactful solutions is something many developers have overlooked: chaos engineering.

Menacing or Misunderstood?

The concept of chaos engineering was brought into the spotlight by Netflix 17 years ago following a days-long outage triggered by a corruption of its database. Chaos Monkey was the result — a system built to randomly terminate instances and services within Netflix’s environment, ensuring its streaming platform could withstand unexpected disruptions without compromising user experience.

Most engineers have dabbled with chaos in the past, but found that they couldn’t get their testing models off the ground. In many organizations, chaos engineering has been relegated to a small subset of developers, far too few to drive meaningful change. Compounding this, it’s often introduced too late in the software delivery cycle, which limits its potential to improve system resilience.

But a lot has changed in the past 17 years, and it might be time for engineering leaders to reconsider if chaos could give their teams an edge. Recent advancements in automation, AI, and the emergence of Internal Developer Portals (IDPs) present an opportunity to scale and streamline the practice, making it more accessible, effective, and integrated into day-to-day development workflows.

AI is Redefining Chaos

As AI capabilities evolve, smaller teams can automate chaos engineering practices to deliver much greater impact. AI can, for instance:

Handle discovery automatically – offering engineers an instant, comprehensive view of the applications and infrastructure supporting their services. Engineering teams can now automate up to four of the five foundational reliability tests – such as monitoring application uptime and response times – streamlining efforts to assess and improve system resilience at scale.
Free up time for high-value activities – by automating the detection of common reliability issues, developers can shift their focus toward higher-value application-specific scenarios that could be easily overlooked. This might include stress-testing an application’s ability to handle a sudden spike in concurrent logins, or evaluating how it processes multiple inputs in a single database field.
Automate experimentation – AI is also capable of simplifying the process of creating chaos experiments. With generative AI and natural language interfaces, teams can dramatically reduce the time and mental effort needed to design and implement tests.

To further scale these capabilities, enterprises can integrate AI-generated scenarios into their IDPs. By doing so, chaos testing becomes a shared, accessible resource, allowing teams across the organization to reuse and adapt proven experiments. This helps embed resilience testing into the development lifecycle, making it easier to validate the stability of new features before they go live.

Turning Chaos into Order

By working chaos engineering into their core software delivery platforms, developers can weave resilience into every stage of the application development lifecycle. This integration enables more thorough and continuous resilience testing as part of day-to-day operations.

Have you re-visited chaos engineering recently with AI in mind? Should engineering teams be embracing the chaos? I’d love to hear your thoughts, so please share below or get in touch.

Is AI the Catalyst Chaos Engineering Needed to Solve Service Reliability Struggles?

Menacing or Misunderstood?

AI is Redefining Chaos

Turning Chaos into Order

About the Author: