The Scaling for the Holiday Rush: Reliability, Resilience, and Real-Time Recovery webinar response was overwhelming. Dozens of questions poured in about one topic we barely touched: How is AI helping teams during incidents? Not the marketing hype, the real, practical applications. After discussing Best Buy's Black Friday collapse and Shopify's Cyber Monday admin failure, the question everyone kept asking was: 'Could AI have prevented these outages, or at least helped us respond faster?'
In my role leading Target’s chaos engineering, disaster recovery, and notifications platforms, our systems executed thousands of chaos experiments annually, maintained disaster recovery compliance across hundreds of core IT systems, and processed billions of Kafka messages during peak season. I’ve seen firsthand how AI is evolving from theory to practice.
Matt Schillerstrom sat down with Jason Doffing again to unpack this question, exploring how AI is transforming the chaotic reality of incident response, from alert storms to root cause identification and resilience testing. We're getting a preview of what's to come at Chaos Carnival 2026.
The Current Pain - Alert Fatigue and Noise
Matt: Jason, after our peak season webinar, the top questions were all about AI and incidents. But let's start with the problem. When Best Buy or Shopify were failing, what was happening in their war rooms that AI might have helped with?
Jason: Let's paint the picture. It's 9:08 AM on Cyber Monday. Shopify's admin system is failing. The admin system is the back-office tool for running a business on the Shopify Commerce platform. It codifies a business's operations, logistics, online design, and marketing. The admin system handles everything for business owners. When the admin system started failing within 60 seconds, you've got:
- 47 alerts fired from the monitoring system
- Database connection pool warnings
- API timeout notifications
- Load balancer health check failures
- Customer support tickets are flooding in
- Social media mentions of #ShopifyDown are trending
- PagerDuty is blowing up everyone's phones
I lived this at Target during peak seasons. Our notifications platform was processing billions of Kafka messages, and when something went wrong during a traffic surge, the alert storm was overwhelming. We'd see cascading failures across microservices and platforms, causing engineers to waste 10 to 15 minutes trying to determine which alert was the most important. Now, which of those 47 alerts is the “ROOT CAUSE” and which are just symptoms? That's the problem AI is starting to help us sort out, separating the signal from noise, when every second costs revenue.
Matt: So, it's about cutting through the chaos?
Jason: Exactly. Traditional monitoring tools treat every alert equally. They don't understand causation or importance. A database connection pool exhaustion might trigger 30 downstream alerts across your microservices architecture. Humans waste precious minutes triaging symptoms instead of fixing root causes. Back in that 2016 Target war room incident I mentioned in our first conversation, when I had the CIO looking over my shoulder at 2 AM, we didn't have AI-powered correlation. I had to manually trace through service dependencies, review multiple dashboards, and utilize institutional knowledge to determine that the platform wasn't the issue; it was a downstream service. Those fifteen minutes felt like an eternity. Today, with AI-assisted tooling, that same incident would likely present as:
- Root cause: Service X is responding slowly.
- Contributing factor: Database connection saturation.
- Recommended action: Scale read replicas or restart the connection pool.
- Similar incidents: 3 in the past 6 months.
That is ninety seconds for AI to help us triage the incident versus fifteen minutes of investigation and speculation by a human. That's the evolution I've personally witnessed, from pure human correlation to AI-augmented root cause analysis.
AI in Real-Time Incident Response - The Three Layers
Matt: Walk me through how AI is actually being used during live incidents. Not theory, real applications.
Jason: There are three layers where AI is making an immediate impact during incidents:
Layer 1: Alert Correlation and Root Cause Suggestion
At Target, our notifications platform processed billions of Kafka messages annually. During peak season, a single infrastructure issue could trigger hundreds of downstream alerts across our microservices. I recall instances where we had 50+ services all screaming at once, and the on-call engineer had to mentally trace dependencies to determine what had actually broken first. Modern AIOps platforms are ingesting all your telemetry metrics, logs, traces, events, and building correlation graphs in real-time. When an incident starts, instead of showing you 47 alerts, they show you:
- Primary issue: Database connection pool exhausted on db-primary-01. Contributing factors: Traffic spike from /checkout endpoint (3x normal).
- Similar incidents: 3 in past 6 months, all resolved by scaling connection pool or adding read replicas." That's not replacing human judgment. It's giving your on-call engineer the context to make a decision in 90 seconds instead of 15 minutes.
Layer 2: Automated Runbook Suggestions
Here's where it gets interesting. AI systems trained on your incident history can suggest: "Based on similar incidents, teams typically:
- Scale database instances.
- Add read replicas.
- Implement connection pooling limits.
Incident #1247 (similar pattern) was resolved in 8 minutes using approach #2."
Again, the human is still in control, but you're not starting from scratch. You're building on institutional knowledge that's usually trapped in someone's head or a wiki that no one updates.
Layer 3: Predictive Incident Detection
This is the holy grail, and we're getting there. Instead of alerting when database connections reach 95% (too late), AI models trained on your normal versus abnormal patterns can alert when connection pool usage is trending toward exhaustion 10 minutes before it occurs. During Best Buy's Black Friday failure, if they had predictive detection showing "traffic trajectory will exceed capacity in 12 minutes," they could have scaled proactively instead of reactively.
Matt: That last one sounds almost too good to be true.
Jason: It's not magic. It's pattern recognition. If you've trained models on two years of Black Friday traffic patterns and suddenly see a trajectory that historically led to outages, that's predictive. The challenge is false positives; you can't cry wolf every time traffic goes up. The art is in model training and threshold tuning. At Target, we experimented with predictive capacity planning for our notifications platform. When we could predict Kafka message volume spikes 5-10 minutes before they hit the platform, we could pre-scale consumers and avoid the cascading backlog that would otherwise trigger downstream failures and drop alerts. That’s the power of predictive detection: you shift from reactive to proactive operations.
AI in Resilience Testing - Chaos Engineering Meets Machine Learning
Matt: Let's shift to prevention. How is AI changing chaos engineering and resilience testing?
Jason: This is where it gets really exciting, and it's what we'll be diving deep into at Chaos Carnival 2026. Let me ground this in real experience. At Target, I led the chaos engineering platform that executed thousands of chaos experiments annually. We built systems that would inject failures, including killing pods, adding latency, simulating network partitions, constraining resources, and corrupting message queues across our entire infrastructure. However, there was a limitation: every experiment had to be manually designed by an engineer who understood the architecture and hypothesized what might break and how to recover from it. Traditional chaos engineering is humans manually injecting failures: "Let's kill this pod and see what happens. Let's add 200ms latency to this API call." That's valuable, but it's limited by human imagination and the constraints of time. At Target, even with a dedicated chaos engineering team, we could only design and execute a fraction of the experiments we should have run. AI is transforming this in three ways:
1. Intelligent Experiment Generation
At Target, each experiment required manual design and implementation. An engineer had to understand the architecture and the tech stack, identify a hypothesis, configure the experiment parameters, set up and validate observability, and ensure system recoverability. Instead of engineers manually designing chaos experiments, AI can analyze your architecture, understand your dependencies, and suggest:
- Your checkout service depends on payment-gateway, inventory-service, and user-auth. Historical data shows payment-gateway has a 2% failure rate during peak traffic.
- Recommended experiment: Simulate payment-gateway returning 503 errors at 5% rate during the simulated Black Friday load. Expected blast radius: 5% checkout failures.
- Risk level: Medium
It's discovering fault patterns, vulnerabilities, single-points-of-failure, and unknown points of fragility in your systems you didn't think to test for. If we'd had AI-powered experiment generation at Target, we could have 10X'd our chaos engineering coverage without 10X'ing the team.
2. Adaptive Experimentation This is where machine learning really shines. Traditional chaos: inject failure, observe, stop. AI-powered chaos: inject failure, observe impact, adjust failure parameters in real-time to find the exact breaking point without actually breaking production. Think of it like a binary search for your system's failure threshold. "We can handle 4% payment gateway failures without customer impact, but at 6% it causes cascading failures. Now we know our resilience boundary.
3. Hypothesis Generation from Incidents Here's the game-changer, and this is where my work at Target really evolved. We maintained compliance across hundreds of disaster recovery plans, and every incident taught us something about our resilience gaps. But that knowledge was locked in post-mortems that lived in Confluence or other systems and were rarely revisited. What if AI could analyze your incident patterns and automatically generate chaos experiments to test if you're still at risk? That's what modern systems are starting to do. At Target, we built systems that tracked incident patterns across our entire platform. We observed that payment processing failures during traffic spikes followed distinct patterns. But it took human effort to translate those incident learnings into chaos experiments. AI is automating that translation. Best Buy crashes on Black Friday due to a traffic surge? AI suggests:
- Generate experiment: Gradual traffic ramp from 100% to 400% of baseline over 30 minutes, target Black Friday traffic patterns. Test whether the new auto-scaling configuration prevents repeat failure."
Matt: So, AI is turning your incidents into automated resilience tests?
Jason: Exactly. Your past failures become your future chaos experiments. That's how you prevent repeat incidents through continuously testing whether you've actually fixed the underlying resilience gaps.
The Reality Check - Where AI Helps and Where It Doesn't
Matt: This all sounds promising, but let's talk about the hype vs. reality. Where does AI actually fall short?
Jason: Great question because there's a lot of AI theater happening right now. Let me be clear about what AI is NOT doing: AI is NOT fully autonomous incident resolution (yet)
We're not at "AI detects incident, AI fixes incident, humans find out about it later." That's the dream, but we're years away from trusting AI to make production changes without human approval. The closest we get is automated scaling and circuit breaker patterns, and those are rules-based automation, not true AI decision-making.
AI does NOT replace deep system knowledge When Shopify's admin system failed on Cyber Monday, AI might have correlated alerts and suggested root causes. However, understanding WHY the admin system behaved differently than checkout (which remained up) requires human expertise in system architecture, dependencies, and design decisions. AI accelerates experts. It doesn't replace them.
AI does NOT work well with novel incidents If you've never had a specific type of failure before, an AI trained on historical data won't recognize the pattern. Novel incidents, those Black Swan events, still require human creativity and problem-solving. The value is in the 80% of incidents that follow known patterns. That's where AI shines.
AI does NOT understand business context When Best Buy’s site crashed at 6:30 AM PT on Black Friday, AI might detect the technical issue quickly, but AI does not know that at 6:30 AM PT, West Coast shoppers were waking up and checking doorbuster deals on bestbuy.com, or that a one-hour outage during morning peak costs exponentially more than the same outage on any other shopping day. At Target, we always emphasized that resilience engineering is fundamentally grounded in maintaining the guest experience (keeping their trust) while protecting revenue. AI can identify technical failures, but humans make the business-critical decisions about response priorities, communication strategies, and risk trade-offs.
Matt: So, what's the realistic value proposition?
Jason: Reduction in MTTD (mean time to detect) and MTTR (mean time to resolve) by 30-50% for known incident patterns. That's not marketing hype; that's what I personally experienced at Target when we implemented AI solutions for incident response. We architected and delivered AI-powered systems that improved our MTTD and MTTR while also reducing change failure/break rates (we can cover that in a future conversation). The key was focusing on the 80% of incidents that followed known failure patterns, database connection issues, traffic spikes, and dependency failures. For those patterns, AI-assisted correlation and automated runbook suggestions made a measurable difference. Could Best Buy's one-hour outage have been a 20-minute outage instead? That's millions of dollars saved. Could Shopify's two-hour admin failure become a 60-minute failure? That's thousands of merchants staying productive. The value isn't full automation. Its speed and accuracy improvements compound into major business impact.
The Human-AI Partnership - Getting it Right
Matt: So, how should teams actually approach implementing AI for resilience and incident management?
Jason: This is crucial. AI is a force multiplier for engineering teams, not a replacement. Here's what I recommend: Start with Data Quality Garbage in, garbage out. If your monitoring is noisy, your logs are unstructured, and your incidents aren't well-documented, AI will amplify the noise. Before investing in AI incident response, invest in telemetry quality, structured logs, distributed tracing, and proper observability. AI requires high-quality data to generate reliable insights.
Focus on Augmentation, Not Replacement The best AI implementations augment your existing processes:
- AI correlates alerts → Humans make decisions
- AI suggests experiments → Humans design safeguards
- AI predicts trends → Humans plan capacity
Keep humans in the loop for critical decisions. Use AI to enable better informed and faster decision making.
Measure and Iterate Track MTTD and MTTR before and after AI implementation. Track false positive rates on predictive alerts. Track experiment coverage and vulnerability discovery. AI for incident response is an iterative practice, not a one-time implementation. You'll tune models, adjust thresholds, and refine correlations over time.
Invest in Training Your team needs to understand how AI tools work, their limitations, and when to trust them versus override them. An AI suggestion isn't a command; it's input for human decision-making. The teams getting the most value are treating AI as a junior team member that needs coaching, not an oracle that provides perfect answers.
Closing
Matt: Do you have any final thoughts for teams considering AI and resilience?
Jason: Start small. Pick one area where you're drowning in noise, maybe alert correlation, maybe incident pattern analysis, and implement AI there. Measure the impact. Iterate. Don't wait for the perfect AI solutions. These tools exist today to make meaningful improvements in MTTD and MTTR. The teams using them now will have a significant advantage during next year's peak season. And if you want to go deeper on this topic, join us at Chaos Carnival 2026. We're bringing together practitioners who are actually implementing these techniques in production, not just talking about theory.
Matt: Thanks, Jason. This has been incredibly insightful.
Jason: Always a pleasure. See everyone at Chaos Carnival.
Call to Action
Want to dive deeper into AI-powered resilience? Join us at Chaos Carnival 2026 where we'll be exploring:
- Real-world implementations of AI in incident response
- Hands-on workshops for implementing AI-powered chaos engineering
Until then:
- Download our Peak Season IT Readiness Checklist
- Read our peak season blog series
- Start measuring your current MTTD and MTTR (you'll need baseline metrics when you implement AI for incident response)
