We recently chatted with Jason Doffing about his peak season experiences at Target. Jason does not represent or speak for Target. Our conversation focuses on his experiences while at Target preparing for and participating in peak season. All information discussed in this interview is already in the public domain.
What to do when Q4 arrives and you haven't followed the Peak Season Lifecycle
You missed Q1 planning. Q2 architecture design never happened. Q3 testing was half-hearted at best. Now it's late September or early October, and Black Friday is staring you down like an oncoming freight train. If you're reading this with a knot in your stomach, you're not alone. Every year, countless IT teams find themselves in exactly this position, facing peak season with systems that haven't been properly prepared, architectures that can't be changed, and a growing sense of dread about what's coming. Here's the uncomfortable truth: you can't build operational resilience in six weeks. But you can survive. This isn't about prevention anymore, it's about mitigation, damage control, and ensuring that when things go wrong (and they will), you're ready to respond fast enough to minimize the financial impact..
Matt: Jason, let's be honest. There are teams reading this who are already in panic mode. What's your first piece of advice?
Jason: Stop digging the hole deeper. Right now, your current systems are what you have. Accepting that reality is the first step. The "Too Late" phase is about mitigation, not prevention. We're focused on minimizing the blast radius of inevitable issues, not building perfect systems.
Matt: That sounds grim.
Jason: It's realistic. Failure is not an option, but degradation is. Understanding that distinction is critical. You need to design for graceful degradation, not flawless performance.
The Tactical Freeze: Stop Making Things Worse
Matt: Where do teams start when they're already behind?
Jason: Immediate code freeze. Today. Right now. Enforce a hard freeze except for critical security patches. No new features, period. I don't care if marketing wants one more personalization feature or sales wants a new checkout flow. Every code change introduces risk, and you're out of time to properly test that risk.
Matt: That's going to be a tough sell to the business.
Jason: Then make it clear: stability wins over new features during crunch time. Would they rather have a new feature that might break checkout, or would they rather not lose $11 million per minute during Black Friday peak hours? That usually ends the conversation.
Matt: What about identifying vulnerabilities at this late stage?
Jason: You need SPOF triage, single points of failure. Use an 80/20 approach. You don't have time to fix everything, so identify your top 3 high-risk SPOFs. Is it your database? Payment gateway? Authentication service? For each one, plan manual workarounds immediately. This isn't about elegant solutions. If your payment gateway goes down, what's the backup? Do you have a secondary processor you can route to manually? If your database cluster fails, can you fail over to read replicas and operate in degraded states? Document these procedures now, not at 2 AM on Cyber Monday.
Matt: What about third-party vendors?
Jason: Review every vendor SLA immediately. Can your payment processor handle a surge? What about your shipping API? Your CDN? Have backup vendors on standby if possible. Call your account managers, verify their capacity guarantees, and get escalation contacts that work 24/7 during peak season. The gap between what leadership wants and what can be delivered often comes down to third-party limitations you didn't know about. Find out now, not when you're already experiencing website downtime.
Squeeze Every Drop: Optimizing What You Have
Matt: So the architecture is frozen. What can teams still optimize?
Jason: Think of this as "tape and bailing wire" architecture, squeezing every drop of performance out of what you already have. You're buying time, not building long-term fixes.
Matt: Let's start with cloud infrastructure.
Jason: Max out your auto-scaling settings. Aggressively configure your cloud infrastructure and I mean aggressive. It's cheaper to temporarily overprovision than to go down during a sale. Set your scaling triggers to 50-60% CPU instead of 70-80%. Configure rapid scaling add 50% capacity at once, not 10% increments. Yes, you'll overspend on compute for a few weeks. But compare that cost to one hour of website downtime during peak traffic.
Matt: What about database performance?
Jason: If you haven't been tuning your database all year, consult an expert right now for quick wins. Index optimization, query caching, and connection pool tuning, these aren't long-term solutions, but they can buy you 20-30% more capacity. That might be the difference between surviving and crashing. Implement read replicas if you don't have them. Even if you can't do it perfectly, getting read traffic off your primary database reduces load during peak season. This is about database resilience through any means necessary.
Matt: Caching seems obvious, but how aggressive should teams be?
Jason: Extremely aggressive. Cache everything you can, static assets obviously, but also API responses, product catalog data, even parts of dynamic content if possible. Reduce database load by any means necessary. A slightly stale product description is better than a completely down website.
Consider implementing a CDN if you don't have one, or maxing out your CDN settings if you do. Every request that doesn't hit your primary servers is a request your infrastructure doesn't have to handle.
Test Your Response, Not Just Your Systems
Matt: At this late stage, what kind of testing should teams prioritize?
Jason: Rigorous disaster recovery drills. But here's the key: don't just test the system, test your team's ability to execute disaster recovery planning quickly and calmly under pressure. Time every drill.
Matt: What does that look like in practice?
Jason: Run a simulated outage on a Saturday morning. Fail over to your backup database. Execute your incident response procedures. Time how long each step takes. Who gets called? How long until they respond? How long to identify the issue? How long to implement the fix?
If you haven't validated your systems and your team under stress, both will break when you need them most. Most teams discover during these drills that their disaster recovery documentation is outdated, contact lists are wrong, or critical people don't actually know their role. Fix that now, not during Black Friday.
Matt: What about incident response planning?
Jason: Create a single-page "cheat sheet" for core incidents. Not a 50-page runbook, a one-page quick reference. Database failure? Here's the failover procedure. Payment gateway down? Here's the backup processor activation steps. CDN issues? Here's the emergency contact.
Who calls who? What's the first hour plan? Who has authority to make decisions without waiting for approvals? Answer these questions now with clear incident response planning protocols.
Matt: And communication with customers?
Jason: Prepare now. Create status page templates. Write social media response templates. Draft the email that goes out if checkout breaks. Have these ready to deploy in minutes, not hours. Managing customer expectations during an outage is just as important as fixing the technical issue. Silence is worse than transparency.
War Room Operations and Real-Time Monitoring
Matt: Let's talk about the actual peak season operations. What's different in emergency mode?
Jason: You're living and dying by your observability tools. Implement real-time monitoring tools immediately if you don't have them. Every critical metric needs to be visible on a single dashboard, focusing on the health of your defined SPOFs.
Matt: What metrics matter most?
Jason: Focus on business metrics, not just technical metrics. Orders per minute. Checkout completion rate. Payment processing latency. Revenue per hour. These tell you when something's actually impacting customers, not just when CPU is elevated.
Set alerts triggering on business impact. If orders per minute drops 20%, that's your red alert. If checkout completion rate drops from 85% to 70%, something's breaking. Don't wait for customers to tell you, your real-time monitoring tools should tell you first.
Matt: What about proactive issue mitigation during the actual event?
Jason: Dedicate specific team members to "watch the glass" literally watching dashboards and acting on the first sign of a potential problem, not waiting for an alert to fire. This is what we did in the Target war room. You need people who understand the systems watching for patterns that might indicate an emerging issue.
If you see payment processing latency creeping from 200ms to 500ms, that's your warning. Don't wait for it to hit timeout thresholds, investigate immediately. Maybe you need to scale the payment service. Maybe a database query is degrading. Catch it early through proactive issue mitigation.
Matt: How do you structure the war room team when you're already short on resources?
Jason: You need clear roles and 24/7 coverage during peak days. Engineering leaders who can make architecture decisions quickly. Operations experts (with access) who can scale infrastructure at the first sign of trouble. Database experts who can tune queries on the fly. Business representatives who can communicate with stakeholders and senior leadership.
Everyone needs pre-approved authority to take immediate action, rolling back deployments, scaling infrastructure, failing over to backup systems without waiting for approvals. Minutes matter. If you can't see a problem instantly and fix it immediately, you can't minimize the damage.
Operating in the Dark
Matt: You mentioned earlier that teams often don't realize what they already know about their systems. What do you mean?
Jason: Most teams are operating in the dark about their own infrastructure. You have monitoring data, incident logs, performance metrics, but you're not synthesizing it into actionable intelligence.
Look at your last three months of incidents. What patterns do you see? Which services are fragile? Which APIs have the longest response times under load? Which databases are closest to capacity? This information is in your logs right now. You just haven't looked at it systematically.
Matt: How should teams use that information in an emergency preparation scenario?
Jason: That's where you focus your limited time. If your order service has had six incidents in three months, that's your vulnerable service. If your product catalog API response time spikes every time you run a promotion, that's your bottleneck. Prioritize your optimization efforts and monitoring focus on the systems that are already showing weakness or fragility.
Traditional performance testing fails to prepare you enough for peak season because it tests systems with perfect operational conditions. Your production incident data tells you where your real weaknesses are, so use it.
The Morning After: Post Peak Season Recovery
Matt: Let's say a team survives peak season using this emergency playbook. What happens next?
Jason: Within one week of your last major traffic event, schedule the post-mortems. Not in January, immediately, while the pain is fresh and the data is recent. This is your opportunity to build the business case for doing it right next year.
Document everything: every incident, every manual intervention, every moment of panic, every dollar lost to degraded performance. Quantify the cost of being unprepared. Calculate the hours of engineering time spent firefighting instead of building features.
Matt: How does that translate to next year's planning?
Jason: The panic you felt this year is the justification you need for a proper Peak Season Lifecycle budget next year. When you go into Q1 planning, you're not asking for a theoretical investment in operational resilience, you're showing the concrete cost of not being prepared for peak season.
Did you manually fail over databases three times during Black Friday? That's why you need automated failover. Did you lose two hours of peak traffic because auto-scaling was configured wrong? That's why you need proper load testing in Q3. Did your monitoring tools fail to catch issues before customers complained? That's why you need better observability.
Turn this year's crisis into next year's roadmap. The PSLC is a loop, not a linear path. Even if you entered the loop at the worst possible moment this year, you can start the proper cycle for next year.
Final Reality Check
Matt: Any final advice for teams reading this in panic mode?
Jason: Accept that peak season won't be perfect. Some things will break. Some performance will degrade. Your job is to minimize the impact and respond faster than your competitors.
Focus on the controllables: freeze code, optimize what you have, test your response procedures, watch your systems like a hawk, and have backups for your backups. Stability over features. Response speed and recovery over root cause analysis. Customer communication over silent failure.
And most importantly, don't repeat this next year. The moment peak season ends, start your Q1 planning for the Peak Season Lifecycle.
Matt: Thanks, Jason. For teams who survived this conversation, what's the first thing they should do?
Jason: Implement that code freeze. Right now. Before you finish reading this article. Everything else flows from accepting that your architecture is set, and your job is protecting it, not improving it. You have no time to prepare for the most critical period of your year. Every minute counts.
Ready to survive peak season?
Download our Peak Season IT Readiness Checklist, a comprehensive tactical guide covering immediate actions you can take to maximize your chances of success even when you're starting late. Don't repeat this panic next year. When Q1 arrives, revisit our first article: "Peak Season Isn't a Holiday Event, It's an Ultramarathon with No Finish Line" to implement the full Peak Season Lifecycle and build true operational resilience.
