Introduction: What You’ll Learn
This simulation guides you through managing a high-pressure incident call with multiple potential root causes. You'll practice orchestrating a calm and efficient response, coordinating team efforts, and making decisive actions to resolve the incident.
You’ll practice:
- Leading a structured incident call
- Facilitating clear communication among team members
- Prioritizing actions based on available data
- Keeping the team focused on resolution
Step-by-Step Simulation
Scene 1: Initiating the Incident Call
Facilitator: "Hi team, thanks for jumping on this call so quickly. We’re dealing with an incident affecting user logins and payment processing. Let’s focus on gathering details and narrowing down potential causes. Please keep your observations brief. I'll start."
Facilitator (as a lead engineer): "We’re seeing higher error rates on the login service and slower response times in the payment gateway. Initial logs point to issues with our authentication server and a recent deployment. Let’s hear from DevOps."
Scene 2: Gathering Initial Observations
DevOps Lead (Alex): "We noticed a spike in network activity just before the incident began. This might be linked to internal changes or external interference. We’re checking recent config changes and network logs."
Facilitator: "Thanks, Alex. Focus on those logs from the latest deployment. Security team, anything unusual on your end?"
Security Lead (Priya): "We’re analyzing the traffic patterns for potential DDoS activity. There’s an uptick in requests from some suspicious IPs. We’ve ramped up monitoring to check this out."
Facilitator: "Great, keep us posted on that. Development team, any recent code changes that might be relevant?"
Developer (Sara): "We had a merge affecting the auth service yesterday. It passed staging, but I’ll double-check for any issues we might’ve missed."
Facilitator: "Please coordinate with QA to retest the auth flow in staging. QA team, what are you seeing?"
QA Lead (Leo): "We’re noticing some intermittent failures in the login tests that weren’t there before the last rollout. This might help pinpoint what’s wrong."
Scene 3: Coordinating the Response
Facilitator: "Alright, we’ve got a few angles to tackle. Alex and Leo, focus on the network and deployment logs. Priya, keep digging into the security checks. Sara, review those recent changes and sync with QA. Let’s regroup in 10 minutes for updates."
(Time passes, team reconvenes.)
Facilitator: "Thanks for the quick turnaround, everyone. Let’s get some updates. Alex, what’s the latest from the network side?"
DevOps Lead (Alex): "We found a misconfigured load balancer causing some of the latency. We’re applying a fix now."
Facilitator: "Good catch. Priya, anything confirmed on the security front?"
Security Lead (Priya): "The traffic looks legitimate, so no immediate threat from what we can see. We’ll keep an eye on it though."
Facilitator: "Thanks. Sara, what did you find from the code review?"
Developer (Sara): "We discovered a bug with the auth token expiration. We’re rolling back that change to see if it clears up the login errors."
Scene 4: Resolving and Recap
Facilitator: "Great work, everyone. Let’s get the system stable and verify the fixes. Alex, keep an eye on the network after the load balancer fix. Sara, test the login service post-rollback."
(Time passes, systems begin to stabilize.)
Facilitator: "Looks like the fixes are working. Login errors are down, and payment processing is back to normal. Let’s get ready for a post-incident review."
Facilitator: "Quick recap: We had a misconfigured load balancer and a bug in the auth token logic. Both are fixed now. Let’s document what we’ve learned and make sure we prevent similar issues in the future."
Mini Roleplay Challenges
Challenge 1: A team member is unsure about their next steps.
- Best Response: “Let’s clarify your priorities — focus on verifying your area’s stability first.”
Challenge 2: Multiple team members start diagnosing the same issue.
- Best Response: “Let’s avoid overlap — Alex, handle the network aspect, and Sara, focus on the code review.”
Challenge 3: Someone introduces a new potential root cause late in the call.
- Best Response: “Thanks, let’s add it to our post-incident review to explore further once we stabilize.”
Optional Curveball Mode
- A key team member is unavailable.
- A suspected root cause turns out incorrect.
- Communication starts breaking down mid-call.
Practice managing these scenarios while maintaining control of the call.
Reflection Checklist
Incident Management
- Did I keep the team focused and organized?
- Did I prioritize actions effectively?
- Did I facilitate clear communication?
Resolution Focus
- Did we identify and address root causes promptly?
- Did I ensure follow-up actions were clear?
Leadership & Tone
- Was I calm and decisive under pressure?
- Did I create a collaborative environment?
Common Mistakes to Avoid
- Allowing the call to become chaotic
- Overlooking team member input
- Failing to document findings and follow-up actions