Episodes
-
Join Robert Ross and special guest Ricardo Castro in a dynamic discussion that dives into the world of DevOps and the challenges and career progression of Site Reliability Engineers (SREs). They highlight the ambiguity surrounding the SRE role across different organizations and the difficulty in defining SRE levels. The importance of both technical and communication skills is emphasized, and the hosts address the difficulties in measuring the contribution of SREs, particularly in managing incidents.
-
In this episode we chat with veteran cloud architect Masaru Hoshi about the challenges of alert fatigue, the importance of effective alerting systems, and fostering ownership in software teams. Masaru shares insights from his 30-year career, emphasizing the need for balance, trust, and collaboration in incident response.
-
Episodes manquant?
-
Engineers have been managing incidents for about as long as they've been building software, but it's only in the past few years that incident management has become a primary focus for software teams. Today I'm talking to Shannon Schulte, an engineering manager of incident response about practical ways to implement incident management.
-
When it comes to resolving an incident there are a number of metrics that can be misleading. Resolution time, for example, can fluctuate wildly. However, there’s one that we have a significant amount of influence over.
Today, I’m talking to Brent Chapman, Founder at Great Circle, about how engineering teams should ditch metrics like MTTR and instead focus on what we can control; assembly time.
Brent's Information:
Website: https://greatcircle.com/
LinkedIn: https://www.linkedin.com/in/brentchapman/
Twitter: https://twitter.com/brent_chapman
WW2 plane improvements
Book - The Checklist Manifesto
https://slack.com/events/resolve-incidents-faster-in-slack
https://slack.com/blog/collaboration/engineers-netflix-pagerduty-slack
https://slack.com/resources/using-slack/the-modern-incident-response
https://slack.com/resources/using-slack/slack-for-incident-management
https://slack.com/blog/transformation/incident-management-slack
https://slack.com/intl/en-in/events/minimize-incident-response-times
-
After the dust of an incident settles, it's normal for us to want to move on and get back to less stressful work. But doing so would skip an essential part of the incident management process, the Retro.
Today, I’m talking to Chad Todd, Site Reliability Manager at Crowdstrike, about the importance of retros to avoid what he calls “Incident amnesia”.
-
We all understand that incidents cause a loss in revenue, but the camouflaged costs of incidents can cause more damage than the immediate impact to revenue. What does the itemized receipt of an incident really look like?
In this episode of the Better Incidents podcast, we talk with MRZ, Sr. Director of Production Engineering at Cowbell, about the hidden costs of incidents and a concept he uses called Mean Time to Clue, or my preferred version, Mean Time to WTF?
-
Your incident response process helps you more quickly resolve incidents … but is that truly the aim of your entire incident management program? In this episode of the Better Incidents Podcast, we talk to Brian O’Hearn, senior manager of incident management at Zendesk, about how to center your incident management program — and how you evaluate it — around decreasing customer pain.
-
Incidents impact a variety of teams who all want to be in the loop, which can be … distracting. In the throes of an incident, how do you decide how often, when, and with whom to communicate? In this episode of the Better Incidents Podcast, we talk to Joan O’Callaghan, Udemy’s senior manager of monitoring and observability, about how to strike the right balance between giving people enough information while keeping the focus on remediation.
-
One of the first things we have to do in the beginning of an incident is get the right people in the room. But how do you know who to flag and where to find them? In this episode of the Better Incidents Podcast, we talk to Brian Liles, VP and principal engineer for VMware, about the importance of a service catalog and how to take the first steps toward formalizing one.