Alerting, Incident Response, and the SDLC

Episódios

Navigating the SRE Landscape w/ Ricardo Castro
27 out 2023· Better Incidents Podcast
Join Robert Ross and special guest Ricardo Castro in a dynamic discussion that dives into the world of DevOps and the challenges and career progression of Site Reliability Engineers (SREs). They highlight the ambiguity surrounding the SRE role across different organizations and the difficulty in defining SRE levels. The importance of both technical and communication skills is emphasized, and the hosts address the difficulties in measuring the contribution of SREs, particularly in managing incidents.
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
Alerting, Incident Response, and the SDLC
5 out 2023· Better Incidents Podcast
In this episode we chat with veteran cloud architect Masaru Hoshi about the challenges of alert fatigue, the importance of effective alerting systems, and fostering ownership in software teams. Masaru shares insights from his 30-year career, emphasizing the need for balance, trust, and collaboration in incident response.
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
Estão a faltar episódios?

Clique aqui para atualizar o feed.
Practical Ways to Implement Incident Management with Shannon Schulte
18 ago 2023· Better Incidents Podcast
Engineers have been managing incidents for about as long as they've been building software, but it's only in the past few years that incident management has become a primary focus for software teams. Today I'm talking to Shannon Schulte, an engineering manager of incident response about practical ways to implement incident management.
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
Focus on Assembly Time with Great Circle's Brent Chapman
18 mai 2023· Better Incidents Podcast
When it comes to resolving an incident there are a number of metrics that can be misleading. Resolution time, for example, can fluctuate wildly. However, there’s one that we have a significant amount of influence over.
Today, I’m talking to Brent Chapman, Founder at Great Circle, about how engineering teams should ditch metrics like MTTR and instead focus on what we can control; assembly time.
Brent's Information:
Website: https://greatcircle.com/
LinkedIn: https://www.linkedin.com/in/brentchapman/
Twitter: https://twitter.com/brent_chapman
WW2 plane improvements
Book - The Checklist Manifesto
https://slack.com/events/resolve-incidents-faster-in-slack
https://slack.com/blog/collaboration/engineers-netflix-pagerduty-slack
https://slack.com/resources/using-slack/the-modern-incident-response
https://slack.com/resources/using-slack/slack-for-incident-management
https://slack.com/blog/transformation/incident-management-slack
https://slack.com/intl/en-in/events/minimize-incident-response-times
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
The Importance of Retros with CrowdStrike's Chad Todd
9 mai 2023· Better Incidents Podcast
After the dust of an incident settles, it's normal for us to want to move on and get back to less stressful work. But doing so would skip an essential part of the incident management process, the Retro.
Today, I’m talking to Chad Todd, Site Reliability Manager at Crowdstrike, about the importance of retros to avoid what he calls “Incident amnesia”.
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
The hidden costs of incident management with Cowbell's MRZ
10 abr 2023· Better Incidents Podcast
We all understand that incidents cause a loss in revenue, but the camouflaged costs of incidents can cause more damage than the immediate impact to revenue. What does the itemized receipt of an incident really look like?
In this episode of the Better Incidents podcast, we talk with MRZ, Sr. Director of Production Engineering at Cowbell, about the hidden costs of incidents and a concept he uses called Mean Time to Clue, or my preferred version, Mean Time to WTF?
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
Customer empathy in incident management with Zendesk’s Brian O’Hearn
18 nov 2022· Better Incidents Podcast
Your incident response process helps you more quickly resolve incidents … but is that truly the aim of your entire incident management program? In this episode of the Better Incidents Podcast, we talk to Brian O’Hearn, senior manager of incident management at Zendesk, about how to center your incident management program — and how you evaluate it — around decreasing customer pain.
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
Communicating during an incident with Udemy's Joan O'Callaghan
24 out 2022· Better Incidents Podcast
Incidents impact a variety of teams who all want to be in the loop, which can be … distracting. In the throes of an incident, how do you decide how often, when, and with whom to communicate? In this episode of the Better Incidents Podcast, we talk to Joan O’Callaghan, Udemy’s senior manager of monitoring and observability, about how to strike the right balance between giving people enough information while keeping the focus on remediation.
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois
Get the right people in the room with VMWare's Bryan Liles
20 out 2022· Better Incidents Podcast
One of the first things we have to do in the beginning of an incident is get the right people in the room. But how do you know who to flag and where to find them? In this episode of the Better Incidents Podcast, we talk to Brian Liles, VP and principal engineer for VMware, about the importance of a service catalog and how to take the first steps toward formalizing one.
- Ouvir Ouvir novamente Continuar A reproduzir…
- Ouvir depois Ouvir depois

Episódios

Navigating the SRE Landscape w/ Ricardo Castro

Practical Ways to Implement Incident Management with Shannon Schulte

Focus on Assembly Time with Great Circle's Brent Chapman

The Importance of Retros with CrowdStrike's Chad Todd

The hidden costs of incident management with Cowbell's MRZ

Customer empathy in incident management with Zendesk’s Brian O’Hearn

Communicating during an incident with Udemy's Joan O'Callaghan

Get the right people in the room with VMWare's Bryan Liles