Episodes
-
Hans Kristian is a Platform Engineer for NAV's Kubernetes Platform Nais hosting Norway's wellfare services. With 10 years on Kubernetes, 2000 apps and 1000 developers across more than 100 teams there was a need to make OpenTelemetry adoption as easy as possible.Tune in as we hear from Hans Kristian who is also a CNCF Ambassador and hosts Cloud Native Day Bergen why OpenTelemetry is chosen by the public sector, why it took much longer to adopt, which challenges they had to scale the observability backend and how they are tackling the "noisy data problem"
Links we discussed in the episode
Follow Hans Kristian on LinkedIn: https://www.linkedin.com/in/hansflaatten/From 0 to 100 OTel Blog: https://nais.io/blog/posts/otel-from-0-to-100/?foo=barCloud Native Day Bergen: https://2024.cloudnativebergen.dev/Public Money, Public Code. How we open source everything we do! (https://m.youtube.com/watch?v=4v05Huy2mlw&pp=ygUkT3BlbiBzb3VyY2Ugb3BlbiBnb3Zlcm5tZW50IGZsYWF0dGVu)State of Platform Engineering in Norway (https://m.youtube.com/watch?v=3WFZhETlS9s&pp=ygUYc3RhdGUgb2YgcGxhdGZvcm0gbm9yd2F5) -
Has one of the decision makers in your organization decided that you have to go "all in on technology X" because they saw a great presentation at a conference or got a great sales pitch from a vendor? If that is the case then this episode is for you and you should forward it to those decision makers.
Sebastian Vietz, Director of Reliability Engineering and Host of the Reliability Enablers Podcast, shares his thoughts on considerations when picking a technology like Serverless. We discuss the importance of knowing limits, best fit architectural patterns and things that should influence your technology decisions!
Being aware of coldstarts, a 20000 concurrent request limit or 512mb being an ideal size for Lambda are just some of the things we can all learn from Sebastian.
Additional links we discussed:
Sebastians LinkedIn: https://www.linkedin.com/in/sebastianvietz/
Reliability Podcast: https://podnews.net/podcast/ibe8k
More things on serverless: https://serverlessland.com/ -
Episodes manquant?
-
When your code runs on more than 6 million systems - many of them business critical - then this is really exciting news for Marco and Wolfgang, Dynatrace OneAgent Java Team members. Their code powers auto-instrumentation and collection of all observability signals of Java based applications running on every possible stack: container in k8s, serverless, VM, on your workstation or even the mainframe.
Tune is as we sat down with Marco and Wolfgang to learn what it means to continuously innovate on agent-based instrumentation with 160+ other engineers across the globe that also focus on OneAgent. They share insights on how they develop their observability code, how they continuously test across all supported environments, what the processes at Dynatrace look like to avoid situations like the recent CrowdStrike outage and how they integrate and collaborate with other communities and tools such as OpenTelemetry!
Things we discussed during the episode
Dynatrace OneAgent: https://www.dynatrace.com/platform/oneagent/
Dynatrace for Java: https://www.dynatrace.com/technologies/java-monitoring/
OpenTelemetry and Dynatrace: https://docs.dynatrace.com/docs/extend-dynatrace/opentelemetry
Jobs at Dynatrace: https://careers.dynatrace.com/ -
When thousands of systems show a blue screen - which ones do you fix first to quickly bring up your most critical systems? For that you need to know which systems are impacted, which mission critical applications run on it, and which depending systems are also impacted by something like the recent CrowdStrike incident!
We have invited Josh Wood, Principal Solutions Engineer at Dynatrace, who was one of the first responders helping organizations to leverage observability data to identify which systems to fix first to bring critical apps such as ATMs, Self-Service Terminals, POS (Point of Sales), ... back up again quickly.
In this special episode Josh is walking us through the technical details of the CrowdStrike BSOD (Blue Screen of Death), what caused it, how to leverage observability to get a priorities list of systems to fix first and what organizations can do to prevent software impacting issues in the future.
Here the links we discussed in the episode:
Josh on LinkedIn: https://www.linkedin.com/in/joshuadwood/
Josh's blog on CrowdStrike BSOD: https://www.dynatrace.com/news/blog/crowdstrike-bsod-quickly-find-machines-impacted-by-the-crowdstrike-issue/
CrowdStrike Incident Takeaway Blog: https://www.dynatrace.com/news/blog/crowdstrike-incident-revisiting-vendor-quality-control/ -
WebAssembly runs in every browser, provides secure and fast code execution from any language, runs across multiple platforms and has a very small binary footprint. It's adopted by several of the big web-based SaaS solutions we use on a daily basis.
But where did WebAssembly come from? What problems does it try to solve? Has it reached critical adoption? And how about observing code that gets executed in browsers, servers or embedded devices?
To answer all those questions we invited Matt Butcher, CEO at Fermyon, who explains the history, current implementation status, limitations and opportunities that WebAssembly provides.
Further links we disucssed
LinkedIn Profile: https://www.linkedin.com/in/mattbutcher/
Fermyon Dev Website: https://developer.fermyon.com/
The New Stack Blog with Matt: https://thenewstack.io/webassembly-and-kubernetes-go-better-together-matt-butcher/ -
"Because I don't want software to go down every single day in my next gig!" is what drives the motivation of Ash Patel, Reliability Advocate and Podcast host of SREpath, to talk about and educate IT professionals on the importance of building and operating reliable systems.
For 15 years Ash used to be Director of Operations at a private health service organization. He has experienced that patients couldn't get the treatment they expected due to unreliable software he was responsible for.
In our conversation Ash talks about how he had to close the knowledge gap on technology but also solve the problem by having engineers understand the pain and the requirements of their end users. One way to educate more engineers is through his podcast called SREpath where Observability has become a hot topic recently. Tune in, hear about the memorable stories from his guests from CapitalOne, IKEA and SquaredUp, and lets move towards a world where software is reliable by default.
Links as discussed today:
Ash on LinkedIn: https://www.linkedin.com/in/ash-patel-srepath/
SREpath Podcast: https://www.srepath.com/podcast/
Clearing Delusions in Observability https://read.srepath.com/p/30-clearing-delusions-in-observability-2af
Boosting your observability data's usability https://read.srepath.com/p/35-boosting-your-observability-datas-3f4
How to Enable Observability for Success https://read.srepath.com/p/40-how-to-enable-observability-for -
"Meet your users where they are!" - For Platform Engineering Teams that means understanding the current way your engineers work, understand their pain, and provide a solution that doesnt force them to change their behavior but provides a 10x efficiency improvement. Thats not easy to achieve but is what we discussed with Abby Bangser in our latest episode
Abby is a Team Topologies Advocate, has spent years at Thoughtworks helping organizations transform through Delivery Platforms and is now a Lead at the CNCF Platform Working Group. Tune in and hear our discussions on Why Platform Engineering is nothing new, how to avoid Platform Engineering Teams to become your next bottleneck and silo, why Platforms need to have more than one interface and why the purpose of Platform Engineering should be to bring good Developer Experience to all engineers
Here all the links we discussed during this episode
Platform Engineering Maturity Model: https://tag-app-delivery.cncf.io/whitepapers/platform-eng-maturity-model/
CNCF Platform Working Group: https://tag-app-delivery.cncf.io/wgs/platforms/
KubeCon 2024 Talk: https://colocatedeventseu2024.sched.com/event/1YFdf/sometimes-lipstick-is-exactly-what-a-pig-needs-abby-bangser-syntasso-whitney-lee-vmware
GitHub Issue for Questionnaire: https://github.com/cncf/tag-app-delivery/issues/635
Kratix: https://www.kratix.io/
Abbys LinkedIn: https://www.linkedin.com/in/abbybangser/
Abbys Events: https://www.paintedwavelimited.com/events -
Requesting more CPU for your database used to take 6 months of planning 20 years ago. Now it takes the execution of a Terraform script. What has stayed the same all those years is Almudena Vivanco's passion for performance engineering to keep systems optimized. Ensuring that systems are available, scalable and resilient even during spike events such as the upcoming Euro Cup or any holiday specials.
Tune in and hear from Almudena, who is currently working for SCRM Lidl, on how moving to the cloud gave new justification to performance engineering. She explains the importance of connecting business with service level objectives and gives insights on how Lidl makes sure to sell 50000 pieces of pork without breaking the cloud bank
Here the additional links we discussed
Slides from Barcelona Meetup: https://docs.google.com/presentation/d/1h83V4gUyqAmIWeAAtKb4BcRvuJV-XirLk-9Xq077nbw
Video from TestCon: https://www.youtube.com/watch?v=rIP_G-YBy04
LinkedIn: https://www.linkedin.com/in/almudenavivanco/ -
Making observability available to everyone! This noble goal needs superhero powers in an IT world where there is so much chatter and confusion about what observability is, how to sell the value add besides a glorified troubleshooting tool and how OpenTelemetry will disrupt the landscape.
In our latest episode we have Rainer Schuppe, Observability Veteran (more than 20+ years in the space), who has worked for the majority of the observability vendors. He is sharing his observability expertise through workshops in his home town of Mallorca. Teaching organizations from basic to strategic observability implementations.
Tune in and learn about the typical adoption and maturity path of observability within enterprises: from fixing a problem at hand, to justifying the cost to keep it until enabling companies to become information driven digital organizations! Also check out his OpenTelemetry journey in his blog post series
Here are the links we discussed today:
Observability Heroes Website: https://observability-heroes.com/
Observability Heroes Community: https://observability.mn.co/
Cloud Native Mallorca Meetup: https://www.meetup.com/cloud-native-mallorca/
OpenTelemetry: https://opentelemetry.io/
Rainer on LinkedIn: https://www.linkedin.com/in/rainerschuppe/ -
eBPF is a kernel technology enabling high-performance, low overhead tools for networking, security and observability. In simpler terms: eBPF makes the kernel programmable!
Tune in to this episode whether you have never heard about eBPF, using eBPF based tools such as bcc, Cillium, Falco, Tetragon, Inspector Gadget ... or whether you are developing your own eBPF programs!
Liz Rice, Chief Open Source Officer at Isovalent, kicks this episode off with a brief introduction of eBPF, explains how it works, which use cases it has enabled and why eBPF can truly give you super powers!
In our conversation we dive deeper into the performance aspects of eBPF: how and why tools like Cillium outperforms classical network load balancers, how performance engineers can use it and how the Kernel internally handles eBPF extecutions.
We discussed a lot of follow up material - here are all the relevant links:
Liz's slide deck on "Unleashing the kernel with eBPF": https://speakerdeck.com/lizrice/unleashing-the-kernel-with-ebpf
eBPF Documentary on YouTube: https://www.youtube.com/watch?v=Wb_vD3XZYOA
Learning eBPF GitHub repo accompanying her book: https://github.com/lizrice/learning-ebpf
eBPF website: https://epbf.io
Liz on LinkedIn: https://www.linkedin.com/in/lizrice/ -
Use Things you Understand! Learn the fundamentals to understand the layers of abstraction! And remember that we don't live in a world with unlimited resources!
These are advice from our recent conversation with Ernst Ambichl, Chief Product Architect at Dynatrace, who has started his performance career in the late 80s building the first load testing tools for databases which later became one of the most successful performance engineering tools in the market.
Tune in and learn about how Ernst has evolved from being a performance engineer to become an advocate for "Designing and Architecting for Performance". Ernst explains how important good upfront analysis of performance requirements and characteristics of the underlying infrastructure is, how to define baselines and constantly evaluate your changes against your goals.
On a personal note: I want to say THANK YOU Ernst for being one of my personal mentors over the past 20+ years. You inspired me with your passion about performance and building resilient systems -
SREs (Site Reliability Engineers) have varying roles across different organizations: From Codifying your Infrastructure, handling high priority incidents, automating resiliency, ensuring proper observability, defining SLOs or getting rid of alert fatigue. What an SRE team must not be is a SWAT team - or - as Dana Harrison, Staff SRE at Telus puts it: "You don't want to be the fire brigade along the DevOps Infinity Loop"
In his years of experience as an SRE Dana also used to run 1 week boot camps for developers to educate them on making apps observable, proper logging, resiliency architecture patterns, defining good SLIs & SLOs. He talked about the 3 things that are the foundation of a good SRE: understand the app, understand the current state and make sure you know when your systems are down before your customers tell you so!
If you are interested in seeing Dana and his colleagues from Telus talk about their observability and SRE journey then check out the On-Demand session from Dynatrace Perform 2024: https://www.dynatrace.com/perform/on-demand/perform-2024/?session=simplifying-observability-automations-and-insights-with-dynatrace#sessions -
Whether its GitOps, DevOps, Platform Engineering, Observability as a Service or other terms. We all have our definitions, but rarely do we have a consensus on what those terms really mean! To get some clarity we invited Roberth Strand, CNCF Ambassador and Azure MVP, who has been passionately advocating for GitOps as it was initially defined and explained by Alexis Richardson, Weaveworks in his blog What is GitOps Really!
Tune in and learn about Desired State Management, Continuous Pull vs Pushing from Pipelines, how Progressive Delivery or Auto-Scaling fits into declaring everything in Git, what OpenGItOps is and why this podcast will help you get your GitOps certification (coming soon)
As we had a lot to talk we also touched on Platform Engineering and various other topics
Here are all the links we discussed:
Alexis GitOps Blog Post: https://medium.com/weaveworks/what-is-gitops-really-e77329f23416
OpenGitOps: https://opengitops.dev/
Flux Image Reflector: https://fluxcd.io/flux/components/image/
CNCF White Paper on Platform Engineering: https://tag-app-delivery.cncf.io/whitepapers/platforms/
Platform Engineering Maturity Model: https://tag-app-delivery.cncf.io/whitepapers/platform-eng-maturity-model/
Platform Engineering Working Group as part of TAG App Delivery: https://tag-app-delivery.cncf.io/wgs/platforms/ -
Can you explain GitOps in simple terms? How does it fit into Continuous Integration (CI), Continuous Delivery and Continuous Deployment? And what are considerations when rolling out GitOps in an enterprise?
To get answers to those questions we sat down with Christian Hernandez, Head of Community at Akuity, who has a fabulous analogy to explain GitOps that I am sure many of us will "borrow" from him. Christian also explains the ecosystem he works in such as ArgoCD, Kargo as well as OpenGitOps which aims to provide open-source standard and best practices to implementing GitOps.
We closed the session with some advice around Application Dependency Management, External Secrets Operator and choosing the right Git Repo Structure.
Here are some of the links we discussed:
OpenGitOps: https://opengitops.dev/
ArgoCD: https://argoproj.github.io/cd/
Kargo: https://github.com/akuity/kargo
ArgoCon: https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/argocon/
GitOpsCon: https://events.linuxfoundation.org/gitopscon-north-america/ -
While the mainframe is powering the world's most critical system the words "modern", "open source" or "generative AI" typically don't come to mind. So lets change this!
To do that simply tune in to our latest episode where we have Jessielaine (Jelly) Punongbayan, Sr. Technical Support Engineer at Dynatrace, telling us why she is excited about the modern Mainframe and how it brought her from the Philippines via Singapore and Czech Republic to Austria.
We learn about all the open-source projects and communities she is involved in such as Open Mainframe or Zowe that make it easy to connect the Mainframe with the modern tooling of today's development environments. Jelly shares her stories about the role of good observability, how it connects the distributed and the mainframe world and how it enables development teams to build more efficient systems. And what about AI? Well - you have to tune in and listen to the end!
Here the links discussed in the episode
Writing a COBOL program using VSCode: https://medium.com/modern-mainframe/beginners-guide-cobol-made-easy-introduction-ecf2f611ac76
Using CircleCI to perform automation in Mainframe: https://medium.com/modern-mainframe/beginners-guide-cobol-made-easy-leveraging-open-source-tools-eb4f8dcd7a98
Using OpenTelemetry to capture Mainframe Insights: https://medium.com/@jessielaine.punongbayan/re-imagining-mainframe-insights-through-open-source-tooling-79dd4c937114
Dynatrace support for Mainframe: https://www.dynatrace.com/technologies/mainframe-monitoring/ -
201 is the HTTP status code for Resource Created. It is also the number of PurePerformance Episodes (including this one) we have published over the past years. None better to invite than the person who initially inspired us to launch PurePerformance: Mark Tomlinson, Performacologist and Director of Observability at FreedomPay
Tune in and listen to our thoughts on current state of automation, a recap on IFTTT, whether we believe that AIs such as CoPilot will not only make us more efficient in creating code and scripts but also lead to new ways of automation. We also give a heads-up (or rather a recap) of what Mark will be presenting on at Perform 2024.
To learn more about and from Mark follow him on the various social media channels:
LinkedIn: https://www.linkedin.com/in/mtomlins/
Performacology: https://performacology.com/ -
Marcelo Amaral is a Researcher for Cloud System Optimization and Sustainability. With his background in performance engineering where he optimized microservice workloads in containerized environments making the leap towards analyzing and optimizing energy consumption was easy.
Tune in to this episode and learn about how Kepler, the CNCF project Marcelo is working on, which provides metrics for workload energy consumption based on power models it was trained on by the community. Marcelo goes into details about how Kepler works and also provides practical advice for any developer to keep energy consumption in mind when making architectural and coding decisions.
To learn more about Kepler and the episode today check out:
LinkedIn from Marcelo: https://www.linkedin.com/in/mcamaral/
CNCF Blogpost on Kepler: https://www.cncf.io/blog/2023/10/11/exploring-keplers-potentials-unveiling-cloud-application-power-consumption/
Kepler GitHub Repo: https://github.com/sustainable-computing-io/kepler -
Its only been a year since ChatGPT was introduced. Since then we see LLMs (Large Language Models) and Generative AIs being integrated into every days life software applications. Developers have the hard choice to pick the right model for their use case to produce the quality of output their end users demand.
Tune in to this session where we have Nir Gazit, CEO and Co-founder of Traceloop, educating us about how to observe and quantify the quality of LLMs. Besides performance and costs engineers need to look into quality attributes such as accuracy, readability or grammatical correctness.
Nir introduces us to OpenLLMetry - a set of Open Source extensions built on top of OpenTelemetry providing automated observability into the usage of LLMs for developers to better understand how to optimize the usage of LLMs. His advice to every developer is to start measuring the quality of your LLMs on Day 1 and continuously evaluate as you change your model, the prompt and the way you interact with your LLM stack!
If you have more questions about LLM Observability check out the following links:
OpenLLMetry GitHub Page: https://github.com/traceloop/openllmetry
Traceloop Website: https://www.traceloop.com/
OpenLLMetry Documentation: https://traceloop.com/docs/openllmetry -
After analyzing Distributed Traces over more than 15 years Brian and I thought that everyone in software engineering and operations must be satisfied with all that observability data we have available. But. Maybe Brian and I were wrong because we didn’t fully understand all the use cases - especially those for developers that must fix code in production or need to quickly understand what code from somebody else is really doing without having the luxury to add another log line and redeploy on the fly. To learn more about the observability requirements of developers we invited Liran Haimovitch, CTO at Rookout and now part of Dynatrace, who has spent the last 7 years solving the challenging problems that developers face day and night. Tune in and learn about what non-breaking breakpoints are, how it is possible to "debug in production" without impacting running code and how we can make developers lives easier even though we push so many things "to the left"
-
I was invited to speak at BankTechShow in Budapest, Hungary where the nations IT leaders in the banking sector presented and discussed the future of banking - both in the cloud as well as what it means for the physical bank branches.
I got a chance to sit down with Adam Gajdi, IT Solutions CoE Lead at K&H, who walked me through the process of their recent new mobile banking app launch. Adam highlighted the importance of observability for both business owners as well as developers. Furthermore, Adam enlightened me with the fact that Hungarian banks are mandated to conduct chaos tests to proof that their systems are resilient in case of data center outages. I was obviously also curious about how AI, LLMs and other technologies are adopted in their sector. Tune in to learn more - Montre plus