Episodes
-
Discover why standard Kubernetes StatefulSets might not be sufficient for your database workloads and how custom operators can provide better solutions for stateful applications.
Andrew Charlton, Staff Software Engineer at Timescale, explains how they replaced Kubernetes StatefulSets with a custom operator called Popper for their PostgreSQL Cloud Platform. He details the technical limitations they encountered with StatefulSets and how their custom approach provides more intelligent management of database clusters.
You will learn:
Why StatefulSets fall short for managing high-availability PostgreSQL clusters, particularly around pod ordering and volume management
How Timescale's instance matching approach solves complex reconciliation challenges when managing heterogeneous database workloads
The benefits of implementing discrete, idempotent actions rather than workflows in Kubernetes operators
Real-world examples of operations that became possible with their custom operator, including volume downsizing and availability zone consolidation
Sponsor
This episode is brought to you by mirrord — run local code like in your Kubernetes cluster without deploying first.
More info
Find all the links and info for this episode here: https://ku.bz/fhZ_pNXM3
Interested in sponsoring an episode? Learn more.
-
Curious about running AI models on Kubernetes without breaking the bank? This episode delivers practical insights from someone who's done it successfully at scale.
John McBride, VP of Infrastructure and AI Engineering at the Linux Foundation shares how his team at OpenSauced built StarSearch, an AI feature that uses natural language processing to analyze GitHub contributions and provide insights through semantic queries. By using open-source models instead of commercial APIs, the team saved tens of thousands of dollars.
You will learn:
How to deploy VLLM on Kubernetes to serve open-source LLMs like Mistral and Llama, including configuration challenges with GPU drivers and daemon sets
Why smaller models (7-14B parameters) can achieve 95% effectiveness for many tasks compared to larger commercial models, with proper prompt engineering
How running inference workloads on your own infrastructure with T4 GPUs can reduce costs from tens of thousands to just a couple thousand dollars monthly
Practical approaches to monitoring GPU workloads in production, including handling unpredictable failures and VRAM consumption issues
Sponsor
This episode is brought to you by StackGen! Don't let infrastructure block your teams. StackGen deterministically generates secure cloud infrastructure from any input - existing cloud environments, IaC or application code.
More info
Find all the links and info for this episode here: https://ku.bz/wP6bTlrFs
Interested in sponsoring an episode? Learn more.
-
Missing episodes?
-
This episode examines how a default configuration in Cilium CNI led to silent packet drops in production after 8 months of stable operations.
Isala Piyarisi, Senior Software Engineer at WSO2, shares how his team discovered that Cilium's default Pod CIDR (10.0.0.0/8) was conflicting with their Azure Firewall subnet assignments, causing traffic disruptions in their staging environment.
You will learn:
How Cilium's default CIDR allocation can create routing conflicts with existing infrastructure
A methodical process for debugging network issues using packet tracing, routing table analysis, and firewall logs
The procedure for safely changing Pod CIDR ranges in production clusters
Sponsor
This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
More info
Find all the links and info for this episode here: https://ku.bz/kJjXQlmTw
Interested in sponsoring an episode? Learn more.
-
Managing microservices in Kubernetes at scale often leads to inconsistent deployments and maintenance overhead. This episode explores a practical solution that standardizes service deployments while maintaining team autonomy.
Calin Florescu discusses how a unified Helm chart approach can help platform teams support multiple development teams efficiently while maintaining consistent standards across services.
You will learn:
Why inconsistent Helm chart configurations across teams create maintenance challenges and slow down deployments
How to implement a unified Helm chart that balances standardization with flexibility through override functions
How to maintain quality through automated documentation and testing with tools like Helm Docs and Helm unittest
Sponsor
This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
More info
Find all the links and info for this episode here: https://ku.bz/mcPtH5395
Interested in sponsoring an episode? Learn more.
-
Learn how ByteDance manages computing resources at scale with custom Kubernetes scheduling solutions that handle millions of pods across thousands of nodes.
Yue Yin, Software Engineer at ByteDance, discusses their open-source Gödel scheduler and Katalyst resource management system. She explains how these tools address the challenges of managing online and offline workloads in large-scale Kubernetes deployments.
You will learn:
How Gödel's distributed architecture with dispatcher, scheduler, and binder components enables the scheduling of 5,000 pods per second
Why NUMA-aware scheduling and two-layer architecture are crucial for handling complex workloads at scale
How Katalyst provides node-level resource insights to enable efficient workload co-location and improve CPU utilization
Sponsor
This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
More info
Find all the links and info for this episode here: https://ku.bz/lMpNng_33
Interested in sponsoring an episode? Learn more.
-
Platform Engineer Artem Lajko breaks down observability into three distinct layers and explains how tools like Prometheus, Grafana, and Falco serve different purposes. He also shares practical insights on implementing the right level of monitoring based on team requirements and capabilities.
You will learn:
How to implement the three-layer model (external, internal, and OS-level) and why each layer serves different stakeholders
How to choose and scale observability tools using a label-based approach (low, medium, high)
How to manage observability costs by collecting only relevant metrics and logs
Sponsor
This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
More info
Find all the links and info for this episode here: https://ku.bz/9sGxhmm8s
Interested in sponsoring an episode? Learn more.
-
Stefan Roman shares his experience building Labs4Grabs, a platform that gives students root access to Kubernetes clusters. He discusses the journey from evaluating simple namespace-based isolation to implementing full VM-based isolation with KubeVirt.
You will learn:
Why namespace isolation isn't sufficient for untrusted users and the limitations of tools like vCluster when running privileged workloads.
How to use KubeVirt to achieve complete workload isolation and the trade-offs.
Practical approaches to implementing network security with NetworkPolicies and managing resource allocation across multiple student environments.
Follow Stefan's journey from simple to complex isolation strategies, focusing on the technical decisions and trade-offs he encountered.
Sponsor
This episode is sponsored by Kusari — gain complete visibility into your software components and secure your supply chain through comprehensive tracking and analysis.
More info
Find all the links and info for this episode here: https://ku.bz/Xz-TrmX2F
Interested in sponsoring an episode? Learn more.
-
Michael Levan explains how specialized teams and smart abstractions can lead to better outcomes. Drawing from cognitive science and his experience in platform engineering, Michael presents practical strategies for building effective engineering organizations.
You will learn:
Why specialized teams (or "silos") can improve productivity and why the real enemy is ego, not specialization.
How to use Internal Developer Platforms (IDPs) and abstractions to empower teams without requiring everyone to be a Kubernetes expert.
How to balance specialization and collaboration using platform engineering practices and smart abstractions
Practical strategies for managing cognitive load in engineering teams and why not everyone needs to know YAML.
Sponsor
This episode is brought to you by Testkube — scale all of your tests with Kubernetes, integrate seamlessly with CI/CD and centralize test troubleshooting and reporting.
More info
Find all the links and info for this episode here: https://ku.bz/qlZPfM-zr
Interested in sponsoring an episode? Learn more.
-
Xe Iaso shares their journey in building a "compute as a faucet" home lab where infrastructure becomes invisible and tasks can be executed without manual intervention. The discussion covers everything from operating system selection to storage architecture and secure access patterns.
You will learn:
How to evaluate operating systems for your home lab — from Rocky Linux to Talos Linux, and why minimal, immutable operating systems are gaining traction.
How to implement a three-tier storage strategy combining Longhorn (replicated storage), NFS (bulk storage), and S3 (cloud storage) to handle different workload requirements.
How to secure your home lab with certificate-based authentication, WireGuard VPN, and proper DNS configuration while protecting your home IP address.
Sponsor
This episode is sponsored by Nutanix — innovate faster with a complete and open cloud-native stack for all your apps and data anywhere.
More info
Find all the links and info for this episode here: https://ku.bz/2kzj2MgfH
Interested in sponsoring an episode? Learn more.
-
If you're trying to make sense of when to use Kubernetes and when to avoid it, this episode offers a practical perspective based on real-world experience running production workloads.
Paul Butler, founder of Jamsocket, discusses how to identify necessary vs unnecessary complexity in Kubernetes and explains how his team successfully runs production workloads by being selective about which features they use.
You will learn:
The three compelling reasons to use Kubernetes are managing multiple services across machines, defining infrastructure as code, and leveraging built-in redundancy.
Why to be cautious with features like CRDs, StatefulSets, and Helm and how to evaluate if you really need them.
How to stay on the "happy path" in Kubernetes by focusing on stable and simple resources like Deployments, Services, and ConfigMaps.
When to consider alternatives like Google Cloud Run for simpler deployments that don't need the full complexity of Kubernetes
Sponsor
This episode is sponsored by Syntasso, the creators of Kratix, a framework for building composable internal developer platforms
More info
Find all the links and info for this episode here: https://ku.bz/VB-0WYqtb
Interested in sponsoring an episode? Learn more.
-
This episode explores Admission Controllers and Webhooks with Gordon Myers, who shares his experience implementing webhook solutions in production. Gordon explains the lifecycle of Kubernetes API requests and how webhooks can intercept and modify resources before they are stored in etcd.
You will learn:
How the Kubernetes API processes requests through authentication, authorization, and Admission Controllers.
The difference between Validating and Mutating webhooks and how to implement them using JSON Patch.
Best practices for testing webhooks and avoiding common pitfalls that can break cluster deployments.
Real-world examples of webhook implementations, including injecting secrets from HashiCorp Vault into containers.
Sponsor
This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
More info
Find all the links and info for this episode here: https://ku.bz/Dmn93dd7M
Interested in sponsoring an episode? Learn more.
-
Are you facing challenges with pre-production environments in Kubernetes?
This KubeFM episode shows how to implement efficient deployment previews and solve data seeding bottlenecks.
Nick Nikitas, Senior Platform Engineer at Blueground, shares how his team transformed their static pre-production environments into dynamic previews using ArgoCD Application Sets, Wave and Velero.
He explains their journey from managing informal environment sharing between teams to implementing a scalable preview system that reduced environment creation time from 19 minutes to 25 seconds.
You will learn:
How to implement GitOps-based preview environments with Argo CD Application Sets and PR generators for automatic environment creation and cleanup.
How to control cloud costs with TTL-based termination and FIFO queues to manage the number of active preview environments.
How to optimize data seeding using Velero, AWS EBS snapshots, and Kubernetes PVC management to achieve near-instant environment creation.
Sponsor
This episode is sponsored by Loft Labs — simplify Kubernetes with vCluster, the leading solution for Kubernetes multi-tenancy and cost savings.
More info
Find all the links and info for this episode here: https://ku.bz/tt4VFslxD
Interested in sponsoring an episode? Learn more.
-
Discover how a seemingly simple 502 error in Kubernetes can uncover complex interactions between Go and containerized environments.
Emin Laletović, a solution architect at Hybird Technologies, shares his experience debugging a production issue in which a specific API endpoint failed due to out-of-memory errors.
He walks through the systematic investigation process, from initial log checks to uncovering the root cause in Go's memory management within Kubernetes.
You will learn:
How Go's garbage collector interacts with Kubernetes resource limits, potentially leading to unexpected OOMKilled errors.
The importance of the GOMEMLIMIT environment variable in Go 1.19+ for managing memory usage in containerized environments.
Debugging techniques for memory-related issues in Kubernetes, including GODEBUG for garbage collector tracing.
Considerations for optimizing Go applications in Kubernetes, balancing performance and resource utilization.
Sponsor
This episode is sponsored by StormForge – Double your Kubernetes resource utilization and unburden developers from sizing complexity with the first HPA-compatible vertical pod rightsizing solution. Try it for free.
More info
Find all the links and info for this episode here: https://ku.bz/7fnF-tJ8R
Interested in sponsoring an episode? Learn more.
-
This episode offers a rare glimpse into the design decisions that shaped the world's most popular container orchestration platform.
Brian Grant, CTO of ConfigHub and former tech lead on Google's Borg team discusses the Kubernetes Resource Model (KRM) and its profound impact on the Kubernetes ecosystem.
He explains how KRM's resource-centric API patterns enable Kubernetes' flexibility and extensibility and influence the entire cloud native landscape.
You will learn:
How the Kubernetes API evolved from inconsistency to a uniform structure, enabling support for thousands of resource types.
Why Kubernetes' self-describing resources and Server-side Apply simplify client implementations and configuration management.
The evolution of Kubernetes configuration tools like Helm, Kustomize, and GitOps solutions.
Current trends and future directions in Kubernetes configuration, including potential AI-driven enhancements.
Sponsor
This episode is sponsored by StormForge – Double your Kubernetes resource utilization and unburden developers from sizing complexity with the first HPA-compatible vertical pod rightsizing solution. Try it for free.
More info
Find all the links and info for this episode here: https://ku.bz/_ZLj6ZV-9
Interested in sponsoring an episode? Learn more.
-
Dive into the world of GitOps and compare two of the most popular tools in the CNCF landscape: Argo CD and Flux CD.
Andrei Kvapil, CEO and Founder of Aenix, breaks down the strengths and weaknesses of Argo CD and Flux CD, helping you understand which tool might best fit your team's needs.
You will learn:
The different philosophies behind the tools.
How they handle access control and deployment restrictions.
Their trade-offs in usability and conformance to infrastructure as code.
Why there is no one-size-fits-all in the GitOps world.
Sponsor
This episode is sponsored by DigitalOcean — learn how GPUs for DigitalOcean Kubernetes can enable your AI/ML workloads.
More info
Find all the links and info for this episode here: https://ku.bz/0mvh5s4Ld
Interested in sponsoring an episode? Learn more.
-
Eric Jalal, an independent consultant and Kubernetes developer, explains how Kubernetes is fundamentally built on familiar Linux features. He discusses why understanding Linux is crucial for working with Kubernetes and how this knowledge can simplify your approach to cloud-native technologies.
You will learn:
Why Eric considers Kubernetes to be "just Linux" and how it wraps existing Linux technologies.
The importance of understanding Linux fundamentals (file systems, networking, storage).
How Kubernetes provides a standard and consistent interface for managing Linux-based infrastructure.
Why learning Linux deeply can make Kubernetes adoption an incremental step rather than a giant leap
Sponsor
This episode is sponsored by Learnk8s — get started on your Kubernetes journey through comprehensive online, in-person or remote training.
More info
Find all the links and info for this episode here: https://ku.bz/-jCTfgqRC
Interested in sponsoring an episode? Learn more.
-
Alexandre Souza, a senior platform engineer at Getir, shares his expertise in managing large-scale environments and configuring requests, limits, and autoscaling.
He explores the challenges of over-provisioning and under-provisioning and discusses strategies for optimizing resource allocation using tools like Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA).
You will learn:
How to set appropriate resource requests and limits to balance application performance and cost-efficiency in large-scale Kubernetes environments.
Strategies for implementing and configuring Horizontal Pod Autoscaler (HPA), including scaling policies and behavior management.
The differences between CPU and memory management in Kubernetes and their impact on workload performance.
Techniques for leveraging tools like KubeCost and StormForge to automate resource optimization.
Sponsor
This episode is sponsored by VictoriaMetrics - request a free trial for VictoriaMetrics enterprise today.
More info
Find all the links and info for this episode here: https://ku.bz/z2Vj9PBYh
Interested in sponsoring an episode? Learn more.
-
In this KubeFM episode, Kensei Kanada discusses Tortoise, an open-source project he developed at Mercari to tackle Kubernetes resource optimization challenges. He explains the limitations of existing solutions like Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA), and how Tortoise aims to provide a more comprehensive and automated approach to resource management in Kubernetes clusters.
You will learn:
The complexities of resource optimization in Kubernetes, including the challenges of managing HPA, VPA, and manual tuning of resource requests and limits
How Tortoise automates resource optimization by replacing HPA and VPA, reducing the need for manual intervention and continuous tuning
The technical implementation of Tortoise, including its use of Custom Resource Definitions (CRDs) and how it interacts with existing Kubernetes components
Strategies for adopting and migrating to new tools like Tortoise in a large-scale Kubernetes environment
Sponsor
This episode is sponsored by Learnk8s — estimate the perfect cluster node with the Kubernetes Instance Calculator
More info
Find all the links and info for this episode here: https://ku.bz/bRd0243xQ
Interested in sponsoring an episode? Learn more.
-
In this KubeFM episode, Ángel Barrera discusses Adidas' strategic shift to a GitOps-based container platform management system, initiated in May 2022, and its impact on their global infrastructure.
You will learn:
The initial state and challenges: Understand the complexities and inefficiencies of Adidas' pre-GitOps infrastructure.
The transition process: Explore the steps and strategies used to migrate to a GitOps-based system, including tool changes and planning.
Technical advantages: Learn about the benefits of the pull mechanism, unified configuration, and improved visibility into cluster states.
Developer and business feedback: Gain insights into the feedback from developers and the business side, and how they were convinced to invest in the migration.
Sponsor
This episode is sponsored by ControlPlane — empower your Kubernetes deployments with ControlPlane Enterprise for Flux CD.
More info
Find all the links and info for this episode here: https://ku.bz/-5QbzQXJg
Interested in sponsoring an episode? Learn more.
-
In this KubeFM episode, Miguel Luna discusses the intricacies of Observability in Kubernetes, including its components, tools, and future trends.
You will learn:
The fundamental components of Observability: metrics, logs, and traces, and their roles in understanding system performance and health.
Key tools and projects: insights into Keptn and OpenTelemetry and their significance in the Observability ecosystem.
The integration of AI technologies: how AI is shaping the future of Observability in Kubernetes.
Practical steps for implementing Observability: starting points, what to monitor, and how to manage alerts effectively.
Sponsor
This episode is sponsored by Learnk8s — estimate the perfect cluster node with the Kubernetes Instance Calculator
More info
Find all the links and info for this episode here: https://ku.bz/WwS04jYvv
Interested in sponsoring an episode? Learn more.
- Show more