Episodes
-
Due to its vast array of capabilities and applications, it can be challenging to pinpoint the exact essence of what Kubernetes does. Today we are asking questions about orchestration and runtimes and trying to see if we can agree on whether Kubernetes primarily does one or the other, or even something else. Kubernetes may rather be described as a platform for instance! In order to answer these questions, we look at what constitutes an orchestrator, thinking about management, workflow, and security across a network of machines. We also get into the nitty-gritty of runtimes and how Kubernetes can fit into this model too. The short answer to our initial question is that defining a platform, orchestrator or a runtime depends on your perspective and tasks and Kubernetes can fulfill any one of these buckets. We also look at other platforms, either past or present that might be compared to Kubernetes in certain areas and see what this might tell us about the definitions. Ultimately, we come away with the message that the exact way you slice what Kubernetes does, is not all-important. Rigid definitions might not serve us so well and rather our focus should be on an evolving understanding of these terms and the broadening horizon of what Kubernetes can achieve. For a really interesting meditation on how far we can take the Kube, be sure to join us today!
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://www.notion.so/thepodlets/The-Podlets-Guest-Central-9cec18726e924863b559ef278cf695c9
Hosts:
https://twitter.com/mauilion
https://twitter.com/joshrosso
https://twitter.com/embano1
https://twitter.com/opowero
Key Points From This Episode:
What defines an orchestrator? Different kinds of management, workflow, and security.Considerations in a big company that go into licensing, security and desktop content.Can we actually call Kubernetes and orchestrator or a runtime?How managing things at scale increases the need for orchestration.An argument for Kubernetes being considered an orchestrator and a runtime.Understanding runtimes as part of the execution environment and not the entire environment.How platforms, orchestration, and runtimes change positions according to perspective.Remembering the 'container orchestration wars' between Mezos, Swarm, Nomad, and Kubernetes.The effect of containerization and faster release cycles on the application of updates.Instances that Kubernetes might not be used for orchestration currently.The increasingly lower levels at which you can view orchestration and containers.The great job that Kubetenes is able to do in the orchestration and automation layer.How Kubernetes removes the need to reinvent everything over and over again.Breaking down rigid definitions and allowing some space for movement.Quotes:
“Obviously, orchestrator is a very big word, it means lots of things but as we’ve already described, it’s hard to fully encapsulate what orchestration means at a lower level.” — @mauilion [0:16:30]
“I wonder if there is any guidance or experiences we have with determining when you might need an orchestrator.” — @joshrosso [0:28:32]
“Sometimes there is an elemental over-automation some people don’t want all of these automation happening in the background.” — @opowero [0:29:19]
Links Mentioned in Today’s Episode:
Apache Airflow — https://airflow.apache.org/
SSCM — https://www.lynda.com/Microsoft-365-tutorials/SSCM-Office-Deployment-Tool/5030992/2805770-4.html
Ansible — https://www.ansible.com/
Docker — https://www.docker.com/
Joe Beda — https://www.vmware.com/latam/company/leadership/joe-beda.html
Jazz Improvisation over Orchestration — https://blog.heptio.com/core-kubernetes-jazz-improv-over-orchestration-a7903ea92ca?gi=5c729e924f6c
containerd — https://containerd.io/
AWS — https://aws.amazon.com/
Fleet — https://coreos.com/fleet/docs/latest/launching-containers-fleet.html
Puppet — https://puppet.com/
HashiCorp Nomad — https://nomadproject.io/
Mesos — http://mesos.apache.org/
Swarm — https://docs.docker.com/get-started/swarm-deploy/
Red Hat — https://www.redhat.com/en
Zalando — https://zalando.com/
See omnystudio.com/listener for privacy information.
-
We are joined by Ellen Körbes for this episode, where we focus on Kubernetes and its tooling. Ellen has a position at Tilt where they work in developer relations. Before Tilt, they were doing closely related kinds of work at Garden, a similar company! Both companies are directly related to working with Kubernetes and Ellen is here to talk to us about why Kubernetes does not have to be the difficult thing that it is made out to be. According to her, this mostly comes down to tooling. Ellen believes that with the right set of tools at your disposal it is not actually necessary to completely understand all of Kubernetes or even be familiar with a lot of its functions. You do not have to start from the bottom every time you start a new project and developers who are new to Kubernetes need not becomes experts in it in order to take advantage of its benefits.
The major goal for Ellen and Tilt is to get developers code up, running and live in as quick a time as possible. When the system is standing in the way this process can take much longer, whereas, with Tilt, Ellen believes the process should be around two seconds! Ellen comments on who should be using Kubernetes and who it would most benefit. We also discuss where Kubernetes should be run, either locally or externally, for best results and Tilt's part in the process of unit testing and feedback. We finish off peering into the future of Kubernetes, so make sure to join us for this highly informative and empowering chat!Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://www.notion.so/thepodlets/The-Podlets-Guest-Central-9cec18726e924863b559ef278cf695c9
Guest:
Ellen Körbes https://twitter.com/ellenkorbesHosts:
Carlisia CamposBryan LilesOlive PowerKey Points From This Episode:
Ellen's work at Tilt and the jumping-off point for today's discussion.The projects and companies that Ellen and Tilt work with, that they are allowed to mention!Who Ellen is referring to when they say 'developers' in this context.Tilt's goal of getting all developers' code up and running in the two seconds range.Who should be using Kubernetes? Is it necessary in development if it is used in production?Operating and deploying Kubernetes — who is it that does this?Where developers seem to be running Kubernetes; considerations around space and speed.Possible security concerns using Tilt; avoiding damage through Kubernetes options.Allowing greater possibilities for developers through useful shortcuts.VS Code extensions and IDE integrations that are possible with Kubernetes at present.Where to start with Kubernetes and getting a handle on the tooling like Tilt.Using unit testing for feedback and Tilt's part in this process.The future of Kubernetes tooling and looking across possible developments in the space.Quotes:
“You're not meant to edit Kubernetes YAML by hand.” — @ellenkorbes [0:07:43]
“I think from the point of view of a developer, you should try and stay away from Kubernetes for as long as you can.” — @ellenkorbes [0:11:50]
“I've heard from many companies that the main reason they decided to use Kubernetes in development is that they wanted to mimic production as closely as possible.” — @ellenkorbes [0:13:21]
Links Mentioned in Today’s Episode:
Ellen Körbes — http://ellenkorbes.com/
Ellen Körbes on Twitter — https://twitter.com/ellenkorbes?lang=en
Tilt — https://tilt.dev/
Garden — https://garden.io/
Cluster API — https://cluster-api.sigs.k8s.io/
Lyft — https://www.lyft.com/
KubeCon — https://events19.linuxfoundation.org/events/kubecon-cloudnativecon-europe-2019/
Unu Motors — https://unumotors.com/en
Mindspace — https://www.mindspace.me/
Docker — https://www.docker.com/
Netflix — https://www.netflix.com/
GCP — https://cloud.google.com/
Azure — https://azure.microsoft.com/en-us/
AWS — https://aws.amazon.com/
ksonnet — https://ksonnet.io/
Ruby on Rails — https://rubyonrails.org/
Lambda – https://aws.amazon.com/lambda/
DynamoDB — https://aws.amazon.com/dynamodb/
Telepresence — https://www.telepresence.io/
Skaffold Google — https://cloud.google.com/blog/products/application-development/kubernetes-development-simplified-skaffold-is-now-ga
Python — https://www.python.org/
REPL — https://repl.it/
Spring — https://spring.io/community
Go — https://golang.org/
Helm — https://helm.sh/
Pulumi — https://www.pulumi.com/
Starlark — https://github.com/bazelbuild/starlark
Transcript:
EPISODE 22
[ANNOUNCER]
Welcome to The Podlets Podcast, a weekly show that explores cloud native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision-maker, this podcast is for you.
[EPISODE]
[0:00:41.8] CC: Hi, everybody. This is The Podlets. We are back this week with a special guest, Ellen Körbes. Ellen will introduce themselves in a little bit. Also on the show, it’s myself, Carlisia Campos, Michael Gasch and Duffie Cooley.
[0:00:57.9] DC: Hey, everybody.
[0:00:59.2] CC: Today’s topic is Kubernetes Sucks for Developers, right? No. Ellen is going to introduce themselves now and tell us all about what that even means.
[0:01:11.7] EK: Hi. I’m L. I do developer relations at Tilt. Tilt is a company whose main focus is development experience when it comes to Kubernetes and multi-service development. Before Tilt, I used to work at Garden. They basically do the same thing, it's just a different approach. That is basically the topic that we're going to discuss, the fact that Kubernetes does not have to suck for developers. You just need to – you need some hacks and fixes and tools and then things get better.
[0:01:46.4] DC: Yeah, I’m looking forward to this one. I've actually seen Tilt being used in some pretty high-profile open source projects. I've seen it being used in Cluster API and some of the work we've seen there and some of the other ones. What are some of the larger projects that you are aware of that are using it today?
[0:02:02.6] EK: Oh, boy. That's complicated, because every company has a different policy as to whether I can name them publicly or not. Let's go around that question a little bit. You might notice that Lyft has a talk at KubeCon, where they're going to talk about Tilt. I can't tell you right now that they use Tilt, but there's that. Hopefully, I found a legal loophole here.
I think they're the biggest name that you can find right now. Cluster API is of course huge and Cluster API is fun, because the way they're doing things is very different. We're used to seeing mostly companies that do apps in some way or another, like websites, phone apps, etc. Then Cluster API is completely insane. It's something else totally.
There's tons of other companies. I'm not sure which ones that are large I can name specifically. There are smaller companies. Unu Motors, they do electric motorcycles. It's a company here in Berlin. They have 25 developers. They’re using Tilt. We have very tiny companies, like Mindspace, their studio in Tucson, Arizona. They also use Tilt and it's a three-person team. We have the whole spectrum, from very, very tiny companies that are using Docker for Mac and pretty happy with it, all the way up to huge companies with their own fleet of development clusters and all of that and they're using Tilt as well.
[0:03:38.2] DC: That field is awesome.
[0:03:39.3] MG: Quick question, Ellen. The title says ‘developers’. Developers is a pretty broad name. I have people saying that okay, Kubernetes is too raw. It's more like a Linux kernel that we want this past experience. Our business developers, our application developers are developing in there. How would you do describe developer interfacing with Kubernetes using the tools that you just mentioned? Is it the traditional enterprise developer, or more Kubernetes developers developing on Kubernetes?
[0:04:10.4] EK: No. I specifically mean not Kubernetes developers. You have people work in Kubernetes. For example, the Cluster API folks, they're doing stuff that is Kubernetes specific. That is not my focus. The focus is you’re a back-end developer, you’re a front-end developer, you're the person configuring, I don't know the databases or whatever. Basically, you work at a company, you have your own business logic, you have your own product, your own app, your own internal stuff, all of that, but you're not a Kubernetes developer.
It just so happens that if the stuff you are working on is going to be pointing at Kubernetes, it's going to target Kubernetes, then one, you're the target developer for me, for my work. Two, usually you're going to have a hard time doing your job. We can talk a bit about why.One issue is development clusters. If you're using Kubernetes in prod, rule of thumb, you should be using Kubernetes in dev, because you don't want completely separate environments where things work in your environment as a developer and then you push them and they break. You don't want that. You need some development cluster. The type of cluster that that's going to be is going to vary according to the level of complexity that you want and that you can deal with.
Like I said, some people are pretty happy with Docker for Mac. I hear all the time these complaints that, “Oh, you're running Kubernetes on your machine. It's going to catch fire.” Okay, there's some truth to that, but also it depends on what you're doing. No one tries to run Netflix, let's say the whole Netflix on their laptop, because we all know that's not reasonable. People try to do similar things on their mini-Kube, or Docker for Mac. Then it doesn't work and they say, “Oh, Kubernetes on the laptop doesn't work.” No. Yeah, it does. Just not for you.
That's a complaint I particularly dislike, because it comes from a – it's a blanket statement that has no – let's say, no facts behind it. Yeah, if you're a small company, Docker for Mac is going to work fine for you. Let's say you have a beefy laptop with 30 gigs of ram, you can put a lot of software in 30 gigs. You can put a lot of microservices in 30 gigs.
That's going to work up to a point and then it's going to break. When it breaks, you're going to need to move to a cloud, you're going to need to do remote development and then you're going to Go to GCP, or Azure, or Amazon. You're going to set up a cluster there. Some people use the managed Kubernetes options. Some people just spin up a bunch of machines and wire up Kubernetes by themselves. That's going to depend on basically how much you have in terms of resources and in terms of needs.
Usually, keeping up a remote cluster that works is going to demand more infrastructure work. You're going to need people who know how to do that, to keep an eye on that. There's all the billing aspect, which is you can run Docker for Mac all day and you're not going to pay extra. If you leave a bunch of stuff running on Google, you're going to have a bill at the end of the month that you need to pay attention to. That is one thing for people to think about.
Another aspect that I see very often that people don't know what to do with is config files. You scroll Twitter, you scroll Kubernetes Twitter for five minutes and there's a joke about YAML. We all hate editing YAML. Again, the same way people make jokes about using about Kubernetes setting your laptop on fire, I would argue that you're not meant to edit Kubernetes YAML by hand. The tooling for that is arguably not as mature as the tooling when it comes to Kubernetes clusters to run on your laptop. You have stuff like YAML templates, you have ksonnet. I think there's one called customize, but I haven't used it myself.
What I see in every company from the two-person team to the 600 person team is no one writes Kubernetes YAML by hand. Everyone uses a template solution, a templating solution of some sort. That is the first thing that I always tell people when they start making jokes about YAML, is if you’re editing YAML by hand, you're doing it wrong. You shouldn't do that in the first place. It's something that you set up once at some point and you look at it whenever you need to. On your day-to-day when you're writing your code, you should not touch those files, not by hand.
[0:08:40.6] CC: We're five minutes in and you threw so much at us. We need to start breaking some of this stuff down.
[0:08:45.9] EK: Okay. Let me throw you one last thing then, because that is what I do personally. One more thing that we can discuss is the development feedback loop. You're writing your code, you're working on your application, you make a change to your code. How much work is it for you to see that new line of code that you just wrote live and running? For most people, it's a very long journey. I asked that on Twitter, a lot of people said it was over half an hour. A very tiny amount of people said it was between five minutes and half an hour and only a very tiny fraction of people said it was two seconds or less.
The goal of my job, of my work, the goal of Tilt, the tool, which is made by the company I work for, also called Tilt, is to get everyone in that two seconds range. I've done that on stage and talks, where we take an application and we go from, “Okay, every time you make a change, you need to build a new Docker image. You need to push it to a registry. You need to update your cluster, blah, blah, blah, and that's going to take minutes, whole minutes.” We take that from all that long and we dial it down to a couple seconds.
You make a change, or save your file, snap your fingers and poof, it's up and running, the new version of your app. It's basically a real-time, perceptually real-time, just like back and when everyone was doing Ruby on Rails and you would just save your file and see if it worked basically instantly. That is the part of this discussion that personally I focus more on.
[0:10:20.7] CC: I'm going to love to jump to the how in a little bit. I want to circle back to the beginning. I love the question that Michael asked at the beginning, what is considered developer, because that really makes a difference, first to understand who we are talking about. I think this conversation can go in circles and not that I'm saying we are going circles, but this conversation out in the wild can go in circles. Until we have an understanding of the difference between can you as a developer use Kubernetes in a somewhat not painful way, but should you?
I'm very interested to get your take and Michael and Duffie’s take as well as far as should we be doing this and should all of the developers will be using Kubernetes through the development process? Then we also have to consider people who are not using Kubernetes, because a lot of people out there are not using communities. For developers and special, they hear Kubernetes is painful and definitely enough for developers. Obviously, that is not going to qualify Kubernetes as a tool that they’re going to look into. It's just not motivating.
If there is anything that that would make people motivated to look into Kubernetes that would be beneficial for them not just for using Kubernetes for Kubernetes sake, but would it be useful? Basically why? Why would it be useful?
[0:11:50.7] EK: I think from the point of view of a developer, you should try and stay away from Kubernetes for as long as you can. Kubernetes comes in when you start having issues of scale. It's a production matter, it's not a development matter. I don't know, like a DevOps issue, operations issue. Ideally, you put off moving your application to Kubernetes as long as possible. This is an opinion. We can argue about this forever. Just because it introduces a lot of complexity and if you don't need that complexity, you should probably stay away from it.
To get to the other half of the question, which is if you're using Kubernetes in production, should you use Kubernetes in development? Now here, I'm going to say yes a 100% of the time. Blanket statement of course, we can argue about minutiae, but I think so. Because if you don't, you end up having separate environments. Let's say you're using Docker Compose, because you don't like Kubernetes. You’re using Kubernetes in production, so in development you are going to need containers of some sort.
Let's say you're using Docker Compose. Now you're maintaining two different environments. You update something here, you have to update it there. One day, it's going to be Friday, you're going to be tired, you're going to update something here, you're going to forget to update something there, or you're going to update something there and it's going to be slightly different. Or maybe you're doing something that has no equivalent between what you're using locally and what you're using in production. Then basically, you're in trouble.
I've heard from many companies that the main reason they decided to use Kubernetes in development is that they wanted to mimic production as closely as possible. One argument we can have here is that – oh, but if you're using Kubernetes in development, that's going to add a lot of overhead and you're not going to be able to do your job right. I agree that that was true for a while, but right now we have enough tooling that you can basically make Kubernetes disappear and you just focus on being a developer, writing your code, doing all of that stuff. Kubernetes is sitting there in the background. You don't have to think about it and you can just go on about your business with the advantage that now, your development environment and your production environment are going to very closely mimic each other, so you're not going to have issues with those potential disparities.
[0:14:10.0] CC: All right. Another thing too is that I think we're making an assumption that the developers we are talking about are the developers that are also responsible for deployment. Sometimes that's the case, sometimes that's not the case and I'm going to shut up now. It would be interesting to talk about that too between all of us, is that what we see? Is that the case that now developers are responsible? It's like, developers DevOps is just so ubiquitous that we don't even consider differentiating between developers and ops people? All right?
[0:14:45.2] DC: I think I have a different spin on that. I think that it's not necessarily that developers are the ones operating the infrastructure. The problem is that if your infrastructure is operated by a platform that may require some integration at the application layer to really hit its stride, then the question becomes how do you as a developer become more familiar? What is the user experience as of, or what I should say, what's the developer experience around that integration?
What can you do to improve that, so that the developer can understand better, or play with how service discovery works, or understand better, or play with how the different services in their application will be able to interact without having to redefine that in different environments? Which is I think what Ellen point was.
[0:15:33.0] EK: Yeah. At the most basic level, you have issues as such as you made a change to a service here, let's say on your local Docker Compose. Now you need to update your Kubernetes manifest on your cluster for things to make sense. Let's say, I don't know, you change the name of a service, something as simple as that. Even those kinds of things that sounds silly to even describe, when you're doing that every day, one day you're going to forget it, things are going to explode, you're not going to know why, you're going to lose hours trying to figure out where things went wrong.
[0:16:08.7] MG: Also the same with [inaudible] maybe. Even if you use Kubernetes locally, you might run a later version of Kubernetes, maybe use kind for local development, but then your cluster, your remote cluster is on three or four versions behind. Shouldn't be because of the versions of product policy, but it might happen, right?
Then APIs might be deprecated, or you're using different API. I totally agree with you, Ellen, that your development environment should reflect production as close as possible. Even there, you have to make sure that prod, like your APIs matches, API types matches and all the stuff right, because they could also break.
[0:16:42.4] EK: You were definitely right that bugs are not going away anytime soon.
[0:16:47.1] MG: Yeah. I think this discussion also remembers me of the discussion that the folks in the cloud will have with AWS Lambda for example, because there's similar, even though there are tools to simulate, or mimic these platforms, like serverless platforms locally, the general recommendation there is to embrace the cloud and develop in the cloud natively in the cloud, because that's something you cannot resemble. You cannot run DynamoDB locally. You could mimic it. You could mimic lambda runtimes locally. Essentially, it's going to be different.
That's also a common complaint in the world of AWS and cloud development, which is it's really not that easy to develop locally, where you're supposed to develop on the platform that the code is being shipped and run on to, because you cannot run the cloud locally. It sounds crazy, but it is. I think the same is with Kubernetes, even though we have the tools. I don't think that every developer runs Kubernetes locally. Most of them maybe doesn't even have Docker locally, so they use some spring tools and then they have some pipeline and eventually it gets shipped as a container part in Kubernetes.
That's what I wanted to throw in here as more like a question experience, especially for you Ellen with these customers that you work with, what are the different profiles that you see from the maturity perspective and these customers large enterprises might be different and the smaller ones that you mentioned. How do you see them having different requirements, as also Carlisia said, do they do ops, or DevOps, or is it strictly separated there, especially in large enterprises?
[0:18:21.9] EK: What I see the most, let's get the last part first.
[0:18:24.6] MG: Yeah, it was a lot of questions. Sorry for that.
[0:18:27.7] EK: Yeah. When it comes to who operates Kubernetes, who deploys Kubernetes, definitely most developers push their code to Kubernetes themselves. Of course, this involves CI and testing and PRs and all of that, so it's not you can just go crazy and break everything.
When it comes to operating the production cluster, then that's separate. Usually, you have someone writing code and someone else operating clusters and infrastructure. Sometimes it's the same person, but they're clearly separate roles, even if it's the same person doing it. Usually, you go from your IDE to PR and that goes straight into production once the whole process is done.
Now we were talking about workflows and Lambda and all of that. I don't see a good solution for lambda, a good development experience for Lambda just yet. It feels a bit like it needs some refinement still. When it comes to Kubernetes, you asked do most developers run Kubernetes locally? Do they not? I don't know about the numbers, the absolute numbers. Is it most doing this, or most doing that? I'm not sure. I only know the companies I'm in touch with.
Definitely not all developers run Kubernetes on their laptops, because it's a problem of scale. Right now, we are basically stuck with 30 gigs of RAM on our laptops. If your app is bigger than that, tough luck, you're not going to run it on the laptop. What most developers do is they still maintain a local development environment, where they can do development without going to CI. I think that is the main question. They maintain agility in their development process.
What we usually see when you don't have Kubernetes on your laptop and you're using remote Kubernetes, so a remote development cluster in some cloud provider. What most people do and this is not the companies I talk to. This is basically everyone else. What most people will do is they make their development environment be the same, or work the same way as their production environment. You make a change to your code, you have to push a PR that has to get tested by CI. It has to get approved. Then it ends up in the Kubernetes cluster.
Your feedback loop as a developer is insanely slow, because there's so much red tape between you changing a line of code and you getting a new process running in your cluster. Now when you use tools, I call the category MDX. I basically coined that category name myself. MDX is a multi-service development experience tooling.
When you use MDX tools, and that's not just Tilt; it’s Tilt, it’s Garden where I used to work, people use telepresence like that. There is Scaffold from Google and so on. There's a bunch of tools. When you use a tool like that, you can have your feedback loop down to a second like I said before. I think that is the major improvement developers can do if they're using Kubernetes remotely and even if they’re using Kubernetes locally.
I would guess most people do not run Kubernetes locally. They use remotely. We have clients who have clients — we have users who don't even have Docker on their local machines, because if you have the right tooling, you can change the files on your machine. You have tooling running that detects those five changes. It syncs those five changes to your cluster. The cluster then rebuilds images, or restarts containers, or syncs live code that's already running. Then you can see those changes reflected in your development cluster right, away even though you don't even have Docker in your machine. There's all of those possibilities.
[0:22:28.4] MG: Do you see security issues with that approach with not knowing the architecture of Tilt? Even though it's just the development clusters, there might be stuff that could break, or you could break by bypassing the red tape as you said?
[0:22:42.3] EK: Usually, we assign one user per namespace. Usually, every developer has a namespace. Kubernetes itself has enough options that if that's a concern to you, you can make it secure. Most people don't worry about it that much, because it's development clusters. They're not accessible to the public. Usually, there's – you can only access it through a VPN or something of that sort.
We haven't heard about security issues so far. I'm sure they’re going to pop out at some point. I'm not sure how severe it’s going to be, or how hard it's going to be to fix. I am assuming, because none of this stuff is meant to be accessible to the wider Internet that it's not going to be a hard problem to tackle.
[0:23:26.7] DC: I would like to back up for a second, because I feel we're pretty far down the road on what the value of this particular pattern is without really explaining what it is. I want to back this up for just a minute and talk about some of the things that a tooling like this is trying to solve in a use case model, right?
Back in the day when I was learning Python, I remember really struggling with the idea of being able to debug Python live. I came across iPython, which is a REPL and that was – which was hugely eye-opening, because it gave me the ability to interact with my code live and also open me up to the idea that it was an improve over things like having to commit a new log line against a particular function and then push that new function up to a place where it would actually get some use and then be able to go look at that log line and see what's coming out of it, or do I actually have enough logs to even put together what went wrong.
That whole set of use case is I think is somewhat addressed by tooling like this. I do think we should talk about how did we get here and how does that actually and how does like this address some of those things, and which use cases specifically is it looking to address.
I guess where I'm going with this is to your point, so a tooling like Tilt, for example, is the idea that you can, as far as I understand it, inject into a running application, a new instance that would be able to – that you would have a local development control over. If you make a change to that code, then the instance running inside of your target environment would be represented by that new code change very quickly, right? Basically, solving the problem of making sure that you have some very quick feedback loop.
I mean, functionally, that's the killer feature here. I think it’s really interesting to see tooling like that start to develop, right? Another example of that tooling would be the REPL thing, wherein instead of writing your code and compiling your code and seeing the output, you could do a thing where you're actually inside, running as a thread inside of the code and you can dump a data structure and you can modify that data structure and you can see if your function actually does the right thing, without having to go back and write that code while imagining all those data structures in your head. Basic tooling like this, I think is pretty killer.
[0:25:56.8] EK: Yeah. I think one area where that is still partially untapped right now where this tooling could go, and I'm pushing it, but it's a process. It's not something we can do overnight, is to have very high-level patterns, the let's say codified. For example, everyone's copying Docker files and Kubernetes manifests and Terraform can take files, which I forgot what they're called. Everyone's copying that stuff from the Internet from other websites. That's cool.
Oh, you need a container that does such-and-such and sets up this environment and provides these tools. Just download this image and everything is all set up for you. One area where I see things going is for us to have that same portability, but for development environments. For example, I did this whole talk about how to take your Go code, your Go application from I don't know, a 30-seconds feedback loop where you're rebuilding an image every time you make a code change and all of that, down to 1 second.
There's a lot of hacks in there that span all kinds of stuff, like should you use Go vendor, or should you keep your dependencies cached inside a Docker layer? Those kinds of things. Then I went down a bunch of those things and eventually came up with a workflow that was basically the best I could find in terms of development experience. What is the snappiest workflow? Or for example, you could have what is a workflow that makes it really easy to debug my Go app? You would use applications like Squash and that's a debugger that you can connect to a process running in a large container. Those kinds of things.
If we can prepackage those and offer those to users and not just for Go and not just for debugging, but for all kinds of development workflows, I think that would be really great. We can offer those types of experiences to people who don't necessarily have the inclination to develop those workflows themselves.
[0:28:06.8] DC: Yeah, I agree. I mean, it is interesting. I've had a few conversations lately about the fact that the abstraction layer of coding in the way that we think about it really hasn't changed over time, right? It's the same thing. That's actually a really relevant point. It's also interesting to think about with these sorts of frameworks and this tooling, it might be interesting to think of what else we can – what else we can enable the developer to have a feedback loop on more quickly, right?
To your point, right? We talked about how these different environments, your development environment and your production environment, the general consensus is they should be as close as you can get them reasonably, so that the behavior in one should somewhat mimic the behavior in the other. At least that's the story we tell ourselves. Given that, it would also be interesting if the developer was getting feedback from effectively how the security posture of that particular cluster might affect the work that they're doing. You do actually have to define network policy. Maybe you don't necessarily have to think about it if we can provide tooling that can abstract that away, but at least you should be aware that it's happening so that you understand if it's not working correctly, this is where you might be able to see the sharp edges pop up, you know what I mean? That sort of thing.
[0:29:26.0] EK: Yeah. At the last KubeCon, where was it? In San Diego. There was this running joke. I was running around with the security crowd and there was this joke about KubeCon applies security.yaml. It was in a mocking tone. I'm not disparaging their joke. It was a good joke. Then I was thinking, “What if we can make this real?” I mean, maybe it is real. I don't know. I don't do security myself. What if we can apply a comprehensive enough set of security measures, security monitoring, security scanning, all of that stuff, we prepackage it, we allow users to access all of that with one command, or even less than that, maybe you pre-configure it as a team lead and then everyone else in your team can just use it without even knowing that it's there. Then it just lets you know like, “Oh, hey. This thing you just did, this is a potential security issue that you should know about.” Yeah, I think coming up with these developer shortcuts, it's my hobby.
[0:30:38.4] MG: That's cool. What you just mentioned Ellen and Duffie remembers me on – reminds me on the Spring community, the Spring framework, where a lot of the boilerplate, or beat security stuff, or connections, integrations, etc., is being abstracted away and you just annotate your code a bit and then some framework and Spring obviously, it's a spring framework. In your case Ellen, what you were hinting to is maybe this build environment that gives me these integration hooks where I just annotate. Or even those annotations could be enforced. Standards could be enforced if I don't annotate at all, right? I could maybe override them.
Then this build environment would just pick it up, because it scans the code, right? It has the source code access, so I could just scan it and hook into it and then apply security policies, lock it down, see ports being used, maybe just open them up to the application, the other ones will automatically get blocked, etc., etc. It just came to my mind. I have not done any research there, or whether there's already some place or activity.
[0:31:42.2] EK: Yeah. Because I won't shut up about this stuff, because I just love it, we are doing a – it's in a very early stage right now. We are doing a thing at Tile, we're calling extensions. Very creative name, I suppose. It's basically Go in parts, but for those were closed. It's still at a very early stage. We still have some road ahead of us. For example, we have – let's say this one user and they did some very special integration of Helm and Tilt. You don't have to use Helm by hand anymore. You can just make all of your Helm stuff happen automatically when you're using Tilt.
Usually, you would have to copy I don't know, a 100 lines of code from your Tilt config file and copy that around for other developers to be able to use it. Now we have this thing that it's basically going parts where you can just say load extension and give it a name, it fetches it from a repository and it's running. I think that is basically an early stage of what you just described with Spring, but more geared towards let's say an infra-Kubernetes, like how do you tie infra-Kubernetes, that stuff with a higher level functionality that you might want to use?
[0:33:07.5] MG: Cool. I have another one. Speaking of which, is there any other integrations for IDEs with Tilt? Because I know that VS code for example, has Kubernetes integrations, does the fabric aid and may even plugin, which handles some stuff under the covers.
[0:33:24.3] EK: Usually, Tilt watches your code files and it doesn't care which IDEs you use. It has its own dashboard, which is just a page that you open on your browser. I have just heard this week. Someone mentioned on Slack that they wrote an extension for Tilt. I'm not sure if it was for VS code or the other VS code-like .NET editors. I don't remember what it’s called, but they have a family of those.
I just heard that someone wrote one of those and they shared the repo. We have someone looking into that. I haven't seen it myself. The idea has come up when I was working at Garden, which is in the same area as Tilt. I think it's pertinent. We also had the idea of a VS code extension. I think the question is what do you put in the extension? What do you make the VS code extension do? Because both Tilt and Garden. They have their own web dashboards that show users what should be shown and in the manner that we think should be shown.
If you're going to create a VS code extension, you either replicate that completely and you basically take this stuff that was in the browser and put it in the IDE. I don't particularly see much benefit in that. If enough people ask, maybe we'll do it, but it's not something that I find particularly useful.
Either you do that and you replicate the functionality, or you come up with new functionality. In both cases, I just don't see a very strong point as to what different and IDE-specific functionality should you want.
[0:35:09.0] MG: Yes. The reason why I was asking is that we see all these Pulumi, CDKs, AWS CDKs coming up, where you basically use a programming language to write your application/application infrastructure code and your IDE and then all the templating, that YAML stuff, etc., gets generated under covers.
Just this week, AWS announced the CDKs, like the CDK basically for Kubernetes. I was thinking, with this happening where some of these providers abstract the scaffolding as well, including the build. You don't even have to build, because it's abstracted away under the covers. I was seeing this trend. Then obviously, we still have Helm and the templating and the customize and then you still have the manual as I mentioned in the beginning.
I do like the IDE integration, because that's where I spend most of my time. Whenever I have to leave the IDE, it's a context switch that I have to go through. Even if it's just for opening another file also that I need to edit somewhere. That's why I think having IDE integration is useful for developers, because that's where they most spend up their time. As you said, there might be reasons to not do it in an IDE, because it's just replicating functionality that might not be useful there.
[0:36:29.8] EK: Yeah. In the case of Tilt, all the config is written in Starlark, which is a language and it's basically Python. If your IDE can syntax highlight Python, it can syntax highlight the Tilt config files. About Pulumi and that stuff, I'm not that familiar. It's stuff that I know how it works, but I haven't used it myself. I'm not familiar with the browse and the IDE integration side of it.
The thing about tools like Tilt is that usually, if you set it up right, you can just write your code all day and you don't have to look at the tool. You just switch from your IDE to let's say, your browser where your app is running, so you get feedback and that kind of thing. Once you configure it, you don't really spend much time looking at it. You're going to look at it when there are errors.
You try to refresh your application and it fails. You need to find that error. By the time that happened, you already lost focus from your code anyway. Whether you're going to look for your error on a terminal, or on the Tilt dashboard, that's not much an issue.
[0:37:37.7] MG: That's right. That’s right. I agree.
[0:37:39.8] CC: All this talk about tooling and IDEs is making me think to ask you Ellen. If I'm a developer and let's say, my company decides that we’re going to use Kubernetes. What we are advocating here with this episode is to think about well, if you're going to be the point to Kubernetes in production, you should consider running Kubernetes as a local development environment.
Now for those developers who don't even – haven't even worked with Kubernetes, where do you suggest they jump in? Should they get a handle on – because it's too many things. I mean, Kubernetes already is so big and there are so many toolings around to how to operate Kubernetes itself. For a developer who is, “Okay, I like this idea of having my own local Kubernetes environment, or a development environment somehow may also be in the cloud,” should they start with a tooling like Tilt, or something similar? Would that make it easier for them to wrap their head around Kubernetes and what Kubernetes does? Or should they first get a handle on Kubernetes and then look at a tool like this?
[0:38:56.2] EK: Okay. There are a few sides to this question. If you have a very large team, ideally you should get one or a few people to actually really learn Kubernetes and then make it so that everyone else doesn't have to. Something we have seen is very large company, they are going to do Kubernetes in development. They set up a developer experience team and then for example, they have their own wrapper around Kubectl and then basically, they automate a bunch of stuff so that everyone in the team doesn't have to take a certified Kubernetes application development certificate. Because for people who don't know that certificate, it's basically how much Kubectl can you do off top of your head? That is basically what that certificate is about, because Kubectl is an insanely huge and powerful tool.
On the one hand, you should do that. If you have a big team, take a few people, learn all that you can about Kubernetes, write some wrappers so that people don't have to do Kubectl or something, something by hand. Just make very easy functions, like Kubectl, let’s say you know a name of your wrapper, context and the name and then that's going to switch you to a namespace let's say, where some version of your app is running, so that thing.
Now about the tooling. Once you have your development environment set up and you're going to need someone who has some experience with Kubernetes to set that up in the first place, but once that is set up, if you have the right tooling, you don't really have to know everything that Kubernetes does. You should have at least a conceptual overview. I can tell you for sure, that there's hundreds of developers out there writing code that is going to be deployed to Kubernetes, writing codes that whenever they make a change to their code, it goes to a Kubernetes development cluster and they don't have the first – well, I’m not going to say the first clue, but they are not experienced Kubernetes users. That's because of all the tooling that you can put around.
[0:41:10.5] CC: Yeah, that makes sense.
[0:41:12.2] EK: Yeah. You can abstract a bunch of stuff with basically good sense, so that you know the common operations that need to be done for your team and then you just abstract them away, so that people don't have to become Kubectl experts. On the other side, you can also abstract a bunch of stuff away with tooling. Basically, as long as your developer has the basic grasp of containers and basics of Kubernetes, that stuff, they don't need to know how to operate it, with any depth.
[0:41:44.0] MG: Hey Ellen, in the beginning you said that it's all about this feedback loop and iterating fast. Part of a feedback loop for a developer is unit testing, integration testing, or all sorts of testing. How do you see that changing, or benefiting from tools like Tilt, especially when it comes to integration testing? Unit tests usually locally, but the integration testing.
[0:42:05.8] EK: One thing that people can do when they're using Tilt is once you have Tilt running, you basically have all of your application running. You can just set up one-off tasks with Tilt. You could basically set up a script that there's a bunch of stuff, which would basically be what your test does. If it returns zero, it succeeded. If it doesn’t, it failed. You can set something up like that. It's not something that we have right now in a prepackaged farm that you can use right away.
You would basically just say, “Hey Tilt, run this thing for me,” and then you would see if it worked or not. I have to make a plug to the competition right now. Garden has more of that part of it, that part of things set up. They have tests as a separate primitive right next to building and deploying, which is what you usually see. They also have testing. It does basically what I just said about Tilt, but they have a special little framework around it.
With Garden, you would say, “Oh, here's a test. Here's how you run the test. Here's what the test depends on, etc.” Then it runs it and it tells you if it failed or not. With Tilt, it would be a more generic approach where you would just say, “Hey Tilt, run this and tell me if it fails or not,” but without the little wrapping around it that's specific for testing.
When it comes to how things work, like when you're trying to push the production, let's say you did a bunch of stuff locally, you're happy with it, now it's time to push the production. Then there's all that headache with CI and waiting for tests to run and flaky tests and all of that, that I don't know. That is a big open question that everyone's unhappy about and no one really knows which way to run to.
[0:43:57.5] DC: That’s awesome. Where do you see this space going in the future? I mean, as you look at the tooling that’s out there, maybe not specifically to the Tilt particular service or capability, but where do you see some other people exploring that space? We were talking about AWS dropping and CDK and there are different people trying to solve the YAML problem, but more from the developer user experience tooling way, where do you see that space going?
[0:44:23.9] EK: For me, it's all about higher level abstractions and well-defined best practices. Right now, everyone is fumbling around in the dark not knowing what to do, trying to figure out what works and what doesn't. The main thing that I see changing is that given enough time, best practices are going to emerge and it's going to be clear for everyone. If you're doing this thing, you should use this workflow. If you're doing that thing, you should use that workflow. Basically, what happened when IDEs emerged and became a thing, that is the best practice aside.
[0:44:57.1] DC: That's a great example.
[0:44:58.4] EK: Yeah. What I see in terms of things being offered for me tomorrow of — in terms of prepackaged higher level abstractions. I don't think developers should, everyone know how to deal with Kubernetes at a deeper level, the same way as I don't know how to build the Linux kernel even though I use Linux every day. I think things should be wrapped up in a way that developers can focus on what matters to them, which is right now basically writing code.
Developers should be able to get to the office in the morning, open up their computer, start writing code, or doing whatever else they want to do and not worry about Kubernetes, not worry about lambda, not worry about how is this getting built and how is this getting deployed and how is this getting tested, what's the underlying mechanism. I'd love for higher-level patterns of those to emerge and be very easy to use for everyone.
[0:45:53.3] CC: Yeah, it's going to be very interesting. I think best practices is such an interesting thing to think about, because somebody could sit down and write, “Oh, these are the best practices we should be following in the space.” I think, my opinion it's really going to come out of what worked historically when we have enough data to look at over the years.
I think it's going to be as far as tooling goes, like a survival of the fittest. Whatever tool has been used the most, that's what's going to be the best practice way to do things. Yeah, we know today there are so many tools, but I think probably we're going to get to a point where we know what to use for what in the future. With that, we have to wrap-up, because we are at the top of the hour. It was so great to have Ellen, or L, how they I think prefer to be called and to have you on the show, Elle. Thank you so much. I mean, L. See, I can't even follow my own.
You're very active on Twitter. We're going to have all the information for how to reach you on the show notes. We're going to have a transcript. As always people, subscribe, follow us on Twitter, so you can be up to date with what we are doing and suggest episodes too on our github repo. With that, thank you everybody. Thank you L.
[0:47:23.1] DC: Thank you, everybody.
[0:47:23.3] CC: Thank you, Michael and –
[0:47:24.3] MG: Thank you.
[0:47:24.8] CC: - thank you, Duffie.
[0:47:26.2] EK: Thank you. It was great.
[0:47:26.8] MG: Until next time.
[0:47:27.0] CC: Until next week.
[0:47:27.7] MG: Bye-bye.
[0:47:28.5] EK: Bye.
[0:47:28.6] CC: It really was.
[END OF EPISODE]
[0:47:31.0] ANNOUNCER: Thank you for listening to the Podlets Cloud Native Podcast. Find us on Twitter @thepodlets and on thepodlets.io website. That is ThePodlets, altogether, where you will find transcripts and show notes. We’ll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
Missing episodes?
-
Running Kubernetes on conventional operating systems is time-consuming and labor-intensive. Today’s guests Andrew Rynhard and Timothy Gerla have engineered a product that attempts to provide a solution to this problem. They call it Talos, and it is a modern OS designed specifically to host Kubernetes clusters, managed by a flexible and powerful API. Talos is completely stripped down to the bare components required to run Kubernetes and get information from the system. It stays updated by keeping time with Kubernetes, but also provides the user with a large degree of control in the event that they might need to update a flag. In this episode, Andrew and Timothy get into some of the mechanics and thought processes behind Talos, telling us why they went with a read-only API, how they handle security concerns on the OS, and how a system like theirs might get adopted by the Kubernetes community and layperson more broadly. They get into the advantages provided by a stripped-down solution for systematizing the use of Kubernetes across communities and running new components through clusters rather than on the OS itself. In a space where most participants are largely operating in the dark, it is a pleasure to see innovations like this display such lasting power so make sure you check out this episode.
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Guests:
Andrew Rynhard https://twitter.com/andrewrynhardTim Gerla https://twitter.com/tybstarHosts:
Carlisia CamposBryan LilesOlive PowerKey Points From This Episode:
What a Kubernetes OS is: a stripped-down OS that integrates with Kubernetes.The difficulties of managing and getting Kubernetes installed on regular OSs.Why a Kubernetes OS? Less attack surface and OS compatibility issues.What Talos does: quickly makes nodes part of a Kubernetes cluster by being stripped down.How replacing SSH with an API alleviates some users’ security concerns.A command-line interface called OSCTL that allows users to explore the API.What does ‘stripped-down’ mean? Talos runs kubelets and gets information from the OS.The ability to run new components through clusters rather than from the OS.How the Kubernetes OS evolves with Kubernetes but gets separately controlled too.Better integrating into Kubernetes by abstracting OS features into Kubernetes as operators.Security precautions: kernel hardening, SSH and Bash removal, and a read-only OS.Usability of Talos for the average Joe, and its consistency across base platforms.Possibilities for interacting with deeper levels of an OS through an API managed OS.How Talos might become appealing to laypeople: decreasing costs for porting to it.Value gained from switching to a purpose-built OS as something which could outweigh costs.Tendencies to hang onto tried and trusted tech even if its predecessors are superior.Quotes:
“To me, it’s just about abstracting away the operating system and not even having to worry about it anymore, and looking at Kubernetes and the entire cluster as an operating system.” — Andrew Rynhard [0:05:00]
“As rapid as the technology is changing, you need an operating system that is going to evolve with it or at least the operations intelligence to evolve with Kubernetes right alongside it.” — Andrew Rynhard [0:13:08]
“The challenge I think for us and for anybody changing the way that operating systems work is is it better enough than what I have today or what I had before?” — @tybstar [0:26:50]
“There’s a lot of companies out there who got us at this point in tech that don’t exist anymore, but if they didn’t do what they did, we would not be here right now.” — @bryanl [0:33:41]
Links Mentioned in Today’s Episode:
Talos — https://www.talos-systems.com/
Timothy Gerla — http://www.gerla.net/
Timothy Gerla on Twitter — https://twitter.com/tybstar
Andrew Ryndhard on LinkedIn —https://www.linkedin.com/in/andrewrynhard/
Andrew Ryndhard on GitHub — https://github.com/andrewrynhard
Jed Salazar on LinkedIn — https://www.linkedin.com/in/jedsalazar/
Bryan Liles on LinkedIn — https://www.linkedin.com/in/bryanliles/
Carlisia Campos on LinkedIn — https://www.linkedin.com/in/carlisia/
Red Hat — https://www.redhat.com/en/technologies/linux-platforms/enterprise-linux
Arch — https://www.archlinux.org/
Debian — https://www.debian.org/Linux — https://www.linux.org/
Bell Labs — http://www.bell-labs.com/
AT&T — https://www.att.com/
Transcript:
EPISODE 20
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your Cloud Native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[INTERVIEW]
[00:00:41] CC: Hi, everybody. Welcome back to the Podlets. Today we have a special episode, and on the show, we have a special guest, Andrew Ryndhard. Say hi, Andrew.
[00:00:53] AR: Hello, how are you?
[00:00:55] CC: We also have Timothy Gerla. Say hi, Tim.
[00:00:58] TG: Hi. Thanks for having me.
[00:01:00] CC: Yeah. Andrew and Timothy are from Talos. Andrew dropped an issue on our GitHub repo and here we are. It was a great suggestion. What we’re going to talk about today is what they are working on, which is a Kubernetes operating system. We have tons of questions for them for sure. We also have a special participant on the episode today as a co-host, Jed Salazar. Hi, Jed.
[00:01:28] JS: Hey, everyone. Jed Salazar here from the CRE team here at VMware.
[00:01:31] CC: And Bryan Liles.
[00:01:32] BL: Hi.
[00:01:33] CC: Hi. And me, Carlisia Campos. Who’d like to get the party started and kick this off?
[00:01:41] BL: Oh, I’m here. Let’s throw the gauntlet down. We’re talking about Kubernetes operating systems today. I have an operating system, a Mac, or I have Linux. I can run Kubernetes. What is a Kubernetes operating system and why should I even be thinking about this?
[00:01:58] AR: Sure. I’d like to think about Kubernetes operating system as an operating system that has stripped down the absolute bare minimum to run Kubernetes. Everything that is required to run the kubelet, and essentially that’s it, at least in my opinion. It should be super minimal to start with.
Second of all, I also think that it should integrate with Kubernetes as well. The combination of just being able to strip down Linux as we know as small as possible and then actually integrating with Kubernetes itself using APIs to figure out things about itself, whatever. I think that that, in my opinion, is what I would call a Kubernetes OS.
[00:02:42] BL: Interesting. Okay. Now that we know a little bit about Kubernetes operating systems, and like I said, I’m starting in early today as the devil’s advocate. Now, like I said before, I have a Mac and I have Linux and I have Windows on my desktop. There’re been lots of efforts from lots of people trying to get Kubernetes running up on Ubuntu or Fedora, and it’s cool that you’re trying to slim this down, but really why would I look at a Kubernetes operating system over my Linux that I’m familiar with? I like Ubuntu with Debian.
[00:03:17] AR: Sure. That’s a great question. It’s one we get a lot. I like to think that you actually just get less operational overhead when you actually have a Kubernetes-specific operating system. I think that Kubernetes itself is a job, managing it, getting it installed, unfortunately. It’s getting better, but it’s still a job at the end of the day.
Having to manage Kubernetes and the operating system, everything that you need to pass compliance on the operating system, get all the packages installed, these are all things that we kind of know that Kubernetes needs already and yet we’re still having to go in and app install whatever we might need to get Kubernetes up and running.
The idea with a Kubernetes operating system in my mind is that we should stop worrying about the individual node, the underlying operating system and start looking at Kubernetes as a whole as a giant machine and we just add machines, nodes to this giant machine that give us extra resources. The less that we have to care about the machine or the underlying operating system, the better, in my mind. We get to focus on Kubernetes.
Not only that, but because it’s minimal, you get a smaller attack surface. There’re just not things there that you would otherwise have to worry about. I’ve done Kubernetes for three years now and having to go in and worry about updating packages that are just completely unrelated, it’s something that I think we shouldn’t have to do anymore. If you’re dedicated to running your apps and your stack in Kubernetes, then why are we going in and managing the nodes on an individual basis. For that matter, managing things that don’t really have any relevance for running Kubernetes.
To me, it’s just about abstracting away the operating system and not even having to worry about it anymore and looking at Kubernetes and the entire whole cluster as an operating system. We can’t really get there if we’re having to worry about the two jobs of managing both at the same time.
[00:05:17] JS: Andrew, can I ask a follow up question?
[00:05:18] AR: Sure.
[00:05:20] JS: I fully agree with all of those statements. I think a general purpose operating system might not be the best job for a specific role, like being a Kubernetes node. As you mentioned, you have to deal with kind of all the various packages that might be beneficial to you if you’re running it for some general purpose. It’s really supposed to be running a workload as a Kubernetes node so you can kind of scope that down.
I’m just wondering when you kind of make this pitch or kind of let these folks know, how do you get folks to kind of relinquish their desire to have full control over their operating system from being able to install their own security management processes on it or being a little bit shy about not being able to SSH or kind of use their common patterns of operating system management?
[00:06:09] AR: Oh, that’s a great question. I think the biggest thing that I always answer back is – I can take this in two parts. Let me first of all talk about what – People, they do want to run things on the host. My answer always back is can you run it in Kubernetes? Kubernetes is sort of your package manager, if you will.
They sit back usually and they’re like, “Hmm. Yeah, I probably could.” If you need to run something on every host, Kubernetes has something for that. It’s a daemon set. Run it on Kubernetes and call it a day. This isn’t something that’s going to work for absolutely everything I imagine. Nothing in the world is like that. But I think for the majority of the use cases out there and for the things that people want to run on the host, you could actually just run it in Kubernetes itself.
As far as SSH and for those that don’t really know what we’ve done in Talos, in Talos we’ve actually stripped down just the kernel and a small Go lang binary that’s our – That, basically, its whole goal is to create a Kubernetes cluster or make a node part of a Kubernetes cluster as fast as possible, and that’s really it. We’ve gone so far as ripping out Bash and SSH and we’ve actually replaced that with an API.
My answer always back to the SSH question is what is it that you really trying to get out of SSH? 9 times out of 10, it’s, “I want to get information about what’s wrong. I want to do troubleshooting.” If our answer back to them is, “Oh! We have an API for that,” you still at the end of the day – it’s really the information that you’re after. It’s not necessarily that you need SSH to do that. You need a way to get this information and not necessarily have to sit there and wait for a Prometheus metric, see it pulled it every minute. You want something right on the spot. You want to ask a question and you want to get an immediate answer. I feel like we can answer that with an API.
That tends to satisfy the desire for wanting SSH most of the time. I mean, as you said, people are still going to want to hold on to it, but I think over time we’re going to have to educate people that this is a better way. It’s a read-only API that gives operations engineers a way to get that information that they would otherwise get by SSH-ing and asking via Unix utilities what you want to know.
[00:08:27] CC: When you say an API, are you also giving them a command line tool or like in the case of Talos, or only an actual API?
[00:08:37] TG: Yeah, we do provide a command line interface to the API. It’s called OSCTL and it basically wraps our API, and our intention is that that will be used for exploration of the system, automation through scripting languages, etc. Then as you get more sophisticated with your environment, you might begin to build your own tools that interact directly with that API.
[00:08:56] CC: Cool. Yeah, this is a really cool subject. I wasn’t even aware that Kubernetes operating system was a thing until really recently, and I don’t remember how I came across it. One question I have is, Andrew, you were saying, “Well, we strip down Kubernetes to the bare minimum.” How opinionated is it in your case in specific? When you say you – it’s a stripped down version to the bare minimum, this statement of bare minimum, would there be a consensus in the community that, yes, this set of functionality is the bare minimum? Is it your opinion of what the bare minimum should be?
[00:09:38] AR: Sure. I think at the absolute bare minimum, we need to run the kubelet. In my mind, that’s really all we need, but you still have this practical issue of, like you said, you need to get information off that machine. You need to be able to kind of manage Kubernetes without having to need Kubernetes as a chicken and egg’s problem. That’s where the API was actually born.
When I started Talos, I actually just built a very minimal strip down route-fs that all that did was run the kubelet. But figuring out why the kubelet wasn’t running successfully obviously was not very easy. I figured, “You know what? Let’s put an API in front of this. I want to keep this as minimal as possible. I want to keep this read-only.” I threw an API in front of it.
I think you need two things, really. You need to have what’s required by the kubelet. You need a CNI. You need all the utilities that the kubelet will run and you also need a way to query the system. If that is – If in the case of other operating systems that are minimal operating systems, they have decided to do SSH and all the classic utilities that we all know and love, we went another route with an API. But I don’t think the operating system, the route-fs should have any more than what’s required by the kubelet. That would be the pie in the sky dream right there.
[00:11:01] CC: The two questions that come to my mind are if I wanted to add Kubernetes components to that, would it be possible? If I wanted to add anything to the operating system, would it possible? I think the second question you already answered, which is, well, if you need to run – Correct me if I’m wrong. If you need to run something on the operating system that’s not there, you can run it in the actual cluster.
[00:11:27] AR: Yeah, that’s the idea, is that Kubernetes gives us the APIs to do – We could schedule to specific nodes. We can schedule to a class of nodes. We can schedule to every single node. I think that you can actually handle a lot of the use cases out there for any kind of application with Kubernetes itself. I think that that’s really strong because you get one single consistent API in managing your infrastructure. I want to deploy applications for this team or this team. At the end of the day, everything is just declarative and Kubernetes will make it happen. You don’t have to worry about the scheduling and all of these different things. The only thing that the operating system is concerned about is making that machine available to the Kubernetes cluster.
[00:12:10] BL: This idea of slimmed down operating systems, it’s not a new one. CoreOS was doing this years ago. One issue that CoreOS ran into was like, “Well, what’s current?” Well, it depends on what stream you’re on. How do you manage keeping everything up-to-date?
[00:12:28] AR: Our goal is to keep pace with Kubernetes essentially. I know that, traditionally, there’s long-term support and there’s all these different ways of releasing different versions of an operating system, but Kubernetes isn’t really there yet. There is no notion that I know of of LTS in Kubernetes yet. There’s just, I believe, it’s N-2 or something like that where they actually offer official support.
I think that the operating system is bound to that. I think that it needs to follow Kubernetes as close as possible. There’re constantly different feature gates being opened up. There’re things being graduated to GA. I think especially at this time right now, as rapid as the technology is changing, you need an operating system that is going to evolve with it or at least the operations intelligence to evolve with Kubernetes right alongside it.
[00:13:20] BL: So that brings up an interesting point. I mean, there are two things here. There’s the operating system itself and there’s Kubernetes. Do they upgrade in lockstep or are they upgraded separately?
[00:13:29] AR: I could only speak for ourselves. There are people that I think they actually have upgrades kind of be one and the same, where the operating system and a Kubernetes upgrade both happen. We’ve decided started to go the other route where we actually want to evolve our APIs sort of independently, but then give you a way to still manage Kubernetes on its own. We’ve actually done self-hosted Kubernetes. In Talos, we’ll actually bootstrap a lightweight control plane, small control plane and then we’ll spin up another control plane using the Kubernetes API.
Then now, Kubernetes upgrades simply look like a kubectl edit. I’m going to update my daemon step for my API server. Then from there, you will have to basically update the kub. We use hyperkub for the kubelet. You have to tell Talos, “Use this kubelet image next time you boot.”
We’ve separated the two I think for good reason. I think that the two should be able to evolve independently to give a little bit more power back to the user. If you combine them, if you couple them really closely, it becomes really, really opinionated. I think we should at least support what Kubernetes supports, and that’s the N-2 and leave it up to the user to kind of configure Kubernetes, but we still have same best practices out of the box.
[00:14:54] BL: Yeah, that makes sense, because yesterday, what did we get? We got a Kubernetes 1.15.10, and I don’t know 16, but we got 1.17.3 yesterday too. You might not want to move, because you might not – 1.17 introduced a whole bunch of deprecations and for custom resource definition. You’re not ready to move yet. We’re on beta 1 for a while for CRDs. I totally see why you had moved that direction.
[00:15:20] AR: Yeah, that’s exactly it. We can’t impose too much opinion, but I think that we should drive – The opinion at least up until like, “Hey, don’t worry about what’s on this machine. I’m going to make it a Kubernetes node for you. Just tell me which version you want.” I think that’s where we should draw the boundary and then we should still give the controls back to the user as far as what flags do I want to specify. What kind of feature gates? All these various things that you don’t get out of a lot of the different managed products out there. Hopefully we’ll be tittering right on the line of having that convenience of managed but still giving you that power and flexibility to update a flag if you need to.
[00:16:04] CC: This episode is so in the style of an interrogation. It’s hilarious.
[00:16:08] BL: That’s me. I’m digging in.
[00:16:09] CC: I feel like – No. We are all digging in. It’s just because – At least speaking from myself. I’m super curious. I wanted to ask you, Andrew, at the beginning you were saying that a Kubernetes operating system needs to integrate with Kubernetes and I was sitting here thinking, “Operate? It’s supposed to be Kubernetes.” What did you have in mind when you said that? Did you mean to be able to interface with another Kubernetes cluster? Was that what you meant?
[00:16:36] AR: Not quite. What I meant by that is there’s this really powerful thing that Kubernetes gives us in CRDs and this idea of operators or controllers. If you can actually have a way to use an operator controller, say, for upgrading your operating system, which we have in Talos, it’s just an upgrade operator lives in Kubernetes and knows how to talk to Kubernetes and it knows how to talk to our API and sort of orchestrate upgrades across the board.
Part of that is, for example, when you receive the upgrade API on a Talos node, it actually is aware, “Hey, I’m running Kubernetes. I’m going to cordon myself, because I know I’ve gotten this and I know that I’m not going to be able to schedule workload on me.” I think that that’s just one example, but we could probably take that a lot farther one day.
But I would like to see everything that we know and love about our operating systems today essentially be abstracted and pushed up into Kubernetes as operators. There’s a lot of power in that where you can actually orchestrate things, like I said, like upgrades. I think that that’s one example of how we can integrate better with Kubernetes as how an operating system should, at least.
[00:17:45] CC: Got you.
[00:17:46] JS: I was wondering if we can kind of maybe just pivot a little bit, like maybe to satisfy my own curiosities, but I was kind of hoping we could talk a little bit about like some of the selling features. Imagine if I’m a hardened sys admin or security team and basically someone comes up and says, “Hey, I want to run this Kubernetes operating system.” Knowing what I know about the state of security today and operating systems, there’s a lot of efforts to basically kind of contain things. No pun intended, but we have user space operates out of some type of sandbox. We have seccomp to limit sys calls. How does Talos approach security maybe like philosophically or maybe even down to the implementation details? What is security in Talos look like?
[00:18:33] AR: Yeah. Again, our goal is to basically – We want people to forget about the operating system. But to forget about the operating system, you have to know it’s secure. You have to go to great lengths to secure that because you can’t forget about it for that reason. We actually go down to the kernel, we actually apply what’s called the kernel self-protection project. We basically try to harden the kernel, and at boot time, we do a bunch of checks to make sure that your kernel is running at least most of those configurations. I think we have a little bit of work to do as far as enforcing all of them. But we do some checks to ensure that your kernel is compatible with KSPP, for example.
That alone has a ton of benefits to it. It’s a statically compiled kernel so you it can’t do any kernel module loading and stuff like that. That’s completely prohibited. That alone just kind of cuts off a lot of security issues in itself. Then going up the stack further, we’ve actually stripped out SSH. We stripped out Bash. So you have nothing that you can really log on to anymore.
Again, that’s just flat out removes a lot of – A whole category of potential attacks potentially. Going even further than that, we’ve actually have Talos running completely out of RAM and it’s a squash-fs. So it’s a read-only file system. The only thing that actually uses a disk is the kubelet. The idea is that we want to make the operating system, again, just have it go away. Having it read-only I think is a really strong thing, and squash-fs in particular, because you can’t remount it, rewrite if you’re a user or something like that.
Then up in Kubernetes we actually – Out of the box, we try to deploy it with all of the security best practices, the CIS benchmarks and all of that. We go to all the way from the kernel, to our user LAN and even to Kubernetes itself. We try to bring out security best practices out of the box. I think that’s something I’d love to see for Kubernetes itself upstream, but for now that’s what we’re doing.
[00:20:33] BL: Can we go back to the interrogation? No. Let’s not go back to an interrogation. Thinking of – If we take the concept of a Kubernetes operating system, that can be updated in a different cadence, then the Kubernetes running on it – Who is Talos for? Who does it make – Could Joe as a neophyte or someone who doesn’t really know the space, will this make their life any easier or is there a special set of expertise that we would need to be fruitful with this?
[00:21:06] TG: I think from our perspective, we hope that everybody who uses Kubernetes would find something useful in Talos, or a system like Talos. Number one, I think Talos would be a great way to get started on your laptop or workstation. I got some basic features to standup a small Kubernetes cluster there. That’s one place to start. As you move further into production, I think that a Kubernetes OS-based platform would be particularly useful in an environment where you might have multiple clusters spread across different geographical locations, spread across different teams. Maybe spread across different hosting environments.
We’ve talked to a number of folks who have been running Kubernetess in production for a couple of years now, and these clusters kind of come up organically within a larger organization in different areas, doing different things for the business, managed by different teams. Now that a little bit of time has passed, these organizations are realizing that, “Hey, we’ve got kind of a Kubernetes sprawl problem. We have this team over here on Amazon managing and running Kubernetes one way. We have a separate team managing and running Kubernetes a different way over here on a different kind of platform.”
I think anything that – anywhere where we can drive some consistency across the tooling, consistency across the base platforms would be useful. We also think that the minimal aspect of our system and some of the design decisions we’ve made around security and make it particularly useful in maybe a regulated environment. I think that that claim would hold true for any sort of special purpose operating system or minimal operating system designed for a specific task.
[00:22:35] BL: Interesting. Just thinking about a concept of a Kubernetes operating system, what’s next? I’m not asking what’s next from Talos, but given all the opportunity all the time and all the knowledge. What should we be doing that we’re not doing right now?
[00:22:49] AR: Specifically around operating systems or Kubernetes?
[00:22:52] BL: Well, you know what? You can start with operating systems. I mean, you can go to Kubernetes and then we’ll see if our lists match.
[00:22:57] AR: That’s a good question. Right off the bat, I’m going to say I don’t really know. I think this is new space. I think that we have a big task in front of us already in getting people to use these kinds of operating systems, hopefully not too big of a task. I’m hoping to see – Because you find these big companies, “Oh! We can’t do this. We can’t do that,” because getting a new OS is hard.
I think we first of all need to win people over on just these even more minimal operating systems beyond what CoreOS has done. Personally, I don’t know if I could answer that question honestly without just owing something.
[00:23:33] TG: I’ve got a thought here. One of the things that I’m really interested in beyond just Kubernetes and beyond just the operating system – what is computing going to look like in 5, 10, 15 years? I don’t know if Kubernetes is going to be around. I’m kind of a tech-cynic, right? I’ve seen a lot of fads in my career and things that pop up and are very popular for a couple of years and then sort of disappear. I don’t think Kubernetes is one of those. I think Kubernetes and the concepts and the layers of abstraction that Kubernetes has provided, all of that will remain useful and powerful in distant future whether or not it’s called Kubernetes or if it’s called something else, some new paradigm.
But what I’m really interested in is seeing what can we do with this idea of an API-managed OS? If you look at the general purpose operating systems out there, some aspects of the system might expose an API. But for the most part, you’re still interacting and interfacing with this system like you were 30 years ago, 35, 40 years ago even. That’s fine. What works works, but everything else today has an API. Kubernetes has a powerful and extensible API and I think that your operating system should have something similar, something comparable, something that you can interact with using the same tools and the same processes and the same ideas that you can at the top of the stack and move some of those concepts down to the host OS level where we’re talking about today.
[00:24:51] CC: This brings up a point that I’m so curious about, not only the idea of having a Kubernetes operating system, but any idea that is new that you were just talking about, Tim, is – So what works works. For example, every year or every couple of years, I am evaluating a new code editor or I am evaluating a new note taking app, or do-to-list app, those three things. I’m continuously finding something to reevaluate because what I have has never worked for me just the way I think. Actually, recently I found a couple of things that are really good.
In any case, the thing is they just never worked for years. They’re very limited. They don’t match my thinking. But operating system, I would never – Well, I’m not an administrator also, but just like from having my own laptops forever, I’m not going to go out there. That’s not true either, but I was going to say I’m not going to go out there and try a new operating system to see if it’s offering that I already have, then it might be better for me. But that’s not true, because I have done that many times too. So never mind.
But I think the idea of my question is stance, is how are you communicating to people out there that, “Hey, there is this new thing that maybe it’s working for you – Maybe you think it’s working for you, but you just don’t know that there is a new different way of doing.” When you do try to do that, how are people responding?
I mean, of course, there are those cases where people just know they get it and they immediately resonate with them. But I’m talking about the people who like might benefit from this but they don’t quite grasp. How do you break through that barrier?
[00:26:38] TG: Sure. Maybe the lay majority.
[00:26:40] CC: Yeah, and how are people responding?
[00:26:42] TG: Yeah. The great thing about Talos is that people understand pretty immediately what it is, how it works and why we’ve done it. The challenge I think for us and for anybody changing the way that operating systems work. Is it better enough than what I have today or what I had before? Is it worth the switch in costs? I think that switching cost is something that’s pretty well understood in the industry. People have gone through this process and they’ve moved from virtualization to containers, from Docker to Kubernetes, etc. They understand that process and they understand there’s a technical cost. There’s a people cost, etc. We have to show that value.
I think that progress in our industry is incremental. Our industry is young. We’re not building bridges. We’re not at the level of like the internal combustion engine where the engineering is understood and we know how to build it and we know how to make it so that it doesn’t fall over and explode. Clearly, we’re not quite there yet in the broader world of computing.
I think anywhere where we can show a little bit of incremental improvement where we can tackle one narrow slice of a problem and make it a little bit better and get to a point where computing is just a little bit safer and a little bit easier and a little bit faster. I think that’ll be a pretty compelling argument and there’s a lot of details involved and we have to talk about how do you get your applications from one operating system to the next?
15 years ago, it may have been a very big ask to ask someone to port their enterprise application from one operating system to another. They’re so inextricably linked. There’re a lot of connections between the OS and the applications, but today, we have these levels of abstraction. We have containers. We have the Kubernetes orchestration mechanisms and I think that switching cost is going down every release of Kubernetes and every step along the way as people change the way that applications are deployed that switching cost gets a little bit cheaper. It will be easier for us to prove that the value you gain by moving to a purpose-built operating system is greater than the switching cost.
[00:28:41] CC: Very good points.
[00:28:42] JS: I feel like there’s a lot of emphasis and focus on the move over. The first steps toward migrating to something new. There’s a lot of emphasis on bootstrapping a cluster. There’s a lot of emphasis on how do I get started. I’m part of a team called customer reliability engineering and we see operators running Kubernetes environments that are durable and have been in the field for many years. I think that there’s kind of a hidden cost in these day two operations where, like today, to effectively be a Kubernetes operator, you need to also have a great deal of understanding of Linux internal operating systems or Linux operating systems internals. These are abstractions on top, but sometimes those abstractions are leaky. So you need to be able to parse IP tables rules. You need to be able to understand how traffic gets routed, all of these aspects of it.
I’m just wondering how do we kind of get folks shifted from this mindset of I’m going to start with something that’s general purpose and then I’m going to basically make it do what I want it to do by making all of these configuration changes and installing things on top of it to kind of make that not general purpose, but kind of specific focus on it and kind of get people to move back more fundamentally and think, “Well, what if we just started with something that is strictly for running workloads?” We don’t have to worry about installing a security suite on top of this or making this configuration change or hardening requirements or what have you. We’re fundamentally in a better place because we’ve started with something that’s arguably more secure.
[00:30:21] BL: You know that – I mean, I’m old. I’m old now. I’m realizing this. When I started – Back in my day when we started with Linux, we went through this whole thing of Linux installers and there’re many iterations of Linux installers and it depended on, “Well, did you like what Red Hat was doing? Did you like what Debian project was doing? Oh! Did you like what Arch was doing? Oh! Did you want to do it yourself? Do you want to merge the world with gen 2?”
Really, we come to this point now, no one ever talks about Linux installers anymore. You just put it on there. I think what I’m getting at is that we don’t actually know what we want. I mean we say that we want it to be simpler. We say we want it to be more secure, but we don’t know. Only time will tell, and I think it’s going to be a lot of chipping away at problems. Then people who are wanting to have the bold ideas are saying, “I’m going to out there and create a Kubernetes operating system.” In reality, it may work. We hope it works, or it may not work, but at least we gained just a little tiny bit more knowledge on how we want to run this thing.
I think – And I’ll just say one more last thing, is that if you look at like Bell Labs, Bell Labs created the vacuum tube, and then like 20 years later, 20 or 30 years later, they created the transistor, twice actually. It took a long time to get the vacuum tube out because it kind of just worked and they just said, “We can’t throw it back. We just can’t throw that away.”
Maybe we’re seeing a lot of that in Kubernetes. We’re holding on to some good things even though some greater things are going to come, but it might not be here this year or next year. It might be 18 months. It might be 24 months. We just got to really pay attention to that.
[00:32:02] AR: Brian, when you said you were old, I was going to shake my head internally and then you brought up the vacuum tube and I’m like, “Okay.”
[00:32:07] BL: I mean, I’m not that old.
[00:32:09] JS: Yeah, I think that’s a good point, Brian. The thing I like to point out is the allegory of the cave. People have been living a certain way for so long they think that these shadows are real and they just know that way of life until some crazy comes along and says, “Hey, there’s a whole world out there,” and no one believes him.
I think we just need to do – Like you said, we just need to do it. When you just create and make it happen and hopefully educate people in the process and just keep chipping away at it. Do the good work.
[00:32:38] BL: That’s the important piece and that was the power of Bell Labs. You probably can tell. I just read a book about Bell Labs. I’m an expert now. But that was the power of Bell Labs. They didn’t just focus on making product for AT&T. They focused on changing the world, like literally. Who creates a transistor if you knew what one was? You just don’t create that. That’s like some really crazy stuff.
I try to bring the parallel back to what we’re doing here. We can’t just create this perfect Kubernetes thing, because really, we don’t know what it is. I mean, we can be smart and say, “Well, it needs to be secure. It needs to be networking,” and all these stuff. But you know what? We don’t even have cgroups v2 support yet. We don’t even know where we are. Let’s figure out – Let’s just keep going down the path, but we will suss out these better patterns.
[00:33:23] CC: Yeah, I like that.
[00:33:24] BL: That’s it. It is incremental. Here’s the crazy part, and this is the real tough part. You know what? It is incremental, and reality says that not everybody can win. Don’t take your failures as a loss. Take them as, “well, maybe we shouldn’t have done that,” and keep on moving forward because there’s a lot of companies out there who got us at this point in tech that don’t exist anymore, but if they didn’t do what they did, we would not be here right now. It’s not [inaudible 00:33:52].
[00:33:53] CC: Why are we talking about failures?
[00:33:55] BL: I’m sorry. It’s the ultimate success.
[00:34:01] CC: Oh gosh! Let’s not end the show in such a downer.
[00:34:04] BL: No. That’s a happy point though. Let me put the bow on the happy point and then I will stop talking. The thing is, is it’s not the glass is half empty. It is glass is half full. The path to success is littered with failure and it’s not a bad thing. It’s a good thing, because it’s good that we can continue making those failures because we know they lead to successes. That is actually a happy thing.
[00:34:29] CC: I wonder if Andrew and Tim want to do a little bit of interrogating of us. I think that would be fair.
[00:34:36] AR: I wouldn’t know what to interrogate you guys about.
[00:34:40] CC: Well, we are coming up at the top of the hour. So it’s time to say goodbye. It was great having you, Andrew, and you, Tim, on the call. Jed, thank you for participating as well. I think it was very informative. With that, I will say, until next week. Bye everybody.
[00:34:59] TG: Bye. Thanks for having us.
[00:35:00] CC: My pleasure.
[00:35:00] AR: Bye-bye. Thank you.
[00:35:02] JS: Thank you. Bye.
[END OF INTERVIEW]
[0:35:05.3] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
Today on the show we are very lucky to be joined by Chris Umbel and Shaun Anderson from Pivotal to talk about app transformation and modernization! Our guests help companies to update their systems and move into more up-to-date setups through the Swift methodology and our conversation focusses on this journey from legacy code to a more manageable solution. We lay the groundwork for the conversation, defining a few of the key terms and concerns that arise for typical clients and then Shaun and Chris share a bit about their approach to moving things forward. From there, we move into the Swift methodology and how it plays out on a project before considering the benefits of further modernization that can occur after the initial project. Chris and Shaun share their thoughts on measuring success, advantages of their system and how to avoid roll back towards legacy code. For all this and more, join us on The Podlets Podcast, today!
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Carlisia CamposJosh RossoDuffie CooleyOlive PowerKey Points From This Episode:
A quick introduction to our two guests and their roles at Pivotal.Differentiating between application organization and application transformation.Defining legacy and the important characteristics of technical debt and pain.The two-pronged approach at Pivotal; focusing on apps and the platform.The process of helping companies through app transformation and what it looks like.Overlap between the Java and .NET worlds; lessons to be applied to both.Breaking down the Swift methodology and how it is being used in app transformation.Incremental releases and slow modernization to avoid roll back to legacy systems.The advantages that the Swift methodology offers a new team.Possibilities of further modernization and transformation after a successful implementation.Measuring success in modernization projects in an organization using initial objectives.Quotes:
“App transformation, to me, is the bucket of things that you need to do to move your product down the line.” — Shaun Anderson [0:04:54]
“The pioneering teams set a lot of the guidelines for how the following teams can be doing their modernization work and it just keeps rolling down the track that way.” — Shaun Anderson [0:17:26]
“Swift is a series of exercises that we use to go from a business problem into what we call a notional architecture for an application.” — Chris Umbel [0:24:16]
“I think what's interesting about a lot of large organizations is that they've been so used to doing big bang releases in general. This goes from software to even process changes in their organizations.” — Chris Umbel [0:30:58]
Links Mentioned in Today’s Episode:
Chris Umbel — https://github.com/chrisumbel
Shaun Anderson — https://www.crunchbase.com/person/shaun-anderson
Pivotal — https://pivotal.io/
VMware — https://www.vmware.com/
Michael Feathers — https://michaelfeathers.silvrback.com/
Steeltoe — https://steeltoe.io/
Alberto Brandolini — https://leanpub.com/u/ziobrando
Swiftbird — https://www.swiftbird.us/
EventStorming — https://www.eventstorming.com/book/
Stephen Hawking — http://www.hawking.org.uk/
Istio — https://istio.io/
Stateful and Stateless Workload Episode — https://thepodlets.io/episodes/009-stateful-and-stateless/
Pivotal Presentation on Application Transformation: https://content.pivotal.io/slides/application-transformation-workshop
Transcript:
EPISODE 19
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[0:00:41.0] CC: Hi, everybody. Welcome back to The Podlets. Today, we have an exciting show. It's myself on, Carlisia Campos. We have our usual guest hosts, Duffie Cooley, Olive Power and Josh Rosso. We also have two special guests, Chris Umbel. Did I say that right, Chris?
[0:01:03.3] CU: Close enough.
[0:01:03.9] CC: I should have checked before.
[0:01:05.7] CU: Umbel is good.
[0:01:07.1] CC: Umbel. Yeah. I'm not even the native English speaker, so you have to bear with me. Shaun Anderson. Hi.
[0:01:15.6] SA: You said my name perfectly. Thank you.
[0:01:18.5] CC: Yours is more standard American. Let's see, the topic of today is application modernization. Oh, I just found out word I cannot pronounce. That's my non-pronounceable words list. Also known as application transformation, I think those two terms correctly used alternatively? The experts in the house should say something.
[0:01:43.8] CU: Yeah. I don't know that I would necessarily say that they're interchangeable. They're used interchangeably, I think by the general population though.
[0:01:53.0] CC: Okay. We're going to definitely dig into that, how it does it not make sense to use them interchangeably, because just by the meaning, I would think so, but I'm also not in that world day-to-day and that Shaun and Chris are. By the way, please give us a brief introduction the two of you. Why don't you go first, Chris?
[0:02:14.1] CU: Sure. I am Chris Umbel. I believe it was probably actually pronounced Umbel in Germany, but I go with Umbel. My title this week is the – I think .NET App Transformation Journey Lead. Even though I focus on .NET modernization, it doesn't end there. Touch a little bit of everything with Pivotal.
[0:02:34.2] SA: I'm Shaun Anderson and I share the same title of the week as Chris, except for where you say .NET, I would say Java. In general, we play the same role and have slightly different focuses, but there's a lot of overlap.
[0:02:48.5] CU: We get along, despite the .NET and Java thing.
[0:02:50.9] SA: Usually.
[0:02:51.8] CC: You both are coming from Pivotal, yeah? As most people should know, but I'm sure now everybody knows, Pivotal was just recently as of these date, which is what we are? End of January. This episode is going to be a while to release, but Pivotal was just acquired by VMware. Here we are.
[0:03:10.2] SA: It's good to be here.
[0:03:11.4] CC: All right. Somebody, one of you, may be let's say Chris, because you've brought this up, how this application organization differs from application transformation? Because I think we need to lay the ground and lay the definitions before we can go off and talk about things and sound experts and make sure that everybody can follow us.
[0:03:33.9] CU: Sure. I think you might even get different definitions, even from within our own practice. I'll at least lay it out as I see it. I think it's probably consistent with how Shaun's going to see it as well, but it's what we tell customers anyway. At the end of the day, there are – app transformation is the larger [inaudible] bucket. That's going to include, say just the re-hosting of applications, taking applications from point A to some new point B, without necessarily improving the state of the application itself.
We'd say that that's not necessarily an exercise in paying down technical debt, it's just making some change to an application or its environment. Then on the modernization side, that's when things start to get potentially a little more architectural. That's when the focus becomes paying down technical debt and really improving the application itself, usually from an architectural point of view and things start to look maybe a little bit more like rewrites at that point.
[0:04:31.8] DC: Would you say that the transformation is more in-line with re-platforming, those of you that might think about it?
[0:04:36.8] CU: We'd say that app transformation might include re-platforming and also the modernization. What do you think of that, Shaun?
[0:04:43.0] SA: I would say transformation is not just the re-platforming, re-hosting and modernization, but also the practice to figure out which should happen as well. There's a little bit more meta in there. Typically, app transformation to me is the bucket of things that you need to do to move your product down the line.
[0:05:04.2] CC: Very cool. I have two questions before we start really digging to the show, is still to lay the ground for everyone. My next question will be are we talking about modernizing and transforming apps, so they go to the clouds? Or is there a certain cut-off that we start thinking, “Oh, we need to – things get done differently for them to be called native.” Is there a differentiation, or is this one is the same as the other, like the process will be the same either way?
[0:05:38.6] CU: Yeah, there's definitely a distinction. The re-platforming bucket, that re-hosting bucket of things is where your target state, at least for us coming out of Pivotal, we had definitely a product focus, where we're probably only going to be doing work if it intersects with our product, right?
We're going to be doing both re-platforming targeted, say typically at a cloud environment, usually Cloud Foundry or something to that effect. Then modernization, while we're usually doing that with customers who have been running our platform, there's nothing to say that you necessarily need a cloud, or any cloud to do modernization. We tend to based on who we work for, but you could say that those disciplines and practices really are agnostic to where things run.
[0:06:26.7] CC: Sorry, I was muted. I wanted to ask Shaun if you wanted to add to that. Do you have the same view?
[0:06:33.1] SA: Yeah. I have the same view. I think part of what makes our process unique that way is we're not necessarily trying to target a platform for deployment, when we're going through the modernization part anyway. We're really looking at how can we design this application to be the best application it can be. It just so happens that that tends to be more 12-factor compliant that is very cloud compatible, but it's not necessarily the way that we start trying to aim for a particular platform.
[0:07:02.8] CC: All right. If everybody allows me, after this next question, I'll let other hosts speak too. Sorry for monopolizing, but I'm so excited about this topic. Again, in the spirit of understanding what we're talking about, what do you define as legacy? Because that's what we're talking about, right? We’re definitely talking about a move up, move forwards. We're not talking about regression and we're not talking about scaling down. We're talking about moving up to a modern technology stack.
That means, that implies we're talking about something that's legacy. What is legacy? Is it contextual? Do we have a hard definition? Is there a best practice to follow? Is there something public people can look at? Okay, if my app, or system fits this recipe then it’s considered legacy, like a diagnosis that has a consensus.
[0:07:58.0] CU: I can certainly tell you how you can't necessarily define legacy. One of the ways is by the year that it was written. You can certainly say that there are certainly shops who are writing legacy code today. They're still writing legacy code. As soon as they're done with a project, it's instantly legacy. There's people that are trying to define, like another Michael Feathers definition, which is I think any application that doesn't have tests, I don't know that that fits what – our practice necessarily sees legacy as.
Basically, anything that's occurred a significant amount of technical debt, regardless of when the application was written or conceived fits into that legacy bucket. Really, our work isn't necessarily as concerned about whether something's legacy or not as much as is there pain that we can solve with our practice? Like I said, we've modernized things that were in for all intents and purposes, quite modern in terms of the year they were written.
[0:08:53.3] SA: Yeah. I would double down on the pain. Legacy to us often is something that was written as a prototype a year ago. Now it's ready to prove itself. It's going to be scaled up, but it wasn't built with scale in mind, or something like that. Even though it may be the latest technology, it just wasn't built for the load, for example. Sometimes legacy can be – the pain is we have applications on a mainframe and we can't find Cobol developers and we're leasing a giant mainframe and it's costing a lot of money, right? There's different flavors of pain.
It also could be something as simple as a data center move. Something like that, where we've got all of our applications running on Iron and we need to go to a virtual data center somewhere, whether it's cloud or on-prem. Each one of those to us is legacy. It's all about the pain.
[0:09:47.4] CU: I think is miserable as that might sound, that's really where it starts and is listening to that pain and hearing directly from customers what that pain is. Sounds terrible when you think about it that you're always in search of pain, but that isn't indeed what we do and try to alleviate that in some way.
That pain is what dictates the solution that you come up with, because there are certain kinds of pain that aren't going to be solved with say, modernization approach, a more a platformed approach even. You have to listen and make sure that you're applying the right medicine to the right pain.
[0:10:24.7] OP: Seems like an interesting thing bringing what you said, Chris, and then what you said earlier, Shaun. Shaun you had mentioned the target platform doesn't necessarily matter, at least upfront. Then Chris, you had implied bringing the right thing in to solve the pain, or to help remedy the pain to some degree.
I think what's interesting may be about the perspectives for those on this call and you too is a lot of times our entry points are a lot more focused with infrastructure and platform teams, where they have these objectives to solve, like cost and ability to scale and so on and so forth. It seems like your entry point, at least historically is maybe a little bit more focused on finding pain points on more of the app side of the house. I'm wondering if that's a fair assessment, or if you could speak to how you find opportunities and what you're really targeting.
[0:11:10.6] SA: I would say that's a fair assessment from the perspective of our services team. We're mainly app-focused, but it's almost there's a two-pronged approach, where there's platform pain and application pain. What we've seen is often solving one without the other is not a great solution, right? I think that's where it's challenging, because there's so much to know, right? It's hard to find one team or one person who can point out the pain on both sides.
It just depends on often, how the customer approaches us. If they are saying something like, “We’re a credit card company and we're getting our butts kicked by this other company, because they can do biometrics and we can't yet, because of the limitations of our application.” Then we would approach it from the app-first perspective. If it's another pain point, where our operations, day two operations is really suffering, we can't scale, where we have issues that the platform is really good at solving, then we may start there. It always tends to merge together in the end.
[0:12:16.4] CU: You might be surprised how much variety there is in terms of the drivers for people coming to us. There are a lot of cases where the work came to us by way of the platform work that we've done. It started with our sister team who focuses on the platform side of things. They solve the infrastructure problems ahead of us and then we close things out on the application side. We if our account teams and our organization is really listening to each individual customer that you'll find that there – that the pain is drastically different, right?
There are some cases where the driver is cost and that's an easy one to understand. There are also drivers that are usually like a date, such as this data center goes dark on this date and I have to do something about it. If I'm not out of that data center, then my apps no longer run. The solution to that is very different than the solution you would have to, "Look, my application is difficult for me to maintain. It takes me forever to ship features. Help me with that." There's two very different solutions to those problems, but each of which are things that come our way. It's just that former probably comes in by way of our platform team.
[0:13:31.1] DC: Yeah, that’s an interesting space to operate in in the application transformation and stuff. I've seen entities within some of the larger companies that represent this field as well. Sometimes that's called production engineering or there are a few other examples of this that I'm aware of. I'm curious how you see that happening within larger companies.
Do you find that there is a particular size entity that is actually striving to do this work with the tools that they have internally, or do you find that typically, most companies are just need something like an application transformation so you can come in and help them figure out this part of it out?
[0:14:09.9] SA: We've seen a wide variety, I think. One of them is maybe a company really has a commitment to get to the cloud and they get a platform and then they start putting some simple apps up, just to learn how to do it. Then they get stuck with, “Okay. Now how do we with trust get some workloads that are running our business on it?” They will often bring us in at that point, because they haven't done it before. Experimenting with something that valuable to them is — usually means that they slow down.
There's other times where we've come in to modernize applications, whether it's a particular business unit for example, that may have been trying to get off the mainframe for the last two years. They’re smart people, but they get stuck again, because they haven't figured out how to do it. What often happens and Chris can talk about some examples of this is once we help them figure out how to modernize, or the recipes to follow to start getting their systems systematically on to the platform and modernize, that they tend to like forming a competency area around it, where they'll start to staff it with the people who are really interested and they take over where we started from.
[0:15:27.9] CU: There might be a little bit of bias to that response, in that typically, in order to even get in the door with us, you're probably a Fortune 100, or at least a 500, or government, or something to that effect. We're going to be seeing people that one, have a mainframe to begin with. Two, would have say, capacity to fund say a dedicated transformation team, or to build a unit around that. You could say that the smaller an organization gets, maybe the easier it is to just have the entire organization just write software the modern way to begin with.
At least at the large side, we do tend to see people try to build a – they'll use different names for it. Try to have a dedicated center of excellence or practice around modernization. Our hope is to help them build that and hopefully, put them in a position that that can eventually disappear, because eventually, you should no longer need that as a separate discipline.
[0:16:26.0] JR: I think that's an interesting point. For me, I argue that you do need it going forward, because of the cognitive overhead between understanding how your application is going to thrive on today's complex infrastructure models and understanding how to write code that works. I think that one person that has all of that in their head all the time is a little too much, a little too far to go sometimes.
[0:16:52.0] CU: That's probably true. When you consider the size the portfolios and the size of the backlog for modernization that people have, I mean, people are going to be busy on that for a very long time anyway. It's either — even if it is finite, it still has a very long life span at a minimum.
[0:17:10.7] SA: At a certain point, it becomes like painting the Golden Gate Bridge. As soon as you finish, you have to start again, because of technology changes, or business needs and that thing. It's probably a very dynamic organization, but there's a lot of overlap. The pioneering teams set a lot of the guidelines for how the following teams can be doing their modernization work and it just keeps rolling down the track that way. It may be that people are busy modernizing applications off of WebLogic, or WebSphere, and it takes a two years or more to get that completed for this enterprise. It was 20, 50 different projects. To them, it was brand-new each time, which is cool actually to come into that.
[0:17:56.3] JR: I'm curious, I definitely love hear it from Olive. I have one more question before I pass it out and I think we’d love to hear your thoughts on all of this. The question I have is when you're going through your day-to-day working on .NET and Java applications and helping people figure out how to go about modernizing them, what we've talked about so far is that represents some of the deeper architectural issues and stuff.
You've already mentioned 12 factor after and being able to move, or thinking about the way that you frame the application as far as inputs of those things that it takes to configure, or to think with the lifecycle of those things. Are there some other common patterns that you see across the two practices, Java and .NET, that you think are just concrete examples of stuff that people should take away maybe from this episode, that they could look at their app – and they’re trying to get ahead of the game a little bit?
[0:18:46.3] SA: I would say a big part of the commonality that Chris and I both work on a lot is we have a methodology called the SWIFT methodology that we use to help discover how the applications really want to behave, define a notional architecture that is again, agnostic of the implementation details. We’ll often come in with a the same process and I don't need to be a .NET expert and a .NET shop to figure out how the system really wants to be designed, how you want to break things into microservices and then the implementation becomes where those details are.
Chris and I both collaborate on a lot of that work. It makes you feel a little bit better about the output when you know that the technology isn't as important. You get to actually pick which technology fits the solution best, as opposed to starting with the technology and letting a solution form around it, if that makes sense.
[0:19:42.4] CU: Yeah. I'd say that interesting thing is just how difficult it is while we're going through the SWIFT process with customers, to get them to not get terribly attached to the nouns of the technology and the solution. They've usually gone in where it's not just a matter of the language, but they have something picked in their head already for data storage, for messaging, etc., and they're deeply attached to some of these decisions, deeply and emotionally attached to them.
Fundamentally, when we're designing a notional architecture as we call it, really you should be making decisions on what nouns you're going to pick based on that architecture to use the tools that fit that. That's generally a bit of a process the customers have to go through. It's difficult for them to do that, because the more technical their stakeholders tend to be, often the more attached they are to the individual technology choices and breaking that is the principal role for us.
[0:20:37.4] OP: Is there any help, or any investment, or any coordination with those vendors, or the purveyors of the technologies that perhaps legacy applications are, or indeed the platforms they're running on, is there any help on that side from those vendors to help with application transformation, or making those applications better?
Or do organizations have to rely on a completely independent, so the team like you guys to come in and help them with that? Do you understand my point? Is there any internal – like you mentioned WebLogic, WebSphere, do the purveyors of those platforms try and drive the transformation from within there? Or is it organizations who are running those apps have to rely on independent companies like you, or like us to help them with that?
[0:21:26.2] SA: I think some of it depends on what the goal of the modernization is. If it's something like, we no longer want to pay Oracle licensing fees, then of course, obviously they – WebLogic teams aren't going to be happy to help. That's not always the case. Sometimes it's a case where we may have a lot of WebLogic. It's working fine, but we just don't like where it's deployed and we'd like to containerize it, move it to Kubernetes or something like that. In that case, they're more willing to help. At least in my experience, I've found that the technology vendors are rightfully focused just on upgrading things from their perspective and they want to own the world, right?
WebLogic will say, “Hey, we can do everything. We have clustering. We have messaging. We've got good access to data stores.” It's hard to find a technology vendor that has that broader vision, or the discipline to not try to fit their solutions into the problem, when maybe they're not the best fit.
[0:22:30.8] CU: I think it's a broad generalization, but specifically on the Java side it seems that at least with app server vendors, the status quo is usually serving them quite well. Quite often, we’re adversary – a bit of an adversarial relationship with them on occasion. I could certainly say that within the .NET space, we've worked a relatively collaboratively with Microsoft on things like Steeltoe, which is a I wouldn't say it's a springboot analog, but at least a microservice library that helps people achieve 12-factor cloud nativeness. That's something where I guess Microsoft represents both the legacy side, but also the future side and were part of a solution together there.
[0:23:19.4] SA: Actually, that's a good point because the other way that we're seeing vendors be involved is in creating operators on Kubernetes side, or Cloud Foundry tiles, something that makes it easy for their system to still be used in the new world. That's definitely helpful as well.
[0:23:38.1] CC: Yeah, that's interesting.
[0:23:39.7] JR: Recently, myself me people on my team went through a training from both Shaun and Chris, interestingly enough in Colorado about this thing called the SWIFT methodology. I know it's a really important methodology to how you approach some of the application transformation-like engagements. Could you two give us a high-level overview of what that methodology is?
[0:24:02.3] SA: I want to hear Chris go through it, since I always answer that question first.
[0:24:09.0] CU: Sure. I figured since you were the inventor, you might want to go with it Shaun, but I'll give it a stab anyway. Swift is a series of exercises that we use to go from a business problem into what we call a notional architecture for an application. The one thing that you'll hear Shaun say all the time that I think is pretty apt, which is we're trying to understand how the application wants to behave.
This is a very analog process, especially at the beginning. It's one where we get people who can speak about the business problem behind an application and the business processes behind an application. We get them into a room, a relatively large room typically with a bunch of wall space and we go through a series of exercises with them, where we tease that business process apart. We start with a relatively lightweight version of Alberto Brandolini’s event storing method, where we map out with the subject matter experts, what all of the business events that occur in a system are. That is a non-technical exercise, a completely non-technical exercise. As a matter of fact, all of this uses sticky notes and arts and crafts.
After we've gone through that process, we transition into Boris diagram, which is an exercise of Shaun's design that we take the domains that we've, or at least service candidates that we've extrapolated from that event storming and start to draw out a notional architecture. Like an 80% idea of what we think the architecture is going to look like. We're going to do this for slices of – thin slices of that business problem. At that point, it starts to become something that a software developer might be interested in.
We have an exercise called Snappy that generally occurs concurrently, which translates that message flow, Boris diagram thing into something that's at least a little bit closer to what a developer could act upon. Again, these are sticky note and analog exercises that generally go on for about a week or so, things that we do interactively with customers to try to get a purely non-technical way, at least at first, so that we can understand that problem and tell you what an architecture is that you can then act on.
We try to position this as a customer. You already have all of the answers here. What we're going to do as facilitators of these is try to pull those out of your head. You just don't know how to get to the truth, but you already know that truth and we're going to design this architecture together.
How did I do, Shaun?
[0:26:44.7] SA: I couldn't have said it better myself. I would say one of the interest things about this process is the reason why it was developed the way it was is because in the world of technology and especially engineers, I definitely seen that you have two modes of thought when you come from the business world to the to the technical world. Often, engineers will approach a problem in a very different way and a very focused, blindered way than business folks. Ultimately, what we try to think of is that the purpose for the software is to enable the business to run well. In order to do that, you really need to understand at least at a high-level, what the heck is the business doing?
Surprisingly and almost consistently, the engineering team doing the work is separated from the business team enough that it's like playing the telephone game, right? Where the business folks say, “Well, I told them to do this.” The technical team is like, “Oh, awesome. Well then, we're going to use all this amazing technology and build something that really doesn't support you.” This process really brings everybody together to discover how the system really wants to behave. Also as a side effect, you get everybody agreeing that yes, that is the way it's supposed to be. It's exciting to see teams come together that actually never even work together. You see the light bulbs go on and say, “Oh, that's why you do that.”
The end result is in a week, we can go from nobody really knows each other, or quite understands the system as a whole, to we have a backlog of work that we can prioritize based on the learnings that we have, and feel pretty comfortable that the end result is going to be pretty close to how we want to get there. Then the biggest challenge is defining how do we get from point A to point B. That's part of that layering of the Swift method is knowing when to ask those questions.
[0:28:43.0] JR: A micro follow-up and then I'll keep my mouth shut for a little bit. Is there a place that people could go online to read about this methodology, or just get some ideas of what you just described?
[0:28:52.7] SA: Yeah. You can go to swiftbird.us. That has a high-level overview of more the public facing of how the methodology works. Then there's also internal resources that are constantly being developed as well. That's where I would start.
[0:29:10.9] CC: That sounds really neat. As always, we are going to have links on the show notes for all of this. I checked out the website for the EventStorming book. There is a resources page there and has a list of a bunch of presentations. Sounds very interesting.
I wanted to ask Chris and Shaun, have you ever seen, or heard of a case where a company went through the transformation, or modernization process and then they roll back to their legacy system for any reason?
[0:29:49.2] SA: That's actually a really good question. It implies that often, the way people think about modernization would be more of a big bang approach, right? Where at a certain point in time, we switch to the new system. If it doesn't work, then we roll back. Part of what we try to do is have incremental releases, where we're actually putting small slices into production where you're not rolling back a whole – from modern back to legacy. It's more of you have a week's worth of work that's going into production that's for one of the thin slices, like Chris mentioned.
If that doesn't work where there's something that is unexpected about it, then you're rolling back just a small chunk. You're not really jumping off a cliff for modernization. You're really taking baby steps. If it's a two step forward and one step back, you're still making a lot of really good progress. You're also gaining confidence as you go that in the end in two years, you're going to have a completely shiny new modern system and you're comfortable with it, because you're getting there an inch of the time, as opposed to taking a big leap.
[0:30:58.8] CU: I think what's interesting about a lot of large organizations is that they've been so used to doing big bang releases in general. This goes from software to even process changes in their organizations. They’ve become so used to that that it often doesn't even cross their mind that it's possible to do something incrementally. We really do often times have to get spend time getting buy-in from them on that approach.
You'd be surprised that even in industries that you’d think would be fantastic with managing risk, when you look at how they actually deal with deployment of software and the rolling out of software, they’re oftentimes taking approaches that maximize their risk. There's no way to make something riskier by doing a big bang. Yeah, as Shaun mentioned, the specifics of the swift are to find a way, so that you can understand where and get a roadmap for how to carve out incremental slices, so that you can strangle a large monolithic system slowly over time. That's something that's pretty powerful.
Once someone gets bought in on that, they absolutely see the value, because they're minimizing risk. They're making small changes. They're easy to roll back one at a time. You might see people who would stop somewhere along the way, and we wouldn't necessarily say that that's a problem, right? Just like not every app needs to be modernized, maybe there's portions of systems that could stay where they are. Is that a bad thing? I wouldn't necessarily say that it is. Maybe that's the way that – the best way for that organization.
[0:32:35.9] DC: We've bumped into this idea now a couple of different times and I think that both Chris and Shaun have brought this up. It's a little prelude to a show that we are planning on doing.
One of the operable quotes from that show is the greatest enemy of knowledge is not the ignorance, it is the illusion of knowledge. It's a quote by Stephen Hawking. It speaks exactly to that, right? When you come to a problem with a solution in your mind that is frequently difficult to understand the problem on its merit, right? It’s really interesting seeing that crop up again in this show.
[0:33:08.6] CU: I think even oftentimes, the advantage of a very discovery-oriented method, such as Swift is that it allows you to start from scratch with a problem set with people maybe that you aren't familiar with and don't have some of that baggage and can ask the dumb questions to get to some of the real answers. It's another phrase that I know Shaun likes to use is that our roles is facilitator to this method are to ask dumb questions. I mean, you just can't put enough value on that, right? The only way that you're going to break that established thinking is by asking questions at the root.
[0:33:43.7] OP: One question, actually there was something recently that happened in the Kubernetes community, which I thought was pretty interesting and I'd like to get your thoughts on it, which is that Istio, which is a project that operates as a service mesh, I’m sure you all are familiar with it, has recently decided to unmodernize itself in a way.
It was originally developed as a set of microservices. They have had no end of difficulty in getting in optimizing the different interactions between those services and the nodes. Then recently, they decided this might be a good example of when to monolith, versus when to microservice. I'm curious what your thoughts are on that, or if you have familiarity with it.
[0:34:23.0] CU: What's actually quite – I'm not going to necessarily speak too much to this. Time will tell as to if the monolithing that they're doing at the moment is appropriate or not. Quite often, the starting point for us isn't necessarily a monolith. What it is is a proposed architecture coming from a customer that they're proud of, that this is my microservice design. You'll see a simple system with maybe hundreds of nano-services.
The surprise that they have is that the recommendation from us coming out of our Swift sessions is that actually, you're overthinking this. We're going to take that idea that you have any way and maybe shrink that down and to save tens of services, or just a handful of services. I think one of the mistakes that people make within enterprises, or on microservices at the moment is to say, “Well, that's not a microcservice. It’s too big.” Well, how big or how small dictates a microservice, right? Oftentimes, we at least conceptually are taking and combining services based on the customers architecture very common.
[0:35:28.3] SA: Monoliths aren't necessarily bad. I mean, people use them almost as a pejorative, “Oh, you have a monolith.” In our world it's like, well monoliths are bad when they're bad. If they're not bad, then that's great. The corollary to that is micro-servicing for the sake of micro-servicing isn't necessarily a good thing either. When we go through the Boris exercise, really what we're doing is we're showing how domain-based, or capabilities relate to each other. That happens to map really well in our opinion to first, cut microservices, right?
You may have an order service, or a customer service that manages some of that. Just because we map capabilities and how they relate to each other, it doesn't mean the implementation can't even be as a single monolith, but componentized inside it, right? That's part of what we try really hard to do is avoid the religion of monolith versus microservices, or even having to spend a lot of time trying to define what a microservice is to you. It's really more of well, a system wants to behave this way. Now, surprise, you just did domain-driven design and mapped out some good 12-factor compliant microservices should you choose to build it that way, but there's other constraints that always apply at that point.
[0:36:47.1] OP: Is there more traction in organizations implementing this methodology on a net new business, rather than current running businesses or applications? Is there are situations more so that you have seen where a new project, or a new functionality within a business starts to drive and implement this methodology and then it creeps through the other lines of business within the organization, because that first one was successful?
[0:37:14.8] CU: I'd say that based on the nature of who our customers are as an app transformation practice, based on who those customers are and what their problems are, we're generally used to having a starting point of a process, or software that exists already. There's nothing at all to mandate that it has to be that way. As a matter of fact, with folks from our labs organization, we've used these methods in what you could probably call greener fields.
At the end of the day when you have a process, or even a candidate process, something that doesn't exist yet, as long as you can get those ideas onto sticky notes and onto a wall, this is a very valid way of getting – turning ideas into an architecture and an architecture into software.
[0:37:59.4] SA: We've seen that happen in practice a couple times, where maybe a piece of the methodology was used, like EventStorming just to get a feel for how the business wants to behave. Then to rapidly try something out in maybe more of a evolutionary architecture approach, MVP approach to let's just build something from a user perspective just to solve this problem and then try it out.
If it starts to catch hold, then iterate back and now drill into it a little bit more and say, “All right. Now we know this is going to work.” We're modernizing something that may be two weeks old just because hooray, we proved it's valuable. We didn't necessarily have to spend as much upfront time on designing that as we would in this system that's already proven itself to be of business value.
[0:38:49.2] OP: This might be a bit of a broad question, but what defines success of projects like this? I mean, we mentioned earlier about cost and maybe some of the drivers are to move off certain mainframes and things like that. If you're undergoing an application transformation, it seems to me like it's an ongoing thing. How do enterprises try to evaluate that return on investment? How does it relate to success criteria? I mean, faster release times, etc., potentially might be one, but how was that typically evaluated and somebody internally saying, “Look, we are running a successful project.”
[0:39:24.4] SA: I think part of what we tried to do upfront is identify what the objectives are for a particular engagement. Often, those objectives start out with one thing, right? It's too costly to keep paying IBM or Oracle for WebLogic, or WebSphere. As we go through and talk through what types of things that we can solve, those objectives get added to, right? It may be the first thing, our primary objective is we need to start moving workloads off of the mainframe, or workloads off of WebLogic, or WebSphere, or something like that. There's other objectives that are part of this too, which can include things as interesting as developer happiness, right?
They have a large team of a 150 developers that are really just getting sick of doing the same old thing and having new technology. That's actually a success criteria maybe down the road a little bit, but it's more of a nice to have. In a long-winded answer of saying, when we start these and when we incept these projects, we usually start out with let's talk through what our objectives are and how we measure success, those key results for those objectives. As we're iterating through, we keep measuring ourselves against those.
Sometimes the objectives change over time, which is fine because you learn more as you're going through it. Part of that incremental iterative process is measuring yourself along the way, as opposed to waiting until the end.
[0:40:52.0] CC: Yeah, makes sense. I guess these projects are as you say, are continuous and constantly self-adjusting and self-analyzing to re-evaluate success criteria to go along. Yeah, so that's interesting.
[0:41:05.1] SA: One other interesting note though that personally we like to measure ourselves when we see one project is moving along and if the customers start to form other projects that are similar, then we know, “Okay, great. It's taking hold.” Now other teams are starting to do the same thing. We've become the cool kids and people want to be like us. The only reason it happens for that is when you're able to show success, right? Then other teams want to be able to replicate that.
[0:41:32.9] CU: The customers OKRs, oftentimes they can be a little bit easier to understand. Sometimes they're not. Typically, they involve time or money, where I'm trying to take release times from X to Y, or decrease my spend on X to Y. The way that we I think measure ourselves as a team is around how clean do we leave the campsite when we're done. We want the customers to be able to run with this and to continue to do this work and to be experts.
As much as we'd love to take money from someone forever, we have a lot of people to help, right? Our goal is to help to build that practice and center of excellence and expertise within an organization, so that as their goals or ideas change, they have a team to help them with that, so we can ride off into the sunset and go help other customers.
[0:42:21.1] CC: We are coming up to the end of the episode, unfortunately, because this has been such a great conversation. It turned out to be a more of an interview style, which was great. It was great getting the chance to pick your brains, Chris and Shaun.
Going along with the interview format, I like to ask you, is there any question that wasn't asked, but you wish was asked? The intent here is to illuminates what this process for us and for people who are listening, especially people who they might be in pain, but they might be thinking this is just normal.
[0:42:58.4] CU: That's an interesting one. I guess to some degree, that pain is unfortunately normal. That's just unfortunate. Our role is to help solve that. I think the complacency is the absolute worst thing in an organization. If there is pain, rather than saying that the solution won't work here, let’s start to talk about solutions to that. We've seen customers of all shapes and sizes. No matter how large, or cumbersome they might be, we've seen a lot of big organizations make great progress. If your organization's in pain, you can use them as an example. There is light at the end of the tunnel.
[0:43:34.3] SA: It's usually not a train.
[0:43:35.8] CU: Right. Usually not.
[0:43:39.2] SA: Other than that, I think you asked all the questions that we always try to convey to customers of how we do things, what is modernization. There's probably a little bit about re-platforming, doing the bare minimum to get something onto to the cloud. We didn't talk a lot about that, but it's a little bit less meta, anyway. It's more technical and more recipe-driven as you discover what the workload looks like.
It's more about, is it something we can easily do a CF push, or just create a container and move it up to the cloud with minimal changes? There's not conceptually not a lot of complexity. Implementation-wise, there's still a lot of challenges there too. They're not as fun to talk about for me anyway.
[0:44:27.7] CC: Maybe that's a good excuse to have some of our colleagues back on here with you.
[0:44:30.7] SA: Absolutely.
[0:44:32.0] DC: Yeah, in a previous episode we talked about persistence and state of those sorts of things and how they relate to your applications and how when you're thinking about re-platforming and even just where you're planning on putting those applications. For us, that question comes up quite a lot. That's almost zero trying to figure out the state model and those sort of things.
[0:44:48.3] CC: That episode was named States in Stateless Apps, I think. We are at the end, unfortunately. It was so great having you both here. Thank you Duffie, Shaun, Chris and I'm going by the order I'm seeing people on my video. Josh and Olive.
Until next time. Make sure please to let us know your feedback. Subscribe. Give us a thumbs up. Give us a like. You know the drill. Thank you so much. Glad to be here. Bye, everybody.
[0:45:16.0] JR: Bye all.
[0:45:16.5] CU: Bye.
[END OF EPISODE]
[0:45:17.8] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
The question of diving into Kubernetes is something that faces us all in different ways. Whether you are already on the platform, are considering transitioning, or are thinking about what is best for your team moving forward, the possibilities and the learning-curve make it a somewhat difficult question to answer. In this episode, we discuss the topic and ultimately believe that an individual is the only one who can answer that question well. That being said, the capabilities of Kubernetes can be quite persuasive and if you are tempted then it is most definitely worth considering very seriously, at least. In our discussion, we cover some of the problems that Kubernetes solves, as well as some of the issues that might arise when moving into the Kubernetes space. The panel shares their thoughts on learning a new platform and compare it with other tricky installations and adoption periods. From there, we look at platforms and how Kubernetes fits and does not fit into a traditional definition of what a platform constitutes. The last part of this episode is spent considering the future of Kubernetes and how fast that future just might arrive. So for all this and a bunch more, join us on The Podlets Podcast, today!
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Carlisia CamposJosh RossoDuffie CooleyBryan LilesKey Points From This Episode:
The main problems that Kubernetes solves and poses.Why you do not need to understand distributed systems in order to use Kubernetes.How to get around some of the concerns about installing and learning a new platform.The work that goes into readying a Kubernetes production cluster.What constitutes a platform and can we consider Kubernetes to be one?The two ways to approach the apparent value of employing Kubernetes.Making the leap to Kubernetes is a personal question that only you can answer.Looking to the future of Kubernetes and its possible trajectories.The possibility of more visual tools in the UI of Kubernetes.Understanding the concept of conditions in Kubernetes and its objects.Considering appropriate times to introduce a team to Kubernetes.Quotes:
“I can use different tools and it might look different and they will have different commands but what I’m actually doing, it doesn’t change and my understanding of what I’m doing doesn’t change.” — @carlisia [0:04:31]
“Kubernetes is a distributed system, we need people with expertise across that field, across that whole grouping of technologies.” — @mauilion [0:10:09]
“Kubernetes is not just a platform. Kubernetes is a platform for building platforms.” — @bryanl [0:18:12]
Links Mentioned in Today’s Episode:
Weave — https://www.weave.works/docs/net/latest/overview/
AWS — https://aws.amazon.com/
DigitalOcean — https://www.digitalocean.com/
Heroku — https://www.heroku.com/
Red Hat — https://www.redhat.com/en
Debian — https://www.debian.org/
Canonical — https://canonical.com/
Kelsey Hightower — https://github.com/kelseyhightower
Joe Beda — https://www.vmware.com/latam/company/leadership/joe-beda.html
Azure — https://azure.microsoft.com/en-us/
CloudFoundry — https://www.cloudfoundry.org/
JAY Z — https://lifeandtimes.com/
OpenStack — https://www.openstack.org/
OpenShift — https://www.openshift.com/
KubeVirt — https://kubevirt.io/
VMware — https://www.vmware.com/
Chef and Puppet — https://www.chef.io/puppet/
tgik.io — https://www.youtube.com/playlist?list=PL7bmigfV0EqQzxcNpmcdTJ9eFRPBe-iZa
Matthias Endler: Maybe You Don't Need Kubernetes - https://endler.dev/2019/maybe-you-dont-need-kubernetes
Martin Tournoij: You (probably) don’t need Kubernetes - https://www.arp242.net/dont-need-k8s.html
Scalar Software: Why most companies don't need Kubernetes - https://scalarsoftware.com/blog/why-most-companies-dont-need-kubernetes
GitHub: Kubernetes at GitHub - https://github.blog/2017-08-16-kubernetes-at-github
Debugging network stalls on Kubernetes - https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/
One year using Kubernetes in production: Lessons learned - https://techbeacon.com/devops/one-year-using-kubernetes-production-lessons-learned
Kelsey Hightower Tweet: Kubernetes is a platform for building platforms. It's a better place to start; not the endgame - https://twitter.com/kelseyhightower/status/935252923721793536?s=2
Transcript:
EPISODE 18
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[0:00:41.9] JR: Hello everyone and welcome to The Podlets Podcast where we are going to be talking about should I Kubernetes? My name is Josh Rosso and I am very pleased to be joined by, Carlisia Campos.
[0:00:55.3] CC: Hi everybody.
[0:00:56.3] JR: Duffy Cooley.
[0:00:57.6] DC: Hey folks.
[0:00:58.5] JR: And Brian Lyles.
[0:01:00.2] BL: Hi.
[0:01:03.1] JR: All right everyone. I’m really excited about this episode because I feel like as Kubernetes has been gaining popularity over time, it’s been getting its fair share of promoters and detractors. That’s fair for any piece of software, right? I’ve pulled up some articles and we put them in the show notes about some of the different perspectives on both success and perhaps failures with Kub.
But before we dissect some of those, I was thinking we could open it up more generically and think about based on our experience with Kubernetes, what are some of the most important things that we think Kubernetes solves for?
[0:01:44.4] DC: All right, my list is very short and what Kubernetes solves for my point of view is that it allows or it actually presents an interface that knows how to run software and the best part about it is that it doesn’t – the standard interface. I can target Kubernetes rather than targeting the underlying hardware. I know certain things are going to be there, I know certain networking’s going to be there.
I know how to control memory and actually, that’s the only reason that I really would give, say for Kubernetes, we need that standardization and you don’t want to set up VM’s, I mean, assuming you already have a cluster. This simplifies so much.
[0:02:29.7] BL: For my part, I think it’s life cycle stuff that’s really the biggest driver for my use of it and for my particular fascination with it. I’ve been in roles in the past where I was responsible for ensuring that some magical mold of application on a thousand machines would magically work and I would have all the dependencies necessary and they would all agree on what those dependencies were and it would actually just work and that was really hard.
I mean, getting to like a known state in that situation, it’s very difficult. Having something where either both the abstractions of containers and the abstraction of container orchestration, the ability to deploy those applications and all those dependencies together and the ability to change that application and its dependencies, using an API. That’s the killer part for me.
[0:03:17.9] CC: For me, from a perspective of a developer is very much what Duffy just said but more so the uniformity that comes with all those bells and whistles that we get by having that API and all of the features of Kubernetes. We get such a uniformity across such a really large surface and so if I’m going to deploy apps, if I’m going to allow containers, what I have to do for one application is the same for another application.
If I go work for another company, that uses Kubernetes, it is the same and if that Kubernetes is a hosted Kubernetes or if it’s a self-managed, it will be the same. I love that consistency and that uniformity that even so I can – there are many tools that help, they are customized, there’s help if you installing and composing specific things for your needs. But the understanding of what you were doing is it’s the same, right?
I can use different tools and it might look different and they will have different commands but what I’m actually doing, it doesn’t change and my understanding of what I’m doing doesn’t change. I love that. Being able to do my work in the same way, I wish, you know, if that alone for me makes it worthwhile.
[0:04:56.0] JR: Yeah, I think like my perspective is pretty much the same as what you all said and I think the one way that I kind of look at it too is Kubernetes does a better job of solving the concerns you just listed, then I would probably be able to build myself or my team would be able to solve for ourselves in a lot of cases. I’m not trying to say that specialization around your business case or your teams isn’t appropriate at times, it’s just at least for me, to your point Carlisia, I love that abstraction that’s consistent across environments. It handles a lot of the things, like Brian was saying, about CPU, memory, resources and thinking through all those different pieces.
I wanted to take what we just said and maybe turn it a bit at some of the common things that people run in to with Kubernetes and just to maybe hit on a piece of low hanging fruit that I think is oftentimes a really fair perspective is Kubernetes is really hard to operate. Sure, it gives you all the benefits we just talked about but managing a Kubernetes cluster? That is not a trivial task. And I just wanted to kind of open that perspective up to all of us, you know? What are your thoughts on that?
[0:06:01.8] DC: Well, the first thought is it doesn’t have to be that way. I think that’s a fallacy that a lot of people fall into, it’s hard. Guess what? That’s fine, we’re in the sixth year of Kubernetes, we’re not in the sixth year of stability of a stable release. It’s hard to get started with Kubernetes and what happens is we use that as an excuse to say well, you know what? It’s hard to get started with so it’s a failure.
You know something else that was hard to get started with? Whenever I started with it in the 90s? Linux. You download it and downloading it on 30 floppy disks. There was the download corruption, real things, Z modem, X modem, Y modem. This is real, a lot of people don’t know about this. And then, you had to find 30 working flopping disk and you had to transfer 30, you know, one and a half megabyte — and it still took a long time to floppy disk and then you had to run the installer.
And then most likely, you had to build a kernel. Downloading, transferring, installing, building a kernel, there was four places where just before you didn’t have windows, this was just to get you to a log in prompt, that could fail.
With Kubernetes, we had this issue. People were installing Kubernetes, there’s cloud vendors who are installing it and then there’s people who were installing it on who knows what hardware. Guess what? That’s hard and it’s not even now, it’s not even they physical servers that’s networking. Well, how are you going to create a network that works across all your servers, well you’re going to need an overlay, which one are you going to use, Calico? Use Weave?
You’re going to need something else that you created or something else if it works. Yeah, just we’re still figuring out where we need to be but these problems are getting solved. This will go away.
[0:07:43.7] BL: I’m living that life right now, I just got a new laptop and I’m a Linux desktop kind of guy and so I’m doing it right now. What does it take to actually get a recent enough kernel that the hardware that is shipped with this laptop is supported, you know? It’s like, those problems continue, even though Linux has been around and considered stable and it’s the underpinning of much of what we do on the internet today, we still run into these things, it’s still a very much a thing.
[0:08:08.1] CC: I think also, there’s a factor of experience, for example. This is not the first time you have to deal with this problem, right Duffy? Been using Linux on a desktop so this is not the first hardware that you had to setup Linux on. So you know where to go to find that information.
Yeah, it’s sort of a pain but it’s manageable. I think a lot of us are suffering from gosh, I’ve never seen Kubernetes before, where do I even start and – or, I learned Kubernetes but it’s quite burdensome to keep up with everything as opposed to let’s say, if 10 years from now, we are still doing Kubernetes. You’ll be like yeah, okay, whatever. This is no big deal. So because we have done these things for a few years that we were not possibly say that it’s hard. I don’t’ think we would describe it that way.
[0:09:05.7] DC: I think there will still be some difficulty to it but to your point, it’s interesting, if I look back like, five years ago, I was telling all of my friends. Look, if you’re a system’s administrator, go learn how to do other things, go learn how to be, go learn an API centric model, go play with AWS, go play with tools like this, right?
If you’re a network administrator, learn to be a system’s administrator but you got to branch out. You got to figure out how to ensure that you’re relevant in the coming time. With all the things that are changing, right? This is true, I was telling my friend this five years ago, 10 years ago, continues, I continue to tell my friends that today.
If I look at the Kubernetes platform, the complexity that represents in operating it is almost tailor made to those people though did do that, that decided to actually branch out and to understand why API’s are interesting and to understand, you know, can they have enough of an understanding in a generalist way to become a reasonable systems administrator and a network administrator and you know, start actually understanding the paradigms around distributed systems because those people are what we need to operate this stuff right now, we’re building –
I mean, Kubernetes is a distributed system, we need people with expertise across that field, across that whole grouping of technologies.
[0:10:17.0] BL: Or, don’t. Don’t do any of that.
[0:10:19.8] CC: Brian, let me follow up on that because I think it’s great that you pointed that out Duffy. I was thinking precisely in terms of being a generalist and understanding how Kubernetes works and being able to do most of it but it is so true that some parts of it will always be very complex and it will require expertise. For example, security. Dealing with certificates and making sure that that’s working, if you want to – if you have particular needs for networking, but, understanding the whole idea of this systems, as it sits on top of Kubernetes, grasping that I think is going to – have years of experience under their belt. Become relatively simple, sorry Brian that I cut you off.
[0:11:10.3] BL: That’s fine but now you gave me something else to say in addition to what I was going to say before. Here’s the killer. You don’t need to know distributed systems to use Kubernetes. Not at all. You can use a deployment, you can use a [inaudible] set, you can run a job, you can get workloads up on Kubernetes without having to understand that.
But, Kubernetes also gives you some good constructs either in the Kubernetes API's itself or in its client libraries where you could build distributed systems in easier way but what I was going to say before that though is I can’t build a cluster. Well don’t. You know what you should do? Use a cloud vendor, use AWS, use Google, use Microsoft or no, I mean, did I say Microsoft? Google and Microsoft. Use Digital Ocean.
There’s other people out there that do it as well, they can take care of all the hard things for you and three, four minutes or 10 minutes if you’re on certain clouds, you can have Kubernetes up and running and you don’t even have to think about a lot of these networking concerns to get started.
I think that’s a little bit of the thud that we hear, "It’s hard to install." Well, don’t install it, you install it whenever you have to manage your own data centers. Guess what? When you have to manage your own data centers and you’re managing networking and storage, there’s a set of expertise that you already have on staff and maybe they don’t want to learn a new thing, that’s a personal problem, that’s not really a Kubernetes problem. Let’s separate those concerns and not use our lack or not wanting to, to stop us from actually moving forward.
[0:12:39.2] DC: Yeah. Maybe even taking that example step forward. I think where this problem compounds or this perspective sometimes compounds about Kubernetes being hard to operate is coming from of some shops who have the perspective of are operational concerns today, aren’t that complex. Why are we introducing this overhead, this thing that we maybe don’t need and you know, to your point Brian, I wonder if we’d all entertain the idea, I’m sure we would that maybe even, speaking to the cloud vendors, maybe even just a Heroku or something.
Something that doesn’t even concern itself with Kube but can get your workload up and running and successful as quickly as possible. Especially if you’re like, maybe a small startup type persona, even that’s adequate, right? It could have been not a failure of Kubernetes but more so choosing the wrong tool for the job, does that resonate with you all as well, does that make sense?
[0:13:32.9 DC: Yeah, you know, you can’t build a house with a screwdriver. I mean, you probably could, it would hurt and it would take a long time. That’s what we’re running into. What you’re really feeling is that operationally, you cannot bridge the gap between running your application and running your application in Kubernetes and I think that’s fair, that’s actually a great thing, we prove that the foundations are stable enough that now, we can actually do research and figure out the best ways to run things because guess what?
RPM’s from Red Hat and then you have devs from the Debian project, different ways of getting things, you have Snap from Canonical, it works and sometimes it doesn’t, we need to actually figure out those constructs in Kubernetes, they’re not free. These things did not exist because someone says, "Hey, I think we should do this." Many years. I was using RPM in the 90s and we need to remember that.
[0:14:25.8] JR: On that front, I want to maybe point a question to you Duffy, if you don’t mind. Another big concern that I know you deal with a lot is that Kubernetes is great. Maybe I can get it up no problem. But to make it a viable deployment target at my organization, there’s a lot of work that goes into it to make a Kubernetes cluster production ready, right? That could be involving how you integrate storage and networking and security and on and on.
I feel like we end up at this tradeoff of it’s so great that Kubernetes is super extensible and customizable but there is a certain amount of work that that kind of comes with, right? I’m curious Duff, what’s your perspective on that?
[0:15:07.3] DC: I want to make a point that bring back to something Brian mentioned earlier, real quick, before I go on to that one. The point is that, I completely agree that yo do not have to actually be a distributed systems person to understand how to use Kubernetes and if that were a bar, we would have set that bar and incredibly, the inappropriate place. But from the operational perspective, that’s what we were referring to.
I completely also agree that especially when we think about productionalizing clusters, if you’re just getting into this Kubernetes thing, it may be that you want to actually farm that out to another entity to create and productionalize those clusters, right? You have a choice to make just like you had a choice to make what when AWS came along. Just like you had a choice to make — we’re thinking of virtual machines, right?
You have a choice and you continue to have a choice about how far down that rabbit hole as an engineering team of an engineering effort your company wants to go, right? Do you want farm everything out to the cloud and not have to deal with the operations, the day to day operations of those virtual machines and take the constraints that have been defined by that platformer, or do you want to operate that stuff locally, are you required by the law to operate locally?
What does production really mean to you and like, what are the constraints that you actually have to satisfy, right? I think that given that choice, when we think about how to production Alize Kubernetes, it comes down to exactly that same set of things, right? Frequently, productionalizing – I’ve seen a number of different takes on this and it’s interesting because I think it’s actually going to move on to our next topic in line here.
Frequently I see that productionizing or productionalizing Kubernetes means to provide some set of constraints around the consumption of the platform such that your developers or the focus that are consuming that platform have to operate within those rails, right? They could only define deployments and they can only define deployments that look like this.
We’re going to ask them a varied subset of questions and then fill out all the rest of it for them on top of Kubernetes. The entry point might be CICD, it might be a repository, it might be code repository, very similar to a Heroku, right? The entry point could be anywhere along that thing and I’ve seen a number of different enterprises explore different ways to implement that.
[0:17:17.8] JR: Cool. Another concept that I wanted to maybe have us define and think about, because I’ve heard the term platform quite a bit, right? I was thinking a little bit about you know, what the term platform means exactly? Then eventually, whether Kubernetes itself should be considered a platform.
Backing u, maybe we could just start with a simple question, for all of us, what makes something a platform exactly?
[0:17:46.8] BL: Well, a platform is something that provides something. That is a Brian Lyles exclusive. But really, what it is, what is a platform, a platform provides some kind of service that can be used to accomplish some task and Kubernetes is a platform and that thing, it provides constructs through its API to allow you to perform tasks.
But, Kubernetes is not just a platform. Kubernetes is a platform for building platforms. The things that Kubernetes provides, the workload API, the networking API, the configuration and storage API’s. What they provide is a facility for you to build higher level constructs that control how you want to run the code and then how you want to connect the applications. Yeah, Kubernetes is actually a platform for platforms.
[0:18:42.4] CC: Wait, just to make sure, Brian. You’re saying, because Kelsey Hightower for example is someone who says Kubernetes is a platform of platforms. Now, is Kubernetes both a platform of platforms, at the same time that it’s also a platform to run apps on?
[0:18:59.4] BL: It’s both. Kelsey tweeted that there is some controversy on who said that first, it could have been Joe Beda, it could have been Kelsey. I think it was one of those two so I want to give a shout out to both of those for thinking in the same line and really thinking about this problem.
But to go back to what you said, Carlisia, is it a platform for providing platforms and a platform? Yes, I will explain how. If you have Kubernetes running and what you can do is you can actually talk to the API, create a deployment. That is platform for running a workload. But, also what you can do is you can create through Kubernetes API mechanisms, ie. CRD’s, custom resource definitions.
You can create custom resources that I want to have something called an application. You can basically extend the Kubernetes API. Not only is Kubernetes allowing you to run your workloads, it’s allowing you to specify, extend the API, which then in turn can be run with another controller that’s running on your platform that then gives you this thing when you cleared an application. Now, it creates deployment which creates a replica set, which creates a pod, which creates containers, which downloads images from a container registry. It actually is both.
[0:20:17.8] DC: Yeah, I agree with that. Another quote that I remember being fascinated by which I think kind of also helps define what a platform is Kelsey put on out quote that said, Everybody wants platform at a service with the only requirement being that they’ve built it themselves." Which I think is awesome and it also kind of speaks, in my opinion to what I think the definition of a platform is, right?
It’s an interface through which we can define services or applications and that interface typically will have some set of constraints or some set of workflows or some defined user experience on top of it. To Brian's point, I think that Kubernetes is a platform because it provides you a bunch of primitive s on the back end that you can use to express what that user experience might be.
As we were talking earlier about what does it take to actually – you might move the entry point into this platform from the API, the Kubernetes API server, back down into CICD, right? Perhaps you're not actually defining us and called it a deployment, you’re just saying, I want so many instances off this, I don’t want it to be able to communicate with this other thing, right? It becomes – so my opinion, the definition about of a platform it is that user experience interface. It’s the constraints that we know things that you're going to put on top of that platform.
[0:21:33.9] BL: I like that. I want to throw out a disclaimer right here because we’re here, because we’re talking about platforms. Kubernetes is not a platform, it’s as surface. That is actually, that’s different, a platform as a service is – from the way that we look at it, is basically a platform that can run your code, can actually make your code available to external users, can scale it up, can scale it down and manages all the nuances required for that operation to happen.
Kubernetes does not do that out of the box but you can build a platform as a surface on Kubernetes. That’s actually, I think, where we’ll be going next is actually people, stepping out of the onesy-twosy, I can deploy a workload, but let’s actually work on thinking about this level. And I’ll tell you what. DEUS who got bought by Azure a few years ago, they actually did that, they built a pass that looks like Heroku.
Microsoft and Azure thought that was a good idea so they purchased them and they’re still over there, thinking about great ideas but I think as we move forward, we will definitely see different types of paths on Kubernetes. The best thing is that I don’t think we’ll see them in the conventional sense of what we think now.
We have a Heroku, which is like the git-push Heroku master, we share code through git. And then we have CloudFoundry idea of a paths which is, you can run CFPush and that actually is more of an extension of our old school Java applications, where we could just push [inaudible] here but I think at least I am hoping and this is something that I am actually working on not to toot my own horn too much but actually thinking about how do we actually – can we build a platform as a service toolkit?
Can I actually just build something that’s tailing to my operation? And that is something that I think we’ll see a lot more in the next 18 months. At least you will see it from me and people that I am influencing.
[0:23:24.4] CC: One thing I wanted to mention before we move onto anything else, in answering “Is Kubernetes right for me?” We are so biased. We need to play devil’s advocate at some point. But in answering that question that is the same as in when we need to answer, “Is technology x right for me?” and I think there is at a higher level there are two camps.
One camp is very much of the thinking that, "I need to deliver value. I need to allow my software and if the tools I have are solving my problem I don’t need to use something else. I don’t need to use the fancy, shiny thing that’s the hype and the new thing." And that is so right. You definitely shouldn't be doing that.
I am divided on this way of thinking because at the same time at that is so right. You do have to be conscious of how much money you’re spending on things and anyway, you have to be efficient with your resources. But at the same time, I think that a lot of people who don’t fully understand what Kubernetes really can do and if you are listening to this, if you maybe could rewind and listen to what Brian and Duffy were just saying in terms of workflows and the Kubernetes primitives. Because those things they are so powerful. They allow you to be so creative with what you can do, right? With your development process, with your roll out process and maybe you don’t need it now.
Because you are not using those things but once you understand what it is, what it can do for your used case, you might start having ideas like, “Wow, that could actually make X, Y and Z better or I could create something else that could use these things and therefore add value to my enterprise and I didn’t even think about this before.” So you know two ways of looking at things.
[0:25:40.0] BL: Actually, so the topic of this session was, “Should I Kubernetes” and my answer to that is I don’t know. That is something for you to figure out. If you have to ask somebody else I would probably say no. But on the other side, if you are looking for great networking across a lot of servers. If you are looking for service discovery, if you are looking for a system that can restart workloads when they fail, well now you should probably start thinking about Kubernetes.
Because Kubernetes provides all of these things out of the box and are they easy to get started with though? Some of these things are harder. Service discovery is really easy but some of these things are a little bit harder but what Kubernetes does is here comes my hip-hop quote, Jay Z said this, basically he’s talking about difficult things and he basically wants difficult things to take a little bit of time and impossible things or things we thought that were impossible to take a week.
So basically making difficult things easy and making things that you could not even imagine doing, attainable. And I think that is what Kubernetes brings to the table then I’ll go back and say this one more time. Should you use Kubernetes? I don’t know that is a personal problem that is something you need to answer but if you’re looking for what Kubernetes provides, yes definitely you should use it.
[0:26:58.0] DC: Yeah, I agree with that I think it is a good summary there. But I also think you know coming back to whether you should Kubernetes part, from my perspective the reason that I Kubernetes, if you will, I love that as a verb is that when I look around at the different projects in the infrastructure space, as an operations person, one of the first things I look for is that API that pattern around consumption, what's actually out there and what’s developing that API.
Is it a the business that is interested in selling me a new thing or is it an API that’s being developed by people who are actually trying to solve real problems, is there a reasonable way to go about this. I mean when I look at open stack, OpenStack was exactly the same sort of model, right? OpenStack existed as an API to help you consume infrastructure and I look at Kubernetes and I realize, “Wow, okay well now we are developing an API that allows us to think about the life cycle and management of applications."
Which moves us up the stack, right? So for my part, the reason I am in this community, the reason I am interested in this product, the reason I am totally Kubernetes-ing is because of that. I realized that fundamentally infrastructure has to change to be able to support the kind of load that we are seeing. So whether you should Kubernetes, is the API valuable to you? Do you see the value in that or is there more value in continuing whatever paradigm you’re in currently, right? And judging that equally I think is important.
[0:28:21.2] JR: Two schools of thoughts that I run into a lot on the API side of thing is whether overtime Kubernetes will become this implementation detail, where 99% of users aren’t even aware of the API to any extent. And then another one that kind of talks about the API is consistent abstraction with tons of flexibility and I think companies are going in both directions like OpenShift from Red Hat is perhaps a good example.
Maybe that is one of those layer two platforms more so Brian that you were talking about, right? Where Kubernetes is the platform that was used to build it but the average person that interacts with it might not actually be aware of some of the Kubernetes primitives and things like that. So if we could all get out of our crystal balls for a second here, what do you all think in the future? Do you see the Kubernetes API becoming just a more prevalent industry standard or do you see it fading away in favor of some other abstraction that makes it easier?
[0:29:18.3] BL: Oh wow, well I already see it as I don’t have to look too far in the future, right? I can see the Kubernetes API being used in ways that we could not imagine. The idea that I will think of is like KubeVirt. KubeVirt allows you to boot basically pods on whatever implements that it looks like a Kubelet. So it looks like something that could run pods. But the neat thing is that you can use something like KubeVirt with a virtual Kubelet and now you can boot them on other things.
So ideas in that space, I don’t know VMware is actually going on that, “Wow, what if we can make virtual machines look like pods inside of Kubernetes? Pretty neat." Azure has definitely led work on this as now, we can just bring up either bring up containers, we can bring up VM’s and you don’t actually need a Kube server anymore. Now but the crazy part is that you can still use a workloads API’s, storage API’s with Kubernetes and it does not matter what backs it.
And I’ll throw out one more suggestion. So there is also projects like AWS operators in [inaudible] point and what they allow you to do is to use the Kubernetes API or actually in cluster API, I'll use all three. But I use the Kubernetes API to boot things that aren’t even in the cluster and this will be AWS services or this could be databases across multiple clouds or guess what? More Kubernetes services. Yeah, so we are on that path but I just can’t wait to see what people are going to do with that. The power of Kubernetes is this API, it is just so amazing.
[0:30:50.8] DC: For my part, I think is that I agree that the API itself is being extended in all kinds of amazing ways but I think that as I look around in the crystal ball, I think that the API will continue to be foundational to what is happening. If I look at the level two or level three platforms that are coming, I think those will continue to be a thing for enterprises because they will continue to innovate in that space and then they will continue to consume the underlying API structure and that portability Kubernetes exposes to define what that platform might look like for their own purpose, right?
Giving them the ability to effectively have a platform as a service that they define themselves but using and under – you know, using a foundational layer that it’s like consistent and extensible and extensive I think that that’s where things are headed.
[0:31:38.2] CC: And also more visual tools, I think is in our future. Better, actual visual UI's that people can use I think that’s definitely going to be in our future.
[0:31:54.0] BL: So can I talk about that for a second?
[0:31:55.9] CC: Please, Brian.
[0:31:56.8] BL: I am wearing my octant hoodie today, which is a visual tool for Kubernetes and I will talk now as someone who has gone down this path to actually figure this problem out. As a prediction for the future, I think we’ll start creating better API’s in Kubernetes to allow for more visual things and the reason that I say that this is going to happen and it can’t really happen now is because for inside of an octant and whenever creating new eye views, pretty much happened now what that optic is.
But what is going to happen and I see the rumblings from the community, I see the rumblings from K-native community as well is that we are going to start standardizing on conditions and using conditions as a way that we can actually say what’s going on. So let me back it up for a second so I can explain to people what conditions are.
So Kubernetes, we think of Kubernetes as YAML and in a typical object in Kubernetes, you are going to have your type meta data. What is this, you are going to have your object meta data, what’s name this and then you are going to have a spec, how is this thing configured and then you are going to have a status and the status generally will say, “Well what is the status of this object? Is it deployment? How many references out? If it is a pod, am I ready to go?"
But there is also this concept and status called conditions, which are a list of things that say how your thing, how your object is working. And right now, Kubernetes uses them in two ways, they use them in the negative way and the positive way. I think we are actually going to figure out which one we want to use and we are going to see more API’s just say conditions. And now from a UI developer, from my point of view, now I can just say, “I don’t really care what your optic is. You are going to give me conditions in a format that I know and I can just basically report on those in the status and I can tell you if the thing is working or not.”
That is going to come too. And that will be neat because that means that we get basically, we can start building UI’s for free because we just have to learn the pattern.
[0:33:52.2] CC: Can you talk a little bit more about conditions? Because this is not something I hear frequently and that I might know but then not know what you are talking about by this name.
[0:34:01.1] BL: Oh yeah, I will give you the most popular one. So everything in Kubernetes is an object and that even means that the nodes that your workloads run on, are objects. If you run KubeControl, KubeCuddle, Kube whatever, git nodes, it will show you all the nodes in your cluster if you have permission to see that and if you do KubeCTL, gitnode, node name and then you actually have the YAML output what you will see in the bottom is an object called 'conditions'.
And inside of there it will be something like is there sufficient memory, is the node – I actually don’t remember all of them but really what it is, they’re line items that say how this particular object is working. So do I have enough memory? Do I have enough storage? Am I out of actual pods that can be launched on me and what conditions are? It is basically saying, “Hey Brian, what is the weather outside?” I could say it's nice.
Or I could be like, “Well, it’s 75 degrees, the wind is light but variable. It is not humid and these are what the conditions are.” They allow the object to specify things about itself that might be useful to someone who is consuming it.
[0:35:11.1] CC: All right that was useful. I am actually trying to bring one up here. I never paid attention to that.
[0:35:18.6] BL: Yeah and you will see it. So the two ones that are most common right now, there is some competition going on in Kubernetes architecture, trying to figure out how they are going to standardize on this but with pods and nodes you will see conditions on there and those are just telling you what is going on but the problem is that a condition is a type, a message, a status and something else but the problem is that the status can be true of false — oh and a reason, the status can be true or false but sometimes the type is a negative type where it would be like “node not ready”.
And then it will say false because it is. And now whenever you’re inspecting that with automated code, you really want the positive condition to be true and the negative condition to be false and this is something that the K-native community is really working on now. They have the whole facility of this thing called duck typing. Which they can actually now pattern-match inside of optics to find all of these neat things. It is actually pretty intriguing.
[0:36:19.5] CC: All right, it is interesting because I very much status is everything for objects and that is very much a part of my work flow. But I never noticed that there was some of the objects had conditions. I never noticed that and just a plug, we are very much going to have the K-native folks here to talk about duck typing. I am really excited about that.
[0:36:39.9] BL: Yeah, they’re on my team. They’ll be happy to come.
[0:36:42.2] CC: Oh yes, they are awesome.
[0:36:44.5] JR: So I was thinking maybe we could wrap this conversation up and I think we have acknowledged that “Should I Kubernetes?” is a ridiculously hard question for us to answer for you and we should clearly not be the ones answering it for you but I was wondering if we could give some thoughts around — for the Podlet listener who is sitting at their desk right now thinking like, “Is now the right time for my organization to bring this in?”
And I will start with some thought and then open it all up to you. So one common thing I think that I run into a lot is you know your current state and you know your desired state to steal a Kubernetes concept for a moment. And the desired state might be more decoupled services that are more scalable and so on and I think oftentimes at orgs we get a little bit too obsessed with the desired state that we forget about how far the gap is between the current state and the desired state.
So as an example, you know maybe your shop’s biggest issue is the primary revenue generating application is a massive dot-net framework monolith, which isn’t exactly that easy to just port over into Kubernetes, right? So if a lot of your friction right now is teams collaborating on this tool, updating this tool, scaling this tool, maybe before even thinking about Kubernetes, being honest with the fact that a lot of value can be derived right now from some amount of application architecture changes.
Or even sorry to use a buzzword but some amount of modernization of aspects of that application before you even get to the part of introducing Kubernetes. So that is one common one that I run into with orgs. What are some other kind of suggestion you have for people who are thinking about, “Is it the right time to introduce Kube?”
[0:38:28.0] BL: So here is my thought, if you work for a small startup and you’re working on shipping value and you have no Kubernetes experience and staff and you don’t want to use for some reason you don’t want to use the cloud, you know go figure out your other problems then come back. But if you are an enterprise and especially if you work in a central enterprise group and you are thinking about “modernization”, I actually do suggest that you look at Kubernetes and here is the reason why.
My guess is that if you’re a business of a certain size, you run VMware in your data center. I am just guessing that because I haven’t been to a company that doesn’t. Because we learned a long time ago that using virtual machines in many cases is way more efficient than just running hardware because what happens is we can’t use our compute capacity. So if you are working for a big company or even like a medium sized company, I don’t think –
I am not telling you to run for it but I am telling you to at least have someone go look at it and investigate if this could ultimately be something that could make your stack easier to run.
[0:39:31.7] DC: I think I am going to take the kind of the operations perspective. I think if you are in the business of coming up with a way to deploy applications on the servers and you are looking at trying to handle the lifecycle of that and you’re pretty fed up with the tooling that is out there and things like Puppet and Chef and tooling like that and you are looking to try and understand is there something in Kubernetes for me?
Is there some model that could help me improve the way that I actually handle a lifecycle of those applications, be they databases or monoliths or compostable services? Any which way you want to look at it like are there tools there that can be expressed. Is the API expressive enough to help me solve some of those problems? In my opinion the answer is yes. I look at things like DaemonSet and the things like scheduling [inaudible] that are exposed by Kubernetes.
And there is actually quite a lot of power there, quite a lot of capability in just the traditional model of how do I get this set of applications onto that set of servers or some subset they’re in. So I think it is worth evaluating if that is the place you’re in as an organization and if you are looking at fleets of equipment and trying to handle that magical recipe of multiple applications and dependencies and stuff. See what is the water is like on this side, it is not so bad.
[0:40:43.1] CC: Yes, I don’t think there is a way to answer this question. It is Kubernetes for me without actually trying it, giving it a try yourself like really running something of maybe low risk. We can read blogposts to the end of the world but until you actually do it and explore the boundaries is what I would say, try to learn what else can you do that maybe you don’t even need but maybe might become useful once you know you can use.
Yeah and another thing is maybe if you are a shop that has one or two apps and you don’t need full blown, everything that Kubernetes has to offer and there is a much more scaled down tool that will help you deploy and run your apps, that’s fine. But if you have more, a certain number, I don’t know what that number would be but multiple apps and multiple services just think about having that uniformity across everything.
Because for example, I’ve worked in shops where the QA machines were taking care by a group of dev ops people and the production machines, oh my god they were taken care by other groups and now the different group of people and the two sides of these groups used were different and I as a developer, I had to know everything, you know? How to deploy here, how to deploy there and I had to have my little notes and recipes because whenever I did it –
First of all I wasn’t doing that multiple times a day. I had to read through the notes to know what to do. I mean just imagine if it was one platform that I was deploying to with the CLI comments there, it is very easy to use like Kubernetes has, gives us with Kubes ETL. You know you have to think outside of the box. Think about these other operations that you have that people in your company are going to have to do. How is this going to be taught in the future?
Having someone who knows your stack because your stack is the same that people in your industry are also using. I think about all of these things not just – I think people have to take it across the entire set of problems.
[0:43:01.3] BL: I wanted to mention one more thing and this is we are producing lots of content here with The Podlets and with our coworkers. So I want to actually give a shout out to the TGIK. We want to know what you can do in Kubernetes and you want to have your imagination expanded a little bit. Every Friday we make a new video and actually funny enough, three fourths of the people on this call have actually done this.
Where, on Friday, we pick a topic and we go in and it might be something that would be interesting to you or it might not and we are all over the place. We are not just doing applications but we are applications low level, mapping applications on Kubernetes, new things that just came out. We have been doing this for a 101 episodes now. Wow. So you can go look at that if you need some examples of what things you could do on Kubernetes.
[0:43:51.4] CC: I am so glad to tgik.io maybe somebody, an English speaker should repeat that because of my accent but let me just say I am so glad you mentioned that Brian because I was sitting here as we are talking and thinking there should be a catalog of used cases of what Kubernetes can do not just like the rice and beans but a lot of different used cases, maybe things that are unique that people don’t think about to use because they haven’t run into that need yet.
But they could use it as a pause, okay that would enable me to do these thing that I didn’t even think about. That is such a great catalog of used cases. It is probably the best resource. Somebody say the website again? Duffy what is it?
[0:44:38.0] DC: tgik.io and it is every Friday at 1 PM Pacific.
[0:44:43.2] CC: And it is live. It’s live and it’s recorded, so it is uploaded to the VMware Cloud Native YouTube and everything is going to be on the show notes too.
[0:44:52.4] DC: It’s neat, you can come ask us questions there is a live chat inside of that and you can use that live chat. You can ask us questions. You can give us ideas, all kinds of crazy things just like you can with The Podlets. If you have an idea for an episode or something that you want us to cover or if you have something that you are interested in, you can go to thepodlets.io that will link you to our GitHub pages where you can actually open an issue about things you’d love to hear more about.
[0:45:15.0] JR: Awesome and then maybe on that note, Podlets, is there anything else you all would like to add on “Should I Kubernetes?” or do you think we’ve –
[0:45:22.3] BL: As best as our bias will allow it I would say.
[0:45:27.5] JR: As best as we can.
[0:45:27.9] CC: We could go another hour.
[0:45:29.9] JR: It’s true.
[0:45:30.8] CC: Maybe we’ll have “Should I Kubernetes?” Part 2.
[0:45:34.9] JR: All right everyone, well that wraps it up for at least Part 1 of “Should I Kubernetes?” and we appreciate you listening. Thanks so much. Be sure to check out the show notes as Duffy mentioned for some of the articles we read preparing for this episode and TGIK links and all that good stuff.
So again, I am Josh Russo signing out, with us also Carlisia Campos.
[0:45:55.8] CC: Bye everybody, it was great to be here.
[0:45:57.7] JR: Duffy Coolie.
[0:45:58.5] DC: Thanks you all.
[0:45:59.5] JR: And Brian Lyles.
[0:46:00.6] BL: Until next time.
[0:46:02.1] JR: Bye.
[END OF EPISODE]
[0:46:03.5] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
If you work in Kubernetes, cloud native, or any other fast-moving ecosystem, you might have found that keeping up to date with new developments can be incredibly challenging. We think this as well, and so we decided to make today’s episode a tribute to that challenge, as well as a space for sharing the best resources and practices we can think of to help manage it. Of course, there are audiences in this space who require information at various levels of depth, and fortunately the resources to suit each one exist. We get into the many different places we go in order to receive information at each part of the spectrum, such as SIG meetings on YouTube, our favorite Twitter authorities, the KubeWeekly blog, and the most helpful books out there. Another big talking point is the idea of habits or practices that can be helpful in consuming all this information, whether it be waiting for the release notes of a new version, tapping into different TLDR summaries of a topic, streaming videos, or actively writing posts as a way of clarifying and integrating newly learned concepts. In the end, there is no easy way, and passionate as you may be about staying in tune, burnout is a real possibility. So whether you’re just scratching the cloud native surface or up to your eyeballs in base code, join us for today’s conversation because you’re bound to find some use in the resources we share.
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Carlisia CamposJosh RossoDuffie CooleyOlive PowerMichael GaschKey Points From This Episode:
Audiences and different levels of depth that our guests/hosts follow Kubernetes at.What ‘keeping up’ means: merely following news, or actually grasping every new concept?The impossibility of truly keeping up with Kubernetes as it becomes ever more complex.Patterns used to keep up with new developments: the TWKD website, release notes, etc.Twitter’s helpful provision of information, from opinions to tech content, all in one place.How helpful Cindy Sridharan is on Twitter in her orientation toward distributed systems.The active side of keeping up such as writing posts and helping newcomers.More helpful Twitter accounts such as InfoSec.How books provide one source of deep information as opposed to the noise on Twitter.Books: Programming Kubernetes; Managing Kubernetes; Kubernetes Best Practices.Another great resource for seeing Kubernetes in action: the KubeWeeky blog.A call to watch the SIG playlists on the Kubernetes YouTube channel.Tooling: tab management and Michael’s self-built Twitter searcher.Live streaming and CTF live code demonstrations as another resource.How to keep a team updated using platforms like Slack and Zoom.The importance of organizing shared content on Slack.Challenges around not knowing the most important thing to focus on.Cognitive divergence and the temptation of escaping the isolation of coding by socializing.The idea that not seeing keeping up to date as being a personal sacrifice is dangerous.Using multiple different TLDR summaries to cement a concept in one’s brain.Incentives for users rather than developers of projects to share their experiences.The importance of showing appreciation for free resources in keeping motivation up.Quotes:
“An audience I haven’t mentioned is the audience that basically just throws up their hands and walks away because there’s just too much to keep track of, right?” — @mauilion [0:05:15]
“Maybe it’s because I’m lazy, I don’t know? But I wait until 1.17 drops, then I go to the release notes and really kind of ingest it because I’ve just struggled so much to kind of keep up with the day to day, ‘We merged this, we didn’t merge this,’ and so on.” — @joshrosso [0:10:18]
“If you find value in being up to date with these things, just figure out – there are so many resources out there that address these different audiences and figure out what the right measure for you is. You don’t have to go deep on the code on everything.” — @mauilion [0:27:57]
“Actually putting the right content in the right channel, at least from a higher level, helps me decide whether I want to like look at that channel today, and stuff that should be in the channel is not kind of in a conversation channel.” — @opowero [0:32:21]
“When I see something that is going to give me the fundamentals, like I have other priorities now, I sort of always want to consume that to learn the fundamentals, because I think in the long term phase of, but then I neglect physically what I need to know to do in the moment.” — @carlisia [0:33:39]
“Just do nothing, because our brain needs that. We need to not be listening, not be reading, just nothing. Just sit and look at the ceiling. Our brain needs that. Ideally, look at nature, like look outside, look at the air, go for a walk. We need that, because that recharges the brain.” — @carlisia [0:42:38]
“Just consuming and keeping up, that doesn’t necessarily mean you don’t give back.” — @embano1 [0:49:32]
Links Mentioned in Today’s Episode:
Chris Short — https://chrisshort.net/
Last Week in Kubernetes Development — http://lwkd.info/
1.17 Release Notes — https://kubernetes.io/docs/setup/release/notes/
Release Notes Filter Page — https://relnotes.k8s.io/
Cindy Sridharan on Twitter — https://twitter.com/copyconstruct
InfoSec on Twitter — https://twitter.com/infosec?lang=en
Programming Kubernetes on Amazon —https://www.amazon.com/Programming-Kubernetes-Developing-Cloud-Native-Applications/dp/1492047104
Managing Kubernetes on Amazon — https://www.amazon.com/Managing-Kubernetes-Operating-Clusters-World/dp/149203391X
Brendan Burns on Twitter — https://twitter.com/brendandburns
Kubernetes Best Practices on Amazon — https://www.amazon.com/Kubernetes-Best-Practices-Blueprints-Applications-ebook/dp/B081J62KLW/
KubeWeekly — https://kubeweekly.io/
Kubernetes SIG playlists on YouTube — https://www.youtube.com/channel/UCZ2bu0qutTOM0tHYa_jkIwg/playlists
Twitch — https://www.twitch.tv/
Honeycomb — https://www.honeycomb.io/
KubeKon EU 2019 — https://events19.linuxfoundation.org/events/kubecon-cloudnativecon-europe-2019/
Aaron Crickenberger on LinkedIn — https://www.linkedin.com/in/spiffxp/
Stephen Augustus on LinkedIn — https://www.linkedin.com/in/stephenaugustus
Office Hours — https://github.com/kubernetes/community/blob/master/events/office-hours.md
Transcript:
EPISODE 17
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[0:00:41.5] DC: Good afternoon everybody and welcome to The Podlets. In this episode, we’re going to talk about, you know, one of the more challenging things that we all have to do, just kind of keep up with cloud native and how we each approach that and what we do. Today, I have a number of cohosts with me, I have Olive Power.
[0:00:56.6] OP: Hi.
[0:00:57.4] DC: Carlisia Campos.
[0:00:58.6] CC: Hi everybody.
[0:00:59.9] DC: Josh Rosso.
[0:01:01.3] JR: Hey all.
[0:01:02.8] DC: And Michael.
[0:01:01.1] MICHAEL: Hey, hello.
[0:01:04.8] DC: This episode, we’re going to do something a little different than we normally do. In most of our episodes, we try to remain somewhat objective around the problem and the potential solutions for it, rather than prescribing a particular solution. In this episode, however, since we’re talking about how we keep up with all of the crazy things that happen in such a fast ecosystem, we’re going to probably provide quite a number of examples or resources that you yourself could use to drive and to try and keep up to date with what’s happening out there.
Be sure to check out the notes after the episode is over at thepodlets.io and you will find a link to the episodes up at the top part, click down to this episode, and check out the notes. There will be tons of resources. Let’s get started.
One of the things I think about that’s interesting about keeping up with something like, you know, a Kubernetes or a fast-moving project, regardless of what that project is, whether it’s Kubernetes or, you know, for a while, it was the Mesos that I was following or OpenStack or a number have been big infrastructure projects that have been very fast moving over time and I think what’s interesting is I find that there’s multiple audiences that we kind of address when we think about what it means to ‘keep up,’ right?
Keeping up with something like a project is interesting because I feel like there’s an audience that it’s actually very interested in what’s happening with the design goals or the code base of the project, and there’s an audience that is very specific to wanting to understand at a high level – like, “Give me the State of the World report like every month or so just so I can understand generally what’s happening with the project, like is it thriving? Is it starting to kind of wane? Are there big projects that it’s taking on?”
And then there’s like, then I feel like there’s an audience somewhere in the middle there where they really want to see people using the project and understand, and know how to learn from those people who are using it so that they can elevate their own use of that project. They’re not particularly interested in the codebase per se but they do want to understand, are they exploring this project at a depth that makes sense for themselves? What do you all think about that?
[0:03:02.0] CC: I think one thing that I want to mention is that this episode, it’s not so much about on-boarding people onto Kubernetes and the Kubernetes ecosystem. We are going to have an episode soon to talk specifically about that. How you get going, like get started. I think Duffy mentioned this so we’re going to be talking about how we all keep up with things. Definitely, there are different audiences, even when we’re talking about keeping up.
[0:03:32.6] JR: Yeah, I think what’s funny about your audience descriptions, Duffy, is I feel like I’ve actually slid between those audiences a bit, right? It’s funny, back in the day, Kubernetes like one-four, one-five days, I feel like I was much more like, “What’s going on in the code?” Like trying to keep track of like how things are progressing.
Now my role is a lot more focused with working with customers and standing up cube and like making a production ready. I feel like I’m a lot more, kind of reactive and more interested to see like, what features have become stable and impact me, you know what I mean? I’m far less in the weeds than I used to be. It’s a super interesting thing.
[0:04:08.3] OP: Yeah, I tend to – for my role, I tend to definitely fall into the number three first which is the kind of general keeping an eye on things. Like when you see like interesting articles pop up that maybe have been linked internally because somebody said, “Oh, check out this article. It’s really interesting.”
Then you find that you kind of click through five or six articles similar but then you can kind of flip to that kind of like, “Oh, I’m kind of learning lots of good stuff generally about things that folks are doing.” To actually kind of having to figure out some particular solution for one of my customers and so having to go quite deep into that particular feature.
You kind of go – I kind of found myself going right in and then back out, right in, going back out depending on kind of where I am on a particular day of the week. It’s kind of a bit tricky. My brain sometimes doesn’t kind of deal with that sort of deep concentration into one particular topic and then back out again. It’s not easy.
I find it quite tough actually some of the time.
[0:05:05.0] DC: Yeah, I think we can all agree on that. Keeping track of everything is – it’s why the episode, right? How do we even approach it? It seems – I feel like, an audience I haven’t mentioned is the audience that basically just throws up their hands and walks away because there’s just too much to keep track of, right? I feel like we are all that at some point, you know?
I get that.
[0:05:26.4] OP: That’s why we have Christmas holidays, right? To kind of refresh the brain.
[0:05:31.4] CC: Yeah, I maybe purposefully or maybe not even – not trying to keep up because it is too much, it is a lot, and what I’m trying to do is, go deeper on the things that I already, like sort of know. And things that I am working with on a day to day basis. I only really need to know, I feel like, I really only need to know – because I’m not working directly with customers.
My scope is very well defined and I feel that I really only need to know whenever there’s a new Kubernetes release. I need to know what the release is. We usually – every once in a while, we update our project to the – we bump up the Kubernetes release that we are working against and in general, yeah, it’s like if things come my way, if it’s interesting, I’ll take a look, but mostly, I feel like I work in a spiral.
If I’m doing codes related to controllers and there’s a conference talk about controllers then okay, let me take a look at this to maybe learn how to design this thing better, implement in a better way if I know more about it. If I’m doing, looking at CRDs, same thing. I really like conference talks for education but that’s not so much keeping up with what’s new. Are we talking about educating ourselves with things that we don’t know about?
Things that we don’t know about. Or are we talking about just news?
[0:07:15.6] JR: I think it’s everything. That’s a great question. One of my other questions when we were starting to talk about this was like, what is keeping up even mean, right? I mean, does it mean, where do you find resources that are interesting that keep you interested in the project or are you looking for resources that just kind of keep you up to date with what’s changing? It’s a great question.
[0:07:36.2] MICHAEL: Actually, there was some problem that I faced when I edit the links that I wanted to share in the show. I started writing the links and then I realized, “Well, most of the stuff is not keeping up with news, it’s actually understanding the technology,” because I cannot keep up.
What does help me in understanding specific areas, when I need to dig into them and I think back five or four years into early days of Kubernetes, it was easy to catch up by the time because it was just about Kubernetes. Later right, it became this platform. We realized that it actually this platform thing. Then we extended Kubernetes and then we realized there are CICD-related stuff and operations and monitoring and so the whole ecosystem grew. The landscape grew so much that today, it’s impossible to keep up, right?
I think I’m interested in all those patterns that you have developed over the years that help you to manage this, let’s say complexity or stream of information.
[0:08:33.9] DC: Yeah, I agree. This year, I was thinking about putting up a talk with Chris Short, it was actually last year. That was about kind of on the same topic of keeping up with it. In that, I kind of did a little research into how that happens and I feel like some of the interesting stuff that came out of that was that there are certain patterns that a project might take on that make it easier or more approachable to, you know, stay in contact with what’s happening.
If we take Kubernetes as an example, there are a number of websites I think that pretty much everybody here kind of follows to some degree, that helps, sort of, kind of, address those different audiences that we were talking about.
One of the ones that I’ve actually been really impressed with is LWKD which stands for Last Week in Kubernetes Development, and as you can imagine, this is really kind of focused on, kind of – I wouldn’t say it’s like super deep on the development but it is watching for things that are changing, that are interesting to the people who are curating that particular blog post, right?
They’ll have things in there like, you know, code freezes coming up on this date, IPV6, IPV4, duel stack is merging, they’ll have like some of the big mile markers that are happening in a particular release and where they are in time as it relates to that release. I think if that’s a great pattern and I think that – it’s a very narrow audience, right? It would really only be interesting to people who are interested in, or who are caught up in the code base, or just trying to understand like, maybe I want a preview of what the release notes might look like, so I might just like look for like a weekly kind of thing.
[0:10:03.4] JR: Yeah, speaking of the release notes, right? It’s funny. I do get to look at Last Week in Kubernetes development every now and then. It’s an awesome resource but I’ve gotten to the point where the release notes are probably my most important thing for staying up to date.
Maybe it’s because I’m lazy, I don’t know, but I wait till 1.17 drops, then I go to the release notes and really kind of ingest it because I’ve just struggled so much to kind of keep up with the day to day, “We merged this, we didn’t merge this,” and so on. That has been a huge help for me, you know, day to day, week to week, month to month.
[0:10:37.0] MICHAEL: Well, what was also helpful just on the release notes that the new filter webpage that they put out in 1.15, starting 1.15. Have you all seen that?
[0:10:44.4] JR: I’ve never heard of it.
[0:10:45.4] DC: Rel dot, whatever it is. Rel dot –
[0:10:47.7] MICHAEL: Yeah, if you can share it Duffy, that’s super useful. Especially like if you want to compare releases and features added and –
[0:10:55.2] DC: I’ll have to dig it up as well. I don’t remember exactly what –
[0:10:56.7] CC: I’m sorry, say? Which one is that again?
[0:10:59.1] MICHAEL: The real notes. I’ll put it in the hackMD.
[0:11:02.8] DC: Yeah relnotes.k8s.io which is an interesting one because it’s sort of like a comparison engine that allows you to kind of compare what it would have featured like how to feature relates to different versions of stuff.
[0:11:14.4] CC: That’s great. I cannot encourage enough for the listeners to look at the show notes because we have a little document here that we – can I? The resources are amazing. There are so many things that I have never even heard about and sound great – is – I want to go to this whole entire list. Definitely check it out. We might not have time to mention every single thing. I don’t want people to miss on all the goodness that’s been put together.
[0:11:48.7] DC: Agreed, and again, if you’re looking for those notes, you just go to the podlets.io. Click on ‘episodes’ at the right? And then look for this episode and you’ll find that it’s there.
[0:11:58.0] CC: I can see that a lot of the content in those notes are like Twitter feeds. Speaking personally, I’m not sure I’m at the stage yet where I learn a lot about Twitter feeds in terms of technical content. Do you guys find that it’s more around people’s thoughts around certain things so thought-provoking things around Kubernetes and the ecosystem rather than actual technical content. I mean, that’s my experience so far.
But looking at those Twitter feeds, maybe I guess I might need to follow some of those feeds. What do you all think?
[0:12:30.0] MICHAEL: Do you mean the tweets are from those like learn [inaudible 0:12:32] or the person to be tweets?
[0:12:35.3] OP: You’ve listed some of there, Michael, and some sort of.
[0:12:37.6] MICHAEL: I just wanted to get some clarity. The reason I listed so many Twitter accounts there is because Twitter is my only kind of newsfeed if you will. I used Feedly and RSS and others before and emails and threads. But then I just got overwhelmed and I had this feeling of missing out on all of those times.
That’s why I said, “Okay, let’s just use Twitter.” To your question, most of these accounts are people who have been in the Kubernetes space for very long, either running Kubernetes, developing on Kubernetes, having opinions about Kubernetes.
Opinions in general on topics related to cloud native because we didn’t want to make the search just about Kubernetes. Most of these people, I really appreciate their thoughts and some of them also just a retweet things that they see which I missed somewhere else and not necessarily just opinions. I think It’s a good mix of these accounts, providing options, some guidance, and also just news that I miss out on because not being on the other channels.
[0:13:35.6] OP: Yeah, I agree because sometimes you can kind of read – I tend to require a lot of sort of blog posts and sort of web posts which, you know, without realizing it can be kind of opinionated and then, you know, it’s nice to then see some Twitter feeds that kind of actually just kind of give like a couple of words, a kind of a different view which sometimes makes me think “Okay, I understand that topic from a certain article that I’ve read, it’s just really nice to hear a kind of a different take on it through Twitter.”
[0:14:03.0] CC: I think some of the accounts, like fewer of the accounts – and there are a bunch of things that – there are listed accounts here that I didn’t know before so I’ll check them out. I think fewer of the accounts are providing technical content, for example, Cindy Sridharan, not pronouncing it correctly but Cindy is great, she puts out a lot of technical content and a lot of technical opinion and observations that is really good to consume. I wish I had time to just read her blog posts and Twitter alone.
She’s very oriented towards distributed systems in general, so she’s not even specific just Kubernetes. Most of the accounts are very opinionated and the benefit for me is that sometimes I catch people talking about something that I didn’t even know was a thing. It’s like, “Oh, this is a thing I should know about for the work that I do,” and like Michael was saying, you know, sometimes I catch retweets that I didn’t catch before and I just – I’m not checking out places, I’m not checking – hash tagging Reddit.
I rely on Twitter and the people who I follow to – if there is a blog post that sounds important, I just trust that somebody would, that I’m going to see it multiple times until like, “Okay, this is content that is related to something and I’m working on, that I want to get better at.” Then I’ll go and look at it. My sources are mainly Twitter and YouTube and it’s funny because I love blog posts but it’s like I haven’t been reading them because it takes a long time to read a blogpost.
I give preference to video because I can just listen while I’m doing stuff. I sort of stopped reading blog post which is sad. I also want to start writing posts because it’s so helpful for me to engrain the things that I’m learning and hopefully it will be helpful to other people too. But in any case, go Duffy.
[0:16:02.8] DC: A number of people that I follow – I have been cultivating my feed pretty carefully, trying to get a broad perspective of technical stuff that’s happening. But also I’ve been trying to develop my persona on Twitter a bit more, right? I’m actually trying to build my audience there. What’s interesting there is I’ve been trying to – to that end, what I’ve been doing is like trying to amplify voices that I think aren’t heard enough out there, right?
If I see an article by somebody who is just coming into Kubernetes. or just coming into distributed systems and they’ve taken an effort to really lay out something that they found really interesting about pretty much anything, right? I’m like, “Okay, that’s pretty awesome,” and I’ll try to amplify that, right? Sometimes I even get involved or I’ll, not directly in public on Twitter but I’ll offer to help edit or help provide whatever our guidance I can provide around that sort of stuff.
If I see people like having a difficult time with a particular project or something like that, I’ll reach out privately and say, “Hey, can I help you with it so you can go out there and do a great job,” you know? That is something I love to do. I think your point about like not necessarily going at Twitter for the deep knowledge stuff but more just like making sure that you have a broad enough awareness of what’s happening in different ecosystems that you’re not surprised by the things when the things change, right?
A couple of other people that I follow are Akira Asuta, I can’t say enough about that person. They are amazing, they have been doing like, incredibly deep security stuff as it relates to containerization and stuff like that for quite a while. I’m always like, learning brand new things to me when following folks like that. I’ve been kind of getting more interested in InfoSec Twitter lately, learning how people kind of approach that problem.
Also some of the bias arounds that which has been pretty interesting. Both the bias against people who are in InfoSec which seems weird to me. Also, how InfoSec approaches a problem, like do they put it like a learning experience or they approach it like an attack experience.
It’s been kind of fascinating to get in there.
[0:18:08.1] OP: You know, I kind of use Twitter as well for some of this stuff but you know, books are kind of a resource as well but in my head, kind of like at the opposite scale. You know, I obviously don’t read as many books as I read twitter feeds, right? It’s just kind of like, with Twitter, you can kind of digest the whole of the stuff and with books, it’s kind of like – I tend to be trying – because I know, I’m only going to read – like I’m only going to read maybe one/two books a year.
I’ve kind of like – as I said before, blog posts seem to take up my reading time and books kind of tend to be for like on airplanes and stuff. So if – they’re just kind of two opposite resources for me but I find actually, the content of books are probably stuff that I digest a bit more because you know, it’s kind of like, I don’t know, back to the old days. It’s kind of a physical thing on hand and I can kind of read it and digest it a bit more than the kind of throwaway stuff that kind of keeps on Twitter.
Because to be honest, I don’t know what’s on Twitter. Who is kind of a person to listen to or who is not or who is – I just try and form my own opinions and then, again, it kind of gets a bit overwhelming, because it’s a lot of content just streaming through continuously, whereas a book, it’s kind of like just one source of information that is kind of like a bit more personal that I can digest a bit more.
[0:19:18.1] JR: Any particular book recommendation in 2019, Olive, that you found particularly interesting?
[0:19:23.5] OP: I’m still reading, and it’s on the list for the episode notes actually, Programming Kubernetes. I just want to kind of get into that sort of CRD sort of mindset a bit. I think that’s kind of an area that’s interesting and an area that a lot of people will want to use in their organizations, right, because it’s going to do some of the extensibility to Kubernetes that’s just not there out of the box and everybody wants something that’s not out of the box or always in my experience.
[0:19:47.4] MICHAEL: I found the Managing Kubernetes, I think was it, by – from Brendan Burns and some other folks which was just released I think in the end of last year. Super deep and that is kind of the opposite to the Programming Kubernetes, because I like that as well. That is more geared towards understanding architecture and operations.
Operational concepts –
[0:20:05.0] OP: They’re probably the two books I’ve read.
[0:20:08.4] MICHAEL: Okay.
[0:20:08.9] OP: One a year, remember?
[0:20:11.4] MICHAEL: Yeah.
[0:20:14.6] OP: Prolific reading.
[0:20:19.6] CC: I think if you know what you need to learn about cloud native or Kubernetes, there’s amazing books out there, and if you are still exploring Kubernetes and trying to learn, I cannot recommend this book enough. If you are watching this on YouTube, you’ll see the cover. It’s called Kubernetes Best Practices because it’s about Kubernetes best practices but what they did simultaneously and maybe they didn’t even realize is just they gave a map for the entire thing.
You go, “Oh, these are all the elements in Kubernetes.” Of course, it’s saying, “Okay, this is the best way to go about setting the stuff up,” and this is relatively thin but I just think that going through this book, you get really fast overview of the elements in Kubernetes. Then you can go to other books like Managing Kubernetes to go deep and understand all of the knobs and switches.
[0:21:24.6] DC: I want to bring it back to the patterns that we see successful projects. Projects that you think are approachable but, you know, projects that are out there that make it easy for you to kind of stay – or easier at least to stay up to date with them, what some of those patterns are that you think are useful for projects.
We’re talking about like having a couple of different entry points from kind of a weekly report mechanism, we’ve talked about the one that LWKD is, I don’t think we got to talk about KubeWeekly which is actually a weekly blog that is actually curated by a lot of the CNCF ambassadors. KubeWeekly is also broken up in different sections, so like sometimes they’ll just talk about – but they’re actually going out actively and trying to find articles of people using Kubernetes and then trying to post those.
If you’re interested in understanding how people are actually out there using it, then that’s a great place to go find articles that are kind of related to that. What are some other patterns that we see that are out there that are useful for books?
[0:22:27.6] DC: One that I really like. Kubernetes, for everyone listening has this notion of special interest groups, SIGs oftentimes. They’re focused on certain areas of the project. There’s some for networking and storage and life cycles of clusters and what’s amazing, I try to watch them somewhat weekly, I don’t always succeed.
They’re all on YouTube and if you go to the Kubernetes project YouTube, there’s playlists for every SIG. A lot of times I’m doing work relating to life cycles of clusters. I’ll open up the cluster life cycle playlist and I’ll just watch the weekly meetings. While it doesn’t always pertain to completely to me, it lets me understand kind of where the developers and contributor’s heads are at and where they’re kind of headed with a lot of different things.
There’s a link to that as well if anyone wants to check it out.
[0:23:15.9] MICHAEL: Exactly, to add to that. If you don’t have the time to watch the videos, the meeting notes that these gentlemen and women put together are amazing. Usually, I just scroll through and if it’s something to triggers, I go into the episode and watch it.
[0:23:28.7] OP: I almost feel like we should talk about tooling to handle all of this stuff, for example, right now, I think I have 200 tabs opened. I just started learning about some chrome extensions to manage tabs. I haven’t started really using them but I need. I don’t have a good system. My system is open a video that I’m pretty sure I want to watch and just get to that tab eventually until something happens in my chrome goes bust and I lose everything.
I wanted to mention that when we say watch YouTube, some things you don’t need to sit there and actually watch, you can just listen to it and if you pay for the five bucks for YouTube premium – I don’t get a commission you people, but I’m just saying, for me, it’s so helpful. I can just turn off you know, put my phone on my pocket and keep listening to it without having to have the phone open and on the whole time. It’s very handy.
It’s just like listening to a podcast. I also listen to podcasts lots of days.
[0:24:35.1] MICHAEL: For tooling, since I’m just mostly on Twitter and by the time I was using or starting to use Twitter, they didn’t have this bookmark function, so I was basically abusing likes or favorites at the time, I think, to bookmark. What I realized later, my bookmarks grew, well, my likes grew.
I wanted to go back and find something but that through the Twitter search was just impossible. I blew the tiny little go tool, kind of my first exercise there to just parse my likes and then use JQ because it’s all JSON to query and manipulate the stuff. I almost use it every day because I was like, that was a talk or blog post about scheduling and just correct for scheduling and the likes.
I’m sure there’s a better tool or way of doing that but for me, that’s mine too. Because that’s my workflow.
[0:25:27.6] DC: Both of the two blogs that you mentioned both KubeWeekly and LWKD, they both have the ability to take – you can submit stories to them. If you come across things that are interesting and you’d like to put that up on an aggregator somewhere, this is one of the ways to kind of solve that problem because at least if it gets cleared up on an aggregator, you know that you go back to the aggregator to see it, so that helps.
Some other ones I’ve seen out there, I’ve seen people, I’ve seen a number of interesting startups now, starting to kind of like put out a podcast or – and I have started to see a number of people like you know, engaging with Twitch and also doing things like what we do with TJK.io which is like have sort of some kind of a weekly thing where you are just hacking on stuff live and just exploring it whether that is related to – if you think of about TJK is we’re going to do without being related necessarily to anything that we are doing at VMware just anything to do with the community but obviously if you are working for one of the small companies like Honeycomb or some other company.
A smaller kind of startup, you can really just get people more aware of that because for some reason people love to watch others code. They love to understand how people go through that, what are their thought process is and I find it awesome as well. I think it is amazing to me how big a draw that is, you know?
[0:26:41.1] OP: And is there lots of them out there Duffy? Is that kind of an easy searchable thing or is it like how do you know those things are going on?
[0:26:48.4] DC: Oddly enough Twitter, most of the time, yeah. I mean, most of the time I see that kind of stuff happening on Twitter, like somebody will like – I will scope with this or a number of other people will say, “Hey, I am going to do a live stream during this period of time on this,” and I have actually seen a number of people doing live streams on CTFs, which are capture the flags. That one’s really been fascinating to me because it has been how do people think about approaching the security of an application.
Like where do they look for weak spots and how do you determine, how do you approach that kind of a problem, which is fascinating. So yeah, I think it is important to remember that like you know, you are not the only one trying to keep up to date with all of this stuff, right? The one thing we all have said pretty consistently here is that it is a lot, and it is not just Kubernetes, right? Like any fast moving project. It could be your favorite Ruby module that has 200 contributors, right?
It doesn’t matter what it is, it is a lot to keep a track of, and it represents some of that cognitive overheads that you have to think about. That is a lot to take on. Even if it is overwhelming, if you find value in being up to date with these things, just figure out – there are so many resources out there that address these different audiences and figure out what the right measure for you is. You don’t have to go deep on the code on everything.
Sometimes it might be better to just try and find a source of information that gives you a high enough of a view. Maybe you are looking at the blog posts that come out on Kubernetes.io every release and you are just looking at the release notes and if you just read the release notes every release, that is already miles ahead of what I have seen a lot of folks out there when they are starting to ask me questions about how do you keep up to date.
[0:28:35.9] JR: I’m curious, we have been talking a lot about keeping up as an individual. Do you all have strategies for how you help, let’s say your overall team, keep up with all the things that are going on? To give an example, Duffy, Olive and myself, at least at one point, were on the same team and we’d go out to disparate customers and see all of these different new things that they are trying to do or new projects that they are using.
So we’d have to think about how do we get together and share that internally to make sure we are bringing the whole team along with what is going on in the ecosystem especially from a customer perspective. I know one of the ways that we do that is having demos and things of that nature that we share weekly. Are there other strategies that you all use with your teams to kind of share interesting information and news?
[0:29:25.5] M: So what we do is mostly the way we share in our team, and we are a small team. We use Slack. We pre-filter in terms of like if there is stuff that I think is valuable for me and probably not for the whole team – obviously we are not going to share, but I think if it is related to something that the team has or to come grant and then I will share on Slack but we don’t have any formal way. I know people use some reports, weekly reports, or other platforms to distribute but we just use Slack.
[0:29:53.0] DC: I think one of the things – one of the patters that we had at [inaudible 0:29:54] that I thought was actually super helpful was that we would engage a conversation. “I learned a cool new thing about whatever today,” and so we would say, “I am going to – ” and then we would start a Zoom call around that and then people could join if they wanted to, to be a part of the live discussion or not, and if they didn’t, they would still be able to see a recorded Zoom pop up in the channel later on.
So even if your time zones don’t line up, like I know it is 2 AM or 3 AM or something like that for Olive right now, you can still go back to those recorded sessions and you’ll just see it on your daily Slack stuff. You would be able to see, “Oh there was a conversation about whether you should deploy Kubernetes crossed availibility zones or not. I would like to go see that,” and see what the inputs were, and so that can be helpful.
[0:30:42.5] JR: Yeah, that is a super interesting observation. It is almost like remote-first teams that are used to these processes of recording everything and putting it in a Google doc. They are more equipped for that information sharing perhaps than like the water cooler conversations you’d have in the office.
[0:30:58.5] OP: And on the Slack or any of the communication tool, we have different channels because we are all in lots of channels and to have channels dedicated to a particular subject is absolutely the way to go because otherwise in my previous company that seem to be kind of one main channel that all the architect used to discussed everything on and you know sometimes you join and you’re like, “What is everybody talking about?”
There would be literally about a hundred messages on some sort of theme that I have never heard of. So you come away from that thinking that, “That is the main channel. Where is the bit – is there messages in the middle that I missed that were just normal discussions as opposed to in around the technical stuff,” and so it made me a bit sad, right? I would be like, “I haven’t understood something and there is a whole load of stuff on this channel that I don’t understand.”
But it is the kind of central channel for everyone. So I think you end up then start looking up things that they are discussing and then realizing actually that is not really anything related to what I need to know about today or next week. It might be something for the future but I’ve got other stuff to focus on. So my point is that those communication channels for me sometimes can make me feel a little bit behind the curve or very much sort of reactive in trying to jump on things that are actually not really anything to do with me for me now and wasting my time slightly and kind of messing with my head a little bit in that like, “I really need to try and focus out stuff,” and actually putting the right content in the right channel, at least from a higher level, helps me decide whether I want to like look at that channel today, and stuff that should be in the channel is not kind of in a conversation channel. So organization of where that content is, is important to me.
[0:32:37.6] CC: I am so in the same page with you Olive. That is the way my brain works as well. I want to have multiple channels, like if we are talking about Slack or any chat tool, but some people have such aversion to multiple channels. They really have a hard time dealing with too many – like testing their threshold of what they think is too many channels. So I am always mindful too, like it has to work for everybody but if it was up to me, there will be one channel per topic. So I know where to focus on.
But you said something that is so interesting. How do we even just – like you were saying in the context of channel, multiple channels, and I go, if I need to pay attention to this this week as oppose to like, I don’t need to look at this until some time in the future. How do we even decide what we focus on that is useful for us in the moment versus it would be good for me to know but I don’t need to know right now.
I am super bad at this. When I see something that is going to give me the fundamentals, like I have other priorities now, I sort of always want to consume that to learn the fundamentals because I think in the long term phase of, but then I neglect physically what I need to know to do in the moment and I am trying to sort of fish there and get focused on in the moment things. Anybody else have a hard time?
[0:34:04.5] DC: You are not alone on that, yeah.
[0:34:06.7] CC: It is terrible.
[0:34:08.3] MICHAEL: Something that I wish I would do more often as like being a good citizen is like when you read a lot, probably 90% of my time is not writing but reading, maybe even more and then I share and then on Twitter, the tweet for them the most successful ones in terms of retweets or likes are the ones where I do like TLDR’s or some screen captures like too long to read. Where people don’t have the time, they might want to read the article but they don’t have the time.
But if you put in like a TLDR like either a tweet or a thread on it, a lot of people would jump onto it because they can just easily capture it and they can still read the full article if they want but that is something that I learned and it is pretty – what is the right word? Helpful to my followers and the community but I just don’t do it that often unfortunately. If I am writing, summarizing, writing, I kind of remember. That is how the brain works. It is a nice side effect.
[0:35:04.9] DC: I was saying, this is definitely one of those things where you can be the change you want to see if you, you know?
[0:35:08.6] M: Yeah, I know.
[0:35:10.0] DC: This is awesome. I would also say that what you just raised Carlisia is like a super valid point. I mean like not everybody’s brain works the same way, right? There are people who are neuro-divergent. There are people who think very linearly and they are very comfortable with that and there are people who don’t. So it is a struggle I think regardless of how your brain is wired to understand to how to prioritize the attention you will give any given subject.
In some cases, your brain is not wired – your brain is almost wired against that whole idea, like you are just not set up for success when it comes to figuring out how to prioritize your attention.
[0:35:49.0] CC: You hit the nail on the head. We are so set up for failure in that department because there are so many interesting conversations and you want to hop in and you want to be a part of the conversation and part of the group and socialize. Our work is so isolating to really put our heads down and just work, it can be so isolating. So it is great to participate in conversations out there even if it is for only via Twitter. I mean, obviously we are very biased towards Twitter here in this group.
But I am not even this on Twitter so just keep that in mind that we are cognizant of that but in any case, I don’t know what the answer is but what I am trying always to cut down on that, those social activities that seem so appealing. I don’t know how to do that from working out.
[0:36:43.9] JR: I am in the same boat. 2020, I am hoping to let more of that go and to your point, it is not that there is no value in it. It is just, I don’t know, I am not deriving the same amount of quality out of it because I am so just multiplexed all over the place, right? So we’ll see how it goes.
[0:36:59.9] CC: Oh if any listener has opinions and obviously it seems that all of us are helpless in that department. Share with us, please.
[0:37:12.5] DC: It is a tricky one. I think it is also interesting because I find that when we talk about things like work-life balance, we think of the idea of maybe work-life balance is that when you come at the end of the day and you go home and you don’t think about work, right? Sometimes we think that work-life balance means that you have a certain amount of time off that you can actually spend with your family and your friends or your community, what have you, and not be engaging on multiple fronts.
Just be that – have that be your focus, but when it comes to things like keeping up, when it comes to things like learning or elevating your education and stuff, it seems like, for the most part, and this is just my own assumption, I am curious how you all feel about this, that we don’t – that that doesn’t enter into it, right? Your personal time is totally on the table when it comes to how do you keep up with these things. We don’t even think about it that way, right?
I know I personally don’t. I definitely have to do more and cut back on the amount of time that I spend reading. I am right there with Michael on 90% of my time when my eyes are open, they are either reading or staring up on the sky while I try to think about what I am going to write next. You know one way or the other it is like that is what I am doing.
[0:38:24.0] CC: Yeah.
[0:38:25.1] MICHAEL: I noticed last year on my Twitter feed, more people than the years before will complain about like personal burn out. I saw a pattern, like reading those people’s tweets, I saw a pattern there. It wasn’t really like a spiral and then they realized and they shot down like deleted Twitter from their phones or any messaging and other stuff, and I think I am at the point where I also need to do that when it comes to vacation PDO, or whatever.
Because I am just like, as you said Duffy, my free time is on the table when it comes to Twitter and catching up and keeping up because work-life balance in my mind is not work but what is not work for like – Kubernetes is exciting, adding in all the space, like what is not work there? I need to really get better at that because I think I might end in the same spiral of just soaking in more until I just –
[0:39:17.7] CC: Yeah and like Josh said, it is not that there isn’t a value. Obviously we derive a huge value, that is why we’re on it, but you have to weigh things and what are your goals and is that the best way to your goals from where you are right now, and maybe you know, Twitter you use for a while, ramp up your knowledge, ramp up the connections because it is great for making connections, and then you step back and focus on something else, then to go on a cycle.
This is how I am thinking now. It is just like what Olive was saying, you know, books are great, blog posts are great, and I absolutely agree with that. It is just that I don’t have even the time and when I have the time, I would be reading code and I would be reading things all day long, it is just really tiring for me at the end of the day to sit down and read more. I want to invest in learning how to speed read to solve that problem because I read a lot of books and blog posts. So something on my list.
[0:40:22.8] DC: One of the biggest tips on speed reading I ever learned is that frequently when you read you think of saying the word and if you can get out of that habit, if you get out of the habit of saying the word even with your mouth or you just get out of that habit that will already increase the quickness of what you read.
[0:40:39.5] CC: That is so interesting.
[0:40:41.4] DC: Yeah, that is a trippy one.
[0:40:43.1] CC: Because I think being bilingual, I totally like – that really helps me understand things, by saying the words.
[0:40:52.9] DC: I think the point that we are all working around here is, there is a great panel that came out at KubeCon EU in 2019 was put on by Aaron Crickenberger, Esther McNaMara, Steven Augustus, these folks are all very high output people. I mean, they do a lot of stuff especially with regard to community and so they put on a panel that was talking about burn out and self-care and I think that it is definitely worth checking that one out.
And actually also thinking about what keeping up means to you and making sure that you are measuring that against your ability to sustain, is incredibly important, right? I feel like keeping up is one of those subjects where we end up – it is almost insidious in its way to – it is a thing that we can just do all the time. We can just spend all of our time, any free moment that you have, you are sitting on the bus, you are trying to keep up with things.
And because that happens so much, I feel like that is sort of one of the ways that we can feel burnt out as you are seeing today. We can feel like we did a lot of things but there was no real result to it and keep in mind that that’s part of it, right? Like when you are thinking about how we are keeping up with it, make sure that the value to your time is still something that you have some cognizance about, that you have some thought about, like is it worth it to me to just spend this six hours reading everything, right?
Or would it be better for me to spend some amount of time just not reading, you know? Like doing something else, you know? Like bake a cake for crying out loud, you know?
[0:42:29.5] CC: Something that a lot of times we don’t allow ourselves to do and I decided to speak for everybody I am sorry, I just do nothing, because our brain needs that. We need to not be listening, not be reading, just nothing. Just sit and look at the ceiling, our brain needs that. Ideally, look at nature, like look outside, look at the air, go for a walk. We need that, because that recharges the brain. Anyway, one thing also that I want to bring up, maybe we can mention real quick because we are coming up at the top of the hour.
How do people, projects, how do we really help the users of those projects to be up to date with what they are doing?
[0:43:18.4] DC: Well yeah I mean this is the different patterns that we are talking about. So I think the blog posts help. I like the idea of having blogs that are targeted towards different audiences. I like the idea of having an aggregate here for putting up a big project. I mean obviously Kubernetes is such a huge ecosystem that if you have things like KubeWeekly and I know that there are actually quite a number of things out there that try and do this.
But if we can kind of agree on one like KubeWeekly I think is a pretty good one because it is actually run by the CNCF. So it kind of falls within that sort of governance as a model but having an aggregator where you can actually produce content or curate content as it relates to your project that’s helpful, and then office-hours I think is also helpful to Josh’s point. I mean office-hours and SIG hours are very similar things. I mean like office-hours there like how to developers think about what’s happening with the space.
This is an opportunity for you as an end user to show up and ask questions, those sorts of patterns I think all are incredibly helpful as a project to figure out there to those things.
[0:44:17.8] OP: Yeah, I know summary articles or the sort of TLDRs that Michael mentioned earlier, I think I need more of those things in my life because I do a lot of reading, because I think my brain is a bit weird in that I need to read something about five or six different times from five or six different articles for it to sort of frame in my head.
So what I am trying to – like for 2020, I have almost tried to do this, is like if I think somebody knows all about this and it would save me reading those five, six, seven articles and if that person has the time, I try and sort of reach out to them and say, “Listen, have you got 20 minutes or so to explain this topic to me? Can I ask you questions about it?” It just saves me, saves my eyes reading the screen, and it just saves me time. I just need a TLDR summary of a project or a feature or something just so I can know what it is all about in my head and talk fairly sort of confidently about it.
If I need to get in front and down under the weeds then there is more reading to kind of do for me maybe the coding on the technical side, but sometimes I can’t figure out what this feature sort of means and what is its use case in the real world and I have to read through lots of articles and sometimes kind of vendor specific ones and they’ve got a different slant than maybe an independent one and trying to marry those bits up my head is a bit hard for me and there is sort of wealth of information.
So if you are interested in a topic and there is hundreds of articles and you start reading four or five and they are all slightly different, eventually you figure out that – you are confident and I understand what that product is about but it has taken a long time to get there and it is taken a lot of reading time. So TLDRs is like really work and I think as Josh mentioned before, we have this thing internally where we do bench demos.
And that is like a TLDR and a show and tell really quickly, like, “This is what this does and this is why we need to know about it and this is why our customers needs to know about it, the end,” you know? And that’s really, really useful because that just saves a whole bunch of people a whole bunch of time figuring out A, whether they need to know about it and B, actually now understanding that product or feature at the end of the five, 10 minutes which is what they typically are. So they are very useful short snippets of information. Maybe we are back to Twitter.
[0:46:37.8] JR: Similar to the idea of giving a demo Olive, you made me think of something and that is that I think one of the ways that I keep up with the space is actually through writing along with reading and I think the notion of like – and this admittedly takes up time and the whole quality of life conversation comes in but using writing to help develop your thoughts and kind of aggregate all of these crazy inputs and try to be somewhat concise, which I know I struggle with, around something I’ve learned.
It’s helped me a ton and then that asset kind of becomes reusable to share with other people the thing that you wrote. So for people listening to this I guess maybe a call to action for 2020 if that is your style as well, consider starting to write yourself and becoming a resource, right? Because even if you are new to this space, you’d be amazed at just how writing from your perspective can help other people.
[0:47:26.3] DC: I think another one that I actually have been impressed with lately is that a number of consumer companies like people out there like Lyft and companies like that have actually started to surface engineering blogs around how they are using technology and how they are using technology to solve things, which I think, as a service provider, as somebody who is involved in the community of Kubernetes, I find those to be incredibly valuable because I get to actually see how those things are doing.
I mean at the same time, I see things like – we talked about KubeCon, which is a convention that they have every year. Obviously the project is large enough to support it but there is actually an incentive if you are a consumer of that project to go and talk about how you are using it, right? It is incentivized in that it is more likely your talk will be accepted if you are a consumer of the product than somebody building it, right? We hear from people building it all the time.
I love that idea of incentivizing people who are using this thing get out there and talk about it or share their ideas about it or how they are using it, what problems did it solve for them. That is critical I think.
[0:48:31.0] CC: Can I also make a suggestion – is to not so much following on the thread that we are talking about just now but kind of on the general thread of this episode. If you have resources that you do use to keep up with things, stop this recording right now and go and give them a like, give them a follow, give them a thumbs up, show somehow appreciation because what Duffy said just now, he was saying, “Oh it is so helpful when I read a blog post.”
But people who are writing, they want to know that. So give them some indication, it counts a lot. It takes a lot of effort to sit down and write something or produce a podcast and if you take any, derive any benefit from it, show appreciation. It motivates people to keep doing it.
[0:49:26.4] DC: Yeah, agreed.
[0:49:27.9] M: I think that is a great bind maybe to close off this episode because it reiterates that just consuming and keeping up that doesn’t necessarily mean you don’t give back, right? So this is a way of giving back, which is really important to keep that flow and creativeness.
[0:49:41.8] CC: I go through a lot of YouTube videos and sometimes I just play one after the other but sometimes, you know, I have been making a point of going back and liking it. Liking the ones that I like – obviously I don’t like everything. I mean things that I don’t like I don’t listen in but you know what I mean? It takes no effort but just so people know, “OK, you did a good job here.” By the way, go to iTunes and rate us. So we will know that you liked it and it will help people find our show, our podcast, and if you are watching us on YouTube, give us a like.
[0:50:16.1] DC: All right, well unless anybody has any final thoughts, that is what we wanted to cover this session. So thank you all very, very much and I look forward to seeing you next week.
[0:50:25.3] M: Bye-bye.
[0:50:26.3] CC: Thank you so much.
[0:50:27.4] OP: Bye.
[0:50:28.1] JR: Bye.
[END OF EPISODE]
[0:50:28.7] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]See omnystudio.com/listener for privacy information.
-
Do you know what cloud native apps are? Well, we don’t really either, but today we’re on a mission to find out! This episode is an exciting one, where we bring all of our different understandings of what cloud native apps are to the table. The topic is so interesting and diverse and can be interpreted in a myriad of ways. The term ‘cloud native app’ is not very concrete, which allows for this open interpretation. We begin by discussing what we understand cloud native apps to be. We see that while we all have similar definitions, there are still many differences in how we interpret this term. These different interpretations unlock some other important questions that we also delve into. Tied into cloud native apps is another topic we cover today – monoliths. This is a term that is used frequently but not very well understood and defined. We unpack some of the pros and cons of monoliths as well as the differences between monoliths and microservices. Finally, we discuss some principles of cloud native apps and how having these umbrella terms can be useful in defining whether an app is a cloud native one or not. These are complex ideas and we are only at the tip of the iceberg. We hope you join us on this journey as we dive into cloud native apps!
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Carlisia CamposBryan LilesJosh RossoNicholas LaneKey Points From This Episode:
What cloud native applications mean to Carlisia, Bryan, Josh, and Nicholas.Portability is a big factor of cloud native apps.Cloud native applications can modify their infrastructure needs through API calls.Cloud native applications can work well with continuous delivery/deployment systems.A component of cloud native applications is that they can modify the cloud.An application should be thought of as multiple processes that interact and link together.It is possible resources will begin to be requested on-demand in cloud native apps.An explanation of the commonly used term ‘monolith.’Even as recently as five years ago, monoliths were still commonly used.The differences between a microservice approach and a monolith approach.The microservice approach requires thinking about the interface at the start, making it harder.Some of the instances when using a monolith is the logical choice for an app.A major problem with monoliths is that as functionality grows, so too does complexity.Some other benefits and disadvantages of monolith apps.In the long run, separating apps into microservices gives a greater range of flexibility.A monolith can be a cloud native application as well.Clarification on why Brian uses the term ‘microservices’ rather than cloud native.‘Cloud native’ is an umbrella term and a set of principles rather than a strict definition.If it can run confidently on someone else’s computer, it is likely a cloud native application.Applying cloud native principles when building an app from scratch makes it simpler.It is difficult to adapt a monolith app into one which uses cloud native principles.The applications which could never be adapted to use cloud native principles.A checklist of the key attributes of cloud native applications.Cloud native principles are flexible and can be adapted to the context.It is the responsibility of thought leaders to bring cloud native thinking into the mainstream.Kubernetes has the potential to allow us to see our data centers differently.Quotes:
“An application could be made up of multiple processes.” — @joshrosso [0:14:43]
“A monolith is simply an application or a single process that is running both the UI, the front-end code and the code that fetches the state from a data store, whether that be disk or database.” — @joshrosso [0:16:36]
“Separating your app is actually smarter than the long run because what it gives you is the flexibility to mix and match.” — @bryanl [0:22:10]
“A cloud native application isn’t a thing. It is a set of principles that you can use to guide yourself to running apps in cloud environments.” — @bryanl [0:26:13]
“All of these things that we are talking about sound daunting. But it is better that we can have these conversations and talk about things that don’t work rather than not knowing what to talk about in general.” — @bryanl [0:39:30]
Links Mentioned in Today’s Episode:
Red Hat — https://www.redhat.com/en
IBM — https://www.ibm.com/
VWware — https://www.vmware.com/
The New Stack — https://thenewstack.io/
10 Key Attributes of Cloud-Native Applications — https://thenewstack.io/10-key-attributes-of-cloud- native-applications/
Kubernetes — https://kubernetes.io/
Linux — https://www.linux.org/
Transcript:
EPISODE 16
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[0:00:41.4] NL: Hello and welcome back, my name is Nicholas Lane. This time, we’ll be diving into what it’s all about. Cloud native applications. Joining me this week are Brian Liles.
[0:00:53.2] BL: Hi.
[0:00:54.3] NL: Carlisia Campos.
[0:00:55.6] CC: Hi everybody, glad to be here.
[0:00:57.6] NL: And Josh Rosso.
[0:00:58.6] JR: Hey everyone.
[0:01:00.0] NL: How’s it going everyone?
[0:01:01.3] JR: It’s been a great week so far. I’m just happy that I have a good job and able to do things that make me feel whole.
[0:01:08.8] NL: That’s awesome, wow.
[0:01:10.0] BL: Yeah, I’ve been having a good week as well in doing a bit of some fun stuff after work. Like my soon to be in-laws are in town so I’ve been visiting with them and that’s been really fun. Cloud native applications, what does that mean to you all? Because I think that’s an interesting topic.
[0:01:25.0] CC: Definitely not a monolith. I think if you have a monolith running on the clouds, even if you start it out that way, I wouldn’t say it’s a cloud native app, I always think of containerized applications and if you’re using the container system then it’s usually because you want to have a smaller systems in more of them, that sort of thing.
Also, when I think of cloud native applications, I think that they were developed the whole strategy of the development in the whole strategy of deploying and shipping has been designed from scratch to put on the cloud system.
[0:02:05.6] JR: I think of it as applications that were designed to run in container. And I also think about things like services, like micro services or macro services to know what you want to call them that we have multiple applications that are made to talk not just with themselves but with other apps and they deliver a bigger functionality through their coordination. Then what I also want to go cloud native apps, I think of apps that we are moving to the cloud, that’s a big topic in itself but applications that we run in the cloud. All of our new fancy services and our SaaS offerings, a lot of these are cloud native apps.
But then on the other side, I think about applications, they are cloud native are tolerant to failure and on the other side, can actually talk about sells of their health and who they’re talking to.
[0:02:54.8] CC: Gets very complicated.
[0:02:56.6] BL: Yeah. That was the side of that I haven’t thought about.
[0:03:00.7] JR: Actually, it’s for me that always come to mind are obviously portability, right? Wherever you're running this application, it can run somewhat consistently, be it on different clouds or even a lot of people, you know, are running their own cloud which is basically their on-prem cloud, right? That application being able to move across any of those places and often times, containerization is one of the mechanisms we use to do that, right? Which is what we all stated.
Then I guess the other thing too is like, this whole cloud ecosystem, be it a cloud provider or your own personal – are often times very API driven, right? So, the applications, maybe being able to take advantage of some of those API’s should they need to. Be it for scaling purposes otherwise. It’s really interesting model.
[0:03:43.2] NL: It’s interesting, for me like this question because so far, everyone is getting similar but also different answers. And for me, I’m going to give a silent answer to me, a cloud native application is a lot of things we said like portable. I think of micro services when II] think of a cloud native application. But it’s also an application that can modify the infrastructure it needs via API calls, right?
If your application needs a service or needs a networking connection, it can – the application itself can manifest that via cloud offering, right? That’s what I always thought of as a cloud native application, right? If you need like a database, the application can reach out to like AWS RDS and spin up the database and that was an aspect of I always found very fascinating with cloud native applications, it isn’t necessarily the definition but for me, that’s the part that I was really focused on I think is quite interesting.
[0:04:32.9] BL: Also, CI/CD cloud native apps are made to work well with our CI, our seamless integration and our continuous delivery/deployment systems as well, that’s like another very important aspect of cloud native applications. We should be able to deploy them to production without typing anything in. should be some kind of automated process.
[0:04:56.4] NL: Yeah, that is for sure. Carlisia, you mentioned something that I think it’s good for us to talk about a little bit which is terminology. I keeping coming back to that. You mentioned monolithic apps, what are monoliths then?
[0:05:09.0] CC: I am so hung up on what you just said, can we table that for a few minutes? You said cloud native applications for you is an application that can interact with the infrastructure and maybe for example, is the database. I wonder if you have an example or if you could expand on that, I want to – if everybody agrees with that, I’m not clear on what that even is. Because as a developer which is always my point of view is what I know.
It’s a lot of responsibility for the application to have. And for example, when I would think cloud native and I’m thinking now, maybe I’m going off on a tangent here. But we have Kubernetes, isn’t that what Kubernetes is supposed to do to glue it all together? So, the application only needs to know what it needs to do. But spinning up an all tight system is not one of the things it would need to do?
[0:05:57.3] BL: Sure, actually, I was going to use Kubernetes as my example for cloud native application. Because Kubernetes is what it is, an app, right? It can modify the cloud that it’s running. And so, if you have Kubernetes running in AWS, you can create ELB’s, elastic load balancers. It can create new nodes. It can create new databases if you need, as I mentioned. Kubernetes itself is my example like a cloud native application.
I should say that that’s a good callout. My example of what a cloud native application isn’t necessarily like that’s a rule. All cloud native applications have to modify the cloud in which they exist in. It’s more that they can modify. That is a component of a cloud native application. Kubernetes is being an example there. I don’t know, I guess things like operators inside of Kubernetes like the rook operator will create storage for you when you spin up like root create a Ceph cluster, it will also spin up like the ELB’s behind it or at least I believe it does. Or that kind of functionality.
[0:06:57.2] CC: I can see what you're saying because for example, if I choose to use the storage inside something like Kubernetes, then you will be required of my app to call an SDK and connect so that their storage somehow. So, in that sense I guess, you are using your app. Someone correct me if I’m wrong but that’s how the connection is created, right? You just request – but you’re not necessarily saying I want this thing specific, you just say I want this sort of thing like which has their storage and then you define that elsewhere. So, your applications don’t need to know details bit definitely needs to say, I need this. I’m talking about again, when your data storage is running on top of Kubernetes and not outside of it.
[0:07:46.4] BL: Yeah.
[0:07:47.3] NL: That brings up an interesting part of this whole term cloud native app. Because it’s like everything else in the space, our terms are not super concrete and an interesting piece about this is that an application – does an application half the map one to one with the running process? What is an application?
[0:08:06.1] NL: That is interesting because it could say that a serverless app or a serverless rule, whatever serverless really is, I guess we can get into that in another episode. Are those cloud native applications? They’re not just running anywhere.
[0:08:19.8] JR: I will punt on that because I know my boundaries are and that definitely not in my boundaries. But the reason I bring this up is because a little while ago, it’s probably year ago in a Kubernetes [inaudible 0:08:32] apps, we actually have a conversation about what an application was. And the consensus from the community and from the group members was that actually, an application could be made up of multiple processes. So, let’s say you were building this large SaaS service and because you’re selling dog food online, your application could be your dog food application.
But you have inventory management. You have a front end, maybe you haven’t had service, you have a shipping manager and things like that. Sales tax calculator. Are those all applications? Or is it one application? The piece about cloud application are cloud native applications because what we found in Kubernetes is that the way we’re thinking about applications is, an application is multiple processes, that can be linked together and we can tell the whole story of how all those interact and working.
Just something else, another way to think about this.
[0:09:23.5] NL: Yeah, that is interesting, I never really considered that before but that makes a lot of sense. Particularly with the rise of things like GRPC and the ability to send dedicated messages to are like well codified messages too different processes. That gives rise to things like this multi-tenant process as an application.
[0:09:41.8] BL: Right. But going back to your idea around cloud native applications being able to commandeer the resources that they’re needing. That’s something that we do see. We see it within Kubernetes right now. I’ll give you above and beyond the example that you gave is that whenever you create a staple set. And Kubernetes, the operator behind staple set that actually goes and provisions of PPC for you, you requested a resource and whatever you change the number of instances from one to like five, guess what? you get four more PPC’s.
Just think about it, that is actually something that is happening, it’s a little transparent with people. but I can see to the point of we’re just requesting a new resource and if we are using cloud services to watch our other things, or our cloud native services to watch our applications, I could see us asking for this on demand or even a service like a database or some other type of queuing thing on demand.
[0:10:39.2] CC: When I hear things like this, I think, “ Wow, it sounds very complicated. "But then I start to think about it and I think it’s really neat because it is complicated but the alternative would have been way more complicated. I mean, we can talk about, this is sort of how it’s done now. I mean, it’s really hard to go into details on a one-hour episode. We can’t cover the how it’s done or make conceptually, we are sort of throwing a lot of words out there sort of conceptualize it but we can also try to talk about it in a conceptual way how it is done in a non-cloud native world.
[0:11:15.3] NL: Yeah, I kind of want to get back to the question I posed before, what is a monolithic app, what is a none cloud native app? And not all none cloud native apps are monoliths but this is actually something that I’ve heard a lot and I’ll be honest. I have an idea of what a monolithic app is but I think I don’t have a very good grasp of it.
We kind of talked a bit about like what a cloud native app is, what is a none cloud native or what came before a cloud native applications. What is a monolith?
[0:11:39.8] CC: I’m personally not a big fan of monoliths. Of course, I worked with them but once micro services started becoming common and started developing in that mode. I am much more of a fan of breaking things down for so many different reasons. It is a controversial topic for sure. But to go back to your question, the monolith is basically, you have an app, sort of goes to what Brian was saying, it’s like, what is an app? If you think of an app and like one thing, Amazon is an app, right?
It’s an app that we use to buy things as consumers. And you know, the other part is the cloud. But let’s look at it like it’s an app that we use to buy things as consumers, we know it’s broken down to so many different services. There is the checkout service, there is the cart service. I mean, I’m imagining, these I can imagine thought, the small services that compose that one Amazon app.
If it was a monolith, those services that you know – those things are different systems that are talking together. The whole thing would be on one code base. It would reside in same code base or it will be deployed together. It will be shipped together. If you make a change in one place and you needed to deploy that, you have to deploy the whole thing together.
You might have teams that are working on separate aspects but they’re working against the same code base. And maybe because of that, that will lend itself to teams not really specializing on separate aspects because everything is together so you might make one change of the impacts another place and then you have to know that part as well. So, there is a lot less specialization and separation of teams as well.
[0:13:32.3] BL: Maybe to give an example of my experience and I think it aligns with a lot of the details Carlisia just went over. Even taking five years back, my experience at least was, we’d write up a ticket and we’d ask somebody to make a server space for us, maybe run [inaudible 0:13:44] on it, right? We’d write all this Java code and we’d package it into these things that run on a JDL somewhere, right? We would deploy this whole big application you know?Let’s call it that dog food app, right?
It would have maybe even like a state layer and have the web server layer, maybe have all these different pieces all running together, this big code base as Carlisia put it. And we’d deploy it, you know, that process took a lot of time and was very consuming especially when we needed to change stuff, we didn’t have all these modern API’s and this kind of decoupled applications, right?
But then, over time, you know, we started learning more and more about the notion of isolating each of these pieces or layers. So that we could have the web server, isolated in its how, put some site container or a unit and then the state layer and the other layers even isolated, you know, the micro service approach more or less.
And then we were able to scale independently and that was really awesome. so we saw a lot of the gains in that respect. We basically moved our complexity to other areas, we took our complexity that you need to all happen in the same memory space and we moved a lot of it into the network with this new protocols of that different services talk to one another.
It’s been an interesting thing kind of seeing the monolith approach and the micro service approach and how a lot of these micro service apps are in my opinion a lot more like cloud native aligned, if that makes sense? Just seeing how the complexity shows around in that regard.
[0:15:05.8] CC: Let me just say one more thing because it’s actually the biggest aspect of micro services that I like the most in comparison, you know, the aspect of monolith that I hate the most and that I don’t hate it, I appreciate the least, let’s put it that way. Is that, when you have a monolith, it is so easy because design is hard so it’s so easy to couple different parts of your app with other parts of your app and have couples cold and coupled functionality. When you break this into micro services, that is impossible.
Because it was working with separate code bases. If you force to think what is your interface, you’re always thinking about the interface and what people need to consume from you, your interface is the only way into your app, into your system. I really like the aspect that it forces you to think about your API. And people will argue, “Well you can’t put the same amount of effort into that if you have a monolith.” Absolutely, but in reality, I don’t see it. And like Josh was saying, it is not a walk on the park, but I’d much rather deal with those issues, those complexities that Microsoft has create then the challenges of running a big –
I’m talking about big monoliths, right? Not something trivial.
[0:16:29.8] JR: I will come to distil this about how I look at monoliths and how it fits into this conversation. A monolith is simply an application that is or a single process in this case that is running both the UI, the front-end code and the code that fetches the state from a data store, whether that be disk or database. That is what a monolith is.
The reasons people use monoliths are many but I can actually think of some very good reasons. If you have code reuse and let’s say you have a website and you were trying to – you have farms and you want to be able to use those form libraries or you have data access and you want to be able to reuse that data access code, a monolith is great.
The problem with monoliths is as functionality becomes larger, complexity becomes larger and not at the same rate. I’m not going to say that it is not linear but it’s not quite exponential. Maybe it logs into or something like that. But the problem is that at a certain point, you’re going to have so much functionality, you’re not going to be able to put it inside of one process, see Rails. Rails is actually a great example of this where we run into the issues where we put so much application into a rail source directory and we try to run it and we basically run up with these huge processes. And we split them up.
But what we found is that we could actually split out the front-end code to one process. We could spit out the middle ware, see multiple process in the middle, the data access layer to another process and we could use those, we could actually take advantage of multiple CPU cores or multiple computers. The problem with this is that with splitting this out, it’s complexity. So, what if you have a [inaudible 0:18:15] is, what I’m trying to say here in a very long way is that monoliths have their places. As a matter of fact, the encourage, at least I still encourage people to start with the monolith. Put everything in one place. Whenever it gets too big, you spit it out.
But in a cloud native world, because we’re trying to take advantage of containers, we’re trying to take advantage of cords on CPUs, we’re trying to take advantage of multiple computers to do that in the most efficient way, you want to split your application up into smaller pieces so that your front end versus your middle layer, versus your data access layer versus your data layer itself can run on as many computers and as many cores as possible. Therefore, spreading thee risk and spreading the usage because everything should be faster.
[0:19:00.1] NL: Awesome. That is some great insight into monolithic apps and also the benefit and pros and cons of them. Like something I didn’t have before. Because I’ve only ever heard of a praise monolithic apps and then it’s like said in hushed tones or what the swear word directly after it. And so, it’s interesting to hear the concept of it being that each way you deploy your application is complex but there are different tradeoffs, right?
It’s the idea that I was like, “Why don’t you want to turn your monolithic into micro services? Well, there’s so much more overhead, so much more yak shaving you have to do to get there to take advantage of micro services. That was awesome, thank you so much for that insight.
[0:19:39.2] CC: I wanted to reiterate a couple aspects of what Brian said and Josh said in regards to that. One huge advantage, I mean, your application needs to be substantial enough that you feel like you need to do that, you’re going to get some advantage from it. when you had that point, and you do that, you’re clearing to services like Josh was saying and Brian was saying, you have the ability to increase your capabilities, your process capabilities based on one aspect of the system that needs it.
So, you have something that requires very low processing, you run that service with certain level of capabilities. And something that like your orders process or your orders micro service. You increase the processing power for that much more than some other part. When it comes to running this in the cloud native world, I think this is more an infrastructure aspect. But my understanding is that you can automate all of that, you can determine, “Okay, I have analyzed my requirements based on history and what I need is stacks. So, I’m going to say tell the cloud native infrastructure, this is what I need in the automation will take care of bringing the system up to that if anything happens.”
We are always going to be healing your system in an automated way and this is something that I don’t think gets talked about enough like we say, we talk about, “Oh things split up this way and they’re run this way but in an automated mode that these makes all of the difference.
[0:21:15.4] NL: Yeah that makes a lot of sense actually. So, basically analytic apps don’t give us the benefit of automation or automated deployment versus like micro services kind of give us and cloud native applications give us the rise.
[0:21:28.2] BL: Yes, and think about this, whenever you have five micro services delivering your applications functionality and you need to upgrade the front-end code for the HTML, whatever generates the HTML. You can actually replace that piece or replace that piece and that not bring your whole application down. And even better yet, you can replace that piece one at a time or two at a time, still have the majority of your applications still running and maybe your users won’t even know at all.
So, let’s say you have a monolith and you are running multiple versions of this monoliths. When you take that whole application down, you literally take the whole application down not only do you lose front-end capacity, you also lose back-end capacity as well. So, separating your app is actually smarter than the long run because what it gives you is the flexibility to mix and match and you could actually scale the front end at a different level than you did at the backend.
And that is actually super important in [inaudible 0:22:22] land and actually Python land and .NET land if you’re writing monoliths. You have to scale at the level of your monolith and if you can scale that then you are having wasted resources. So smaller micro services, smaller cloud native apps makes the run of containers, actually will use less resources.
[0:22:41.4] JR: I have an interesting question for us all. So obviously a lot of cloud native applications usually maybe look like these micro services we’re describing, can a monolith be a cloud native application as well?
[0:22:54.4] BL: Yes, it can.
[0:22:55.1] JR: Cool.
[0:22:55.6] NL: Yeah, I think so. As long as the – basically monolith can be deployed in the mechanism that we described like CSAD or can take advantage of the cloud. I believe the monolith can be a cloud native application, sure.
[0:23:08.8] CC: There are monolith – because I am glad you brought that up because I was going to bring that up because I hear Brian using the micro services in cloud native apps interchangeably and it makes it really hard for me to follow, “Okay, so what is not cloud native application or what is not a cloud native service and what is not a cloud native monolith?”
So, to start this thread with the question that Josh just asked, which also became my question: if I have a monolith app running a cloud provider is that a cloud native app? If it is not what piece of puzzle needs to exists for that to be considered a cloud native app? And then the follow up question I am going to throw it out there already is why do we care? What is the big deal if it is or if it isn’t?
[0:23:55.1] BL: Wow, okay. Well let’s see. Let’s unpack this. I have been using micro service and cloud native interchangeably probably not to the best effect. But let me clear up something here about cloud native versus micro services. Cloud native is a big term and it goes further than an application itself. It is not only the application. It is also the environment of the application can run in. It is the process that we use to get the application to production.
So, monoliths can be cloud native apps. We can run them through CI/CD. They can run in containers. They can take advantage of their environment. We can scale them independently. but we use micro services instead this becomes easier because our surface area is smaller. So, what I want to do is not use that term like this. Cloud native applications is an umbrella term but I will never actually say cloud native application. I always say a micro service and the reason why I will say the micro service is because it is much more accurate description of that process that is running. Cloud native applications is more of the umbrella.
[0:25:02.0] JR: It is really interesting because a lot of the times that we are working with customers when they go out and introduce them to Kubernetes, we are often times asked, “How do I make my application cloud native?” To what you are talking about Brian and to your question Carlisia, I feel like a lot of times people are a little bit confused about it because sometimes they are actually asking us, “How do I break this legacy app into smaller micro services,” right?
But sometimes they are actually asking like, “How do I make it more cloud native?” And usually our guidance or the things that we are working with them on is exactly that, right? It is like getting that application container so we can get it portable whether it is a monolith or a micro service, right? We are containerizing it. We are making it more portable. We are maybe helping them out with health checks that the infrastructure environment that they are running in can tap into it and know the health of that application whether it’s to restart it with Kubernetes as an example.
We are going through and helping them understand those principles that I think fall more into the umbrella of cloud native like you are saying Brian if I am following you correctly and helping them kind of enhance their application. But it doesn’t necessarily mean splitting it apart, right? It doesn’t mean running it in smaller services. It just means following these more cloud native principles. It is hard talk up so that was continuing to say cloud native right?
[0:26:10.5] BL: So that is actually a good way of putting it. A cloud native application isn’t a thing. It is a set of principles that you can use to guide yourself to running apps in cloud environments. And it is interesting. When I say cloud environments I am not even really particularly talking about Kubernetes or any type of scheduler. I am just talking about we are running apps on other people’s computers in the cloud this is what we should think about and it goes through those principles.
Where we use CI/CD, storage maybe most likely will be ephemeral. Actually, you know what? That whole process, that whole virtual machine that we are running on that is ephemeral too, everything will go away. So, cloud native applications is basically a theory that allows us to be strategic about running applications with other people’s computers and storage and networking and compute may go away. So, we do this at this way, this is how to get our 5-9’s or 4-9’s above time because we can actually do this.
[0:27:07.0] NL: That is actually a great point. The cloud native application is one that can confidently run on somebody else’s computer. That is a good stake in the ground.
[0:27:15.9] BL: I stand behind that and I like the way that you put it. I am going to steal that and say I made it up.
[0:27:20.2] NL: Yeah, go ahead. We have been talking about monoliths and cloud native applications. I am curious, since you all are developers, what is your experience writing cloud native applications?
[0:27:31.2] JR: I guess for green field projects where we are starting from scratch and we are kind of building this thing, it is a really pleasant experience because a lot of things are sort of done for us. We just need to know how to interact with the API or the contract to get the things we need. So that is kind of my blanket statement. I am not trying to say it is easy, I am just saying like it has become quite convenient in a lot of respects when adopting these cloud native principles.
Like the idea that I have a docker file and I build this container and now I am running this app that I am writing code for all over the place, it’s become such a more pleasant experience and at least in my experience years and years ago with like dropping things into the tomcat instances running all over the place, right? But I guess what’s also been interesting is it’s been a bit hard to convert older applications into the cloud native space, right?
Because I think the point Carlisia had started with around the idea of all the code being in one place, you know it is a massive undertaking to understand how some of these older applications work. Again, not saying that all older applications are only monoliths. But my experience has been that they generally are. Their bigger code base is hard to understand how they work and where to modify things without breaking other things, right?
When you go and you say, “All right, let’s adopt some cloud native principles on this app that has been running on the mainframe for decades” right? That is a pretty hard thing to do but again, green field projects I found it to be pretty convenient.
[0:28:51.6] CC: It is actually easy, Josh. You just rewrite it.
[0:28:54.0] JR: Totally yes. That is always a piece of cake.
,
[0:28:56.9] BL: You usually write it in Go and then it is cloud native. That is actually the secret to cloud native apps. You write it in Go, you install it, you deploy in Kubernetes, mission accomplish, cloud native to you.[0:29:07.8] CC: Anything written in Go is cloud native. We are declaring that here, you heard that here first.
[0:29:13.4] JR: That is a great question, it’s like how do we get there? That is a hard question and not one that I would basically just wave a magic set of words over and say that we are there. But what I would say is that as we start thinking of moving applications to cloud native first, we need to identify applications that cannot be called updated and I could actually give you some. Your Windows 2003 applications and yes, I do know some of you are running 2003 still.
Those are not cloud native and they never will be and the problem is that you won’t be able to run them in a containerized environment. Microsoft says stop using 2003, you should stop using it. Other applications that won’t be cloud native are applications that require a certain level of machine or server access. We have been able to attract GPU’s. But if you’re working on the IO level like you are actually looking at IO or if you are looking at hardware interrupts.
Or you are looking at anything like that, that application will never be cloud native. Because there is no way that we can in a shared environment, which most likely your application will be running in, in the cloud. There is no way that first of all that the hypervisor that is actually running your virtual machine wants to give you that process or give you that access or that is not being shared from one to 200 other processes on that server.
So, applications that want low level access or have real time, you don’t want to run those in the cloud. They cannot be cloud native. That still means a lot of applications can be.
[0:30:44.7] CC: So, I keep thinking of if I own a tech stack and I every once in a while stop and evaluate, if I am squeezing as most tech as I can out of my system? Meaning am I using the best technology out there to the extent that fits my needs? If I am that kind of person and I don’t know – it’s like when I say I am a decision maker and even if I was a tech person like I am also a tech person, I still would not have – unless I am one of the architects.
And sometimes even the architects don’t have an entire vision. I mean they have to talk to other architects who have a greater vision of the whole system because systems that can be so big. But at any rate, if I am an architect or I own the tech stack one way or another, my question is, is my system a cloud native system? Is my app a cloud native app? I am not even sure that we clarified enough for people to answer that. I mean it is so complicated, maybe we did hopefully we helped a little bit.
So basically, this will be my question, how do I know if I am there or not? Because my next step would be well if I am not there then what am I missing? Let me look into it and see if the cost benefit is worth it. But if I don’t know what is missing, what do I look at? How do I evaluate? How do I evaluate if I am there and if I am not, what do I need to do? So, we talked about this a little bit on episode one, which we talked about cloud native like what is cloud native in general and now we are talking about apps.
And so, you know, there should be a checklist of things that cloud native should at least have these sets of things. Like the 12-factor app, what do you need to have to be considered 12 factor app. We should have a checklist, 12 factor app I think having that checklist is being part of micro-service in the cloud native app. But I think there needs to be more. I just wish we would have that not that we need to come up with that list now but something to think about. Someone should do it, you know?
[0:32:57.5] JR: Yeah, it would be cool.
[0:32:58.0] CC: Is it reasonable or now to want to have that checklist?
[0:33:00.6] BL: So, there is, there is that checklist that exist I know that Red Hat has one. I know that IBM has one. I would guess VMware has one on one of our web pages. Now the problem is they’re all different. What I do and this is me trying to be fair here. The New Stack basically they talk about things that are happening in the cloud and tech. If you search for The New Stack in cloud native application, there is a 10-bullet list. That is what I send to people now.
The reason I send that one rather than any vendor is because a vendor is trying to sell you something. They are trying to sell you their vision of cloud native where they excel and they will give you products that help you with that part like CI/CD, “oh we have a product for that.” I like The New Stack list and actually, I Googled it while you were talking Carlisia because I wanted it to bring it up. So, I will just go through the titles of this list and we’ll make sure that we make this link available.
So, there is 10 Key Attributes of Cloud-Native Applications. Package as light weight to containers. Developed with best-of-breed languages and frameworks, you know that doesn’t mean much but that is how nebulous this is. Designed as loosely coupled microservices. Centered around API’s for interaction and collaboration. Architected with clean separation of stateless and stateful services. Isolated from server and operating system dependencies.
Deployed on self-service elastic cloud infrastructure. Managed through agile DevOps processes. Automated capabilities. And the last one, Defined policy-driven resource allocation. And as you see, those are all very much up for interpretation or implementations. So, a cloud native app from my point of view tries to target most of these items and has an opinion on most of these items. So, a cloud native app isn’t just one thing. It is a mindset that I am running. Like I said before, I am running my software on other people’s computers, how can I best do this.?
[0:34:58.1] CC: I added the link to our shownotes. When I look at this list, I don’t see observability. That word is not there. Does it fall under one of those points because observability is another new-ish term that seems to be in parcel of cloud native? Correct me here, people.
[0:35:19.1] JR: I am. Actually, the eighth item, ‘Manage through agile DevOps processes,’ is actually – they don’t talk about monitoring observability. But for an application for a person who is not developing application, so whether you have a dev ops team or you have an SRE practice, you are going to have to be able to communicate the status and the application whether it be through metrics logs or metrics logs or whatever the other one is.
I am thinking – traces. So that is actually I think is baked in it is just not called out. So, to get the proper DevOps you would need some observability that is how you get that status when you have a problem.
[0:35:57.9] CC: So, this is how obscure these things can be. I just want to point this out. It is so frustrating, so literally we have item eight, which Brian has been, as the main developer so he is super knowledgeable. He can look at that and know what it means. But I look at that and the words log metrics, observability none of these words are there and yet Brian knew that that is what it means that that is what he meant. And I don’t disagree with him. I can see it now but why does it have to be so obscure?
[0:36:29.7] JR: I think a big thing to consider too is like it very much lands on spectrum, right? Like something you would ask Carlisia is how do I qualify if my app is cloud native or what do I need to do? And you know a lot of people in my experience are just adopting parts of this list and that’s totally fine. You know worrying about whether you fully qualify as a cloud native app since we have talked about it as more of a set of principles is something –
I don’t know if there is too too much value in worrying about whether you can block that label onto your app as much as it is, “Oh I can see our organization our applications having these problems.” Like lacking portability when we move them across providers or going back to observability, not being able to know what is going on inside of the application and where the network packets are headed and they switched to being asked we’re late to see these happening.
And as those problems come on, really looking at and adopting these principles where it is appropriate. Sometimes it might not be with the engineering efforts without them, one of the more cloud native principles. You know you just have to pick and choose what is most valuable to you.
[0:37:26.7] BL: Yes, and actually this is what we should be doing as experts, as thought-leaders, as industry movers and shakers. Our job is to make this easier for people coming behind us. At one time, it was hard to even start an application or start your operating system. Remember when we had to load AN1, you know? Remember we had to do that back in the day on our basic, on our Comado64’s or Apple or Apple2. Now you turn your computer on and it comes with instantly.
We click on application and it works. We need to actually bring this whole cloud movement to that point where things like if you include these libraries and you code with these API’s you get automatic observability. And I am saying that with air quotes but you get the ability to have this thing to monitor it in some fashion. If you use this practice and you have this stack, CI/CD should be super simple for you and we are just not quite there yet.
And that is why the industry is definitely rotating around this and that is why there has been a lot of buzz around cloud native and Kubernetes is because people are looking at this to actually solve a lot of these problems that we’ve had. Because they just haven’t been solvable because everybody stacks are too different. But this one though, the reason Linux is I think ultimately successful is because it allowed us to do things and all of these Linux things we liked and it worked on all sorts of computers.
And it got that mindset behind it behind companies. Kubernetes could also do this. It allows us to think about our data centers as potentially one big computer or fewer computers that allows us to make sure things are running. And once we have this, now we can develop new tools that will help us with our observability, with our getting software into production and upgraded and where we need it.
[0:39:17.1] NL: Awesome. So, on that, we are going to have to wrap up for this week. Let’s go ahead and do a round of closing thoughts.
[0:39:22.7] JR: I don’t know if I have any closing thoughts. But it was a pleasure talking about cloud native applications with you all. Thanks.
[0:39:28.1] BL: Yeah, I have one thought is that all of these things that we are talking about it sounds kind of daunting. But it is better that we can have these conversations and talk about things that don’t work rather than not knowing what to talk about in general. So this is a journey for us and I hope you come for more of our journey.
[0:39:46.3] CC: First I was going to follow up on Josh and say I am thoughtless. But now I want to fill up on Brian’s and say, no I have no opinions. It is very much what Brian said for me, the bridging of what we can do using cloud native infrastructure in what we read about it and what we hear about it like for people who are not actually doing it is so hard to connect one with the other. I hope by being here and asking questions and answering questions and hopefully people will also be very interactive with us.
And ask us to talk about things they want to know that we all try to connect it too little by little. I am not saying it is rocket science and nobody can understand it. I am just saying for some people who don’t have multi background experience, they might have big gaps.
[0:40:38.7] NL: And that is for sure. This was a very useful episode for me. I am glad to know that everybody else is just as confused at what cloud native applications actually mean. So that was awesome. It was a very informative episode for me and I had a lot of fun doing it. So, thank you all for having me. Thank you for joining us on this week of the Kublets Podcast. And I just want to wish our friend Brian a very happy birthday. Bye you all.
[0:41:03.2] CC: Happy birthday Brian.
[0:41:04.7] BL: Ahhhh.
[0:41:05.9] NL: All right, bye everyone.
[END OF EPISODE]
[0:41:07.5] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
There are two words that get the blame more often than not when a problem cannot be rooted: the network! Today, along with special guest, Scott Lowe, we try to dig into what the network actually means. We discover, through our discussion that the network is, in fact, a distributed system. This means that each component of the network has a degree of independence and the complexity of them makes it difficult to understand the true state of the network. We also look at some of the fascinating parallels between networks and other systems, such as the configuration patterns for distributed systems. A large portion of the show deals with infrastructure and networks, but we also look at how developers understand networks. In a changing space, despite self-service becoming more common, there is still generally a poor understanding of networks from the developers’ vantage point. We also cover other network-related topics, such as the future of the network engineer’s role, transferability of their skills and other similarities between network problem-solving and development problem-solving. Tune in today!
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Duffie CooleyNicholas LaneJosh RossoKey Points From This Episode:
• The network is often confused with the server or other elements when there is a problem.
• People forget that the network is a distributed system, which has independent routers.
• The distributed pieces that make up a network could be standalone computers.
• The parallels between routing protocols and configuration patterns for distributed systems.
• There is not a model for eventually achieving consistent networks, particularly if they are old.
• Most routing patterns have a time-sensitive mechanism where traffic can be re-dispersed.
• Understanding a network is a distributed system gives insights into other ones, like Kubernetes.
• Even from a developers’ perspective, there is a limited understanding of the network.
• There are many overlaps between developers and infrastructural thinking about systems.
• How can network engineers apply their skills across different systems?
• As the future changes, understanding the systems and theories is crucial for network engineers.
• There is a chasm between networking and development.
• The same ‘primitive’ tools are still being used for software application layers.
• An explanation of CSMACD, collisions and their applicability.
• Examples of cloud native applications where the network does not work at all.
• How Spanning Tree works and the problems that it solves.
• The relationship between software-defined networking and the adoption of cloud native technologies.
• Software-defined networking increases the ability to self-service.
• With self-service on-prem solutions, there is still not a great deal of self-service.Quotes:
“In reality, what we have are 10 or hundreds of devices with the state of the network as a system, distributed in little bitty pieces across all of these devices.” — @scott_lowe [0:03:11]
“If you understand how a network is a distributed system and how these theories apply to a network, then you can extrapolate those concepts and apply them to something like Kubernetes or other distributed systems.” — @scott_lowe [0:14:05]
“A lot of these software defined networking concepts are still seeing use in the modern clouds these days” — @scott_lowe [0:44:38]
“The problems that we are trying to solve in networking are not different than the problems that you are trying to solve in applications.” — @mauilion [0:51:55]
Links Mentioned in Today’s Episode:
Scott Lowe on LinkedIn — https://www.linkedin.com/in/scottslowe/
Scott Lowe’s blog — https://blog.scottlowe.org/
Kafka — https://kafka.apache.org/
Redis — https://redis.io/
Raft — https://raft.github.io/
Packet Pushers — https://packetpushers.net/
AWS — https://aws.amazon.com/
Azure — https://azure.microsoft.com/en-us/
Martin Casado — http://yuba.stanford.edu/~casado/
Transcript:
EPISODE 15
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[0:00:41.4] DC: Good afternoon everybody. In this episode, we’re going to talk about the network. My name is Duffie Cooley and I’ll be the lead of this episode and with me, I have Nick.
[0:00:49.0] NL: Hey, what’s up everyone.[0:00:51.5] DC: And Josh.
[0:00:52.5] JS: Hi.
[0:00:53.6] DC: And Mr. Scott Lowe joining us as a guest speaker.
[0:00:56.2] SL: Hey everyone.
[0:00:57.6] DC: Welcome, Scott.
[0:00:58.6] SL: Thank you.
[0:01:00.5] DC: In this discussion, we’re going to try and stay away, like we do always, we’re going to try and stay away from particular products or solutions that are related to the problem. The goal of it is to really kind of dig in to like what the network means when we refer to it as it relates to like cloud native applications or just application design in general.
One of the things that I’ve noticed over time and I’m curious, what you all think but like, one of the things I’ve done over time is that people are kind of the mind that if it can’t root cause a particular issue that they run into, they’re like, “That was the network.” Have you all seen that kind of stuff out there?
[0:01:31.4] NL: Yes, absolutely. In my previous life, before being a Kubernetes architect, I actually used my networking and engineering degree to be a network administrator for the Boeing Company, under the Boeing Corporation. Time and time again, someone would come to me and say, “This isn’t working. The network is down.” And I’m like, “Is the network down or is the server down?” Because those are different things. Turns out it was usually the server.
[0:01:58.5] SL: I used to tell my kids that they would come to me and they would say, the Internet is down and I would say, “Well, you know. I don’t think the entire Internet is down, I think it’s just our connection to the Internet.”
[0:02:10.1] DC: Exactly.
[0:02:11.7] JS: Dad, the entire global economy is just taking a total hit.
[0:02:15.8] SL: Exactly, right.
[0:02:17.2] DC: I frequently tell people that my first distributed system that I ever had a real understanding of was the network, you know? It’s interesting because it kind of like, relies on the premises that I think a good distributed system should in that there is some autonomy to each of the systems, right? They are dependent on each other or even are inter communicate with each other but fundamentally, like when you look at routers and things like that, they are autonomous in their own way. There’s work that they do exclusive to the work that others do and exclusive to their dependencies which I think is very interesting.
[0:02:50.6] SL: I think the fact that the network is a distributed system and I’m glad you said that Duffie, I think the fact the network is a distributed system is what most people overlook when they start sort of blaming the network, right? Let’s face it, in the diagrams, right, the network’s always just this blob, right? Here’s the network, right? It’s this thing, this one singular thing. When in reality, what we have are like 10 or hundreds of devices with the state of the network as a system, distributed in little bitty pieces across all of these devices.
And no way, aside from logging in to each one of these devices are we able to assemble what the overall state is, right? Even routing protocols mean, their entire purpose is to assemble some sort of common understanding of what the state of the network is. Melding together, not just IP addresses which are these abstract concept but physical addresses and physical connections. And trying to reason to make decisions about them, how we center across and it’s far more complex and a lot of people understand, I think that’s why it’s just like the network is down, right?
When reality, it’s probably something else entirely.
[0:03:58.1] DC: Yeah, absolutely. Another good point to bring up is that each of these distributed pieces of this distributed system are in themselves like basically like just a computer. A lot of times, I’ve talked to people and they were like, “Well, the router is something special.” And I’m like, “Not really. Technically, a Linux box could just be a router if you have enough ports that you plug into it. Or it could be a switch if you needed to, just plug in ports.”
[0:04:24.4] NL: Another good interesting parallel there is like when we talk about like routing protocols which are a way of – a way that allow configuration changes to particular components within that distributed system to be known about by other components within that distributed system.
I think there’s an interesting parallel here between the way that works and the way that configuration patterns that we have for distributed systems work, right? If you wanted to make a configuration only change to a set of applications that make up some distributed system, you might go about like leveraging Ansible or one of the many other configuration models for this.
I think it’s interesting because it represents sort of an evolution of that same idea in that you’re making it so that each of the components is responsible for informing the other components of the change, rather than taking the outside approach of my job is to actually push a change that should be known about by all of these concepts, down to them.
Really, it’s an interesting parallel. What do you all think of that?
[0:05:22.2] SL: I don’t know, I’m not sure. I’d have to process that for a bit. But I mean, are you saying like the interesting thought here is that in contrast to typical systems management where we push configuration out to something, using a tool like an Ansible, whatever, these things are talking amongst themselves to determine state?
[0:05:41.4] DC: Yeah, it’s like, there are patterns for this like inside of distributed systems today, things like Kafka and you know, Kafka and Gossip protocol, stuff like this actually allows all of the components of a particular distributed system to understand the common state or things that would be shared across them and if you think about them, they’re not all that different from a routing protocol, right?
Like the goal being that you give the systems the ability to inform the other systems in some distributed system of the changes that they may have to react to. Another good example of this one, which I think is interesting is like, what they call – when you have a feature behind a flag, right? You might have some distributed configuration model, like a Redis cache or database somewhere that you’ve actually – that you’ve held the running configuration of this distributed system.
And when you want to turn on this particular feature flag, you want all of the components that are associated with that feature flag to enable that new capability. Some of the patterns for that are pretty darn close to the way that routing protocol models work.
[0:06:44.6] SL: Yeah, I see what you're saying. Actually, that’ makes a lot of sense. I mean, if we think about things like Gossip protocols or even consensus protocols like Raft, right? They are similar to routing protocols in that they are responsible for distributing state and then coming to an agreement on what that state is across the entire system. And we even apply terms like convergence to both environments like we talk about how long it takes routing protocol to converge. And we might also talk about how long it takes for and ETCD cluster to converge after changing the number of members in the cluster of that nature.
The point at which everybody in that distributed system, whether it be the network ETCD or some other system comes to the same understanding of what that shared state is.
[0:07:33.1] DC: Yeah, I think that’s a perfect breakdown, honestly. Pretty much every routing technology that’s out there. You know, if you’re taking that – the computer of the network, you know, it takes a while but eventually, everyone will reconcile the fact that, “Yeah, that node is gone now.”
[0:07:47.5] NL: I think one thing that’s interesting and I don’t know how much of a parallel there is in this one but like as we consider these systems like with modern systems that we’re building at scale, frequently we can make use of things like eventual consistency in which it’s not required per se for a transaction to be persisted across all of the components that it would affect immediately.
Just that they eventually converge, right? Whereas with the network, not so much, right? The network needs to be right now and every time and there’s not really a model for eventually consistent networks, right?
[0:08:19.9] SL: I don’t know. I would contend that there is a model for eventually consistent networks, right? Certainly not on you know, most organizations, relatively simple, local area networks, right? But even if we were to take it and look at something like a Clos fabric, right, where we have top of rack switches and this is getting too deep for none networking blokes that we know, right?
Where you take top of rack switches that are talking layer to the servers below them or the end point below them. And they’re talking layer three across a multi-link piece up to the top, right? To the spine switches, so you have leaf switches, talking up spine switches, they’re going to have multiple uplinks. If one of those uplinks goes down, it doesn’t really matter if the rest off that fabric knows that that link is down because we have the SQL cost multi pathing going across that one, right?
In a situation like that, that fabric is eventually consistent in that it’s okay if you know, knee dropping link number one of leaf A up to spine A is down and the rest of the system doesn’t know about that yet. But, on the other hand, if you are looking at network designs where convergence is being handled on active standby links or something of that nature or there aren’t enough paths to get from point A to point B until convergence happens then yes, you’re right.
I think it kind of comes down to network design and the underlying architecture and there are so many factors that affect that and so many designs over the years that it’s hard to – I would agree and from the perspective of like if you have an older network and it’s been around for some period of time, right? You probably have one that is not going to be tolerant, a link being down like it will cause problems.
[0:09:58.4] NL: Adds another really great parallel in software development, I think. Another great example of that, right? If we consider for a minute like the circuit breaking pattern or even like you know, most load balancer patterns, right? In which you have some way of understanding a list of healthy end points behind the load balancer and were able to react when certain end points are no longer available.
I don’t consider that a pattern that I would relate to specifically if they consent to eventual consistency. I feel like that still has to be immediate, right? We have to be able to not send the new transaction to the dead thing. That has to stop immediately, right? It does in most routing patterns that are described by multi path, there is a very time sensitive mechanism that allows for the re-dispersal of that traffic across known paths that are still good. And the work, the amazing amount of work that protocol architects and network engineers go through to understand just exactly how the behavior of those systems will work.
Such that we don’t see traffic. Black hole in the network for a period of time, right? If we don’t send traffic to the trash when we know or we have for a period of time, while things converge is really has a lot going for it.
[0:11:07.0] SL: Yeah, I would agree. I think the interesting thing about discussing eventual consistency with regards to the networking is that even if we take a relatively simple model like the DOD model where we only have four layers to contend with, right? We don’t have to go all the way to this seven-layer OSI model. But even if we take a simple layer like the DOD four-layer model, we could be talking about the rapid response of a device connected at layer two but the less than rapid response of something operating at layer three or layer four, right?
In the case of a network where we have these discreet layers that are intentionally loosely coupled which is another topic, we could talk about from a distribution perspective, right? We have these layers that are intentionally loosely coupled, we might even see consistency and the application of the cap theorem, behave differently at different layers of their model.
[0:12:04.4] DC: That’s right. I think it’s fascinating like how much parallel there is here. As you get into like you know, deep architectures around software, you’re thinking of these things as it relates to like these distributed systems, especially as you’re moving toward more cloud native systems in which you start employing things like control theory and thinking about the behaviours of those systems both in aggregate like you know, some component of my application, can I scale this particular component horizontally or can I not, how am I handling state.
So many of those things have parallels to the network that I feel like it kind of highlights I’m sure what everybody has heard a million times, you know, that there’s nothing new under the sun. There’s million things that we could learn from things that we’ve done in the past.
[0:12:47.0] NL: Yeah, totally agree. I recently have been getting more and more development practice and something that I do sometimes is like draw out like how all of my functions and my methods, and take that in rack with each other across a consisting code base and lo and behold when I draw everything out, it sure does look a lot like a network diagram. All these things have to flow together in a very specific way and you expect the kind of returns that you’re looking for.
It looks exactly the same, it’s kind of the – you know, how an atom kind of looks like a galaxy from our diagram? All these things are extrapolated across like –
[0:13:23.4] SL: Yeah, totally.
[0:13:24.3] NL: Different models. Or an atom looks like a solar system which looks like a galaxy.
[0:13:28.8] SL: Nicholas, you said your network administrator at Boeing?[0:13:30.9] NL: I was, I was a network engineer at Boeing.
[0:13:34.0] SL: You know, as you were sitting there talking, Duffie, so, I thought back to you Nick, I think all the times, I have a personal passion for helping people continue to grow and evolve in their career and not being stuck. I talk to a lot of networking folks, probably dating because of my involvement, back in the NSX team, right? But folks being like, “I’m just a network engineer, there’s so much for me to learn if I have to go learn Kubernetes, I wouldn’t even know where to start.”
This discussion to me underscores the fact that if you understand how a network is a distributed system and how these theories apply to a network, then you can extrapolate those concepts and apply them to something like Kubernetes or other distributed systems, right? Immediately begin to understand, okay. Well, you know, this is how these pieces talk to each other, this is how they come, the consensus, this is where the state is stored, this is how they understand and exchange date, I got this.
[0:14:33.9] NL: if you want to go down that that path, the controlled plane of your cluster is just like your central routing back bone and then the kublets themselves are just your edge switches going to each of your individual smaller network and then the pods themselves have been nodes inside of the network, right? You can easily – look at that, holy crap, it looks exactly the same.
[0:14:54.5] SL: Yeah, that’s a good point.
[0:14:55.1] DC: I mean, another interesting part, when you think about how we characterize systems, like where we learn that, where that skillset comes from. You raise a very good point. I think it’s an easier – maybe slightly easier thing to learn inside of networking, how to characterize that particular distributed system because of the way the components themselves are laid out and in such a common way.
Where when we start looking at different applications, we find a myriad of different patterns with particular components that may behave slightly differently depending, right? Like there are different patterns within software like almost on per application bases whereas like with networks, they’re pretty consistently applied, right? Every once in a while, they’ll be kind of like a new pattern that emerges, that it just changes the behavior a little bit, right? Or changes the behavior like a lot but at the same time, consistently across all of those things that we call data center networks or what have you.
To learn to troubleshoot though, I think the key part of this is to be able to spend the time and the effort to actually understand that system and you know, whether you light that fire with networking or whether you light that fire with like just understanding how to operationalize applications or even just developing and architecting them, all of those things come into play I think.
[0:16:08.2] NL: I agree. I’m actually kind of curious, the three of us have been talking quite a bit about networking from the perspective that we have which is more infrastructure focused. But Josh, you have more of a developer focused background, what’s your interaction and understanding of the network and how it plays?
[0:16:24.1] JS: Yeah, I’ve always been a consumer of the network. It’s something that is sat behind an API and some library, right? I call out to something that makes a TCP connection or an http interaction and then things just happen. I think what’s really interesting hearing talk and especially the point about network engineers getting into thee distributed system space is that I really think that as we started to put infrastructure behind API’s and made it more and more accessible to people like myself, app developers and programmers, we started – by we, you know, I’m obviously generalizing here.
But we started owning more and more of the infrastructure. When I go into teams that are doing big Kubernetes deployments, it’s pretty rare, that’s the conventional infrastructure and networking teams that are standing up distributed systems, Kubernetes or not, right? It's a lot of times, a bunch of app developers who have maybe what we call dev-ops, whatever that means but they have an application development background, they understand how they interact with API’s, how to write code that respects or interacts with their infrastructure and they’re standing up these systems and I think one of the gaps of that really creates is a lot of people including myself just hearing you all talk, we don’t understand networking at that level.
When stuff falls over and it’s either truly the network or it’s getting blamed on the network, it’s often times, just because we truly don’t understand a lot of these things, right? Encapsulation, meshes, whatever it might be, we just don’t understand these concepts at a deep level and I think if we had a lot more people with network engineering backgrounds, shifting into the distributed system space.
It would alleviate a bit of that, right? Bringing more understanding into the space that we work in nowadays.
[0:18:05.4] DC: I wonder if maybe it also would be a benefit to have like more cross discussions like this one between developers and infrastructure kind of focused people, because we’re starting to see like as we’re crossing boundaries, we see that the same things that we’re doing on the infrastructure side, you’re also doing in the developer side. Like cap theorem as Scott mention which is the idea that you can have two out of three of consistency, availability and partitioning.
That also applies to networking in a lot of ways. You can only have a network that is either like consistent or available but it can’t handle partitioning. It can be a consistent to handle partitioning but it’s not always going to be available, that sort of thing. These things that apply in from the software perspective also apply to us but we think about them as being so completely different.
[0:18:52.5] JS: Yeah, I totally agree. I really think like on the app side, a couple of years ago, you know, I really just didn’t care anything outside of the JVM like my stuff on the JVM and if it got out to the network layer of the host like just didn’t care, know, need to know about that at all. But ever since cloud computing and distributed systems and everything became more prevalent, the overlap has become extremely obvious, right?
In all these different concepts and it’s been really interesting to try to ramp up on that.
[0:19:19.6]:19.3] NNL: Yeah, I think you know Scott and I both do this. I think as I imagine, actually, this is true of all four of us to be honest. But I think that it’s really interesting when you are out there talking to people who do feel like they’re stuck in some particular role like they’re specialists in some particular area and we end up having the same discussion with them over and over again. You know, like, “Look, that may pay the bills right now but it’s not going to pay the bills in the future.”
And so you know, the question becomes, how can you, as a network engineer take your skills forward and not feel as though you’re just going to have to like learn everything all over again. I think that one of the things that network engineers are pretty decent at is characterizing those systems and being able to troubleshoot them and being able to do it right now and being able to like firefight those capabilities and those skills are incredibly valuable in the software development and in operationalizing applications and in SRE models.
I mean, all of those skills transfer, you know? If you’re out there and you’re listening and you feel like I will always be a network engineer, consider that you could actually take those skills forward into some other role if you chose to.
[0:20:25.1] JS: Yeah, totally agree. I mean, look at me, the lofty career that I’ve been come to.
[0:20:31.4] SL: You know, I would also say that the fascinating thing to me and one of the reasons I launched, I don’t say this to like try and plug it but just as a way of talking about the reason I launched my own podcast which is now part of packet pushers, was exploring this very space and that is like we’ve got folks like Josh who comes from the application development spacing is now being, you know, in a way, forced to own and understand more infrastructure and we’ve got the infrastructure folks who now in a way, whether it be through the rise of cloud computing and abstractions away from visible items are being forced kind of up the stack and so they’re coming together and this idea of what does the future of the folks that are kind of like in our space, what does that look like?
How much longer does a network engineer really need to be deeply versed in all the different layers? Because everything’s been abstracted away by some other type of thing whether it’s VPC’s or Azure V Nets or whatever the case is, right? I mean, you’ve got companies bringing the VPC model to on premises networks, right? As API’s become more prevalent, as everything gets sort of abstracted away, what does the future look like, what are the most important skills and it seems to me that it’s these concepts that we’re talking about, right?
This idea of distributed systems and how distributed systems behave and how the components react to one another and understanding things like the cap theorem that are going to be most applicable rather than the details of trouble shooting VGP or understanding AWS VPC’s or whatever the case may be.
[0:22:08.5] NL: I think there is always going to be a place for the people who know how things are running under the hood from like a physical layer perspective, that sort of thing, there’s always going to be the need for the grave beards, right? Even in software development, we still have the people who are slinging kernel code in C. And you know, they’re the best, we salute you but that is not something that I’m interested in it for sure.
We always need someone there to pick up the pieces as it were. I think that yeah, having just being like, I’m a Cisco guy, I’m a Juniper guy, you know? I know how to pawn that or RSH into the switch and execute these commands and suddenly I’ve got this port is now you know, trunk to this V neck crap, I was like, Nick, remember your training, you know?
How to issue those commands, I wonder, I think that that isn’t necessarily going away but it will be less in demand in the future.
[0:22:08.5] SL: I’m curious to hear Josh’s perspective as like having to own more and more of the infrastructure underneath like what seems to be the right path forward for those folks?
[0:23:08.7] JS: Yeah, I mean, unfortunately, I feel like a lot of times, it just ends up being trial by fire and it probably shouldn’t be that. But the amount of times that I have seen a deployment of some technology fall over because we overlapped the site range or something like that is crazy. Because we just didn’t think about it or really understand it that well.
You know, like using one protocol, you just described BGP. I never ever dreamt of what BGP was until I started using attributed systems, right? Started using BGP as a way to communicate routes and the amount off times that I’ve messed up that connection because I don’t have a background in how to set that up appropriately, it’s been rough. I guess my perspective is that the technology has gotten better overall and I’m mostly obviously in the Kubernetes space, speaking to the technologies around a lot of the container networking solutions but I’m sure this is true overall. It seems like a lot of the sharp edges have been buffed out quite a bit and I have less of an opportunity to do things terribly wrong.
I’ve also noticed for what it’s worth, a lot of folks that have my kind of background or going out to like the AWS is the Azure’s of the world. They’re using all these like, abstracted networking technologies that allow t hem to do really cool stuff without really having to understand how it works and they’re often times going back to their networking team on prem when they have on prem requirements and being like it should be this easy or XY and Z and they’re almost like pushing the networking team to modernize that and make things simpler. Based on experiences they’re having with these cloud providers.
[0:24:44.2] DC: Yeah, what do you mean I can’t create a load balancer that crosses between these two disparate data centers as it easily is. Just issuing a single command. Doesn’t this just exist from a networking standpoint? Even just the idea that you can issue an API command and get a load balancer, just that idea alone, the thousands of times I have heard that request in my career.
[0:25:08.8] JS: And like the actual work under the hood to get that to work properly is it’s a lot, there’s a lot of stuff going on.
[0:25:16.5] SL: Absolutely, yeah,
[0:25:17.5] DC: Especially when you’re into plumbing, you know? If you’re going to create a load balancer with API, well then, what API does the load balancer use to understand where to send that traffic when it’s being balanced. How do you handle discovery, how do you hit like – obviously, yeah, there’s no shortage on the amount of work there.
[0:25:36.0] JS: Yeah.
[0:25:36.3] DC: That’s a really good point, I mean, I think sometimes it’s easy for me to think about some of these API driven networking models and the cost that come with them, the hidden cost that come with them. An example of this is, if you’re in AWS and you have a connectivity between wo availability, actually could be any cloud, it doesn’t have to be an AWS, right? If you have connectivity between two different availability zones and you’re relying on that to be reliable and consistent and definitely not to experience, what tools do you have at your disposal, what guarantees do you have that that network has even operating in a way that is responsive, right? And in a way, this is kind of taking us towards the observability conversation that I think we’ve talked a little bit about the past.
Because I think it highlights the same set of problems again, right? You have to understand, you have to be able to provide the consumers of any service, whether that service is plumbing, whether it’s networking, whether it’s your application that you’ve developed that represents a set of micro service. You have to provide everybody a way or you know, have to provide the people who are going to answer the phone at two in the morning.
Or even the robots that are going to answer the phone at two in the morning. I have to provide them some mechanism by which to observe those systems as they are in use.
[0:26:51.7] JS: I’m not convinced that very many of the cloud providers do that terribly well today, you know? I feel like I’ve been burned in the past without actually having an understanding of the state that we’re in and so it is interesting maybe the software development team can actually start pushing that down toward the networking vendors out there out in the world.
[0:27:09.9] NL: Yeah that would be great. I mean I have been recently using a managed Kubernetes service. I have been kicking the tires on it a little bit. And yeah there has been a couple of times where I had just been got by networking issues. I am not going to get into what I have seen in a container network interface or any of the technologies around that. We are going to talk about that another time. But the CNI that I am using in this managed service was just so wonky and weird.
And it was failing from a network standpoint. The actual network was failing in a sense because the IP addresses for the nodes themselves or the pods wasn’t being released properly and because of our bag. And so, the rules associated with my account could not remove IP addresses from a node in the network because it wasn’t allowed to and so from a network, I ran out of IP addresses in my very small site there.
[0:28:02.1] SL: And this could happen in database, right? This could happen in a cache of information, this could happen in pretty much the same pattern that you are describing is absolutely relevant in both of these fields, right? And that is a fascinating thing about this is that you know we talk about the network generally in these nebulous terms and that it is like a black box and I don’t want them to know anything about it. I want to learn about it, I don’t want to understand it.
I just want to be able to consume it via an API and I want to have the expectation that everything will work the way it is supposed to. I think it is fascinating that on the other side of that API are people maybe just like you who are doing their level best to provide, to chase the cap theorum into it’s happy end and figure out how to actually give you what you need out of that service, you know? So, empathy I think is important.
[0:28:50.4] NL: Absolutely, to bring that to an interesting thought that I just had where on both sides of this chasm or whatever it is between networking and develop, the same principles exists like we have been saying but just to elicited on it a little bit more, it’s like on one side you have like I need to make sure that these ETCD nodes communicate with each other and that the data is consistent across the other ones. So, we use a protocol called RAFT, right?
And so that’s eventually existent tool then that information is sent onto a network, which is probably using OSPF, which is “open shortest path first” routing protocol to become eventually consistent on the data getting from one point to the other by opening the shortest path possible. And so these two things are very similar. They are both these communication protocols, which is I mean that is what protocol means, right? The center for communication but they’re just so many different layers.
Obviously of the OSI model but people don’t put them together but they really are and we keep coming back to that where it is all the same thing but we think about it so differently. And I am actually really appreciating this conversation because now I am having a galaxy brain moment like boo.
[0:30:01.1] SL: Another really interesting one like another galaxy moment, I think that is interesting is if you think about – so let us break them down like TCP and UTP. These are interesting patterns that actually do totally relate again just in software patterns, right? In TCP the guarantee is that every data gram, if you didn’t get the entire data gram you will understand that you are missing data and you will request a new version of that same packet.
And so, you can provide consistency in the form of retries or repeats if things don’t work, right? Not dissimilar from the ability to understand like that whether you chuck some in data across the network or like in a particular data base, if you make a query for a bunch of information you have to have some way of understanding that you got the most recent version of it, right? Or ETCD supports us by using the revision by understanding what revision you received last or whether that is the most recent one.
And other software patterns kind of follow the same model and I think that is also kind of interesting. Like we are still using the same primitive tools to solve the same problems whether we are doing it at a software application layer or whether we are doing it down in the plumbing at the network there, these tools are still very similar. Another example is like UTP where it is basically there are no repeats. You either got the packet or you didn’t, which sounds a lot like an event stream to me in some ways, right?
Like it is very interesting, you just figured out like I put in on the line, you didn’t get it? It is okay, I will put another line here in a minute you can react to that one, right? It is an interesting overlap.
[0:31:30.6] NL: Yeah, totally.
[0:31:32.9] JS: Yeah, the comparison to event streams or message queues, right? There is an interesting one that I hadn’t considered before but yeah, there are certainly parallels between saying, “Okay I am going to put this on the message queue,” and wait for the acknowledgement that somebody has taken it and taken ownership of it as oppose to an event stream where it is like this happened. I admit this event. If you get it and you do something with it, great.
If you don’t get it then you don’t do something with it, great because another event is going to come along soon. So, there you go.
[0:32:02.1] DC: Yep, I am going to go down a weird topic associated with what we are just talking about. But I am going to get a little bit more into the weeds of networking and this is actually directed into us in a way. So, talking about the kind of parallels between networking and development, in networking at least with TCP and networking, there is something called CSMACD, which is “carry your sense multi,” oh I can’t remember what the A stands for and the CD.
[0:32:29.2] SL: Access.
[0:32:29.8] DC: Multi access and then CD is collision detection and so basically what that means is whenever you sent out a packet on the network, the network device itself is listening on the network for any collisions and if it detects a collision it will refuse to send a packet until a certain period of time and they will do a retry to make sure that these packets are getting sent as efficiently as possible. There is an alternative to that called CMSCA, which was used by Mac before they switched over to using a Linux based operating system.
And then putting a fancy UI in front of it, which collision avoidance would listen and try and – I can’t remember exactly, it would time it differently so that it would totally just avoid any chance that there could be collision. It would make sure that no packets were being sent right then and then send it back up. And so I was wondering if something like that exists in the realm between the communication path between applications.
[0:33:22.5] JS: Is it collision two of the same packets being sent or what exactly is that?
[0:33:26.9] DC: With the packets so basically any data going back and forth.
[0:33:29.7] JS: What makes it a collision?
[0:33:32.0] SL: It is the idea that you can only transmit one message at a time because if they both populate the same media it is trash, both of them are trash.
[0:33:39.2] JS: And how do you qualify that. Do you receive an ac from the system or?
[0:33:42.8] NL: No there is just nothing returned essentially so it is like literally like the electrical signals going down the wire. They physically collide with each other and then the signal breaks.
[0:33:56.9] JS: Oh, I see, yeah, I am not sure. I think there is some parallels to that maybe with like queuing technologies and things like that but can’t think of anything on like direct app dev side.
[0:34:08.6] DC: Okay, anyway sorry for that tangent. I just wanted to go down that little rabbit-hole a little bit. It was like while we are talking about networking, I was like, “Oh yeah, I wanted to see how deep down we can make this parallel going?” so that was the direction I went.
[0:34:20.5] SL: Like where is that that CSMACD, a piece is like seriously old school, right? Because it only applied to half duplex Ethernet and as soon as we went to full duplex Ethernet it didn’t matter anymore.
[0:34:33.7] DC: That is true. I totally forgot about that.
[0:34:33.8] JS: It applied the satellite with all of these as well.
[0:34:35.9] DC: Yeah, I totally forgot about that. Yeah and with full duplex, we totally just space on that. This is – damn Scott, way to make me feel old.
[0:34:45.9] SL: Well I mean satellite stuff, too, right? I mean it is actually any shared media upon which you have to – where if this stuff goes and overlap there, you are not going to be able to make it work right? And so, I mean it is interesting. It is actually an interesting PNL. I am struggling to think of an example of this as well. I mean my brain is going towards circuit breaking but I don’t think that that is quite the same thing.
It is sort the same thing that in a circuit breaking pattern, the application that is making the request has the ability obviously because it is the thing making the request to understand that the target it is trying to connect to is not working correctly. And so, it is able to make an almost instantaneous decision or at least a very shortly, a very timely decision about what to do when it detects that state. And so that’s a little similar and that you can and from the requester side you can do things if you see things going awry.
And really and in reality, in the circuit breaking pattern we are making the assumption that only the application making the request will ever get that information fast enough to react to it.
[0:35:51.8] JS: Yeah where my head was kind of going with it but I think it is pretty off is like on a low level piece of code like it is maybe something you write in C where you implement your own queue in that area and then multiple threads are firing off the same time and there is no block system or mechanism if two threads contend to put something in the same memory space that that queue represents. That is really going down the rabbit hole. I can’t even speak to what degree that is possible in modern programming but that is where my head was.
[0:36:20.3] NL: Yeah that is a good point.
[0:36:21.4] SL: Yeah, I think that is actually a pretty good analogy because the key commonality here is some sort of shared access, right? Multiple threads accessing the same stack or memory buffer. The other thing that came to mind to me was like some sort of session multiplexing, right? Where you are running multiple application layer sessions inside a single sort of network connection and those network sessions getting comingled in some fashion.
Whether through identifiers or sequence number or something else of that nature and therefore, you know garbling the ultimate communication that is trying to be sent.
[0:36:59.2] DC: Yeah, locks are exactly the right direction, I think.
[0:37:03.6] NL: That is a very good point.
[0:37:05.2] DC: Yeah, I think that makes perfect sense. Good, all right. Yes, we nailed it.
[0:37:09.7] SL: Good job.
[0:37:10.8] DC: Can anybody here think of a software pattern that maybe doesn’t come across that way? When you are thinking about some of the patterns that you see today in cloud native applications, is there a counter example, something that the network does not do at all?
[0:37:24.1] NL: That is interesting. I am trying to think where event streams. No, that is just straight up packets.
[0:37:30.7] JS: I feel like we should open up one of those old school Java books of like 9,000 design patterns you need to know and we should go one by one and be like, “What about this” you know? There is probably something I can’t think of it off the top of my head.
[0:37:43.6] DC: Yeah me neither. I was trying to think of it. I mean like I can think of a myriad of things that do cross over even the idea of only locally relevant state, right? That is like a cam table on a switch that is only locally relevant because once you get outside of that switching domain it doesn’t matter anymore and it is like there is a ton of those things that totally do relate, you know? But I am really struggling to come up with one that doesn’t –
One thing that is actually interesting is I was going to bring up – we mentioned the cap theorem and it is an interesting one that you can only pick like two and three of consistency availability and partition tolerance. And I think you know, when I think about the way that networks solve or try to address this problem, they do it in some pretty interesting way. It’s like if you were to consider like Spanning Tree, right? The idea that there can really only be one path through a series of broadcast domains.
Because we have multiple paths then obviously we are going to get duplicity and the things are going to get bad because they are going to have packets that are addressed the same things across and you are going to have all kinds of bad behaviors, switching loops and broadcast storms and all kinds of stuff like that and so Spanning Tree came along and Spanning Tree was invented by an amazing woman engineer who created it to basically ensure that there was only one path through a set of broadcast domains.
And in a way, this solved that camp through them because you are getting to the point where you said like since I understand that for availability purpose, I only need one path through the whole thing and so to ensure consistency, I am going to turn off the other paths and to allow for partition tolerance, I am going to enable the system to learn when one of those paths is no longer viable so that it can re-enable one of the other paths.
Now the challenge of course is there is a transition period in which we lose traffic because we haven’t been able to open one of those other paths fast enough, right? And so, it is interesting to think about how the network is trying to solve with the part that same set of problems that is described by the cap theorem that we see people trying to solve with software routine.
[0:39:44.9] SL: No man I totally agree. In a case like Spanning Tree, you are sacrificing availability essentially for consistency and partition tolerance when the network achieves consistency then availability will be restored and there is other ways to doing that. So as we move into systems like I mentioned clos fabrics earlier, you know a cost fabric is a different way of establishing a solution to that and that is saying I’d later too. I will have multiple connections.
I will wait those connections using the higher-level protocol and I will sacrifice consistency in terms of how the routes are exchanged to get across that fabric in exchange for availability and partition columns. So, it is a different way of solving the same problem and using a different set of tools to do that, right?
[0:40:34.7] DC: I personally find it funny that in the cap theorem there is at no point do we mention complexity, right? We are just trying to get all three and we don’t care if it’s complex. But at the same time, as a consumer of all of these systems, you care a lot about the complexity. I hear it all the time. Whether that complexity is in a way that the API itself works or whether even in this episode we are talking about like I maybe don’t want to learn how to make the network work.
I am busy trying to figure out how to make my application work, right? Like cognitive load is a thing. I can only really focus on so many things at a time where am I going to spend my time? Am I going to spend it learning how to do plumbing or am I going to spend it actually trying the right application that solves my business problem, right? It is an interesting thing.
[0:41:17.7] NL: So, with the rise of software defined networking, how did that play into the adoption of cloud native technologies?
[0:41:27.9] DC: I think it is actually one of the more interesting overlaps in the space because I think to Josh’s point again. his is where we were taking I mean I work for a company called [inaudible 0:41:37], in which we were virtualizing the network and this is fascinating because effectively we are looking at this as a software service that we had to bring up and build and build reliably and scalable. Reliably and consistently and scalable. We want to create this all while we are solving problems.
But we need it to do within an API. It is like we couldn’t make the assumption with the way that networks were being defined today like going to each component and configuring them or using protocols was actually going to work in this new model of software confined networking. And so, we had an incredible amount of engineers who were really focused from a computer science perspective on how to effectively reinvent network as a software solution.
And I do think that there is a huge amount of cross over here like this is actually where I think the waters meet between the way the developers think about the problems and the way that network engineers think about the problem but it has been a rough road I will say. I will say that STN I think is actually has definitely thrown a lot of network engineers under their heels because they’re like, “Wait, wait but that is not a network,” you know? Because I can’t actually look at it and characterize it in the way that I am accustomed to looking at characterizing the other networks that I play with.
And then from the software side, you’re like, “Well maybe that is okay” right? Maybe that is enough, it is really interesting.
[0:42:57.5] SL: You know I don’t know enough about the details of how AWS or Azure or Google are actually doing their networking like and I don’t even know and maybe you guys all do know – but I don’t even know that aside from a few tidbits here and there that AWS is going to even divulge the details of how things work under the covers for VPC’s right?
But I can’t imagine that any modern cloud networking solution whether it would be VBPC’s or VNET’s or whatever doesn’t have a significant software to find aspect to it. You know, we don’t need to get into the definitions of what STN is or isn’t. That was a big discussion Duffie and I had six years ago, right? But there has to be some part of it that is taking and using the concepts that are common in STN right? And applying that. Just as the same way as the cloud vendors are using the concepts from compute virtualization to enable what they are doing.
I mean like the reality is that you know the work that was done by the Cambridge folks on Zen was a massive enabler trade for AWS, right? The word done on KVM also a massive enabler for lots of people. I think GCP is KBM based and V Sphere where VM Ware data as well. I mean all of this stuff was a massive enablers for what we do with compute virtualization in the cloud. I have to think that whether it is – even if it wasn’t necessarily directly stemming out of Martin Casado’s open flow work at Stanford, right?
That a lot of these software define networking concepts are still seeing use in the modern clouds these days and that is what enables us to do things like issue an API call and have an isolated network space with its own address space and its own routing and satiated in some way and managed.
[0:44:56.4] JS: Yeah and on that latter point, you know as a consumer of this new software defined nature of networking, it is amazing the amount of I don’t know, I started using like a blanket marketing term here but agility that it is added, right? Because it has turned all of these constructs that I used to file a ticket and follow up with people into self-service things that when I need to poke holes in the network, hopefully the rights are locked down, so I just can’t open it all up.
Assuming I know what I am doing and the rights are correct it is totally self-service for me. I go into AWS, I change the security group roll and boom, the ports have changed and it never looked like that prior to this full takeover of what I believe is STN almost end to end in the case of AWS and so on. So, it is really just not only has it made people like myself have to understand more about networking but it has allowed us to self-service a lot of the things.
That I would imagine most network engineers were probably tired of doing anyways, right? How many times do you want to go to that firewall and open up that port? Are you really that excited about that? I would imagine not so.
[0:45:57.1] NL: Well I can only speak from experience and I think a lot of network engineers kind of get into that field because it really love control. And so, they want to know what these ports are that are opening and it is scary to be like this person has opened up these ports, “Wait what?” Like without them even totally knowing. I mean I was generalizing, I was more so speaking to myself as being self-deprecating. It doesn’t apply to you listener.
[0:46:22.9] JS: I mean it is a really interesting point though. I mean do you think it makes the networking people or network engineers maybe a little bit more into the realm of observability and like knowing when to trigger when something has gone wrong? Does it make them more reactive in their role I guess. Or maybe self-service is not as common as I think it is. It is just from my point of view, it seems like with STN’s the ability to modify the network more power has been put into the developers’ hands is how I look at it, you know?
[0:46:50.7] DC: I definitely agree with that. It is interesting like if we go back a few years there was a time when all of us in the room here I think are employed by VMware. So, there was a time where VMware’s thing was like the real value or one of the key values that VMware brought to the table was the idea that a developer come and say “Give me 10 servers.” And you could just call an API or make it or you could quickly provision those 10 servers on behalf of that developer and hand them right back.
You wouldn’t have to go out and get 10 new machines and put them into a rack, power them and provision them and go through that whole process that you could actually just stamp those things out, right? And that is absolutely parallel to the network piece as well. I mean if there is nothing else that SPN did bring to the fore is that, right? That you can get that same capability of just stamping up virtual machines but with networks that the API is important in almost everything we do.
Whether it is a service that you were developing, whether it is a network itself, whether it is the firewall that we need to do these things programmatically.
[0:47:53.7] SL: I agree with you Duffie. Although I would contend that the one area that and I will call it on premises STN shall we say right? Which is the people putting on STN solutions. I’d say the one area at least in my observation that they haven’t done well is that self-service model. Like in the cloud, self-service is paramount to Josh’s point. They can go out there, they can create their own BPC’s, create their own sub nets, create their own NAT gateways, Internet gateways to run security groups. Load balancers, blah-blah, all of that right?
But it still seems to me that even though we are probably 90, 95% of the way there, maybe farther in terms of on premise STN solutions right that you still typically don’t see self-service being pushed out in the same way you would in the public cloud, right? That is almost the final piece that is needed to bring that cloud experience to the on-premises environment.
[0:48:52.6] DC: That is an interesting point. I think from an infrastructure as a service perspective, it falls into that realm. It is a problem to solve in that space, right? So when you look at things like OpenStack and things like AWS and things like JKE or not JKE but GCE and areas like that, it is a requirement that if you are going to provide infrastructure as a service that you provide some capability around networking but at the same time, if we look at some of the platforms that are used for things like cloud native applications.
Things like Kubernetes, what is fascinating about that is that we have agreed on a least come – we agreed on abstraction of networking that is maybe I don’t know, maybe a little more precooked you know what I mean? In the assumption within like most of the platforms as a service that I have seen, the assumption is that when I deploy a container or I deploy a pod or I deploy some function as a service or any of these things that the networking is going to be handled for me.
I shouldn’t have to think about whether it is being routed to the Internet or not or routed back and forth between these domains. I should if anything only have to actually give you intent, be able to describe to you the intent of what could be connected to this and what ports I am actually going to be exposing and that the platform actually hides all of the complexity of that network away from me, which is an interesting round to strike.
[0:50:16.3] SL: So, this is one of my favorite things, one of my favorite distinctions to make, right? And that is this is the two worlds that we have been talking about, applications and infrastructure and the perfect example of these different perspectives and you even said it or you talked there Duffie like from an IS perspective it is considered a given that you have to be able to say I want a network, right? But when you come at this from the application perspective, you don’t care about a network.
You just want network connectivity, right? And so, when you look at the abstractions that IS vendors and solutions or products have created then they are IS centric but when you look at the abstractions that have been created in the cloud data space like within Kubernetes, they are application centric, right? And so, we are talking about infrastructure artifacts versus application artifacts and they end up meeting but they are coming at this from two different very different perspectives.
[0:51:18.5] DC: Yeah.
[0:51:19.4] NL: Yeah, I agree.
[0:51:21.2] DC: All right, well that was a great discussion. I imagine that we are probably get into – at least I have a couple of different networking discussions that I wanted to dig into and this conversation I hope that we’ve helped draw some parallels back and forth between the way – I mean there is both some empathy to spend here, right? I mean the people who are providing the service of networking to you in your cloud environments and your data centers are solving almost exactly the same sorts of availability problems and capabilities that you are trying to solve with your own software.
And I think in itself is a really interesting takeaway. Another one is that again there is nothing new under the sun. The problems that we are trying to solve in networking are not different than the problems that you are trying to solve in applications. We have far fewer tools and we generally network engineers are focused on specific changes that happen in the industry rather than looking at a breathe of industries like I mean as Josh pointed out, you could break open a Java book.
And see 8,000 patterns for how to do Java and this is true, every programming language that I am aware of I mean if you look at Go and see a bunch of different patterns there and we have talked about different patterns for just developing cloud native aware applications as well, right? I mean there is so many options in the software versus what we can do and what are available to us within networks. And so I think I am rambling a little bit but I think that is the takeaway from this session.
Is that there is a lot of overlap and there is a lot of really great stuff out there. So, this is Duffie, thank you for tuning in and I look forward to the next episode.
[0:52:49.9] NL: Yep and I think we can all agree that Token Ring should have won.
[0:52:53.4] DC: Thank you Josh and thank you Scott.
[0:52:55.8] JS: Thanks.
[0:52:57.0] SL: Thanks guys, this was a blast.
[END OF EPISODE]
[0:52:59.4] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
Our topic in today's great episode is how we think jobs in software engineering have changed since the advent of cloud native computing. We begin by giving our listeners an idea of our jobs and speak more to what a job in cloud native would look like as well as how Kubernetes fits into the whole picture. Next up we cover some old challenges and how advances in the field have made those go away while simultaneously opening the gateway to even more abstract problems. We talk about some of the specific new developments and how they have changed certain jobs. For example, QA has not disappeared but rather evolved toward becoming ever more automated, and language evolution has left more space for actual development instead of debugging. Our conversation shifts toward some tips for what to know to get into cloud native and where to find this information. We wrap up our conversation with some thoughts on the future of this exciting space, predicting how it might change but also how it should change. Software engineering is still in a place where it is continuously breaking new ground, so tune in to hear why you should be learning as much as you can about development right now.
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Carlisia CamposBryan LilesNicholas LaneKey Points From This Episode:
• The work descriptions of our hosts who merge development, sysadmin, and consulting.
• What a cloud native related job looks like.
• Conceptualizing cloud native in relation to development, sysadmin, and DevOps.
• A cloud native job is anything related to building software for other people’s computers.
• Kubernetes is just one way of helping software run easily on a cloud.
• Differences between cloud native today and 10 years ago: added ease through more support.
• How cloud native developing is the new full stack due to the wide skillset required.
• An argument that old challenges are gone but have introduced more abstract ones.
• Advances making transitioning from testing to production more problem-free.
• How QA has turned into SDE, meaning engineers now write software that tests.
• Why jobs have matured after the invention of cloud native.
• Whether the changes in jobs have been one of titles or function.
• How languages like Rust, Go, and Swift have changed developer jobs by being less buggy.
• What good support equates to, beyond names like CRE and company size.
• The many things people who want to get into cloud native should know.
• Prospective cloud native workers should understand OSs, networking, and more.
• Different training programs for learning Kubernetes such as CKA and CKAD.
• Resources for learning such as books, YouTube videos, and podcasts.
• Predictions and recommendations for the future of cloud native.
• Tips for recruiters such as knowing the software they are hiring for.Quotes:
“What is the cloud? The cloud is other people’s computers. It's LPC, and what is Kubernetes? Well, basically, it’s a way that we can run our software on other people’s computers, AKA the cloud.” — @bryanl [0:07:35]
“What we have now is we know what we can do with distributed computing and now we have a great set of software for multiple vendors who allow us to do what we want to do.” — @bryanl [0:10:03]
“There are certain challenges now in cloud native that are gone, so the things that were hard before like spinning up a server or getting the database are gone and that frees us to worry about more complicated or more abstract ideas.” — @apinick [0:12:58]
“The biggest problem with what we are doing is that we are trailblazing. So a lot of the things that are happening, like the way that Kubernetes advances every few months is new, new, new, new.” — @bryanl [0:36:11]
“Now is the literal best time to get into writing software and specifically for cloud native applications.” — @bryanl [0:42:22]
Links Mentioned in Today’s Episode:
Azure — https://azure.microsoft.com/en-us/
Google Cloud Platform — https://cloud.google.com/
AWS — https://aws.amazon.com/
Amazon RDS — https://aws.amazon.com/rds/
Mesosphere — https://d2iq.com/
Aurora — https://stackshare.io/stackups/aurora-vs-mesos-vs-mesosphere
Marathon — https://mesosphere.github.io/marathon/
Rails Rumble — http://blog.railsrumble.com/
Terraform — https://www.terraform.io/intro/index.html
Swift — https://developer.apple.com/swift/
Go — https://golang.org/
Rust — https://www.rust-lang.org/
DigitalOcean — https://www.digitalocean.com/
Docker — https://www.docker.com/
Swarm — https://www.santafe.edu/research/results/working-papers/the-swarm-simulation-system-a-toolkit-for-building
HashiCorp — https://www.hashicorp.com/
Programming Kubernetes on Amazon — https://www.amazon.com/Programming-Kubernetes-Developing-Native-Applications/dp/1492047104
The Kubernetes Cookbook on Amazon — https://www.amazon.com/Kubernetes-Cookbook-Building-Native-Applications/dp/1491979682
Kubernetes Patterns on Amazon — https://www.amazon.com/Kubernetes-Patterns-Designing-Cloud-Native-Applications/dp/1492050288
Cloud Native DevOps with Kubernetes on Amazon — https://www.amazon.com/Cloud-Native-DevOps-Kubernetes-Applications/dp/1492040762
Kubernetes in Action on Amazon — https://www.amazon.com/Kubernetes-Action-Marko-Luksa/dp/1617293725
Managing Kubernetes on Amazon — https://www.amazon.com/Managing-Kubernetes-Operating-Clusters-World/dp/149203391X
Transcript:
EPISODE 14
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[0:00:41.3] NL: Hello and welcome back. This week, we’ll be discussing the thing that’s brought us all together, our jobs. But not just our jobs. I think we’re going to be talking about the difference kind of jobs you can find in cloud native land. This time, I’m your host, Nicolas Lane and with me are Brian Liles.
[0:00:57.1] BL: Howdy.
[0:00:58.0] NL: And Carlisia Campos.
[0:00:59.6] CC: Hi everybody, glad to be here.
[0:01:02.6] NL: How’s it going you all?
[0:01:03.7] CC: Very good.
[0:01:05.4] NL: Cool. To get us started, let’s talk about our jobs and like what it means to have a job and like the cloud native land from our current perspective. Brian, you want to go ahead and kick us off?
[0:01:17.8] BL: Wow, cloud native jobs. What is my job? My job is – I look at productivity of developers and people who are using Kubernetes. My job is to understand cloud native apps but also understand that the systems that they are running on are complex and whether they’d be Windows or Linux or Mac based, being able to understand those too. Really, my job is the combination of a senior developer, composed with a senior level admin. Whether it be Windows or Linux.
Maybe I am the actual epitome of DevOps.
[0:01:58.5] NL: Yeah you seem to be kind of fusion of the two. Carlisia?
[0:02:03.3] CC: My job is – so I’m mainly a developer but to do the job that I need to do, I need to be a bit of a DevOps person as well because as I’ve talked many times here on the show, I work on an open source too called Valero that does backup and recovery for Kubernetes clusters. I need to be able to boot up our cluster at least with three main providers. Azure, Google Cloud Platform, and AWS. I need to know how to do that, how to tweak things, how to troubleshoot things and I don’t think when we think of just a straight up developer, that usually is not part of the daily activity.
In that sense, I think, I’m not sure how we would define the cloud native job but I think my job, if there is such a thing, my job definitely is a cloud native job because I have to interact with these cloud native technologies, even beyond what I – the actual app that I’m developing which runs inside a Kubernetes cluster so it all ties in. You Nick?
[0:03:16.0] NL: My job is I’m a cloud native architect or a Kubernetes architect, I’m not sure what we’re calling ourselves these days honestly. What that means is we work with customers to help them along their cloud native journey. Either that means helping them set up like a Kubernetes cluster and then getting them like running with certain tools that are going to make there life easier or helping them develop tools in their cloud environments to help make the running of their jobs easier.
We kind of run the gamut of developers and sys. admins a bit and consultants. We kind of touch a little bit of everything.
Let’s take a step back now and talk about what we think a cloud native job looks like? Because for me, that’s kind of hard to describe. A cloud native job seems to be any job that has to do with some cloud native technology but that’s kind of broad, right?
You could have things from sysadmins, people who are running their cloud infrastructure for the company who are like managing things like, you know, rights access, accounting, that sort of thing, to people who are doing development like yourselves, like Brian and Carlisia, you guys are doing this type of work. Is there anything that you think is like unique to a cloud native job?
[0:04:35.2] CC: Yeah, it’s very interesting to talk about I think because especially in relation to if you don’t have a cloud native job, what do you have and how is it different? I wonder if the new cloud native job title is the new full stack developer for developers because, I think it’s easier to conceptualize what a cloud native job is for a systems admin or dev ops person.
But for developer, I think it’s a little more tricky, right? Is it the new full stack? Is it now that the developer even if you’re not doing – for example, my application runs inside Kubernetes, it’s an extension of Kubernetes but some applications just run on Kubernetes as a platform. Now, are we talking about developers with a cloud native title like ‘cloud native software engineer’ and for those developers, does it mean that they now have to design, code and deploy consistently?
You know, in my old days, when I – before doing this type of work, I would deploy apps but it was not all the time. There was a system, every single job I had, the system was different. The one thing that I love about Kubernetes is that if I was just a regular app developer, again as supposed to like extending Kubernetes, right?
If I was building apps that would run on Kubernetes as supposed to extending Kubernetes, and if I had to deploy them at Kubernetes, if I move jobs and they were working with Kubernetes, this process would be exactly the same and that’s one really cool thing about. I wouldn’t mind – in other words, I wouldn’t mind so much if I had to do deployment in the deployment, the process was the same everywhere.
Because it’s really painful to do like a one off deployment here and there, each place was different, I had to write a ton of notes to make sure, you know – it was like, 200 stacks and if anyone of them, you had to troubleshoot and I’m not a systems admin so it will be a struggle.
[0:06:44.6] BL: Yeah.
[0:06:45.8] CC: Because each system – it’s not that I couldn’t learn but each system would be different and I make – anyway, I think I went off on a tangent.
[0:06:53.1] NL: No worries.
[0:06:54.1] CC: But I also wanted to mention that I searched on LinkedIn for cloud native in the jobs section and there are a ton of job titles, job postings with cloud native in the title like a lot of it is architect but there is also product manager, there is also software engineer, I found the one that was senior Kubernetes engineer. It’s definitely a thing.
[0:07:21.0] BL: All right. What is the question here?
[0:07:25.6] NL: It was what do we think a cloud native job looks like essentially?
[0:07:29.6] BL: All right. I’m going to blow your mind here. Basically, what is the cloud? The cloud is other people’s computers. It's LPC and what is Kubernetes? Well, basically, it’s a way that we can run our software on other people’s computers, AKA the cloud. Kubernetes makes running software in the cloud easier. What that really breaks down to is if you are writing software on other people’s computers or if you were designing software that runs well on clouds, well, you’re a cloud native person. Actually, the term is basically been co opted for marketing purposes by who knows who.Basically, everyone.
But what I think is, as long as you are working on software that runs on modern infrastructure which means that nodes may go away, you might not own all of your services, you might use a least database server, you know, something like RDS from Amazon. Everyone working in that realm, working with software that may go away with things that aren’t theirs, was doing cloud native work.
We just happen to be doing on Kubernetes because that’s the most popular option right now. It isn’t the only option and it probably won’t be the final option.
[0:08:48.7] CC: Do you see any difference between what is required for a job like that today? Versus maybe 10 years ago or five years ago? Brian?
[0:08:58.0] BL: Yeah, actually, I do see some differences. One of the biggest differences is that there’s a lot more services out there that are provided to help you do what you need to do and so 10 years ago, having a database provider would be hard because one, the network wouldn’t be good enough and you’re hosting company probably didn’t have that unless you were at AWS and even they didn’t have that.
Now, what we get to take advantage off is things are just easier, it’s easier to fire up databases, it’s easier to add nodes to our production. It’s easier to have multiple productions, it’s easier to keep that all in order. It’s easier to put automated configuration around that than it was 10 years ago.
Now, five years ago, back in 2014, I would actually say that the way that we progressed since then is that we became more mature. I remember when Kubernetes came out and I thought it was going to win but Mesosphere was, mesosphere with Aurora, or marathon was actually better than Kubernetes, just it worked out of the box, for what we thought we could do with it but now, what we have now is we know what we can do with distributed computing and now we have a great set of software for multiple vendors who allow us to do what we want to do.
That’s the best part about now versus five years ago.
[0:10:17.7] CC: Yeah, I have to agree with that, it’s definitely easier. As a developer, I’m not going to tell you it’s easy but it’s easier. As an example. I remember when that was Rails Rumble maybe 10 years ago, I don’t know.
[0:10:31.3] BL: Yeah, I remember.
[0:10:34.0] CC: You did a video showing step by step how to boot up a Linux server to run apps on that server. I don’t remember why we needed to boot up from scratch. Remember that Brian?
[0:10:46.9] BL: I do remember that. That was 2007 or eight? It was a long time ago.
[0:10:53.0] CC: That was one of the place that made me very impressed about you because I followed all the steps and at the end it worked. You just was – you were right on with – as far as the instructions went.
I think doing that, I think it took me about two hours, I remember it took a long time and because this again, these are things that I do once in a while, I don’t do these things all the time. Now, we can use a Terraform script and have something running in a matter of 15 minutes if you have.
[0:11:26.9] BL: Side bar. Quick side bar. Yeah, we can use Terraform. I use Terraform for even all my personal infrastructure so things that are running in my house use Terraform. All my work stuff uses Terraform. But still, it’s sometimes easier to just write a script or type in the commands on the command line or click something. We’re still not to the point where using things like Terraform actually makes us not want to do it manually. That’s how I know that we’re not to our ultimate level maturity yet.
But, if you want to, the options are there and they’re pretty good.
[0:12:01.8] CC: Yeah.
[0:12:03.6] NL: Carlisia, you said something that kind of reminded me and maybe kind of get down this path. While we’re talking about like there are certain challenges that we aren’t faced anymore in a cloud native land like things are easier, there are certain things that are easier, not to say that our jobs are easy, like you’re saying Carlisia. But it was something along the lines of like a developer now needs to be – like a cloud native job is now the full stack kind of job or full stack developer. That was the name of the game back in the day, now, it’s a cloud native job.
I actually kind of agree with that in a sense where a cloud native developer or anyone in the cloud native realm has to exist not just in their own silo anymore. You need to understand more of the infrastructure that you’re using to write your code on someone else’s computer better. I actually kind of like that.
[0:12:56.6] CC: Exactly.
[0:12:58.0] NL: Yeah, there are certain challenges now in cloud native that are gone so the things that were hard before like spinning up a server, you know, getting the database, these things are gone and that now that frees us to worry about more complicated or more abstract ideas like how do we have everyone agree on the API to use and thus rises Kubernetes.
[0:13:19.1] CC: Yeah, I see that as a very positive thing. It might sound like – it’s a huge burden to ask developers to now have to now this but again, if we stick to the same stack, the burden diminishes really quickly because you learn it once and then that’s it. That’s been a huge advantage. If it works out this way, I mean, I’m all for like you know, the best technology should win. But there is that advantage.
If we remain using the same container orchestrator, you know, we use containers, we can run our code as if we were running any machine. One advantage that I see is that I’ve had cases where you know, these was working on my computer (™) and it will be deployed and one little stupid thing wouldn’t work because the way the URL was redirected, didn’t work, broke things, I got yelled at. I’m like, “Okay, you want me to do this right? Give me a server.”
Back then, good luck, “I’m going to give you a server, no way.” It was just so expensive, developers will be lucky to get a tasking of our staging environments. And, even when you get there, you had to coordinate with QA and there was a process.
Now, because I have access to my own servers here, right? I can just imagine if I were a developer, building apps to run Kubernetes, admin could just say, “Okay, you have these resources, go for it.” I’ll have my own name space and I could run my code as if it was running and a production environment and I’ll have just more assurance that my code works basically. Which is to me so satisfying. It’s one less thing to worry about if I deploy something to production, I have already tested it.
[0:15:17.3] NL: Yeah, that’s great. That’s something I really do cherish about the current landscape. We can actually test these things out locally and have confidence that they’ll work at least fairly well in production, right?
[0:15:30.2] CC: It’s not just running things locally, you can actually get access to like a little slice of let’s say an AWS server and just shift your things there and test it there. But because these system admins people, they can just carve out that little one slice for your team of even in the per person basis, maybe that’s too much but it’s relatively uncomplicated to do that and not very costly.
[0:15:56.9] NL: Yeah. You mentioned a team and the name of a team that I haven’t heard of in quite some time which is QA. How do we think the rise of cloud native have affected jobs and also kind of tangential to that, what were jobs like prior to cloud native because I haven’t heard of a QA team in many of the organizations that I’ve touched. Now, I’m not touching their like production dev team that they actually make this, I just haven’t heard of that name in a while and I’m wondering if like jobs like that have kind of gone away with the rise of cloud native.
[0:16:27.7] BL: No, I’m going to end that rumor right here, that is a whole untruth.
[0:16:33.8] NL: That was not a rumor, I’ve just conjecture I my part, literally unfounded.
[0:16:38.7] BL: We got to think what does QA do? QA is supposed to be responsible for the quality of our applications and when they first started, there wasn’t a lot of good tooling so a lot of our QA people were manual testers. They started the app, they clicked on everything, they put in all the inputs until it came back and they were professional app breakers.
I’d say, over a decade ago, we got more automated tools and moving into now, you can automate your web browser, you can actually write software to do all the actions that a human would do. What we found is that QA as profession has actually matured and you can see that because Google, I don’t think they even have QA, they have, what do they call them? Software engineers under test or SDE’s. What they do are – they’re developers in their own right but they write software that makes it easier for developers to write code that works well and a code that can be tested. I think that the roll has matured and has taken another angle in a lot of cases but even where we work.
There are QA engineers in our group and we still need them because you’ve seen the meme where you talk about unit testing and it would be like a door that had all the right parts but it didn’t fit in it’s casing or too hot handles on a sink. The pieces work right? They both put out hot water but together they didn’t work. We still have that, it just looks a little bit different now.
Also, a lot of software is not written in a huge monolithic cycle where we would take six months to release anew version or a year or a year and a half. Now, people are trying to turn around a lot of software quicker so QA has had to optimize how they work to work in those processes. They’re still there though.
[0:18:37.8] CC: I would hope so. I mean, I can’t answer the question if the question is do we have as much QA efforts out there as before. I don’t know, but I hope so because if you don’t have a QA, if you’re not QAing your apps, then you’re users are. That’s not good. For my team for example, we do our own QA but we do QA. We don’t have separate people doing it, we do it ourselves. It might be just because it’s pretty special, I mean, we are a small team to begin with and what we do is very specialized. It will be difficult to bring someone in and teach them and if they’re just running QA, I don’t know, maybe – I don’t think it will be that difficult, we can just have constructions, you know.
“Run this command, this is what the output should be” – I don’t’ think it would be that difficult, take that back, but still, we do it ourselves.
[0:19:31.7] NL: The question was more – less in line with like, “What happened to QA?” It was more like, how do we think that cloud native has affected jobs and the job market and it sounds like, that jobs have changed because of cloud native, they’ve matured as we were just discussing with QA where people aren’t doing the same kind of drudgery or the same kind of toil that they were doing before. Now, we’re using more tooling to do our jobs and kind of lifting up each position to be more cloud native-y, right? More development and infrastructure focus at the same time. At least, that’s what I was getting from it.
[0:20:09.8] BL: Yeah, I think that is true but I think all types of development jobs, especially jobs that are in the cloud native space have changed. One good example would be, with organizations moving to cloud native apps, we’re starting to see, and this is all anecdotal, I have no evidence to back this up, that there are more developers who are on call for the software they write because one, they know it better than anyone else and they’re closer to it.
And two, because having an ops group that just supports app isn’t conducive to being productive because there’s no way that one group can understand all apps. What we’re finding is that in this new cloud native era that jobs are maturing and they’re getting new functionality, they’re losing some functionality, some jobs are combining but it’s still at the end of the day, it’s the same thing we were doing 20 years ago but it all just has new titles and we use new software to do it. Which is good.
Because on some of these ideas that we came up with, 20, 30 years ago are still good ones today.
[0:21:15.5] NL: Yeah, that’s actually an interesting question. Do you think that it’s just the titles that are changing or are the functions changing, right? It’s like sys admins used to be sys admins, now they’re CREs, well then there are dev ops for a while and now they’re CRE’s, more SRA’s I should say. Our support team are now CREs, Customer Reliability Engineering. Is that just a title change or are there functional differences? I’m inclined to believe that they’re functional differences.
[0:21:43.9] BL: I think it’s both, I think it’s the same reason why all engineers after two years in the field are somehow senior engineers. People feel like they have progress when they get new titles even though you’re the most junior engineer on this team, how can you be a senior engineer?
And then also the same thing with CRE, shout out to Google for making that term popular but really, what it comes down to is maybe the focus changed but maybe it didn’t. Maybe we were already doing that, maybe we were already doing resilience engineering with our customers and maybe we already had great customer support or customer success team.
But I do think that there has been some changes in jobs because what we’re finding is that with modern languages that people are using so teams are moving away from C++ to things like Swift and Go and Rust. We’re finding that because our software is easier to write, we actually don’t have to think about some of the things that we did before.
With Go, technically, you don‘t have to worry about memory access. With Rust, 100%, you don’t have to worry about null pointer exceptions, they don’t exist. Now that we freed our developers to do more development rather than more debugging, then what we can find is that the jobs will actually change over time.But at the end of the day, and even where we work right now and then all over the place, people are devs, they do ops stuff, they do security stuff or they’re pointing here at Boston. I challenge anyone listening to this to find something where I am not telling the truth, we might do both or more than one thing but at the end of the day, we can still break it down to what people do.
[0:23:24.8] NL: Yeah, Carlisia, any thoughts on that?
[0:23:26.1] CC: No, I think that was nice with me, sounds right.
[0:23:29.8] NL: Yeah, I agree. I think that there are some functional changes. I think that support versus CRE isn’t just like getting tickets received and then going to a ticket queue and filing those things. I think there are some changes with like, I know from our CRE team they are actively going out and saying like, “Here’s our opinion based on these technologies and this is like why we validated these things,” that are reevaluating their support model constantly and just making sure that they’re like abreast of everything that’s going on so they can more resiliently engineer their customer support.
[0:24:04.5] BL: But hold on, one second though. That’s what I’m talking about with the marketing because guess what? It is supported, a good support team would be doing all those things whether it’s called customer reliability engineering or whatever, it’s support, it’s customer success, it’s getting in front of our customer’s problems and having the answers before they even ask the question, that’s good support.
Whenever we label things like CRE, that’s somebody and some corporate marketing center who thought that that was a good idea. But it doesn’t mean because you don’t call that CRE, it’s not good support because I will tell you in the past, DigitalOcean, we did that and the term CRE didn’t even exist yet but we were out there in front of problems whenever we could be and we thought that was good for our customers. What we’re finding is that people have the capabilities now with the progress of whatever technologies we have that we can actually give our customers good support, and you don’t have to be a Google sized company to do that anymore, that’s the plus.
[0:25:02.9] NL: Yeah, I agree with that.
[0:25:05.5] CC: I want us to talk a little bit about for people who are not working in a cloud native space but they see it coming or they want to move towards doing something more in that area, what should they be looking at? What should that be brushing up on their learning or incorporating into their what they are currently doing and of course different roles and so it will be different for each different role. We have developers we have DevOps or SRE or admins or operators, managers, recruiters. It changes a little bit for everybody.
[0:25:47.5] BL: Well I will hop in here first and say it is all code at the end of the day. When it comes down to what we are doing in cloud native for ops, it doesn’t really matter. You could take a lot of the same principles and do them on prem or wherever else you happen to be. I mean I am not trying to diminish the role of anyone that we work with or anyone in our industry whenever I say this though but when it comes down to it, what I see is people understand the operating system, mostly Linux.
People understand public key encryption so they understand PKI, you know we deal with a lot of certs. They understand networking, they can tell you how many IPs are in a 23 and if I am giving you side or numbers out there. These are things that people know. I don’t think there is anything specific for cloud native other than learning Kubernetes itself or Mesosphere or Docker, Swarm or whatever or the tool from HashiCorp that always escapes me whenever I have to say it out loud. But it is all the same thing.
What you need to know and to be good at any job where you are doing ops, you need to understand the theory of operating computers. You need to understand operating systems, networking and how that all works and then all things around and some security.
For developers now, it is a little bit interesting because a lot of the apps that we are writing these days are more stateless. So for a developer you need to think more about my app may crash.
So anything that I am holding a memory that is important can go away at any given time. So either, one, I need to store it on more than one thing, I need to have it in a restrictive fashion or, two, I need to store it in the database instantlyAnd I would once again challenge anyone to say that if you are a developer who can actually understand those topics, you would be able to apply for a cloud native job in this space because frankly a lot of developers, a lot of cloud native developers writing apps working cloud native, two years ago they were doing something else.
[0:27:50.1] CC: Yeah that sounds right. I think for developers where you said I think focusing on authentication, how do you handle secret keys and the question of row authentication and row authorization and if you can even be well-versed in developing clients and servers and handling certs for that interaction and I guess it comes down to being well-versed in distributed systems development is what this whole cloud native is all about and on top of that I think being well-versed on how to push your apps into containers.
You know create images, creating containers, pushing them into repository, pulling them from the repository and manipulating and creating containers in different ways and then on top of that maybe you want to learn Kubernetes and we can talk about that too but I wanted to give Nick a chance to talk about his aspects.
[0:28:59.0] NL: I agree with pretty much everything you guys have said. I think the only thing I would add is like really understanding how to use and work with an API and an API driven environment because that is what a lot of cloud native is, right? It is using someone else’s computer so how do you do that? It is via an API like we’re talking about containers and orchestration, those are all done hopefully within API. Luckily, if you are using Kubernetes, which likely you are. It is all API driven.
And so using an API, I think, and getting familiar with that. Most developers I think at some point are familiar with that but just that would be the main thing I would think too, outside of what you and Brian have already said are what is needed to do like a cloud native job.
[0:29:40.2] CC: Yeah. Now if someone wanted to learn Kubernetes, well there is the Kubernetes Academy.
[0:29:47.5] NL: There is a Kubernetes Academy.
[0:29:49.4] CC: That is pretty awesome but do you think going through the certification would help?
[0:29:55.3] NL: I think that is a good place to start. So the current certification that exists is the CKA, the Certified Kubernetes Administrator and I think that is a good starting place, especially for someone who is not really touched Kubernetes before. If they’re like, “How do I know the basics of Kubernetes?” going through that certification process I think will be a huge step forward because that really covers most of what you are going to touch on day to day basis for Kubernetes.
[0:30:21.6] CC: And there is the CKAD as well, which is for developer. The CKAD is Certified Kubernetes –
[0:30:29.7] BL: Application Developer.
[0:30:31.7] CC: Application Developer and the other one is Certified Kubernetes Admin.
[0:30:34.9] NL: Yeah I was like, “administrative developer?” like.
[0:30:38.2] CC: If you are brand new I think it is worth while doing the developer first because it is mostly the commands. You go through the commands just so you have a knowledge of how to interact with Kubernetes and the admin is more like how you manage and how do you troubleshoot a cluster and how do you manage cluster. So it is more involved I think. You need to know more but in any case, I agree with you that it would help because it serves as a syllabus for what to learn.
It is like, “Okay, these other things that if you know these things, it would help you a lot if you had to do anything with Kubernetes.”
[0:31:14.6] NL: Yeah, I don’t think that you need to have a certification to do a job. I really don’t think so unless it is like required by law like you have to.
[0:31:23.2] CC: No, yeah not at all what I am saying but if you don’t know anything at all and you’re like, “Where do I start?” I would recommend that. That is not a bad place to start or if you know some things but you feel like you don’t know others and you want to fill in the gaps and you don’t know what your gaps are, also same idea. What do you think Brian? Do you think having this certification would be useful?
[0:31:46.5] BL: I don’t know, some people need it but I am also barely graduated from high school and I don’t have a college degree. So I have always leaned on myself for learning things on my own schedule, in my own pace on my own terms but some people do need the structure provided to them by certifications and I’ve only heard good things of people taking those tests. So I think for some people it is actually really good but for others, it might be a waste of time because what will actually happen if you get that certification?
If you work at some large companies, I do know this for a fact by getting your AWS certificates actually had a money thing behind it but in a lot of places I don’t know but it couldn’t hurt. That is the most important piece. It can’t hurt.
[0:32:36.3] NL: Yeah, I totally agree. You learn at least something even when I am taking a certification exam for something that I was already pretty aware of, I always learned at least one thing by taking like an examination. The last good question that you likely have never even thought off but I also agree with Brian where it is like I don’t have my CKA and I think I am a pretty damn good expert of Kubernetes. So I don’t think anything would change for me to take the exam.
[0:33:00.2] CC: Oh yeah. I work with so many people who have none of those sort of certifications and they are absolutely experts. I was talking about like it would help me. I want to take those two certifications because it helped me fill in the gaps and I know there is a lot that I am going to learn especially with the admin one. So it is using the curriculum as a guide for what I need to learn and then testing, did I really learn and also it made me feel good but other than that, I don’t think it has any – I don’t know, I don’t think it is bad either.
[0:33:33.4] BL: And that is the most important piece, what you just said, it made you feel good because you take certifications to test your knowledge against yourself in a lot of cases. So I think it is good. I just realized you can – I mean people cannot see behind me. I don’t think I have as many books as Carlisia’s up there but I have read all mine except for like four of them.
[0:33:52.6] CC: Yeah, I did not read all of these books. I mean a lot of these books are school related books that I kept because they are really good and books that I have acquired and I have read ome but not all the entire book. Some things I use for reference but definitely have not read. Don’t be impressed, I have not read all of these books. Hopefully one day when I retire maybe. Anyway –
[0:34:17.7] BL: I think that one interesting thing would be the amount of study that you need to do to gain a certification when you are not working in the space actually gives you that little bit of push that you need to make sure that you understand that you know what you need to know. So if you organically came to cloud native as I did, as I’ll explain in my story, you know I am not really interested in that certification.
But if I wanted to change, and maybe I wanted to change my focus to doing more graphic stuff and there was a certification for this, maybe I would think about it just to make sure that I knew I was eligible for these jobs that I am trying for, so.
[0:34:57.8] CC: Yeah.
[0:34:58.8] NL: Yeah, that makes sense. Also, my books are over there and I have read most of the way through many of my books but not all the way through because a lot of them are boring.
[0:35:09.5] BL: But I will say and since we are talking about books and talking about getting yourself into Kubernetes land, right now is actually the best time to buy books because there is lots of them and I am not actually saying that these books are super awesome but some of them are. Notably this Programming Kubernetes book is pretty awesome and the reason it is so awesome is because my quote is on the back of it.
[0:35:33.3] NL: I was going to say.
[0:35:34.4] BL: Yeah, my name is on the back of it and then another book that I just picked up lately is called The Kubernetes Cookbook and it is for building cloud native applications from O’Reilly and the reason that I like it is because I have always, since I mean 20 years ago, I love creating O’Reilly cookbooks because small problem, answer with an exclamation, and then there is another one called Kubernetes Patterns, which I just started and I think it is pretty good too.
And just to say that these are not endorsements but this is what I am reading right now. It is like a thousand pages here. The things that I am trying to get through right now to keep up to date with what we are trying to do because actually the biggest problem with what we are doing is that we are trail blazing. So a lot of the things that are happening, like the way that Kubernetes advances every few months is new, new, new, new.
So there is not a lot of higher art in what we are doing that is public. So what you need to do is turn yourself into someone who actually understands the theory of what we are doing rather than the practical application of it. Understand that piece too but you got to understand the theory, which is why I said I’ve literally doing the same thing for the last 25 years because I learned how to program and I learned a Unix and then I learned Linux and then I learned networking.
Take all of those lessons and I can apply them all the time. So that is actually the most important part about any of this.
[0:36:56.9] CC: Yeah, I agree with you like going through the fundamentals helps so much more than going through the specifics and in fact, trying to learn specifics without having fundamentals it can be very painful but then you try to learn the fundamentals and then you go, “Oh yeah, it totally makes sense. I have been trying to listen to YouTube lectures on the server systems and I have a lot of moments of, “Ah that is why Kubernetes works this way to address this problem.”
And I have that programming book, which is not in my office. I have to find it but yes that is a very good book, I have this.
[0:37:37.9] BL: Oh Cloud Native DevOps with Kubernetes. That is another good book.
[0:37:43.5] CC: Yes.
[0:37:44.3] BL: I have it too.
[0:37:46.3] CC: I have like that as one?
[0:37:47.6] BL: Yes.
[0:37:48.2] CC: Good book and this, I haven’t gotten through it yet.
[0:37:52.7] BL: It is called Kubernetes in Action.
[0:37:54.3] CC: Yes, thank you for saying the name because if you are not in the video you wouldn’t know.
[0:37:58.5] BL: So really what we are saying now –
[0:37:59.7] CC: People say great things about the Kubernetes in Action, this one.
[0:38:02.9] BL: So I actually want to bring up another thing and say, I read a lot. I like to read. I read a lot of blog posts and here is another crazy thing, the YouTube videos from like KubeCon every year or every few months, we publish a 180 talks for free and there is some good lessons in those. So the good thing about getting into cloud native is that you can get into it for cheap because all of this information out here Kubernetes source is free. Go read it.
I mean 5,000 developers have worked on it. I am sure you will get a lot out of that, go do that but like YouTube talks, blog post, just following your favorite SIGK’s, Special Interest Group for Kubernetes, their community meetings. You can learn so much about how this space works and really how to write software in it without spending a dime other than have a computer and Internet.
[0:38:55.5] CC: Yeah and I am going to give a tip for people that I actually caught on not too long ago. I subscribed to YouTube premium, which I think is $5 a month. It is the best $5 I have ever spent because really I don’t have time to sit in front of a video unless it is very special and just watch something and reading is also very – after I spend a whole day reading codes, my mind doesn’t want to read anything else. So I love podcasts and I listen to a lot of podcast.
And now the YouTube videos are even have been more educational to me because the premium version of YouTube is if your phone locks it will still play.
[0:39:40.2] BL: And you can download the videos.
[0:39:42.0] CC: You can download the videos too. Yeah if you go on a camping trip or airplane you have them so it’s been fantastic. I just put my headset, my little Bluetooth headset and as I am doing laundry or as I’m cooking or anything, I am always listening to something. There goes the tip.
[0:40:01.9] NL: Yeah, I totally agree. I love YouTube premium. No ads as Brian said is the best. I am going to throw out a book recommendation, one written by my colleague and a good friend, Craig Tracy, or co-written called Managing Kubernetes and it is actually like I was saying that these tech books are kind of boring, this one is actually a lot of fun to read. It is written well in a way that I found I kept turning at the page. So I really liked it.
[0:40:26.3] BL: Yeah, it is only 150 pages too.
[0:40:29.3] NL: Yeah that is pretty short.
[0:40:30.5] BL: And the software that Carlisia writes is the last chapter of it, the next to last chapter so.
[0:40:37.6] NL: Oh shoot, all right throw it out then.
[0:40:40.4] BL: Well no, I am just saying it is another good book and I like the way you bring this up because this information is out there but I know were coming close to the end and I had one thing that I want to talk about today.
[0:40:50.3] NL: I was just about to bring that up, please take us away.
[0:40:52.3] BL: All right, so we talked about where we come from and we talked about things in the space about the jobs, how we keep up to date but really, the most important piece is what happens in the future. You know Kubernetes is only five years old so theoretically cloud native jobs are only a few years old. So how does cloud native move in the future and I do have some thoughts on this one.
So what we are going to see is what we have seen over the last two decades is that our stacks will get more complex, we will run more apps, we will have more CPU’s and more networking and it is not even Morris Law stuff. We’ll just have more stuff. So what I find is that in the future, what we need to think about are things like automation. We need to think about better resilience. Apps that can actually take care of themselves. So your app goes down, what happens? Well nothing because it brought itself back up.
So I see that the jobs that we have now are just going to evolve into better versions of what we have right now. So developers will still be developing. The more interesting piece is that we are going to have more developers because more people are taking these boot camp courses, more people are going into computer science in school. So we are actually going to have more developers out there. So all that means is that we are just going to have more problems to solve at least for the next few years.
The generation from now, I couldn’t tell you what is going to happen. Maybe we will all be out of work. I will be retired so I probably won’t care but just think about this. Now is the literal best time to get into writing software and specifically for cloud native applications whether you are in operations or you are writing applications that run on clouds or anything like this. This is the best time because it is still beginning and there is more work to do than we have people and if you look through jobs postings you’ll realize that wow, everyone is looking for this.
[0:42:48.3] CC: Yeah and at the same time, there is a sufficient amount of resources out there for you to learn even if you don’t want to – if you want to or you can’t pay. We now are so much at the beginning that there is nothing so it is a very good time.
[0:43:04.6] NL: Yeah, the wealth of knowledge is out there that is for free is unheard of. It is unprecedented and yeah, I totally agree that this is the best time. Brian, if we go by your thesis throughout this entire episode, basically we are going to be doing the same thing in 20 years as we are doing now. It is the same thing we did 20 years ago. So it is probably going to be like you said, developers are going to develop-ate, sys admins are going to sys administrate.
[0:43:28.6] CC: I love that.
[0:43:30.1] BL: And security people are going to complain about everything.
[0:43:33.4] NL: That is how we are going to change. So we are just going to be running on like quantum applications in 20 years but they are still going to be if/else statements.
[0:43:41.1] CC: My prediction is that we are going to have greater server access, like easier server access, and especially developers and there will be more buttons to press and more visual tools so you don’t have to be necessarily logging into a server to command lines that we have more tools abstracting all of that detailed work that develops.
[0:44:07.0] BL: So more abstractions on top of abstractions.
[0:44:10.1] CC: Yeah that is my prediction. Why not?
[0:44:13.3] BL: Well you know what? I mean if that is true because that is what we have been doing forever now so we are going to continue on doing this thing.
[0:44:20.0] CC: Because it is what people want.
[0:44:22.0] BL: Because it works.
[0:44:23.0] CC: Yeah, it makes life easier for some people. I don’t see why we wouldn’t move in that direction but before we wrap up, unless you guys want to make predictions too, I really wanted to touch base on the hiring side of things. The recruiters and hiring managers before interviewing, I can’t imagine there is a whole bunch of people out there who need to recruit people to do these cloud native jobs and how can we help them? Like can we give them some tips? How can they attract people? What should they be looking for?
[0:45:03.4] NL: Well, I guess my thought is that I really feel like recruiters need to start learning the technology that they are hiring for. I don’t think that they can hide behind the idea that they’re recruiters and they don’t need to know. If you want to hire good people, if you want to weed out the bad people or whatever it is that you are trying to do, you need to actually learn the technology that you are hiring for and I think like we are saying, there is now a wealth of knowledge that is free for you to access, please look.
[0:45:32.9] CC: I am not going to disagree with that.
[0:45:34.3] BL: And the interesting thing is when he says learn it, he doesn’t mean that you have to be able to produce it but you should understand how it works at the minimum.
[0:45:42.8] NL: Yeah and also know when someone’s BS-ing you in the text screen.
[0:45:48.2] CC: But it is not easy because you might be going in the direction with the intention of learning and you might misunderstand things and you know how deep do you have to go to not misunderstand the technology?
[0:46:06.1] BL: You know what? I don’t think there is an answer for that. I think it is just you don’t know and there is something in between being an expert. You need to be something in between where if you’re hiring for cloud native in Kubernetes, you can’t offer a job that wants 10 years of Kubernetes experience. First of all, Kubernetes is huge and no one has all Kubernetes experience throughout the whole stack and second of all, Kubernetes is only five years old.
So please don’t do that to yourself as well. So you should know how old it is and at least know the parts and what your team is going to be working on but for managers, wow, actually I don’t have a good answer for that. So I am just going to I’ll plan on that one.
[0:46:45.1] CC: Well, how would it be different? Actually it is going to sound like I asked a loaded question but I just now had this realization. I don’t think it would be different from what we were saying in regards to giving tips for people to prepare themselves, to make a move into this space if they are not working with any of this stuff. It will be the same, like try to find people who know distributed systems, they can debug well. I am not even to go into working well with people. That is such a given. Let’s just keep it to the text stack and all of those things that we recommended for people to learn, I don’t know.
[0:47:26.0] BL: Yeah, it sounds good to me.
[0:47:28.1] NL: All right, well I think that just about wraps it up for this week of the podcast, the Kubelets Podcast. I thought this was a really interesting discussion. It was cool to talk about where we were and where we are going and you know, and what brought us all together as I said.
[0:47:44.2] CC: Nick, do you want to share with us what your tagline for this episode was?
[0:47:48.1] NL: Yeah, the tagline for this episode is CREAM: Cash Rules Everything Around Me.
[0:47:53.3] BL: Dollar-dollar bills you all.
[0:47:55.8] CC: Ka-ching, ka-ching, ka-ching
[0:47:58.8] NL: All right, thank you so much. Thank you Brian, thanks for joining us.
[0:48:03.9] BL: Thank you for having me.
[0:48:05.3] NL: Yeah and thank you Carlisia.
[0:48:07.6] CC: This was really good, thank you.
[0:48:09.7] NL: Yeah, I had a lot of fun. Bye, y’all.
[0:48:13.5] BL: Bye.
[0:48:14.1] CC: Bye.
[END OF EPISODE]
[0:48:14.8] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
Today on The Podlets Podcast, we are joined by VMware's Vice President of Research and Development, Craig McLuckie! Craig is also a founder of Heptio, who were acquired by VMware and during his time at Google he was part of bringing Kubernetes into being. Craig has loads of expertise and shareable experience in the cloud native space and we have a fascinating chat with him, asking about his work, Heptio and of course, Kubernetes! Craig shares some insider perspective on the space, the rise of Kubernetes and how the increase in Kubernetes' popularity can be managed. We talk a lot about who can use Kubernetes and the prerequisites for implementation; Craig insists it is not a one-size-fits-all scenario. We also get into the lack of significantly qualified minds and how this is impacting competition in the hiring pool. Craig comments on taking part in the open source community and the buy-in that is required to meaningfully contribute as well as sharing his thoughts on the need to ship new products and services regularly. We finish off the episode with some of Craig's perspectives on the future of Kubernetes, dangers it poses to code if neglected and the next phase of its lifespan. For this amazing chat with a true expert in his field, make sure to join us on for this episode!
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Special guest:
Craig McLuckie
Hosts:
Carlisia CamposDuffie CooleyJosh RossoKey Points From This Episode:
• A brief introduction to Craig's history and his work in the cloud native space.
• The questions that Craig believes more people should be asking about Kubernetes.
• Weighing the explosion of the Kubernetes space; fragmentation versus progress.
• The three pieces of enterprise software and aiming to enlarge the 'crystalline core'.
• Craig's thoughts on specialized Kubernetes operating systems and their tradeoffs.
• Quantifying the readiness of an organization to implement Kubernetes.
• Craig's reflections on Heptio and the lessons he feels he learned in the process.
• The skills shortage for Kubernetes and how companies are approaching this issue.
• Balancing the needs and level of the community and shipping products regularly.
• Involvement in the open source community and the leap of faith that is inherent in the process.
• The question of microliths; making monoliths more complex and harder to manage.
• Masking problems with Kubernetes and how detrimental this can be to your code.
• Craig's thoughts on the future of the Kubernetes space and possible changes.
• The two duty cycles of any technology; the readiness phase that follows the hype.Quotes:
“I think Kubernetes has opened it up, not just in terms of the world of applications that can run Kubernetes, but also this burgeoning ecosystem of supporting technologies that can create value.” — @cmcluck [0:06:20]
“You're not a cool mainstream enterprise software provider if you don’t have a Kubernetes story today. I think we’ll start to see continued focus and consolidation around a set of the larger organizations that are operating in this space.” — @cmcluck [0:06:39]
“We are so much better served as a software company if we can preserve consistency from environment to environment.” — @cmcluck [0:09:12]
“I’m a fan of rendered down, container-optimized operating system distributions. There’s a lot of utility there, but I think we also need to be practical and recognize that enterprises have gotten comfortable with the OS landscape that they have.” — @cmcluck [0:14:54]
Links Mentioned in Today’s Episode:
Craig McLuckie on LinkedIn
Craig McLuckie on Twitter
The Podlets on Twitter
Kubernetes
VMware
Brendan Burns
Cloud Native Computing Foundation
Heptio
Mesos
Valero
vSphere
Red Hat
IBM
Microsoft
Amazon
KubeCon
Transcript:
EPISODE 13
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically-minded decision maker, this podcast is for you.
[INTERVIEW]
[00:00:41] CC: Hi, everybody. Welcome back to The Podlets podcast, and today we have a special guest, Craig McLuckie. Craig, I have the hardest time pronouncing your last name. You will correct me, but let me just quickly say, well, I’m Carlisia Campos and today we also have Duffy Colley and Josh Rosso on the show. Say that three times fast, Craig McLuckie.
Please help us say your last name and give us a brief introduction. You are super well-known in the Kubernetes community and inside VMware, but I’m sure there are not enough people that should know about you that didn’t know about you.
[00:01:20] CM: All right. I’ll do a very quick intro. Hi, I’m Craig McLuckie. I’m a Vice President of Research and Development here at VMware. Prior of VMware, I spent a fair amount of time at Google where my friend Joe and I were responsible for building and shipping Google Compute Engine, which was an interesting exercise in bringing traditional enterprise virtualized workloads into the very sophisticated Google data center.
We then went ahead and as our next project with Brendan Burns, started Kubernetes, and that obviously worked out okay, and I was also responsible for the ideation and formation of the Cloud Native Computing Foundation. I then wanted to work with Joe again. So we started Heptio, a little startup in the Kubernetes ecosystem. Almost precisely a year ago, we were acquired by VMware. So I’m now part of the VMware company and I’m working on our broader strategy around cloud native apps under the brand [inaudible 00:02:10].
[00:02:11] CC: Let me start off with a question. I think it is going to be my go-to first question for every guest that we have in the show. Some people are really well-versed in the cloud native technologies and Kubernetes and some people are completely not. Some people are asking really good questions out there, and I try to too as I’m one of those people who are still learning.
So my question for you is what do you think people are asking that they are not asking the right frame, that you wish they would be asking that question in a different way.
[00:02:45] CM: It’s a very interesting question. I don’t think there’s any bad questions in the world, but one question I encountered a fair bit is, “Hey, I’ve heard about this Kubernetes thing and I want one.” I’m not sure it’s actually the right question, right? Kubernetes is a powerful technology. I definitely think we’re in this sort of peak hype phase of the project. There are a set of opportunities that Kubernetes really brings a much more robust ability to manage, it abstracts a way infrastructure — there are some very powerful things. But to be able to be really successful with Kubernetes project, there’re a number of additional ingredients that really need to be thought through.
The questions that ought to be asked are, "I understand the utility of Kubernetes and I believe that it would bring value to my organization, but do I have the skills and capabilities necessary to stand up and run a successful Kubernetes program?" That’s something to really think about. It’s not just about the nature of the technology, but it really brings in a lot of new concepts that challenge organizations.
If we think about applications that exist in Kubernetes, there’s challenges with observability. When you think the mechanics of delivering into a containerized sort of environment, there are a lot of dos and don’ts that make a ton of sense there. A lot of organizations I’ve worked with are excited about the technology, but they don’t necessarily have the depth of understanding of where it's best used and then how to operate it.
The second addendum to that is, “Okay, I’m able to deploy Kubernetes, but what happens the next day? What happens if I need to update it? When I need to maintain it? What happens when I discover that I need not one Kubernetes cluster or even 10 Kubernetes clusters, but a hundred or a thousand or 10,000.” Which is what we are starting to see out there in the industry. “Have I taken the right first step on that journey to set me up for success in the long-term?”
I do think there’s just a tremendous amount of opportunity and excitement around the technology, but also think it’s something that organizations really need to look at as not just about deploying a platform technology, but introducing the necessary skills that are necessary to operate and maintain it and the supporting technologies that are necessary to get the workloads on to it in a sustainable way.
[00:04:42] JR: You’ve raised a number of assumptions around how people think about it I think, which are interesting. Even just starting with the idea of the packaging problem that represents containerization is a reasonable start. So infrequently, do we describe like the context of the problems that — all of the problems that Kubernetes solve that frequently I think people just get way ahead of themselves. It’s a pretty good description.
[00:05:04] DC: So maybe in a similar vein, Craig, we had mentioned all the pieces that go into running Kubernetes successfully. You have to bolt some things on maybe for security or do some things to ensure observability as adequate, and it seems like the ecosystem has taken notice of all those needs and has built a million projects and products around that space.
I’m curious of your thoughts on that because it’s like in one way it’s great because it shows it’s really healthy and thriving. In another way, it causes a lot of fragmentation and confusion for people who are thinking whether they can or cannot run Ku, because there are so many options out there to accomplish those kinds of things. So I was just curious of your general thoughts on that and where it’s headed.
[00:05:43] CM: It’s fascinating to see the sort of burgeoning ecosystem around Kubernetes, and I think it’s heartening, because if you think at the very highest level, the world is going to go one of two ways with the introduction of the hyper-scale public cloud. It’s either going to lead us into a world which feels like mainframe era again, where no one ever got [inaudible 00:06:01] Amazon in this case, or by Microsoft, whatever the case. Whoever sort of merges over time as the dominant force.
But it also represents some challenges where you have these vertically integrated closed systems, innovation becomes prohibitively difficult. It’s hard to innovate in a closed system, because you’re innovating only for organizations that have already taken that dependancy. I think Kubernetes has opened it up, not just in terms of the world of applications that can run Kubernetes, but also this burgeoning ecosystem of supporting technologies that can create value. There’s a reason why startups are building around Kubernetes. There’s a reason they’re looking to solve these problems.
I do think we’ll see a continued period of consolidation. You're not a cool mainstream enterprise software provider if you don’t have a Kubernetes story today. I think we’ll start to see continued focus and consolidation around a set of the larger organizations that are operating in this space. It’s not accidental that Heptio is a part of VMware at this point. When I looked at the ecosystem, it was pretty clear we need to take a boat to fully materialize the value of Kubernetes and I am pleased to be part of this organization.
So I do think you’ll start to see a variety of different vendors emerging with a pretty clear, well-defined opinions and relatively turnkey solutions that address the gamut of capabilities. One organization needs to get into Kubernetes. One of the things that delights me about Kubernetes is that if you are a sophisticated organization that is self-identifying as a software company, and this is sort of manifest in the internet space if you’re running a sort of hyper-scale internet service, you are kind of by definition a software company.
You probably have the skills on hand to make great choices around what projects, follow the communities, identify when things are reaching point of critical mass. You’re running in a space where your system is relatively homogenous. You don’t have just the sort of massive gamut of workloads, a lot of dimension enterprise organizations have. There’s going to be different approaches to the ecosystem depending on which organization is looking at the problem space.
I do think this is prohibitively challenging for a lot of organizations that are not resourced at the level of a hyper-scale internet company from a technology perspective, where their day job isn’t running a production service for millions or billions of users. I do think situations like that, it makes a tremendous amount of sense to identify and work with someone you trust in the ecosystem, that can help you just navigate the wild map that is the Kubernetes landscape, that can participate in a number of these emerging communities that has the ability to put their thumb on the scale where necessary to make sure that things converge.
I think it’s situational. I think the lovely thing about Kubernetes is that it does give organizations a chance to cut their teeth without having to dig into like a deep procurement cyclewith a major vendor. We see a lot of self-service Kubernetes projects getting initiated. But at some point, almost inevitably, people need a little bit more help, and that’s the role of a lot of these vendors. I think that I truly hope that I’m personally committed to, is that as we start to see the convergence of this ecosystem, as we start to see the pieces falling into place, that we retain an emphasis on the value of community that we also sort of avoid the balkanization and fragmentation, which sometimes comes out of these types of systems.
We are so much better served as a software company if we can preserve consistency from environment to environment. The reality is as we start looking at large organizations, enterprises that are consuming Kubernetes, it’s almost inevitable that they’re going to be consuming Kubernetes from a number of different sources. Whether the sources are cloud provider delivering Kubernetes services or whether they handle Kubernetes clusters that are dedicated centralized IT team is delivering or whether it’s vendor provided Kubernetes. There’s going to be a lot of different flavors and variants on it.
I think working within the community not as king makers, but as concerned citizens that are looking to make sure that there are very high-levels of consistency from offering to offering, means that our customers are going to be better served. We’re right now in a time where this technology is burgeoning. It’s highly scrutinized, but it’s not necessarily very widely deployed. So I think it’s important to just keep an eye on that sort of community centricity. Stay as true to our stream as possible. Avoid balkanization, and I think everyone will benefit from that.
[00:10:16] DC: Makes sense. One of the things I took away from my year, I was just looking kind of back at my year and learning, consolidating my thoughts on what had happened. One of the big takeaways for me in my customer engagements this year was that a number of customers outright came out explicitly and said, “Our success as a company is not going to be measured by our ability to operate Kubernetes, which is true and obvious.”
But at the same time, I think that that’s a really interesting moment of awareness for a lot of the people that I work with out there in the field, where they realized, you know what, Kubernetes may be the next best thing. It may be an incredible technology, but fundamentally, it’s not going to be the measure by which we are graded success. It’s going to be what we do on top of that that is more interesting.
So I think that your point about that ecosystem is large enough that people will be consuming Kubernetes for multiple searches is sort of amplified by that, because people are going to look for that easy button as inroad. They’re going to look for some way to get the Kubernetes thing so that they can actually start exploring what will happen on top of it as their primary goal rather than how to get Kubernetes from an operational perspective or even understand the care and feeding of it because they don’t see that as the primary measure of success.
[00:11:33] CM: That is entirely true. When I think about enterprise software, there’s sort of these three pieces of it. The first piece is the sort of crystaline core of enterprise software. That’s consistent from enterprise to enterprise to enterprise. It’s purchased from primary vendors or it’s built by open source communities. It represents a significant basis for everything.
There’s the sort of peripheral, the sort of sea of applications that exist around that enterprises built that are entirely unique to their environment, and they’re relatively fluid. Then there’s this weird sort of interstitial layer, which is the integration glue that exists between their crystalline core and those applications and operating practices that enterprises create.
So I think from my side, we benefit if that crystalline core is as large as possible so that enterprises don’t have to rely on bespoke integration practices as much possible. We also need to make allowances for the idea that that interstitial layer between the sort of core of a technology like Kubernetes and the applications may be modular or sort of extended by a variety of different vendors. If you’re operating in this space, like the telco space, your problems are going to be unique to telco, but they’re going to be shared by every other telco provider.
One of the beautiful things about Kubernetes is it is sufficiently modular, it is a pretty well-thought resistant. So I think we will start to see a lot of specialization in terms of those integration pieces. A lot of specialization in terms of how Kubernetes is fit to a specific area, and I think that represents an awful opportunity for the community to continue to evolve. But I also think it means that we as contributors to the project need to make allowances for that. We can’t hold opinion to the point where it precludes massive significant value for organizations as they look at modularized and extending the platform.
[00:13:19] CC: What is your opinion on people making specialized Kubernetes operating systems? For example, we’re talking about telcos. I think there’s a Kubernetes OSS specifically for telcos that strip away things that kind of industry doesn’t need. What are the tradeoffs that you see?
[00:13:39] CM: It’s almost inevitable that you’re going to start to see specialized operating system distributions that are tailored to container-based workloads. I think as we start looking at like the telco space with network function virtualization, Kubernetes promises to be something that we never really saw before. At the end of the day, telco is very broadly deployed open stack as this primary substrate for network function virtualization.
But at the end of the day, they ended up not just deploying one rendition of open stack. But in many cases, three, four, five, depending on what functions they wanted to run, and there wasn’t a sufficient commonality in terms of the implementations. It became very sort of vendor-centric and balkanized in many ways. I think there’s an opportunity here to work hard as a community to drive convergence around a lot of those Kubernetes constructs so that, sure, the operating system is going to be different.
If you’re running an NFV data plane implementation, doing a lot of bit slinging, it’s going to look fundamentally different to anything else in the industry, right? But that shouldn’t necessarily mean that you can’t use the same tools to organize, manage and reason about the workloads. A lot of the innovations that happen above that shouldn’t necessarily be tied to that. I think there’s promise there and it’s going to be an amazing test for Kubernetes itself to see how well it scales into those environments.
By and large, I’m a fan of rendered down, container-optimized operating system distributions. There’s a lot of utility there, but I think we also need to be practical and recognize that enterprises have gotten comfortable with the OS landscape that they have. So we have to make allowances that as part of containerizing and distributing your application, maybe you don’t necessarily need to and hopefully re-qualify the underlying OS and challenge a lot of the assumptions. So I think we just need to pragmatic about it.
[00:15:19] DC: I know that’s a dear topic to Josh and I. We’ve fought that battle in the past as well. I do think it’s another one of those things where it’s a set of assumptions. It’s fascinating to me how many different ecosystems are sort of collapsing, maybe not ecosystems. How many different audiences are brought together by a technology like container orchestration. That you are having that conversation with, “You know what? Let’s just change the paradigm for operating systems.” That you are having that conversation with, “Let’s change the paradigm for observability and lifecycle stuff. Let’s change the paradigm for packaging. We’ll call it containers.”
You know what I mean? It’s so many big changes in one idea. It’s crazy.
[00:15:54] CM: It’s a little daunting if you think about it, right? I always say, change is easiest across one dimension, right? If I’m going to change everything all at once across all the dimensions, life gets really hard. I think, again, it’s one of these things where Kubernetes represents a lot of value. I walk into a lot of customer accounts and I spend a lot of time with customers. I think based on their experiences, they sort of make one of two assumptions.
There’s a set of vendors that will come into an environment and say, “Hey, just run this tool against your virtual machine images – and Kubernetes, right?” Then they have another set of vendors that will come in and say, “Yeah. Hey, you just need to go like turn this thing into 12 factor cloud native service mesh-linked applications driven through CICD, and your life is magic.” There are some cases where it makes sense, but there’re some cases where it just doesn’t.
Hey, what uses a 24 gigabyte container? Is that really solving the problems that you have in some systematic way? At the other end of the spectrum, like there’s no world in which an enterprise organization is rewriting 3,000, 5,000 applications to be cloud native from the ground up. It just is not going to happen, right? So just understanding the return investment associated with the migration into Kubernetes. I’m not saying where it make sense and where it doesn’t. It’s such an important part of this story.
[00:17:03] JR: On that front, and this is something Duffy and I talk to our customers about all the time. Say you’re sitting with someone and you’re talking about potentially using Kubernetes or they’re thinking about it, are there like some key indicators that you see, Craig, as like, “Okay. Maybe Kubernetes does have that return on investment pretty soon to justify it." Or maybe even in the reverse, like some things where you think, “Okay, these people are just going to implement Kubernetes and it’s going to become shelf weary.” How do you qualify as an org, “I might be ready to bring on something like Kubernetes.”
[00:17:32] CM: It’s interesting. For me, it’s almost inevitably – as much about the human skills as anything else. I mean, the technology itself isn’t rocket science. I think the sort of critical success criteria, when I start looking at engagement, is there a cultural understanding of what Kubernetes represents?
Kubernetes is not easy to use. That initial [inaudible 00:17:52] to the face is kind of painful for people that are used to different experiences. Making sure that the basic skills and expectations are met is really important. I think there’s definitely some sort of acid test around workloads fit as you start looking at Kubernetes. It’s an evolving ecosystem and it’s maturing pretty rapidly, but there are still areas that need a little bit more heavy lifting, right?
So if you think about like, “Hey, I want to run a vertically-scaled OLTP database in Kubernetes today.” I don’t know. Maybe not the best choice. If the customer knows that, if they have enough familiarity or they’re willing to engage, I think it makes a tremendous amount of sense.
By and large, the biggest challenge I see is not so much in the Kubernetes space. It’s easy enough to get to a basic cluster. There’re sort of two dimensions to this, there is day two operations. I see a lot of organizations that have worked to create scale up programs of platform technologies. Before Kubernetes there was Mesos and there’s obviously PCF that we’ll be coming more increasingly involved in.
Organizations that have chewed on creating and deploying a standardized platform often have the operational skills, but you also need to look at like why did that previous technology really meet sort of criteria, and do you have the skills to operate it on a day two basis? Often there’s not – They’ve worked out the day two operational issues, but they still haven’t figured out like what it means to create a modern software supply chain that can deliver into the Kubernetes space.
They haven’t figured out necessarily how to create the right incentive structures and experiences for the developers that are looking to build, package and deliver into that environment. That’s probably the biggest point of frustration I see with enterprises, is, “Okay. I got to Kubernetes. Now what?” That question just hasn’t been answered. They haven’t really thought through, “These are the CICD processes. This is how you engage your cyber team to qualify the platform for these classes of workloads. This is how you set up a container repo and run scans against it. This is how you assign TTL on images, so you don’t just get massive repo.”
There’s so much in the application domain that just needs to exist that I think people often trivialize and it’s really taking the time and picking a couple of projects being measured in the investments. Making sure you have the right kind of cultural profile of teams that are engaged. Create that sort of celebratory moment of success. Make sure that the team is sort of metricking and communicating the productivity improvements, etc. That really drives the option and engagement with the whole customer base.
[00:20:11] CC: It sounds to me like you have a book in the making.
[00:20:13] CM: Oh! I will never write a book. It just seems like a lot of work. Brendan and a buch of my friends write books. Yeah, that seems like a whole lot of work.
[00:20:22] DC: You had mentioned that you decided you wanted to work with Joe again. You formed Heptio. I was actually there for a year. I think I was around for a bit longer than that obviously. I’m curious what your thoughts about that were as an experiment win. If you just think about it as that part of the journey, do you think that was a success and what did you learn from that whole experiment that you wished everybody knew, just from a business perspective? It might have been business or it might have been running a company, any of that stuff.
[00:20:45] CM: So I’m very happy with the way that Heptio went. There were a few things that sort of stood out for me as things that folks should think about if they’re going to start a startup or they want to join a startup. The first and foremost I would say is design the culture to the problem at hand. Culture isn’t accidental. I think that Heptio had a pretty distinct and nice culture, and I don’t want to sound self-congratulatory.
I mean, as with anything, a certain amount of this is work, but a lot of it is luck as well. Making sure that the cultural identity of the company is well-suited to the problem at-hand. This is critical, right? When I think about what Heptio embodied, it was really tailored to the specific journey that we were setting ourselves up for.
We were looking to be passionate advocates for Kubernetes. We were looking to walk the journey with our customers in an authentic way. We were looking to create a company that was built around sustainability. I think the culture is good and I encourage folks either the thing you’re starting is a startup or looking to join one, to think hard about that culture and how it’s going to map to the problems they’re trying to solve.
The other thing that I think really motivated me to do Heptio, and I think this is something that I’m really excited to continue on with VMware, was the opportunity to walk the journey with customers. So many startups have this massive reticence to really engage deeply in professional services. In many ways, Google is fun. I had a blast there. It’s a great company to work for. We were able to build out some really cool tech and do good things.
But I grew kind of tired of writing letters from the future. I was, “Okay, we are flying cars." When you're interacting with the customer. I can’t start my car and get to work. It’s great that you have flying cars, but right now I just need to get in my car, drive down the block and get out and get to work.
So walking the journey with customers is probably the most important learning from Heptio and it’s one of the things I’m kind of most proud of. That opportunity to share the pain. Get involved from day one. Look at that as your most valuable apparatus to not just build your business, but also to learn what you need to build. Having a really smart set of people that are comfortable working directly with customers or invested in the success of those customers is so powerful.
So if you’re in the business or in the startup game, investors may be leery of building out a significant professional service as a function, because that’s just how Silicon Valley works. But it is absolutely imperative in terms of your ability to engage with customers, particularly around nascent technologies, filled with gaps where the product doesn’t exist. Learn from those experiences and bring that back into the core product.
It’s just a huge part of what we did. If I was ever in a situation where I had to advice a startup in the sort of open source space, I’d say lean into the professional service. Lean into field engineering. It’s a critical way to build your business. Learn what customers need. Walk the journey with them and just develop a deep empathy.
[00:23:31] CC: With new technology, that was a concern about having enough professionals in the market who are knowledgeable in that new technology. There is always a gap for people to catch up with that.
So I’m curious to know what customers or companies, prospective customers, how they are thinking in terms of finding professionals to help them? Are they’re concerned that there’s enough professionals in the market? Are they finding that the current people who are admins and operators are having an easy time because their skills are transferable, if they’re going to embark on the Kubernetes journey? What are they telling you?
[00:24:13] CM: I mean, there’s a huge skills shortage. This is one of the kind of primary threats to the short term adoption of Kubernetes. I think Kubernetes will ultimately permeate enterprise organizations. I think it will become a standard for distributed systems development. Effectively emerging as an operating system for distributed systems, is people build more natively around Kubernetes. But right now it’s like the early days of Linux, where you deploy Linux, you’d have to kind of build it from scratch type of thing. It is definitely a challenge.
For enterprise organizations, it’s interesting, because there’s a war for talent. There’s just this incredible appetite for Kubernetes talent. There’s always that old joke around the job description for like 10 years of Kubernetes experience on a five-year project. That certainly is something we see a lot.
I’d take it from two sides. One is recognizing that as an enterprise organization, you are not going to be able to hire this talent. Just accept that sad truth. You can hire a seed crystal for it, but you really need to look at that as something that you’re going to build out as an enablement function for your own consumption.
As you start assessing individuals that you’re going to bring on in that role, don’t just assess for Kubernetes talent. Assess for the ability to teach. Look for people that can come in and not just do, but teach and enable others to do it, right? Because at the end of the day, if you need like 50 Kubernauts at a certain level, so does your competitor and all of your other competitors. So does every other function out there. There’s just massive shortage of skills.
So emphasizing your own – taking on the responsibility of building your own expertise. Educating your own organization. Finding ways to identify people that are motivated by this type of technology and creating space for them and recognizing and rewarding their work as they build this out. Because it’s far more practical to hire into existing skillset and then create space so that the people that have the appetite and capability to really absorb these types of disruptive technologies can do so within the parameters of your organization.
Create the structures to support them and then make it their job to help permeate that knowledge and information into the organization. It’s just not something you can just bring in. The skills just don’t exist in the broader world. Then for professionals that are interested in Kubernetes, this is definitely a field that I think we’ll see a lot of job security for a very long-time. Taking on that effort, it’s just well worth the journey.
Then I’d say the other piece of this is for vendors like VMware, our job can’t be just delivering skills and delivering technology. We need to think about our role as an enablers in the ecosystem as folks that are helping not just build up our own expertise of Kubernetes that we can represent to customers, but we’re well-served by our customers developing their own expertise. It’s not a threat to us. It actually enables them to consume the technologies that we provide. So focusing on that enablement through us as integration partners and [inaudible] community, focusing on enablement for our customers and education programs and the things that they need to start building out their capacity internally, is going to serve us all well.
[00:27:22] JR: Something going back to maybe the Heptio conversation, I’m super interested in this. Being a very open source-oriented company, at VMware this is of course this true as well. We have to engage with large groups of humans from all different kinds of companies and we have to do that while building and shipping product to some degree. So where I’m going with this is like – I remember back in the Heptio days, there was something with dynamic audit logging that we were struggling with, and we needed it for some project we were working on.
But we needed to get consensus in a designed approve at like a bigger community level. I do know to some degree that did limit our ability to ship quickly. So you probably know where I’m going with this. When you’re working on projects or products, how do you balance, making sure the whole community is coming along with you, but also making sure that you can actually ship something?
[00:28:08] DM: That harkens back to that sort of catch phrase that Tim Sinclair always uses. If you want to go fast, go alone. If you want to go far, go together. I think as with almost everything in the world, these things are situational, right? There are situations where it is so critical that you bring the community along with you that you don’t find yourself carrying the load for something by yourself that you just have to accept and absorb that it’s going to be pushing string.
Working with an engaged community necessitates consensus, necessitates buy-in not just from you, but from potentially your competitors. The people that you’re working with and recognizing that they’ll be doing their own sort of mental calculus around whether this advantages them or not and whatnot. But hopefully, I think certainly in the Kubernetes community, this is general recognition that making the underlying technology accessible. Making it ubiquitous, making it intrinsically supportable profits everyone. I think there’re a couple of things that I look at. Make the decision pretty early on as to whether this is something you want to kind of spark off and sort of stride off on your own an innovate around, whether it’s something that’s critical to bring the community along with you around.
I’ll give you two examples of this, right? One example was the work we did around technologies like Valero, which is a backup restore product. It was an urgent and critical need to provide a sustainable way to back up and recover Kubernetes. So we didn’t have the time to do this through Kubernetes. But also it didn’t necessarily matter, because everything we’re doing was build this addendum to Kubernetes.
That project created a lot of value and we’ve donated to open source project. Anyone can use it. But we took on the commitment to drive the development ourselves. It’s not just we need it to. Because we had to push very quickly in that space. Whereas if you look at the work that we’re doing around things like cluster API and the sort of broader provisioning of Kubernetes, it’s so important that the ecosystem avoids the tragedy of the commons around things like lifecycle management.
It’s so important that we as a community converge on a consistent way to reason about the deployment upgrade and scaling of Kubernetes clusters. For any single vendor to try to do that by themselves, they’re going to take on the responsibility of dealing with not just one or two environments if you’re a hyperscale cloud provider [inaudible 00:30:27] many can do that. But we think about doing that for, in our case, “Hey, we only deploy into vSphere. Not just what’s coming next, but also earlier versions of vSphere.
We need to be able to deploy into all of the hyper-scalers. We need to deploy into some of the emerging cloud providers. We need to start reasoning about edge. We need to start thinking about all of these. We’re a big company and we have a lot of engineers. But you’re going to get stretched very thin, very quickly if you try to chew that off by yourself.
So I think a lot of it is situational. I think there are situations where it does pay for organizations to kind of innovate, charge off in a new direction. Run an experiment. See if it sticks. Over time, open that up to the community as it makes sense. The thing that I think is most important is that you just wear your heart on your sleeve. The worst thing you can do is to present a charter that, “Hey, we’re doing this as a community-centric, open project with open design, open community, open source,” and then change your mind later, because that just creates dramas.
I think it’s situational. Pick the path that makes sense to the problem at-hand. Figure out how long your customer can wait for something. Sometimes you can bring things back to communities that are very open and accepting community. You can look at it as an experiment, and if it makes sense in that experiment perform factor, present it back to the Kubernetes communities and see if you can kind of get it back in. But in some case it just makes sense to work within the structure and constraints of the community and just accept that great things from a community angle take a lot of time.
[00:31:51] CC: I think too, one additional thing that I don’t think was mentioned is that if a project grows too big, you can always break it off. I mean, Kubernetes is such a great example of that. Break it off into separate components. Break it off into separate governance groups, and then parts can move with different speeds.
[00:32:09] CM: Yeah, and there’s all kinds of options. So the heart of it is no one rule, right? It’s entirely situational. What are you trying to accomplish on what arise and acknowledge and accept that the evolution of the core of Kubernetes is slowing as it should. That’s a signal that the project is maturing. You cannot deliver value at a longer timeline that your business or your customers can absorb then maybe it makes sense to do something on the outside. Just wear your heart on your sleeve and make sure your customers and your partners know what you’re doing.
[00:32:36] DC: One of your earlier points about how do companies – I think Josh's question and was around how do companies attract talent. You’re basically pointing, and I think that there are some relation to this particular topic because, frequently, I’ve seen companies find some success by making room for open source or upstream engineers to focus on the Kubernetes piece and to help drive that adoption internally.
So if you’re going to adopt something like a Kubernetes strategy as part of a larger company goal, if you can actually make room within your organization to bring people who are – or to support people who want to focus on that up stream, I think that you get a lot of ancillary benefits from that, including it makes it easier to adopt that technology and understand it and actually have some more skin in the game around where the open source project itself is going.
[00:33:25] CM: Yeah, absolutely. I think one of the lovely things about the Kubernetes community is this idea of your position is earned, not granted, right? The way that you earn influence and leadership and basically the good will of everyone else in that community is by chopping wood, carrying water. Doing the things that are good for the community.
Over time, any organization, any human being can become influential and lead based on their merits of their contributions. It’s important that vendors think about that. But at the same time, I have a hard time taking exception with practically any use of open source. At the end of the day, open source by its nature is a leap of faith. You’re making that technology accessible. If someone else can take it, operationalize it well and deliver value for organizations, that’s part of your contract.
That’s what you absorb as a vendor when you start the thing. So people shouldn’t feel like they have to. But if you want to influence and lead, you do need to. Participate in these communities in an open way.
[00:34:22] DC: When you were helping form the CNCF and some of those projects, did you foresee it being like a driving goal for people, not just vendors, but also like consumers of the technologies associated with those foundations?
[00:34:34] CM: Yeah, it was interesting. Starting the CNCF, I can speak from the position of where I was inside Google. I was highly motivated by the success of Kubernetes. Not just personally motivated, because it was a project that I was working on. I was motivated to see it emerge as a standard for distributed systems development that attracts the way the infrastructure provider. I’m not ashamed of it.
It was entirely self-serving. If you looked at Google’s market position at that time, if you looked at where we were as a hyper-scale cloud provider. Instituting something that enabled the intrinsic mobility of workloads and could shuffle around the cards on the deck so to speak [inaudible 00:35:09].
I also felt very privileged that that was our position, because we didn’t necessarily have to create artificial structures or constraints around the controls of the system, because that process of getting something to become ubiquitous, there’s a natural path if you approach it as a single provider. I’m not saying who couldn’t have succeeded with Kubernetes as a single provider. But if Red Hat and IBM and Microsoft and Amazon had all piled on to something else, it’s less obvious, right? It’s less obvious that Kubernetes would have gone as far as it did.
So I was setting up CNCF, I was highly motivated by preserving the neutrality. Creating structures that separated the various sort of forms of governance. I always joke that at the time of creating CNCF, I was motivated by the way the U.S. Constitution is structured. Where you have these sort of different checks and balances. So I wanted to have something that would separate vendor interests from things that are maintaining taste on the discreet project.
The sort of architecture integrity, and maintain separation from customer segments, so that you’d create the sort of natural self-balancing system. It was definitely in my thinking, and I think it worked out pretty well. Certainly not perfect, but it did lead down a path which I think has supported the success of the project a fair bit.
[00:36:26] DC: So we talked a lot about Kubernetes. I’m curious, do you have some thoughts, Carlisia?
[00:36:31] CC: Actually, I know you have a question about microliths. I was very interested in exploring that.
[00:36:37] CM: There’s an interesting pattern that I see out there in the industry and this manifests in a lot of different ways, right? When you think about the process of bringing applications and workloads into Kubernetes, there’s this sort of pre-dispositional bias towards, “Hey, I’ve got this monolithic application. It’s vertically scaled. I’m having a hard time with the sort of team structure. So I’m going to start tuning it up into a set of microservices that I can then manage discretely and ideally evolve on a separate cadence. This is an example of a real customer situation where someone said, “Hey, I’ve just broken this monolith down into 27 microservices.”
So I was sort of asking a couple of questions. The first one was when you have to update those 27 – if you want to update one of those, how many do you have to touch? The answer was 27. I was like, “Ha! You just created a microlith.” It’s like a monolith, except it’s just harder to live with. You’re taking a packaging problem and turn it into a massively complicated orchestration problem.
I always use that jokingly, but there’s something real there, which is there’s a lot of secondary things you need to think through as you start progressing on this cloud native journey. In the case of microservice development, it’s one thing to have API separated microservices. That’s easy enough to institute. But instituting the organization controls around an API versioning strategy such you can start to establish stable API with consistent schema and being able to sort of manage the dependencies to consuming teams requires a level of sophistication that a lot of organizations haven’t necessarily thought through.
So it’s very easy to just sort of get caught up in the hype without necessarily thinking through what happens downstream. It’s funny. I see the same thing in functions, right? I interact with organizations and they’re like, “Wow! We took this thing that was running in a container and we turned it into 15 different functions.” I’m like, “Ha! Okay.” You start asking questions like, “Well, do you have any challenges with state coherency?” They’re like, “Yeah! It’s funny you say that. Because these things are a little bit less transactionally coherent, we have to write state watches. So we try and sort of watermark state and watch this thing."
I’m like, “You’re building a distributed transaction coordinator on your free time. Is this really the best use of your resources?" Right? So it really gets back to that idea that there’s a different tool for a different job. Sometimes the tool is a virtual machine. Sometimes it’s not. Sometimes the tool is a bare metal deployment. If you’re building a quantitative trading application that’s microsecond latency sensitive, you probably don’t want to hypervisor there.
Sometimes a VM is the natural destination and there’s no reason to move from a VM. Sometimes it’s a container. Sometimes you want to start looking at that container and just modularizing it so you can run a set of things next to each other in the same process space. Sometimes you’re going to want to put APIs between those things and separate them out into separate containers. There’s an ROI. There’s a cause and there’s a benefit associated with each of those transitions. More importantly, there are a set of skills that you have to have as you start looking at their continuum and making sure that you’re making good choices and being wise about it.
[00:39:36] CC: That is a very good observation. Design is such an important part of software development. I wonder if Kubernetes helps mask these design problems. For example, the ones you are mentioning, or does Kubernetes sort of surfaces them even more?
[00:39:53] CM: It’s an interesting philosophical question. Kubernetes certainly masks some problems. I ran into an early – this is like years ago. I ran into an early customer, who confided in me, "I think we’re writing worse code now." I was like, ”What do you mean?” He was like, “Well, it used to be when we went out of memory on something, we get paged. Now we’ve set out that we go and it just restarts the container and everything continuous.” There’s no real incentive for the engineers to actually go back and deal with the underlying issues and recourse it, because the system is just more intrinsically robust and self-healing by nature.
I think there's definitely some problems that Kubernetes will compound. If you’re very sloppy with your dependencies, if you create a really large, vertically scaled monolith that’s running at VM today, putting it in a container is probably strictly going to make your life worse. Just be respectful of that. But at the same time, I do think that the discipline associated with transition to Kubernetes, if you walk it a little bit further along. If you start thinking about the fact that you’re not running a lot of imperative processes during a production in a push, where deployment container is effectively a bin copy with some minimal post-deployment configuration changes that happen. It sort of leads you on to a much happier path naturally.
I think it can mask some issues, but by and large, the types of systems you end up building are going to be more intrinsically operationally stable and scalable. But it is also worth recognizing that it’s — you are going to encounter corner cases. I’ve run into a lot of customers that will push the envelope in a direction that was unanticipated by the community or they accidentally find themselves on new ground that’s just unstable, because the technology is relatively nascent. So just recognizing that if you’re going to walk down a new path, I’m not saying don’t, just recognize that you’re probably going to encounter some stuff that’s going to take over to working through.
[00:41:41] DC: We get an earlier episode about API contracts, which I think highlights some of these stuff as well, because it sort of gets into some of those sharp edges of like why some of those things are super important when you start thinking about microservices and stuff.
We’re coming to the end of our time, but one of the last questions I want to ask you, we’ve talked a lot about Kubernetes in this episode, I’m curious what the future holds. We see a lot of really interesting things happening in the ecosystem around moving more towards serverless. There are a lot of people who are like — thinking that perhaps a better line would be to move away from like infrastructure offering and just basically allow cloud providers in this stuff to manage your nodes for you. We have a few shots on goal for that ourselves.
It’s been really an interesting evolution over the last year in that space. I’m curious, what sort of lifetime would you ascribe to it today? What do you think that this is going to be the thing in 10 years? Do you think it will be a thing in 5 years? What do you see coming that might change it?
[00:42:32] CM: It’s interesting. Well, first of all, I think 2018 was the largest year ever for mainframe sales. So we have these technologies, once they’re in enterprise, it tends to be pretty durable. The duty cycle of enterprise software technology is pretty long-lived. The real question is we’ve seen a lot of technologies in this space emerge, ascend, reach a point of critical mass and then fade and they’re disrupted by the technologies. Is Kubernetes going to be a Linux or is Kubernetes going to be a Mesos, right?
I mean, I don’t claim to know the answer. My belief, and I think this is probably true, is that it’s more like a Linux. When you think about the heart of what Kubernetes is doing, is it’s just providing a better way to build and organized distributed systems. I’m sure that the code will evolve rapidly and I’m sure there will be a lot of continued innovation enhancement.
But when you start thinking about the fact that what Kubernetes has really done is brought controller reconciler based management to distributed systems developed everywhere. When you think about the fact that pretty much every system these days is distributed by nature, it really needs something that supports that model. So I think we will see Kubernetes sticking. We’ll see it become richer. We’ll start to see it becoming more applicable for a lot of things that we’re starting to just running in VMs.
It may well continue to run in VMs and just be managed by Kubernetes. I don’t have an opinion about how to reason about the underlying OS and virtualization structure. The thing I do have opinion about is it makes a ton of sense to be able to use a declarative framework. Use a set of well-structured controllers and reconcilers to drive your world into a non-desired state.
I think that pattern will be – it’s been quite successful. It can be quite durable. I think we’ll start to see organizations embrace a lot of these technologies over time. It is possible that something brighter, shinier, newer, comes along. Anyone will tell you that we made enough mistakes during the journey and there is stuff that I think everyone regret some of the Kubernetes train.
I do think it’s likely to be pretty durable. I don’t think it’s a silver bullet. Nothing is, right? It’s like any of these technologies, there’s always the cost and there’s a benefit associated with it. The benefits are relatively well-understood. But there’s going to be different tools to do different jobs. There’s going to be new patterns that emerge that simplify things. Is Kubernetes the best framework for running functions? I don’t know. Maybe. Kind of like what the [inaudible] people are doing. But are there more intrinsically optimal ways to do this, maybe. I don’t know.
[00:45:02] JR: It has been interesting watching Kubernetes itself evolve in that moving target. Some of the other technologies I’ve seen kind of stagnate on their one solution and don’t grow further. But that’s definitely not what I see within this community. It’s like always coming up with something new.
Anyway, thank you very much for your time. That was an incredible session.
[00:45:22] CM: Yeah. Thank you. It’s always fun to chat.
[00:45:24] CC: Yeah. We’ll definitely have you back, Craig. Yes, we are coming up at the end, but I do want to ask if you have any thoughts that you haven’t brought up or we haven’t brought up that you’d like to share with the audience of this podcast.
[00:45:39] CM: I guess the one thing that was going through my head earlier I didn’t say which is as you look at these technologies, there’s sort of these two duty cycles. There’s the hype duty cycle, where technology ascends in awareness and everyone looks at it as an answer to all the everythings. Then there’s the readiness duty cycle, which is sometimes offset.
I do think we’re certainly peak hype right now in Kubernetes if you attended KubeCon. I do think there’s perhaps a gap between the promise and the reality for a lot of organizations. It's always just council caution and just be judicious about how you approach this. It’s a very powerful technology and I see a very bright future for it. Thanks for your time.
[00:46:17] CC: Really, thank you so much. It’s so refreshing to hear from you. You have great thoughts.
With that, thank you very much. We will see you next week.
[00:46:28] JR: Thanks, everybody. See you.
[00:46:29] DC: Cheers, folks.
[END OF INTERVIEW]
[00:46:31] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
In this episode of The Podlets Podcast, we welcome Michael Gasch from VMware to join our discussion on the necessity (or not) of formal education in working in the realm of distributed systems. There is a common belief that studying computer science is a must if you want to enter this field, but today we talk about the various ways in which individuals can teach themselves everything they need to know. What we establish, however, is that you need a good dose of curiosity and craziness to find your feet in this world, and we discuss the many different pathways you can take to fully equip yourself. Long gone are the days when you needed a degree from a prestigious school: we give you our hit-list of top resources that will go a long way in helping you succeed in this industry. Whether you are someone who prefers learning by reading, attending Meetups or listening to podcasts, this episode will provide you with lots of new perspectives on learning about distributed systems.
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Carlisia CamposDuffie CooleyMichael GaschKey Points From This Episode:
• Introducing our new host, Michael Gasch, and a brief overview of his role at VMware.
• Duffie and Carlisia’s educational backgrounds and the value of hands-on work experience.
• How they first got introduced to distributed systems and the confusion around what it involves.
• Why distributed systems are about more than simply streamlining communication and making things work.
• The importance and benefit of educating oneself on the fundamentals of this topic.
• Our top recommended resources for learning about distributed systems and their concepts.
• The practical downside of not having a formal education in software development.
• The different ways in which people learn, index and approach problem-solving.
• Ensuring that you balance reading with implementation and practical experience.
• Why it’s important to expose yourself to discussions on the topic you want to learn about.
• The value of getting different perspectives around ideas that you think you understand.
• How systems thinking is applicable to things outside of computer science.
• The various factors that influence how we build systems.Quotes:
“When people are interacting with distributed systems today, or if I were to ask like 50 people what a distributed system is, I would probably get 50 different answers.” — @mauilion [0:14:43]
“Try to expose yourself to the words, because our brains are amazing. Once you get exposure, it’s like your brain works in the background. All of a sudden, you go, ‘Oh, yeah! I know this word.’” — @carlisia [0:14:43]
“If you’re just curious a little bit and maybe a little bit crazy, you can totally get down the rabbit hole in distributed systems and get totally excited about it. There’s no need for having formal education and the degree to enter this world.” — @embano1 [0:44:08]
Learning resources suggested by the hosts:
Book, Designing Data-Intensive Applications, M. KleppmannBook, Distributed Systems, M. van Steen and A.S. Tanenbaum (free with registration)Book, Thesis on Raft, D. Ongaro. - Consensus - Bridging Theory and Practice (free PDF)Book, Enterprise Integration Patterns, B.Woolf, G. HohpeBook, Designing Distributed Systems, B. Burns (free with registration)Video, Distributed SystemsVideo, Architecting Distributed Cloud ApplicationsVideo, Distributed AlgorithmsVideo, Operating System - IIT LecturesVideo, Intro to Database Systems (Fall 2018)Video, Advanced Database Systems (Spring 2018)Paper, Time, Clocks, and the Ordering of Events in a Distributed SystemPost, Notes on Distributed Systems for Young BloodsPost, Distributed Systems for Fun and ProfitPost, On TimePost, Distributed Systems @The Morning PaperPost, Distributed Systems @Brave New GeekPost, Aphyr’s Class materials for a distributed systems lecture seriesPost, The Log - What every software engineer should know about real-time data’s unifying abstractionPost, Github - awesome-distributed-systemsPost, Your Coffee Shop Doesn’t Use Two-Phase CommitPodcast, Distributed Systems Engineering with Apache Kafka ft. Jason GustafsonPodcast, The Systems Bible - The Beginner’s Guide to Systems Large and Small - John GallPodcast, Systems Programming - Designing and Developing Distributed Applications - Richard AnthonyPodcast, Distributed Systems - Design Concepts - Sunil KumarLinks Mentioned in Today’s Episode:
The Podlets on Twitter — https://twitter.com/thepodlets
Michael Gasch on LinkedIn — https://de.linkedin.com/in/michael-gasch-10603298
Michael Gasch on Twitter — https://twitter.com/embano1
Carlisia Campos on LinkedIn — https://www.linkedin.com/in/carlisia
Duffie Cooley on LinkedIn — https://www.linkedin.com/in/mauilion
VMware — https://www.vmware.com/
Kubernetes — https://kubernetes.io/
Linux — https://www.linux.org
Brian Grant on LinkedIn — https://www.linkedin.com/in/bgrant0607
Kafka — https://kafka.apache.org/
Lamport Article — https://lamport.azurewebsites.net/pubs/time-clocks.pdf
Designing Date-Intensive Applications — https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable-ebook/dp/B06XPJML5D
Designing Distributed Systems — https://www.amazon.com/Designing-Distributed-Systems-Patterns-Paradigms/dp/1491983647
Papers We Love Meetup — https://www.meetup.com/papers-we-love/
The Systems Bible — https://www.amazon.com/Systems-Bible-Beginners-Guide-Large/dp/0961825170
Enterprise Integration Patterns — https://www.amazon.com/Enterprise-Integration-Patterns-Designing-Deploying/dp/0321200683
Transcript:
EPISODE 12
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[00:00:41] CC: Hi, everybody. Welcome back. This is Episode 12, and we are going to talk about distributed systems without a degree or even with a degree, because who knows how much we learn in university. I am Carlisia Campos, one of your hosts. Today, I also have Duffie Cooley. Say hi, Duffie.
[00:01:02] DC: Hey, everybody.
[00:01:03] CC: And a new host for you, and this is such a treat. Michael Gasch, please tell us a little bit of your background.
[00:01:11] MG: Hey! Hey, everyone! Thanks, Carlisia. Yes. So I’m new to the show. I just want to keep it brief because I think over the show we’ll discuss our backgrounds a little bit further. So right now, I’m with VMware. So I’ve been with VMware almost for five years. Currently, I'm in the office of the CTO. I’m a platform architect in the office of the CTO and I mainly use Kubernetes on a daily basis from an engineering perspective. So we build a lot of prototypes based on customer input or ideas that we have, and we work with different engineering teams.
Kurbernetes has become kind of my bread and butter but lately more from a consumer perspective like developing with Kurbenetes or against Kubernetes, instead of the formal ware of mostly being around implementing and architecting Kubernetes.
[00:01:55] CC: Nice. Very impressive. Duffie?
[00:01:58] MG: Thank you.
[00:01:59] DC: Yeah.
[00:02:00] CC: Let’s give the audience a little bit of your backgrounds. We’ve done this before but just to frame the episodes, so people will know how we come in as distributed systems.
[00:02:13] DC: Sure. In my experience, I spent – I don’t have a formal education history. I spent most of my time kind of like in a high school time. Then from there, basically worked into different systems administration, network administration, network architect, and up into virtualization and now containerization. So I’ve got a pretty hands-on kind of bootstrap experience around managing infrastructure, both at small-scale, inside of offices, and the way up to very large scale, working for some of the larger companies here in the Silicon Valley.
[00:02:46] CC: All right. My turn I guess. So I do have a computer science degree but I don’t feel that I really went deep at all in distributed systems. My degree is also from a long time ago. So mainly, what I do know now is almost entirely from hands-on work experience. Even so, I think I'm very much lacking and I’m very interested in this episode, because we are going to go through some great resources that I am also going to check out later. So let’s get this party started.
[00:03:22] DC: Awesome. So you want to just talk about kind of the general ideas behind distributed systems and like how you became introduced to them or like where you started in that journey?
[00:03:32] CC: Yeah. Let’s do that.
[00:03:35] DC: My first experience with the idea of distributed systems was in using them before I knew that they were distributed systems, right? One of the very first distributed systems as I look back on it that I ever actually spent any real time with was DNS, which I consider to be something of a distributed system. If you think about it, they have name servers, they have a bunch of caching servers. They solve many of the same sorts of problems.
In a previous episode, we talked about how networking, just the general idea of networking and handling large-scale architecting networks. It’s also in a way very – has a lot of analogues into distributed systems. For me, I think working with and helping solve the problems that are associated with them over time gave me a good foundational understanding for when we were doing distributed systems as a thing later on in my career.
[00:04:25] CC: You said something that caught my interest, and it’s very interesting, because obviously for people who have been writing algorithms, writing papers about distributed systems, they’re going to go yawning right now, because I’m going to say the obvious.
As you start your journey programming, you read job requirements. You read or you must – should know distributed systems. Then I go, “What is distributed system? What do they really mean?” Because, yes, we understand apps stuck to apps and then there is API, but there’s always for me at least a question at the back of my head. Is that all there is to it? It sounds like it should be a lot more involved and complex and complicated than just having an app stuck on another app.
In fact, it is because there are so many concepts and problems involved in distributed systems, right? From timing, clock, and sequence, and networking, and failures, how do you recover. There is a whole world in how do you log this properly, how do you monitor. There’s a whole world that revolves around this concept of systems residing in different places and [inaudible 00:05:34] each other.
[00:05:37] DC: I think you made a very good point. I think this is sort of like there’s an analog to this in containers, oddly enough. When people say, “I want a container within and then the orchestration systems,” they think that that's just a thing that you can ask for. That you get a container and inside of that is going to be your file system and it’s going to do all those things. In a way, I feel like that same confusion is definitely related to distributed systems.
When people are interacting with distributed systems today or if I were to ask like 50 people what a distributed system is, I would probably get 50 different answers. I think that you got a pretty concise definition there in that it is a set of systems that intercommunicate to perform some function. It’s like found at its base line. I feel like that's a pretty reasonable definition of what distributed systems are, and then we can figure out from there like what functions are they trying to achieve and what are some of the problems that we’re trying to solve with them.
[00:06:29] CC: Yeah. That’s what it’s all about in my head is solving the problems because at the beginning, I was thinking, “Well, it must be just about communicating and making things work.” It’s the opposite of that. It’s like that’s a given. When a job says you need to understand about distributed systems, what they are really saying is you need to know how to deal with failures, not just to make it work. Make it work is sort of the easy part, but the whole world of where the failures can happen, how do you handle it, and that, to me is what needing to know distributed system comes in handy.
In a couple different things, like at the top layer or 5% is knowing how to make things work, and 95% is knowing how to handle things when they don’t work, because it’s inevitable.
[00:07:19] DC: Yeah, I agree. What do you think, Michael? How would you describe the context around distributed systems? What was the first one that you worked with?
[00:07:27] MG: Exactly. It’s kind of similar to your background, Duffie, which is no formal degree or education on computer science right after high school and jumping into kind of my first job, working with computers, computer administration.
I must say that from the age of I think seven or so, I was interested in computers and all that stuff but more from a hardware perspective, less from a software development perspective. So my take always was on disassembling the pieces and building my own computers than writing programs. In the early days, that just was me.
So I completely almost missed the whole education and principles and fundamentals of how you would write a program for a single computer and then obviously also for how to write programs that run across a network of computers. So over time, as I progress on my career, especially kind of in the first job, which was like seven years of different Linux systems, Linux administrations, I kind of – Like you, Duffie, I dealt with distributed systems without necessarily knowing that I'm dealing with distributed systems. I knew that it was mostly storage systems, Linux file servers, but distributed file servers. Samba, if some of you recall that project.
So I knew that things could fail. I know it could fail, for example, or I know it could not be writable, and so a client must be stuck but not necessarily I think directly related to fundamentals of how distributed systems work or don’t work. Over time, and this is really why I appreciate the Kubernetes project in community, I got more questions, especially when this whole container movement came up. I got so many questions around how does that thing work. How does scheduling work? Because scheduling kind of was close to my interest in the hardware design and low-level details. But I was looking at Kubernetes like, “Okay. There is the scheduler.”
In the beginning, the documentation was pretty scarce around the implementation and all the control as for what’s going on. So I had to – I listen to a lot of podcasts and Brian Grant’s great talks and different shows that he gave from the Kubernetes space and other people there as well.
In the end, I had more questions than answers. So I had to dig deeper. Eventually, that led me to a path of wanting to understand more formal theory behind distributed systems by reading the papers, reading books, taking some online classes just to get a basic understanding of those issues. So I got interested in results scheduling in distributed systems and consensus. So those were two areas that kind of caught my eyes like, “What is it? How do machines agree in a distributed system if so many things can go wrong?”
Maybe we can explore this later on. So I’m going to park this for a bit. But back to your question, which was kind of a long-winded answer or a road to answering your question, Duffie. For me, a distributed system is like this kind of coherent network of computer machines that from the outside to an end-user or to another client looks like one gigantic big machine that is [inaudible 00:10:31] to run as fast. That is performing also efficient. It constitutes a lot of characteristics and properties that we want from our systems that a single machine usually can’t handle. But it looks like it's a big single machine to a client.
[00:10:46] DC: I think that – I mean, it is interesting like, I don’t want to get into – I guess this is probably not just a distributed systems talk. But obviously, one of the questions that falls out for me when I hear that answer is then what is the difference between a micro service architecture and distributed systems, because I think it's – I mean, to your point, the way that a lot of people work with the app to learn to develop software, it’s like we’re going to develop a monolithic application just by nature. We’re going to solve a software problem using code.
Then later on, when we decide to actually scale this thing or understand how to better operate it under a significant load, then we started thinking about, “Okay. Well, how do we have to architect this differently in such a way that it can support that load?”
That’s where I feel like the beams cut across, right? We’re suddenly in a world where you’re not only just talking about microservices. You’re also talking about distributed systems because you’re going to start thinking about how to understand transactionality throughout that system, how to understand all of those consensus things that you're referring to. How do they affect it when I add mister network in there? That’s cool.
[00:11:55] MG: Just one comment on this, Duffie, which took me a very long time to realize, which is coming – From my definition of what a distributed system is like this group of machines that they perform work in a certain sense or maybe even more abstracted like at a bunch of computers network together.
What I kind of missed most of the time, and this goes back to the DNS example that you gave in the beginning, was the client or the clients are also part of this distributed system, because they might have caches, especially in DNS. So you always deal with this kind of state that is distributed everywhere. Maybe you don't even know where it kind of is distributed, and the client kind of works with a local stale data.
So that is also part of a distributed system, and something I want to give credit to the Kafka community and some of the engineers on Kafka, because there was a great talk lately that I heard. It’s like, “Right. The client is also part of your distributed system, even though usually we think it's just the server. That those many server machines, all those microservices.” At least I missed that a long time.
[00:12:58] DC: You should put a link to that talk in our [inaudible 00:13:00]. That would be awesome. It sounds great. So what do you think, Carlisia?
[00:13:08] CC: Well, one thing that I wanted to mention is that Michael was saying how he’s been self-teaching distributed systems, and I think if we want to be competent in the area, we have to do that. I’m saying this to myself even.
It’s very refreshing when you read a book or you read a paper and you really understand the fundamentals of an aspect of distributed system. A lot of things fall into place in your hands. I’m saying this because even prioritizing reading about and learning about the fundamentals is really hard for me, because you have your life. You have things to do. You have the minutiae in things to get done. But so many times, I struggle.
In the rare occasions where I go, “Okay. Let me just learn this stuff trial and error,” it makes such a difference. Then once you learn, it stays with you forever. So it’s really good. It’s so refreshing to read a paper and understand things at a different level, and that is what this episode is. I don’t know if this is the time to jump in into, “So there are our recommendations.” I don't know how deep, Michael, you’re going to go. You have a ton of things listed. Everything we mention on the show is going to be on our website, on the show notes. So nobody needs to be necessarily taking notes.
Anything thing I wanted to say is it would be lovely if people would get back to us once you listened to this. Let us know if you want to add anything to this list. It would be awesome. We can even add it to this list later and give a shout out to you. So it’d be great.
[00:14:53] MG: Right. I don’t want to cover this whole list. I just wanted to be as complete as possible about a stuff that I kind of read or watched. So I just put it in and I just picked some highlights there if you want.
[00:15:05] CC: Yeah. Go for it.
[00:15:06] MG: Yeah. Okay. Perfect. Honestly, even though not the first in the list, but the first thing that I read, so maybe from kind of my history of how I approach things, was searching for how do computers work and what are some of the issues and how do computers and machines agree. Obviously, the classic paper that I read was the Lamport paper on “Time, Clocks, and the Ordering of Events in a Distributed System”.
I want to be honest. First time I read it, I didn’t really get the full essence of the paper, because it doesn't prove in there. The mathematic proof for me didn't click immediately, and there were so many things and concepts and physics and time that were thrown at me where I was looking for answers and I had more questions than answers. But this is not to Leslie. This is more like by the time I just wasn't prepared for how deep the rabbit hole goes.
So I thought, if someone asked me for – I only have time to read one book out of this huge list that I have there and all the other resources. Which one would it be? Which one would I recommend? I would recommend Designing Data-Intensive Apps by Martin Kleppmann, which I’ve been following his blog posts and some partial releases that he's done before fully releasing that book, which took him more than four years to release that book.
It’s kind of almost the Bible, state-of-the-art Bible when it comes to all concepts in distributed systems. Obviously, consensus, network failures, and all that stuff but then also leading into modern data streaming, data platform architectures inspired by, for example, LinkedIn and other communities. So that would be the book that I would recommend to someone if – Who does have time to read one book.
[00:16:52] DC: That’s a neat approach. I like the idea of like if you had one thing, if you have one way to help somebody ramp on distributed systems and stuff, what would it be? For me, it’s actually I don't think I would recommend a book, oddly enough. I feel like I would actually – I’d probably drive them toward the kind of project, like the kind [inaudible 00:17:09] project and say, “This is a distributed system all but itself.” Start tearing it apart to pieces and seeing how they work and breaking them and then exploring and kind of just playing with the parts. You can do a lot of really interesting things.
This is actually another book in your list that was written by Brendan Burns about Designing Distributed Systems I think it’s called. That book, I think he actually uses Kubernetes as a model for how to go about achieving these things, which I think is incredibly valuable, because it really gets into some of the more stable distributed systems patterns that are around.
I feel like that's a great entry point. So if I had one thing, if I had to pick one way to help somebody or to push somebody in the direction of trying to learn distributed systems, I would say identify those distributed systems that maybe you’re already aware of and really explore how they work and what the problems with them are and how they went about solving those problems. Really dig into the idea of it. It’s something you could put your hands on and play with. I mean, Kubernetes is a great example of this, and this is actually why I referred to it.
[00:18:19] CC: The way that works for me when I’m learning something like that is to really think about where the boundaries are, where the limitations are, where the tradeoffs are. If you can take a smaller system, maybe something like The Kind Project and identify what those things are. If you can’t, then ask around. Ask someone. Google it. I don’t know. Maybe it will be a good episode topic for us to do that. This part is doing this to map things out. So maybe we can understand better and help people understand things better.
So mainly like yeah. They try to do the distributed system thesis are. But for people who don’t even know what they could be, it’s harder to identify it. I don’t know what a good strategy for that would be, because you can read about distributed systems and then you can go and look at a project. How do you map the concept to learning to what you’re seeing in the code base? For me, that’s the hardest thing.
[00:19:26] MG: Exactly. Something that kind of I had related experience was like when I went into software development, without having formal education on algorithms and data structures, sometimes in your head, you have the problem statement and you're like, “Okay. I would do it like that.” But you don't know the word that describes, for example, a heap structure or queue because you’ve never – Someone told you that is heap, that is a queue, and/or that is a stick.
So, for me, reading the book was a bit easier. Even though I have done distributed systems, if you will, administration for many years, many years ago, I didn't realize that it was a distributed system because I never had this definition or I never had those failure scenarios in mind and it never had a word for consensus. So how would I search for something like how do machines agree? I mean, if you put that on Google, then likely they will come – Have a lot of stuff. But if you put it in consensus algorithm, likely you get a good hit on what the answer should be.
[00:20:29] CC: It is really problematic when we don't know the names of things because – What you said is so right, because we are probably doing a lot of distributed systems without even knowing that that’s what it is. Then we go in the job interview, and people are, “Oh! Have you done a distributed system?” No. You have but you just don’t know how to name things. But that’s one –
[00:20:51] DC: Yeah, exactly.
[00:20:52] CC: Yeah. Right? That’s one issue. Another issue, which is a bigger issue though is at least that’s how it is for me. I don’t want to speak for anybody else but for me definitely. If I can’t name things and I face a problem and I solve it, every time I face that problem it’s a one-off thing because I can’t map to a higher concept.
So every time I face that problem, it’s like, “Oh!” It’s not like, “Oh, yeah!” If this is this kind of problem, I have a pattern. I’m going to use that to this problem. So that’s what I’m saying. Once you learn the concept, you need to be able to name it. Then you can map that concept to problems you have.
All of a sudden, if you have like three things [inaudible 00:21:35] use to solve this problem, because as you work with computers, coding, it’s like you see the same thing over and over again. But when you don’t understand the fundamentals, things are just like – It’s a bunch of different one-offs. It’s like when you have an argument with your spouse or girlfriend or boyfriend. Sometimes, it’s like you’re arguing 10 times in a month and you thought, “Oh! I had 10 arguments.” But if you’d stop and think about it, no. We had one argument 10 times. It’s very different than having 10 problems versus having 1 problem 10 times, if that makes sense.
[00:22:12] MG: It does.
[00:22:11] DC: I think it does, right?
[00:22:12] MG: I just want to agree.
[00:22:16] DC: I think it does make sense. I think it’s interesting. You’ve highlighted kind of an interesting pattern around the way that people learn, which I think is really interesting. That is like some people are able to read about patterns or software patterns or algorithms or architectures and have that suddenly be an index of their heads. They can actually then later on correlate what they've read with the experience that they’re having around the things they're working on.
For some, it needs to be hands-on. They need to actually be able to explore that idea and understand and manipulate it and be able to describe how it works or functions in person, in reality. They need to have that hands-on like, “I need to touch it to understand it,” kind of experience. Those people also, as they go through those experiences, start building this index of patterns or algorithms in their head. They have this thing that they can correlate to, right, like, “Oh! This is a time problem,” or, “This is a consensus problem,” or what have you, right?
[00:23:19] CC: Exactly.
[00:23:19] DC: You may not know the word for that saying but you're still going to develop a pattern in your mind like the ability to correlate this particular problem with some pattern that you’ve seen before. What's interesting is I feel like people have taken different approaches to building that index, right? For me, it’s been troubleshooting. Somebody gives me a hard problem, and I dig into it and I figure out what the problem is, regardless of whether it's to do with distributed systems or cooking. It could be anything, but I always want to get right in there and figure out what that problem and start building a map in my mind of all of the players that are involved.
For others, I feel like with an educational background, if you have an education background, I think that sometimes you end up coming to this with a set of patterns already instilled that you understand and you're just trying to apply those patterns to the experience you’re having instead. It’s just very – It’s like horse before the cart or cart before the horse. It’s very interesting when you think about it.
[00:24:21] CC: Yes.
[00:24:22] MG: The recommendation that I just want to give to people that are like me who like reading is that I went overboard a bit in the beginnings because I was so fascinated by all the stuff, and it went down the rabbit hole deeper, deeper, deeper, deeper. Reading and reading and reading. At some point, even coming to weird YouTube channels that talk about like, “Is time real and where does time emerge from?” It became philosophical even like the past where I went to.
Now, the thing is, and this is why I like Duffie’s approach with like breaking things and then undergo like trying to break things and understanding how they work and how they can fail is that immediately you practice. You’re hands-on. So that would be my advice to people who are more like me who are fascinated by reading and all the theory that your brain and your mind is not really capable of kind of absorbing all the stuff and then remembering without practicing. Practicing can be breaking things or installing things or administrating things or even writing software. But for me, that was also a late realization that I should have maybe started doing things earlier than the time I spent reading.
[00:25:32] CC: By doing, you mean, hands-on?
[00:25:35] MG: Yeah.
[00:25:35] CC: Anything specific that you would have started with?
[00:25:38] MG: Yes. On Kubernetes – So going back those 15 years to my early days of Linux and Samba, which is a project. By the time, I think it was written in C or C++. But the problem was I wasn’t able to read the code. So the only thing that I had by then was some mailing lists and asking questions and not even knowing which questions to ask because of lack of words of understanding.
Now, fast-forward into Kubernetes’ time, which got me deeper in distributed systems, I still couldn't read the code because I didn't know [inaudible 00:26:10]. But I forced myself to read the code, which helped a little bit for myself to understand what was going on because the documentation by then was lacking. These days, it’s easier, because you can just install [inaudible 00:26:20] way easier today. The hands-on piece, I mean.[00:26:23] CC: You said something interesting, Michael, and I have given this advice before because I use this practice all the time. It's so important to have a vocabulary. Like you just said, I didn't know what to ask because I didn’t know the words. I practice this all the time. To people who are in this position of distributed systems or whatever it is or something more specific that you are trying to learn, try to expose yourself to the words, because our brains are amazing. Once you get exposure, it’s like your brain works in the background. All of a sudden, you go, “Oh, yeah! I know this word.”
So podcasts are great for me. If I don't know something, I will look for a podcast on the subject and I start listening to it. As the words get repeated, just contextually. I don’t have to go and get a degree or anything. Just by listening to the words being spoken in context, absorb the meaning of it. So podcasting is great or YouTube or anything that you can listen. Just in reading too, of course. The best thing is talking to people. But, again, it’s really – Sometimes, it’s not trivial to put yourself in positions where people are discussing these things.
[00:27:38] DC: There are actually a number of Meetups here in the Bay Area, and there’s a number of Meetups – That whole Meetup thing is sort of nationwide across the entire US and around the world it seems like now lately. Those Meetups I feel like there are a number of Meetups in different subject areas. There’s one here in the Bay Area called Papers We Love, where they actually do explore interesting technical papers, which are obviously a great place to learn the words for things, right? This is actually where those words are being defined, right?
When you get into the consensus stuff, they really get into – One even is Raft. There are many papers on Raft and many papers on multiple things that get into consensus. So definitely, whether you explore a meetup on a distributed system or in a particular application or in a particular theme like Kubernetes, those things are great places just to kind of get more exposure to what people are thinking about in these problems.
[00:28:31] CC: That is such a great tip.
[00:28:34] MG: Yeah. The podcast is twice as good as well, because for people, non-natives – English speaker, I mean. Oh, people. Not speakers. People. The thing is that the word you’re looking for might be totally different than the English word. For example, consensus in Germany has this totally different meaning. So if I would look that up in German, likely I would find nothing or not really related at all. So you have to go through translation and then finding the stuff.
So what you said, Duffie, with PWL, Papers We Love, or podcasts, those words, often they are in English, those podcasts and they are natural consensus or charting or partitioning. Those are the words that you can at least look up like what does it mean. That’s what I did as well thus far.
[00:29:16] CC: Yes. I also wanted to do a plus one for Papers We Love. It’s – They are everywhere and they also have an online. They have an online version of the Papers We Love Meetup, and a lot of the local ones film their meetups. So you can go through the history and see if they talked about any paper that you are interested in.
Probably, I’m sure multiple locations talk about the same paper, so you can get different takes too. It’s really, really cool. Sometimes, it’s completely obscure like, “I didn’t get a word of what they were saying. Not one. What am I doing here?” But sometimes, they talk about things. You at least know what the thing is and you get like 10% of it. But some paper you don’t. People who deal with papers day in and day out, it’s very much – I don’t know.
[00:30:07] DC: It’s super easy when going through a paper like that to have the imposter syndrome wash over you, right, because you’re like –
[00:30:13] CC: Yes. Thank you. That’s what I wanted to say.
[00:30:15] DC: I feel like I’ve been in this for 20 years. I probably know a few things, right. But in talking about reading this consensus paper going, “Can I buy a vowel? What is happening?”
[00:30:24] CC: Yeah. Can I buy a vowel? That’s awesome, Duffie.
[00:30:28] DC: But the other piece I want to call out to your point, which I think is important is that some people don't want to go out and be there in person. They don’t feel comfortable or safe exploring those things in person.
So there are tons of resources like you have just pointed out like the online version of Papers We Love. You can also sign into Slack and just interact with people via text messaging, right? There’s a lot of really great resources out there for people of all types, including the amount of time that you have.
[00:30:53] CC: For Papers We Love, it’s like going to language class. If you go and take a class in Italian, your first day, even though that is going to be super basic, you’re going to be like, “What?” You’ll go back in your third week. You start, “Oh! I’m getting this.” Then a month, three months, “Oh! I’m starting to be competent.”
So you go once. You’re going to feel lost and experience imposter syndrome. But you keep going, because that is a format. First, you start absorbing what the format is, and that helps you understand the content. So once your mind absorbs the format, you’re like, “Okay. Now, I have – I know how to navigate this. I know what’s coming next.” So you don’t have to focus on that. You start focusing in the content. Then little but little, you become more proficient in understanding. Very soon, you’re going to be willing to write a paper. I’m not there yet.
[00:31:51] DC: That’s awesome.
[00:31:52] CC: At least that’s how I think it goes. I don’t know.
[00:31:54] MG: I agree.
[00:31:55] DC: It’s also changed over time. It’s fascinating. If you read papers from like 20 years ago and you read papers that are written more recently, it's interesting. The papers have changed their language when considering competition.
When you're introducing a new idea with a paper, frequently that you are introducing it into a market full of competition. You're being very careful about the language, almost in a way to complicate the idea rather than to make it clear, which is challenging. There are definitely some papers that I’ve read where I was like, “Why are you using so many words to describe this simple idea?” It makes no sense, but yeah.[00:32:37] CC: I don’t want to make this episode all about Papers We Love. It was so good that you mentioned that, Duffie. It’s really good to be in a room where we’ll be watching something online where you see people asking questions and people go, “Oh! Why is this thing like this? Why is X like this,” or, “Why is Y doing like this?” Then you go, “Oh! I didn’t even think that X was important. I didn’t even know that Y was important.”
So you stop picking up what the important things are, and that’s what makes it click is now you’ve – Hooking into the important concepts because people who know more than you are pointing out and asking questions. So you start paying attention to learning what the main things it should be paying attention to, which is different from reading the paper by yourself. It’s just a ton of content that you need to sort through.
[00:33:34] DC: Yeah. I frequently self-describe it as a perspective junkie, because I feel like for any of us really to learn more about a subject that we feel we understand, we need the perspective of others to really engage, to expand our understanding of that thing. I feel like and I know how to make a peanut butter and jelly sandwich. I’ve done it a million times. It’s a solid thing. But then I watch my kid do it and I’m like, “I hadn’t thought of that problem.” [inaudible 00:33:59], right? This is a great example of that.
Those communities like Papers We Love are great opportunity to understand the perspective of others around these hard ideas. When we’re trying to understand complex things like distributed systems, this is where it’s at. This is actually how we go about achieving this. There is a lot that you can do on your own but there is always going to be more that you can do together, right? You can always do more. You can always understand this idea faster. You can understand the complexity of a system and how to break it down into these things by exploiting it with other people. That's I feel like –
[00:34:40] CC: That is so well said, so well said, and it’s the reason for this show to exist, right? We come on a show and we give our perspectives, and people get to learn from people with different backgrounds, what their takes are on distributed systems, cloud native. So this was such a major plug for the show. Keep coming back. You’re going to learn a ton.
Also, it was funny that you – It was the second time you mentioned cooking, made a cooking reference, Duffie, which brings me to something I want to make sure I say on this episode. I added a few things for reference, three books. But the one that I definitely would recommend starting with is The Systems Bible by John Gall. This book is so cool, because it helps you see everything through systems. Everything is a system. A conversation can be a system. An interaction between two people can be a system. I’m not saying this book says that. It’s just like my translation and that you can look –
Cooking is a system. There is a process. There is a sequence. It’s really, really cool and it really helps to have things framed in this way and then go out and read the other books on systems. I think it helps a lot. This is definitely what I am starting with and what I would recommend people start with, The Systems Bible. Did you two know this book?
[00:36:15] MG: I did not. I don’t.
[00:36:17] DC: I’m not aware of it either but I really appreciate the idea. I do think that that's true. If you develop a skill for understanding systems as they are, you’ll basically develop – Frequently, what you’re developing is the ability to recognize patterns, right?
[00:36:32] CC: Exactly.
[00:36:32] DC: You could recognize those patterns on anything.
[00:36:37] MG: Yeah. That's a good segue for just something that came to my mind. Recently, I gave a talk on event-driven architectures. For someone who's not a software developer or architect, it can be really hard to grab all those concepts on asynchrony and eventual consistency and idempotency. There are so many words of like, “What is this all – It sounds weird, way too complex.” But I was reading a book some years ago by Gregor Hohpe. He’s the guy behind Enterprise Integration Patterns. That’s also a book that I have on my list here. He said, “Your barista doesn't use two-phase commit.” So he was basically making this analogy of he was in a coffee shop and he was just looking at the process of how the barista makes the coffee. You pay for it and all the things that can go wrong while your coffee is brewed and served to you.
So he was making this relation between the real world and the life and human society to computer systems. There it clicked to me where I was like, “So many problems we solve every day, for example, agreeing on a time where we should meet for dinner or cooking, is a consensus problem, and we solve it.”
We even solve it in the case of failure. I might not be able to call Duffie, because he is not available right now. So somehow, we figure out. I always thought that those problems just exist in computer science and distributed systems. But I realized actually that's just a subset of the real world as is. Looking at those problems through the lens of your daily life and you get up and all the stuff, there are so many things that are related to computer systems.
[00:38:13] CC: Michael, I missed it. Was it an article you read?
[00:38:16] MG: Yes. I need to put that in there as well. Yeah. It’s a plug.
[00:38:19] CC: Please put that in there. Absolutely. So far from being any kind of expert in distributed systems, but I have noticed. I have caught myself using systems thinking for even complicated conversations. Even in my personal life, I started approaching things in the systems oriented and just the – just a high-level example.
When I am working with systems, I can approach from the beginning, the end. It’s like a puzzle, putting the puzzle together, right? Sometimes, it starts from the middle. Sometimes, it starts from the edges. When I‘m having conversations that I need to be very strategic like I have one shot. Let’s say maybe I’m in a school meeting and I have to reach a consensus or have a solution or have a plan of action. I have to ask the right questions.
My private self would do things linearly. Historically like, “Let’s go from the beginning and work out through the end.” Now, I don’t do that anymore. Not necessarily. Sometimes, I like, “Let me maybe ask the last question I would ask and see where it leads and just approach things from a different way.” I don’t know if this is making sense.[00:39:31] MG: It does. It does.
[00:39:32] CC: But my thinking has changed. The way I see the possibilities is not a linear thing anymore. I see how you can truly switch things. I use this in programming a lot and also writing. Sometimes, when you’re a beginner writer, you start at the top and you go down to the conclusion. Sometimes, I start I the middle and go up, right? So you can start anywhere. It’s beautiful or it just gives you so many more options. Or maybe I’m just crazy. Don’t listen to me.
[00:40:03] DC: I don’t think you’re crazy. I was going to say, one of the funny things about Michael’s point and your point both, it’s like in a way that they have kind of referred to Conway's law, the idea that people will build systems in the way that they communicate. So this is actually – It totally brings it back to that same point of thing, right? We by nature will build systems that we can understand, because that is the constraint in which we have to work, right? So it’s very interesting.
[00:40:29] CC: Yeah. But it’s an interesting thing, because we are [inaudible 00:40:32] by the way we are forced to work. For example, I work with constraints and what I'm saying is that that has been influencing my way of thinking.
So, yes, I built systems in the way I think but also because of the constraints that I’m dealing with that I have to be – the tradeoffs I need to make, that also turns around and influences the way I think, the way I see the world and the rest of the systems and all the rest of the world. Of course, as I change my thinking, possibly you can theorize that you go back and apply that. Apply things that you learn outside of your work back to your work. It’s a beautiful back-and-forth I think.
[00:41:17] MG: I had the same experience with some – When I had to design kind of my first API and think of, “Okay. What would the consumer contract be and what would a consumer expect me to deliver in response and so on?” I was forcing myself and being explicit in communicating and not throwing everything at the client back to confusing but being very explicit and precise. Also on communication every day when you talk to people, being explicit and precise really helps to avoid a lot of problems and trouble. Be it partnership or amongst friends or at work.
This is what I took from computer science actually back into my real world in order to taking all those perceptions, perceiving things from a different perspective, and being more precise and explicit in how I respond or communicated.
[00:42:07] CC: My take on what you just said, Michael, is we design systems thinking how is this going to fail. We know this is going to fail. We’re going to design for that. We’re going to implement for that.
In real life, for example, if I need to get an agreement from someone, I try to understand the person's thinking and just go, “I just had this huge thing this week. This is in my mind.” I’m not constantly thinking about this, I’m not crazy like that. Just a little bit crazy. It’s like, “How does this person think? What do they need to know? How far can I push?” Right? We need to make a decision quickly, so the approach is everything, and sometimes you only get one shot, so yeah. I mean, correct me if I’m wrong. That's how I heard or I interpreted what you just said.
[00:42:52] MG: Yeah, absolutely. Spot on. Spot on. So I’m not crazy as well.
[00:42:55] CC: Basically, I think we ended up turning this episode into a little bit of like, “Here are great references,” and also a huge endorsement for really going deep into distributed systems, because it’s going to be good for your jobs. It’s going to be good for your life. It’s going to be good for your health. We are crazy.
[00:43:17] DC: I’m definitely crazy. You guys might be. I’m not. All right. So we started this episode with the idea of coming to learning distributed systems perhaps without a degree or without a formal education in it. We talked about a ride of different ideas on that subject. Like different approaches that each of us took, how each of us see the problem. Is there any important point that either of you want to throw back into the mix here or bring up in relation to that?
[00:43:48] MG: Well, what I take from this episode, being my first episode and getting to know your background, Duffie and Carlisia, is that whoever is going to listen to this episode, whatever background you have, even though you might not be in computer systems or industry at all, I think we three all had approved that whatever background you have, if you’re just curious a little bit and maybe a little bit crazy, you can totally get down the rabbit hole in distributed systems and get totally excited about it. There’s no need for having formal education and the degree to enter this world. It might help but it’s kind of not a high bar that I was perceiving it to be 10 years ago, for example.
[00:44:28] CC: Yeah. That’s a good point. My takeaway is it always puzzled me how some people are so good and experienced and such experts in distributed systems. I always look at myself. It’s like, “How am I lacking?” It’s like, “What memo did I miss? What class did I miss? What project did I not work on to get the experience?” What I’m seeing is you just need to put yourself in that place. You need to do the work. But the good news is achieving competency in distributed systems is doable.
[00:45:02] DC: My takeaway is as we discussed before, I think that there is no one thing that comprises a distributed system. It is a number of things, right, and basically a number of behaviors or patterns that we see that comprise what a distributed system is.
So when I hear people say, “I’m not an expert in distributed systems,” I think, “Well, perhaps you are and maybe you don’t know it already.” Maybe there's some particular set of patterns with which you are incredibly familiar. Like you understand DNS better than the other 20 people in the room. That exposes you to a set of patterns that certainly give you the capability of saying that you are an expert in that particular set of patterns.
So I think that to both of your points, it’s like you can enter this stage where you want to learn about distributed systems from pretty much any direction. You can learn it from a CIS background. You can come it with no computer experience whatsoever, and it will obviously take a bit more work. But this is really just about developing and understanding around how these things communicate and the patterns with which they accomplish that communication. I think that’s the important part.
[00:46:19] CC: All right, everybody. Thank you, Michael Gasch, for being with us now. I hope to –
[00:46:25] MG: Thank you.
[00:46:25] CC: To see you in more episodes [inaudible 00:46:27]. Thank you, Duffie.
[00:46:30] DC: My pleasure.
[00:46:31] CC: Again, I’m Carlisia Campos. With us was Duffie Cooley and Michael Gesh. This was episode 12, and I hope to see you next time. Bye.
[00:46:41] DC: Bye.
[00:46:41] MG: Goodbye.
[END OF EPISODE]
[00:46:43] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
A warm welcome to John Harris who will be joining us for his first time on the show today to discuss our exciting topic, CI and CD in cloud native! CI and CD are two terms that usually get spoken about together but are actually two different things entirely if you think about them. We begin by getting into exactly what these differences are, highlighting the regulatory aspects of CD in contrast to the future-focussed nature of CI. We then move on to a deep exploration of their benefits in optimizing processes in cloud native space through automation and surveillance from development to production environments. You’ll hear about the benefits of automatic building in container orchestration, the value of make files and local test commands, and the evolution of CI from its ‘rubber chicken’ days with Martin Fowler and Jez Humble. We take a deep dive into the many ways that containers differ from regular binary as far as deployment methods, build speed, automation, run targets, realtime reflections of changes, and regulation. Moreover, we talk to the challenges of transitioning between testing and production environments, getting past human error through automation, and using sealed secrets to manage clusters. We also discuss the benefits and drawbacks of different CI tools such as Kubebuilder, Argo, Jenkins X, and Tekton. Our conversation gets wrapped up by looking at some of the exciting developments on the horizon of CI and CD, so make sure to tune in!
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Bryan LilesNicholas LaneKey Points From This Episode:
• The difference between CI and CD.
• Understanding the meaning of CD: ‘continuous delivery’ and ‘continuous deployment’.
• Building an artifact that can be deployed in the future is termed ‘continuous integration’.
• The benefits of continuous integration for container orchestration: automatic building.
• What to do before starting a project regarding make files and local test commands.
• Kubebuilder is a tool that scaffolds out the creation of controllers and web hooks.
• Where CI has got to as far as location since its ‘rubber chicken’ co-located days.
• The prescience of Martin Fowler and Jez Humble regarding continuous integration.
• The value of running tests in a CI process for quality maintenance purposes.
• What makes containers great as far as architecture, output, deployment, and speed.
• The benefits of CD regarding deployment automation, reflection, and regulation.
• Transitioning between testing and production environments using targets, clusters, pipelines.
• Getting past human error through automation via continuous deployment.
• What containers mean for the traditional idea of environments.
• How labeling factors into the simplicity of transitioning from development to production.
• What GitOps means for keeping track of changes in environments using tags.
• How sealed secrets stop the need to change an app when managing clusters.
• The tools around CD and what a good CD system should look like.
• Using Argo and Spinnaker to take better advantage of hardware.
• How JenkinsX helps mediate YAML when installing into clusters.
• Why the customizable nature of CI tools can be seen as negative.
• The benefits of using cloud native-built tools like Tekton.
• Perspectives on what is missing in the cloud native space.
• A definition of blue-green deployments and how they operate in service meshes.
• The business abstraction elements of CI tools that are lacking.
• Testing and data storage-related aspects of CI/CD that need to be developed.Quotes:
“With the advent of containers, now it’s as simple as identifying the images you want and basically running that image in that environment.” — @bryanl [0:18:32]
“The whole goal whenever you’re thinking about continuous delivery or continuous deployment is that any human intervention on the actual moving of code is a liability and is going to break.” — @bryanl [0:21:27]
“Any time you’re in developer tooling, everyone wants to do something slightly differently. All of these tools are so tweak-able that they become so general.” — @johnharris85 [0:34:23]
Links Mentioned in Today’s Episode:
John Harris — https://www.linkedin.com/in/johnharris85/
Jenkins — https://jenkins.io/
CircleCI — https://circleci.com/
Drone — https://drone.io/
Travis — https://travis-ci.org/
GitLab — https://about.gitlab.com/
Docker — https://www.docker.com/
Go — https://golang.org/
Rust — https://www.rust-lang.org/
Kubebuilder — https://github.com/kubernetes-sigs/kubebuilder
Martin Fowler — https://martinfowler.com/
Jez Humble — https://continuousdelivery.com/about/
David Farley — https://dfarley.com/index.html
AMD — https://www.amd.com/en
Intel — https://www.intel.com/content/www/us/en/homepage.html
Windows — https://www.microsoft.com/en-za/windows
Linux — https://www.linux.org/
Intel 386 — http://www.computinghistory.org.uk/det/6192/Introduction-of-Intel-386/
386SX — https://www.computerworld.com/article/2475341/flashback--remembering-the-386sx.html
386DX — https://en.wikipedia.org/wiki/Intel_80386
Pentium — https://www.intel.com/content/www/us/en/products/processors/pentium.html
AMD64 — https://www.webopedia.com/TERM/A/AMD64.html
ARM — https://en.wikipedia.org/wiki/ARM_architecture
Tomcat — http://tomcat.apache.org/
Netflix — https://www.netflix.com/za/
GitOps — https://www.weave.works/technologies/gitops/
Weave — https://www.weave.works/
Argo — https://www.intuit.com/blog/technology/introducing-argo-flux/
Spinnaker — https://www.spinnaker.io/
Google X — https://x.company/
Jenkins X — https://jenkins.io/projects/jenkins-x/
YAML — https://yaml.org/
Tekton — https://github.com/tekton
Councourse CI — https://concourse-ci.org/Transcript:
EPISODE 11
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically-minded decision maker, this podcast is for you.
[EPISODE]
[00:00:41] BL: Back to the Kubelets Podcast, episode 11. I’m Bryan Liles, and today we have Nicholas Lane.
[00:00:50] NL: Hello!
[00:00:51] BL: And joining us for the first time, we have John Harris.
[00:00:55] JH: Hey everyone. How is it going?
[00:00:56] BL: All right! So today we’re going to talk about CI and CD in cloud native. I want to start this off with this whole term CI and CD. We talk about them together, that are two different things almost entirely if you think about them. But CI stands for continuous integration, and then we have CD. What does CD stand for?
[00:01:19] NL: Compact disk.
[00:01:20] BL: Right. True, and actually I’ve used that term before. I actually do agree. But what else does CD stand for?
[00:01:28] NL: It’s continuous deployment right?
[00:01:30] BL: Yeah, and?
[00:01:31] JH: Continuous delivery.
[00:01:32] NL: Oh! I forgot about that one.
[00:01:35] BL: Yeah, that’s the interesting thing, is that as we talk about tech and we give things acronyms, CD is just a great one. Change in directories, compact disk, continuous delivery and continuous deployment. Here’s the bonus question, does anyone here know the difference between continuous delivery and continuous deployment?
[00:01:58] NL: Now that’s interesting.
[00:01:59] JH: I would go ahead and say continuous delivery is the ability to move changes through the pipeline, but you still have the ability to do human intervention at any stage, and usually deployments production and continuous delivery would be a business decision, whereas continuous deployment is no gating and everything just go straight to product.
[00:02:18] BL: Oh, John! Gold start for you, because that is one of the common ones. I just like to bring that up because we always talk about CI and CD as they are just one thing, but they’re actually way bigger topics and we’ve already introduced three things here. Let’s start at the beginning and let’s talk about continuous integration, a.k.a CI.
I’ll start off. We have CI, and what is the goal of CI? I think that we always get boggled down with tech terms and all these technology and all these packages from all these companies. But I’d like to boil CI down to one simple thing. The process of continuous integration is to build an artifact that can be deployed somewhere at some future date at some future time by some future person, process. Everything else is a detail of the system you choose to use. Whether you use Jenkins, or CircleCI, or Drone, or you built your own thing, or you’re using Travis, or any of the other online CI tools. At the end of the day, you’re building either – If you’re doing web development. Maybe you’re building out Docker files, because we’re in cloud native. I mean docker images, because we’re in cloud native. But if you’re not, maybe you’re just building JARs, WARs, or EARs, or a ZIP file, or a binary, or something. I’d just like to start off, start this off with there. Any more thoughts on continuous integration?
[00:03:48] NL: Yeah. I think the only times that I’ve ever used something that’s like continuous integration is when I’ve been doing like more container orchestration, like development, things on top of like things like Kubernetes, for instance. The thing I really like about it is like the concept of being able to like, from my computer, save and do an automatic save and push to a local repo and have all of the pieces get built for me automatically somewhere else, and I just love that so much because it saves so much brain thinky juice to run every command to make the binary you need.
[00:04:28] BL: So did you actually create those scripts yourself?
[00:04:30] NL: Some of them. When I’ve used things like GitLab, I use the pipeline that exists there and just fiddled around with like a little bit of code, like some bash there, but like not too much because GitLab has a pretty robust pipeline. Travis — I don’t think I needed to actually. Travis had a pretty good just go make Docker build, scripts already templated out for you.
[00:04:53] JH: Yeah. I’d like to tell people whenever you start any project, whether it’s big or small, especially if it’s on – Not on Windows. I’ll tell you something different if it’s on Windows. But if you’re developing on a Mac or developing on Linux, the first thing you should do in your project is create a make file or your programming language equivalent of a make file, and then in that make file what you should do is write a command that will build your software that runs its tests locally, and also builds – whatever the process is.
I mean, if you’re running in Go, you do a Go build. If you’re using Rust, build with Rust, or C++, or whatever before you even write any code. The reason why is because the hardest part is making your code build, and if you leave that to the end, you’re actually making it harder on yourself. If your code build works from the beginning, all you have to do is change it to fit what you’re doing rather than thinking about it when it’s crunch time.
[00:05:57] NL: I actually ran into that exact scenario recently, because I’ve been building some tooling around some Kubernetes stuff, and the first one I did, I built it all manually by hand. Then at the end I was like – I gave it to the person who wanted it and they’re like, “So, where’s the make file?” I’m like, “Where’s the what?” So I had go in and like fill in the make file, and that was a huge pain in the butt.
Then recently the other thing I’ve been using is Kubebuilder. John, you and I have been talking about Kubebuilder quite a bit, but using Kubebuilder, and one of the things it does for you is it scaffolds out and a make file for you, and that was like going from me doing it by myself to having it already exist for you or just having it at the beginning was so much better. I totally agree with you, Brian.
[00:06:42] BL: So quick point of order here. For those of us who don’t know what Kubebuilder is. What is Kubebuilder?
[00:06:48] NL: Kubebuilder is a tool that was created by members of the Kubernetes Community to scaffold out the creation of controllers and web hooks. What a controller is in Kubernetes is a piece of software that waits, sort of watches a specific object or many specific objects and reconciles them. If they noticed that something has changed and you want to make an action based on that change, the controller does that for you.
[00:07:17] JH: Okay. So it actually makes the action of working with CRDs and Kubernetes much easier than creating it all yourself.
[00:07:26] NL: Correct. Yeah. So, for instance, the one that I made for myself was a tool that watched, updated and watched a specific CRD, but it wasn’t necessarily a controller. It was just like flagging on whether or not a change occurred, and I used the dynamic client, and that was a huge headache on of itself.
Kubebuilder has like the ability to watch not just CRDs, but any object in Kubernetes and then reconcile them based on changes.
[00:07:53] NL: It’s pretty great.
[00:07:54] BL: All right. So back to CI. John, do you have any opinions on CI or anecdotes or anything like that?
[00:07:59] JH: Yeah. I think one of the interesting things about the original kind of philosophy of CI outside of tooling was like trunk-based development that every develop changes get integrated into trunk as soon as possible. You don’t get into integration hell and rebasing. I guess it’s kind of interesting when you apply that to a cloud native landscape where like when that stuff came out with like Martin Fowler or Jez Humble probably 10, 15 years ago almost now, a lot of dev teams were co-located. You could do CI. I think there was a rubber chicken method where you didn’t use a tool. It was just whoever had the chicken that’s responsible for the build. Just to pull everyone else’s changes.
But now it seems like everything is branch-based. When you look at a project like Kubernetes, there’s a huge number of contributors all geographically displaced, different time zones, lots of different branches and features going on at the same time. It’s interesting how these original principles of continuous integration from the beginning now apply to these huge projects in the cloud native landscape.
[00:08:56] BL: Yeah, that’s actually a great point of how prescient Martin Fowler has been for many, many years, and even with Jez Humble being able to see these problems 10, 15 years ago and be able to describe them. I believe Jez Humble wrote the CD book, the continuous delivery book.
[00:09:15] JH: Yeah, with David Farley, I think.
[00:09:18] NL: Yeah. Yeah, he did. So, John, you brought up some good things about CI. I try to simplify everything. I think the mark of someone who really knows what they’re talking about is being able to explain everything in the simplest words possible, and then you can work backwards when people understand.
I started off by saying that CI produces an artifact. I didn’t talk about branches or anything like that, or even the integration piece. But now let’s go into that a little bit. There are a lot of misconceptions about CI in general, but one of the things that we talk about is that you have to run test. No, you don’t have to run test, but should you? Yes, 100% of the time. Your CI process, your integration process should actually build your software and run the test, because running the test on this dedicated service or hardware wherever it is ensures that the quality of your software is there at least as much as your developers have insured the quality in the test.
It’s very important those run, and a lot of bugs of course can be spotted by running a CI. I mean, we are all sorts of developers here, and I tell you what, sometimes I forget to run the test locally and CI catches me before a commit makes it into master and it has a huge typo or a whole bunch of print lines in there.
Moving on here, thinking about CI and cloud native. Whenever you’re creating a cloud native app, have you ever thought about the differences between let’s say creating just a regular binary that maybe runs on a server, but not in a container on somebody’s cloud native stack, i.e. Kubernetes? Have you ever thought about the differences of things to think about?
[00:11:04] BL: Yeah. So part of it is – I would imagine or I believe it’s like things like resource, like what resources you need or what architecture you’re deploying into. You need the binary to make like run in this – With containerization, it’s easy because you’re like, “I know that the container is going to be this architecture,” but you can’t necessarily guarantee that outside of a containerized world. I mean, I suppose you can being like with the right tooling setup you can be like, “I only want to run on this.” But that isn’t necessarily guaranteed, because any computer that runs on could be just whatever architecture that happens to land on, right?
Also, something to – I think of is like how do you start processes on disparate computers in a controlled fashion? Something like, again, with containers, you can trust that the container runtime will run it for you. But without that, it seems like a much harder task.
[00:12:01] NL: Yeah, I would agree. Then I said that containers in general just help us out, because most of our workloads go on some AMD or Intel 64 bit and it’s Linux. We know what our output is going to be. So it’s not like in the old days where you had to actually figure out what your run target was. I mean, that’s even on Intel stacks. I mean, I’m updating myself here where you had like – When the 386 was out and then you had the 386SX and the 386DX, there were different things there, and you actually compile your code different. Then when the 46 came out and then when we had introduction of Pentium chips, things were different.
But now we can pretty much all target AMD64, and in some cases, I mean, there are some chip things like the bigger encryption things that are in the newer chips. But for the most part, we know what our deployed target is going to be.
But the cool thing is also that we don’t have to have Intel or AMD64. It could be ARM32 or ARM64, and with the addition to a lot of the work that has been going on in Windows land lately, we can have Windows images. I don’t know so many people were doing that yet. I’m not out and part of the field, but I like that the opportunity is there.
[00:13:25] JH: Oh! I think one of the interesting things is the deployment method as well. Now with containers, everything is kind of an immutable rip and replace. Like if we develop an application, we know that the old container is going to stop when I deploy a new one. I think Netflix were doing a little bit of this before containers and some other folks with like baking AMIs and using that immutable method. But I think before that it was if we had a WAR file, we had to throw it back into Tomcat, let Tomcat pick it up or whatever. Everything was a little bit more flaky in terms of deployment. We had to do a lot of checks around deployment rather than just bring something out, bring something back in blue/green, whatever.
[00:13:59] BL: Well, I actually like that you brought that up, because that’s actually one of the greatest parts of this whole cloud native thing, is that when we’re using containers and we’re deploying with containers, we know what our file system is going to look like, because we created it. There would not be some rogue file or another configuration there that will trip up our deployment, because at build time, we’ve created the environment. It’s much better than that facility that Netflix was doing with baking AMIs.
In a previous life, I actually ran the facility for baking AMIs at a large company where we had thousands of developers on more than a thousand dev teams, and we had a lot of spyware. Whenever you had to build an image, it was fine in one account, but if you had let’s say a thousand accounts with the way that AWS works and encrypted images, you actually had to copy all the images to all the accounts. It couldn’t actually boot it from your account. That process would literally take all night to get it done across all of our accounts.
If you made a mistake, guess what? You get to do it again. So I am glad that we actually have this thing called a container and all these things based on CRI, the container runtime, that we are able to quickly build containers.
I don’t want to just limit this conversation to continuous integration. Let’s get into the other parts too with deployment and delivery. What is so novel about CD and the cloud native world?
[00:15:35] NL: I think to me it’s the ability to have your code or your artifact or whatever it is, whatever you’re working on. When you make a change, you can see the change reflected in reality, whatever your reality looks like, without your intervention. I mean, you might have had to set up all the pipelines and all that jargon, but when you press save in VS code and it creates a branch and runs all your tests and then deploys it for you or delivers it for you into what you’d define as reality, that’s just so nice, because it really kind of sucks having to do the like, “Okay, I’ve got a new deployment. Destroy the old deployment. Put in the new one or like rev the new image tag or whatever in the deployment you’re doing.” All these manual steps, again, thinky-brain juice, it takes pieces of your attention away, and having these pieces like added for you is just so nice.[00:16:30] BL: Yeah, what do you think, John?
[00:16:32] JH: Yeah. I think just something in the state of DevOps we’ve bought one of the best predictors for a company’s success is like cycle time of feature from ideation to production. I think like the faster we can get that cycle – It kind of gets me interested. How long does an application take to build? If it takes two hours, how good are you at getting features out there quickly? Maybe one of the drivers with microservices, smaller pieces independently deployed, we can get features out to production quicker, because I think the name of the game is just about enabling developers to put the decision in the hands of the business to decide when the customer should see that feature. I think the tighter we can make that cycle, the better for everyone.
[00:17:14] BL: Oh, no! I agree. I love and hate web services, but what I do like is the idea of making these abstractions smaller, and if the abstractions are smaller, it’s less code. A lot of the languages we use now are faster compiling, let’s say, a large C++ project. That could take literally two hours to compile. But now when we have languages like Go, and Rust is not as fast, but it’s not slow as well. Then we have all of our interpret languages, whether it’d be Python, or JavaScript, or TypeScript, where we can actually go from an idea, run the test in a few minutes and build this image that we can actually run and see it almost in real-time.
Now with the complexity of the tools, I mean, the features that are built in the tools, we can now easily manage multiple deployment environments, because think about before, you would have a dev environment, and that would be the Wild West. That would be literally where it would be awful. You might have to rebuild it every couple of months. Then you would have staging, and then maybe you would have some kind of pre-prod environment just as like your final smoke test, and then you would have your production.
Maintaining all the software on all those was extremely hard. But now with the advent of containers, now it’s as simple as identifying the images you want and basically running that image in that environment. I like where we’ve ended up. But with all power comes new problems, and just because we can deploy quicker means we just run into a lot of different problems we didn’t run into before.
The first one that I’ll bring up is the complexity. Auto conversion between environments, so moving code between test staging and production. How do we do that? Any ideas before I throw some out there?
[00:19:11] NL: I guess you would have different, or maybe the same pipeline but different targets for like if say you’re using something like Kubernetes. You could have one part of your pipeline deploy initially to this Kubernetes context, which points to like one cluster. It’s building up clusters by environment type and then deploying into those, running your tests, see if it runs properly and then switch over to the next context to apply that image tag and that information and then just go down the chain until you go to production.
[00:19:44] BL: Well, that’s interesting. One thing I’d like to throw out there, and I’m not advocating any particular product. But the idea of having pipelines for continuous integration and your CD process is great, where you can now have gates and you can basically automate the whole thing. Code goes into CI and we built an artifact, and a message can go out automatically to an approver or not, and that message could say, “Hey! This code is going to be integrated into our trunk or our master branch.” They can either do it themselves manually as a lot of people do or they can actually maybe click on a link or check a checkbox and this gets integrated in.
Then what automatically could happen at this point is, and I’ve seen a lot of companies doing this, is now we take that software and we spin up a new whole environment and we just install that software. For that one particular feature that you worked on, you can actually get an automatic environment for that.
Then what we can do is we can take that environment itself and we can now merge this maybe into a staging branch or tag it with a staging label, and that automatically gets moved to staging. Depending on how complicated you are, how advanced you are, now you can actually have it go out to your product people or people who make decisions, maybe your executives, and they can view the software in whatever context it happens to be in. Then they can say, “Okay.”
Now that’s when we’re talking about now we can hit okay and the software just keeps on moving to the pipeline and it gets into production. The whole goal here, and this is actually where your goal should be just in general whenever you’re thinking about continuous delivery or continuous deployment is that any human intervention on the actual moving of code is a liability and is going to break, and it’s going to break because on Friday afternoon at 5:25 PM, someone’s thinking about the weekend and they’re not thinking about code, and they’re going to break your build. Our goal is to build these delivery systems that are Friday afternoon proof. We can push code anytime. It doesn’t matter. We trust our process.
[00:22:03] JH: I think it’s a great point about environments. I think back in the day, an environment used to be a set of machines, and then test used to be – staging was where there were kind of more stable versions of APIs and folks were more coordinated pushing things into them. What really is an environment? Like you said, when we push micro services or whatever service, we can spin up an entire Kubernetes cluster just for that service. We can set it up. We can run whatever tests we want. We could tear it down.
With the advent of Elastic compute, and now containers, they really enabled this world where like the traditional idea of an environment and what constitutes an environment is starting to get a bit kind of sloppy and blend into each other.
[00:22:42] BL: I like it though. I think it’s progress.
[00:22:45] NL: I totally agree. The one that scares me but I also find like really interesting, is the idea of having all of your environments in one set of machines. So clusters. Having a multi-tenanted set of machines for like dev staging and production, they’re all running in the same place and they’re all just separated by like what configuration of like connectivity from different networking and things like that set up.
When a user hits your website, bryanliles.com, they should go to the production images, but those are binaries, and those binaries should be running in the same space essentially as the development ones. It’s scary, but it’s also like allows for like some really fast testing and integration. I find it to be very fascinating.
[00:23:33] BL: I mean that’s where we want to be. I find more often than not that people have separate clusters for dev and staging and production. But using the Kubernetes API, you don’t have to do that, because what we can do is we can force deployment or workload to a set of machines based on their label. That’s actually one of the very strong positives for Kubernetes. Forget all the complexity.
One of the things that makes it easy is to say that I want this particular deployment to only live on my development machines. Well, which development machine? I don’t care. What if we increase our development pool size? We just re-label nodes. It doesn’t matter. Now we can just control that. When it comes down to controlling cost and complexity, this is actually one idea that Kubernetes is leading and just making it easier to actually use more of your hardware.
[00:24:31] NL: Yeah. Absolutely. That’s so great because if you think about it from a CI/CD standpoint, at that point all you have to do is just change the label to where you’re applying this piece of code. So you’re like, “Node selector, label equals dev. Okay, now it’s staging. Okay, now it’s prod.”
[00:24:47] BL: So this brings me into the next part of what I want to talk about or introduce to you all today. We’re on a journey as you probably can tell. Now whenever we have our CI process and we’re building and we’re deploying, where do we store our configurations?
[00:25:04] NL: [inaudible 00:25:04].
[00:25:06] BL: Ever thought about that?
[00:25:08] NL: Okay. I mean, in a Kubernetes perspective, you might be using something like etcd to sort of – But like everything else, what if you’re using Travis? [inaudible 00:25:16] store everything. Everything should be versioned, right? Everything should be –
[00:25:20] BL: Yeah, 100%.
[00:25:24] NL: I would store everything these as much as possible. Now, do I do that all the time? God, no! Absolutely not. I’m a human being after all.
[00:25:32] BL: I mean, that’s what I actually want to bring up, is this concept of GitOps. GitOps was a coined term by my friend, Alexis, who works at Weave. I think Weave created this. Really what it’s about is instead of having – basically, Kubernetes is declarative, and our configurations can be declarative too, because what we can do is make sure is we can have tech space configurations, and for one reason it’s because tech space means it can be versioned. It can be diffs. We take those text versions and we put them in our same repository we put our code in. How do we know what’s in production at any given time or any given time in the past? We just look at the tags of what we did.
We had a push at 5:15 on August 13th. Of course, this is 5:15, you could see time, because any other time doesn’t exist in the computer land. So what we could do is we could just basically tag that particular version as like 2019-08-13. If I said 5-17-55, and we call 01 just so we could have 100 deploys in a day. If we started doing that, now not only can we control what we have, but we can also know what was on in any given environment at any given time.
Because with Git and with Mercurial and any other of these – Well, only the popular ones, with Git and Mercurial, you can definitely do this. Any given commit can have multiple tags. You could actually have a tag that hit dev and then a tag that, let’s say, hits staging, and then a tag that hit production, the exact same code but three different tags. So you know at any given time what happened.
[00:27:18] JH: Yeah, the config thing is so important. I think that was another Jez Humble quote where it was like, “Give me three hours access to your code and I’ll break it. But give me 5 minutes with your configurations and I’ll break it.”
Almost like every big bug is, right, someone was accidentally pointing the prod server to the staging database like, “Oops! Their API was pointing to the wrong port, and everything came down,” or we changed the wrong versions or whatever.
I think that’s one of the intersections of developers and operations folks. We kind of talked about like Dev Ops and things like that. I really love the idea of everything being kept in Git and using GitOps, but then we’ve got things like secrets and configuration that shouldn’t be seen or being able to be edited by developers, but need to be for ops folks. But we still want to keep the single point of truth. Things like sealed secrets have really enabled us to move along in this area where we can keep everything in text-based version.
[00:28:08] BL: All right. Quick point of order here. Sealed secrets is a controller/CRD created by Bitnami. What it allows you do is, John –
[00:28:23] JH: It allows you – It creates a CRD, which is sealed secret, which is a special resource type in your cluster and also creates a key, which is only available to that operator running in your cluster. You can submit a sealed secret in plain text or you can submit a secret in plain text and it will throw it back out as an encrypted secret with that key and then you can check that into version control. Then when you go to deploy your software, you can deploy that encrypted secret into the cluster. The operator will pick it up, decrypt it using only the key that it has access to and then put it back in the cluster as a regular secret. Your application just interacts with regular Kubernetes secrets. You don’t need to change your app. They deal with all the encryption outside of the user intervention.
[00:29:03] BL: I think the most important part of what you said is that this allows us to have no excuses about what we can store in our repositories for our configuration, because someone is going to make the argument, “No, we can’t store secrets, because someone’s going to be able to see them.” Well, guess what? We never even stored an unencrypted secret in our repository. They’re all encrypted, and it’s still secrets. It’s [inaudible 00:29:25]. I don’t know if anyone’s cracked yet. I’m sure maybe a state level actor has thought of it. But for us regular people, even our companies, like even at VMware, or even at Google, they have not done it yet. So it’s still pretty safe.
Thinking even further now, and really what I’m trying to paint the picture of is not just how do you do CD, but really what CD could look like and how it can actually make you happy rather than sad.
The next item I wanted to think about was tools around CD and creating tools and what does a good continuous delivery system look like. I kind of hinted about this earlier whenever I was talking about pipelines. The ability to take advantage of your hardware, so we’re deploying to let’s say 100 servers. We’re pulling 5 or 6 services to 100 node cluster. We can do those all at once, and what we can do is you want to have a system that can actually run like this. I could think of a couple.
From Intuit, there is Argo, and they have Argo CD. There is the tool created by Google and maybe Netflix. I want to have to look that one up. It’s funny, because they quoted –
[00:30:40] JH: Spinnaker?
[00:30:42] BL: Spinnaker. They quoted me in their book, and I don’t remember their name. I’m sorry anyone from Spinnaker product listening. Once again, not advocating any products, but they have the concept of doing pipelines. Then you also have other things for your projects, like if you’re using open source, Drone. Another X Google – I think it was X-Googler that made this. Basically, they have ways you can do more than one thing at a time.
The most important piece about this is not only can you do more than one thing at a time, is that you have a programmatic check that it’ll make sure that you can verify that whatever you did was successful. We deployed to staging or we deployed to our smoke test servers for our smoke test, and that requires our testing people and an executive signoff. They can actually just wait until they get their signoff or maybe if it goes over a day or so, they can actually – It just fails, and now the build is done. But that part is pretty neat. Any other topics over here before I start throwing out more?
[00:31:45] NL: I think I just have thoughts on some of the tools that we’ve used. Everyone Jenkins. Jenkins can do anything that you want it to do, but you really have to tighten the screws on it. It is super powerful. It’s kind of like Bash, like Bash scripting. It’s super powerful, but you have to know precisely what you’re doing, otherwise it can really hurt you.
Actually, I have used Spinnaker in the past, and I’ve really liked it. It has a good UI, very good pipelines. Easy blue/green or canary deployment mechanism, I thought that was great. I’ve looked at Drone, believe it or not, but Drone is actually pretty cool. Check out Drone. I really liked it.
[00:32:25] BL: Well, since we’re throwing out products, Jenkins, does have JenkinsX. I have not given it the full rundown yet. But what I do like about it, and I think everyone should pay attention to this if you’re doing a product in this space, is that when you install JenkinsX, you install it locally to your machine. You basically get this binary called JX, and you then tell JX to install it into your cluster. Instead of just doing kubectl apply-f a whole bunch of YAML, it actually ask you questions and it sets up GitHub repositories or wherever you need these repositories. It sets up [inaudible 00:33:01] spaces for you.
There’s no just [inaudible 00:33:05] kubectl apply-f HTTPS: I just owned your system, because that’s actually a problem. Then it solves the YAML sprawl, because YAML and Kubernetes is something that is complained about a lot, but it’s how it’s configured. But it’s also just a detail what we’re supposed to be doing, and we actually work with Joe Beda and I could talk about this all the time, is that the YAML is the implementation, but it’s not the idea. The idea is that we build tools on top of that that create YAML so users have to see less YAML. I think that’s a problem with Jenkins, is that it’s so powerful and they’re like, “Well, we want powerful people or smart people to be able to do smart things. So here you go.”
The problem with that is that where do I start? It’s a little daunting. So I do think that they definitely came with the much stronger game with this JX command. Just as a little sidebar, we do it as well with our Valero project, and I think that just speaks, should be like the bar for anything. If you’re installing something into a cluster, you should come up with a command line tool that helps you manage the lifecycle of whatever you’re installing to the operator, YAML, whatever.
[00:34:18] JH: I think what’s interesting about the options, this is definitely one area where there’s so much nuance. Any time you’re in developer tooling, everyone wants to do something slightly differently. All of these tools are so tweak-able that they become so general. I think it’s probably one of the criticisms that could be leveraged against Jenkins is that you can do everything, and that’s actually a negative as well as a positive. Sometimes it’s too overwhelming. There are too many ways of doing things. I’m a fan of some of the more kind opinionated tools in that space.
[00:34:45] BL: Yeah. I like opinionated tools as well, but the problem that we’re having in this cloud native space is that, yeah, Kubernetes is five-years-old now. We are just getting to the point where we actually understand what a good decision is, because there was a lot of guesses before and we’ve done a lot of things, and some of these have been good ideas, but in some cases they have not been great ideas.
Even I ran the project case on it. Great idea on paper, but implementation, it required people to know too many things. We’d learned a lot of lessons from that. That’s what I think we’re going to find out in this space is that we’re going to learn little lessons. I say this project from my last project that I was going to bring up is something that I think has learned some of the lessons.
Google sponsors a project called Tekton, and if you go to – It’s like I believe, and they have some continuous delivery stuff in there and they implement pipelines. But the neat part is, and this is actually the best part, it’s actually a cloud native built service. So every step of your delivery process, from creating images, to actually putting them on clusters, is backed by a Docker image or a container, and I think that part is pretty neat. So now you can define your steps.
What is your step? Well, you can use one of their pre-baked, run this command, or if you have something special, like the example before I was giving out where you would say that you need an approval, maybe it’s a Slack approval. You send something with Slack and it has a checkbox, check yes if you like me. What we can do now is we can actually control that and it’s easy to write something a little Docker image that can actually make that call and then get the request and then it can move it on.
If you’re looking at more of a toolkit full of good ideas, I do think that Tekton has definitely has some lots of industry. People are looking at it and it’s probably the best example of getting it right in the cloud native way. Because a lot of the products we have now are not cloud native. We’re talking about Jenkins. We’re talking about Spinnaker and we talk about Drone and Travis, which is totally a SaaS product. They’re not cloud native.
Actually, the neat part about Tekton is that it actually comes with its own controllers and its own CRDs. So you can actually build these things up using your familiar Kubernetes tooling, which means in theory we could actually use the tooling that we are deploying. We can actually control it in the same way as our applications, because it’s just yet another object that goes in our cluster.
[00:37:21] NL: That does sound pretty cool. One other that I meant to bring up was Concourse. Have you check out Concourse yet?
[00:37:27] BL: CouncourseCI. I have not. I have used it, but never in a way where I would have a big opinion on it.
[00:37:34] NL: I’m kind of in the same place. I think it’s a good idea. It seems really neat, but I need to kick the tires a little more. I will say that I really like the UI. The structure of the UI is really nice. Everything makes sense, and anything you can click on like drills into something a bit deeper. I think that’s pretty cool, but it is one of the shout that I went out to as well as like another tool that I’m aware of.
[00:37:52] BL: Yeah, that’s pretty interesting. So we’ve gone about 40 minutes now. Let’s actually start winding this down, and the way that I’m going to suggest that we wind this down is thinking about where we are now. What’s missing in this space and what else could we actually be doing in the cloud native space to make this work out better?
[00:38:12] NL: I think I’d like to see better structured or better examples of blue-green or canary deployments with tests associated, and that might just be like me not looking hard enough at this problem. But anytime I began looking at blue-green, I get the idea of what someone’s done, but I would love to see some implementation details, or any of these opinionated tools having opinions around blue-green and what they specifically do to test it. I feel like I’m just not seeing that.
[00:38:41] BL: With blue-green, blue-green is hard to do in Kubernetes without an external tool, because for everyone, a blue-green deployment is, I have a software deployment and we’ll give it a color. We’ll call it blue, and I have the next version, and we’ll call it green. Really what I can do is I basically have two versions of my application deployed and I can use my load balancer, or in this case, my service to just change the label or the selector in my service and now I can point at at my green from my blue. Then I want to deploy again, I can just deploy another blue and then change my label selector again.
The problem with this is that you can do it in Kubernetes, just fine. But out of the box with Kubernetes, you will drop traffic, because guess what? What happens to a connection that was initiated or a session that was initiated on the blue cluster when you went to green? Actually, this is a whole conversation in itself about service meshes and this is actually one of the reasons service mesh is a big topic, because you can do this blue-green, or another example would be Netflix and Redblack, or you get the creative people who are like rainbow deployments, because just having two is not good enough for them. So they want to have any number of deployments going at one time. I agree with that 100%.
[00:39:57] JH: I think, yeah, integrating tools like launch. [inaudible 00:40:01] and I think there are more which enable – I think we’re missing the business abstractions on this stuff so far. Like you said, it’s kind of hard to do if you need to go into the gritty of it right now, but I think the business abstractions of if we deploy a different version to a certain subset of customers, can we get all of those metrics? Can we get those traces back in? Will you automate it, roll it out? Can we increase the percentage of customers that are seeing those things? Have that all controlled in a Kubernetes native way, but having roll it up to a business and more of an abstraction. I think that stuff is currently missing. I think the underpinning kind of technologies are coming up, stuff like service mesh, but I think it’s the abstraction that’s really going to make it useful, which doesn’t exist today.
[00:40:39] BL: Yeah. Actually, that’s pretty close to what I was going to say. We built all these tooling that helps us basically as technologists, but really what it comes down to is the business. A lot of the things we’re talking about where we’re talking about CD is important to the business, but when we’re talking about metrics or trace collection, that’s not important to the business, because they only care about the SLA. This is on the SLO side.
What we really need to do is mature our processes enough that we can actually marry our outputs to something that other people can understand that has no jargon and it’s sales going up, sales going down. Everything else is just a detail.
So, anything else?
[00:41:20] NL: Something I think I’d like to see is in our testing, if there was a good way to accurately show the effect of something at load in a CI/CD component. Because one of the things that I’ve run into is like I’ve got this great idea for how this code should work and when I deploy it, it works great. The like a thousand people touch it all at once and it doesn’t work right anymore. I’d love to have some tool along the way that can test things out of load and like show me something that I could fix before all those people touch it.
[00:41:57] BL: Yes, that would be a good tool to have. So John, anything else for you?
[00:42:02] JH: I’ll open a can of worms right at the end and say the biggest problem here is probably going to be data when we have a lot of systems we need to talk to each other and we need the data to align between those systems and we have now proliferation of environments and clusters. Like how do we get that data reliably into the place that it needs to be to make up testing robust enough to get things out there? It’s probably an episode on some –
[00:42:23] BL: Yeah, that’s a big conversation that if we could answer it, we wouldn’t working at VMware. We would have our own companies doing all these great things. But we can definitely iterate on it. So with that, I think we’re going to wrap it up. Thanks for listening to the Kubelets. I’m Bryan Liles, and with me today was Nicholas Lane and John – Yeah, and John Harris.
[00:42:47] JH: Thanks everyone.
[00:42:47] BL: All right, we’ll see you next time.
[END OF EPISODE]
[00:42:50] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
Security is inherently dichotomous because it involves hardening an application to protect it from external threats, while at the same time ensuring agility and the ability to iterate as fast as possible. This in-built tension is the major focal point of today’s show, where we talk about all things security. From our discussion, we discover that there are several reasons for this tension. The overarching problem with security is that the starting point is often rules and parameters, rather than understanding what the system is used for. This results in security being heavily constraining. For this to change, a culture shift is necessary, where security people and developers come around the same table and define what optimizing to each of them means. This, however, is much easier said than done as security is usually only brought in at the later stages of development. We also discuss why the problem of security needs to be reframed, the importance of defining what normal functionality is and issues around response and detection, along with many other security insights. The intersection of cloud native and security is an interesting one, so tune in today!
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Carlisia CamposDuffie CooleyBryan LilesNicholas LaneKey Points From This Episode:
Often application and program security constrain optimum functionality.Generally, when security is talked about, it relates to the symptoms, not the root problem.Developers have not adapted internal interfaces to security.Look at what a framework or tool might be used for and then make constraints from there.The three frameworks people point to when talking about security: FISMA, NIST, and CIS.Trying to abide by all of the parameters is impossible.It is important to define what normal access is to understand what constraints look like.Why it is useful to use auditing logs in pre-production.There needs to be a discussion between developers and security people.How security with Kubernetes and other cloud native programs work.There has been some growth in securing secrets in Kubernetes over the past year.Blast radius – why understanding the extent of security malfunction effect is important.Chaos engineering is a useful framework for understanding vulnerability.Reaching across the table – why open conversations are the best solution to the dichotomy.Security and developers need to have the same goals and jargon from the outset.The current model only brings security in at the end stages of development.There needs to be a place to learn what normal functionality looks like outside of production.How Google manages to run everything in production.It is difficult to come up with security solutions for differing contexts.Why people want service meshes.Quotes:
“You’re not able to actually make use of the platform as it was designed to be made use of, when those constraints are too tight.” — @mauilion [0:02:21]
“The reason that people are scared of security is because security is opaque and security is opaque because a lot of people like to keep it opaque but it doesn’t have to be that way.” — @bryanl [0:04:15]
“Defining what that normal access looks like is critical to us to our ability to constrain it.” — @mauilion [0:08:21]
“Understanding all the avenues that you could be impacted is a daunting task.” — @apinick [0:18:44]
“There has to be a place where you can go play and learn what normal is and then you can move into a world in which you can actually enforce what that normal looks like with reasonable constraints.” — @mauilion [0:33:04]
“You don’t learn to ride a motorcycle on the street. You’d learn to ride a motorcycle on the dirt.” — @apinick [0:33:57]
Links Mentioned in Today’s Episode:
AWS — https://aws.amazon.com/
Kubernetes https://kubernetes.io/
IAM https://aws.amazon.com/iam/
Securing a Cluster — https://kubernetes.io/docs/tasks/administer-cluster/securing-a-cluster/
TGI Kubernetes 065 — https://www.youtube.com/watch?v=0uy2V2kYl4U&list=PL7bmigfV0EqQzxcNpmcdTJ9eFRPBe-iZa&index=33&t=0s
TGI Kubernetes 066 —https://www.youtube.com/watch?v=C-vRlW7VYio&list=PL7bmigfV0EqQzxcNpmcdTJ9eFRPBe-iZa&index=32&t=0s
Bitnami — https://bitnami.com/
Target — https://www.target.com/
Netflix — https://www.netflix.com/
HashiCorp — https://www.hashicorp.com/
Aqua Sec — https://www.aquasec.com/
CyberArk — https://www.cyberark.com/
Jeff Bezos — https://www.forbes.com/profile/jeff-bezos/#4c3104291b23
Istio — https://istio.io/
Linkerd — https://linkerd.io/Transcript:
EPISODE 10
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores cloud native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[0:00:41.2] NL: Hello and welcome back to The Kubelets Podcast. My name is Nicholas Lane and this time, we’re going to be talking about the dichotomy of security. And to talk about such an interesting topic, joining me are Duffie Coolie.
[0:00:54.3] DC: Hey, everybody.
[0:00:55.6] NL: Bryan Liles.
[0:00:57.0] BM: Hello
[0:00:57.5] NL: And Carlisia Campos.
[0:00:59.4] CC: Glad to be here.
[0:01:00.8] NL: So, how’s it going everybody?
[0:01:01.8] DC: Great.
[0:01:03.2] NL: Yeah, this I think is an interesting topic. Duffie, you introduced us to this topic. And basically, what I understand, what you wanted to talk about, we’re calling it the dichotomy of security because it’s the relationship between security, like hardening your application to protect it from attack and influence from outside actors and agility to be able to create something that’s useful, the ability to iterate as fast as possible.
[0:01:30.2] DC: Exactly. I mean, the idea from this came from putting together a talks for the security conference coming up here in a couple of weeks. And I was noticing that obviously, if you look at the job of somebody who is trying to provide some security for applications on their particular platform, whether that be AWS or GCE or OpenStack or Kubernetes or anything of these things.
It’s frequently in their domain to kind of define constraints for all of the applications that would be deployed there, right? Such that you can provide rational defaults for things, right? Maybe you want to make sure that things can’t do a particular action because you don’t want to allow that for any application within your platform or you want to provide some constraint around quota or all of these things. And some of those constraints make total sense and some of them I think actually do impact your ability to design the systems or to consume that platform directly, right?
You’re not able to actually make use of the platform as it was designed to be made use of, when those constraints are too tight.
[0:02:27.1] DC: Yeah. I totally agree. There’s kind of a joke that we have in certain tech fields which is the primary responsibility of security is to halt productivity. It isn’t actually true, right? But there are tradeoffs, right? If security is too tight, you can’t move forward, right? Example of this that kind of mind are like, if you’re too tight on your firewall rules where you can’t actually use anything of value.
That’s a quick example of like security gone haywire. That’s too controlling, I think.
[0:02:58.2] BM: Actually. This is an interesting topic just in general but I think that before we fall prey to what everyone does when they talk about security, let’s take a step back and understand why things are the way they are. Because all we’re talking about are the symptoms of what’s going on and I’ll give you one quick example of why I say this. Things are the way they are because we haven’t made them any better.
In developer land, whenever we consume external resources, what we were supposed to do and what we should be doing but what we don’t do is we should create our internal interfaces. Only program to those interfaces and then let that interface of that adapt or talk to the external service and in security world, we should be doing the same thing and we don’t do this.
My canonical example for this is IAM on AWS. It’s hard to create a secure IM configuration and it’s even harder to keep it over time and it’s even harder to do it whenever you have 150, 100, 5,000 people dealing with this. What companies do is they actually create interfaces where they could describe the part of IAM they want to use and then they translate that over.
The reason I bring this up is because the reason that people are scared of security is because security is opaque and security is opaque because a lot of people like to keep it opaque. But it doesn’t have to be that way.
[0:04:24.3] NL: That’s a good point, that’s a reasonable design and wherever I see that devoted actually is very helpful, right? Because you highlight a critical point in that these constraints have to be understood by the people who are constrained by them, right? It will just continue to kind of like drive that wedge between the people who are responsible for them top finding t hem and the people who are being affected by them, right? That transparency, I think it’s definitely key.
[0:04:48.0] BM: Right, this is our cloud native discussion, any idea of where we should start thinking about this in cloud native land?
[0:04:56.0] DC: For my part, I think it’s important to understand if you can like what the consumer of a particular framework or tool might need, right? And then, just take it from there and figure out what rational constraints are. Rather than the opposite which is frequently where people go and evaluate a set of rules as defined by some particular, some third-part company. Like you look at CIS packs and you look at like a lot of these other tooling.
I feel like a lot of people look at those as like, these are the hard rules, we must comply to all of these things. Legally, in some cases, that’s the case. But frequently, I think they’re just kind of like casting about for some semblance of a way to start defining constraint and they go too far, they’re no longer taking into account what the consumers of that particular platform might meet, right?
Kubernetes is a great example of this. If you look at the CIS spec for Kubernetes or if you look at a lot of the talks that I’ve seen kind of around how to secure Kubernetes, we defined like best practices for security and a lot of them are incredibly restrictive, right? I think of the problem there is that restriction comes at a cost of agility. You’re no longer able to use Kubernetes as a platform for developing microservices because you provided so much constraints that it breaks the model, you know?
[0:06:12.4] NL: Okay. Let’s break this down again. I can think of a top of my head, three types of things people point to when I’m thinking about security. And spoiler alert, I am going to do some acronyms but don’t worry about the acronyms are, just understand they are security things. The first one I’ll bring up is FISMA and then I’ll think about NIST and the next one is CIS like you brought up.
Really, the reason they’re so prevalent is because depending on where you are, whether you’re in a highly regulated place like a bank or you’re working for the government or you have some kind of automate concern to say a PIPA or something like that. These are the words that the auditors will use with you. There is good in those because people don’t like the CIS benchmarks because sometimes, we don’t understand why they’re there.
But, from someone who is starting from nothing, those are actually great, there’s at least a great set of suggestions. But the problem is you have to understand that they’re only suggestions and they are trying to get you to a better place than you might need. But, the other side of this is that, we should never start with NIST or CIS or FISMA. What we really should do is our CISO or our Chief Security Officer or the person in charge of security.
Or even just our – people who are in charge, making sure our stack, they should be defining, they should be taking what they know, whether it’s the standards and they should be building up this security posture in this security document and these rules that are built to protect whatever we’re trying to do.
And then, the developers of whoever else can operate within that rather than everything literally.
[0:07:46.4] DC: Yeah, agreed. Another thing I’ve spent some time talking to people about like when they start rationalizing how to implement these things or even just think about the secure surface or develop a threat model or any of those things, right? One of the things that I think it’s important is the ability to define kind of like what normal looks like, right?
What normal access between applications or normal access of resources looks like. I think that your point earlier, maybe provides some abstraction in front of a secure resource such that you can actually just share that same fraction across all the things that might try to consume that external resource is a great example of the thing.
Defining what that normal access looks like is critical to us to our ability to constrain it, right? I think that frequently people don’t start there, they start with the other side, they’re saying, here are all the constraints, you need to tell me which ones are too tight. You need to tell me which ones to loosen up so that you can do your job. You need to tell me which application needs access to whichever application so that I can open the firewall for you.
I’m like, we need to turn that on its head. We need the environments that are perhaps less secure so that we can actually define what normal looks like and then take that definition and move it into a more secured state, perhaps by defining these across different environments, right?
[0:08:58.1] BM: A good example of that would be in larger organizations, at every part of the organization does this but there is environments running your application where there are really no rules applied. What we do with that is we turn on auditing in those environments so you have two applications or a single application that talks to something and you let that application run and then after the application run, you go take a look at the audit logs and then you determine at that point what a good profile of this application is.
Whenever it’s in production, you set up the security parameters, whether it be identity access or network, based on what you saw in auditing in your preproduction environment. That’s all you could run because we tested it fully in our preproduction environment, it should not do any more than that. And that’s actually something – I’ve seen tools that will do it for AWS IM. I’m sure you can do for anything else that creates auditing law. That’s a good way to get started.
[0:09:54.5] NL: It sounds like what we’re coming to is that the breakdown of security or the way that security has impacted agility is when people don’t take a rational look at their own use case. instead, rely too much on the guidance of other people essentially. Instead of using things like the CIS benchmarking or NIST or FISMA, that’s one that I knew the other two and I’m like, I don’t know this other one. If they follow them less as guidelines and more as like hard set rules, that’s when we get impacts or agility. Instead of like, “Hey. This is what my application needs like you’re saying, let’s go from there.”
What does this one look like? Duffie is for saying. I’m kind of curious, let’s flip that on its head a little bit, are there examples of times when agility impacts security?
[0:10:39.7] BM: You want to move fast and moving fast is counter to being secure?
[0:10:44.5] NL: Yes.
[0:10:46.0] DC: Yeah, literally every single time we run software. When it comes down to is developers are going to want to develop and then security people are going to want to secure. And generally, I’m looking at it from a developer who has written security software that a lot of people have used, you guys had know that.
Really, there needs to be a conversation, it’s the same thing as we had this dev ops conversation for a year – and then over the last couple of years, this whole dev set ops conversation has been happening. We need to have this conversation because from a security person’s point of view, you know, no access is great access.
No data, you can’t get owned if you don’t have any data going across the wire. You know what? Can’t get into that server if there’s no ports opened. But practically, that doesn’t work and we find is that there is actually a failing on both sides to understand what the other person was optimizing for.
[0:11:41.2] BM: That’s actually where a lot of this comes from. I will offer up that the only default secure posture is no access to anything and you should be working from that direction to where you want to be rather than working from, what should we close down? You should close down everything and then you work with allowing this list for other than block list.
[0:12:00.9] NL: Yeah, I agree with that model but I think that there’s an important step that has to happen before that and that’s you know, the tooling or thee wireless phone to define what the application looks like when it’s in a normal state or the running state and if we can accomplish that, then I feel like we’re in a better position to find what that LOI list looks like and I think that one of the other challenges there of course, let’s backup for a second. I have actually worked on a platform that supported many services, hundreds of services, right?
Clearly, if I needed to define what normal looked like for a hundred services or a thousand services or 2,000 services, that’s going to be difficult in a way that people approach the problem, right? How do you define for each individual service? I need to have some decoration of intent. I need the developer to engage here and tell me, what they’re expecting, to set some assumptions about the application like what it’s going to connect to, those dependences are –
That sort of stuff. And I also need tooling to verify that. I need to be able to kind of like build up the whole thing so that I have some way of automatically, you know, maybe with oversight, defining what that security context looks like for this particular service on this particular platform.
Trying to do it holistically is actually I think where we get into trouble, right? Obviously, we can’t scale the number of people that it takes to actually understand all of these individual services. We need to actually scale this stuff as software problem instead.
[0:13:22.4] CC: With the cloud native architecture and infrastructure, I wonder if it makes it more restrictive because let’s say, these are running on Kubernetes, everything is running at Kubernetes. Things are more connected because it’s a Kubernetes, right? It’s this one huge thing that you’re running on and Kubernetes makes it easier to have access to different notes and when the nodes took those apart, of course, you have to find this connection.
Still, it’s supposed to make it easy. I wonder if security from a perspective of somebody, needing to put a restriction and add miff or example, makes it harder or if it makes it easier to just delegate, you have this entire area here for you and because your app is constrained to this space or name space or this part, this node, then you can have as much access as you need, is there any difference? Do you know what I mean? Does it make sense what I said?
[0:14:23.9] BM: There was actually, it’s exactly the same thing as we had before. We need to make sure that applications have access to what they need and don’t have access to what they don’t need. Now, Kubernetes does make it easier because you can have network policies and you can apply those and they’re easier to manage than who knows what networking management is holding you have.
Kubernetes also has pod security policies which again, actually confederates this knowledge around my pod should be able to do this or should not be able to run its root, it shouldn’t be able to do this and be able to do that. It’s still the same practice Carlisia, but the way that we can control it is now with a standard set off tools.
We still have not cracked the whole nut because the whole thing of turning auditing on to understand and then having great tool that can read audit locks from Kubernetes, just still aren’t there. Just to add one more last thing that before we add VMWare and we were Heptio, we had a coworker who wrote basically dynamic audit and that was probably one of the first steps that we would need to be able to employ this at scale.
We are early, early, super early in our journey and getting this right, we just don’t have all the necessary tools yet. That’s why it’s hard and that’s why people don’t do it.
[0:15:39.6] NL: I do think it is nice to have t hose and primitives are available to people who are making use of that platform though, right? Because again, kind of opens up that conversation, right? Around transparency. The goal being, if you understood the tools that we’re defining that constraint, perhaps you’d have access to view what the constraints are and understand if they’re actually rational or not with your applications.
When you’re trying to resolve like I have deployed my application in dev and it’s the wild west, there’s no constraints anywhere. I can do anything within dev, right? When I’m trying to actually promote my application to staging, it gives you some platform around which you can actually sa, “If you want to get to staging, I do have to enforce these things and I have a way and again, all still part of that same API, I still have that same user experience that I had when just deploying or designing the application to getting them deployed.”
I could still look at again and understand what the constraints are being applied and make sure that they’re reasonable for my application. Does my application run, does it have access to the network resources that it needs to? If not, can I see where the gaps are, you know?
[0:16:38.6] DC: For anyone listening to this. Kubernetes doesn’t have all the documentation we need and no one has actually written this book yet. But on Kubernetes.io, there are a couple of documents about security and if we have shownotes, I will make sure those get included in our shownotes because I think there are things that you should at least understand what’s in a pod security policy.
You should at least understand what’s in a network security policy. You should at least understand how roles and role bindings work. You should understand what you’re going to do for certificate management. How do you manage this certificate authority in Kubernetes? How do you actually work these things out? This is where you should start before you do anything else really fancy. At least, understand your landscape.
[0:17:22.7] CC: Jeffrey did a TGI K talk on secrets. I think was that a series? There were a couple of them, Duffie?
[0:17:29.7] DC: Yeah, there were. I need to get back and do a little more but yeah.
[0:17:33.4] BM: We should then add those to our shownotes too. Hopefully they actually exist or I’m willing to see to it because in assistance.
[0:17:40.3] CC: We are going to have shownotes, yes.
[0:17:44.0] NL: That is interesting point, bringing up secrets and secret management and also, like secured Inexhibit. There are some tools that exist that we can use now in a cloud native world, at least in the container world. Things like vault exist, things like well, now, KBDM you can roll certificate which is really nice. We are getting to a place where we have more tooling available and I’m really happy about it.
Because I remember using Kubernetes a year ago and everyone’s like, “Well. How do you secure a secret in Kubernetes?” And I’m like, “Well, it sure is basics for you to encode it. That’s on an all secure.”
[0:18:15.5] BM: I would do credit Bitnami has been doing sealed secrets, that’s been out for quite a while but the problem is that how do you suppose to know about that and how are you supposed to know if it’s a good standard? And then also, how are you supposed to benchmark against that? How do you know if your secrets are okay?
We haven’t talked about the other side which is response or detection of issues. We’re just talking about starting out, what do you do?
[0:18:42.3] DC: That’s right.
[0:18:42.6] NL: It is tricky. We’re just saying like, understanding all the avenues that you could be impacted is kind of a daunting task. Let’s talk about like the Target breach that occurred a few years ago? If anybody doesn’t remember this, basically, Target had a huge credit card breach from their database and basically, what happened is that t heir – If I recalled properly, their OIDC token had a – not expired but the audience for it was so broad that someone had hacked into one computer essentially like a register or something and they were able to get the OIDC token form the local machine.
The authentication audience for that whole token was so broad that they were able to access the database that had all of the credit card information into it. These are one of these things that you don’t think about when you’re setting up security, when you’re just maybe getting started or something like that.
What are the avenues of attack, right? You’d say like, “OIDC is just pure authentication mechanism, why would we need to concern ourselves with this?” And then but not understanding kind of what we were talking about last because the networking and the broadcasting, what is the blast radius of something like this and so, I feel like this is a good example of sometimes security can be really hard and getting started can be really daunting.
[0:19:54.6] DC: Yeah, I agree. To Bryan’s point, it’s like, how do you test against this? How do you know that what you’ve defined is enough, right? We can define all of these constraints and we can even think that they’re pretty reasonable or rational and the application may come up and operate but how do you know? How can you verify that? What you’ve done is enough?
And then also, remember. With OIDC has its own foundations and loft. You realize that it’s a very strong door but it’s only a strong door, it also do things that you can’t walk around a wall and that it’s protecting or climb over the wall that it’s protecting. There’s a bit of trust and when you get into things like the target breach, you really have to understand blast radius for anything that you’re going to do.
A good example would be if you’re using shared key kind of things or like public share key. You have certificate authorities and you’re generating certificates. You should probably have multiple certificate authorities and you can have a basically, a hierarchy of these so you could have basically the root one controlled by just a few people in security.
And then, each department has their own certificate authority and then you should also have things like revocation, you should be able to say that, “Hey, all this is bad and it should all go away and it probably should have every revocation list,” which a lot of us don’t have believe it or not, internally. Where if I actually kill our own certificate, a certificate was generated and I put it in my revocation list, it should not be served and in our clients that are accepting that our service is to see that, if we’re using client side certificates, we should reject these instantly.
Really, what we need to do is stop looking at security as this one big thing and we need to figure out what are our blast radius. Firecracker, blowing up in my hand, it’s going to hurt me. But Nick, it’s not going to hurt you, you know? If someone drops in a huge nuclear bomb on the United States or the west coast United States, I’m talking to myself right now.
You got to think about it like that. What’s the worst that can happen if this thing gets busted or get shared or someone finds that this should not happen? Every piece off data that you have that you consider secure or sensitive, you should be able to figure out what that means and that is how whenever you are defining a security posture that’s butchered to me. Because that is why you’ll notice that a lot of companies some of them do run open within a contained zone. So, within this contained zone you could talk to whomever you want. We don’t actually have to be secure here because if we lose one, we lost them all so who cares?
So, we need to think about that and how do we do that in Kubernetes? Well, we use things like name spaces first of all and then we use things like this network policies and then we use things like pod security policies. We can lock some access down to just name spaces if need be. You can only talk to pods and your name space. And I am not telling you how to do this but you need to figure out talking with your developer, talking to the security people.
But if you are in security you need to talk to your product management staff and your software engineering staff to figure out really how does this need to work? So, you realize that security is fun and we have all sorts of neat tools depending on what side you’re on. You know if you are on red team, you’re half knee in, you’re blue team you are saving things. We need to figure out these conversations and tooling comes from these conversations but we need to have these conversation first.
[0:23:11.0] DC: I feel like a little bit of a broken record on this one but I am going to go back to chaos engineering again because I feel like it is critical to stuff like this because it enables a culture in which you can explore both the behavior of applications itself but why not also use this model to explore different ways of accessing that information? Or coming up with theories about the way the system might be vulnerable based on a particular attack or a type of attack, right?
I think that this is actually one of the movements within our space that I think provides because then most hope in this particular scenario because a reasonable chaos engineering practice within an organization enables that ability to explore all of the things. You don’t have to be red team or blue team. You can just be somebody who understands this application well and the question for the day is, “How can we attack this application?”
Let’s come up with theories about the way that perhaps this application could be attacked. Think about the problem differently instead of thinking about it as an access problem, think about it as the way that you extend trust to the other components within your particular distributed system like do they have access that they don’t need. Come up with a theory around being able to use some proxy component of another system to attack yet a third system.
You know start playing with those ideas and prove them out within your application. A culture that embraces that I think is going to be by far a more secure culture because it lets developers and engineers explore these systems in ways that we don’t generally explore them.
[0:24:36.0] BM: Right. But also, if I could operate on myself I would never need a doctor. And the reason I bring that up is because we use terms like chaos engineering and this is no disrespect to you Duffie, so don’t take it as this is panacea or this idea that we make things better and true. That is fine, it will make us better but the little secret behind chaos engineering is that it is hard. It is hard to build these experiments first of all, it is hard to collect results from these experiments.
And then it is hard to extrapolate what you got out of the experiments to apply to whatever you are working on to repeat and what I would like to see is what people in our space is talking about how we can apply such techniques. But whether it is giving us more words or giving us more software that we can employ because I hate to say it, it is pretty chaotic in chaos engineering right now for Kubernetes.
Because if you look at all the people out there who have done it well.
And so, you look at what Netflix has done with pioneering this and then you listen to what, a company such us like Gremlin is talking about it is all fine and dandy. You need to realize that it is another piece of complexity that you have to own and just like any other things in the security world, you need to rationalize how much time you are going to spend on it first is the bottom line because if I have a “Hello, World!” app, I don’t really care about network access to that.
Unless it is a “Hello, World!” app running on the same subnet as some doing some PCI data then you know it is a different conversation.
[0:26:05.5] DC: Yeah. I agree and I am certainly not trying to version as a panacea but what I am trying to describe is that I feel like I am having a culture that embraces that sort of thinking is going to enable us to be in a better position to secure these applications or to handle a breach or to deal with very hard to understand or resolve problems at scale, you know? Whether that is a number of connections per second or whether that is a number of applications that we have horizontally scaled.
You know like being able to embrace that sort of a culture where we asked why where we say “well, what if…” or if we actually come up you know embracing the idea of that curiosity that got you into this field, you know what I mean like the thing that is so frequently our cultures are opposite of that, right? It becomes a race to the finish and in that race to the finish, lots of pieces fall off that we are not even aware of, you know? That is what I am highlighting here when I talk about it.
[0:26:56.5] NL: And so, it seems maybe the best solution to the dichotomy between security and agility is really just open conversation, in a way. People actually reaching across the aisle to talk to each other. So, if you are embracing this culture as you are saying Duffie the security team should be having constant communication with the application team instead of just like the team doing something wrong and the security team coming down and smacking their hand.
And being like, “Oh you can’t do it this way because of our draconian rules” right? These people are working together and almost playing together a little bit inside of their own environment to create also a better environment. And I am sorry.I didn’t mean to cut you off there, Bryan.
[0:27:34.9] BM: Oh man, I thought it was fleeting like all my thoughts. But more about what you are saying is, is that you know it is not just more conversations because we can still have conversations and I am talking about sider and subnets and attack vectors and buffer overflows and things like that. But my developer isn’t talking, “Well, I just need to be able to serve this data so accounting can do this.”
And that’s what happens a lot in security conversations. You have two groups of individuals who have wholly different goals and part of that conversation needs to be aligning or jargon and then aligning on those goals but what happens with pretty much everything in the development world, we always bring our networking, our security and our operations people in right at the end, right when we are ready to ship, “Hey make this thing work.” And really it is where a lot of our problems come out.
Now security either could or wanted to be involved at the beginning of a software project what we actually are talking about what we are trying to do. We are trying to open up this service to talk to this, share this kind of data. Security can be in there early saying, “Oh no you know, we are using this resource in our cloud provider. It doesn’t really matter what cloud provider and we need to protect this. This data is sitting here at rest.”
If we get those conversations earlier, it would be easier to engineer solutions that to be hopefully reused so we don’t have to have that conversation in the future.
[0:29:02.5] CC: But then it goes back to the issue of agility, right? Like Duffie was saying, wow you can develop, I guess a development cluster which has much less restrictive restrictions and they move to a production environment where the proper restrictions are then – then you find out or maybe station environment let’s say. And then you find out, “Oh whoops. There are a bunch of restrictions I didn’t deal with but I didn’t move a lot faster because I didn’t have them but now, I have to deal with them.”
[0:29:29.5] DC: Yeah, do you think it is important to have a promotion model in which you are able to move toward a more secure deployment right? Because I guess a parallel to this is like I have heard it said that you should develop your monolith first and then when you actually have the working prototype of what you’re trying to create then consider carefully whether it is time to break this thing up into a set of distinct services, right?
And consider carefully also what the value of that might be? And I think that the reason that that’s said is because it is easier. It is going to be a lower cognitive load with everything all right there in the same codebase. You understand how all of these pieces interconnect and you can quickly develop or prototype what you are working on. Whereas if you are trying to develop these things into individual micro services first, it is harder to figure out where the line is.
Like where to divide all of the business logic. I think this is also important when you are thinking about the security aspects of this right? Being able to do a thing when which you are not constrained, define all of these services and your application in the model for how they communicate without constraint is important. And once you have that when you actually understand what normal looks like from that set of applications then enforce them, right?
If you are able to declare that intent you are going to say like these are the ports on the list on for these things, these are the things that they are going to access, this is the way that they are going to go about accessing them. You know if you can declare that intent then that is actually that is a reasonable body of knowledge for which the security people can come along and say, “Okay well, you have told us. You informed us. You have worked with us to tell us like what your intent is. We are going to enforce that intent and see what falls out and we can iterate there.”
[0:31:01.9] CC: Yeah everything you said makes sense to me. Starting with build the monolith first. I mean when you start out why which ones will have abstract things that you don’t really – I mean you might think you know but you’re only really knowing practice what you are going to need to abstract. So, don’t abstract things too early. I am a big fan of that idea. So yeah, start with the monolith and then you figure out how to break it down based on what you need.
With security I would imagine the same idea resonates with me. Don’t secure things that you don’t need you don’t know just yet that needs securing except the deal breaker things. Like there is some things we know like we don’t want production that are being accessed some types of production that are some things we know we need to secure so from the beginning.
[0:31:51.9] BM: Right. But I will still iterate that it is always denied by default, just remember that. It is security is actually the opposite way. We want to make sure that we have the least amount and even if it is harder for us you always want to start with un-allowed TCP communication on port 443 or UDP as well. That is what I would allow rather than saying shut everything else off. But this, I would rather have the way that we only allow that and that also goes in with our declarative nature in cloud native things we like anyways. We just say what we want and everything else doesn’t exists.
[0:32:27.6] DC: I do want to clarify though because I think what you and I, we are the representative of the dichotomy right at this moment, right? I feel like what you are saying is the constraint should be the normal, being able to drop all traffic, do not allow anything is normal and then you have to declare intent to open anything up and what I am saying is frequently developers don’t know what normal looks like yet. They need to be able to explore what normal looks like by developing these patterns and then enforce them, right, which is turning the model on its head.
And this is actually I think the kernel that I am trying to get to in this conversation is that there has to be a place where you can go play and learn what normal is and then you can move into a world in which you can actually enforce what that normal looks like with reasonable constraint. But until you know what that is, until you have that opportunity to learn it, all we are doing here is restricting your ability to learn. We are adding friction to the process.
[0:33:25.1] BM: Right, well I think what I am trying to say here layer on top of this is that yes, I agree but then I understand what a breach can do and what bad security can do. So I will say, “Yeah, go learn. Go play all you want but not on software that will ever make it to production. Go learn these practices but you are going to have to do it outside of” – you are going to have a sandbox and that sandbox is going to be unconnected from the world I mean from our obelisk and you are going to have to learn but you are not going to practice here. This is not where you learn how to do this.
[0:33:56.8] NL: Exactly right, yeah. You don’t learn to ride a motorcycle on the street you know? You’d learn to ride a motorcycle on the dirt and then you could take those skills later you know? But yeah I think we are in agreement like production is a place where we do have to enforce all of those things and having some promotion level in which you can come from a place where you learned it to a place where you are beginning to enforce it to a place where it is enforced I think is also important.
And I frequently describe this as like development, staging and production, right? Staging is where you are going to hit the edges from because this is where you’re actually defining that constraint and it has to be right before it can be promoted to production, right? And I feel like the middle ground is also important.
[0:34:33.6] BM: And remember that production is any environment production can reach. Any environment that can reach production is production and that is including that we do data backup dumps and we clean them up from production and we use it as data in our staging environment. If production can directly reach staging or vice versa, it is all production. That is your attack vector. That is also what is going to get in and steal your production data.
[0:34:59.1] NL: That is absolutely right. Google actually makes an interesting not of caveat to that but like side point to that where like if I understand the way that Google runs, they run everything in production, right? Like dev, staging and production are all the same environment. I am more positing this is a question because I don’t know if anybody of us have the answer but I wonder how they secure their infrastructure, their environment well enough to allow people to play to learn these things?
And also, to deploy production level code all in the same area? That seems really interesting to be and then if I understood that I probably would be making a lot more money.
[0:35:32.6] BM: Well it is simple really. There were huge people process at Google that access gatekeeper for a lot of these stuff. So, I have never worked in Google. I have no intrinsic knowledge of Google or have talked to anyone who has given me this insight, this is all speculation disclaimer over. But you can actually run a big cluster that if you can actually prove that you have network and memory and CPU isolation between containers, which they can in certain cases and certain things that can do this.
What you can do is you can use your people process and your approvals to make sure that software gets to where it needs to be. So, you can still play on the same clusters but we have great handles on network that you can’t talk to these networks or you can’t use this much network data. We have great things on CPU that this CPU would be a PCI data. We will not allow it unless it’s tied to CPU or it is PCI. Once you have that in place, you do have a lot more flexibility. But to do that, you will have to have some pretty complex approval structures and then software to back that up.
So, the burden on it is not on the normal developer and that is actually what Google has done. They have so many tools and they have so many processes where if you use this tool it actually does the process for you. You don’t have to think about it. And that is what we want our developers to be. We want them to be able to use either our networking libraries or whenever they are building their containers or their Kubernetes manifest, use our tools and we will make sure based on either inspection or just explicit settings that we will build something that is as secure as we can given the inputs.
And what I am saying is hard and it is capital H hard and I am actually just pitting where we want to be and where a lot of us are not. You know most people are not there.
[0:37:21.9] NL: Yeah, it would be nice if we had like we said earlier like more tooling around security and the processes and all of these things. One thing I think that people seem to balk on or at least I feel is developing it for their own use case, right? It seems like people want an overarching tool to solve all the use cases in the world. And I think with the rise of cloud native applications and things like container orchestration, I would like to see people more developing for themselves around their own processes, around Kubernetes and things like that.
I want to see more perspective into how people are solving their security problems, instead of just like relying on let’s say like HashiCorp or like Aqua Sec to provide all the answers like I want to see more answers of what people are doing.
[0:38:06.5] BM: Oh, it is because tools like Vault are hard to write and hard to maintain and hard to keep correct because you think about other large competitors to vault and they are out there like tools like CyberArk. I have a secret and I want to make sure only certain will keep it. That is a very difficult tool but the HashiCorp advantage here is that they have made tools to speak to people who write software or people who understand ops not just as a checkbox.
It is not hard to get. If you are using vault it is not hard to get a secret out if you have the right credentials. Other tools is super hard to get the secret out if you even have the right credential because they have a weird API or they just make it very hard for you or they expect you to go click on some gooey somewhere. And that is what we need to do. We need to have better programming interfaces and better operator interfaces, which extends to better security people are basis for you to use these tools.
You know I don’t know how well this works in practice. But the Jeff Bezos, how teams at AWS or Amazon or forums, you know teams communicate on API and I am not saying that you shouldn’t talk, but we should definitely make sure that our API’s between teams and team who owns security stuff and teams who are writing developer stuff that we can talk on the same level of fidelity that we can having an in person conversation, we should be able to do that through our software as well.
Whether that be for asking for ports or asking for our resources or just talking about the problem that we have that is my thought-leadering answer to this. This is “Bryan wants to be a VP of something one day” and that is the answer I am giving. I’m going to be the CIO that is my CIO answer.
[0:39:43.8] DC: I like it. So cool.
[0:39:45.5] BM: Is there anything else on this subject that we wanted to hit?
[0:39:48.5] NL: No, I think we have actually touched on pretty much everything. We got a lot out of this and I am always impressed with the direction that we go and I did not expect us to go down this route and I was very pleased with the discussion we have had so far.
[0:39:59.6] DC: Me too. I think if we are going to explore anything else that we talked about like you know, get it more into that state where we are talking about like that we need more feedback loops. We need people developers to talk to security people. We need security people talk to developers. We need to have some way of actually pushing that feedback loop much like some of the other cultural changes that we have seen in our industry are trying to allow for better feedback loops and other spaces.
And you’ve brought up dev spec ops which is another move to try and open up that feedback loop but the problem I think is still going to be that even if we improved that feedback loop, we are at an age where – especially if you ended up in some of the larger organizations, there are too many applications to solve this problem for and I don’t know yet how to address this problem in that context, right?
If you are in a state where you are a 20-person, 30-person security team and your responsibility is to secure a platform that is running a number of Kubernetes clusters, a number of Vsphere clusters, a number of cloud provider implementations whether that would be AWS or GC, I mean that is a set of problems that is very difficult. It is like I am not sure that improving the feedback loop really solves it. I know that I helps but I definitely you know, I have empathy for those folks for sure.
[0:41:13.0] CC: Security is not my forte at all because whenever I am developing, I have a narrow need. You know I have to access a cluster.I have to access a machine or I have to be able to access the database. And it is usually a no brainer but I get a lot of the issues that were brought up. But as a builder of software, I have empathy for people who use software, consume software, mine and others and how can’t they have any visibility as far as security goes?
For example, in the world of cloud native let’s say you are using Kubernetes, I sort of start thinking, “Well, shouldn’t there be a scanner that just lets me declare?” I think I am starting an episode right now –should there be a scanner that lets me declare for example this node can only access this set of nodes like a graph. But you just declare and then you run it periodically and you make sure of course this goes down to part of an app can only access part of the database.
It can get very granular but maybe at a very high level I mean how hard can this be? For example, this pod can only access that pods but this pod cannot access this name space and just keep checking what if the name spaces changes, the permission changes. Or for example would allow only these answers can do a backup because they are the same users who will have access to the restore so they have access to all the data, you know what I mean? Just keep checking that is in place and it only changes when you want to.
[0:42:48.9] BM: So, I mean I know we are at the end of this call and I want to start a whole new conversation but this is actually is why there are applications out there like Istio and Linkerd. This is why people want service meshes because they can turn off all network access and then just use the service mesh to do the communication and then they can use, they can make sure that it is encrypted on both sides and that is a honey cave on all both sides. That is why this is operated.
[0:43:15.1] CC: We’ll definitely going to have an episode or multiple on service mesh but we are on the top of the hour. Nick, do your thing.
[0:43:23.8] NL: All right, well, thank you so much for joining us on another interesting discussion at The Kubelets Podcast. I am Nicholas Lane, Duffie any final thoughts?
[0:43:32.9] DC: There is a whole lot to discuss, I really enjoyed our conversations today. Thank you everybody.
[0:43:36.5] NL: And Bryan?
[0:43:37.4] BM: Oh it was good being here. Now it is lunch time.
[0:43:41.1] NL: And Carlisia.
[0:43:42.9] CC: I love learning from you all, thank you. Glad to be here.
[0:43:46.2] NL: Totally agree. Thank you again for joining us and we’ll see you next time. Bye.
[0:43:51.0] CC: Bye.
[0:43:52.1] DC: Bye.
[0:43:52.6] BM: Bye.
[END OF EPISODE]
[0:43:54.7] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
This week on The Podlets Cloud Native Podcast we have Josh, Carlisia, Duffie, and Nick on the show, and are also happy to be joined by a newcomer, Brian Liles, who is a senior staff engineer at VMWare! The purpose of today’s show is coming to a deeper understanding of the meaning of ‘stateful’ versus ‘stateless’ apps, and how they relate to the cloud native environment. We cover some definitions of ‘state’ initially and then move to consider how ideas of data persistence and co-ordination across apps complicate or elucidate understandings of ‘stateful’ and ‘stateless’. We then think about the challenging practice of running databases within Kubernetes clusters, which effectively results in an ephemeral system becoming stateful. You’ll then hear some clarifications of the meaning of operators and controllers, the role they play in mediating and regulating states, and also how important they are in a rapidly evolving but skills-scarce environment. Another important theme in this conversation is the CAP theorem or the impossibility of consistency, availability and partition tolerance all at once, but the way different databases allow for different combinations of two out of the three. We then move on to chat about the fundamental connection between workloads and state and then end off with a quick consideration about how ideas of stateful and stateless play out in the context of networks. Today’s show is a real deep dive offering perspectives from some the most knowledgeable in the cloud native space so make sure to tune in!
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Carlisia CamposDuffie CooleyBryan LilesJosh RossoNicholas LaneKey Points From This Episode:
• What ‘stateful’ means in comparison to ‘stateless’.
• Understanding ‘state’ as a term referring to data which must persist.
• Examples of stateful apps such as databases or apps that revolve around databases.
• The idea that ‘persistence’ is debatable, which then problematizes the definition of ‘state’.
• Considerations of the push for cloud native to run stateless apps.
• How inter-app coordination relates to definitions of stateful and stateless applications.
• Considering stateful data as data outside of a stateless cloud native environment.
• Why it is challenging to run databases in Kubernetes clusters.
• The role of operators in running stateful databases in clusters.
• Understanding CRDs and controllers, and how they relate to operators.
• Controllers mediate between actual and desired states.
• Operators are codified system administrators.
• The importance of operators as app number grows in a skill-scarce environment.
• Mechanisms around stateful apps are important because they ensure data integrity.
• The CAP theorem: the impossibility of consistency, availability, and tolerance.
• Why different databases allow for different iterations of the CAP theorem.
• When partition tolerance can and can’t get sacrificed.
• Recommendations on when to run stateful or stateless apps through Kubernetes.
• The importance of considering models when thinking about how to run a stateful app.
• Varying definitions of workloads.
• Pods can run multiple workloads
• Workloads create states, so you can’t have one without the other.
• The term ‘workloads’ can refer to multiple processes running at once.
• Why the ephemerality of Kubernetes systems makes it hard to run stateful applications.
• Ideas of stateful and stateless concerning networks.
• The shift from server to browser in hosting stateful sessions.Quotes:
“When I started envisioning this world of stateless apps, to me it was like, ‘Why do we even call them apps? Why don’t we just call them a process?’” — @carlisia [0:02:60]
“‘State’ really is just that data which must persist.” — @joshrosso [0:04:26]
“From the best that I can surmise, the operator pattern is the combination of a CRD plus a controller that will operate on events from the Kubernetes API based on that CRD’s configuration.” — @bryanl [0:17:00]
“Once again, don’t let developers name them anything.” — @bryanl [0:17:35]
“Data integrity is so important” — @apinick [0:22:31]
“You have to really be careful about the different models that you’re evaluating when trying to think about how to manage a stateful application like a database.” — @mauilion [0:31:34]
Links Mentioned in Today’s Episode:
KubeCon+CloudNativeCon — https://events19.linuxfoundation.org/events/kubecon-cloudnativecon-north-america-2019/
Google Spanner — https://cloud.google.com/spanner/
CockroachDB — https://www.cockroachlabs.com/
CoreOS — https://coreos.com/
Red Hat — https://www.redhat.com/en
Metacontroller — https://metacontroller.app/
Brandon Philips — https://www.redhat.com/en/blog/authors/brandon-phillips
MySQL — https://www.mysql.com/Transcript:
EPISODE 009
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[INTERVIEW]
[00:00:41] JR: All right! Hello, everybody, and welcome to episode 6 of The Cubelets Podcast. Today we are going to be discussing the concept of stateful and stateless and what that means in this crazy cloud native landscape that we all work.
I am Josh Rosso. Joined with me today is Carlisia.
[00:00:59] CC: Hi, everybody.
[00:01:01] JR: We also have Duffie.
[00:01:03] D: Hey, everybody.
[00:01:04] JR: Nicholas.
[00:01:05] NL: Yo!
[00:01:07] JR: And a newcomer to the podcast, we also have Brian. Brian, you want to give us a little intro about yourself?
[00:01:12] BL: Hi! I’m Brian. I work at VMWare. I do lots of community stuff, including sharing the KubeCon+CloudNativeCon.
[00:01:22] JR: Awesome! Cool. All right. We’ve got a pretty good cast this week. So let’s dive right into it. I think one of the first things that we’ve been talking a bit about is the concept of what makes an application stateful? And of course in reverse, what makes an application stateless? Maybe we could try to start by discerning those two. Maybe starting with stateless if that makes? Does someone want to take that on?
[00:01:45] CC: Well, I’m going to jump right in. I have always been a developer, as supposed to some of you or all of you have who have system admin backgrounds. The first time that I heard the stateless app, I was like, “What?” That wasn’t recent, okay? It was a long time ago, but that was a knot in my head. Why would you have a stateless app? If you have an app, you’re going to need state. I couldn’t imagine what that was. But of course it makes a lot of sense now. That was also when we were more in the monolithic world.
[00:02:18] BM: Actually that’s a good point. Before you go into that, it’s a great point. Whenever we start with apps or we start developing apps, we think of an application. An application does everything. It takes input and it does stuff and it gives output. But now in this new world where we have lots of apps, big apps, small apps, we start finding that there’s apps that only talk and coordinate with other apps. They don’t do anything else. They don’t save any data. They don’t do anything. That’s what – where we get into this thing called stateless apps. Apps don’t have any type of data that they store locally.
[00:02:53] CC: Yeah. It’s more like when I envision in my head. You said it brilliantly, Brian. It’s almost like a process. When I started envisioning this world of stateless apps, to me it was like, “Why do we even call them apps? Why don’t we just call them a process?” They’re just shifting back data and forth but they’re not – To me, at the beginning, apps were always stateless. They went together.
[00:03:17] D: I think, frequently, people think of applications that have only locally relevant stuff that is actually not going to persist to disc, but maybe held in memory or maybe only relevant to the type of connection that’s coming through that application also as stateless, which is interesting, because there’s still some state there, but the premise is that you could lose that state and not lose the functionality of that code.
[00:03:42] NL: Something that we might want to dive into really quickly when talking about stateless and stateful apps. What do we mean by the word state? When I first learned about these things, that was what always screwed me up. I’m like, “What do you mean state? Like Washington? Yeah. We got it over here.”
[00:03:57] JR: Oh! State. That’s that word. State is one of those words that we use to sound smarter than we actually are 95% of the time, and that’s a number I just made up. When people are talking about state, they mean databases. Yeah. But there are other types of state as well. If you maintain local cache that needs to be persistent, if you have local files that you’re dealing with, like you’re opening files. That’s still state. State really is just that it’s data that must persist.
[00:04:32] D: I agree with that definition. I think that state, whether persisted to memory or persisted to disc or persisted to some external system, that’s still what we refer to as state.
[00:04:41] JR: All right. Makes sense and sounds about like what I got from it as well.
[00:04:45] CC: All right. So now we have this world where we talk about stateless apps and stateful apps. Are there even stateful apps? Do we call a database an app? If we have a distributed system where we have one stateless app over here, another stateless app over there and then we have the database that’s connected to the two of them, are we calling the database a stateful app or is that whole thing – How do we call this?
[00:05:15] NL: Yeah. The database is very much a state as an app with state. I’m very much –
[00:05:19] D: That’s a close definition. Yeah.
[00:05:21] NL: Yeah. Literally, it’s the epitome of a stateful app. But then you also have these apps that talk to databases as well and they might have local data, like data that – they start a transaction and then complete it or they have a long distributed type transaction. Any apps that revolve around a database, if they store local data, whether it’s within a transaction or something else, they’re still stateful apps.
[00:05:46] D: Yup. I think you can modify and input data or modify state that has to be persisted in some way I think is a stateful app, even though I do think it’s confusing because of what – As I said before, I think that there are a bunch of applications that we think of, like not everybody considers Spark jobs to be stateful. Spark jobs, for example, are something that would bring data in, mutate that data in some way, produce some output and go away.
The definition there is that Spark would generally push the resulting data into some other external system. It’s interesting, because in that model, Spark is not considered to be a stateful app because the Spark job could fail, go away, get recreated, pick up the pieces where it left off or just redo that work until all of the work is done.
In many cases, people consider that to be a stateless application. That’s I think is like the crux – In my opinion, the crux of the confusion around what a stateful and stateless application is, is that people frequently – I think it’s more about where you store – what you mean by persistence and how that actually realizes in your application. If you’re pushing your state to an external database, is your application still stateful?
[00:06:58] NL: I think it’s a good question, or if you are gathering data from an external source and mutating it in some way, but you don’t need data to be present when you start up, is that a stateful app or a stateless app? Even though you are taking in data, modifying it and checking it, sending out to some other mechanism or serving it in your own way, does that become like a stateless app? If that app gets killed and it comes back and it’s able to recover, is it stateful or stateless? That’s a bit of a gray area, I think.
[00:07:26] JR: Yeah. I feel like a lot of the customers I work with, if the application can get killed even if it has some type of local state, they still refer to it as stateless usually, to me at least, when we talk about it because they think, “I can kind of restart this application and I’m not too worried about losing whatever it may have had.” Let’s say cached for simplicity, right?
I think that kind of leads us into an interesting question. We’ve talked a lot on this podcast about cloud native infrastructure and cloud native applications and it seems like since the inception of cloud native, there’s always been this push that a stateless app is the best candidate to run or the easiest candidate to run. I’m just curious if we could dive into that for a moment. Why in the cloud native infrastructure area has there always been this push for running stateless applications? Why is it simpler? Those kinds of things.
[00:08:15] BL: Before we dive into that, we have to realize – And this is just a problem of our whole ecosystem, this whole cloud native. We’re very hand-wavy in our descriptions for things. There’re a lot of ambiguous descriptions, and state is one of those. Just keep that in mind, that when we’re talking today, we’re really just talking about these things that store data and when that’s the state. Just keep that in mind as you’re listening to this.
But when it comes to distributed systems in general, the easiest system is a system that doesn’t need coordination with any other system. If it happens to die, that’s okay. We can just restart it. People like to start there. It’s the easiest thing to start.
[00:08:58] NL: Yeah, that was basically what I was going to say. If your application needs to tie into other applications, it becomes significantly more complicated to implement it, at least for your first time and in your system. These small applications that only – They don’t care about anybody else, they just take in data or not, they just do whatever. Those are super easy to start with because they’re just like, “Here. Start this up. Who cares? Whatever happens, it happens.”
[00:09:21] CC: That could be a good boundary to define – I don’t want to jump back too far, but to define where is the stateless app to me is part of a system and just say it depends for it to come back up. Does it depend on something else that has state?
[00:09:39] BL: I’ll give you an example. I can give you a good example of a stateless app that we use every day, every single one of us, none of us on this call, but when you search Google. You go to google.com and you go to the bar and you type in a search, what’s happening is there is a service at the beginning that collects that search and it federates the search over many different probably clusters of computers so they can actually do the search currently. That app that actually coordinates all that work is a stateless app most likely. All it does is just splits it up and allows more CPUs to do the work. Probably, that goes away. Probably not a problem. You probably have 10 more of them. That’s what I consider stateless. It doesn’t really own any of the data. It’s the coordinator.
[00:10:25] CC: Yeah. If it goes down, it comes back up. It doesn’t need to reset itself to the state where it was before. It can truly be considered a stateless because it can just, “Okay. I reset. I’m starting from the beginning from this clear state.”
[00:10:43] BL: Yes. That’s a good summary of that.
[00:10:45] CC: Because another way to think about stateless – What makes an app stateful app, does it have to be combined or like deployed and shipped together with the part that maintains the state? That’s a more clear cut definition. Then that app is definitely a stateful app.
[00:11:05] D: What we frequently talk about in like the cloud native space is like you know that you have a stateless app if you can just create 20 of them and not have to worry about the coordination of them. They are all workers. They are all going to take input. You could spread the load across those 20 in an identical way and not worry about which one you landed on. That’s stateless application.
A stateful application is a very different thing. You have to have some coordination. You have to say how many databases can you have on a backend? Because you’re persisting data there, you have to be really careful about that you only write to the master database or to the writing database and you could read of any other memories of that database cluster, that sort of stuff.
[00:11:44] CC: It might seem that we are going so deep into this differentiating between stateful and stateless, but this is so important because clusters are usually designed to be ephemeral. Ephemeral means obviously they die down, they are brought back up, the nodes, and you should worry as least as possible with the state of things.
Then going back to what Joshua is saying, when we are in this cloud native world, usually we are talking about stateless apps, stateless workloads and then we’re going to just talk about what workload means. But then if that’s the case, where are the stateful apps? It’s like we have this vision that the stateful apps live outside the cloud native world? How does it work? But it’s supposed to work.
[00:12:36] BL: Yup. This is the question that keeps a lot of people employed. Making sure my state is available when I need it. You know what? I’m not going to even use that word state. Making sure my data is available wherever I need it and when I need it. I don’t want to go too deep in right now, but this is actually a huge problem in the Kubernetes community in general, and we see it because there’s been lots of advice given, “Don’t run things like databases in your clusters.” This is why we see people taking the ideas of Google Spanner and like CockroachDB and actually going through a lot of work to make sure that you can run databases in Kubernetes clusters.
The interesting piece about this is that we’re actually to the point where we can run these types of workloads in our clusters, but with a caveat, big star at the end, it’s very difficult and you have to know what you’re doing.
[00:13:34] JR: Yeah. I want to dovetail on that Brian, because it’s something that we see all the time. I feel like when we first started setting up, let’s call them clusters, but in our case it was Kubernetes, right? We always saw that data level always being delegated to like if you’re in Amazon, some service that they hosted and so on. But now I think more and more of the customers that at least I’m seeing. I’m sure Nicholas and Duffie too, they’re interested in doing exactly what you just described.
Cockroach is an example I literally just worked with recently, and it’s just interesting how much more thoughtful they have to be about their cluster operations. Going back to what you said Carlisia, it’s not as easy as just like trashing a cluster and instantiating a new one anymore, like they’re used to. They need to be more thoughtful about keeping that data integrity intact through things like upgrades and disaster recover.
[00:14:18] D: Another interesting point kind to your point, Brian, is that like, frequently, people are starting to have conversations and concerns around data gravity, which means that I have a whole bunch of data that I need to work with, like to a Spark job, which I mentioned earlier. I need to basically put my compute where that data is. The way that I store that data inside the cluster and use Kubernetes to manage it or whether I just have to make sure that I have some way of bringing up compute workloads close to that data. It’s actually kind of introducing a whole new layer to this whole thing.
[00:14:48] BL: Yeah! Whole new layer of work and a whole new layer of complexity, because that’s actually – The crux of all this is like where we slide the complexity too, but this is interesting, and I don’t want to go too far to this one definitely. This is why we’re seeing more people creating operators around managing data. I’ve seen operators who are bringing databases up inside of Kubernetes. I’ve seen operators that actually can bring up resources outside of Kubernetes using the Kubernetes API.
The interesting thing about this is that I looked at both solutions and I said, “I still don’t know what the answer is,” and that’s great. That means that we have a lot to learn about the problem, and at least we have some paths for it.
[00:15:29] NL: Actually, that kind of reminds me of the first time I ever heard the word stateful or stateless – I’m an infrastructure guy. Was around the discussion of operators, which there’s only a couple of years ago when operators were first introduced at CoreOS and some people were like, “Oh! Well, this is how you now operate a stateful mechanism inside of Kubernetes. This is the way forward that we want to propose.” I was just like, “Cool! What is that? What’s state? What do you mean stateful and stateless?” I had no idea. Josh, you were there. You’re like, “Your frontend doesn’t care about state and your backend does.” I’m like, “Does it? I don’t know. I’m not a developer.”
[00:16:10] JR: Let’s talk about exactly that, because I think these patterns we’re starting to see are coming out of the needs that we’re all talking about, right? We’ve seen at least in the Kubernetes community a lot of push for these different constructs, like something called a stateful [inaudible 00:16:21], which isn’t that important right now, but then also like an operator. Maybe we can start by defining what is an operator? What is that pattern and why does it relate to stateful apps?
[00:16:31] CC: I think that would be great. I am not clear what an operator is. I know there’s going to be a controller involved. I know it’s not a CRD. I am not clear on that at all, because I only work with CRDs and we don’t define – like the project I worked on with Velero, we don’t categorize it as an operator. I guess an operator uses specific framework that exists out there. Is it a Kubernetes library? I have no idea.
[00:16:56] BL: We did it to ourselves again. We’re all doing these to ourselves. From the best that I can surmise, the operator pattern is the combination of a CRD plus a controller that will operate on events from the Kubernetes API based on that CRD’s configuration. That’s what an operator is.
[00:17:17] NL: That’s exactly right.
[00:17:18] BL: To conflate this, Red Hat created the operator SDK, and then you have [inaudible 00:17:23] and you have a Metacontroller, which can help you build operators. Then we actually sometimes conflate and call CRDs operators, and that’s pretty confusing for everyone. Once again, don’t let developers name anything.
[00:17:41] CC: Wait. So let’s back up a little. Okay. There is an actual library that’s called an operator.
[00:17:46] BL: Yes. There’s an operator SDK.
[00:17:47] CC: Referred to as an operator. I heard that. Okay. Great. But let me back up a little because –
[00:17:49] D: The word operator can
[00:17:50] CC: Because if you are developing an app for Kubernetes, if you’re extending Kubernetes, you are – Okay, you might not use CRDs, but if you are using CRDs, you need a controller, right? Because how will you do actions? Then every app that has a CRD – because the alternative to having CRDs is just using the API directly without creating CRDs to reflect to resources. If you’re creating CRDs to reflect to resources, you need controllers. All of those apps, they have CRDs, are operators.
[00:18:24] D: Yip [inaudible 00:18:25] is an operator.
[00:18:26] CC: [inaudible 00:18:26] not an operator. How can you extend Kubernetes and not be qualified [inaudible 00:18:31] operator?
[00:18:32] BL: Well, there’s a way. There is a way. You can actually just create a CRD and use a CRD for data storage, you know, store states, and you can actually query the Kubernetes API for that information. You don’t need a controller, but we couple them with controllers a lot to perform action based on that state we’ve saved to etcd.
[00:18:50] CC: Duffie.
[00:18:51] D: I want to back up just for a moment and talk about the controller pattern and what it is and then go from there to operators, because I think it makes it easier to get it in your head. A control pattern is effectively a way to understand desired state and real state and provide some logic or business code that will allow you to converge those two states, your actual state and your desired state. This is a pattern that we see used in almost everything within a distributed system. It’s like within Kubernetes, within most of the kind of more interesting systems that are out there. This control pattern describes a pretty good way of actually managing application flow across distributed systems.
Now, operators, when they were initially introduced, we were talking about that this is a slightly different thing. Operators, when we introduced the idea, came more from like the operational burden of these stateful applications, things like databases and those sorts of stuff. With the database, etcd for example, you have a whole bunch of operational and runtime concerns around managing the lifecycle of that system. How do I add a new member to the cluster? What do I do when a member dies? How do I take action?
Right now, that’s somebody like myself waking up at 2 in the morning and working through a run book to basically make sure that that service remains operational through the night. But the idea of an operator was to take that control pattern that we described earlier and make it wake up at 2 in the morning to fix this stuff. We’re going to actually codify the operational knowledge of managing the burden of these stateful applications so that we don’t have to wake up at 2 in the morning and do it anymore. Nobody wants to do that.
[00:20:32] BL: Yeah. That makes sense. Remember back at KubCon years ago, I know it was one in Seattle where Brandon Philips was on stage talking about operators. He basically was saying if we think about SysOp, system operators, it was a way to basically automate or capture the knowledge of our system administrators in scripts or in a process or in code a la operators.
[00:20:57] D: The last part that I’ll add to this thing, which I think is actually what really describes the value of this idea to me is that there are only so many people on the planet that do what the people in this blog post do. Maybe you’re one of them that listen to this podcast. People who are operating software or operating infrastructure at scale, there just aren’t that many of us on the planet. So as we add more applications, as more people adopt the cloud native regime or start coming to a place where they can crank out more applications more quickly, we’re going to have to get to a place where we are able to automate the burden of managing those applications, because there just aren’t enough of us to be able to support the load that is coming. There just aren’t enough people on the planet that do this to be able to support that.
That’s the thing that excites me most about the operator pattern, is that it gives us a place to start. It gives us a place to actually start thinking about managing that burden over time, because if we don’t start changing the way we think about managing that burden, we’re going to run out of people. We’re not going to be able to do it.
[00:22:05] NL: Yeah. It’s interesting. With stateful apps, we keep kind of bringing them – coming back to stateful apps, because stateful apps are hard and stateless apps are easy, and we’ve created all these mechanisms around operating things with state because of how just complicated it is to make sure that your data is ready, accessible and has integrity. That’s the big one that I keep not thinking about as a SysOps person coming into the Dev world. Data integrity is so important and making sure that your data is exactly what it needs to be and was the last time you checked it, is super important. It’s only something I’m really starting to grasp. That’s why I was like these things, like operators and all these mechanisms that we keep creating and recreating and recreating keep coming about, because making sure that your stateful apps have the right data at the right time is so important.
[00:22:55] BL: Since you brought this up, and we just talked about why a state is so hard, I want to introduce the new term to this conversation, the whole CAP theorem, where data would typically be – in a distributed system at least, your data will be consistent or your data can be available, or if your distributed systems falls in multiple parts, you can have partition tolerance. This is one of those computer science things where you can actually pick two. You can have it be available and have partition tolerance, but your data won’t be consistent, or you can have consistency and availability, but you won’t have partition tolerance. If your cluster splits into two for some reason, the data will be bad.
This is why it’s hard, this is why people have written basically lots of PhD dissertations on this subject, and this is why we are talking about this here today, is because managing state, and particularly managing distributed, is actually a very, very hard problem. But there’s software out there that will help us, and Kubernetes is definitely part of that and stateful sets are definitely part of that as well.
[00:24:05] JR: I was just going to say on those three points, consistently, availability and partition tolerance. Obviously, we’d want all three if we could have them. Is there one that we most commonly tradeoff and give up or does it go case-by-case?
[00:24:17] BL: Actually, it’s been proven. You can’t have all three. It’s literally impossible. It depends. If you have a MySQL server and you’re using MySQL to actually serve data out of this, you’re going to most likely get consistency and availability. If you have it replicated, you might not have partition tolerance. That’s something to think about, and there are different databases and this is actually one of the reasons why there are different databases. This is why people use things like relational databases and they use key value stores not because we really like the interfaces, but because they have different properties around the data.
[00:24:55] NL: That’s an interesting point and something that I had recently just been thinking about, like why are there so many different types of databases. I just didn’t know. It was like in only recently heard of CAP theorem as well just before you mentioned it. I’m like, “Wow! That’s so fascinating.” The whole thing where you only pick two. You can’t get three.
Josh, to kind of go back to your question really quickly, I think that partition tolerance is the one that we throw away the most. We’re willing to not be able to segregate our database as much as possible because C and A are just too important, I think. At least that’s what I’m saying, like I am wearing an [inaudible 00:25:26] shirt and [inaudible 00:25:27] is not partition tolerant. It’s bad at it.
[00:25:31] BL: This is why Google introduced Spanner, and Spanner in some situations can get free with tradeoffs and a lot of really, really smart stuff, but most people can’t run this scale. But we do need to think about partition tolerance, especially with data whenever – Let’s say you run a store and you have multiple instances across the world and someone buys something from inventory, what is your inventory look like at any particular point? You don’t have to answer my question, of course, but think about that. These are still very important problems if fiber gets cut across the Atlantic and now I’ve sold more things than I have.
Carlisia, speaking to you as someone who’s only been a developer, have you moved your thoughts on state any further?
[00:26:19] CC: Well, I feel that I’m clear on – Well, I think you need to clarify your question better for me. If you’re asking if I understand what it means, I understand what it means. But I actually was thinking to ask this question to all of you, because I don’t know the answer, if that’s the question you’re asking me. I want to put that to the group. Do you recommend people, as in like now-ish, to run stateful workloads? We need to talk about workloads mean.
Run stateful apps or database in sites if they’re running a Kubernetes cluster or if they’re planning for that, do you all as experts recommend that they should already be looking into doing that or they should be running for now their stateful apps or databases outside of the cloud native ecosystem and just connecting the two? Because if that’s what your question was, I don’t know.
[00:27:21] BL: Well, I’ll take this first. I think that we should be spending lots of more time than we are right now in coming up community-tested solutions around using stateful sets to their best ability. What that means is let’s say if you’re running a database inside of Kubernetes and you’re using a stateful set to manage this, what we do need to figure out is what happens when my database goes down? The pod just kills? When I bring up a new version, I need to make sure that I have the correct software to verify integrity, rebuilt things, so that when it comes back up, it comes back up correctly. That’s what I think we should be doing.
[00:27:59] JR: For me, I think working with customers, at least Kubernetes-oriented folks, when they’re trying to introduce Kubernetes as their orchestration part of their overall platform, I’m usually just trying to kind of meet them where they’re at. If they’re new to Kubernetes and distributed systems as a whole, if we have stateless, let’s call them maybe simpler applications to start with, I generally have them lean into that first, because we already have so much in front of us to learn about. I think it was either Brian or Duffie, you said it introduces a whole bunch more complexity. You have to know what you’re doing. You have to know how to operate these things. If they’re new to Kubernetes, I generally will advise start with stateless still. But that being said, so many of our customers that we work with are very interested in running stateful workloads on Kubernetes.
[00:28:42] CC: But just to clarify what you said, Josh, because you spoke like an expert, but I still have beginner’s ears. You said something that sounded to me like you recommend that you go stateless. It sounded to me like that. What you really say is that they take out the stateless part of what they have, which they might already have or they might have to change and put the stateless. You’re not suggesting that, “Oh! You can’t do stateful anymore. You need to just do everything stateless.” What you’re saying is take the stateless part of your system, put that in Kubernetes, because that is really well-tested and keep the stateful outside of that ecosystem. Is that right?
[00:29:27] JR: I think that’s a better way to put it. Again, it’s not that Kubernetes can’t do stateful. It’s more of a concept of biting off more than you can chew. We still work with a lot of people who are very new to these distributed systems concepts, and to take on running stateful workloads, if we could just delegate that to some other layer, like outside of the cluster, that could be a better place to start, at least in my experience. Nicholas and Duff might have different –
[00:29:51] NL: Josh, you basically nailed it like what I was going to say, where it’s like if the team that I’m working with is interested in taking on the complexity of maintaining their databases, their stateful sets and making sure that they have data integrity and availability, then I’m all for them using Kubernetes for a stateful set.
Kubernetes can run stateful applications, but there is all this complexity that we keep talking about and maintaining data and all that. If they’re willing to take on their complexity, great, it’s there for you. If they’re not, if they’re a little bit kind of behind as – Not behind, but if they’re kind of starting out their Kubernetes journey or their distributed systems journey, I would recommend them to move that complexity to somebody else and start with something a little bit easier, like a stateless application.
There are a lot of good services that provide data as a service, right? You’ve got dataview as RDS is great for creating stateful application. You can leverage it anytime and you’ve got like dedicated wires too. I would point them to there first if they don’t want to take on like complexity.
[00:30:51] D: I completely agree with that. An important thing I would add, which is in response to the stateful set piece here, is that as we’ve already described, managing a stateful application like a database does come with some complexity. So you should really carefully look at just what these different models provide you. Whether that model is making use of a stateful set, which provides you like ordinality, ensuring that things start up in a particular order and some of the other capabilities around that stuff.
But it won’t, for example, manage some of the complexity. A stateful set won’t, for example, try and issue a command to the new member to make sure that it’s part of an existing database cluster. It won’t manage that kind of stuff. So you have to really be careful about the different models that you’re evaluating when trying to think about how to manage a stateful application like a database.
I think because it’s actually why the topic of an operator came up kind of earlier, which was that like there are a lot of primitives within Kubernetes in general that provide you a lot of capability for managing things like stateful applications, but they may not entirely suit your needs. Because of the complexity with stateful applications, you have to really kind of be really careful about what you adopt and where you jump in.
[00:32:04] CC: Yeah. I know just from working with Velero, which is a tool for doing backup and recovery migration of Kubernetes clusters. I know that we backup volumes. So if you have something mounted on a volume, we can back that up. I know for a fact that people are using that to backup stateful workloads. We need to talk about workloads. But at any case, one thing to – I think one of you mentioned is that you definitely also need to look at a backup and recovery strategy, which is ever more important if you’re doing stateful workloads.
[00:32:46] NL: That’s the only time it’s important. If you’re doing stateless, who cares?
[00:32:49] BL: Have we defined what a workload is?
[00:32:50] CC: Yeah. But let me say something. Yeah, I think we should do an episode on that maybe, maybe not. We should do an episode on GitOps type of thing for related things, because even though you – Things are stateless, but I don’t want to get into it. Your cluster will change state. You can recover in stuff from like a fresh version. But as it goes through a lifecycle, it will change state and you might want to keep that state. I don’t know. I’m not the expert in that area, but let’s talk about workloads, Brian.
Okay. Let me start talking about workloads. I never heard the term workload until I came into the cloud native world, and that was about a year ago or when they started looking in this space more closely. Maybe a little bit before a year ago. It took me forever to understand what a workload was. Now I understand, especially today, we’re talking about a little bit before we started recording. Let me hear from you all what it means to you.
[00:34:00] BL: This is one of those terms, and I’m sure like the last any ex-Googlers about this, they’ll probably agree. This is a Google term that we actually have zero context about why it’s a term. I’m sure we could ask somebody and they would tell us, but workloads to me personally are anything that ultimately creates a pod. Deployments create replica sets, create pods. That whole thing is a workload. That’s how I look at it.
[00:34:29] CC: Before there were pods, were there workloads, or is a workload a new thing that came along with pods?
[00:34:35] BL: Once again, these words don’t make any sense to us, because they’re Google terms. I think that a pod is a part of a workload, like a deployment is a part of a workload, like a replica set is part of a workload. Workload is the term that encompasses an entire set of objects.
[00:34:52] D: I think of a workload as a subset of an application. When I think of an application or a set of microservices, I might think of each of the services that make up that entire application as a workload. I think of it that way because that’s generally how I would divide it up to Brian’s point into different deployment or different stateful sets or different – That sort of stuff. Thinking of them each as their own autonomous piece, and altogether they form an application. That’s my think of it.
[00:35:20] CC: To connect to what Brian said, deployment, will always run in the pods, which is super confusing if you’re not looking at these things, just so people understand, because it took me forever to understand that. The connection between a workload, a deployment and a pod.
Pods contain – If you have a deployment that you’re going to shift Kubernetes – I don’t know if shift is the right word. You’re going to need to run on Kubernetes. That deployment needs to run somewhere, in some artifact, and that artifact is called a pod.
[00:35:56] NL: Yeah. Going back to what Duffie said really quickly. A workload to me was always a process, kind of like not just a pod necessarily, but like whatever it is that if you’re like, “I just need to get this to run,” whatever that is. To me that was always a workload, but I think I’m wrong. I think I’m oversimplifying it. I’m just like whatever your process is.
[00:36:16] BL: Yeah. I would give you – The reason why I would not say that is because a pod can run multiple containers at once, which ergo is multiple processes. That’s why I say it that way.
[00:36:29] NL: Oh! You changed my mind.
[00:36:33] BL: The reason I bring this up, and this is probably a great idea for a future show, is about all the jargon and terminology that we use in this land that we just take as everyone knows it, but we don’t all know it, and should be a great conversation to have around that. But the reason I always bring up the whole workload thing is because when we think about workloads and then you can’t have state without workloads, really. I just wanted to make sure that we tied those two things together.
[00:36:58] CC: Why can you not have state without workloads? What does that mean?
[00:37:01] BL: Well, the reason you can’t have state without workloads is because something is going to have to create that state, whether that workload is running in or out a cluster. Something is going to have to create it. It just doesn’t come out of nowhere.
[00:37:11] CC: That goes back to what Nick was saying, that he thinks a workload is a process. Was that was you said, Nick?
[00:37:18] NL: It is, yeah, but I’m renegading on that.
[00:37:23] CC: At least I could see why you said that. Sorry, Brian. I cut you off.
[00:37:28] BL: What I was saying is a workload ultimately is one or more processes. It’s not just a process. It’s not a single process. It could be 10, it could be 1.
[00:37:39] JS: I have one final question, and we can bail on this and edit it out if it’s not a good one to end with. I hope it’s not too big, but I think maybe one thing we overlooked is just why it’s hard to run stateful workloads in these new systems like Kubernetes. We talked about how there’s more complexity and stuff, but there might be some room to talk about – People have been spinning up an EC2 server, a server on the web and running MySQL on it forever. Why in like the Kubernetes world of like pods and things is it a little bit harder to run, say, MySQL just [inaudible 00:38:10]. Is that something worth diving into?
[00:38:13] NL: Yeah, I think so. I would say that for things like, say, applications, like databases particularly, they are less resilient to outages. While Kubernetes itself is dedicated to – Or most container orchestrations, but Kubernetes specifically, are dedicated to running your pods continuously as long as they will, that it is still somewhat of a shifting landscape. You do have priority and preemption. If you don’t set those things up properly of if there’s just like a total failure of your system at large, your stateful application can just go down at any time. Then how do you reconcile the outage in data, whatever data that might have gotten lost? Those sorts of things become significantly more complicated in an environment like Kubernetes where you don’t necessarily have access to a command line to run the commands to recover as easy. You may not, but it’s the same.
[00:39:01] BL: Yes. You got to understand what databases do. Disk is slow, whether you have spinning disk or you have disk on chip, like SSD. What databases do in a lot of cases is they store things in memory. So if it goes away, didn’t get stored. In other cases, what databases do is they have these huge transactional logs, maybe they write them out in files and then they process the transaction log whenever they have CPU time. If a database dies just suddenly, maybe its state is inconsistent because it had items that were to be processed in a queue that haven’t been processed. Now it doesn’t know what’s going on, which is why –
[00:39:39] NL: That’s interesting. I didn’t know that.
[00:39:40] BL: If you kill MySQL, like kill MySQL D with a -9, why it might not come back up.
[00:39:46] JR: Yeah. Going back to Kubernetes as an example, we are living in this newer world where things can get rescheduled and moved around and killed and their IPs changed and things. It seems like this environment is, should I say, more ephemeral, and those types of considerations becoming to be more complex.
[00:40:04] NL: I think that really nails it. Yeah. I didn’t know that there were transactional logs about databases. I should, I feel like, have known that but I just have no idea.
[00:40:11] D: There’s one more part to the whole stateful, stateless thing that I think is important to cover, but I don’t know if we’ll be able to cover it entirely in the time that we have left, and that is from the network perspective. If you think about the types of connections coming into an application, we refer to some of those connections as stateful and stateless. I think that’s something we could tackle in our remaining time, or what’s everybody’s thought?
[00:40:33] JR: Why don’t you try giving us maybe a quick summary of it, Duffie, and then we can end on that.
[00:40:36] CC: Yeah. I think it’s a good idea to talk about network and then address that in the context of network. I’m just thinking an idea for an episode. But give us like a quick rundown.
[00:40:45] D: Sure. A lot of the kind of older monolithic applications, the way that you would scale these things is you would have multiple of them and then you would have some intelligence in the way that you’re routing connections down to those applications that would describe the ability to ensure that when Bob accesses a website and he authenticates, he’s going to authenticate to one specific instance of this application and the intelligence up in the frontend is going to handle the routing to make sure that Bob’s connection always comes back to that same instance. This is an older pattern.
It’s been around for a very long time and it’s certainly the way that we first kind of learned to scale applications before we’ve decided to break into maker services and kind of handle a lot of this routing in a more resilient way. That was kind of one of the early versions of how we do this, and that is a pretty good example of a stateful session, and that there is actually some – Perhaps Bob has authenticated and he has a cookie that allows him, that when he comes back to that particular application, a lot of the settings, his browser settings, whether he’s using the dark theme or the light theme, that sort of stuff, is persisted on the server side rather than on the client side. That’s kind of what I mean by stateful sessions.
Stateless sessions mean it doesn’t really matter that the user is terminating to the same end of point, because we’ve managed to keep the state either with the client. We’re handling state on the browser side of things rather on the server side of things. So you’re not necessarily gaining anything by pushing that connection back to the same specific instance, but just to a service that is more widely available.
There are lots of examples of this. I mean, Brian’s example of Google earlier. Obviously, when I come back to Google, there are some things I want it to remember. I want it to remember that I’m logged in as myself. I want it to remember that I’ve used a particular – I want it to remember my history. I want it to remember that kind of stuff so that I could go back and find things that I looked at before. There are a ton of examples of this when we think about it.
[00:42:40] JR: Awesome! All right, everyone. Thank you for joining us in episode 6, Stateful and Stateless. Signing off. I’m Josh Rosso, and going across the line, thank you Nicholas Lane.
[00:42:54] NL: Thank you so much. This was really informative for me.
[00:42:56] JR: Carlisia Campos.
[00:42:57] CCC: This was a great conversation. Bye, everybody.
[00:42:59] JR: Our new comer, Brian Liles.
[00:43:01] BL: Until next time.
[00:43:03] JR: And Duffie Cooley.
[00:43:05] DCC: Thank you so much, everybody.
[00:43:06] JR: Thanks all.
[00:43:07] CCC: Bye!
[END OF EPISODE]
[0:50:00.3] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
In this episode of The Podlets Podcast, we are talking about the very important topic of recovery from a disaster! A disaster can take many forms, from errors in software and hardware to natural disasters and acts of God. That being said that are better and worse ways of preparing for and preventing the inevitable problems that arise with your data. The message here is that issues will arise but through careful precaution and the right kind of infrastructure, the damage to your business can be minimal. We discuss some of the different ways that people are backing things up to suit their individual needs, recovery time objectives and recovery point objectives, what high availability can offer your system and more! The team offers a bunch of great safety tips to keep things from falling through the cracks and we get into keeping things simple avoiding too much mutation of infrastructure and why testing your backups can make all the difference. We naturally look at this question with an added focus on Kubernetes and go through a few tools that are currently available. So for anyone wanting to ensure safe data and a safe business, this episode is for you!
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
https://twitter.com/carlisia
https://twitter.com/bryanl
https://twitter.com/joshrosso
https://twitter.com/opoweroKey Points From This Episode:
• A little introduction to Olive and her background in engineering, architecture, and science. • Disaster recovery strategies and the portion of customers who are prepared.
• What is a disaster? What is recovery? The fundamentals of the terms we are using.
• The physicality of disasters; replication of storage for recovery.
• The simplicity of recovery and keeping things manageable for safety.
• What high availability offers in terms of failsafes and disaster avoidance.
• Disaster recovery for Kubernetes; safety on declarative systems.
• The state of the infrastructure and its interaction with good and bad code.
• Mutating infrastructure and the complications in terms of recovery and recreation. • Plug-ins and tools for Kubertnetes such as Velero.
• Fire drills, testing backups and validating your data before a disaster!
• The future of backups and considering what disasters might look like.Quotes:
“It is an exciting space, to see how different people are figuring out how to back up distributed systems in a reliable manner.” — @opowero [0:06:01]
“I can assure you, careers and fortunes have been made on helping people get this right!” — @bryanl [0:07:31]
“Things break all the time, it is how that affects you and how quickly you can recover.” —
@opowero [0:23:57]“We do everything through the Kubernetes API, that's one reason why we can do selective
backups and restores.” — @carlisia [0:32:41]Links Mentioned in Today’s Episode:
The Podlets — https://thepodlets.io/
The Podlets on Twitter — https://twitter.com/thepodlets
VMware — https://www.vmware.com/
Olive Power — https://uk.linkedin.com/in/olive-power-488870138
Kubernetes — https://kubernetes.io/
PostgreSQL — https://www.postgresql.org/
AWS — https://aws.amazon.com/
Azure — https://azure.microsoft.com/
Google Cloud — https://cloud.google.com/
Digital Ocean — https://www.digitalocean.com/
SoftLayer — https://www.ibm.com/cloud
Oracle — https://www.oracle.com/
HackIT — https://hackit.org.uk/
Red Hat — https://www.redhat.com/
Velero — https://blog.kubernauts.io/backup-and-restore-of-kubernetes-applications-using- heptios-velero-with-restic-and-rook-ceph-as-2e8df15b1487
CockroachDB — https://www.cockroachlabs.com/
Cloud Spanner — https://cloud.google.com/spanner/Transcript:
EPISODE 08[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[00:00:41] CC: Hi, everybody. We are back. This is episode number 8. Today we have on the show myself, Carlisia Campos and Josh.
[00:00:51] JR: Hello, everyone.
[00:00:52] CC: That was Josh Rosso. And Olive Power.
[00:00:55] OP: Hello.
[00:00:57] CC: And also Brian Lyles.
[00:00:59] BL: Hello.
[00:00:59] CC: Olive, this is your first time, and I didn’t even give you a heads-up. But tell us a little bit about your background.
[00:01:06] OP: Yeah, sure. I’m based in the UK. I joined VMware as part of the Heptio acquisition, which I joined Heptio way back last year in October. The acquisition happened pretty quickly for me. Before that, I was at Red Hat working on some of their cloud management tooling and a bit of OpenShift as well.
Before that, I worked with HP and Fujitsu. I kind of work in enterprise management a lot, so things like desired state and automation are kind of things that have followed me around through most of my career. Coming in here to VMware, working in the cloud native applications business unit is kind of a good fit for me.
I’m a mom of two and I’m based in the UK, which I have to point out, currently undergoing a heat wave. We’ve had about like 3 weeks of 25 to 30 degrees, which is warm, very warm for us. Everybody is in a great mood.
[00:01:54] CC: You have a science background, right?
[00:01:57] OP: Yeah, I studied chemistry in university and then I went on to do a PhD in cancer research. I was trying to figure out ways where we could predict how different people will going to respond to radiation treatments and then with a view to tailoring everybody’s treatment to make it unique for them rather than giving the same treatment to different who present you with the same disease but were response very, very different. Yeah, that was really, really interesting.
[00:02:22] CC: What is your role at VMware?
[00:02:23] OP: I’m a cloud native architect. I help customers predominantly focus on their Kubernetes platforms and how to build them either from scratch or help them get more production-ready depending on where they are in their Kubernetes journey. It’s been really exciting part of being part of Heptio and following through into the VMware acquisition. We’re going to speak to customers a lot at very exciting times for them. They’re kind of embarking on their Kubernetes journey a lot of them. We’re with them from the start and every step of the way. That’s really rewarding and exciting.
[00:02:54] CC: Let me pick up on that thread actually, because one thing that I love about this group for me, because I don’t get to do that. You all meet customers and you know what they are doing. Get that knowledge first-hand. What would you say the percentage of the clients that you see, how disaster recovery strategy, which by the way is a topic of today’s show.
[00:03:19] OP: I speak to customers a lot. As I mentioned earlier, a lot of them are like in different stages of their journey in terms of automation, in terms of infrastructure of code, in terms of where they want to go for their next platform. But there generally in the room a team that is responsible for backup and recovery, and that’s generally sort of leads into this storage team really because you’re trying to backup state predominantly.
When we’re speaking to customers, we’ll have the automation people in the room. We’ll have the developers in the room and we’ll have the storage people in the room, and they are the ones that are primarily – Out of those three sort of folks I’ve mentioned, they’re the ones that are primarily concerned about backup. How to back up their data. How to restore it in a way that satisfies the SLAs or the time to get your systems back online in a timely manner. They are the force concerned with that.
[00:04:10] JR: I think it’s interesting, because it’s almost scary how many of our customers don’t actually have a disaster recovery strategy of any sort. I think it’s often times just based on the maturity of the platform. A lot of the applications and such, they’re worried about downtime, but not necessarily like it’s going to devastate the business in a lot of these apps. I’m not trying to say that people don’t run mission critical apps on things like Kubernetes. It’s just a lot of people are very new and they’re just kind of ramping up. It’s a really complicated thing that we work with our customers on, and there’re so many like layers to this. I’m sure layers that we’ll get into. There are things like disaster recovery of the actual platform.
If Kubernetes, as an example, goes down. Getting it back up, backing up its data store that we call etcd. There’s obviously like the applications disaster recovery. If a cluster of some sort goes own, be it Kubernetes or otherwise, shifting some CI system and redeploying that into some B cluster to bring it back up. Then to Olive’s point, what she said, it all comes back to storage. Yeah. I mean, that’s where it gets extremely complicated. Well, at least in my mind, it’s complicated for me, I should say.
When you’re thinking about, “Okay, I’m running this PostgreS as a service thing on this cluster.” It’s not that simple to just move the app from cluster A to cluster B anymore. I have to consider what do I do with the data? How do I make sure I don’t lose it out? Then that’s a pretty complicated question to answer.
[00:05:32] OP: I think a lot of the storage providers, vendors playing in that storage space are kind of looking at novel ways to solve that and have adapted their current thinking maybe that was maybe slightly older thinking to new ways of interacting with Kubernetes cluster to provide that ongoing replication of data around different systems outside of the Kubernetes and then allowing it to be ported back in when a Kubernetes cluster – If we’re talking about Kubernetes in this instance as a platform, porting that data back in.
There’re a lot of vendors playing in that space. It’s kind of an exciting space really to see how different people are figuring out how to back up distributed systems in reliable manner, because different people want different levels of backup. Because of the microservices nature of the cloud native architectures that we predominantly deal with, your application is not just one thing anymore. Certain parts of that application need to be recovered fairly quickly, and other parts don’t need to recover that quickly.
It’s all about functionality ultimately that your end customers or your end users see. If you think about visually as like a banking application, for example, where if you’re looking at things like – The customer is interacting with that and they can check their financial details and they can check the current stages of their account, then they are two different services. But the actual service to transfer money into their account is down. It’s still a pretty functional system to the end user. But in the background, all those great systems are in place to recover that transfer of money functionality, but it’s not detrimental to your business if that’s down.
There’ll be different SLAs and different objectives in terms of recovery, in terms of the amount of time that it takes for you to restore. All of that has to be factored in into disaster recovery plans and it’s up to the company and we can help as much as possible for them to figure out which feats of the applications and which feats of your business need to conform to certain SLAs in terms of recovery, because different feats will have different standards and different times in and around that space. It’s a complicated thing. It definite is.
[00:07:29] BL: I want to take a step back and unpack this term, disaster recovery, because I can assure you, careers and fortunes have been made on helping people get this right. Before we get super deep into this, what’s a disaster and then what’s a recovery for that? Have you thought about that at a fundamental level?
[00:07:45] OP: Just for me, if we would kind of take it at face value. A physical disaster, they could be physical ones or software-based ones. Physical ones can be like earthquakes or floodings, fires, things like that that are happening either in your region or can be fairly widespread across the area that you’re in, or software, cyber attacks that are perhaps to your own internal systems, like your system has been compromised. That’s fairly local to you.
There are two different design strategies there. Physical disaster, you have to have a recover plan that is outside of that physical boundary that you can recover your system from somewhere that’s not affected by that physical disaster. For the recovery in terms of software in terms of your system has been compromised, then the recovery from that is different.
I’m not an expert on cyber attacks and vulnerabilities, but the recovery from there for companies trying to recover from that, they plan for it as much as possible. So they down their systems and try and get patches and fixes to them as quickly as possible and spin the system backups.
[00:08:49] BL: I’m understanding what you’re saying. I’m trying to unpack it for those of us listening who don’t really understand it. I’m going to go through what you said and we’ll unpack it a little bit. Physical from my assumption is we’re running workloads. Let’s say we’re just going to say in a cloud, not on-premise. We’re running workloads in let’s say AWS, and in the United States, we can take care local diversity by running in East and West regions.
Also, we can take care of local diversity by running in availability, but they don’t reach it, because AWS is guaranteed that AZ1 and AZ3 have different network connections, are not in the same building, and things like that. Would you agree? Do you see that? I mean, this is for everyone out there. I’m going to go from super high-level down to more specific.
[00:09:39] OP: I personally wouldn’t argue that, except not everybody is on AWS.
[00:09:43] BL: Okay. AWS, or Azure, or Google Cloud, DigitalOcean, or SoftLayer, or Oracle, or Packet. If I thought about this, probably we could do 20 more.
[00:09:55] JR: IBM.
[00:09:56] BL: IBM. That’s why I said SoftLayer. They all practice in the physical diversity. They all have different regions that you can deploy software. Whether it’s be data locality, but also for data protection. If you’re thinking about creating a planet for this, this would be something you could think about. Where does my rest? What could happen to that data? Building could actually just fall over on to itself. All the hard drives are gone. What do I do?
[00:10:21] OP: You’re saying that replication is a form of backup?
[00:10:26] BL: I’m actually saying way more than that. Before you even think about things when it comes to disaster recovery, you got to define what a disaster is. Some applications can actually run out of multiple physical locations.
Let’s go back to my AWS example, because it’s everywhere and everyone understands how AWS works at a high-level. Sometimes people are running things out of US-East-1 and US-West-2, and they could run both of the applications. The reason they can do that is because the individual transactions of whatever they’re doing don’t need to talk to one another. They connect just websites out of places.
To your point, when you talk about now you have the issue where maybe you’re doing inventory management, because you have a large store and you’re running it out of multiple countries. You’re in the EU and you’re somewhere on APAC as well. What do you do about that?
Well, there are a couple of ways that – I could think about how we would do that. We could actually just have all the database connections go back to one single main service. Then what we could do with that main service is that we could have it replicated in their local place and then we can replicate it in a remote place too. If the local place goes up, at least you can point all the other sites back to this one. That’s the simplest way.
The reason I wanted to bring this up, is because I don’t like acronyms all that much, but disaster recovery has two of my favorite ones and they’re called RPO and RTO. Really, what it comes down to is you need to think about when you have a disaster, no matter that disaster is or how you define it, you have RTO. Basically, it’s the time that you can be down before there’s a huge issue. Then you have something called DPO, which is without going into all the names, is how far you can go since your last backup before you have business problems.
Just thinking about those things is how we should think about our backup disaster recovery, and it’s all based on how your business works or how your project works and how long you can be down and how much data you have.
[00:12:27] CC: Which goes to what Olive was saying. Please spell out to us what RTO and RPO stand for.
[00:12:35] BL: I’m going to look them up real quick, because I literally pushed those acronym meanings out. I just know what they mean.
[00:12:40] OP: I think it’s recovery time objective and recovery data objective.
[00:12:45] BL: Yeah. I don’t know what the P stands for, but it is for data.
[00:12:49] OP: Recovery.
[00:12:51] BL: It’s the recovery points. Yeah. That’s what it is. It is the recovery point objective, RPO; and recovery time objective, RTO. You could tell that I’ve spent a lot of time in enterprise, because we don’t even define words. The acronym means what it is. Do you know what the acronym stands for anymore?
[00:13:09] OP: How far back in terms of data can we go that was still okay? How far back in time can we be down, basically, until we’re okay?
[00:13:17] CC: It is true though, and as Josh was saying, some teams or companies or products, especially companies that are starting their journey, their cloud native journey. They don’t have a backup, because there are many complicated things to deal with, and backup is super complicated, I mean, the disaster recovery strategy. Doing that is not trivial.
But shouldn’t you start with that or at least because it is so complex? It’s funny to me when people say I don’t have that kind of a strategy. Maybe just like what Bryan said why utilizing, spreading out your data through regions, that is a strategy in itself, and there’s more to it.
[00:14:00] JR: Yeah. I think I oversimplified too much. Disaster recovery could theoretically be anything I suppose. Going back to what you were saying, Brian, the recovery aspect of it. Recovery for some of the customers I work with is literally to stand on a brand-new cluster, whatever that cluster is, a cluster, that is their platform. Then redeploy all the applications on top of it.
That is a recovery strategy. It might not be the most elegant and it might make assumptions about the apps that run on it, but it is a recovery strategy that somewhat simple, simple to kind of conceptualize and get started with.
I think a lot of the customers that I work with when they’re first getting their bearings with distributed system of sorts, they’re a lot more concerned about solving for high availability, which is what you just said, Carlisia, where we’re spreading across maybe multiple sites. There’s the notion of different parts of the world, but there’s also the idea of like what I think Amazon has coined availability zones. Making sure if there is a disaster, you’re somewhat resilient to that disaster like Brian was saying with moving connections over and so on.
Then once we’ve done high-availability somewhat well, depending on the workloads that are running, we might try to get a more fancy recovery solution in place. One that’s not just rebuild everything and redeploy, because the downtime might not be acceptable.
[00:15:19] BL: I’m actually going to give some advice to all the people out there who might be listening to this and thinking about disaster recovery. First of all, all that complex stuff, that book you read, forget about it. Not because you don’t need to know. It’s because you should only think about what’s in scope at any given time.
When you’re starting an application, let’s say I’m actually making a huge assumption that you’re using someone else’s cloud. You’re using public cloud. Whenever you’re in your data center, there’s a different problem. Whenever you’re using public cloud, think about what you already have. All the major public clouds had a durable object storage. Many 9s of durability and then fewer 9s, but still a lot of 9s of availability too. The canonical example there is S3. When you’re designing your applications and you know that you’re going to have disaster issues, realize that S3 is almost always going to be there, unless it was 2017 and it goes down, or the other two failures that it had. Pretty much, it will be there. Think about how do I get that data into S3. I’m just saying, you can use it for storage. It’s fairly cheap for how much storage you can get. You can make it sure it’s encrypted, and using IM, you can definitely make sure that people who have the right pillages can see it. The same goes with Azure and the same goes with Google. That’s the first phase.
The second phase is that now you’re going to say, “Well, what is a relational database?” Once again, use your cloud provider. All the major cloud providers have great relational databases, and actually key value stores as well. The neat thing about them is you can actually set them up sometimes to run in a whole region. You can set them up to do automated backups. At least the minimum that you have, you actually use your cloud provider for what it’s valuable for.
Now, you’re not using a cloud provider and you’re doing it on-premise, I’m going to tell you, the simple answer is I hope you have a little bit of money, because you’re going to have to pay somebody either one of Kubernetes architects or you’re going to pay somebody else to do it. There’s no easy button for this kind of solution.
Just for this little mini-rant, I’m going to leave everyone with the biggest piece of advice, the best piece of advice that I can ever leave you if you’re running relational databases. If you are running a relational database, whether it’d be PostgreS, MySQL, Aurora, have it replicated. But here’s the kicker, have another replica that you delay and make it delay 10 minutes, 15 minutes, not much longer than that.
Because what’s going to happen, especially in a young company, especially if you’re using Rails or something like that, you’re going to have somebody who is going to have access to production, because you’re a small company, you haven’t really federated this out yet. Who’s going to drop your main database table? They’re just going to do it and it’s going to happen and you’re going to panic.
If you have it in a replica, that databases go in a replica, you have a 10-minute delay replica – 10 minutes to figure it out before the world ends. Hopefully someone deletes the master database. You’re going to know pretty quickly and you can just cut that replica out, pull that other one over. I’m not going to say where i learned this trick. We had to employ it multiple times, and it saves our butts multiple times. That’s my favorite thing to share.
[00:18:24] OP: Is that replica on separate system?
[00:18:26] BL: It was on a separate system. I actually don’t say, because it will be telling on who did it. Let’s say that it was physically separate from the other one in a different location as well.
[00:18:37] OP: I think we’ve all been there. We’ve all have deleted something that maybe –
[00:18:41] CC: I’m going to tell who did it. It was me.
[00:18:45] BL: Oh no! It definitely wasn’t me.
[00:18:46] OP: We mentioned HA. Will the panel think that there’s now a slightly inverse relationship between the amount of HA that you architect for versus the disaster recovery plan that you have implemented on the back of that? More you’re architecting around HA, like the less you architect or plan for DR. Not eliminating ether of them.
[00:19:08] BL: I see it more. Mean, it used to be 15 years ago.
[00:19:11] CC: Sorry. HA, we’re talking about high availability.
[00:19:15] BL: When you think about high availability, a lot of sites were hosted. This is really before you had public cloud and a lot of people were hosting things on WebHost or they’re hosting themselves. Even if you are a company who had like a big equinox of level 3, you probably didn’t have two facilities at two different equinoxes or level 3, which probably does had one big cage and you just had diversity in the systems in there.
We found people had these huge tape backups and we’re very diligent about swapping our tapes out. One thing you did was we made sure that – I mean, lots of practice of bringing this huge system down, because we assumed that the database would die and we would just spend a few hours bringing it back up, or days. Now with high availability, we can architect systems where that is less of a problem, because we could run more things that manage our data. Then we can also do high availability in the backend on the database side too. We can do things like multi-writes and multi-reads. We can actually write our data in multiple places.
What we find when we do this is that the loss of a single database or a slice of processing/webhosts just means that our services degraded, which means we don’t really have a disaster in this point and we’re trying to avoid disasters.
[00:20:28] JR: I think on that point, the way I’ve always thought about it, and I’ll admit this is super overly simplified, but like successful high availability or HA could make your lead to perform disaster recovery less likely, can, maybe, right? It’s possible.
[00:20:45] BL: Also realize that everybody is running in public cloud. In that case, well, you can still back your stuff up to public cloud even if you’re not running in public cloud. There are still people out there who are running big tape arrays, and I’ve seen them. I’ve seen tape arrays that are wider. I’m sitting in an 80-inch wide table, bigger than this table with robotic arms and takes the restic and you had to make sure that you got the text right for that particular day doing your implementation.
I guess what I’m saying is that there is a balance. HA, high availability, if you’re doing it in a truly high available way, you can’t miss whole classes of disaster. But I’m not saying that you will not have disaster, because if that was the case, we won’t be having this discussion right now.
I’d like to move the conversation just a little bit to more cloud native. If you’re running on Kubernetes, what should you think about for disaster recovery? What are the types of disasters we could have? How could we recover them?
[00:21:39] JR: Yeah. I think one thing that comes to mind, I was actually reading the Kubernetes Best Practices book last night, but I just got an O’Reilly membership. Awesome. Really cool book. One of the things that they had recommended early on, which I thought was a really good pull out is that since Kubernetes is a declarative system where we write these manifests to describe the desired state of our application and how it should run, recommending that we make sure to keep that declarative state in source control, just like we would our code so that if something were to go wrong, it is somewhat more trivial to redeploy the application should we need to recover. That does assume we’re not worried about like data and things like that, but it is a good call out I think. I think the book made a good call out.
[00:22:22] OP: That’s on the declarative system and enable to bring your systems back up to the exact way they were before kind of itself adds comfort to the whole notion that they could be disaster. If they was, we can spin up backup relatively quickly. That’s back from the days of automation where the guys originally – I came from Red Hat, so fork at Ansible.
We’re kind of trying to do the infrastructure as a code, being able to deploy, redeploy, redeploy in the same manner as the previous installation, because I’ve been in this game long-time now and I’ve spent a lot of time working with processes in and around building physical servers. That process will get handled over to lots of different teams. It was a huge thing to build these things, to get one of these things built and signed off, because it literally has to pass through the different teams to do their own different bits of things.
The idea that you would get a language that had the functionality that suited the needs of all those different teams, of the store team, could automate their piece, which they were doing. They just wasn’t interactive with any of the other teams. The network people would automate theirs and the application install people would do their bit. The server OS people would do their bit.
Having a process that could tie those teams together in terms of a language, so Ansible, Puppet, Chef, those kinds of things try to unite those teams and it can all do your automation, but we have a tool that can take that code and run it as one system end-to-end. At the end of that, you get an up and running system. If you run it again, you get all the systems exactly the same as the previous one. If you run it again, you get another one.
Reducing the time to build these things plays very importantly into this space. Disaster is only disaster in terms of time, because things break all the time. How that affects you and how quickly you can recover. If you can recover in like seconds, in minutes and it hasn’t affected your business at all, then it wasn’t really a disaster.
The time it takes you to recover, to build your things back is key. All that automation and then leading on to Kubernetes, which is the next step, I think, this whole declarative, self-healing and implementing the desired state on a regular basis really plays well into this space.
[00:24:25] CC: That makes me think, I don’t completely understand because I’m not out there architecting people’s systems. The one thing that I do is building this backup tool, which happens to be for Kubernetes. I don’t completely get the limitations and use cases, but my question is, is it enough to have the declarations of how your infrastructure should be in source control? Because what if you’re running applications on the platform and your applications are interacting with a platform, change in the state of the platform. Is that not something that happens?
Of course, ideally, having those declarations and source control of course is a great backup, but don’t you also want to back up the changes to state as they keep happening?
[00:25:14] BL: Yeah, of course. That has been used for a long-time. That’s how replication works. Literally, you take the change and you push it over the wire and it gets applied to the remote system. The problem is, is that there isn’t just one way to do this, because if you do only transaction-based. If you only do the changes, you need a good base to start with, because you have to apply those changes to something. How do you get that piece? I’m not asking you to answer that. It’s just something to think about.
[00:25:44] JR: I think you’ve hit a fatal flaw too, Carlisia, and like what that simplified just like having source control model kind of falls over. I think having that declarative kind of stamped out, this is the ideal nature of the world to this deployment and source control has benefits beyond just that of disaster recovery scenario, right?
For stateless applications especially, like we talked about in the previous podcast, it can actually be all lead potentially, which is so great. Move your CI system over to cluster B. Boom! You’re back up and running. That’s really neat. A lot of our customers we work with, once we get them to a point where they’re at that stage, they then go, “Well, what about all these persisted volumes?” which by the way is evolving on a computer, which is a Kubernetes term. But like what about all these parts on like disk that I don’t want to lose if I lose my cluster? That it totally feeds into why tools like the one you work on are so helpful. Maybe I don’t know if now would be a good time. But maybe, Carlisia, you could expand on that tool. What it tries to solve for?
[00:26:41] CC: I want to back up a little though. Let’s put aside stateful workloads and volumes and databases. I was talking about the infrastructure itself, the state of the infrastructure. I mean, isn’t that common? I don’t know the answer to this. I might be completely off. Isn’t that common for you to develop a cloud native application that is changing the state of the infrastructure, or is this something that’s not good to do?
[00:27:05] JR: It’s possible that you can write applications that can change infrastructure, but think about that. What happens when you have bad code? We all have bad code. Our people like to separate those two things. You can still have infrastructure as code, but it’s separated from the application itself, and that’s just to protect your app people from your not app people and vice versa.
A lot of that is being handled through systems that people are writing right now. You have Ansible from IBM. You have things like HashiCorp and all the things that they’re doing. They have their hosted thing. They have their own premise thing. They have their local thing. People are looking at that problem.
The good thing is that that problem hasn’t been solved. I guess good and bad at the same time, because it hasn’t been solved. So someone can solve it better. But the bad thing is that if we’re looking for good infrastructure as code software, that has not been solved yet.
[00:27:57] OP: I think if we’re talking about containerized applications, I think if there was systems that interacted or affected or changed the infrastructure, they would be separate from the applications. As you were saying, Brian, you just expanded a little bit [inaudible 00:28:11] containerized or sandboxed, processes that were running separate to the main application.
You’re separating out what’s actually running and doing function in terms of application versus systems that have to edit that infrastructure first before that main application runs. They’re two separate things. If you had to restore the infrastructure back to the way it was without rebuilding it, but perhaps have a system whereby if you have something editing the infrastructure, you would always have something that would edit it back.
If you have the process that runs to stop something, you’d also have a process that start at something. If you’re trying to [inaudible 00:28:45] your applications and if it needs to interact with other things, then that application design should include the consideration of what do I need to do to interact with the infrastructure. If I’m doing something left-wise, I have to do the opposite in equal reaction right-wise to have an effectively clean application. That’s the kind of stuff I’ve seen anyway.
[00:29:04] JR: I think it maybe even fold into a whole other topic that we could even cover on another podcast, which is like the notion of the concern of mutating infrastructure. If you have a ton of hands in those cookie jars and they’re like changing things all over the place, you’re losing that potential single source of declarative truth even, right? It just could become very complicated. I think maybe to the crux of your original point, Carlisia. Hopefully I’m not super off. If that is happening a lot, I think it could actually make recover more complicated, or maybe recovery is not the way to put it, but recreating the infrastructure, if that makes sense.
[00:29:36] BL: Your infrastructure should be deterministic, and that’s why I said you could. I know we talked about this before about having applications modify infrastructure. Think about that. Can and should are two different things. If you have it happen within your application due to input of any kind, then you’re no longer deterministic, unless you can figure out what that input is going to be. Be very careful about that.
That’s why people split infrastructure as code from their other code. You could still have CI, continuous integration and continuous delivery/deployment for both, but they’re on different pipelines with different release metrics and different monitoring and different validation to make sure they work correctly.
[00:30:18] OP: Application design plays a very important role now, especially in terms of cloud native architecture. We’re talking a lot about microservices. A lot of companies are looking to re-architect their applications. Maybe mistakes that were made in the past, or maybe not mistakes. It’s perhaps a strong word. But maybe things that were allowed in the past perhaps are now best practices going forward. If we’re looking to be able to run things independently of each other, and by definition, applications independent on the infrastructure, that should be factored in into the architecture of those applications going forward.
[00:30:50] CC: Josh asked me to talk a little bit about Velerao. I will touch up on it quickly. First of all, we’d love to have a whole show just about infrastructure code, GitOps. Maybe that would be two episodes. Velero doesn’t do any backup of the infrastructure itself. It works at the Kubernetes level. We back up the Kubernetes clusters including the volumes. If you have any sort of stateful app attached to a pod that can get backed up as well.
If you want to restore that to even a different service provider, then the one you backed up from, we have a restic plugin that you can use. It’s embedded in the Velero tool. So you can do that using this plugin. There are few really cool things that I find really cool about Velero is, one, you can do selective backups, which really, really don’t recommend. We recommend you always back up everything, but you can do selective restores. That would be – If you don’t need to restore a whole cluster, why would you do it? You can just do parts of it.
It’s super simple to use. Why would you not have a backup? Because this is ridiculously simple. You do it through a command line, and we have a scheduler. You can just put your backup on scheduler. Determine the expiration date of each backup. A lot of neat simple features and we are actively developing things all the time.
Velero is not the only one. It’d be fair to mention, and I’m not a super well versed on the tools out there, but etcd itself has a backup tool. I’m not familiar with any of these other tools. One thing to highlight is that we do everything through the Kubernetes API. That’s for example one reason why we can do selective backup or restores. Yes, you can backup etcd completely yourself, but you have to back up the whole thing. If you’re on a managed service, you wouldn’t be able to do that, because you just wouldn’t have access.
All the tools like we use to back up to the etcd offers or a service provider. PX-motion. I’m not sure what this is. I’m reading the documentation here. There is this K10 from [inaudible 00:33:13] Canister. I haven’t used any of these tools. [inaudible 00:33:16].
[00:33:17] OP: I just want to say, Velero, the last customer I worked on, they wanted to use Velero in its capacity to be able to back up a whole cluster and then restore that whole cluster on a different cloud provider, as you mentioned. They weren’t thoroughly using it as – Well, they were using it as backup, but their primary function was that they wanted to populate the cluster as it was on a brand-new cloud provider.
[00:33:38] CC: Yeah. It’s a migration. One thing that, like I said, Velero does, is back up the cluster, like all the Kubernetes objects, because why would we want to do that? Because if you’re declaring – Someone explain to everybody who’s listening, including myself. Some people bring this up and they say, “Well, I don’t need to back up the Kubernetes objects if all of that is declared and I have the declaration is source control. If something happens, I can just do it again.
[00:34:10] BL: Untrue, because just for any given Kubernetes object, there is a configuration that you created. Let’s say if you’re creating an appointment, you need spec replicas, you need the spec templates, you need labels and selectors. But if you actually go and pull down that object afterwards, what you’ll see is there is other things inside of that object. If you didn’t specify any replicas, you get the defaults or other things that you should get defaults for.
You don’t want to have a lousy backup and restore, because then you get yourself into a place where if I go back this thing up and then I restore it to a different cluster to actually test it out to see if it works, it will be different. Just keep that in mind when you’re doing that.
[00:34:51] JR: I think it just comes down to knowing exactly what Brian just said, because there certainly are times where when I’m working with a customer, there’s just such a simple use case at the notion of redeploying the application and potentially losing some of those factors that may have mutated overtime. They just shrug to it and go, “Whatever.”
It is so awesome that tools like Velero and other tools are bridging that gap, and I think to a point that Olive made, not only just backing that stuff up and capturing it state as it was in the cluster, but providing us with a good way to section out one namespace or one group of applications and just move those potentially over and so on. Yeah, it just kind of comes to knowing what exactly are you going to have to solve for and how complex your solution should be.
[00:35:32] BL: Yeah. We’re getting towards the end, and I wanted to make sure that we talked about testing your backup, because that’s a popular thing here. People take backups. I’ve done my backups, whether I dump to S3, or I have Velero dumping to S3, or I have some other method that is in an invalid backup, it’s not valid until someone comes and takes that backup, restore it somewhere and actually verifies that it works, because there’ll be nothing worse than having yourself in a situation where you need a backup and you’re in some kind of disaster, whether small or large, and going to find out that, “Oh my gosh! We didn’t even backup the important thing.”
[00:36:11] CC: That is so true. I have only been in this backup world for a minute, but I mean I’ve needed to backup things before. I don’t think I’ve learned this concept after coming here. I think I’ve known this concept. It just became stronger in my mind, so I always tell people, if you haven’t done that restore, you don’t have a backup.
[00:36:29] JR: One thing I love to add on to that concept too is having my customers run like fire drills if they’re open to it. Effectively, having a list of potential terrible things that can happen, from losing a cluster to just like losing an important component. Unlike one person the team, let’s say, once a week or one a month, depending on their tolerance, just chooses something from that list and does it, not in production, but does it. It gives you the opportunity to test everything end-to-end. Did your learning fire off? When you did restore to your points, was the backup valid? Did the application come back online? It’s kind of a lot of like semi-fun, using the word fun loosely there. Fun ways that you can approach it, and it really is a good way to kind of stress test.
[00:37:09] BL: I do have one small follow up on that. You’re doing backups, and no matter how you’re doing them, think about your strategy and then how long to keep data. I mean, whether it’s due to regulation or just physical space and it costs money. You just don’t backup yesterday and then you’d backup again. Backup every day and keep the last 8 days and then, like old school, would actually then have a full backup and keep that for a while just in case, because you never know.
[00:37:37] CC: Good point too. Yeah. I think a lot of what we said goes to what – It was Olive I think who said it first. You have to understand your needs.
[00:37:46] OP: Yeah, just which bits have different varying degrees of importance in terms of application functionality for your end user. Which bits are absolutely critical and which bits can buy you a little bit more time to recover.
[00:37:58] CC: Yeah. That would definitely vary from product to product. As we are getting into this idea of ephemeral clusters and automation and we get really good at automating things and bringing things back up, is it possible that we get to a point where we don’t even talk about disasters anymore, or you just have to grow, bring this up cluster or this system, and does it even matter why [inaudible 00:38:25]. We’re not going to talk about this aspect, because what I’m thinking is in the past, in a long, long time ago, or maybe not so long time ago. When I was working with application, and that was a disaster, it was a disaster, because it felt like a disaster. Somebody had to go in manually and find out what happened and what to fix and fix it manually. It was complete chaos and stress.
Now if they just like keep rolling and automate it, something goes down, you bring it back up. Do you know what I mean? It won’t matter why. Are we going to talk about this in terms of it was a disaster? Does it even matter what caused it? Maybe it was a – Recovery from a disaster wouldn’t look any different than a planned update, for example.
[00:39:12] BL: I think we’re getting to a place – And I don’t know whether we’re 5 years away or 10 years away or 20 years away, a place where we won’t have the same class of disaster that we have now. Think about where we’ve come over the past 20 years. Over the past 20 years, be basically looked at hardware in a rack is replace. I can think about 1988, 1999 and 2000. We rack a whole bunch of servers, and that server will be special.
Now, at these scales, we don’t care about that anymore. When a server goes away, we have 50 more just like it. The reason we were able to do that across large platforms is because of Linux. Now with Kubernetes, if Kubernetes keeps on going in the same trajectory, we’re going to basically codify these patterns that makes hardware loss not a thing. We don’t really care if we lose a server. You have 50 more nodes that look just like it.
We’re going to start having the software – The software is always available. Think about like the Google Spanner. Google Spanner is multi-location, and it can lose notes and it doesn’t lose data, and it’s relational as well. That’s what CockroachDB is about as well, about Spanner, and we’re going into the place where this kind of technology is available for anyone and we’re going to see that we’re not going to have these kinds of disasters that we’re having now. I think what we’ll have now is bigger distributed systems things where we have timing issues and things like that and leader election issues. But I think those cool stuff can’t be phased out at least over the next computing generation.
[00:40:39] OP: It’s maybe more around architectures these days and applications designers and infrastructure architects in the container space and with Kubernetes orchestrating and maintaining your desired state. You’re thinking that things will fail, and that’s okay, because it will go back to the way it was before. The concept of something stopping in mid-run is not so scary anymore, because it would get put back to its state.
Maybe you might need to investigate if it keeps stopping and starting and Kubernetes keeps bringing it back. The system is actually still fully functional in terms of end users. You as the operator might need to investigate why that’s so. But the actual endpoint is still that your application is still up and running. Things fail and it’s okay. That’s maybe a thing that’s changed from maybe 5 years ago, 10 years ago.
[00:41:25] CC: This is a great conversation. I want to thank everybody, Olive Power, Josh Rosso, Brian Lyles. I’m Carlisia Campos singing off. Make sure to subscribe. This was Episode 8. We’ll be back next week. See you.
[END OF EPISODE]
[0:50:00.3] KN: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
Today on the show we have esteemed Kubernetes thought-leader, Kelsey Hightower, with us. We did not prepare a topic as we know that Kelsey presents talks and features on podcasts regularly, so we thought it best to pick his brain and see where the conversation takes us. We end up covering a mixed bag of super interesting Kubernetes related topics. Kelsey begins by telling us what he has been doing and shares with us his passion for learning in public and why he has chosen to follow this path. From there, we then talk about the issue of how difficult many people still think Kubernetes is. We discover that while there is no doubting that it is complicated, at one point, Linux was the most complicated thing out there. Now, we install Linux servers without even batting an eyelid and we think we can reach the same place with Kubernetes in the future if we shift our thinking! We also cover other topics such as APIs and the debates around them, common questions Kelsey gets before finally ending with a brief discussion on KubeCon. From the attendance and excitement, we saw that this burgeoning community is simply growing and growing. Kelsey encourages us all to enjoy this spirited community and what the innovation happening in this space before it simply becomes boring again. Tune in today!
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Carlisia CamposDuffie CooleyBryan LilesMichael GaschKey Points From This Episode:
Learn more about Kelsey Hightower, his background and why he teaches Kubernetes!The purpose of Kelsey’s course, Kubernetes the Hard Way.Why making the Kubernetes cluster disappear will change the way Kubernetes works.There is a need for more ops-minded thinking for the current Kubernetes problems.Find out why Prometheus is a good example of ops-thinking applied to a system.An overview of the diverse ops skillsets that Kelsey has encountered.Being ops-minded is just an end –you should be thinking about the next big thing!Discover the kinds of questions Kelsey is most often asked and how he responds.Some interesting thinking and developments in the backup space of Kubernetes.Is it better to backup or to have replicas?If the cost of losing data is very high, then backing up cannot be the best solution.Debates around which instances are not the right ones to use Kubernetes in.The Kubernetes API is the part everyone wants to use, but it comes with the cluster.Why the Kubernetes API is only useful when building a platform.Can the Kubernetes control theory be applied to software?Protocols are often forgotten about when thinking about APIs.Some insights into the interesting work Akihiro Suda’s is doing.Learn whether Kubernetes can run on Edge or not.Verizon: how they are changing the Edge game and what the future trajectory is.The interesting dichotomy that Edge presents and what this means.Insights into the way that KubeCon is run and why it’s structured in the way it is.How Spotify can teach us a lesson in learning new skills!Quotes:
“The real question to come to mind: there is so much of that work that how are so few of us going to accomplish it unless we radically rethink how it will be done?” — @mauilion [0:06:49]
“If ops were to put more skin in the game earlier on, they would definitely be capable of building these systems. And maybe they even end up more mature as more operations people put ops-minded thinking into these problems.” — @kelseyhightower [0:04:37]
“If you’re in operations, you should have been trying to abstract away all of this stuff for the last 10 to 15 years.” — @kelseyhightower [0:12:03]
“What are you backing up and what do you hope to restore?” — @kelseyhightower [0:20:07]
“Istio is a protocol for thinking about service mesh, whereas Kubernetes provides the API for building such a protocol.” — @kelseyhightower [0:41:57]
“Go to sessions you know nothing about. Be confused on purpose.” — @kelseyhightower [0:51:58]
“Pay attention to the fundamentals. That’s the people stuff. Fundamentally, we’re just some people working on some stuff.” — @kelseyhightower [0:54:49]
Links Mentioned in Today’s Episode:
The Podlets on Twitter — https://twitter.com/thepodlets
Kelsey Hightower — https://twitter.com/kelseyhightower
Kelsey Hightower on GitHub — https://github.com/kelseyhightower
Interaction Protocols: It's All about Good Manners — https://www.infoq.com/presentations/history-protocols-distributed-systems
Akihiro Suda — https://twitter.com/_AkihiroSuda_
Carlisia Campos on LinkedIn — https://www.linkedin.com/in/carlisia/
Kubernetes — https://kubernetes.io/
Duffie Cooley on LinkedIn — https://www.linkedin.com/in/mauilion/
Bryan Liles on LinkedIn — https://www.linkedin.com/in/bryanliles/
KubeCon North America — https://events19.linuxfoundation.org/events/kubecon-cloudnativecon-north-america-2019/
Linux — https://www.linux.org/
Amazon Fargate — https://aws.amazon.com/fargate/
Go — https://golang.org/
Docker — https://www.docker.com/
Vagrant — https://www.vagrantup.com/
Prometheus — https://prometheus.io/
Kafka — https://kafka.apache.org/
OpenStack — https://www.openstack.org/
Verizon — https://www.verizonwireless.com/
Spotify — https://www.spotify.com/
Transcript:
EPISODE 7
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[INTERVIEW]
[00:00:41] CC: Hi, everybody. Welcome back to The Podlets, and today we have a special guest with us, Kelsey Hightower. A lot of people listening to us today will know Kelsey, but as usual, there are a lot of new comers in this space. So Kelsey, please give us an introduction.
[00:01:00] KH: Yeah. So I consider myself a minimalist. So I want to keep this short. I work at Google, on Google Cloud stuff. I’ve been involved with the Kubernetes community for what? 3, 4, 5 years ever since it’s been out, and one main goal, learning in public and helping other people do the same.
[00:01:16] CC: There you go. You do have a repo on your GitHub that it’s about learning Kubernetes the hard way. Are you still maintaining that?
[00:01:26] KH: Yeah. So every six months or so. So Kubernetes is a hard way for those that don’t know. It’s a guide, a tutorial. You can copy and paste. It takes about three hours, and the whole goal of that guide was to teach people how to stand up a Kubernetes cluster from the ground up. So starting from scratch, 6 VMs, you install etcd, all the components, the nodes, and then you run a few test workloads so you can get a feel for Kubernetes.
The history behind that was when I first joined Google, we were all concerned about the adaption of such a complex system that Kubernetes is, right? Docker Swarm is out at the time. A lot of people are using Mesos and we’re wondering like a lot of the feedback at that time was Kubernetes is too complex. So Kubernetes the hard way was built as an idea that if people understand how it worked just like they understand how Linux works, because that’s also complex, that if people just saw how the moving pieces fit together, then they would complain less about the complexity and have a way to kind of grasp it.
[00:02:30] DC: I’m back. This is Duffie Colley. I’m back this week, and then we also have Michael and Bryan with us. So looking forward to this session talking through this stuff.
[00:02:40] CC: Yeah. Thank you for doing that. I totally forgot to introduce who else is in this show, and me, Carlisia. We didn’t plan what the topic is going to be today. I will take a wild guess, and we are going to touch on Kubernetes. I have so many questions for you, Kelsey. But first and foremost, why don’t you tell us what you would love to talk about?
One thing that I love about you is that every time I hear an interview of you, you’re always talking about something different, or you’re talking about the same thing in a different way. I love that about the way you speak. I know you offer to be on a lot of podcast shows, which is how we ended up here and I was thinking, “Oh my gosh! We’re going to talk about what everybody is going to talk about, but I know that’s not going to happen.” So feel free to get a conversation started, and we are VMware engineers here. So come at us with questions, but also what you would like to talk about on our show today.
[00:03:37] KH: Yeah. I mean, we’re all just coming straight off the hills of KubeCon, right? So this big, 12,000 people getting together. We’re super excited about Kubernetes and the Mister V event, things are wrapping up there as well. When we start to think about Kubernetes and what’s going to happen, and a lot of people saw Amazon jump in with Fargate for EKS, right? So those unfamiliar with that offering, over the years, all the cloud providers have been providing some hosted Kubernetes offering, the ideas that the cloud provider, just like we do with hypervisors and virtual machines, would provide this base infrastructure so you can focus on using Kubernetes.
You’ve seen this even flow down on-prem with VMware, right? VMware saying, “Hey, Kubernetes is going to be a part of this control plane that you can use to Kubernetes’ API to manage virtual machines and containers on-prem.”
So at some point now, where do we go from here? There’s a big serverless movement, which is trying to eliminate infrastructure for all kinds of components, whether that’s compute, database as a storage. But even in the Kubernetes world, I think there’s an appetite when we saw this with Fargate, that we need to make the Kubernetes cluster disappear, right?
If we can make it disappear, then we can focus on building new platforms that extend the API or, hell, just using Kubernetes as is without thinking about managing nodes, operating systems and autoscalers. I think that’s kind of been the topic that I’m pretty interested in talking about, because that feature means lots of things disappear, right? Programming languages and compilers made assembly disappear for a lot of developers. Assembly is still there. I think people get caught up on nothing goes away. They’re right. Nothing goes away, but the number of people who have to interact with that thing is greatly reduced.
[00:05:21] BL: You know what, Kelsey? I’m going to have you get out of my brain, because that was the exact example that I was going to use. I was on a bus today and I was thinking about all the hubbub, about the whole Fargate EKS thing, and then I was thinking, “Well, Go, for example, can generate assembler and then it compiles that down.” No one complains about the length of the assembler that Go generates. Who cares? That’s how we should think about this problem. That’s a whole solvable problem. Let’s think about bigger things.
[00:05:51] KH: I think it’s because in operations we tend to identify ourselves as the people responsible for running the nodes. We’re the people responsible for tuning the API server. When someone says it’s going to go away, in ops – And you see this in some parts, right? Ops, some people focus a lot more on observability. They can care less about what machine something runs on. They’re still going to try to observe and tune it. You see this in SRE and some various practices.
But a lot of people who came up in a world like I have in a traditional ops background, you were the one that pixie-booted the server. You installed that Linux OS. You configured it with Puppet. When someone tells you, “We’re going to move on from that as if it’s a good thing.” You’re going to be like, “Hold up. That’s my job.”
[00:06:36] DC: Definitely. We’ve touched this topic through a couple of different times on this show as well, and it definitely comes back to like understanding that, in my opinion, it’s not about whether there will be a worker for people who are in operations, people who want to focus on that. The real question that come to mind is like there is so much of that work that how are so few of us are going to be able to accomplish it unless we radically re-sync how it will be done. We’re vastly outnumbered. The number of people walking into the internet for the first time every day is mind-boggling.
[00:07:08] KH: In early days, we have this goal of abstract or automating ourselves out of a job, and anyone that tried that a number of times knows that you’re always going to have something else to do. I think if we carry that to the infrastructure, I want to see the ops folks. I was very surprised that Docker didn’t come from operations folks. It came from the developer folks. Same thing for Vagrant and the same thing from Kubernetes. These are developer-minded folks that want to tackle infrastructure problems.
If I think if ops were to put more skin in the game earlier on, definitely capable of building these systems and maybe they even end up more mature as more operations people put ops-minded thinking to these problems.
[00:07:48] BL: Well, that’s exactly what we should do. Like you said, Kelsey, we will always have a job. Whenever we solve one problem, we could think about more interesting problems. We don’t think about Linux on servers anymore. We just put Linux on servers and we run it. We don’t think about the 15 years where it was little rocky. That’s gone now. So think about what we did there and let’s do that again with what we’re doing now.
[00:08:12] KH: Yeah. I think the Prometheus community is a good example of operations-minded folks producing a system. When you meet the kind of the originators of Prometheus, they took a lot of their operational knowledge and kind of build this metrics and monitoring standard that we all kind of think about now when we talk about some levels of observability, and I think that’s what happens when you have good operations people that take prior experience, the knowledge, and that can happen over code these days. This is the kind of systems they produce, and it’s a very robust and extensible API that I think you start to see a lot of adaption.
[00:08:44] BL: One more thing on Prometheus. Prometheus is six-years-old. Just think about that, and that’s not done yet, and it’s just gotten better and better and better. We go to give up our old thing so we can get better and better and better. That’s just what I want to add.
[00:08:58] MG: Kelsey, if you look at the – Basically your own history of coming from ops, as I understood your own history, right? Now being kind of one of the poster childs in the Kubernetes world, you see the world changing to serverless, to higher abstractions, more complex systems on one hand, but then on the other side, we have ops. Looking beyond or outside the world of Silicon Valley into the traditional ops, traditional large enterprise, what do you think is the current majority level of these ops people?
I don’t want to discriminate anyone here. I’m just basically throwing this out as a question. Where do you think do they need to go in terms of to keep up with this evolving and higher level abstractions where we don’t really care about nitty-gritty details?
[00:09:39] KH: Yes. So this is a good, good question. I spent half of my time. So I probably spent time onsite with at least 100 customers a year globally. I fly on a plane and visit them in their home turf, and you definitely meet people at various skill levels and areas of responsibility. I want to make sure that I’m clear about the areas of responsibility.
Sometimes you’re hired in an area of responsibility that’s below your skillset. Some people are hired to manage batch jobs or to translate files from XML to JSON. That really doesn’t say a lot about their skillset. It just kind of talks about the area of responsibility. So shout out to all the people that are dealing with main frames and having to deal with that kind of stuff.
But when you look at it, you have the opportunity to rise up to whatever level you want to be in in terms of your education. When we talk about this particular question, some people really do see themselves as operators, and there’s nothing wrong with that. Meaning, they could come in. They get a system and they turn the knobs. You gave me a mainfrastructure me, I will tell you how to turn the knobs on that mainframe. You buy me a microwave, I’ll tell you how to pop popcorn. They’re not very interested in building a microwave. Maybe they have other things that are more important to them, and that is totally okay.
Then you have people who are always trying to push the boundaries. Before Kubernetes, if I think back to 10 years ago, maybe 8. When I was working in a traditional enterprise, like kind of the ones you’re talking about or hinting at, the goal has always been to abstract away all of these stuff that it means to deploy an application the right way in a specific environment for that particular company. The way I manage to do it was say, “Hey, look. We have a very complex change in management processes.” I work in finance at that time.
So everything had to have a ticket no matter how good the automation was. So I decided to make JIRA the ticketing system their front door to do everything. So you go to JIRA. There’ll be a custom field that says, “Hey, here are all the RPMs that have been QA’d by the QA team. Here are all the available environments.” You put those two fields in. That ticket goes to change in management and approval, and then something below the scenes automated everything, in that case it was Puppet, Red Hat and VMware, right?
So I think what most people have been doing if you’re in the world of abstracting this stuff away and making it easier for the company to adapt, you’ve already been pushing these ideas that we call serverless now. I think the cloud providers put these labels on platforms to describe the contract between us and the consumer of the APIs that we present. But if you’re in operations, you should have been trying to abstract away all of these stuff for the last 10 or 15 years.
[00:12:14] BL: I 100% agree. Then also, think about other verticals. So 23 years ago, I did [inaudible 00:12:22] work. That was my job. But we learned how to program in C and C++ because we were on old Suns, not even Spark machines. We’re on the old Suns, and we wanted to write things in CVE and we wanted to write our own Window managers. That is what we’re doing right now, and that’s why you see like Mitchell Hashimoto with Vagrant and you’re seeing how we’re pushing this thing.
We have barely scratched the surface of what we’re trying to do. For a lot of people who are just ops-minded, understand that being ops-minded is just the end. You have to be able to think outside of your boundaries so you can create the next big thing.
[00:12:58] KH: Of you may not care about creating the next big thing. There are parts of my life where I just don’t care. For example, I pay Comcast to get internet access, and my ops involvement was going to BestBuy and buying a modem and screwing it into the wall, and I troubleshoot this thing every once in a while when someone in the household complains the internet is down.
But that’s just far as I’m ever going to push the internet boundaries, right? I am not really interested in pushing that forward. I’m assuming others will, and I think that’s one thing in our industry where sometimes we believe that we all need to contribute to pushing things forward. Look, there’s a lot of value in being a great operations person. Just be welcomed to saying that what we operate will change overtime.
[00:13:45] DC: Yeah, that’s fair. Very fair. For me, personally, I definitely identify as an operations person. I don’t consider it my life’s goal to create new work necessarily, but to expand on the work that has been identified and to help people understand the value of it.
I find I sit in between two roles personally. One is to help figure out all of the different edges and pieces and parts of Kubernetes or some other thing in the ecosystem. Second, to educate others on those things, right? Take what I’ve learned and amplify it. Having the amplifying effect.
[00:14:17] CC: One thing that I wanted to ask you, Kelsey is – I work on the Valero project, and that does back and recovery of Kubernetes clusters. Some people ask me, “Okay. So tell me about the people who are doing?” I’m like, “I don’t want to talk about that. That’s boring. I wanted to talk about the people who are not doing backups.” “Okay. Let’s talk about why you should be doing maybe thinking about that.”
Well, anyway. I wonder if you get a lot of questions in the area of Kubernetes operations or cloud native in general, infrastructure, etc., that in the back of your mind you go, “That’s the wrong question or questions.” Do you get that?
[00:14:54] KH: Yeah. So let’s use your backup example. So I think when I hear questions, at least it lets me know what people are thinking and where they’re at, and if I ask enough questions, I can kind of get a pulse in the trend of where the majority of the people are. Let’s take the backups questions. When I hear people say, “I want to back up my Kubernetes cluster.” I rewind the clock in my mind and say, “Wow! I remember when we used to backup Linux servers,” because we didn’t know what config files were on the disk. We didn’t know where processes are running. So we used to do these PS snapshots and we used to pile up the whole file system and store it somewhere so we can recover it.
Remember Norton Ghost? You take a machine and ghost it so you can make it again. Then we said, “You know what? That’s a bad idea.” What we should be doing is having a tool that can make any machine look like the way we want it. Config management is boring. So we don’t back those up anymore.
So when I hear that question I say, “Hmm, what is happening in the community that’s keeping people to ask these questions?” Because if I hear a bunch of questions that already have good answers, that means those answers aren’t visible enough and not enough people are sharing these ideas. That should be my next key note. Maybe we need to make sure that other people know that that is no longer a boring thing, even though it’s boring to me, it’s not boring to the industry in general.
When I hear these question I kind of use it as a keeps me up-to-date, keeps me grounded. I hear stuff like how many Kubernetes clusters should I have? I don’t think there’s a best practice around that answer. It depends on how your company segregates things, or depends on how you understand Kubernetes. It depends on the way you think about things.
But I know why they’re asking that question, is because Kubernetes presents itself as a solution to a much broader problem set than it really is. Kubernetes manages a group of machines typically backed by IS APIs. If you have that, that’s what it does. It doesn’t do everything else. It doesn’t tell you exactly how you should run your business. It doesn’t tell you how you should compartmentalize your product teams. Those decisions you have to make independently, and once you do, you can serialize those into Kubernetes.
So that’s the way I think about those questions when I hear them, like, “Wow! Yeah, that is a crazy thing that you’re still asking this question six years later. But now I know why you’re asking that question.”
[00:17:08] CC: That is such a great take on this, because, yes, it in the area of backup, people who are doing backup in my mind – Yeah, they should be independent of Kubernetes or not. But let’s talk about the people who are not doing backups. What motivates you to not do backups? Obviously, backups can be done in many different ways. But, yes.
[00:17:30] BL: So think about it like this way. Some people don’t exercise, because exercise is tough and it’s hard, and it’s easier to sit on the couch and eat a bag of potato chips than exercise. It’s the same thing with backups. Well, backing up my Kubernetes cluster before Valero was so hard that I’d rather just invest brain cycles in figuring out how to make money. So that’s where people come from when it comes to hard things like backups.
[00:17:52] KH: There’s a trust element too, right? Because we don’t know if the effort we’re putting in is worth it. When people do unit testing, a lot of times unit testing can be seen as a proactive activity, where you write unit tests to catch bugs in the future. Some people only write unit test when there’s a problem. Meaning, “Wow! There’s an odd things in a database.
Maybe we should write a test to prove that our code is putting odd things. Fix the code, and now the test pass.” I think it’s really about trusting that the investment is worth it. I think when you start to think about backups – I’ve seen people back up a lot of stuff, like every day or every couple of hours, they’re backing up their database, but they’d never restored the database.
Then when you read their root cause analysis, they’re like, “Everything was going fine until we tried to restore a 2 terabyte database over 100 meg link. Yeah, we never exercised that part.”
[00:18:43] CC: That is very true.
[00:18:44] DC: Another really fascinating thing to think about the backup piece is that especially like in the Kubernetes with Valero and stuff, we’re so used to having the conversation around stateless applications and being able to ensure that you can redeploy in the case of a failure. You’re not trying to actually get back to a known state the way that like a backup traditionally would. You’re just trying to get back to a running state.
So there’s a bit of a dichotomy there I think for most folks. Maybe they’re not conceptualizing the need for having to deal with some of those stateful applications when they start trying to just think about how Valero fits into the puzzle, because they’ve been told over and over again, “This is about immutable infrastructure. This is about getting back to running. This is not about restoring some complex state.” So it’s kind of interesting.
[00:19:30] MG: I think part of this is also that for the stateful services that why we do backups actually, things change a lot lately, right? With those new databases, scale out databases, cloud services. Thinking about backup also has changed in the new world of being cloud native, which for most of the people, that’s also a new learning experiment to understand how should I backup Kafka? It’s replicated, but can I backup it? What about etcd and all those things? Little different things than backing up a SQL database like more traditional system. So backup, I think as you become more complex, stays if needed for [inaudible 00:20:06].
[00:20:06] KH: Yeah. The case is what are you backing up and what do you hope to restore? So replication, global replication, like we do with like cloud storage and S3. The goal is to give some people 11 9s of reliability and replicate that data almost as many geographies as you can. So it’s almost like this active backup. You’re always backing up and restoring as a part of the system design versus it being an explicit action. Some people would say the type of replication we do for object stores is much closer to active restoring and backing up on a continuous basis versus a one-time checkpoint.
[00:20:41] BL: Yeah. Just a little bit of a note, you can back up two terabytes over 100 meg link in like 44 hours and a half. So just putting out there, it’s possible. Just like two days. But you’re right. When it comes to backups, especially for like – Let’s say you’re doing MySQL or Postgres.
These days, is it better to back it up or is it better to have a replica right next to it and then having like a 10 minute delayed replica right next to that and then replicating to Europe or Asia? Then constantly querying the data that you’re replicating. That’s still a backup. What I’m saying here is that we can change the way that we talk about it. Backup started as conventional as they used to be. There are definitely other ways to protect your data.
[00:21:25] KH: Yeah. Also, I think the other part too around the backup thing is what is the price of data loss? When you take a backup, you’re saying, “I’m willing to lose this much data between the last backup and the next.” That cost is too high than backing up cannot be your primary mode of operation, because the cost of losing data is way too high, then replication becomes a complementing factor in the whole discussion of backups versus real-time replication and shorter times to recovery.
I have a couple of questions. When should people not use Kubernetes? Do you know what I mean? I visit a lot of customers, I work with a lot of eng teams, and I am in the camp of Kubernetes is not for everything, right? That’s a very obvious thing to say. But some people don’t actually practice it that way. They’re trying to jam more and more into Kubernetes. So I love to get your insights on where do you see Kubernetes being like the wrong direction for some folks or workloads.
[00:22:23] MG: I’m going to scratch this one from my question list to Kelsey.
[00:22:26] KH: I’ll answer it too then. I’ll answer it after you will answer it.
[00:22:29] MG: Okay. Who wants to go first?
[00:22:30] BL: All right. I’ll go first. There are cases when I’m writing a piece of software where I don’t care about the service discovery. I don’t care about ingress. It’s just software that needs to run. When I’m running it locally, I don’t need it. If it’s simple enough where I could basically throw it into a VM through a CloudNet script, I think that is actually lower friction than Kubernetes if it’s simple.
Now, but I’m also a little bit jaded here, because I work for the dude who created Kubernetes, and I’m paid to create solutions for Kubernetes, but I’m also really pragmatic about it as well. It’s all about effort for me. If I can do it faster in CloudNet, I will.
[00:23:13] DC: For my part, I think that there’s – I have a couple of – I got follow on questions to this real quick. But I do think that if you’re not actively trying to develop a distributed systems, something where you’re actually making use of the primitives that Kubernetes provides, then that already would kind of be a red flag for me. If you’re building a monolithic application or if you’re in that place where you’re just rapidly iterating on a SaaS product and you’re just trying to like get as many commits on this thing until it works and like just really rapidly prototype or even create this thing.
Maybe Kubernetes isn’t the right thing, because although we’ve come a long way in improving the tools that allow for that iteration, I certainly wouldn’t say that we’re like all the way there yet.
[00:23:53] BL: I would debate you that, Duffy.
[00:23:55] DC: All right. Then the other part of it is Kubernetes aside, I’m curious about the same question as it relates to containerization. Is it containerization the right thing for everyone, or have we made that pronouncement, for example?
[00:24:08] KH: I’m going to jump in and answer on this one, because I definitely think we need a way to transport applications in some way, right? We used to do it on floppy disks. We used to do it on [inaudible 00:24:18]. I think the container to me I treat as a glorified [inaudible 00:24:23]. That’s the way I’ve been seeing it for years. Registry store them. They replace [inaudible 00:24:28]. Great. Now we kind of have a more maybe universal packaging format that can handle simple use cases, scratch containers where it’s just your binary, and the more complex use cases where you have to compose multiple layers to get the output, right? I think RPM spec files used to do something very similar when you start to build those thing in [inaudible 00:24:48], “All right. We got that piece.”
Do people really need them? The thing I get weary about is when people believe they have to have Kubernetes on their laptop to build an app that will eventually deploy to Kubernetes, right? If we took that thinking about the cloud, then everyone would be trying to install open stack on their laptop just to build an app. Does that even make sense? Does that make sense in that context? Because you don’t need the entire cloud platform on your laptop to build an app that’s going to take a request and respond. I think Kubernetes people, I guess because it’s easier to put your on laptop, people believe that it needs to be there.
So I think Kubernetes is overused, because people just don’t quite understand what it does. I think there’s a case where you don’t use Kubernetes, like I need to read a file from a bucket. Someone uploaded an XML file and my app is going to translate it into JSON. That’s it. In that case, this is where I think functions as a service, something like Cloud Run or even Heroku make a lot more sense to me because the operational complexity is kind of hitting within a provider and is linked almost like an SDK to the overall service, which is the object store, right?
The compute part, I don’t want to make a big deal about, because it’s only there to process the file that got uploaded, right? It’s almost like a plug-in to an FTP server, if you will. Those are the cases where I start to see Kubernetes become less of a need, because I need a custom platform to do such an obvious operation.
[00:26:16] DC: Those applications that require the primitives that Kubernetes provides, service discovery, the ability to define ingress in a normal way. When you’re actually starting to figure out how you’re going to platform that application with regard to those primitives, I do see the argument for having Kubernetes locally, because you’re going to be using those tools locally and remotely. You have some way of defining what that platforming requirement is.
[00:26:40] KH: So let me pull on that thread. If you have an app that depends on another app, typically we used to just have a command line flag that says, “This app is over there.” Local host when it’s on my laptop. Some DNS name when it’s in the cluster, or a config file can satisfy that need. So the need for service discovery usually arises where you don’t know where things are. But if you’re literally on your laptop, you know where the things are. You don’t really have that problem.
So when you bring that problem space to your laptop, I think you’re actually making things worse. I’ve seen people depend on Kubernetes service discovery for the app to work. Meaning, they just assume they can call a thing by name and they don’t support IPs, and ports.
They don’t support anything, because they say, “Oh! No. No. No. You’ll always be running into Kubernetes.” You know what’s going to happen? In 5 or 10 years, we’re going to be talking like, “Oh my God! Do you remember when you used to use Kubernetes? Man! That legacy thing. I built my whole career porting apps away from Kubernetes to the next thing.” The number one thing we’ll talk about is where people lean too hard on service discovery, or people who built apps that taught to config maps directly. Why are you calling the Kubernetes API from your app? That’s not a good design. I think we got to be careful coupling ourselves too much to the infrastructure.
[00:27:58] MG: It’s a fair point too. Two answers from my end, to your question. So one is I just build an appliance, which basically priced to bring an AWS Lambda experience to the Vsphere ecosystem. Because we don’t – Or actually my approach is that I don’t want any ops people who needs to do some one-off things, like connect this guy to another guy. I don’t want him to learn Kubernetes for that. It should be as simple as writing a function.
So for that appliance, we had to decide how do we build it? Because it should be scalable. We might have some function as a service component running on there. So we looked around and we decided to put it on Kubernetes. So build the appliance as a traditional VM using Kubernetes on top. For me as a developer, it gave me a lot of capabilities, like self-healing, the self-healing capabilities. But it’s also a fair point that you wrote, Kelsey, about how much do we depend or write our applications being depend on those auxiliary features from Kubernetes? Like self-healing, restarts, for example.
[00:28:55] KH: Well, in your case, you’re building a platform. I would hate for you to tell me that you rebuilt a Kubernetes-like thing just for that appliance. In your case, it’s a great use case. I think the problem that we have as platform builders is what happens when things start leaking up to the user? You tell a user all they have to care about is functions. Then they get some error saying, “Oh! There’s some Kubernetes security context that doesn’t work.” I’m like, “What the hell is Kubernetes?” That leakage is the problem, and I think that’s the part where we have to be careful, and it will take time, but we don’t start leaking the underlying platform making the original goal untrue.
[00:29:31] MG: The point is where I wanted to throw this question back was now these functions being written as simple scripts, whatever, and the operators put in. They run on Kubernetes. Now, the operators don’t know that it runs in Kubernetes. But going back to your question, when should we not use Kubernetes. Is it me writing in a higher level abstraction like a function? Not using Kubernetes in first sense, because I don’t know actually I’m using it. But on the covers, I’m still using it. So it’s kind of an answer and not an answer to your question because –
[00:29:58] KH: I’ve seen these single node appliances. There’s only one node, right? They’re only there to provide like email at a grocery store. You don’t have a distributed system. Now, what people want is the Kubernetes API, the way it deploys things, the way it swaps out a running container for the next one. We want that Kubernetes API.
Today, the only way to get it is by essentially bringing up a whole Kubernetes cluster. I think the K3S project is trying to simplify that by re-implementing Kubernetes. No etcd, SQLite instead. A single binary that has everything. So I think when we start to say what is Kubernetes, there’s the implementation, which is a big distributed system. Then there’s the API. I think what’s going to happen is if you want the Kubernetes API, you’re going to have so many more choices on the implementation that makes better sense for the target platform.
So if you’re building an appliance, you’re going to look at K3S. If you’re a cloud provider, you’re going to probably look something like what we see on GitHub, right? You’re going to modify and integrate it into your cloud platform.
[00:31:00] BL: Of maybe what happened with Kubernetes over the next few years is what happened with the Linux API, or the API. Firecracker and gVisor did this, and WSL did this. We can basically swap out Linux from the backend because we can just get on with the calls. Maybe that will happen with Kubernetes as well. So maybe Kubernetes will become a standard where Kubernetes standard and Kubernetes implementation that we have right now. I don’t even know about that one.
[00:31:30] KH: We’re starting to see it, right? When you say here is my pod, and we can just look at Fargate for EKS as an example. When you give them a pod, their implementation is definitely different than what most people are thinking about running these days, right? One pod per VM. Not using Virtual Kube. So they’ve taken that pod spec and tried to uphold its means. But the problem with that, you get leaks.
For example, they don’t allow you to bind to a host 4. Well, the pod spec says you can bind to a host 4. Their implementation doesn’t allow you to do it, and we see the same problem with gVisor. It doesn’t implement all the system calls. You couldn’t run the Docker daemon on top of gVisor. It wouldn’t work. So I think as long as we don’t leak, because when we leak, then we start breaking stuff.
[00:32:17] BL: So we’re doing the same thing with Project Pacific here at VMware, where this concept of a pod is actually a virtual machines that loops in like a tenth of a second. It’s pretty crazy how they’ve been able to figure that out. If we can get this right, that’s huge for us. That means we can move out of our appliance and we can create better things that actually work. I’m VMware specific. I’m on AWS and I want this name space. I can use Fargate and EKS. That’s actually a great idea.
[00:32:45] MG: I remember this presentation, Kelsey, that you gave. I think two or three years ago. It might be three years, where you took the Kubernetes architecture and you removed the boxes and the only thing remaining was the API server. This is where it clicked to me as like, “This is right,” because I was focused on the scheduler. I wanted to understand the scheduler.
But then you zoomed out or your stripped off all these pieces and the only thing remaining was the API server. This is where it clicked to me. It’s like [inaudible 00:33:09] or like the syscall interface. It’s basically my API to do some crazy things that I would have write on my own and assembly kind of something before I could even get started. As well the breakthrough moment for me, this specific presentation.
[00:33:24] KH: I’m working on an analogy to talk about what’s happening with the Kubernetes API, and I haven’t refined it yet. But when the web came out, we had all of these HTTP verbs, put post git. We have a body. We have headers. You can extract that out of the whole web, the web browser plus the web server. If you have tracked out that one piece, the instead of building web package, we can build APIs and GraphQL, because we can reuse many of those mechanisms, and we just call that RESTful interfaces.
Kubernetes is going through the same evolution, right? The first thing we built was this container orchestration tool. But if you look at the CRDs, the way we do RBAC, the way we think about the status field in a custom object, if you extract those components out, then you end up with this Kubernetes style APIs where we start to treat infrastructure not as code, but as data. That will be the restful moment for Kubernetes, right?
The web, we extracted it out, then we have REST interfaces. In Kubernetes, once we extracted out, we’ll end up with this declarative way of describing maybe any system. But right now, the fine, or the perfect match is infrastructure. Infrastructure as data and using these CRDs to allow us to manipulate that data. So maybe you start with Helm, and then Helm gets piped into something like Customize. That then gets piped into a mission controller. That’s how Kubernetes actually works, and that data model to API development I think is going to be the unique thing that lasts longer then the Kubernetes container platform does.
[00:34:56] CC: But if you’re talking about – Correct me if I misinterpret it, platform as data. Data to me is meant to be consumed, and I actually have been thinking since you said, “Oh, developers should not be developing apps that connect directly to Kubernetes,” or I think you said the Kubernetes API. Then I was thinking, “Wait. I’ve heard so many times people saying that that’s one great benefit of Kubernetes, that the apps have that access.” Now, if you see my confusion, please clarify it.
[00:35:28] KH: Yeah. Right. I remember early on when we’re doing config maps, and a big debate about how config maps should be consumed by the average application. So one way could be let’s just make a configs map API and tell every developer that they need to import a Kubernetes library to call the API server, right? Now everybody’s app doesn’t work anymore on your laptop. So we were like, “Of course not.”
What we should do is have config maps be injected into the file system. So that’s why you can actually describe a config map as a volume and say, “Take these key values from the config map and write them as normal files and inject them into the container so you can just read them from the file system. The other option also was environment variables. You can take a config map and translate them into an environment variables, and lastly, you can take those environment variables and put them into command line flags.
So the whole point of that is all three of the most popular ways of configuring an app, environment variables, command line flags and files. Kubernetes molded itself into that world so that developers would never tightly couple themselves to the Kubernetes API.
Now, let’s say you’re building a platform, like you’re building a workflow engine like Argo, or you’re building a network control plane like Istio. Of course, you should use a Kubernetes API. You’re building a platform on top of a platform. I would say that’s kind of the exception to the rule if you’re building a platform. But a general application that’s leveraging the platform, I really think you should stay away from the Kubernetes API directly. You shouldn’t be making sys calls directly [inaudible 00:37:04] of your runtime. The unsafe package in Go. Once you start doing that, Go can’t really help you anymore. You start pining yourself to specific threads. You’re going to be in a bad time.
[00:37:15] CC: Right. Okay. I think I get it. But you can still use Kubernetes to decouple your app from the machine by using objects to generate those dependencies.
[00:37:25] KH: Exactly. That was the whole benefit of Kub, and Docker even, saying, “You know what? Don’t worry too much more about C groups and namespaces. Don’t even try to do that yourself.” Because remember, there was a period of time where people were actually trying to build C groups and network namespaces into the runtime. There’s a bunch of like Ruby and Python projects that they were trying to containerize themselves within the runtime. Whoa! What are we doing? Having that second layer now with Containerd on C, we don’t have to implement that 10,000 times for every programming language.
[00:37:56] DC: One of the things I want to come back to is your point that you’d made about the Kubernetes API being like one of the more attractive parts of the projects, and people needing that to kind of move forward in some of these projects, and I wonder if it’s more abstract than that. I wonder if it’s abstract enough to think about in terms of like a level triggered versus edge triggered stuff. Taking control theory, the control theory that basically makes Kubernetes such a stable project and applying that to software architecture rather than necessarily bringing the entire API with you. Perhaps, what you should take from this is the lessons that we’ve learned in developing Kubernetes and apply that to your software.
[00:38:33] KH: Yeah. I have the fortunate time to spend some time with Mark Burgess. He came out with the Promise Theory, and the Promise Theory is the underpinnings of Puppet Chef, Ansible, CF Engine, and this idea that we would make promises about something and eventually convergent to that state.
The problem was with Puppet Chef and Ansible, we’re basically doing this with shell scripts and Ruby. We were trying to write all of these if, and, else statements. When those didn’t work, what did you do? You made an exec statement at the bottom and then you’re like, “Oh! Just run some batch, and who knows what’s going to happen?” That early implementations of Promise Theory, we didn’t own the resource that we were making promises about. Anyone could go behind this and remove the user, or the user could have a different user ID on different systems but mean the same thing.
In the Kubernetes world, we push a lot of that if, else statements into the controller. Now, we force the API not have any code. That’s the big difference. If you look at the Kubernetes API, you can’t do if statements. Terraform, you can do if statements. So you kind of fall into the imperative trap at the worst moments when you’re doing dry runs or something like that. It does a really good of it. Don’t get me wrong.
So the Kubernetes API says, “You know what? We’re going to go all-in on this idea.” You have to change the controller first and then update the API. There is no escape patches in the API. So it forces a set of discipline that I think gets us closer to the promises, because we know that the controller owns everything. There’s no way to escape in the API itself.
[00:40:07] DC: Exactly. That’s exactly what I was pushing for.
[00:40:09] MG: I have a somewhat related question and I’m just not sure how to frame it correctly. So yesterday I saw a good talk by someone talking about protocols, like they somewhat forgotten power of protocols in the world of APIs. We got Swagger. We got API definitions. But he made the very easy point of if I give you an open, a close and a write and read method, or an API, you’d still don’t know how to call them in sequence and which one to call it off.
This is same for [inaudible 00:40:36] library if you look at that. So I always have to force myself, “Should I do anything [inaudible 00:40:40] or I’m not leaking some stuff.” So I look it up. Versus on protocols, if you look at the RFC definitions, they are very, very precise and very plainly outlined of what you should do, how you should behave, how you should communicate between these systems.
This is more of a communication and less about the actual implementation of an API. I still have to go through that talk again, and I’m going to put it in the show notes. But this kind of opened my mind again a little bit to think more about communication between systems and contracts and promises, as you said, Carlisia. Because we make so many assumptions in our code, especially as we have to write a lot of stuff very quickly, which I think will make things brittle overtime.
[00:41:21] KH: So the gift and the curse of Kubernetes that it tries to do both all the time. For some things like a pod or a deployment, we all feel that. If I give any Kubernetes cluster a deployment object, I’m going to get back out running pod. This is what we all believe. But the thing is it may not necessarily run on the same kernel. It may not run on the same OS version. It may not even run on the same type of infrastructure, right? This is where I think Kubernetes ends up leaking some of those protocol promises. A deployment gets you a set of running pods.
But then we dropdown to a point where you can actually do your own API and build your own protocol. I think you’re right. Istio is a protocol for thinking about service mesh, whereas Kubernetes provides the API for building such a protocol.
[00:42:03] MG: Yeah, good point. [inaudible 00:42:04].
[00:42:04] DC: On the Fargate stuff, I thought was a really interesting article, or actually, an interesting project by [inaudible 00:42:10], and I want to give him a shout out on this, because I thought that was really interesting. He wrote an admission controller that leverages autoscaler, node affinity and pod affinity to effectively do the same thing so that whenever there is a new pod created, it will spin up a new machine and associate only that pod with that machine. I was like, “What a fascinating project.” But also just seeing this come up from like the whole Fargate ECS stuff. I was like –
[00:42:34] KH: I think that’s the thread that virtual kubelet is pulling on, right? This idea that you can simplify autoscalling if you remove that layer, right? Because right now we’re trying to do this musical chairs dance, right? Like in a cloud. Imagine if someone gave you the hypervisor and told you you’re responsible for attaching hypervisor workers and the VMs. It would be a nightmare. We’re going to be talking about autoscalling the way we do in the cloud.
I think Kubernetes moving into a world where a one pod per resource envelope. Today we call them VMs, but I think at some point we’re going to drop the VM and we would just call it a resource envelope. VMs, this is the way we think about that, Firecrackers. Like, “Hey, does it really need to be a complete VM?” Firecracker is saying, “No. It doesn’t. It just needs to be a resource envelope that allows you to run their particular workload.”
[00:43:20] DC: Yeah. Same thing we’re doing here. It’s just enough VM to get you to the point where you can drop those containers on to it.
[00:43:25] CC: Kelsey, question. Edge? Kubernetes on edge. Yes or no?
[00:43:29] KH: Again, it’s just like compute on edge has been a topic for discussion forever. Problem is when some people say compute on edge, they mean like go buy some servers from Dell and put it in some building somewhere close to your property as you can. But then you have to go build the APIs to deploy it to that edge.
What people want, and I don’t know how far off it is, but Kubernetes has set the bar so high that the Kubernetes API comes with a way to low balance, attach storage, all of these things by just writing a few YAML files. What I hear people saying is I want that close to my data center or store as possible.
When you say Kubernetes on the edge, that’s what they’re saying, is like, “But we currently have one at edge. It’s not enough.” We’ve been providing edge for a very longtime. Open stack was – Remember open stack? Oh! We’re going to do open stack on the edge. But now you’re a pseudo cloud provider without the APIs.
I think what Kubernetes is bringing to the table is that we have to have a default low balancer. We have to have a default block store. We have to have a default everything and on or for to mean Kubernetes like it does today centralized.
[00:44:31] BL: Well, Doors have been doing this forever in some form or another. 20 years ago I worked for a Duty Free place, and literally traveled all over the world replacing point of sale. You might think of point of sales as a cash register. There was a computer in the back and it was RS-232 links from the cash register to the computer in the back. Then there was dial-up, or [inaudible 00:44:53] line to our central thing. We’ve been doing edge for a long time, but now we can do edge.
The central facility can actually manage the compute infrastructure. All they care about is basically CPU and memory and network storage now, and it’s a lot more flexible. The surety is long, but I think we’re going to do it. It’s going to happen, and I think we’re almost right – People are definitely experimenting.
[00:45:16] KH: You know what, Carlisia? You know what’s interesting now though? I was watching the Reinvent announcement. Verizon is starting to allow these edge components to leverage 5G for the last mile, and that’s something game-changer, because most people are very skeptical about 5G being able to provide the same coverage as 4G because of the wavelength and point-to-point, all of these things.
But for edge, this thing is a game-changer. Higher bandwidth, but shorter distance. This is exactly what edge want, right? Now you don’t have to dig up the ground and run fiber from point-to-point. So if you could buy in these Kubernetes APIs, plus concepts like 5G, and get in that closer to people, yeah, I think that’s going to change the way we think about regions and zones. That kind of goes away. We’re going to move closer to CDNs, like Cloudflare has been experimenting with their worker technology.
[00:46:09] DC: On the edge stuff, I think that there’s also an interesting dichotomy happening, right? There’s a definition of edge that we referred to, which is storage stuff and one that you’re alluding to, which is that there may be like some way of actually having some edge capability and a point of presence in a 5G tower or some point with that.
In some cases, edge means data gravity. You’re actually taking a bunch of data from sensors and you’re trying to store it in a place where you don’t have to pay the cost of moving all of the data form one point to another where you can actually centralize compute. So in those edge cases, you’re actually willing to invest in a high-end compute to allow for the manipulation of that data where that data lake is so that you can afford to move it into some centralized location later.
But I think that that whole space is so complex right now, because there are so many different definitions and so many different levels of constraints that you have to solve for under one umbrella term, which is the edge.
[00:47:04] KH: I think Bryan was pulling on that with the POS stuff, right? Because instead of you going to go buy your own cash registry and gluing everything together, that whole space got so optimized that you can just buy a square terminal. Plug it on some Wi-Fi and then there you go, right? You now have that thing.
So once we start to do this for like ML capabilities, security capabilities, I think you’re going to see that POS-like thing expand and that computer get a little bit more robust to do exactly what you’re saying, right? Keep the data local. Maybe you ship models to that thing so that it can get smarter overtime, and then upload the data from various stores overtime.
[00:47:40] DC: Yup.
[00:47:40] MG: One last question from my end. Switching gears a bit, if allow it. KubeCon. I left KubeCon with some mixed feelings this years. But my perspective is different, because I’m not the typical, one of the 12,000 people, because most of them were new comers actually. So I looked at them and I asked myself, “If I would be new to this huge big world of CNCF and Kubernetes and all these stuff, what would I take from that?” I would be confused. Confused like how from [inaudible 00:48:10] talks, which make it sound like it’s so complex to run all these things through the keynotes, which seems to be like just a lineup of different projects that I all have to get through and install and run.
I was missing some perspective and some clarity from KubeCon this year, especially for new comers. Because I’m afraid, if we don’t retain them, attract them, and maybe make them contributors, because that’s another big problem. I’m afraid that we’ll lose our base that is using Kubernetes.
[00:48:39] BL: Before Kelsey says anything, and Kelsey was a Kub contrary before I was, but I was a Kub contrary this time, and I can tell you exactly why everything is like it is. Well, fortunately and unfortunately, this cloud native community is huge now. There’s lots of money. There are lots of people. There are lots of interests. If we went back to KubeCon when it was in San Francisco years ago, or even like the first Seattle one, that was a community event. We could make the event for the community.
Now, there’s community. The people who are creating the products. There’s the end users, the people who are consuming the products, and there are these big corporations and companies, people who are actually financing this whole entire thing. We actually have to balance all three of those.
As a person who just wants to learn, what are you trying to learn from? Are you learning from the consumption piece? Are you learning to be a vendor? Are you learning to be a contributor? We have to think about that. At a certain point, that’s good for Kubernetes. That means that we’ve been able to do the whole chasm thing. We’ve cross over to chasm. This thing is real. It’s big. It’s going to make a lot of people a lot of money one day.
But I do see the issue for the person who’s trying to come in and say, “What do I do now?” Well, unfortunately, it’s like anything else. Where do you start? Well, you got to take it all in. So you need to figure out where you want to be. I’m not going to be the person that’s going to tell you, “Well, go do a sig.” That’s not it.
What I want to tell you is like anything else that we’d have to learn is real hard, whether it’s a programming language or a new technique. Figure out where you want to be and you’re going to have to do some research. Then hopefully you can contribute. I’m sure Kelsey has opinions on this as well.
[00:50:19] KH: I think Brian is right. I mean, I think it’s just like a pyramid happening. A the very bottom, we’re new. We need to get everybody together in one space and it becomes more of a tradeshow, like an introductory, like a tasting, right?
When you’re hungry and you go and just taste everything. Then when you figure out what you want, then that will be your focus, and that’s going to change every year for a lot of people. Some people go from consumer to contributor, and they’re going to want something out of the conference. They’re only going to want to go to the contributor day and maybe some of the deep-dive technical tracks. You’re trying to serve everybody in two or three days. So you’re going to start to have like everything pulling for your attention. I think what you got to do is commit.
If you go and you’re a contributor, or you’re someone what’s building on top, you may have to find a separate event to kind of go with it, right? Someone told me, “Hey, when you go to all of these conferences, make sure you don’t forget to invest in the one-on-one time.” Me going to Oslo and spending an evening with Mark Burgess and really talk about Promise Theory outside of competing for attention with the rest of the conference.
When I go, I’d like to meet new people. Sit down with them. Out of the 12,000 people, I call it a win if I can meet three new people that I’ve never met before. You know what? I’ll do a follow-up hangout with them to go deeper in some areas. So I think it’s more of a catch all. It’s definitely has a tradeshow feel now, because it’s big and there’s a lot of money and opportunity involved.
But at the same time, you got to know that, “Hey, you got to go and seek out.” You go to Spotify, right? Spotify has so much music. So many genre I’ve never heard of. But you got to make your own play list. You got to decide what you want, and also, look, go to sessions you know nothing about. Be confused on purpose. I go to Spotify sometimes like, “You know what? What’s happening in country music right now?” Country music top 10. You know what? I’ll try it. I will literally try a new genre and just let it play. Then what you find is you’ll find one song is like, “Oh! Hold on. Who is that?” It turns out that one new artist I actually like some of their music and then I dive deep in that artist, maybe not the whole genre.
[00:52:25] CC: Yeah. It’s like when you’re in school and you go to class and you didn’t do any of the homework, you didn’t do any of the reading. You sit there like, “Hmm.” I sympathize, empathize with the amount to sort through. But if you just slice it and sort through a piece of that, select. Like Kelsey is saying also, I agree with that. It is not easy. I don’t know how it could be any easier unless we knew you and fed you and there was a natural program for you to go through. But as far as conference goes, just sitting there, listening to the words, acquiring the vocabulary, talking to people. Eventually syncs in, and that works for me.
[00:53:03] BL: I’ll say one last thing on this. If the market besides that, a certain segment of users are not being taken care of, there’ll be another conference. I think you should go that conference. CNCF does not own all the brain space for cloud native and Kubernetes. So we’ll let the market figure that out.
So you’re right. So we’re 12,000 people two weeks ago in San Diego. I’m not sure how big Amsterdam will be, and I’m sure Boston will be that size. Maybe, maybe 15, maybe even more people. So the community, or the ecosystem will figure it out, like we’ve done with everything. We don’t just have programming – We used to have programming conferences. Now we have all these languages. So we’re growing, but it’s a good thing, because now we can actually focus on those things that we want to talk about.
[00:53:46] KH: Keep asking that question. You know what I mean? I mean that authenticity. When I don’t see it, you got to remind people of it. You know what I mean? This year, when I decided what to talk about during my keynote, I decided to strip away the live demos, strip away a lot of the stuff and just say, “Maybe I’ll try something new,” to try to bring it back down to my own kind of ground level.
We need everything. We need people pushing the bar. We need people reminding people. We need other people challenging what we’re reminding people about. Because our glory days are probably painful to some other people. So we got to be careful about how we kind of glorify the past.
[00:54:23] CC: yeah, and also find each other. Find people who are the same level. Learning together makes thing so much easier. But I personally very much empathize, and one of the reason for this show is to help people understand the whats, the whys.
Kelsey, we are very much at the end of our time together, but I want to give you an opportunity to make any last questions, comments.
[00:54:48] KH: Yeah. I mean, I always end just pay attention to the fundamentals. That’s the people stuff. Fundamentally, we’re just people working on some stuff. Fundamentally, the technology is roughly the same. When I peel back my 10 years, the tech looks very similar that it did 10 years ago except for the way we do it is slightly different. I think people should find a little bit of encouragement in that. This isn’t overwhelming new. It’s just a remapping of some of these ideas and formalization of these ideas that we’re all seaming to want to work together for the very first time in a long time. So I think it’s just a good time to be in this space. So enjoy it while we have it, because it could be boring again.
[00:55:26] CC: Yeah.
[00:55:27] BL: Yeah.
[00:55:28] DC: Yeah.
[00:55:29] CC: Could be boring again. That’s a great quote. All right, everybody. Kelsey, thank you so much for being with us. Everybody, thank you. All the hosts.
[00:55:37] BL: Thank you.
[00:55:38] MG: Thank you, Carlisia. [inaudible 00:55:38].
[00:55:40] CC: Yeah. My pleasure. I’m so grateful to be here, and we will be in your ears next week again. Bye.
[END OF INTERVIEW]
[00:55:46] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
For this special episode, we are joined by Joe Beda who is currently Principal Engineer at VMware. He is also one of the founders of Kubernetes from his days at Google! We use this open table discussion to look at a bunch of exciting topics from Joe's past, present, and future. He shares some of the invaluable lessons he has learned and offers some great tips and concepts from his vast experience building platforms over the years. We also talk about personal things like stress management, avoiding burnout and what is keeping him up at night with excitement and confusion! Large portions of the show are obviously spent discussion different aspects and questions about Kubernetes, including its relationship with etcd and Docker, its reputation as a very complex platform and Joe's thoughts for investing in the space. Joe opens up on some interesting new developments in the tech world and his wide-ranging knowledge is so insightful and measured, you are not going to want to miss this! Join us today, for this great episode!
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Special guest:
Joe BedaHosts:
Carlisia CamposBryan LilesMichael GaschKey Points From This Episode:
A quick history of Joe and his work at Google on Kubernetes.The one thing that Joe thinks sometimes gets lost in translation on these topics.Lessons that Joe has learned in the different companies where he has worked.How Joe manages mental stress and maintains enough energy for all his commitments.Reflections on Kubernetes relationship with and usage of etcd.Is Kubernetes supposed to be complex? Why are people so divided about it?Joe's experience as a platform builder and the most important lessons he has learned.Thoughts for venture capitalists looking to invest in the Kubernetes space.Joe's thoughts on a few different recent developments in the tech world.The relationship and between Kubernetes and Docker and possible ramifications of this.The tech that is most exciting and alien to Joe at the moment!Quotes:
“These things are all interrelated. At a certain point, the technology and the business and career and work-life – all those things really impact each other.” — @jbeda [0:03:41]
“I think one of the things that I enjoy is actually to be able to look at things from all those various different angles and try and find a good path forward.” — @jbeda [0:04:19]
“It turns out that as you bounced around the industry a little bit, there's actually probably more alike than there is different.” — @jbeda [0:06:16]
“What are the things that people can do now that they couldn't do pre-Kubernetes? Those are the things where we're going to see the explosion of growth.” — @jbeda [0:32:40]
“You can have the most beautiful technology, if you can't tell the human story about it, about what it does for folks, then nobody will care.” — @jbeda [0:33:27]
Links Mentioned in Today’s Episode:
The Podlets on Twitter — https://twitter.com/thepodlets
Kubernetes — https://kubernetes.io/
Joe Beda — https://www.linkedin.com/in/jbeda
Eighty Percent — https://www.eightypercent.net/
Heptio — https://heptio.cloud.vmware.com/
Craig McLuckie — https://techcrunch.com/2019/09/11/kubernetes-co-founder-craig-mcluckie-is-as-tired-of-talking-about-kubernetes-as-you-are/
Brendan Burns — https://thenewstack.io/kubernetes-co-creator-brendan-burns-on-what-comes-next/
Microsoft — https://www.microsoft.com
KubeCon — https://events19.linuxfoundation.org/events/kubecon-cloudnativecon-europe-2019/
re:Invent — https://reinvent.awsevents.com/
etcd — https://etcd.io/
CosmosDB — https://docs.microsoft.com/en-us/azure/cosmos-db/introduction
Rancher — https://rancher.com/
PostgresSQL — https://www.postgresql.org/
Linux — https://www.linux.org/
Babel — https://babeljs.io/
React — https://reactjs.org/
Hacker News — https://news.ycombinator.com/
BigTable — https://cloud.google.com/bigtable/
Cassandra — http://cassandra.apache.org/
MapReduce — https://www.ibm.com/analytics/hadoop/mapreduce
Hadoop — https://hadoop.apache.org/
Borg — https://kubernetes.io/blog/2015/04/borg-predecessor-to-kubernetes/
Tesla — https://www.tesla.com/
Thomas Edison — https://www.biography.com/inventor/thomas-edison
Netscape — https://isp.netscape.com/
Internet Explorer — https://internet-explorer-9-vista-32.en.softonic.com/
Microsoft Office — https://www.office.com
VB — https://docs.microsoft.com/en-us/visualstudio/get-started/visual-basic/tutorial-console?view=vs-2019
Docker — https://www.docker.com/
Uber — https://www.uber.com
Lyft — https://www.lyft.com/
Airbnb — https://www.airbnb.com/
Chromebook — https://www.google.com/chromebook/
Harbour — https://harbour.github.io/
Demoscene — https://www.vice.com/en_us/article/j5wgp7/who-killed-the-american-demoscene-synchrony-demopartyTranscript:
BONUS EPISODE 001
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[0:00:41.9] CC: Hi, everybody. Welcome back to The Podlets. We have a new name. This is our first episode with a new name. Don’t want to go much into it, other than we had to change from The Kubelets to The Podlets, because the Kubelets conflicts with an existing project and we’ve thought it was just better to change. The show, the concept, the host, everything stays the same.
I am super excited today, because we have a special guest, Joe Beda and Bryan Liles, Michael Gasch. Joe, just give us a brief introduction. The other hosts have been on the show before. People should know about them. Everybody should know about you too, but there's always newcomers in the space, so give us a little bit of a background.
[0:01:29.4] JB: Yeah, sure. I'm Joe Beda. I was one of the founders of Kubernetes back when I was at Google, along with Craig McLuckie and Brendan Burns, with a bunch of other folks joining on soon after. I'm currently Principal Engineer at VMware, helping to cover all things Kubernetes and Tanzu related across the company.
I came into VMware via the acquisition of Heptio, where Bryan's wearing the shirt today. Left Google, did that with Craig for about two years. Then it's almost a full year here at VMware. We're at 11 months officially as of two days ago. Yeah, really excited to be here.
[0:02:12.0] CC: Yeah, I am so excited. Your name is Joe Beda. I always say Joe Beda.
[0:02:16.8] JB: You know what? It's four letters and it's easy – it's amazing how many different ways there are to pronounce it. I don't get picky about it.
[0:02:23.4] CC: Okay, cool. Well, today I learned. I am very excited about this show, because basically, I get to ask you anything I want.
[0:02:35.9] JB: I’ll do my best to answer.
[0:02:37.9] CC: Yeah. You can always not answer. There are so many interviews of you out there on YouTube, podcasts. We are going to try to do something different. Let me fire the first question I have for you. When people interview you, they ask you yeah, the usual questions, the questions that are very useful for the community. I want to ask you is this, what are people asking you that you think are the wrong questions?
[0:03:08.5] JB: I don't think there's any bad questions like this. I think that there's a ton of interest that's when we're talking about technical stuff at different parts of the Kubernetes stack, I think that there's a lot of business context around the container ecosystem and the companies and around to forming Heptio, all that. A lot of times, I'll have discussions around career and what led me to where I'm at now. I think those are all a lot of really interesting things to talk about all around all that.
The one thing that I think is doesn't always come across is these things are all interrelated. At a certain point, the technology and the business and career and work-life – all those things really impact each other. I think it's a mistake to try and take these things in isolation. There's a ton of lead over. I think one of the things that we tried to do at Heptio, and I think we did a good job is recognized that for anybody senior enough inside of any organization, they really have to be able to play all roles, right? At a certain point, everybody is as a business person, fundamentally, in terms of actually moving the ball forward for the company, for the business as a whole. Yeah. I think one of the things that I enjoy is actually to be able to look at things from all those various different angles and try and find a good path forward.
[0:04:28.7] BL: All right. Taking that, so you've gone from big co to big co, to VC to small co to big co. What does that unique experience taught you and what can you share with us?
[0:04:45.5] JB: Bryan, you know my resume better than I do apparently. I started my career at Microsoft and cut my teeth working on Internet Explorer and doing client side stuff there. I then went to Google in the office up here in Seattle. It was actually in Kirkland, this little hole-in-the-wall, temporary office, preemie work type of thing.
I’m thinking, “Hey, I want to do some server-side stuff.” Worked on Google Talk, worked on ads, worked on cloud, started Kubernetes, was a little burned out. Took some time off, goofed off. Did this entrepreneur-in-residence thing for VC and then started Heptio and then sold the VMware.
[0:05:23.7] BL: When you're in a big company, especially when you're more junior, it's easy to get caught up in playing the game inside of that company. When I say the game, what I mean is that there are measures of success within big companies and there are ways to advance see approval, see rewards that are all very specific to that company. I think the culture of a company is really defined by what are the parameters and what are the successes, the success factors for getting ahead inside of each of those different companies. I think a lot of times, especially when as a Microsoft straight out at college, I did a couple internships at Microsoft and then joining – leaving Microsoft that first time was actually really, really difficult because there is this fear of like, “Oh, my God. Everything's going to be super different.”
It turns out that as you bounced around the industry a little bit, there's actually probably more alike than there is different. The biggest difference I think between large company and small company is really, and I'll throw out some science analogies here. I think, oftentimes organizations are a little bit like the ideal gas law. Okay, maybe going past y'all, but this is – PV = nRT. Pressure times volume equals number of molecules times temperature and the R is a constant.
The idea here is that this is an equation where as you add more molecules to a constrained space, that will actually change the temperature and the pressure and these things all rise. What happens is inside of a large company, you end up with so many people within a constrained space in terms of the product space. When you add more people to the organization, or when you're looking to get ahead, it feels very zero-sum. It very much feels like, “Hey, for me to advance, somebody else has to lose.”
That's not how the real world works, but oftentimes that's how it feels inside of the big company, is that if it feels zero-sum like that. The liberating thing for being at a startup and I think why so many people get addicted to working at startups is that startups are fundamentally not zero-sum. Everybody succeeds and fails together. When a new person shows up, your thought process is naturally like, “Awesome, we got more cylinders in the engine. We’re going to go faster,” which is not always the case inside of a big company.
Now, I think as you get senior enough, all of a sudden these things changes, because you're not just operating within the confines of that company. You're actually again, playing a role in the business, you're looking at the ecosystem, you're looking at the community, you're looking at the competitive landscape and that's where you have your eye on the ball and that's what defines success for you, not the internal company metrics, but really the business metrics is what defines success for you.
The thing that I'm trying to do, here at VMware now is as we do Tanzu is make sure that we recognize the unbounded possibilities in front of us inside of this world, make sure that we actually focus our energy on serving customers. In doing so, out-compete others in the market. It's not a zero-sum game, it's not something where as we bring more folks on that we feel we're competing with them. That's a little rambling of an answer. I don't know if that links together for you, Bryan.
[0:08:41.8] BL: No, no. That was pretty good.
[0:08:44.1] JB: Thanks.
[0:08:46.6] MG: Joe, that's probably going to be a context switch now. You touched on the time when you went through the burnout phase. Then last week, I think you put out a tweet on there's so much stuff going on, which tweet I'm talking about. Yeah. In the Kubernetes community, you’re a rock star. At VMware, you're already a rock star being on stage at VMware shaking hands with Pat.
I mean, there's so many people, so many e-mails, so many slacks, whatever that you get every day, but still I feel you are able to keep the balance, stay grounded and always have a chat, even though sometimes I don't want to approach you, but sometimes I do when I have some crazy questions maybe. Still you’re not pushing people away. How do you manage with mental stress preventing another burnout? What is the secret sauce here? Because I feel I need to work on that.
[0:09:37.4] JB: Well, I mean it's hard. The tweet that I put out was last week I was coming back from Barcelona and tired of travel. I'm looking forward to right now, we're recording this just before KubeCon. Then after KubeCon, planning to go to re:Invent in Vegas, which is just a social denial-of-service. It's just overwhelming being with that. I was tired of traveling. I posted something and came across a little stronger than I wanted to. That I just hate people, right?
I was at that point where it's just you're traveling and you just don't want to deal with anybody and every little thing is really bugging you and annoying you. I think burnout is an interesting thing. For me and I think there's different causes for different folks. Number one is that it's always fascinating when you start a new job, your calendar is empty, your responsibilities are low. Then as you are successful and you integrate yourself into the organization, all of a sudden you find that you have more work than you have time to do.
Then you hit this point where you try and like, “I'm just going to keep doing it. I'm going to power through.” Then you finally hit this point where you're like, “This is just not humanly possible.” Then you go into a triage mode and then you have to decide what's important. I know that there's more to be done than I can do. I have to be very thoughtful about prioritizing what I'm doing. There's a lot of techniques that you can bring to bear there. Being explicit about what your goals are and what your priorities are, writing those things down, whether it's an OKR process, or whether it's just here's the my top three things that I'm focusing on. Making sure that those things are purposefully meaningful to you, right?
Understanding the difference between urgent and important, which these are business booky type of things, but it's this idea of there are things that feel they have to get done right now and then there are things that are long-term important. If you're not thoughtful about how you do things, you spend all your time doing the urgent things, but you never get to the stuff that's the actually long-term important. That's a really easy trap to get yourself into. Finding ways to delegate to folks is really, really helpful here, in terms of empowering others, trusting them.
It's hard to let go sometimes, but I think being able to set the stage for other people to be successful is really empowering. Then just recognizing it's not all going to get done and that's okay. You can't hold yourself to expect that. Now with respect to burnout, for me, the biggest driver for burnout in my career has been when I felt personal responsibility over something, but I have been had the tools, or the authority, or the ability to impact it.
When you feel in your bones ownership over something, but yet you can't actually really own it, that is what causes burnout for me. I think there are studies talking about how the worst job is middle management. I think it's not being the CEO. It's not being new to the organization, being junior. It's actually being stuck in the middle. Because you're given a certain amount of responsibility, but you aren't always given the tools necessary to be able to drive that. Whereas the folks at the top, oftentimes they don't have those constraints, so they actually own stuff and have agency to be able to take care of it.I think when you're starting on more junior in the organization, the scope of ownership that you feel is relatively minor. That being stuck in the middle is the biggest driver for me for burnout. A big part of that is just recognizing that sometimes you have to take a step back and personally divest that feeling of ownership when really it's not yours to own.
I'll give you an example, is that I started Google Compute Engine at Google, which is arguably the foundational cloud service for GCP. As it grew, as it became more important to Google, as it got reorged, more or more of the leadership and responsibilities and decision-making, I’m up here in Seattle, move down the mountain view, a lot of that stuff was focused at had been in the cloud market, but then at Google for 10 or 15 years coming in and they're like, “Okay, that's cute. We got it from here,” right?
That was a case where it was my thing. I felt a lot of ownership over it. It was clear after a certain amount of time, hey, you know what? I just work here. I'm just doing my job and I do what I do, but really it’s these other folks that are driving the bus. That's a painful transition to actually go from that feeling of ownership to I just work here. That I think is one of the reasons why oftentimes, people leave the companies. I think that was one of the big drivers for why I ended up leaving Google, was that lack of agency to be able to impact things that I cared about quite a bit.
[0:13:59.8] CC: I think that's one reason why – well, I think that working in the companies where things are moving fast, because they have a very clear, very worthwhile goal provides you the opportunity to just have so much work that you have to say no to a lot of things like where you were saying, and also take ownership of pieces of that work, because there's more work to go around than people to do it.
For example, since Heptio and VM – okay, I’m plugging. This is a big plug for VMware I guess, but it definitely is a place that's moving fast. It's not crazy. It's reasonable, because everybody, pretty much, wherever one of us grown up. There is so much to do and people are glad when you take ownership of things. That really for me is a big source of work satisfaction.
[0:14:51.2] JB: Yeah. I think it's that zero-sum versus positive-sum game. I think that when you – there's a lot more room for you to actually feel that ownership, have that agency, have that responsibility when you're in a positive-sum environment, versus a zero-sum environment.
[0:15:04.9] BL: All right, so now I want to ask your technical question.
[0:15:08.1] JB: All right.
[0:15:09.5] BL: Not a really hard one. Just more of how you think about this. Kubernetes is five and almost five and a half years old. One of the key components of Kubernetes is etcd. Now knowing what we know now and 2019 with Kubernetes have you used etcd as its key store? Or would you have gone another direction?
[0:15:32.1] JB: I think etcd is a good fit. The truth of the matter is that we didn't give that decision as much thought as we probably should have early on. We saw that it was relatively easy to stand up and get going with. At least on paper, it had the qualities that we were looking for, so we started building with it and then just ran with it. Something like ZooKeeper was also something we could have taken, but the operational overhead at the time of ZooKeeper was very different from etcd.
I think we could have gone in the direction of them and this is why [inaudible 0:15:58.5] for a lot of their tools, where they actually build the data store into the tool in a native way. I think that can lead in some ways to a simpler getting started experience, because there's just one thing to boot up, but also it's more monolithic from a backup, maintenance, recovery type of thing. The one thing that I think we probably should have done there in retrospect is to try and create a little bit more of an arm's length relationship in Kubernetes and etcd. In terms of having some cleaner interfaces, some more contractor and stuff, so that we could have actually swapped something else out.
There's folks that are doing it, so it's not impossible, but it's definitely not something that's easy to do, or well-supported. I think that that's probably the thing that I wouldn't change in that space. Another thing we might want to change, I think it might have been good to be more explicit about being able to actually shard things out, so that you could have multiple data stores for multiple resources and actually find a way to horizontally scale. Now we do that with events, because we were writing events into etcd and that's just a totally different stream of data, but everything else right now – I think now there's room to do this into the future. I think we've been able to push etcd vertically up until now. There will come a time where we need to find ways to shard that thing up horizontally.
[0:17:12.0] CC: Is it possible though to use a different data store than etcd for Kubernetes?
[0:17:18.4] JB: The things that I'm aware of here and there may be more and I may not be a 100% up to date, is I do know that the Azure folks created a proxy layer that speaks to the etcd protocol, but that is actually implemented on the backend using CosmoDB. That approach there was to essentially create a translation layer.
Then Rancher created this project, which is a little bit if you've – been added a bit of a fork of Kubernetes, where they're I believe using PostgresSQL as the database for Kubernetes. I haven't looked to see exactly how they ended up swapping that in. My guess is that there's some chewing gum and bailing wiring and it's quite a bit of effort for each version upgrade to be able to actually adapt that moving forward. Don't know for sure. I haven't looked deeply.
[0:18:06.0] CC: Okay. Now I would love to philosophize a little bit, or maybe a lot about Kubernetes. In the spirit of thinking of different questions to ask, so I had a bunch of questions and then I was thinking, “How could I ask this question in a different way?” Maybe this is not the right “question.” Here is the way I came up with this question.
We’re so divided out there. One camp loves Kubernetes, another camp, "So hard, so complicated, it’s so complex. Why even bother with it? I don't understand why people are using this." Basically, there is that sentiment that Kubernetes is complicated. I don't think anybody would refute that. Now is that even the right way to talk about Kubernetes? Is it even not supposed to be complicated? I mean, what kind of a tool is it that we are thinking, it should just work, it should be just be super simple. Is it true that it should be a super simple tool to use?
[0:19:09.4] JB: I mean, that's a loaded question [inaudible]. Let me just first say that number one, if people are complaining, I mean, I'm stealing this from Tim [inaudible], who I think this is the way he takes some of these things in stride. If people are complaining, then you're relevant, right? If nobody is complaining, then nobody cares about what you're doing. I think that it's a good thing that folks are taking a critical look at Kubernetes. That means that they're taking a look at it, right?
For five years in, Kubernetes is on an upswing. That's not going to necessarily last forever. I think we have work to do to continually earn Kubernetes’s place in the technology stack over time. Now that being said, Kubernetes is a super, super flexible tool. It can do so many things in so many different situations. It's used from everything from in retail stores across the tens of thousands of stores, any type of solutions. People are looking at it for telco, 5G. People are looking at it to even running it inside cars, which scares me, right? Then all the way up to folks like at CERN using it to do data analytics for hiring and physics, right?
The technology that I look at that's probably most comparable to that is something like Linux. Linux is actually scalable from everything from a phone, all the way up to an IBM mainframe, but it's not easy, right? I mean, to be able to adapt it across all that things, you have to essentially download the kernel type, make config and then answer 5,000 questions, right, for those who haven't done that. It's not an easy thing to do.
I think that a lot of times, people might be looking at Kubernetes at the wrong level to be able to say this should be simple. Nobody looks at the Linux kernel that you get from git cloning, Linux’s fork and compiling it and saying, “Yeah, this is too hard.” Of course it's hard. It's the Linux kernel. You expect that you're going to have a curated experience if you want something easy, right? Whether that be an Android phone or Ubuntu or what have you.
I think to some degree, we're still in the early days where people are dealing with it perhaps at to raw level, versus actually dealing with it in a more opinionated way. Now I think the fascinating thing for Kubernetes is that it provides a lot of the extension points and patterns, so that we don't know exactly what those higher-level easier-to-use abstractions are going to look like, but we know, or at least we're pretty confident that we have the right tools and the right environment to be able to experiment our way there. I think we're not there yet, but we're set up for success. That's the first thing.
The second thing is that Kubernetes introduces a whole bunch of different concepts and ideas and these things are different and uncomfortable for folks. It's hard to learn new things. It's hard for me to learn new things and it's hard for everybody to learn new things. When you compare Kubernetes to say, getting started with the modern front-end web development stack, with things like Babel and React and how do you deploy this and what are all these different options and it changes on a weekly basis. There's a hell of a lot in common actually between these two ecosystems. They're both really hard, they both introduce all these new concepts and you have to be embedded in it to really get it.
Now that being said, if you just wanted take raw JavaScript, or jQuery and have at it, you can do it and you'll see on Hacker News articles every once in a while where people are like, “Hey, I've programmed my site with jQuery and it's just fine. I don't need all this new stuff,” right? Just like you'll see folks saying like, “I just SSH’d in and actually ran some stuff and it works fine. I don't need all this Kubernetes stuff.” If that works for you, that's great. Kubernetes doesn't have to solve every problem for every person.
Then the next thing is that I think that there's a lot of people who've been solving these problems again and again and again and again, but they've been solving them in their own way. It's not uncommon when you look at back-end systems, to join a company, look at what they've built and found that it's a complicated, bespoke system of chewing gum and baling wire with maybe a little bit Ansible, maybe a little bit of Puppets and bash. Everybody has built their own, complex, overwrought system to do a lot of the stuff that Kubernetes does.
I think one of the values that we see here is that these things are complex, unique complex to do it, but shared complexity is more valuable than personal complexity. If we can agree on some of these concepts, then that's something that can be leveraged widely and it will fade to the background over time, versus having everybody invent their own complex system every time they need to solve these problems.
With that all said, we got a ton of work to do. It's not like we're done here and I'm not going to actually sit here and say Kubernetes is easy, or that every complex thing is absolutely necessary and that we can't find ways to simplify it. We clearly can. I just think that when folks say, “Hey, I just want this to be easy." I think they're being a little bit too naïve, because it's a very difficult problem domain.
[0:23:51.9] BL: I'd like to add on to that. I think about this a lot as well. Something that Joe said to me few years back, where Kubernetes is the platform for creating platforms, it is very applicable here. Where we are looking at as an industry, we need to stop looking at Kubernetes as some estimation. Your destination is really running your applications that give you pleasure, or make your business money. Kubernetes is a tool to enable us to think about our applications more, rather than the underlying ecosystem. We don't think about servers. We want to think about storage and networking, even things like finding things in your cluster. You don't think about that. Kubernetes gives it to you.
If we start thinking about Kubernetes as a way to enable us to do better things, we can go back to what Joe said about Linux. Back whenever I started using Linux in the mid-90s, guess what? We compiled it. Make them big. That stuff was hard and it was slow. Now think about this, in my office I have three different Linux distributions running. You know what? I don't even think about it anymore.
I don't think about configuring X. I don't think about anything. One thing that from Kubernetes is going to grow is it's going to – we're going to figure out these problems and it's going to allow us to think of these other crazy things, which is going to push the industry further. Think maybe 20 years from now if we're still running Kubernetes, who cares? It's just going to be there. We're going to think about some other problem and it could be amazing. This is good times.
[0:25:18.2] JB: At one point. Sorry, the dog’s going to bark here. I mean, at one point people cared about some of the BIOS that they were running on our computers, right? That was something that you stressed out about. I mean, back in the bad old days when I was doing DOS gaming and you're like, “Oh, well this BIOS is incompatible with the –” IRQ's and all that. It's just background now.
[0:25:36.7] CC: Yeah, I think about this too as a developer. I might have mentioned this before in this podcast. I have never gone from one job to another job and had to use the same deployment system. Every single job I've ever had, the deployment system is completely different, completely different set of tooling and completely different process.
Just being able to walk out from one job to another job and be able to use the same platform for deployment, it must be amazing. On the flip side, being able to hire people that will join your organization already know how your deployment works, that has value in itself. It's a huge value that I don't think people talk about enough.
[0:26:25.5] JB: Well honestly, this was one of the motivations for creating Kubernetes, is that I looked around Google early on and Google is really good at importing open source, circa 2000, right?
This is like, “Hey, you want to use libpng, or you want to use this library, or whatever.” That was the type of open source that Google is really, really good at using. Then Google did things, like say release the Big Table paper. Then somebody went through and then created Cassandra out of it. Maybe there's some ideas in Cassandra that actually build on top of big table, or you're looking at MapReduce versus Hadoop.
All of a sudden, you found that these things diverge and Google had zero ability to actually import open source, circa 2010, right? It could not back import systems, because the operational characteristics of these things were solely alien when compared to something like Borg. You see this also, like we would acquire companies and it would take those companies way too long to be able to essentially re-platform themselves on top of Borg, because it was just so different.
This is one of the reasons, honestly, why we ended up doing something like GCE is to actually have a platform that was actually more familiar from acquisition. It's one of the reasons we did it. Then also introducing Kubernetes, it's not Borg. It's a cousin of Borg inside of Google. For those who don't know, Borg is the container system that’s been in production at Google for probably 15 years now, and the spiritual grandfather to Kubernetes in a lot of ways.
A lot of the ideas that you learn from Kubernetes are applicable to Borg. It's not nearly as big a leap for people to actually change between them, as it was before, Kubernetes was out there.
[0:27:58.6] MG: Joe, I got a similar question, because it seems to be like you're a platform builder. You've worked on GCE, Kubernetes obviously. If you would be talking to another platform architect or builder, what would be something that you would recommend to them based on your experiences? What is a key ingredient, technically speaking of a platform that you should be building today, or the main thing, or the lesson learned that you had from building those platforms, like technical advice, if you will?
[0:28:26.8] JB: I mean, that's a really good question. I think in my mind, the mark of a good platform is when people can use it to do things that you hadn't imagined when you were building it, right? The goal here is that you want a platform to be a force multiplier. You wanted to enable people to do amazing things. You compare, again the Linux kernel, even something as simple as our electrical grid, right?
The folks who established those standards, God knows how long ago, right? A 150 years ago or whenever, the whole Tesla versus Thomas Edison, [inaudible]. Nobody had any idea the long-term impact that would have on society over time. I think that's the definition of a successful platform in my mind.
You got to keep that in mind, right? I think that for me, a lot of times people design for the first five minutes at the expense of the next five years. I've seen in a lot of times where you design for hey, I'm getting a presentation. I want to be able to fit something amazing on one slot. You do it, but then all of a sudden somebody wants to do something different. They want to go off course, they want to go off the rails, they want to actually experiment and the thing is just brittle. It's like, “Hey, it does this. It doesn't do anything else. Do you want to do something else? Sorry, this isn't the tool for you.”
For me, I think that's a trap, right? Because it's easy to get it early users based on that very curated experience. It's hard to keep those users as they actually start using the thing in anger, as they start interfacing with the real world, as they deal with things that you didn't think of as a platform.
I'm always thinking about how can every that you put in the platform be used in multiple ways? How can you actually make these things be composable building blocks, because then that gives you the opportunity for folks to actually compose them in ways that you didn't imagine, starting out. I think that's some of it. I started my career at Microsoft working on Internet Explorer. The fascinating thing about Microsoft is that through and through and through and through Microsoft is a platform company. It started with DOS and Windows and Office, but even though Office is viewed as a platform inside of Microsoft.
They fundamentally understand in their bones the benefit of actually starting that platform flywheel. It was really interesting to actually be doing this for the first browser wars of IE versus Netscape when I started my own career, to actually see the fact that Microsoft always saw Internet Explorer as a platform, whereas I think Netscape didn't really get it in the same way, right? They didn't understand the potential, I think in the way that Microsoft did it.
For me, I mean, just being where you start your career, oftentimes you actually sets your patterns in terms of how you look at things over time. I think a lot of this platform thinking comes from just imprinting when I was a baby developer, I think. I don't know. It takes a lot of time to really internalize that stuff.
[0:31:14.1] BL: The lesson here is this a good one, is that when we're building things that are way bigger than us, don't think of your product as the end goal. Think of it as an enabler. When it's an enabler, that's where you get that X multiplier. Then that's where you get all the residuals. Microsoft actually is a great example of it. My gosh. Just think of what Microsoft has been able to do with the power of Office?
[0:31:39.1] JB: Yeah. I look at something like VB in the Microsoft world. We still don't have VB for the cloud era. We still haven't created that. I think there's still opportunity there to actually strike. VB back in the day, for those who weren't there, struck this amazing balance of being easy to get started with, but also something that could actually grow with you over time, because it had all these extension mechanisms where you could actually – there's the marketplace controls that you could buy, you could partner with other developers that were writing C or C++.
It was an incredible platform. Then they leverage to Office to extend the capabilities of VB. It's an amazing ecosystem. Sorry. I didn't mean to interrupt you, Bryan.
[0:32:16.0] BL: Oh, no. That's all good. I get as excited about it as you do whenever I think about it. It's a pretty exciting place to be.
[0:32:21.8] JB: Yeah. I'll talk to VC's, because I did a startup and the EIR thing and I'll have them ask me things like, “Hey, where should we invest in the Kubernetes space?” My answer is using the BS analogy like, “You got to go where the puck is going.” Invest in the things that Kubernetes enables. What are the things that people can do now that they couldn't do pre-Kubernetes? Those are the things where we're going to see the explosion of growth. It's not about the Kubernetes. It's really about a larger ecosystem that Kubernetes is the seed crystal for.
[0:32:56.2] BL: For those of you listening, if you want to get anything out of here, rewind back about 20 seconds and play that over and over again, what Joe just said.
[0:33:04.2] MG: Yeah. This was brilliant.
[0:33:05.9] BL: It’s where the puck is going. It's not where we are now. We're building for the future. We're not building for now.
[0:33:11.1] MG: I'm looking at this tweetable quotes here, the last 20 seconds, so many tweetable quotes. We have to decide which ones to tweet then.
[0:33:18.5] CC: Well, we’ll tweet them all.
[0:33:20.0] MG: Oh, yes.
[0:33:21.3] JB: Here’s another thing. Here’s another piece of career advice. Successful people are good storytellers. You can have the most beautiful technology, if you can't tell the human story about it, about what it does for folks, then nobody will care. I spend a lot of the time on Twitter and probably too much time, if you ask my family.
That medium of being able to actually distill your thoughts down into something that is tweetable, quotable, really potent, that is a skill that's worth developing and it's a skill that's worth valuing. Because there's things that are rolling around in my head and I still haven't found a way to get them into a tweet. At some point, I'll figure it out and it'll be a thing. It takes a lot of time to build that skill to be able to refine like that.
[0:34:08.5] CC: I want to say an anecdote of myself. I interview a small – so tiny startup, maybe less than 10 people at the time in Cambridge back when I lived up there. The guy was borderline wanting to hire me and no. I sent him an e-mail to try to influence his decision and it was a long-ass e-mail. They said, “No, thank you.”
Then I think we had a good rapport. I said, well, anything you can tell me about your decision then? He said something along the lines like, I was too verbose. That was pre-Twitter. Twitter I think existed, but it was at the very beginning, I wasn't using it. Yeah, people. Be concise. Decision-makers don't have time to read long things. You need to be able to convey your message in short sentences, few sentences. It's crucial.
[0:35:07.5] BL: All right, so we're nearing the end. I want to ask another question, because these are random questions for Joe. Joe, it is the week before KubeCon North America 2019 and today is actually an interesting day. A couple of neat things happened today. We had Docker. It was neat. Docker split somewhat and it sold part of it and now they're going to be a tools company. That's neat. We're all still trying decoding what that actually is. Here's the neat piece, Apple released a laptop that can have 64 gigabytes of memory.
[0:35:44.4] MG: Has an escape key.
[0:35:45.7] BL: It has an escape key.
[0:35:47.6] MG: This is brilliant.
[0:35:48.6] BL: Yeah. I think the question was what do you think about that?
[0:35:52.8] JB: Okay. Well, so first of all, I mean, Docker is fascinating and I think this is – there's a lot of lessons there and I'm not sure I'm the one to tell them. I think it's easy to armchair-quarterback these things. It's hard to live that story. I think that it's fun to play that what-if game. I think it does show that this stuff is hard. You can have everything in your grasp and then just have it all slip away. I think that's not anybody's fault. It's just there's different strategies, different approaches in how this stuff plays out over time.
On the laptop thing, I think my current laptop has 16 gigs of RAM. One of the things that we're seeing is that as we move towards a microservices world, I gave a talk about this probably three or four years ago. As we move to a microservices world, I think there's one stage where you create a bunch of microservices, but you still view those things as an app.
You say, "This microservice belongs to this app." Within a mature organization, those things start to grow and eventually what you find is that you have services that are actually useful for multiple apps. Your entire production infrastructure becomes this web of services that are calling each other. Apps are just entry points into these things at different points of that web of infrastructure.
This is the way that things work at Google. When you see companies that are microservices-based, let's take an Uber, or Lyft or an Airbnb. As they diversify the set of products that they're offering, you know they're not running completely independent stacks. You know that there's places where these things connect to behind the scenes in a microservices world. What does that mean for developers? What it means is that you can no longer fit an entire company's worth of infrastructure on your laptop anymore.
Within a certain constraint, you can go through and actually say, “Hey, I can bring up this canonical cut of microservices. I can bring that up on my laptop, but it will have dependencies that I either have to actually call into the prod dependencies, call into specialized staging, or mock those things out, so that I can actually run this thing locally and develop it.” With 64 gig of RAM, I can run more on my laptop, right? There's a little bit of kick in that can down the road in terms of okay, there's this race between more microservicey versus how much I can port on my laptop.
The interesting thing is that where is this going to end? Are we going to have the ability to bring more and more with your laptop? Are you going to be able to run in the split brain thing across like there's people who will create network connections between these things? Or are we going to move to a world where you're doing more development on cluster, in the cloud and your laptop gets thinner and thinner, right?
Either you absolutely need 64 gig because you're pushing up against the boundaries of what you can do on your laptop, or you've given up and it's all running in the cloud. Yet anyways, you might as well just use a Chromebook. It's fascinating that we're seeing this divergence of scaling up, versus actually moving stuff to the cloud.
I can tell you at Google, a lot of folks, even developers can actually be super, super productive with something relatively thin like Chromebook, because there's so many tools there that really are targeted at doing all that stuff remotely, in Google's production data centers and such. That's I think the interesting implication from a developer point of view with 64 gigabytes of RAM. What you going to do Bryan? You're going to get the 64 gig Mac? You’re going to do it?
[0:39:11.2] BL: It’s already coming. They'll be here week after next.
[0:39:13.2] JB: You already ordered it? You are such an Apple fanboy. Oh, man.
[0:39:18.6] BL: Oh, I'm actually so not to go too much into it. I am a fan of lots of memory. You know what? We work in this cloud native world. Any given week, I’ll work on four to five projects. I'm lazy. I don't want to shut any of them down. Now with 64 gigs, I don't have to shut anything down.
[0:39:37.2] JB: It was so funny. When I was at Microsoft, everybody actually focused on Microsoft Windows boot time. They’re like, “We got to make it boot faster. We got to make it boot faster.” I'm like, I don't boot that often. I just want the thing to resume from sleep, right? If you can make that reliable on that theme.
[0:39:48.7] CC: Yeah. I frequently have to restart my computer, because of memory issues. I don't want to know which app is taking up memory. I have a tool that I can look up, but I just shut it down, flush the memory.
I do have a question related to Docker. Kubernetes, I don't know if it's right to say that Kubernetes is so reliant on Docker, because I know it works with other container technologies as well. In the worst case scenario, it's obviously, I have no reason to predict this, but in the worst case scenario where Docker, let's say is discontinued, how would that affect Kubernetes?
[0:40:25.3] JB: Early on when we were doing Kubernetes and you're in this relationship with a company like Docker, I looked at what Docker was doing and you're like, “Okay, where is the real value here over time?” In my mind, I thought that the interface with developers that distributed kernel, that API surface area of Kubernetes, that was really the thing and that a lot of the Docker stuff was over time going to fade to the background.
I think we've seen that happen, because when we talk about production systems, we definitely have moved past Docker and we have the CRI, we have Container D, which it was essentially built by Docker, donated to the CNCF as it made its way towards graduation. I think it's graduated now. The governance ties to Docker have been severed at this point. In production systems for Kubernetes, we've moved past that.
I still think that there's developer experiences oftentimes reliant on Docker and things like Docker files. I think we're moving past that also. I think that if Docker were to disappear off the face of the earth, there would be some adjustment, but I think we have the right toolkits and the right systems to be able to do that. Some of that is open sourced by Docker as part of the Moby project. The whole Docker file evaluation flow is actually in this thing called Build Kit that you can actually use in different contexts outside of the Docker game.
I think there's a lot of the building action. The thing that I think is the most influential thing that actually I think will stand the test of time is the Docker container image format. That artifact that you upload, that you download, the registry APIs. Now those things have been codified and are moving forward slowly under the OCI, the open container initiative project, which is a little bit of a sister foundation niche type of thing to the CNCF. I think that's the influence over time.
Then related to that, I think the world should be a little bit worried about Docker Hub and what that means for Docker Hub over time, because that is not a cheap service to run. It's done as a public good, similar to github. If the commercial aspects of that are not healthy, then I think it might be disruptive if we see something bad happen with Docker Hub itself. I don't know what exactly the replacement for that would be overnight. That'd be incredibly disruptive.
[0:42:35.8] CC: Should be Harbour.
[0:42:37.7] JB: I mean, Harbour is a thing, but somebody's got a run it and somebody's got to pay the bandwidth bills, right? Thank you to Docker for paying those bandwidth bills, because it's actually been good for not just Docker, but for our entire ecosystem to be able to do that. I don't know what that looks like moving forward. I think it's going to be – I mean, maybe github with github artifacts and it's going to pick up the slack. We’re going to have to see.
[0:42:58.6] MG: Good. I have one last question from my end. Totally different topic, not Docker at all. Or maybe, depends on your answer to it. The question is you're very technical person, what is the technology, or the stuff that your brain is currently spinning on, if you can disclose? Obviously, no secrets. What keeps you awake at night, in your brain?
[0:43:20.1] JB: I mean, I think the thing that – a couple of things, is that stuff that's just completely different from our world, I think is interesting. I think we've entered at a place where programming computers, and so stuff is so specialized. That again, I talk about if you made me be a front-end developer, I would flail for several months trying to figure out how to even be productive, right?
I think similar when we look at something like machine learning, there's a lot of stuff happening there really fast. I understand the broad strokes, but I can't say that I understand it to any deep degree. I think it's fascinating and exciting the amount of diversity in this world and stuff to learn. Bryan's asked me in the past. It's like, “Hey, if you're going to quit and start a new career and do something different, what would it be?”
I think I would probably do something like generative art, right? Essentially, there's folks out there writing these programs to generate art, a little bit of the moral descendant of Demoscene that was I don't know. I wonder was the Demoscene happened, Bryan. When was that?
[0:44:19.4] BL: Oh, mid 90s, or early 90s.
[0:44:22.4] JB: That’s right. I was never super into that. I don't think I was smart enough. It's crazy stuff.
[0:44:27.6] MG: I actually used to write demoscenes.
[0:44:28.8] JB: I know you did. I know you did. Okay, so just for those not familiar, the Demoscene was essentially you wrote essentially X86 assembly code to do something cool on screen. It was all generated so that the amount of code was vanishingly small. It was this puzzle/art/technical tour de force type of thing.
[0:44:50.8] BL: We wrote trigonometry in a similar – that's literally what we did.
[0:44:56.2] JB: I think a lot of that stuff ends up being fun. Stuff that's related to our world, I think about how do we move up the stack and I think a lot of folks are focused on the developer experience, how do we make that easier. I think one of the things through the lens of VMware and Tanzu is looking at how does this stuff start to interface with organizational mechanics?
How does the typical enterprise work? How do we actually make sure that we can start delivering a toolset that works with that organization, versus working against the organization? That I think is an interesting area, where it's hard because it involves people.
Back-end people like programmers, they love it because they don't have to deal with those pesky people, right? They get to define their interfaces and their interfaces are pure and logical. I think that UI work, UX work, anytime when you deal with people, that's the hardest thing, because you don't get to actually tell them how to think. They tell you how to think and you have to adapt to it, which is actually different from a lot of back-end here in logical type of folks.
I think there's an aspect of that that is user experience at the consumer level. There's developer experience and there's a whole class of things, which is maybe organizational experience. How do you interface with the organization, versus just interfacing, whether it's individuals in the developer, or the end-user point of view? I don't know if as an industry, we actually have our heads wrapped around that organizational limits.
[0:46:16.6] CC: Well, we have arrived at the end. Makes me so sad, because we could talk for easily two more hours.
[0:46:24.8] JB: Yeah, we could definitely keep going.
[0:46:26.4] CC: We’re going to bring you back, Joe. Don’t worry.
[0:46:28.6] JB: For sure. Anytime.
[0:46:29.9] CC: Or do worry. All right, so we are going to release these episodes right after KubeCon. Glad everybody could be here today. Thank you. Make sure to subscribe and follow us on Twitter. Follow us everywhere and suggest episode topics for us.
Bye and until next time.
[0:46:52.3] JB: Thank you so much.
[0:46:52.9] MG: Bye.
[0:46:54.1] BL: Bye. Thank you.
[END OF EPISODE]
[0:46:55.1] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
This week on The Podlets Podcast, we talk about cloud native infrastructure. We were interested in discussing this because we’ve spent some time talking about the different ways that people can use cloud native tooling, but we wanted to get to the root of it, such as where code lives and runs and what it means to create cloud native infrastructure. We also have a conversation about the future of administrative roles in the cloud native space, and explain why there will always be a demand for people in this industry. We dive into the expense for companies when developers run their own scripts and use cloud services as required, and provide some pointers on how to keep costs at a minimum. Joining in, you’ll also learn what a well-constructed cloud native environment should look like, which resources to consult, and what infrastructure as code (IaC) really means. We compare containers to virtual machines and then weigh up the advantages and disadvantages of bare metal data centers versus using the cloud.
Note: our show changed name to The Podlets.
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Carlisia CamposDuffie CooleyNicholas LaneKey Points From This Episode:
• A few perspectives on what cloud native infrastructure means.
• Thoughts about the future of admin roles in the cloud native space.
• The increasing volume of internet users and the development of new apps daily.
• Why people in the infrastructure space will continue to become more valuable.
• The cost implications for companies if every developer uses cloud services individually.
• The relationships between IaC for cloud native and IaC for the could in general.
• Features of a well-constructed cloud native environment.
• Being aware that not all clouds are created equal and the problem with certain APIs.
• A helpful resource for learning more on this topic: Cloud Native Infrastructure.
• Unpacking what IaC is not and how Kubernetes really works.
• Reflecting how it was before cloud native infrastructure, including using tools like vSphere.
• An explanation of what containers are and how they compare to virtual machines.
• Is it worth running bare metal in the clouds age? Weighing up the pros and cons.
• Returning to the mainframe and how the cloud almost mimics that idea.
• A list of the cloud native infrastructures we use daily.
• How you can have your own “private” cloud within your bare metal data center.Quotes:
“This isn’t about whether we will have jobs, it’s about how, when we are so outnumbered, do we as this relatively small force in the world handle the demand that is coming, that is already here.” — Duffie Coolie @mauilion [0:07:22]
“Not every cloud that you’re going to run into is made the same. There are some clouds that exist of which the API is people. You send a request and a human being interprets your request and makes the changes. That is a big no-no.” — Nicholas Lane @apinick [0:16:19]
“If you are in the cloud native workspace you may need 1% of your workforce dedicated to infrastructure, but if you are in the bare metal world, you might need 10 to 20% of your workforce dedicated just to running infrastructure.” — Nicholas Lane @apinick [0:41:03]
Links Mentioned in Today’s Episode:
VMware RADIO — https://www.vmware.com/radius/vmware-radio-amplifying-ideas-innovation/CoreOS — https://coreos.com/
Brandon Phillips on LinkedIn — https://www.linkedin.com/in/brandonphilips
Kubernetes — https://kubernetes.io/
Apache Mesos — http://mesos.apache.org
Ansible — https://www.ansible.com
Terraform — https://www.terraform.io
XenServer (Citrix Hypervisor) — https://xenserver.org
OpenStack — https://www.openstack.org
Red Hat — https://www.redhat.com/
Kris Nova on LinkedIn — https://www.linkedin.com/in/kris-nova
Cloud native Infrastructure — https://www.amazon.com/Cloud-Native-Infrastructure-Applications-Environment/dp/1491984309
Heptio — https://heptio.cloud.vmware.com
AWS — https://aws.amazon.com
Azure — https://azure.microsoft.com/en-us/
vSphere — https://www.vmware.com/products/vsphere.html
Circuit City — https://www.circuitcity.com
Newegg — https://www.newegg.com
Uber —https://www.uber.com/
Lyft — https://www.lyft.comTranscript:
EPISODE 05
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[0:00:41.1] NL: Hello and welcome to episode five of The Podlets Podcast, the podcast where we explore cloud native topics one topic at a time. This week, we’re going to the root of everything, money. No, I mean, infrastructure. My name is Nicholas Lane and joining me this week are Carlisia Campos.
[0:00:59.2] CC: Hi everybody.
[0:01:01.1] NL: And Duffie Cooley.
[0:01:02.4] DC: Hey everybody, good to see you again.
[0:01:04.6] NL: How have you guys been? Anything new and exciting going on? For me, this week has been really interesting, there’s an internal VMware conference called RADIO where we have a bunch of engineering teams across the entire company, kind of get together and talk about the future of pretty much everything and so, have been kind of sponging that up this week and that’s been really interesting to kind of talking about all the interesting ideas and fascinating new technologies that we’re working on.
[0:01:29.8] DC: Awesome.
[0:01:30.9] NL: Carlisia?
[0:01:31.8] CC: My entire team is at RADIO which is in San Francisco and I’m not. But I’m sort of glad I didn’t have to travel.
[0:01:42.8] NL: Yeah, nothing too exciting for me this week. Last week I was on PTO and that was great so this week it has just been kind of getting spun back up and I’m getting back into the swing of things a bit.
[0:01:52.3] CC: Were you spun back up with a script?
[0:01:57.1] NL: Yeah, I was.
[0:01:58.8] CC: With infrastructure I suppose?
[0:02:00.5] NL: Yes, absolutely. This week on The Podlets Podcast, we are going to be talking about cloud native infrastructure. Basically, I was interested in talking about this because we’ve spent some time talking about some of the different ways that people can use cloud native tooling but I wanted to kind of get to the root of it. Where does your code live, where does it run, what does it mean to create cloud native infrastructure?
Start us off, you know, we’re going to talk about the concept. To me, cloud native infrastructure is basically any infrastructure tool or service that allows you to programmatically create infrastructure and by that I mean like your compute nodes, anything running your application, your networking, software defined networking, storage, Seth, object store, dev sort of thing, you can just spin them up in a programmatical in contract way and then databases as well which is very nice.
Then I also kind of lump in anything that’s like a managed service as part of that. Going back to databases if you use like difference and they have their RDS or RDB tooling that provides databases on the fly and then they manage it for you. Those things to me are cloud native infrastructure. Duffy, what do you think?
[0:03:19.7] DC: Year, I think it’s definitely one of my favorite topics. I spent a lot of my career working with infrastructure one way or the other, whether that meant racking servers in racks and doing it the old school way and figuring out power budgets and you know, dealing with networking and all of that stuff or whether that meant, finally getting to a point where I have an API and my customer’s going to come to me and say, I need 10 new servers, I can be like, one second. Then run all the script because they have 10 new servers versus you know, having to order the hardware, get the hardware delivered, get the hardware racked.
Replace the stuff that was dead on arrival, kind of go through that whole process and yeah. Cloud native infrastructure or infrastructure as a service is definitely near and dear to my heart.
[0:03:58.8] CC: How do you feel about if you are an admin? You work from VMware and you are a field engineer now. You’re basically a consultant but if you were back in that role of an admin at a company and you had the company was practicing cloud native infrastructure things. Basically, what we’re talking about is we go back to this theme of self-sufficiency a lot. I think we’re going to be going back to this a lot too, as we go through different topics.
Mainly, someone was a server in that environment now, they can run an existing script that maybe you made it for them. But do you have concerns that your job is redundant now that you can just one script can do a lot of your work?
[0:04:53.6] NL Yeah, in the field engineering org, we kind of have this mantra that we’re trying to automate ourselves out of a job. I feel like anyone who is like really getting into cloud native infrastructure, that is the path that they’re taking as well. If I were an admin in a world that was like hybrid or anything like that, they had like on prem or bare metal infrastructure and they had cloud native infrastructure. I would be more than ecstatic to take any amount of the administrative work of like spinning up new servers in the cloud native infrastructure.
If the people just need somewhere they can go click, I got whatever services I need and they all work together because the cloud makes them work together, awesome. That gives me more time to do other tasks that may be a bit more onerous or less automated. I would be all for it.
[0:05:48.8] CC: You’re saying that if you are – because I don’t want the admin people listening to this to stop listening and thinking, screw this. You’re saying, if you’re an admin, there will still be plenty of work for you to do?
[0:06:03.8] NL: Year, there’s always stuff to do I think. If not, then I guess maybe it’s time to find somewhere else to go.
[0:06:12.1] DC: There was a really interesting presentation that really stuck with me when I was working for CoreOS which is another infrastructure company, it was a presentation by our CTO, his name is Brandon Philips and Brandon put together a presentation around the idea that every single day, there are you know, so many thousand new users of the Internet coming online for the first time.
That’s so many thousand people who are like going to be storing their photos there, getting emails, doing all those things that we do in our daily lives with the Internet. That globally, across the whole world, there are only about, I think it was like 250k or 300,000 people that do what we do, that understand the infrastructure at a level that they might even be able to automate it, you know?
That work in the IT industry and are able to actually facilitate the creation of those resources on which all of those applications will be hosted, right? This isn’t even taking into account, the number of applications per day that are brought into the Internet or made available to users, right? That in itself is a whole different thing. How many people are putting up new webpages or putting up new content or what have you every single day.
Fundamentally, I think that we have to think about the problem in a slightly different way, this isn’t about whether we will have jobs, it’s about how, when we are so outnumbered, how do we as this relatively small force in the world, handle the demand that is coming, that is already here today, right?
Those people that are listening, who are working infrastructure today, you’re even more valuable when you think about it in those terms because there just aren’t enough people on the planet today to solve those problems using the tools that we are using today, right?
Automation is king and it has been for a long time but it’s not going anywhere, we need the people that we have to be able to actually support much larger numbers or bigger scale of infrastructure than they know how to do today. That’s the problem that we have to solve.
[0:08:14.8] NL: Yeah, totally.
[0:08:16.2] CC: Looking from the perspective of whoever is paying the bills. I think that in the past, as a developer, you had to request a server to run your app in the test environment and eventually you’ll get it and that would be the server that everybody would use to run against, right? Because you’re the developer in the group and everybody’s developing different features and that one server is what we would use to push out changes to and do some level of manual task or maybe we’ll have a QA person who would do it.
That’s one server or one resource or one virtual machine. Now, maybe I’m wrong but as a developer, I think what I’m seeing is I will have access to compute and storage and I’ll run a script and I boot up that resource just for myself. Is that more or less expensive? You know? If every single developer has this facility to speed things up much quicker because we’re not depending on IT and if we have a script. I mean, the reality is not as easy like just, well command and you get it but –
If it’s so easy and that’s what everybody is doing, doesn’t it become expensive for the company?
[0:09:44.3] NL: It can, I think when cloud native infrastructure really became more popular in the workplace and became more like mainstream, there was a lot of talk about the concept of sticker shock, right? It’s the idea of you had this predictable amount of money that was allocated to your infrastructure before, these things cost this much and their value will degrade over time, right?
The server you had in 2005 is not going to be as valuable as the server you buy in 2010 but that might be a refresh cycles like five years or so. But you have this predictable amount of money. Suddenly, you have this script that we’re talking about that can spin up in equivalent resource as one of those servers and if someone just leaves it running, that will run for a long time and so, irresponsible users of the cloud or even just regular users of cloud, it does cost a lot of money to use any of these cloud services or it can cost a lot of money.
Yes, there is some concern about the amount of money that these things cost because honestly, as we’re exploring cloud native topics. One thing keeps coming up is that cloud really just means, somebody else’s computer, right? You’re not using the cost of maintenance or the time it takes to maintain things, somebody else is and you’re paying that price upfront instead of doing it on like a yearly basis, right?
It’s less predictable and usually a bit more than people are expected, right? But there is value there as well.
[0:11:16.0] CC: You’re saying if the user is diligent enough to terminate clusters or the machine, that is how you don’t rack up unnecessary cost?
[0:11:26.6] NL: Right For a test, like say you want to spin up your code really quickly and just need to – a quick like setup like networking and compute resources to test out your code and you spin up a small, like a tiny instance somewhere in one of the clouds, test out your code and then kill the instance. That won’t cost hardly anything.
It didn’t really cost you much on time either, right? You had this automated process hopefully or had a manual process that isn’t too onerous and you get the resource and the things you needed and you’re good. If you aren’t a good player and you keep it going, that can get very expensive very quickly. Because It’s a number of resources used per hour I think is how most billing happens in the cloud.
That can exponentially – I mean, they’re not really exponentially grow but it will increase in time to a value that you are not expecting to see. You get a billing and you’re like holy crap, what is this?
[0:12:26.9] DC: I think it’s also – I mean, this is definitely where things like orchestration come in, right? With the level of obstruction that you get from things like Kubernetes or Mesos or some other tools, you’re able to provide access to those resources in a more dynamic way with the expectation and sometimes one of the explicit contract, that work load that you deploy will be deployed on common equipment, allowing for things like Vin packing which is a pretty interesting term when it comes to infrastructure and means that you can think of the fact that like, for a particular cluster.
I might have 10 of those VMs that we talk about having high value. I attended to those running and then my goal is to make sure that I have enough consumers of those 10 nodes to be able to get my value out of it and so when I split up the environments, we did a little developer has a main space, right?
This gets me the ability to effectively over subscribe those resources that I’m paying for in a way that will reduce the overall cost of ownership or cost of – not ownership, maybe cost of operation for those 10 nodes. Let’s take a step back and go down a like memory lane.
[0:13:34.7] NL: When did you first hear about the concept of IAS or cloud native infrastructure or infrastructure as code? Carlisia?
[0:13:45.5] CC: I think in the last couple of years, same as pretty much coincided with when I started to – you need to cloud native and Kubernetes. I’m not clear on the difference between the infrastructure as code for cloud native versus infrastructure as code for the cloud in general.
Is there anything about cloud native that has different requirements and solutions? Are we just talking about, is the cloud and the same applies for cloud native?
[0:14:25.8] NL: Yes, I think that they’re the same thing. Cloud, like infrastructure is code for the cloud is inherently cloud native, right? Cloud native just means that whatever you’re trying to do, leverages the tools and the contracts that a cloud provides. The basic infrastructure as code is basically just how do I use the cloud and that’s –
[0:14:52.9] CC: In an automated way.
[0:14:54.7] NL: In an automated way or just in a way, right? A properly constructed cloud should have a user interface of some kind, that uses an API, right? A contract to create these machines or create these resources. So that the way that it creates its own resources is the same that you create the resource if you programmatically do it right.
Orchestration tool like Ansible or Terraform. The API calls that itself makes and its UI needs to be the same and if we have that then we have a well-constructed cloud native environment.
[0:15:32.7] DC: Yeah, I agree with that. I think you know, from the perspective of cloud infrastructure or cloud native infrastructure, the goal is definitely to have – it relies on one of the topics that we covered earlier in a program around the idea of API first or API driven being such an intrinsic quality of any cloud native architecture, right?
Because, fundamentally, if we can’t do it programmatically, then we’re kind of stuck in that old world of wrecking servers or going through some human managed process and then we’re right back to the same state that we were in before and there’s no way that we can actually scale our ability to manage these problems because we’re stuck in a place where it’s like one to one rather than one to many.
[0:16:15.6] NL: Yeah, those API’s are critical.[0:16:17.4] DC: The API’s are critical.
[0:16:18.4] NL: You bring up a good point, reminder of something. Not every cloud that you’re going to run into is made the same. There are some clouds that exist and I’m not going to specifically call them out but there are some clouds that exist that the API is people. Using their request and a human being interprets your request and makes the changes. That is a big no-no. Me no like, no good, my brain stopped.
That is a poorly constructed cloud native environment. In fact, I would say it is not a cloud native environment at all, it can barely call itself a cloud and sentence. Duffie, how about you? When was the first time you heard the concept of cloud native infrastructure?
[0:17:01.3] DC: I’m going to take this question in a form of like, what was the first programmatic infrastructure of those service that I played with. For me, that was actually like back in the Nicera days when we were virtualizing the network and effectively providing and building an API that would allow you to create network resources and a different target that we were developing for where things like XenServer which the time had a reasonable API that would allow you to create virtual machines but didn’t really have a good virtual network solution.
There were also technologies like KVM, the ability to actually use KVM to create virtual machines. Again, with an API and then, although, in KVM, it wasn’t quite the same as an API, that’s kind of where things like OpenStack and those technologies came along and kind of wrapped a lot of the capability of KVM behind a restful API which was awesome. But yeah, I would say, XenServer was the first one and that gave me the ability to – like a command line option with which I could stand up and create virtual machines and give them so many resources and all that stuff.
You know, from my perspective, it was the Nicera was the first network API that I actually ever saw and it was also one of the first ones that I worked on which was pretty neat.
[0:18:11.8] CC: When was that Duffie?
[0:18:13.8] DC: Some time ago. I would say like 2006 maybe?
[0:18:17.2] NL: The TVs didn’t have color back then.
[0:18:21.0] DC: Hey now.
[0:18:25.1] NL: For me, for sort of cloud native infrastructure, it reminded me, it was the open stack days I think it was really when I first heard the phrase like I – infrastructure as a service. At the time, I didn’t even get close to grocking it. I still don’t know if I grock it fully but I was working at Red Hat at the time so this was probably back in like 2012, 2013 and we were starting to leverage OpenStack more and having like this API driven toolset that could like spin up these VMs or instances was really cool to me but it’s something I didn’t really get into using until I got into Kubernetes and specifically Kubernetes on different clouds such as like AWS or AS Ram.
We’ll touch on those a little bit later but it was using those and then having like a CLI that had an API that I could reference really easily to spin things up. I was like, holy crap, this is incredible. That must have been around like the 2015, 16 timeframe I think.
I think I actually heard the phrase cloud native infrastructure first from our friend Kris Nova’s book, Cloud Native Infrastructure. I think that really helped me wrap my brain around really what it is to have something be like cloud native infrastructure, how the different clouds interact in this way. I thought that was a really handy book, I highly recommend it. Also, well written and interesting read.
[0:19:47.7] CC: Yes, I read it back to back and when I joined what I have to, I need to read it again actually.
[0:19:53.6] NL: I agree, I need to go back to it. I think we’ve touched on something that we normally do which is like what does this topic mean to us. I think we kind of touched on that a bit but if there’s anything else that you all want to expand upon? Any aspect of infrastructure that we’ve not touched on?
[0:20:08.2] CC: Well, I was thinking to say something which is reflects my first encounter with Kubernetes when I joined Heptio when it was started using Kubernetes for the very first time and I had such a misconception of why Kubernetes was. I’m saying what I’m going to say to touch base on what I want to say – I wanted to relate what infrastructure as code is not.
[0:20:40.0] NL: That’s’ a very good point actually, I like that.
[0:20:42.4] CC: Maybe, what Kubernetes now, I’m not clear what it is I’m going to say. Hang with me. It’s all related. I work from Valero, Valero is out, they run from Kubernetes and so I do have to have Kubernetes running for me to run Valero so we came on my machine or came around on the cloud provider and our own prem just for the pluck.
For Kubernetes to run, I need to have a cluster running. Not one instance because I was used to like yeah, I could bring up an instance or an instance or two. I’ve done this before but bringing up a cluster, make sure that everything’s connected. I was like, then this. When I started having to do that, I was thinking. I thought that’s what Kubernetes did. Why do I have to bother with this?
Isn’t that what Kubernetes supposed to do, isn’t it like just Kubernetes and all of that gets done? Then I had that realization, you know, that encounter where reality, no, I still have to boot up my infrastructure. That doesn’t go away because we’re doing Kubernetes. Kubernetes, this abstraction called that sits on top of that infrastructure. Now what? Okay, I can do it manually while DJIK has a great episode where Joe goes and installs everything by hand, he does thing like every single thing, right? Every single step that you need to do, hook up the network KP’s.
It’s brilliant, it really helps you visualize what happens when all of the stuff, how all the stuff is brought up because ideally, you’re not doing that by hand which is what we’re talking about here. I used for example, cloud formation on AWS with a template that Heptio also has in partnership with AWS, there is a template that you can use to bring up a cluster for Kubernetes and everything’s hooked up, that proper networking piece is there.
You have to do that first and then you have Kubernetes installed as part of that template. But the take home lesson for me was that I definitely wanted to do it using some sort of automation because otherwise, I don’t have time for that.
[0:23:17.8] DC: Ain’t nobody got time for that, exactly.
[0:23:19.4] CC: Ain’t got no time for that. That’s not my job. My job is something different. I’m just boarding these stuff up to test my software. Absolutely very handy and if you put people is not working with Kubernetes yet. I just wanted to clarify that there is a separation and the one thing is having your infrastructure input and then you have installed Kubernetes on top of that and then you know, you might have, your application running in Kubernetes or you can have an external application that interacts with Kubernetes.
As an extension of Kubernetes, right? Which is what I – the project that I work on.
[0:24:01.0] NL: That’s a good point and that’s something we should dive into and I’m glad that you brought this up actually, that’s a good example of a cloud native application using the cloud native infrastructure. Kubernetes has a pretty good job of that all around and so the idea of like Kubernetes itself is a platform. You have like a platform as a service that’s kind of what you’re talking about which is like, if I spin up.
I just need to spin up a Kubernetes and then boom, I have a platform and that comes with the infrastructure part of that. There are more and more to like, of this managed Kubernetes offerings that are coming out that facilitate that function and those are an aspect of cloud native infrastructure. Those are the managed services that I was referring to where the administrators of the cloud are taking it upon themselves to do all that for you and then manage it for you and I think that’s a great offering for people who just don’t want to get into the weeds or don’t want to worry about the management of their application. Some of these like –
For instance, databases. These managed services are awesome tool and going back to Kubernetes a little bit as well, it is a great show of how a cloud native application can work with the infrastructure that it’s on. For instance, when Kubernetes, if you spin up a service of type load balancer and it’s connected to a cloud properly, the cloud will create that object inside of itself for you, right? A load balancer in AWS is an ELB, it’s just a load balancer in Azure and I’m not familiar with the other terms that the other clouds use, they will create these things for you.
I think that the dopest thing on the planet, that is so cool where I’m just like, this tool over here, I created it I n this thing and it told this other thing how to make that work in reality.
[0:25:46.5] DC: That is so cool. Orchestration magic.
[0:25:49.8] NL: Yeah, absolutely.
[0:25:51.6] DC: I agree, and then, actually, I kind of wanted to make a point on that as well which is that, I think – the way I interpreted your original question, Carlisia was like, “What is the difference perhaps between these different models that I call the infrastructure-esque code versus a plat… you know or infrastructure as a service versus platform as a service versus containers as a service like what differentiates these things and for my part, I feel like it is effectively an evolution of the API like what the right entry point for your consumer is. So in the form, when we consider container orchestration.
We consider things like middle this in Kubernetes and technologies like that. We make the assumption that the right entry point for that user is an API in which they can define those things that we want to orchestrate. Those containers or those applications and we are going to provide within the form of that platform capability like you know, service discovery and being able to handle networking and do all of those things for you. So that all you really have to do is to find like what needs to be deployed.
And what other services they need to rely on and away you go. Whereas when you look at infrastructure of the service, the entry point is fundamentally different, right? We need to be thinking about what the infrastructure needs to look at like I would might ask an infrastructure as a service API how man machines I have running and what networks they are associated with and how much memory and disk is associated with each of those machines.
Whereas if I am interacting with a platform as a service, I might ask whatever the questions about those primitives that are exposed by their platform, how many deployments do I have? What name spaces do I have access to do? How many pods are running right now versus how many I ask that would be running? Those questions capabilities.
[0:27:44.6] NL: Very good point and yeah I am glad that we explored that a little bit and cloud native infrastructure is not nothing but it is close to useless without a properly leveraged cloud native application of some kind.
[0:27:57.1] DC: These API’s all the way down give you this true flexibility and like the real functionality that you are looking for in cloud because as a developer, I don’t need to care how the networking works or where my storage is coming from or where these things are actually located. What the API does any of these things. I want someone to do it for me and then API does that, success and that is what cloud native structure gets even.
Speaking of that, the thing that you don’t care about. What was it like before cloud? What do we have before cloud native infrastructure? The things that come to mind are the things like vSphere. I think vSphere is a bridge between the bare metal world and the cloud native world and that is not to say that vSphere itself is not necessarily cloud native but there are some limitations.
[0:28:44.3] CC: What is vSphere?
[0:28:46.0] DC: vSphere is a tool that VMware has created a while back. I think it premiered back in 2000, 2000-2001 timeframe and it was a way to predictably create and manage virtual machines. So in virtual machine being a kernel that sits on top of a kernel inside of a piece of hardware.
[0:29:11.4] CC: So is vSphere two virtual machines where to Kubernetes is to containers?
[0:29:15.9] DC: Not quite the same thing I don’t think because fundamentally, the underlying technologies are very different. Another way of explaining the difference that you have that history is you know, back in the day like 2000, 90s even currently like what we have is we have a layer of people who are involved in dealing with physical hardware. They rack 10 or 20 servers and before we had orchestration and virtualization and any of those things, we would actually be installing an operating system and applications on those servers.
Those servers would be webservers and those servers will be database servers and they would be a physical machine dedicated to a single task like a webserver or database. Where vSphere comes in and XenServer and then KVM and those technologies is that we think about this model fundamentally differently instead of thinking about it like one application per physical box. We think about each physical box being able to hold a bunch of these virtual boxes that look like virtual machines.
And so now those virtual machines are what we put our application code. I have a virtual machine that is a web server. I have a virtual machine that is a database server. I have that level of abstraction and the benefit of this is that I can get more value out of those hardware boxes than I could before, right? Before I have to buy one box or one application, now I can buy one box for 10 or 20 applications. When we take that to the next level, when we get to container orchestration.
We realized, “You know what? Maybe I don’t need the full abstraction of a machine. I just need enough of an abstraction to give me enough, just enough isolation back and forth between those applications such that they have their own file system but they can share the kernel but they have their own network” but they can share the physical network that we have enough isolation between them but they can’t interact with each other except for when intent is present, right?
That we have that sort of level of abstraction. You can see that this is a much more granular level of abstraction and the benefit of that is that we are not actually trying to create a virtual machine anymore. We are just trying to effectively isolate processes in a Linux kernel and so instead of 20 or maybe 30 VMs per physical box, I can get a 110 process all day long you know on a physical box and again, this takes us back to that concept that I mentioned earlier around bin packing.
When we are talking about infrastructure, we have been on this eternal quest to make the most of what we have to make the most of that infrastructure. You know, how do we actually – what tools and tooling that we need to be able to see the most efficiency for the dollar that we spend on that hardware?
[0:32:01.4] CC: That is simultaneously a great explanation of container and how containers compare with virtual machines, bravo.
[0:32:11.6] DC: That was really low crap idea that was great.
[0:32:14.0] CC: Now explain vSphere please I still don’t understand it. I don’t know what it does.
[0:32:19.0] NL: So vSphere is the one that creates the many virtual machines on top of a physical machine. It gives you the capability of having really good isolation between these virtual machines and inside of these virtual machines you feel like you have like a metal box but it is not a metal box it is just a process running on a metal box.
[0:32:35.3] CC: All right, so it is a system that holds multiple virtual machines inside the same machine.
[0:32:41.8] NL: Yeah, so think of it in the cloud native like infrastructure world, vSphere is essentially the API or you could think of it as the cloud itself, which is in the sense an AWS but for your datacenter. The difference being that there isn’t a particularly useful API that vSphere exposes so it makes it harder for people to programmatically leverage, which makes it difficult for me to say like vSphere is a cloud native app like tool.
It is a great tool and it has worked wonders and still works wonders for a lot of companies throughout the years but I would hesitate to lump into a cloud native functionality. So prior to cloud native or infrastructures and service, we had these tools like vSphere, which allowed us to make smaller and smaller VMs or smaller and smaller compute resources on a larger compute resource and going back to something, we’re talking about containers and how you spin up these processes.
Prior to things that containers and in this world of VM’s a lot of times what you do is you would create a VM that had your application already installed into it. It is burnt into the image so that when that VM stood up it would spin up that process. So that would be the way that you would start processes. The other way would be through orchestration tools similar to Ansible but they existed right or Ansible. That essentially just ran SSH on a number of servers to startup tools like these processes and that is how you’d get distributed systems prior to things like cloud native and containers.
[0:34:20.6] CC: Makes sense.
[0:34:21.6] NL: And so before we had vSphere we already had XenServer. Before we had virtual machine automation, which is what these tools are, virtual machine automation we had bare metal. We just had joes like Duffie and me cutting our hands on rack equipment. A server was never installed properly unless my blood is on it essentially because they are heavy and it is all metal and it sharpens some capacity and so you would inevitably squash a hand or something.
And so you’d rack up the server and then you’d plug in all the things, plug in all the plugs and then set it up manually and then it is good to go and then someone can use it at will or in a logical manner hopefully and that is what we had to do before. It’s like, “Oh I need a new compute resource” okay well, let us call Circuit City or whoever, let us call Newegg and get a new server in and there you go” and then there is a process for that and yeah I think, I don’t know I’m dying of blood loss anymore.
[0:35:22.6] CC: And still a lot of companies are using bare metal as a matter of course, which brings up another question, if one would want to ask, which is, is it worth it these days to run bare metal if you have the clouds, right? One example is we see companies like Uber, Lyft, all of these super high volume companies using public clouds, which is to say paying another company for all the data, traffic of data, storage of data in computes and security and everything.
And you know one could say, you would save a ton of money to have that in house but using bare metal and other people would say there is no way that this costs so much to host all of that.
[0:36:29.1] NL: I would say it really depends on the environment. So usually I think that if a company is using only cloud native resources to begin with, it is hard to make the transition into bare metal because you are used to these tools being in place.
A company that is more familiar like they came from a bare metal background and moved to cloud, they may leverage both in a hybrid fashion well because they should have tooling or they can now create tooling so that they can make it so that their bare metal environment can mimic the functionality of the cloud native platform, it really depends on the company and also the need of security and data retention and all of these thing to have like if you need this granularity of control bare metal might be a better approach because of you need to make sure that your data doesn’t leave your company in a specific way putting it on somebody else’s computer probably isn’t the best way to handle that.
So there is a balancing act of how much resources are you using in the cloud and how are you using the cloud and what does that cost look like versus what does it cost to run a data center to like have the physical real estate and then have like run the electricity VH fact, people’s jobs like their salaries specifically to manage that location and your compute and network resources and all of these things. That is a question that each company will have to ask.
There is normally like hard and fast answer but many smaller companies something like the cloud is better because you don’t have to like think of all these other costs associated with running your application.
[0:38:04.2] CC: But then if you are a small company. I mean if you are a small company it is a no brainer. It makes sense for you to go to the clouds but then you said that it is hard to transition from the clouds to bare metal.
[0:38:17.3] NL: It can and it really depends on the people that you have working for you. Like if the people you have working for you are good and creating automation and are good and managing infrastructure of any kind, it shouldn’t be too bad but as we’re moving more and more into a cloud focused world, I wonder if those people are going to start going away.
[0:38:38.8] CC: For people who are just listening on the podcast, Duffie was heavily nodding as Nick was saying that.
[0:38:46.5] DC: I was. I do, I completely agree with the statement that it depends on the people that you have, right? And fundamentally I think the point I would add to this is that Uber or a company like Uber or a company like Lyft, how many people do they employ that are focused on infrastructure, right? And I don’t know the answer to this question but I am positioning it, right? And so, if we assume that they have – that this is a pretty hot market for people who understand infrastructure.
So they are not going to have a ton of them, so what specific problems do they want those people that they employ that are focused on infrastructure to solve right? And we are thinking about this as a scale problem. I have 10 people that are going to manage the infrastructure for all of the applications that Uber has, maybe have 20 people but it is going to be a percentage of the people compared to the number of people that I have maybe developing for those applications or for marketing or for within the company.
I am going to have, I am going to be, I am going to quickly find myself off balance and then the number of applications that I need to actually support that I need to operationalize to be able to get to be deployed, right? Versus the number of people that I have doing the infrastructure work to provide that base layer for which problems in application will be deployed, right? I look around in the market and I see things like orchestration. I see things like Kubernetes and Mezos and technologies like that.
That can provide kind of a multiplier or even AWS or Azure or GCP. I have these things act as a multiplier for those 10 infrastructure pull that I have, right? Because no – they are not looking at trying to figure out how to store it, you know get this data. They are not worried about data centers, they are not worried about servers, they are not worried about networking, right? They can actually have ten people that are decent at infrastructure that can expand, that can spin up very large amounts of infrastructure to satisfy the developing and operational need of a company the size of Uber reasonably.
But if we had to go back to a place where we had to do all of that in the physical world like rack those servers, deal with the power, deal with the cool low space, deal with all the cabling and all of that stuff, 10 people are probably not enough.
[0:40:58.6] NL: Yeah, absolutely. I put into numbers like if you are in the cloud native workspace you may need 1% of your workforce dedicated to infrastructure, but if you are in the bare metal world, you might need 10 to 20% of your workforce dedicated just to running infrastructure. because the overtly overhead of people’s work is so much greater and a lot of it is focused on things that are tangible instead of things that are fundamental like automation, right?
So those 1% or Uber if I am pulling this number totally out of nowhere but if that one percent, their job is to usually focus around automating the allocation of resources and figuring out like the tools that they can use to better leverage those clouds. In the bare metal environment, those people’s jobs are more like, “Oh crap, suddenly we are like 90 degrees in the data center in San Jose, what is going on?” and then having someone go out there and figuring out what physical problem is going on.
It is like their day to day lives and something I want to touch on really quickly as well with bare metal, prior to bare metal we had something kind of interesting that we were able to leverage from an infrastructure standpoint and that is mainframes and we are actually going back a little bit to the mainframe idea but the idea of a mainframe, it is essentially it is almost like a cloud native technology but not really. So instead of you using whatever like you are able to spin up X number of resources and networking and make that work.
With the mainframe at work is that everyone use the same compute resource. It was just one giant compute resource. There is no networking native because everyone is connected with the same resource and they would just do whatever they needed, write their code, run it and then be done with it and then come off and it was a really interesting idea I think where the cloud almost mimics a mainframe idea where everyone just connects to the cloud.
Does whatever they need and then comes off but at a different scale and yeah, Duffie do you have any thoughts on that?
[0:43:05.4] DC: Yeah, I agree with your point. I think it is interesting to go back to mainframe days and I think from the perspective of like what a mainframe is versus like what the idea of a cluster in those sorts of things are is that it is kind of like the domain of what you’re looking at. Mainframe considers everything to be within the same physical domain whereas like when you start getting into the larger architectures or some of the more scalable architectures you find that just like any distributed system we are spreading that work across a number of physical nodes.
And so we think about it fundamentally differently but it is interesting the parallels between what we do, the works that we are doing today versus what we are doing in maintain times.
[0:43:43.4] NL: Yeah, cool. I think we are getting close to a wrap up time but something that I wanted to touch on rather quickly, we have mentioned these things by names but I want to go over some of like the cloud native infrastructures that we use on a day to day basis. So something that we don’t mention them before but Amazon’s AWS is pretty much the number one cloud, I’m pretty sure, right? That is the most number of users and they have a really well structured API.
A really good seal eye, peace and gooey, sometimes there is some problems with it and it is the thing that people think of when they think cloud native infrastructure. Going back to that point again, I think that AWS was agreed, AWS is one of the largest cloud providers and has certainly the most adoption as an infrastructure for cloud native infrastructure and it is really interesting to see a number of solutions out there. IBM has one, GCP, Azure, there are a lot of other solutions out there now.
That are really focused on trying to follow the same, it could be not the same exact patterns that AWS has but certainly providing this consistent API for all of the same resources or for all of the same services serving and maybe some differentiating services as well. So it is yeah, you could definitely tell it is sort of like for all the leader. You can definitely tell that AWS stumbled onto a really great solution there and that all of the other cloud providers are jumping on board trying to get a piece of that as well.
[0:45:07.0] DC: Yep.
[0:45:07.8] NL: And also something we can touch on a little bit as well but from a cloud native infrastructure standpoint, it isn’t wrong to say that a bare metal data center can be a cloud native infrastructure. As long as you have the tooling in place, you can have your own cloud native infrastructure, your own private cloud essentially and I know that private cloud doesn’t actually make any sense. That is not how cloud works but you can have a cloud native infrastructure in your own data center but it takes a lot of work.
And it takes a lot of management but it isn’t something that exists solely in the realm of Amazon, IBM, Google or Microsoft and I can’t remember the other names, you know the ones that are running as well.
[0:45:46.5] DC: Yeah agreed and actually one of the questions you asked, Carlisia, earlier that I didn’t get a chance to answer was do you think it is worth running bare metal today and in my opinion the answer will always be yes, right? Especially as we think about like it is a line that we draw in the sand is container isolation or container orchestration, then there will always be a good reason to run on bare metal to basically expose resources that are available to us against a single bare metal instance.
Things like GPU’s or other physical resources like that or maybe we just need really, really fast disk and we want to make sure that we like provide those containers to access to SSDs underlying and there is technology certainly introduced by VMware that expose real hardware from the underlying hype riser up to the virtual machine where these particular containers might run but you know the question I think you – I always come back to the idea that when thinking about those levels of abstraction that provide access to resources like GPU’s and those sorts of things, you have to consider that simplicity is king here, right?
As we think about the different fault domains or the failure domains as we are coming up with these sorts of infrastructures, we have to think about what would it look like when they fail or how they fail or how we actually go about allocating those resources for ticket of the machines and that is why I think that bare metal and technologies like that are not going away.
I think they will always be around but to the next point and I think as we covered pretty thoroughly in this episode having an API between me and the infrastructure is not something I am willing to give up. I need that to be able to actually to solve my problems at scale even reasonable scale.
[0:47:26.4] NL: Yeah, you mean you don’t want to go back to the battle days of around tell netting into a juniper switch and Telnet – setting up your IP – not IP tables it was your IP comp commands.
[0:47:39.9] DC: Did you just say Telnet?
[0:47:41.8] NL: I said Telnet yeah or serial, serial connect into it.
[0:47:46.8] DC: Nice, yeah.
[0:47:48.6] NL: All right, I think that pretty much covers it from a cloud native infrastructure. Do you all have any finishing thoughts on the topic?
[0:47:54.9] CC: No, this was great. Very informative.
[0:47:58.1] DC: Yeah, I had a great time. This is a topic that I very much enjoy. It is things like Kubernetes and the cloud native infrastructure that we exist in is always what I wanted us to get to. When I was in university this was I’m like, “Oh man, someday we are going to live in a world with virtual machines” and I didn’t even have the idea of containers but people can real easily to play with applications like I was amazed that we weren’t there yet and I am so happy to be in this world now.
Not to say that I think we can stop and we need to stop improving, of course not. We are not at the end of the journey by far but I am so happy we’re at where we are at right now.
[0:48:34.2] CC: As a developer I have to say I am too and I had this thought in my mind as we are having this conversation that I am so happy too that we are where we are and I think well obviously not everybody is there yet but as people start practicing the cloud native development they will come to realize what is it that we are talking about. I mean I said before, I remember the days when for me to get access to a server, I had to file a ticket, wait for somebody’s approval.
Maybe I won’t get an approval and when I say me, I mean my team or one of us would do the request and see that. Then you had that server and everybody was pushing to the server like one the main version of our app would maybe run like every day we will get a new fresh copy. The way it is now, I don’t have to depend on anyone and yes, it is a little bit of work to have to run this choreo to put up the clusters but it is so good that is not for me.
[0:49:48.6] DC: Yeah, exactly.
[0:49:49.7] CC: I am selfish that way. No seriously, I don’t have to wait for a code to merge and be pushed. It is my code that is right there sitting on my computer. I am pushing it to the clouds, boom, I can test it. Sometimes I can test it on my machine but some things I can’t. So when I am doing volume or I have to push it to the cloud provider is amazing. It is so, I don’t know, I feel very autonomous and that makes me very happy.
[0:50:18.7] NL: I totally agree for that exact reason like testing things out maybe something isn’t ready for somebody else to see. It is not ready for prime time. So I need something really quick to test it out but also for me, I am the avatar of unwitting chaos meaning basically everything I touch will eventually blow up. So it is also nice that whatever I do that’s weird isn’t going to affect anybody else either and that is great. Blast radius is amazing.
All right, so I think that pretty much wraps it up for a cloud native infrastructure episode. I had an awesome time. This is always fun so please send us your concepts and ideas in the GitHub issue tracker. You can see our existing episode and suggestions and you can add your own at github.com/heptio/thecubelets and go to the issues tab and file a new issue or see what is already there. All right, we’ll see you next time.
[0:51:09.9] DC: Thank you.
[0:51:10.8] CC: Thank you, bye.
[END OF INTERVIEW]
[0:51:13.9] KN: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter https://twitter.com/ThePodlets and on the https://thepodlets.io website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
Welcome to the fourth episode of The Podlets podcast! Today we speak to the topic of observability: what the term means, how it relates to the process of software development, and the importance of investing in a culture of observability. Each of us has a slightly different take on what exactly observability is, but roughly we agree that it is a set of tools that you can use to observe the interactions and behavior of distributed systems. Kris gives some handy analogies to help understand the growing need for observability due to rising scale and complexity. We then look at the three pillars of observability, and what each of these pillars look like in the process of testing and running a program. We also think more about how observability applies to the external problems that might arise in a system. Next up, we cover how implementing observability in teams is a cultural process, and how it is important to have a culture that accepts the necessity of failure and extensive time spend problem-solving in coding. Finally, the conversation shifts to how having a higher culture of observability can do away with the old problem of calling the dinosaur in a team who knows the code backward every time an error crops up.
Note: our show changed name to The Podlets.
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Carlisia CamposKris NóvaDuffie CooleyKey Points from This Episode:
• Duffy and Kris’s different interpretations of observability.
• Why we should bake observability into applications before a catastrophic failure.
• Observability is becoming more necessary due to scale and complexity.
• New infrastructures require new security systems.
• Observability is a term for new ways of observing code to catch failures as they happen.
• The three pillars of observability: events, metrics, and traceability.
• How events, metrics, and traceability play out in an example of a WordPress blog.
• Why metrics and events are necessary for observing patterns in problems.
• Measuring time series data and how it is managed in a similar way to git deltas.
• Why the ephemerality of events in cloud-native architectures urges a new way of thinking.
• Countering exterior application issues such as a hard drive getting bumped.
• The role of tracing in correlating internal and external issues with a system.
• Tracing is about understanding all the bits that are being touched in a problem.
• Kubernetes can be broken down into three things: compute, network and storage.
• How human experience is a major factor in good observability.
• The fact that embracing observability and chaos engineering is a cultural practice.
• Understanding observability and chaos testing through the laser metaphor.
• The more valuable the application, the higher the need for observability.
• The necessity for a cultural turn toward seeing the importance of observability.
• Seeming bad at debugging vs convincing teams to implement observability.
• The value of having empathy for how the difficulty of software engineering.
• Developing more intuition by spending time debugging.
• The way automated observability tools can possibly help with developing intuition.
• How observability and having common tools removes or normalizes the problem of ‘the guy’Quotes:
“Building software is very hard and complex, so if you are not making mistakes, you either are not human or you are not making enough changes.” — @carlisia [0:33:37]
“Observability is just a fancy word for all of the tools to help us solve a problem.” — @krisnova [0:23:09]
“You’ll be a better engineer for distributed systems if you are in a culture that is blameless, that gives you tools to experiment, and gives you tools to validate those experiments and come up with new ones.” – @mauilion [0:36:08]
Links Mentioned in Today’s Episode:
Velero — https://www.velero.io
Cloud Native Infrastructure — https://www.amazon.com/Cloud-Native-Infrastructure-Applications-Environment/dp/1491984309
Distributed Systems Observability — https://www.goodreads.com/book/show/40182805-distributed-systems-observabilityTranscript:
EPISODE 04
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[0:00:40.5] CC: Hi everyone, welcome back to episode four. Today we’re going to talk about observability. I am Carlisia Campos. Today here on the show with me are Duffy Coolie.
[0:00:52.7] DC: How are you doing folks? I’m Duffy Coolie, I’m staff field engineer here at VMware and looking forward to this topic.
[0:00:58.9] CC: Also with us is Kris Nova.
[0:01:03.2] CN: Hey everyone, I’m Kris Nova. I’m a developer advocate. I code a lot. I hang out in Kubernetes.
[0:01:09.7] CC: I don’t want to be left out. I’m an engineer in the open source project called Valero that does backup and recovery for your Kubernetes applications. So, observability, why do we care?
[0:01:25.6] KN: That’s the million-dollar question right there honestly.
[0:01:28.0] DC: It sure is.
[0:01:30.3] KN: I don’t know, I have a lot of thoughts on observability. I feel like it’s one of those words that it’s kind of like dev ops. It depends which day of the week you ask a specific person, what observability means that you’ll get a different answer.
[0:01:43.3] DC: Yeah, I agree with that. It seems like it’s one of those very hot topics. I mean, it feels like people often conflate the idea of monitoring and logging of an application with the idea of observability and what that means. I’m looking forward to kind of digging into the details of that.
[0:01:59.9] KN: What does observability mean to you Duffy?
[0:02:04.0] DC: In my take, observability is a set of tools that can be applied to describe the way that data moves through a distributed system. Whether that data is a particular request or a particular transaction, in this way, you can actually understand the way all of these distributed parts of this system that we’re building are actually interacting. As you can imagine, things like monitoring and metrics are a part of it, right?
Like being able to actually understand how the code is operating for this particular piece of the system, it’s definitely a key part of understanding how that system is operating but when we think of it as a big distributed system with terrible network demons in between and lots of other kind of stuff in between. I feel like we need kind of a higher level of context for what’s actually happening between all those things and that’s where I feel like the term observability fits.
[0:02:55.5] KN: Year, I think I generally agree with that. I’ve got a few nuances that I like to pick out but I have high opinions but yeah, I mean, I hear a lot about it. I have my own ideas on what it means but like, why do we need it?
[0:03:06.2] DC: I want to hear your idea to what it is, how would you define it?
[0:03:08.8] KN: I mean, we have an hour to listen to me rant about observability. I mean, basically, okay, I’m an infrastructure engineer. I wrote this book Cloud Native Infrastructure, everything to me is some layer of software running on top of it, infrastructure, and observability to me is, it solves this problem of how do I gain visibility into something that I want to learn more about. I think my favorite analogy for observability, have you all ever been to like you know, like a gas station or a convenience store?
On the front door, it’s like a height scale chart, it will say like four feet, five feet, six feet, seven feet. I always wondered what that was for and I remember I went home one day and I googled it and it turns out, that’s actually for, if the place ever gets robbed as the person runs out the front door, you get a free height measurement of how tall they are so you can help identify them later.
To me, that’s like the perfect description of observability. It’s like cleverly sneaking and things into your system that can help you with a problem later downstream.
[0:04:10.0] DC: I like that.
[0:04:11.0] KN: Yeah.
[0:04:12.5] CC: Observability is sort of a new term because it’s not necessarily something that I, as a developer would jump in and say, “Gee, my project doesn’t do observability. I need it.” I understand metrics and I understand logging, monitoring. Now I hear observability. Of course I read about it to talk about it on the show and I have been running into this word everywhere but why are people talking about observability? That’s my question.
[0:04:45.8] KN: Yeah. Well, I think this kind of goes back to the gas station analogy again, right? What do you do when your metaphorical application gets robbed? What happens in the case of a catastrophic problem and how do you go about preparing yourself the best way possible to have an upper hand at solving that problem?
[0:05:04.6] DC: Leah.
[0:05:05.2] KN: Right? You know, some guy robbed a store and ran out the front door and we realized, “We have no idea how tall he is, he could be four feet tall or he could be six feet tall.” You know, we learn the hard way that maybe we should start putting markers on the door. I feel like observability is the same thing but I feel like people just kind of wake up and say like, “I need observability. I’m going to go and you know, I need all of these bells and whistles because my application of course is going to break,” and I feel like in a weird way that’s almost a cop out.
We should be working on application before we work on preparing for catastrophic failure.
[0:05:37.6] CC: Why didn’t I hear the word observability 10 years ago or even five years ago? I think it’s about two years ago.
[0:05:45.9] DC: I’ll argue that the term observability is coming up more frequently and it’s certainly a hot topic today because of effectively context, still comes back down to context. When you’re in a situation where your application, you built like a cloud native architecture of your application, you got a bunch of different services that are intercommunicating or maybe all communicating to some put together shared resource. And things are misbehaving, you’re going to need to have the context to be able to understand how it’s breaking or at what point it’s breaking or where in the tangled web that we move is the problem actually occurring and can we measure that at that point?
Traditionally like in a monolithic architecture, you're not really looking at that, maybe break over the monolith, you set up a couple of set points, you’re looking for the way particular clip pads work or if you’re on top of the game, you might like instrument your code in such a way that will emit events when particular transactions happen or particular things happen.
You’re going to be looking at those events and logs and looking at metrics to figure out how this one application is performing or behaving. With observability, we have to solve that problem across many systems.
[0:06:54.2] CC: That is why I put on the shownotes that it has to do something with the idea of cattle versus [inaud]. Because, I’m saying this because Duffy was asking me before we started recording why was that on the show notes. Because correct me if I’m wrong, I think you are going the direction of saying you don’t see, you don’t see the relation but the relation that I was thinking about was exactly what you just said.
If I have a monolith, I’m looking at one thing, we’re both looking at one log. I can treat that as my little pad as supposed to when I have many micro services interacting. I can’t even treat anything, if I treated them as badly without that, right? Because I can’t. This is too much. The idea of the reason why observability is necessary sounds to me like that is a problem of scale and complexity.
[0:07:46.2] KN: Yeah, I think that explains why we’re just now hearing it too, right? I’m trying to think of another metaphor here. I guess today’s going to be a metaphor day for me. Got it, okay. I just got back from London last week. I had gotten of the tube and I remember I came up to the surface and blinding light is in my eyes and all of a sudden it’s all sign for Scotland Yard and I was like “Wow, I remember this from all the detective sleuth stories of my childhood.” It dawned on me that the entire point of this part of London was there to help people recover from disasters.
I thought about why we don’t have Scotland Yard type places anymore and it’s because we have security systems and we have like different things in place that we had to kind of learn the hard way we needed and we had to develop technology to help make that easier for us and I feel like we’re just kind of at that cusp of like our first wave of security cameras. Metaphorical security cameras with observability.
We’re at that first wave of, we can instrument our code and we can start building our systems out with this idea of, “I want to be able to view it or observe it over time in the case of trying to learn more about it or debugging a problem.”
[0:08:52.1] DC: Yeah.
[0:08:52.9] CC: How do people handle – I’m asking this question because truly, I have yet to have this problem for my project that I need to do observability in my project. I need to make sure my project is observable. I mean, other than the bread and butter metrics and logging, that’s what we do at Valero. We don’t do anything further than that.
But, I don’t know if those are the things are constitute observability but what Nova just said, my question is, when we want to look at this stuff later but we’re also talking about cattle and this things. Supposedly, your servers are ephemeral. They can go away and go back. How do we look at, how do we observe things if they have gone away?
[0:09:41.1] KN: Year. That’s where we get into like, this exciting world of like, how long do we persist our data and which data do we track? And there’s, you know, a lot of schools of thought and a lot of different opinions around what the right solution here is but I think it kind of just boils down to every application and set of concerns is going to be unique and you’re just going to have to give it some thought.
[0:10:01.9] CC: Should we talk more about that because that sounds very interesting.
[0:10:04.4] KN: Yeah. I mean, I guess we should probably just start off with like, given a simple application, concretely, what does it mean to build out ‘observability’ for that application?
[0:10:18.3] DC: There’s this idea of in a book called Distributed Systems Observability by Cindy Sridharan. I’m probably slaughtering her name but she wrote that there’s like these three pillars and the three pillars are events, metrics and traceability or tracing. Events, metrics and tracing. These are the three pillars of observability. If we were going to lay out the way that those things might apply to just any old application like a monolith then we might look at how –
[0:10:45.3] KN: Can we just use like a WordPress blog, just like for an example. It’s got a data score, it’s got a thin layer of software and an API.
[0:10:53.0] DC: Sure, like a WordPress app. The first thing we try to do is actually figure out what events we would like to get from the application and figure out how the instrument our application such that we’re getting useful data back as far as like the event stream. Frequently I think that – or in my experience, the things that you want to instrument in your application or it calls that your application is going to make that might represent a period of time, right?
If it’s going to make a call to an external system, that’s something that you would definitely want to emit an event for if you’re trying to understand you know, where the problems are going sideways, like how long it took to actually make a query against the database in the back of a WordPress blog. It’s a great example, right?
[0:11:33.5] KN: Question, you said the word instrumentation. My understanding of instrumentation is there’s kind of a bit of an art to it and you’re actually going in and you're adding like lines of code to your application that on line 13, we say ‘starting transaction’, on line 14, we make an https transaction and on the next line, we have, ‘the event is now over’ and we can sort of see that and discover that we made this https transaction and see where it broke, if it broke at all. Am I thinking about that right?
[0:12:05.2] DC: I think you are but what’s interesting about that, the reporting on line 14, right? Where you're actually saying the event is over, right? I think that we end up actually measuring this in both in event stream and also in a metric, right? So that we can actually understand, of the last hundred transactions to the database, you know, are we seeing any increase in the amount of time and the process takes, like are we actually, you know, is this something that we can measure with metrics and understand, like, is this value changing over time? And then from the event perspective, that’s where we start tying in things like, contextually, in this transaction, what happened, right?
In this particular event, is there some way that we can correlate the event with perhaps a trace and we’ll talk a little bit more about tracing too but like – so that we can understand, “okay, well, we have” – at 2:00 we see that there is like an incredible amount of latency being introduced when my WordPress blog tries to write to the database and it happens, every day at 2:00. I need to figure out what’s happening there.
That’s a great – to even get to the point where I understand it’s happening at 2:00, I need things like metrics and need things like the events, specifically to give me that time correlation to understand, it’s at two.
[0:13:18.7] KN: This is where we get into what Carlisia just asked about which was how do we solve this problem of what do we do when it goes away? In the case of our 2 PM database latency. For a lack of a better term, let’s just call it the heartbeat, the 2 PM heartbeat. What happens when the server that we’re experiencing, that latency, mysteriously goes away? Where does that data go and then you look at tools like I know Prometheus does this in elastic search, has capability to do this, but you look at how do we start managing time series data and how do we start tracking that and recording it.
It’s a fascinating problem because you don’t actually record 2 PM. To this second and this degree of a second, this thing happened. You record how long it’s been since the previous event. You’re just constantly measuring deltas. It’s the same way that git works. Every time you do a get commit, you don’t’ actually write all 1,000 lines of software, you just write the one line that changed.
[0:14:15.6] DC: Yeah. I think you highlight a really – I mean, both of the two of you highlighted a really good point around like this little cattle versus pets thing. This actually is something that I spent a little time with in a previous life and the challenge is that especially in systems like Kubernetes and other systems where you have – perhaps your application is running or being scaled out dynamically or scaled down dynamically based on load.
You have all of these ephemeral events. You have all of this events that are from pods or from particular instances of your application that are ephemeral, they’re not going to be long lived. This highlights a kind of a new problem that we have to solve, I think, when we start thinking about cloud native architectures in that we have to be able to correlate that particular application with information that gives us the context to understand, like perhaps, this was this version of this application and these events are related to that particular version of the app.
When we made a change, we saw a great reduction in the amount of time it takes to make that database call and we can correlate those new metrics based on the new version of the app and because we don’t have this like, as a long term entity that we can measure, like this isn’t like a single IP and a single piece of software that is not changing.
This is any number of instances of our application deployed – like it makes you have to think about this problem fundamentally differently and how you store that data. This is where that cardinality problem that you’re highlighting comes in.
[0:15:45.2] KN: Yeah. Okay, I have a question. Open question for the group. What is the scope here? I guess, to like kind of like build on our WordPress analogy. Let’s say that every day at 2 PM we notice there’s this latency and we’ve spent the last two weeks just endlessly digging through our logs and trying to come up with some sort of hypothesis of what’s going on here and we just can’t find anything.
Everything we’ve talked about so far has been at the application layer of the stack. Instrumenting our application, debugging our application, making it https request. What should we do, or does observability even care if one of our hard drives is failing every day at 2 PM when the cleaning service comes by and accidentally bumps into it or something? How are we going to start learning about these deeper problems that might exist outside of our application layer which in my experience, those are the problems that really stick with you and really cause a lot of trouble.
[0:16:40.0] DC: Yeah, agreed. Or somebody has like scheduled a backup of your database every day at two is what locks the database for a period of time of the backup and you're like “Wait, when did that happen? Why did that happen?”
[0:16:51.4] KN: Somebody like commented out a line in the chron tab and then the server got reset and there’s like some magical bash grip somewhere on the server that goes and rewrites the chron tab. Who knows?
[0:17:00.9] DC: Yeah, these are the needles in the haystack that we’ve all stumbled upon one way or another.
[0:17:05.1] KN: Yeah, does observability, like, are we responsible for instrumenting like the operating system layer, the hardware layer?
[0:17:14.0] CC: Isn’t that what monitoring is, like, some sort of testing from the outside, like an external testing that – of course, you only – it gives us the information after the fact, right? The server already died. My application’s already not available so now I know.
[0:17:31.8] KN: Yeah.
[0:17:32.4] CC: But isn’t monitoring what would address a problem like that?
[0:17:36.1] DC: I think it definitely helps. I think what you're digging at Kris is correlation. Being able to actually identify at a particular period of time, what’s happening across our infrastructure, not just to our application. Being able to – and the important part is like how you even got to that time of day? Like, how do you know that this is happening, like, when you're looking for those patterns, how did you get to the point where you knew that it was happening at 2:00.
If you know that it’s happening at 2:00 because of the event stream per se, right? That actually gives you a time correlation. Now you can look at, “Okay, well now I have a time and now I need to like, scoot back to like a macro level and see” –
[0:18:10.7] KN: Crank it up at 2 PM.
[0:18:12.9] DC: Yeah. Globally at 2:00, what’s going on in my world, right? Is there, you know – I know that these are the two entities that are responsible. I know that I have a bunch of pods that are running on this cluster, I know that I have a database that may be external to my cluster or maybe on the cluster. I need to really understand what’s happening in the world around those two entities as it correlates to that period of time to give me enough context to even troubleshoot.
[0:18:39.7] CC: How do you do it though? Because I’m still going to go back to the monitoring, I mean, if I’m using external service to ping my service and my service is down, yeah, I’m going to get the timing – right, I can go back and look at the information, the log stream. Would I know that it was because of the server? No. But should I be pinging the server too? Should I ping every layer of the infrastructure? How do people do that?
[0:19:05.4] KN: Yeah, that’s kind of what I was eluding to is like, where does observability at the application level stop and systems observability across the entire stack start and what tools do we have and where are the boundaries there?
[0:19:20.3] DC: I think this is actually where we start talking about the third pillar that we were referring to earlier which was tracing and the ability to understand from the perspective of a particular transaction across the system. What entities that particular transaction will touch and where it spends its time across that entire transaction so my query – what I was trying to do was actually like, you know, submit a comment on a WordPress blog. If I had a way of implementing tracing through that WordPress blog, I might be able to leave myself little breadcrumbs throughout the entire set of systems and understand, “Well, at what point did I – I mean, where in this particular web transaction am I spending time?” I might see that you know, from the load balancer, I begin my trace ID and that load balancer terminates to this pod and inside of that pod, I can see where I’m spending my time.
A little bit of time to kind of load the assets and stuff, a little bit of time for pushing my comments to the database and identifying what that database is, is the important part of that trace. If I understand – I need to know where that traffic is going to go next and how much time I spent in that transaction. You know, again, this is like down to that code layer. We should have some way of actually like leaving us – producing an event that may be related to a particular trace ID so that we can correlate the entire life cycle of that transaction. That unique trace ID across the entire process.
[0:20:45.2] KN: Interesting.
[0:20:45.7] DC: It helps us narrow the field to understand what all the bits are that are actually being touched that are part of the problem. Otherwise, we’re looking at the whole world and like obviously, that’s a much bigger haystack, right?
[0:21:00.4] KN: One of the things that I’ve kind of learned about Kubernetes is as I’ve been like working with Kubernetes and explaining it to people and going down the road and talking and doing public speaking. I found that it’s very easy for users to understand Kubernetes if you break it down into three things. Compute, network and storage.
What I’m kind of getting that here is like the application layer is probably going to be more relevant to the compute layer. Storage is going to be where – that’s observability. Storage is going to be more monitoring. That’s going to be what is my system doing, where am I storing my data, and then network is kind of related to tracing, what you’re looking at here, and these aren’t like necessarily one to one but it just kind of have – distribution of concerns here.
Am I thinking about that? Kind of the same way you are Duffy?
[0:21:42.8] DC: I think you are. I think what I’m trying to get to is like, I’m trying to identify the tools that I need to be able to understand what’s happening at 2:00 and all of the players involved in that, right? For that, I’m actually relying on tools that are pretty normal like the ability to actually monitor all the systems and understand and have like real time stamps and stuff that describes you know, that nodule server or what have you that says that you know, my backup for my SQL database started at 2:00 and ends at 2:30.
I’m relying on things like an event stream to say you know, get to give me some context of time like when my problem is happening and I’m relying on things like tracing perhaps just should narrow the field so that I can actually understand what’s happening with this particular transaction and what are the systems that I should be looking at, whether that is – there’s a bunch of time being spent on the network, so what’s going on with the network at 2:00. There’s a bunch of times being spent on persisting data to a database, what’s going on with the database? You know, like, this kinda gives me I think enough context to actually get into troubleshooting mode, right?
[0:22:50.2] KN: Yeah and I don’t want to take away from this lovely definition you just dropped on us but I want to take a stab at trying to summarize this. So observability, expands the whole stack. So I mean it is like if you look at the OSI reference model it is going to cover every one of those layers and all it really is just a fancy word for all of the tools to help us solve a problem. Yeah, sorry I am not trying to take away from your definition, right? I want to just simplify it so that like I can grapple it a little bit better.
[0:23:22.2] CC: How about people? Does culture factor into it or it is just tools?
[0:23:26.8] KN: I think culture is a huge part of it. Pesky humans.
[0:23:28.0] DC: Yeah it is.
[0:23:29.8] CC: Would this culture be tremendously different from what we get now usually at least with modern companies that doing modern software?
[0:23:40.5] KN: I mean I definitely think like –
[0:23:41.7] CC: Would it look different?
[0:23:42.8] KN: Yeah, I definitely think there’s like – you can always tell. Like somebody once asked, “What is the difference between an SRE and a senior SRE?” And they were like, “patience,” And it is like you can always tell folks who have been burned because they take this stuff extremely seriously and I think that culture, like there is commodity there, like people are willing to pay for it if you can actually do a good job at going from chaotic problem, “I have no idea what is going on.” And making sense of that noise and coming up with a concrete tangible output that humans can take action on, I mean that is huge.
[0:24:16.8] DC: Yeah it is. I was recently discussing the ability – in another medium. We were having a conversation around doing chaos testing and I think that this relates. And the interesting thing that came out of that for me was the idea that you know – I spent a pretty good portion of my career teaching people to troubleshoot, which is kind of weird. You know like teaching somebody to have an intuition about the way that a system works and giving them a place to even begin to troubleshoot a particularly complex problem, especially as we start building more and more complex systems, is really a weird thing to try and do.
And I think that culturally, when you have embraced technologies like observability and embrace technologies like chaos engineering, I think that culturally you are actually not only enabling your developers, your operators, your SRE’s to experiment and understand how the system breaks at any point, but you are also enabling them to better understand how to troubleshoot and characterize these distributed systems that they are building.
So I think that – and if that is a part, if that is a cultural norm within your company, I mean think about how many miles ahead you are of like the other people in your industry, right? You have made it through adopting these technologies. You have enabled your engineering teams, whether they’d be the people who are writing the code, whether they’d be the people who are operating the code, or the people who are just trying to keep the whole system up or provide you feedback to experiment and to develop hypothesis around how the system might break at a particular scale and to test that, right? And giving them the tools with which to actually observe this is critical, you know? Like it is amazing.
[0:26:03.5] KN: Yeah, in my mind, again on my metaphor kick again, I think of the bank robber movies where they take dust and blow it into the air then all of a sudden you can see the lasers. Yeah, I’m feeling like that is what is happening here, is we’re kind of purpose – like chaos testing would just be the practice of intentionally breaking the lasers to make sure the security system works and observability is the practice of actually doing something to make those lasers visible so we can see what is going on.
[0:26:31.0] CC: So because the two of you spend time with customers, or maybe Duffy more so than Nova, but definitely, I spent zero time. I spent zero. I am curious to know if someone, let’s say an SRE, wants to implement a set of practices that comprise what we are talking about and saying it is observability but they need to get a buy out from other people. How do you suggest they go about doing that?
Because they might know how to do it or be willing to learn but they might need to get approval or they need to get a buy out – I am sorry, a buy in from their managers, from their colleagues, you know, there is a benefit and there is a cost. How will somebody present that? I mean we just talked about – I am sorry Nova, definitely just give us a laundry list of benefits but how do you articulate that in a way you prove those benefits are worth the cost and what are the costs? What are the tradeoffs?
[0:27:36.4] KN: Yeah, I mean I think this is such a great question because in my career, I have worked the world’s most paranoid software as a service shop where I mean everything we did, we baked like emergency disaster recovery into it, every layer of everything we did, and I have also worked at shops that are like, “No, we ain’t got time for that, like hurry up and get your code moved and pushed to production,” and I mean I think there is pros and cons to each.
But I think, you know, as you look at the value you have in your application, you are going to come up with some sort of way of concretely measuring that, of saying like, “This is an application that brings in 500 bucks a month,” or whatever, and depending on that cost or how much your application is worth to you is going to depend on I think how seriously you take it. For instance, a WordPress blog is going to probably not have the same level of observability with concerns than like maybe a bank routing system have.
So I think as your application gets more and more valuable your need for observability and your need for these tools is going to go up more.
[0:28:37.6] DC: I agree. I think from the perspective of like, how do you convince, maybe an existing engineering culture to make this jump, to introduce these ideas? I think that that is a tricky question because effectively what you are trying to do is kind of enable that cultural shift that we were talking about before, about like, what tools would set up the culture to succeed as they build out these applications and distributed systems that are going to make up or that are going to comprise the basis of what your product is, right? What tool?
And getting to that, coming at that from a SRE perspective that needs air cover to be able to actually have those tough conversations with your developers and say, “Look, this is why we do it that way and this is something I can help you do but fundamentally, we need to instrument this code in a way that we can actually observe it and to understand how it is actually operating when we start before we can actually open the front door and let some of that crazy – and let the Internet in,” right?
We need to be able to understand how and when the doors fall off and if we are not working with our developers who are more focused on understanding, does this function do what it says on the box? Rather than, is this function implemented in a way that might admit events or metrics, right? This is a completely different set of problems from the developer’s perspective.
I have seen a couple of different implementations of how to implement this within an organization and one of them is Facebook’s idea of product engineering or I think it is called product engineering or production engineering, one of the two, and so this idea is that you might have somebody who’s similar in some ways to an SRE. Somebody who understands the infrastructure and understands how to build applications that will reside upon it and is actually embedded with your developer team to say, “You know, before we can legit sign off on this thing, here are the things that this application must have to be able to wire into to enable us to operate this app so that we can observe it and monitor it. Do all the things that we need to do.”
And the great part about that is that it means that you are teaming with the developer team, you have some engineering piece that is teaming with the developer team and enabling them to understand why these tools are there and what they’re for and really – and promoting that engagement.[0:30:59.9] CC: And getting to that place is an interesting proposition isn’t it? Because, as a developer, even as a developer, I see the world moving more and more towards developer taking on the ship of the apps and knowing more, more layers of this stack, and if I am a developer and I want to implement, incorporate these practices then I need to convince some one, either a developer, or whoever is in charge of monitoring it and making sure the system is up and running right?
[0:31:33.0] KN: Yeah.
[0:31:34.3] CC: So one way to go about quantifying the need for that is to say, “Well over the last month we spent X amount of hours trying to find a bug in production,” and that X is a huge number. So you can bring that number and say, “This is how much the number costs in engineering hours,” but on the other hand, you don’t want to be the one to say that it takes you a 100 hours to find one little bug in production, do you?
[0:32:06.2] KN: Yeah, I mean I feel like this is why agile teams are so successful because baked into how you do your work is sort of this implicit way of tracking your time and your progress. So at the end of the day, if you do spend a 100 hours of work trying to find a bug, it is sort of like, that is the team’s hours that is not your hours and you sort of get this data for free at the end of every sprint.
[0:32:28.1] CC: Yeah.
[0:32:28.5] DC: What you brought up is actually another cultural piece of that that I think is a problem. It has to be – I think that frequently we assume there are many – let me put this differently. I have seen companies where in the culture is somewhat damming for people who spend a lot of time trying to troubleshoot something that they wrote and that is a terrible pattern because it means that the people who are out there writing the code, who are just trying to get across the finish line with the thing that needs to be in production, right, have now this incredible pressure on them to not make a mistake and that is not okay.
We are all here to make mistakes. That is what we do professionally, is make mistakes and the rest is just the gravy, you know what I mean? And so yeah, it makes me nuts that there are organizations that are like that. I feel like we really just in it and what is awesome about this is I see that narrative raising up within the ecosystem that I, you know, around cloud native architectures and other things like that, is that like, you know, you are hired to do a hard job and if we come down on you for thinking that that is a hard job then we are messing up. You are not messing up.
[0:33:37.2] CC: Yeah absolutely. Building software is very hard and complex. So if you are not making mistakes, you either are not human or you are not making enough changes and in today’s world, we still have humans making software instead of robots. We are not there yet but it is a very risky proposition not to be making continuous changes because you will be left behind.
[0:34:04.7] KN: Yeah, I feel like there is definitely something to be said about empathy for software engineers. It is very easy to be like, “Oh my gosh you spent a 100 hours looking at this one bug to save 20 dollars, how dare you?” but it is also a lot harder to be like, “Oh you poor thing, you had to dig through a 100 million lines of somebody else’s code in order to find this bug and it took you a 100 hours and you did all of that just to fix this one little bug, how awesome are you?”
And I feel like that is where we get into the team dynamic of are we a blame-centric team? Do we try to assign blame to a certain person or do we look at this as a team’s responsibility, like this is our code and poor Carlisia over here had to go dig through this code that hasn’t been touched in 10 years,” or whatever.
[0:34:54.6] CC: Another layer to that is that in my experience, I have never done anything in software or looked at any codes or brought up any system that as trivial as the end result was and especially in relation with the time spent, it has never happened that it wasn’t a huge amount of education that I got to reuse in future work. So does that make sense?
[0:35:20.2] DC: Yeah and that is what I was referring to is around being able to build up the intuition around how these systems operate like if the longer, the more time you spend in the trenches working on those things, right? If you are enabled leveraging technologies like observability and chaos into the grain to troubleshoot, to come up with a hypothesis about how this would break when this happens, and test it and view the result and come up with a new hypothesis and continue down that path, you will automatically, I mean like, by your nature, build a better intuition yourself around how all of these system operate.
It doesn’t matter whether it is the application you are working on or some other application, you are going to be able to build up their intuition for how to understand and characterize systems in general. You’ll be a better engineer for distributed systems if you are in a culture that is blameless that gives you tools to experiment and gives you tools to validate those experiments and come up with new ones, you know?
[0:36:21.3] CC: I am going to challenge you and then I am going to agree with you so hang on, okay? So I am going to challenge you, so we are saying that observability, which actually boils down to using automated tools to do all of this work for us that we don’t have to dig in manually on a case by case basis, no that’s wrong?
[0:36:43.4] DC: No, I am saying observability is a set of tools that you can use to observe the interactions and behavior of distributed systems.
[0:36:53.4] CC: Okay but with automated tools right?
[0:36:55.2] DC: The automation piece isn’t really I mean do you want to take this one Kris?
[0:36:59.5] KN: Yeah, I mean I think like they certainly can be automated. I just don’t think that there is a hard bit of criteria that says every one needs to be automated. Like there ain’t nothing wrong with SSH-ing into a server and running a debug script or something if you are having a really bad day.
[0:37:12.3] CC: Okay but let me go with my theory, just pretend it is because it will sound better. All right, so let us say, not to exclude the option to do it manually too if you want, but let’s say we have these wonderful tools that can automate a bunch of this work for us and we get to look at it from a high level. So what I am thinking is whereas before, if we didn’t have or use those tools or we are not using those tools, we have to do a lot of that work manually. We have to look at a lot more different places and I will challenge you that we develop even more intuition that way. So we are decreasing the level of intuition that we develop potentially by using the tools.
Now, I am going to agree with you. It was just a rationale that I had to follow. I agree with you, it definitely helps to develop intuition but it is a better quality of intuition because now you can hold these different pieces in your hand because you are looking at it at this higher level.
Because when you look at the details, you look at thing at this view – at least I am like that. It is like, “Okay, can I hold this one thing, it is big already in my head,” and then for me when I switch context and go look at something else, you know what I looked at over there, and it is hard to, really hard to keep track and really wasteful for – it is impossible to keep all of it in your mind, right?
And let’s say you have to go through the whole debugging process all over again. If I don’t have notes, it will be like just the first time because I can’t possibly remember. I mean I have been in situations of having to debug different systems and okay, now third time around I am taking notes because the fourth time is just going to be so painful. So having tools that lets us look at things at a higher level I think has the additional benefit of helping us understand the system and hold it together in our heads because okay, we definitely don’t know the little details of how these are happening behind the scene.
But how useful is that anyway? I’d much rather know how the whole system works together, points of failure like I can visualize, right?
[0:39:30.1] DC: Yeah.
[0:39:30.7] KN: I have a question for everyone. Following up on Carlisia’s how she challenged you and then agreed with you, I really want to ask this question because I think Carlisia’s answer is going to be different than Duffy’s and I think that is going to say a lot about the different ways that we are thinking about observability here and it is really fascinating if you think about it. So have either of you worked in a shop before where you had ‘the guy’?
You know, that one person who just knew the code base inside and out, he had been around for forever, he was a dinosaur and whatever something went wrong you’re like, “We got to get this guy on the phone,” and he would come in and be like, “Oh it is this one line and this one thing that it would take you six months to figure out but let me just fix this really quick,” bam-bam-bam-bam and production is back online.
[0:40:14.1] CC: Oh code base guy, the system admins guy, like something that is not my app but the system broke, you get that person who knew every like could take one second to figure out what the problem was.
[0:40:28.9] KN: Have you seen that before though? Like that one person who just have so much tribal knowledge.
[0:40:33.3] CC: Yeah, absolutely.
[0:40:34.3] KN: Yeah, Duffy what about you?
[0:40:36.0] DC: Absolutely. I have both been that guy and seen that guy.
[0:40:38.8] CC: I have never been that person.
[0:40:40.2] DC: In lots of shops.
[0:40:42.4] KN: Well what I am kind of digging at here is I think observability, and I mean this in the nicest way possible to all of our folks at home who are actively playing the role of ‘the guy’, I think observability kind of makes that problem go away, right?
[0:40:56.8] DC: I think it normalizes it to your point. I think that it basically gives you – I think you’re onto it. I think that I agree with you but I think that fundamentally what happens is through tooling like chaos engineering, through tooling like observability, you are normalizing what it looks like to teach anybody to be like that person, right? But that is the key takeaway is like, to Carlisia’s point she might – actually, you know Kris and I, I promise that we will approach some complex distributed systems problem fundamentally differently, right?
If somebody has a broken Kubernetes cluster, Kris and I are both going to approach that same problem and we will likely both be able to solve that problem but we are going to approach it in different ways and I think that the benefit of having common tooling with which to experiment and understand and observe the behavior of these distributed systems means that, you know, we can normalize what it looks like to be a developer and have a theory about how the system is breaking or would break, and having some way of actually validating that through the use of observability and perhaps chaos engineering depending, and that means that we are turning keys over, turning the keys to the castle over. There is no more bust test. You don’t have to worry about what happens to me at the end of the day. We all have this common receptacle.
[0:42:16.7] CC: You could go on vacation.
[0:42:18.1] DC: Yeah.
[0:42:19.0] CC: No but this is the most excellent point, I am glad you brought it up Nova because what both of you said is absolutely true. I mean, give me a better documentation and I don’t need you anymore because I can be self-sufficient.
[0:42:33.2] DC: Exactly.
[0:42:34.9] KN: Yeah so when we’re –
[0:42:35.5] CC: If you told me to observe where things went wrong and again I go back to that what I said, more and more developers are having to take being asked, I mean some developers are proactively taking on the shift and in other cases they’d been asked to take more ownership of the whole stack and then say from the application level down the stack and, but you gave me tools to observe where things went wrong beyond my code as a developer, I am not going to call the guy.
[0:43:07.3] KN: Yeah.
[0:43:08.4] CC: So the level of self-sufficient –
[0:43:10.3] KN: The guy doesn’t want you to call him.
[0:43:12.5] CC: So it provides – and then the decision – benefit, we could say, is provide the engineer an additional level of self-sufficiency.
[0:43:22.0] KN: Yeah, I mean teach someone to fish, give someone a fish.
[0:43:25.1] CC: Yeah.
[0:43:25.6] KN: Yeah.
[0:43:26.4] DC: Exactly. All right, well that was a great conversation on observability and we talked about a bunch of different topics. This is Duffy and I had a great time in this session and thanks.
[0:43:38.1] CC: Yeah, we are super glad to be here today. Thanks for listening. Come back next week.
[0:43:43.4] KN: Thanks for joining everyone and I apologize again to all of our ‘guys’ at home listening. Hopefully we can help you with observability along the way to get everybody’s job a little bit easier.
[0:43:53.8] CC: And I want to say you know for the girls, we know that you are all there too. That is just a joke.
[0:43:59.9] KN: Oh yeah, I was totally at it for a while. Good show everyone.
[0:44:05.3] DC: All right, cheers.
[0:44:06.8] KN: Cheers.
[END OF INTERVIEW]
[0:44:08.7] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter https://twitter.com/ThePodlets and on the https://thepodlets.io, where you will find transcripts and show notes. We’ll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
-
In this episode of The Podlets Podcast, we are diving into contracts and some of the building blocks of the Cloud-Native application. The focus is on the importance of contracts and how API's help us and fit into the cloud native space. We start off by considering the role of the API at the center of a project and some definitions of what we consider to be an API in this sense. This question of API-first development sheds some light onto Kubernetes and what necessitated its birth. We also get into picking appropriate architecture according to the work at hand, Kubernetes' declarative nature and how micro-services aid the problems often experienced in more monolithic work. The conversation also covers some of these particular issues, while considering possible benefits of the monolith development structure. We talk about company structures, Conway's Law and best practices for avoiding the pitfalls of these, so for all this and a whole lot more on the subject of API's and contracts, listen in with us, today!
Note: our show changed name to The Podlets.
Follow us: https://twitter.com/thepodlets
Website: https://thepodlets.io
Feeback and episode suggestions:
https://github.com/vmware-tanzu/thepodlets/issues
Hosts:
Carlisia CamposJosh RossoDuffie CooleyPatrick BarkerKey Points From This Episode:
• Reasons that it is critical to start with APIs at the center.
• Building out the user interface and how the steps in the process fit together.
• Picking the way to approach your design based on the specifics of that job.
• A discussion of what we consider to qualify as an API in the cloud-native space.
• The benefit of public APIs and more transparent understanding.
• Comparing the declarative nature of Kubernetes with more imperative models.
• Creating and accepting pods, querying APIs and the cycle of Kubernetes.
• The huge impact of the declarative model and correlation to other steps forward.
• The power of the list and watch pattern in Kubernetes.
• Discipline and making sure things are not misplaced with monoliths.
• How micro-services goes a long way to eradicate some of the confusion that arises in monoliths.
• Counteracting issues that arise out of a company's own architecture.
• The care that is needed as soon as there is any networking between services.
• Considering the handling of an API's lifecycle through its changes.
• Independently deploying outside of the monolith model and the dangers to a system.
• Making a service a consumer of a centralized API and flipping the model.Quotes:
“Whether that contract is represented by an API or whether that contract is represented by a data model, it’s critical that you have some way of actually defining exactly what that is.” — @mauilion [0:05:27]
“When you just look at the data model and the concepts, you focus on those first, you have a tendency to decompose the problem.” — @pbarkerco [0:05:48]
“It takes a lot of discipline to really build an API first and to focus on those pieces first. It’s so tempting to go right to the UI. Because you get these immediate results.” — @pbarkerco [0:06:57]
“What I’m saying is, you shouldn’t do one just because you don’t know how to do the others, you should really look into what will serve you better.” — @carlisia [0:07:19]
Links Mentioned in Today’s Episode:
The Podlets on Twitter — https://twitter.com/thepodlets
Nicera — https://www.nicera.co.jp/
Swagger — https://swagger.io/tools/swagger-ui/
Jeff Bezos — https://www.forbes.com/profile/jeff-bezos/
AWS — https://aws.amazon.com/
Kubernetes — https://kubernetes.io/
Go Language — https://golang.org/
Hacker Noon — https://hackernoon.com/
Kafka — https://kafka.apache.org/
etcd — https://etcd.io/
Conway’s Law — https://medium.com/better-practices/how-to-dissolve-communication-barriers-in-your-api-development-organization-3347179b4ecc
Java — https://www.java.com/Transcript:
EPISODE 03
[INTRODUCTION]
[0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you.
[EPISODE]
[0:00:41.2] D: Good afternoon everybody, my name is Duffy and I’m back with you this week. We also have Josh and Carlisia and a new member of our cast, Patrick Barker.
[0:00:49.4] PB: Hey, I’m Patrick, I’m an upstream contributor to Kubernetes. I do a lot of stuff around auditing.
[0:00:54.7] CC: Glad to be here. What are we going to talk about today?
[0:00:57.5] D: This week, we’re going to talk about some of the building blocks of a cloud native application. This week we’re going to kind of focus on contracts and how API’s kind of help us and why they’re important to cloud native ecosystem. Usually, with these episodes, we start talking about the problem first and then we kind of dig into why this particular solution, something like a contract or an API is important.
And so, to kind of kick that of, let’s talk about maybe this idea of API-first development and why that’s important. I know that Josh and Patrick both and Carlisia have all done some very interesting work in this space as far as developing your applications with that kind of a model in mind. Let’s open the floor.
[0:01:34.1] PB: It’s critical to build API-centric. When you don’t build API-centric, most commonly, you’ll see a cross ecosystem building UI centric, it’s very tempting to do this sort of thing because UI’s are visually enticing and they’re kind of eye candy. But when you don’t go to API-centric and you go that direction, you kind of miss the majority of use cases down the line which are often around an SCK, just ended up being more often than not the flows that are the most useful to people but they’re kind of hard to see it to be getting.
I think going and saying we’re building a product API-first is really saying, we understand that this is going to happen in the future and we’re making this a principle early, we’re going to enforce these patterns early, so that we develop a complete product that could be used in many fashions.
[0:02:19.6] J: I’ve seen some of that in the past as well working for a company called Nicera, which is a network virtualization company. We really focused on providing an API that would be between you and your network infrastructure and I remember that being really critical that we define effectively what would be the entire public API for that product out in front and then later on, we figured out what obviously to learn this semantics of that sort, to be able to build a mental model around what that API might be, that’s where the UI piece comes in.
That was an interesting experiment and like, we ended up actually kind of creating what was the kind of creating what was kind of the – an early version of the Swagger UI in which you basically had a UI that would allow you to explore and introspect and play with, all of those different API objects but it wasn’t a UI in the sense that you know, it had like a constrained user story that was trying to be defined, that was my first experience where I was working with a product that had an API-first model.
[0:03:17.0] CC: I had to warm up my brain, I think about why do we build API’s to begin with before I could think why API-first is of a benefit and where the benefits are. And I actually looked up something today and it’s this Jeff Bezos mandate, I had seen this before, right? I mean, why do we view the API’s? API what you’re talking about is data transfer, right? Taking data from over here and sending it over there or you’re making that available so somebody can fetch it.
It’s communication. Why do we build API? To make it easier to do that so you can automate, you can expose it, you can gate it with some security, right? Authentication, all of those things and with every increasing amount of data, this becomes more and more relevant and I think when Patrick was saying, when you do it API first, you’re absolutely focusing on making it all of those characteristics a priority, making that work well.
If you want to make it pretty, okay, you can take that data in. Transforming some other way to make your presentation pretty, to display on the mobile device or whatever.
[0:04:26.4] PB: Yeah, I think another thing with inserting the API design upfront in the software development lifecycle, at least in my experience has been – it allows you to sort of gather feedback from who your consumers will be early on before you worry about the intricacies of all the implementation details, right?
I guess with Nicera’s instant stuff, I wonder when you all made that contract, were you pushing out a Swagger UI or just general API documentation before you had actually implemented the underlying pieces or did that all happen together?
[0:04:58.1] D: With an API-first, we didn’t build out the UI until after the fact so even to the point where we would define a new object in that API, like a distributed logical router for example. We would actually define that API first and we would have test plants for it and all of that stuff and t hen we would surface it in the UI part of it and that’s a great point.
I will say that it is probably to your benefit in the long run to define what all of the things that you’re going to be concerned with are out front. And if you can do that tin a contractual basis, whether that contract is represented by an API or whether that contract is represented by a data model, it’s critical that you have some way of actually defining exactly what that is so that you can also support things like versioning and being able to actually modify that contract as you move forward.
[0:05:45.0] PB: I think another important piece here, too, is when you just look at the data model and the concepts, you focus on those first, you have a tendency to more decompose the problem, right? You start to look at things and you break it down better into individual pieces that combine better and you end up with more use cases and you end up with a more useable API.
[0:06:03.2] D: That’s a good point. Yeah, I think one of the key parts of this contract is kind of like what you’re trying to solve and it’s always important, you know? I think that, when I talk about API-first development, it is totally kind of in line with that, you have to kind of think about what all the use cases are and if you’re trying to develop a contract that might satisfy multiple use cases, then you get this great benefit of thinking of it as you can kind of collapse a lot of the functionality down into a more composable API, rather than having to solve each individual use cases and kind of a myopic way.
[0:06:34.5] CC: Yeah, it’s the concept of reusability, having the ability of making things composable, reusable.
[0:06:40.7] D: I think we probably all seen UI’s that gets stuck in exactly that pattern, to Patrick’s point. They try to solve the user story for the UI and then on the backend, you’re like, why do we have two different data models for the same object, it doesn’t make sense. We have definitely seen that before.
[0:06:57.2] PB: Yeah, I’ve seen that more times than not, it takes a lot of discipline to really build a UI or an API, you know, first to focus on those pieces first – it’s so tempting to go right to the UI because you get these immediate results and everyone’s like – you really need to bring that back, it takes discipline to focus on the concepts first but it’s just so important to do.
[0:07:19.5] CC: I guess it really depends on what you are doing too. I can see all kinds of benefits for any kind of approach. But I guess, one thing to highlight is that different ways of doing it, you can do a UI-first, presentation first, you can do an API-first and you can do a model-first so those are three different ways to approach the design and then you have to think well, what I’m saying is, you shouldn’t do one just because you don’t know how to do the others, you should really look into what will serve you better.
[0:07:49.4] J: Yeah, with a lot of this talk about API’s and contracts, obviously in software, there’s many levels of contracts we potentially work on, right? There’s the higher level, potential UI stuff and sometimes there’s a lower level pieces with code. Perhaps if you all think it’s a good idea, we could start with talking about what we consider to be an API in the cloud native space and what we’re referring to.
A lot of the API’s we’ve described so far, if I heard everyone correctly, they sounded like they were more so API, as describing perhaps a web service of sorts, is that fair?
[0:08:18.8] PB: That’s an interesting point to bring up. I’m definitely describing the consumption model of a particular service. I’m referring to that contract as an infrastructure guy, I want to be able to consume an API that will allow me to model or create infrastructure. I’m thinking of it from that perspective.
If AWS didn’t have an API, I probably wouldn’t have adopted it, like the UI is not enough to do this job, either, like I need something that I could tie to better abstractions, things like terraform and stuff like that. I’m definitely kind of picturing it from that perspective.
But I will add one other interesting point to this which is that in some cases, to Josh’s point, these things are broken up into public and private API’s, that might be kind of interesting to dig into. Why you would model it that way. There are certainly different interactions between composed services that you’re going to have to solve for. It’s an interesting point.
[0:09:10.9] CC: Let’s hold that thought for a second. We are acknowledging that there are public and private API’s and we could talk about why their services work there. Other flavors of API’s also, you can have for example, a web service type of API and you can have a command line API, right? You can see a line on top of a web service API which is the crazy like, come to mind, Kubernetes but they have different shapes and different flavors even though they are accessing pretty much the same functionality.
You know, of course, they have different purposes and you have to see a light and another one, yet, is the library so in this case, you see the calls to library which calls the web service API but like Duffy is saying, it’s critical sometimes to be able to have this different entry points because each one has its different advantages like a lot of times, it’s way faster to do things on the command line than it is to be a UI interface on the web that would access that web API which basically, you do want to have. Either your Y interface or CLA interface for that.
[0:10:21.5] PB: What’s interesting about Kubernetes too and what I think they kind of introduced and someone could correct me if I’m wrong but is this kid of concept of a core generative type and in Kubernetes, it ends up being this [inaudible]. From the [inaudible], you’re generating out the web API and the CLI and the SCK and they all just come from this one place, it’s all code gen out of that.
Kubernetes is really the first place I’ve seen do that but it’s really impressive model because you end up with this nice congruence across all your interfaces. It just makes the product really rockable, you can understand the concepts better because everywhere you go, you end up with the same things and you’re interacting with them in the same way.
[0:11:00.3] D: Which is kind of the defining of type interface that Kubernetes relates to, right?
[0:11:04.6] PB: Obviously, Kubernetes is incredibly declarative and we could talk a bit about declarative versus imperative, almost entirely declarative. You end up with kind of a nice, neat clear model which goes out to YAML and you end up a pretty clean interface.
[0:11:19.7] D: If we’re going to talk about just the API as it could be consumed by other things. I think we’re kind of talking a little bit about the forward facing API, this is one of those things that I think Kubernetes does differently than pretty much any other model that I’ve seen. In Kubernetes, there are no hidden API’s, there’s not private API.
Everything is exposed all the time which is fascinating. Because it means that the contract has to be solid for every consumer, not just the ones that are public but also anything that’s built on the back end of Kubernetes, the Kublet, controller manager, all of these pieces are going to be accessing the very same API that the user does.
I’ve never seen another application built this way. In most applications, what I see is actually that they might define an API between particular services that you might have a contract between those particular services. Because this is literally — to Carlisia’s point, in most of the models that I’ve seen API’s are contract written, this is about how do I get data or consume data or interact with data, between two services and so there might be a contract between that service and all of its consumers, rather than between the course or within all of the consumers.
[0:12:21.7] D: Like you said, Kubernetes is the first thing I’ve seen that does that. I’m pulling an API right now, there’s a strong push of internal API’s for it. But we’re building on top a Kubernetes product and it’s so interesting how they’ve been able to do that, where literally every API is public and it works well, there really aren't issues with it and I think it actually creates a better understanding of the underlying system and more people can probably contribute because of that.
[0:12:45.8] J: On that front, I hope this is a good segue but I think it would be really interesting to talk about that point you made Patrick, around declarative versus imperative and how the API we’re discussing right now with Kubernetes in particular, it’s almost entirely declarative. Could you maybe expand on that a bit and compare the two?
[0:13:00.8] PB: It’s interesting thing that Kubernetes has really brought to the forefront – I don’t know if there’d be another notable declarative API be terraform. This notion of you just declare state within a file and in some capacity, you just apply that up to a server and then that state is acted on by a controller and it brings us straight to fruition.
I mean, that’s almost indicative of Kubernetes at this point I think. It’s so ingrained into the product and it’s one of the first things to kind of do that and that it’s almost what you think of when you think of Kubernetes. And with the advent of CRD’s and what not, that’s now, they want to be extended out to really in the use case you would have, that would fit this declarative pattern of just declaring to say which it turns out there’s a ton of use cases and that’s incredibly useful.
Now, they’re kind of looking at, in core Kubernetes, could we add imperative functionality on top of the declarative resources, which is interesting too. They’re looking at that for V2 now because there are limitations, there are some things that just do fit in to declarative pattern perfectly that would fit just the standard rest.
You end up some weird edges there. As they’re going towards V2, they’re starting to look at could we mix imperative and declarative, which is and even maybe more interesting idea if you could do that right.
[0:14:09.3] CC: In the Kubernetes world, what would that look like?
[0:14:11.3] PB: Say you have an object that just represents something like on FOO, you have a YAML file and you're declaring FOO to be some sort of thing, you could apply that file and then now that state exist within the system and things noticed that that state of change that they’re acting on that state, there are times when you might want that FOO to have another action.
Besides just applying states, you may want it to have some sort of capability on top of the point, let’s say, they’re quite a few use cases that come in where that turns into a thing. It’s something to explore, it’s a bit of a Pandora’s box if you will because where does that end. Kubernetes is kind of nice that it does enforce constraints at this core level and it produces these really kind of deep patterns within the system that people will find kind of easy to understand at least at a high level.
Granted, you go deep into it, it gets highly complex but enforcing like name spaces as this concept of just a flat name space with declarative resources within it and then declarative resources themselves just being confined to the standard rest verbs, is a model that people I think understand well. I think this is part of the success for Kubernetes is just that people could get their hands around that model. It’s also just incredibly useful.
[0:15:23.7] D: Another way to think about this is like, you probably seen articles out there that kind of describe the RESTful model and talking about whether REST can be transactional. Let’s talk a little bit about what that means. I know the implementation of an API pattern or an interface pattern might be. That the client sends information to the server and that the server locks that client connection until it’s able to return the result, whatever that result is.
Think of this, in some ways, this is very much like a database, right? As a client of a database, I want to insert a row into a database, the database will lock that row, it will lock my connection, it will insert that row and it will return success and in this way, it’s synchronous, right? It’s not trying to just accept the change, it just wants to make sure that it returns to a persisted that change to the database before, letting go of the connection.
This pattern is probably one of the most common patterns in interfaces in the world like it is way super common. But it’s very different than the restful pattern or some of the implementations of a restful pattern. In which what we say, especially in this declarative model, right? In a declarative model, the contract is basically, I’m going to describe a thing and you're going to tell me when you understand the thing I want to describe.
It’s asynchronous. For example, if I were interacting with Kubernetes and I said, cube kettle create pod, I would provide the information necessary to define that pod declaratively and I would get back from the API server 200 okay, pod has been accepted. It doesn’t mean to it's been created. It means it’s been accepted as an object and persisted to disk. Now, to understand from a declarative perspective, where I am in the life cycle of managing that pod object, I have to query that API again.
Hey, this pod that I ask you to make, are you done making it and how does this work and where are you in that cycle of creating that thing? This is where I like within Kubernetes, we have the idea of a speck which defines all of the bits that are declaratively described and we have the idea of a status which describes what we’ve been up to around that declarative object and whether we’ve actually successfully created it or not.
I would argue that from a cloud native perspective that declarative model is critical to our success. Because it allows us to scale and it allows us to provide an asynchronous API around those objects that we’re trying to interact with and it really changes the game as far as like, how we go about implementing those inputs.
[0:17:47.2] CC: This is so interesting, it was definitely a mind bender for me when I started developing against Kubernetes. Because what do you mean you’ve returned the 200 okay, and the thing is not created yet. When does it get created? It’s not hard to understand but I was so not used to that model. I think it gives us a lot of control. So it is very interesting that way and I think you might be right, Duffy, that it might be critical to the success of native apps because it is always like the way I am thinking about it right now just having heard you is almost like with all the models, let’s say you are working with a database in that transactional system.
The data has be inserted and that system decides to retry or not once the transaction is complete as we get a result back. With a Kubernetes model or cloud native model, I don’t know what, which is both a proper things to say, the control is with us. We send the request, Kubernetes is going to do its thing, which allows us to move on too, which is great, right? Then I can check for the result, when I want to check and then I can decide what to do with the results when I want to do anything with it if it all, I think it gives us a lot more control as developers.
[0:19:04.2] D: Agreed. And I think another thing that has stuck in my head around this model whether it would be declared over imperative is that I think that Go Lang itself has actually really enabled us to adopt that asynchronous model around things that threads are first class, right? You can build a channel to handle each individual request, that you are not in this world where all transactions have to stop until this one is complete and then we’ll take the next one out of queue and do that one.
We're no longer in that kind of a queue model, we can actually handle these things in parallel quite a bit more. It makes you think differently when you are developing software.
[0:19:35.9] J: It’s scary too that you can check this stuff into a repo. The advent of Git Ops is almost parallel to the advent of Kubernetes and Terra Form and that you can now have this state that is source controlled and then you just apply it to the system and it understands what to do with it and how to put all of the pieces together that you gave it, which is a super powerful model.
[0:19:54.7] D: There is a point to that whole asynchronous model. It is like the idea of the API that has a declarative or an imperative model and this is an idea and distributed system that is [inaudible]. It is like edge triggering or level triggering but definitely recommend looking up this idea. There is a great article on it on Hack Noon and what they highlight is that the pure abstract perspective there is probably no difference between edge and level triggering.
But when you get down to the details especially with distributed systems or cloud native architectures, you have to take into account the fact that there is a whole bunch of disruption between your services pretty much all the time and this is the challenge of distributed systems in general, when you are defining a bunch of unique individual systems that need to interact and they are going to rely on an unreliable network and they are going to rely on unreliable DNS.
And they’re going to rely on all kinds of things that are going to jump in the way of between these communication models. And the question becomes how do you build a system that can be resilient to those interruptions. The asynchronous model absolutely puts you in that place, where if you are in that situation wherein you say, “Create me a pod.” And that pod object is persisted and now you can have something else to do the work that will reconcile that declared state with the actual state until it works.
It will just keep trying and trying and trying until it works. In other models, you basically say, “Okay, well what work do I have to do right now and I have to focus on doing this work until it stops.” What happens if the process itself dies? What happens if any of the interruptions that we talk about happen? Another good example of this is the Kafka model versus something like a watch on etcd, right? In Kafka, you have these events that you are watching for.
And if you weren’t paying attention when that event went by, you didn’t get that event. It is not still there. It is gone now whereas like with etcd and models like that, what you are saying is I need to reconcile my expectancy of the world with what the desired thing is. And so I am no longer looking for events. I am looking for a list of work that I have to reconcile to, which is a very different model for these sorts of things.
[0:21:47.9] J: In Kubernetes, it becomes the informer pattern. If you all don’t know, which is basically at the core of the informer is just this concepts of list and watch where you are just watching for changes but every so often you list as well in case you missed something. I would argue that that pattern is so much more powerful than the Kafka model you’re just going to skin as well because like you mentioned, if you missed an event in Kafka somehow, someway is very difficult to reconcile that state.
Like you mentioned, your entire system can go down in a level set system. You bring it back up and because it is level set, everything just figures itself out, which is a lot nicer than your entire system going down in an edge-based system and trying to figure out how to put everything back together yourself, which is not a fun time, if you have ever done it.
[0:22:33.2] D: These are some patterns in the contracts that we see in the cloud native ecosystem and so it is really interesting to talk about them. Did you have another point Josh around API’s and stuff?
[0:22:40.8] J: No, not in particular.
[0:22:42.2] D: So I guess we give into like what some of the forms of these API’s to talk about. We could talk about RESTful API’s versus to TIPC-based API’s or maybe even just interfaces back and forth between modular code and how that helped you architect things.
One of the things I’ve had conversations with people around is we spend a lot of our time conditioning our audience when in cloud native architecture to the idea that monliths are bad, bad, bad and they should never do them. And that is not necessarily true, right? And I think it is definitely worth talking through like why we have these different opinions and what they mean.
When I have that conversation with customers, frequently a monolith makes sense because as long as you’re able to build modularity into it and you are being really clear about the interfaces back and forth between those functions with the idea that if you have to actually scale traffic to or from this monolith. If the function that you are writing needs to be effectively externalized in such a way that can handle an amount of work that will surpass what the entire monolith can handle.
As long as you are really clear about the contract that you are defining between those functions then later on, when it comes to a time to externalize those functions and embrace kind of a more microservices based model mainly due to traffic reload or any of the other concerns that kind of drive you toward a cloud native architecture, I think you are in a better spot and this is definitely one of the points of the contract piece that I wanted to raise up.
[0:24:05.0] CC: I wonder though how hard it is for people to keep that in mind and follow that intention. If you have to break things into micro services because you have bottlenecks in your monolith and maybe have to redo the whole thing, once you have the micro services, you have gone through the exercise of deciding, you know this goes here, these goes there and once you have the separate modules it is clear where they should go.
But when you have a monolith it is so easy to put things in a place where they shouldn’t be. It takes so much discipline and if you are working on a team that is greater than two, I don’t know.
[0:24:44.3] PB: There are certain languages that lend themselves to these things like when you are writing Java services or there are things where it is easy to — when writing even quickly, rapidly prototyping an application that has multiple functions to be careful about those interfaces that you are writing, like Go because it is a strongly type language kind of forces you into this, right? There are some other languages that are out that make it difficult to be sloppy about those interfaces.
And I think that is inherently a good thing. But to your point like you are looking at some of the larger monoliths that are out there. It is very easy to fall into these patterns where instead of an asynchronous API or an asynchronous interface, you have just a native interface and you are a asynchronous interface in which you expect that I would be able to call this functional and put something in there. I will get the result back and that is a pattern for monoliths. Like that is how we do it in monoliths.
[0:25:31.8] CC: Because you say in there also made me think of the Conway’s Law because when we separate these into micro services and I am not saying micro services is right for everything for every team and every company. But I am just saying if you are going through that exercise of separating things because you have bottlenecks then maybe in the future you have to put them elsewhere. Externalize them like you said.
If you think if the Conway’s Law if you have a big team, everybody working on that same monolith that is when things are in depth in the place that they shouldn’t be. The point of micro services is not just to technically separate things but to allow people to work separately and that inter-team communication is going to be reflected in the software that they are creating but because they are forced to communicate and hopefully they do it well that those micro services should be well-designed but if you have a monolith and everyone working on the same project, it gets more confusing.
[0:26:31.4] D: Conway’s Law as an overview is basically that an organization will build software and laid out similar to the way the thought musician itself is architected. So if everybody in the entire company is working on one thing and they are really focused on doing that one thing, you’d better build a monolith. If you have these groups that are disparate and are really focused on some subset of work and need to communicate with each other to do that thing then you are going to build something more similar or maybe more capable as a micro service.
That is a great point. So actually one of the things about [inaudible] that I found so fascinating with it, it would be a 100 people and we were everywhere. So communication became a problem that absolutely had to be solved or we wouldn’t be able to move forward as a team.
[0:27:09.5] J: An observation that I had in my past life helping folks, breaking apart Java monoliths like you said Duffy, assume they had really good interfaces and contracts right? And that made it a lot easier to find the breaking points for their API’s to pull those API’s out into a different type of API. They went from this programmatic API, that was in the JBM where things were just intercommunicating to an API that was based on a web service.
And an interesting observation I oftentimes found was that people didn’t realize that in removing complexity from within the app to the network space that oftentimes caused a lot of issues and I am not trying to down API’s because obviously we are trying to talk about the benefits of them but it is an interesting balancing act. Oftentimes when you are working with how to decouple a monolith, I feel like you actually can go too far with it. It can cause some serious issues.
[0:27:57.4] D: I completely agree with that. That is where I wanted to go with the idea of why we say that building a monolith is bad and like with the challenges of breaking those monoliths apart later. But you are absolutely right. When you are going to introduce the wild chaos that is a network between your services, are you going to externalize functions and which means that you have to care a lot more about where you store a state because that state is no longer shared across all of the things.
It means that you have to be really super careful about how you are modeling that. If you get to the point where this software that you built that is a monolith that is wildly successful and all of its consumers are networked based, you are going to have to come around on that point of contracts. Another thing that we haven’t really talked on so much is like we all agree that maybe like an API for say the consumer model is important.
We have talked a little bit about whether private API’s or public API’s make sense. We described one of the whacky things that Kubernetes does, which is that there are no private API’s. It is all totally exposed all the time. I am sure that all of us have seen way more examples of things that do have a private API mainly because perhaps the services are trained. Service A always fact to service B. Service B has an API that it may be a private API.
You are never going to expose to your external customers only to service A or to consumers of that internal API. One of the other things that we should talk about is when you are starting to think about these contracts. One of the biggest and most important bits is how you handle the lifecycle of those API’s, as they change right? Like I say add new features or functionality or as I deprecate old features and functionality, what are my concerns as it relates to this contract.
[0:29:33.5] CC: Tell me and take my money.
[0:29:37.6] D: I wish there was like a perfect answer. But I am pretty convinced that there are no perfect answers.
[0:29:42.0] J: I spent a lot of time in the space recently and I have researched it for like a month or so and honestly, there are no perfect answers to try to version an API. Every single on of them has horrible potential consequences to it. The approach Kubernetes took is API evolution, where basically all versions of the API have to be backwards compatible and they basically all translate to what is an internal type in Kubernetes and everything has to be translatable back to that.
This is nice for reasons. It is also very difficult to deal with at times because if you add things to an API, you can’t really every remove them without a massive amount of deprecation effort basically moderating the usage of that API specifically and then somehow deprecating it. It is incredibly challenging.
[0:30:31.4] PB: I think it is 1-16 in which they finally turn off a lot of the deprecated API’s that Kubernetes had. So a lot of this stuff that has been moved for some number of versions off to different spaces for example deployments used to be extensions and now they are in apps. They have a lot of these things. Some of the older API’s are going to be turned off by default in 1-16 and I am really interested to see how this plays out you know from kind of a chaos level perspective.
But yeah you’re right, it is tough. Having that backwards compatibility definitely means that the contract is still viable for your customers regardless of how old their client side looks like but this is kind of a fingernail problem, right? You are going to be in a situation where you are going to be holding those translations to that stored object for how many generations before you are able to finally get rid of some of those old API’s that you’ve have obviously moved on from.
[0:31:19.6] CC: Deprecating an end point is not reviewed at all and ideally like better with, you would be able to monitor the usage of the end point and see as you intend deprecating is the usage is going lower and if there is anything you can do to accelerate that, which actually made me think of a question I have for you guys because I don’t know the answer to this.
Do we have access to the end points usage, the consumption rate of Kubernetes end points by any of the cloud service providers? It would be nice if we did.
[0:31:54.9] D: Yeah, there would be no way for us to get that information right? The thing about Kubernetes is something that you are going to run on your own infrastructure and there is no phone home thing like that.
[0:32:03.9] CC: Yeah but the cost providers could do that and provide us a nice service to the community.
[0:32:09.5] D: They could that is a very good point.
[0:32:11.3] PB: [inaudible] JKE, it could expose some of the statistics around those API end points.
[0:32:16.2] J: I think the model right now is they just ping the community and say they are deprecating it and if a bunch of people scream, they don’t. I mean that is the only way to really know right now.
[0:32:27.7] CC: The squeaky wheels get the grease kind of thing.
[0:32:29.4] J: Yeah.
[0:32:30.0] D: I mean that is how it turns out.
[0:32:31.4] J: In regarding versioning, taking out of Kubernetes for a second, I also think this is one of the challenges with micro service architectures, right? Because now you have the ability to independently deploy a service outside of the whole monolith and if you happen to break something that cracks contractually you said you would and people just didn’t pay attention or you accidentally broke it not knowing, it can cause a lot of rift in a system.
So versioning becomes a new concern because you are no longer deploying a massive system. You are deploying bits of it and perhaps versioning them and releasing them at different times. So again, it is that added complexity.
[0:33:03.1] CC: And then you have this set of versions talk to this set of versions. Now you have a matrix and it is very complicated.
[0:33:08.7] PB: Yeah and you do somewhat have a choice. You can’t have each service independently versioned or you could go with global versioning, where everything within V1 could talk to everything else than V1. But it's an interesting point around breakage because tools like GRPC kind of enforce you to where you cannot break the API, through just how the framework itself is built and that’s why you see GRPC in a lot of places where you see micro services just because it helps get the system stable.
[0:33:33.1] D: Yeah and I will call back to that one point again, which I think is actually one of Josh’s points. If you are going to build multiple services and you are building an API between them then that means the communication path might be service A to service B and service B to service A. You are going to build this crazy mesh in which you have to define an API in each of these points to allow for that consumption or that interaction data.
And one of the big takeaways for me in studying the cloud native ecosystem is that if you could define that API and that declarative state as a central model to all of your services then you can flip this model on its head instead of actually trying to define an API between in front of a service. You can make that service a consumer of a centralized API and now you have one contract to right and one contract to standby and all of those things that are going to do work are going to pull down from that central API.
And do the work and put back into that central API the results, meaning that you are flipping this model on its head. You are no longer locking until service B can return the result to you. You are saying, “Service B here is a declarative state that I want you to accomplish and when you are done accomplishing it, let me know and I will come back for the results,” right? And you could let me know in an event stream. You can let me know by updating a status object that I am monitoring.
There’s lots of different ways for you to let me know that service B is done doing the work but it really makes you think about the architecture of these distributed systems. It is really one of the big highlights for me personally when I look at the way that Kubernetes was architected. Because there are no private API’s. Everything talks to the API server. Everything that is doing work regardless of what data it’s manipulating but it is changing or modifying. It has to adhere to that central contract.
[0:35:18.5] J: And that is an interesting point you brought up is that Kubernetes in a way is almost a monolith, in that everything passes through the API server, all the data leaves in this central place but you still have those distributed nature too, with the controllers. It is almost a mix of the patterns in some ways.
[0:35:35.8] D: Yeah, I mean thanks for the discussion everybody that was a tremendous talk on contracts and API’s. I hope everybody got some real value out of it. And this is Duffy signing off. I will see you next week.
[0:35:44.8] CC: This is great, thank you.
[0:35:46.5] J: Cheers, thanks.
[0:35:47.8] CC: Bye.
[END OF INTERVIEW]
[0:35:49.2] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter https://twitter.com/ThePodlets and on the https://thepodlets.io website where you will find transcripts and show notes. We’ll be back next week. Stay tuned by subscribing.
[END]
See omnystudio.com/listener for privacy information.
- Show more