Episodes
-
In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Enterprise IT Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do a deep-dive into the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects.
We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episodes, we have hosted Doug Cutting and Mike Cafarella for a fascinating “look back” on Hadoop, and Reynold Xin, co-creator of Apache Spark and co-founder of Databricks for a technical deep-dive into Spark. In this episode, we host Ryan Blue, creator of Apache Iceberg, the most popular open table format that is driving much of the adoption of Data Lake Houses today. Ryan shares with us what led him to create Iceberg, the technical breakthroughs that have made it possible to handle petabytes of data safely and securely, and the critical role he sees Iceberg playing as more and more enterprises adopt the modern data stack.
Show Notes
Timestamps
* [00:00:01] Introduction and background on Apache Iceberg
* [00:02:50] The origin story of Apache Iceberg
* [00:12:00] Where Iceberg sits in the modern data stack
* [00:10:38] Transactional consistency in Iceberg
* [00:14:38] Top features that drive Iceberg’s adoption
* [00:20:00] The technical underpinnings of Iceberg
* [00:21:33] How Iceberg makes "time travel" for data possible
* [00:24:08] Storage system independence in Iceberg
* [00:30:13] Query performance improvements with Iceberg
* [00:35:08] Alternatives to Iceberg and pros/cons
* [00:40:45] Future roadmap and planned features for Apache Iceberg
Transcript
Sudip: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products.
Today, we have another episode in our data engineering series. We are going to dive deep into Apache Iceberg, an amazing project that is redefining the modern data stack. For those of you who already don't know Iceberg, it is an open table format designed for huge petabyte-scale tables. It provides a database-type functionality on top of object stores such as Amazon S3. And the function of a table format is to determine how you manage, organize, and track all of the files that make up a table. You can think of it as an abstraction layer between your physical data files written in Parquet or other formats and how they are structured to form a table. The goal of the Iceberg project is really to allow organizations to finally build true data lake houses in an open architecture, avoiding any vendor and technology lock-in that we all trade off. And just to give you a little bit of history, the Iceberg development started in 2017. In November 2018, the project was open-sourced and donated to the Apache Foundation. And in May 2020, the Iceberg project graduated to become a top-level Apache project. Today, I have the pleasure of welcoming Ryan Blue, the creator of Iceberg project, to our podcast. Welcome, Ryan. It is so great to have you on and thanks for making the time.
Ryan: Thanks for having me on. I always enjoy doing these. It's pretty fun to talk about this stuff.
Sudip: Awesome. Maybe I'll start with having you tell us the origin story of Iceberg. Why did you create it in the first place? I imagine it probably goes back all the way to your days at Cloudera. Anything you can help us with connecting the dots between where you are at Cloudera, then Netflix, and then now building Iceberg?
Ryan: Yeah, you are absolutely correct. It stemmed from my Cloudera days, as well as without Netflix, it wouldn't have happened. So basically what was going on at Cloudera was you had the two main products. You had Hive and Impala in the query space. And they did a just terrible job interacting with one another. I mean, even three years into the life of Iceberg as a project, you were having to have this command to invalidate Hive state in Impala, to pull it over from the meta store. And even then, you just had really tough situations where you want to go update a certain part of a table, and you have to basically work around the limitations of the Hive table format. So if you're overwriting something or modifying data in place, which is a fairly common operation for data engineers, you would have to read the data, union it, and sort of deduplicate or do your merging logic with the new data, and then write it back out as an overwrite. And that's super dangerous, because you're reading all the data essentially into your Hadoop cluster at the time, and then you're deleting all that data from the source of truth. You're saying, okay, this partition is now empty. It doesn't exist anymore. And then you're adding back the data in its place. And if something goes wrong, can you recover? Who knows? And so it was just a mess, and people knew it was a mess all over the place. I talked to engineers at Databricks at the time and was like, hey, we should really collaborate on this and fix the Hive table format and come out with something better. But it was just not to be at Cloudera, because they were distributors. They had so many different products, and it was very hard to get the will behind something like this, which is where Netflix certainly helped.
Sudip: What was the scale of data at Netflix when you got there? Just rough estimates.
Ryan: I think from around that time, we were saying we had about 50 to 100 petabytes in our data warehouse. It's hard to tell how accurate that is at any given time because of data duplication. Exactly what I was talking about a second ago means you're rewriting a lot. And that write amplification, because your write amplification is at the partition level in Hive tables, you would just do so much extra writing and keep that data around for a few days, and it's very hard to actually know what was live or active.
Sudip: So you landed at Netflix. They have this amazing scale of data. You came from Cloudera with some very specific frustrations, and I would imagine some very specific thoughts on how you're going to solve it at Netflix scale. Maybe walk us a little bit through how those first year or two was at Netflix. How did you go about creating what is now Iceberg?
Ryan: So we didn't actually work on it for a year or two. First, we moved over to Spark, and that was fun. So Netflix had a pre-existing solution to sort of make tables a little bit better. What we had was called the batch pattern. And this was basically, we hacked our Hive meta store so that you could swap partitions. So you could say, instead of deleting all these data files and then overwrite new data files in the same place, we just used different directories with what we called a batch ID. And then we would swap from the old partition to the new partition. And that allowed us to make atomic changes to our tables at a partition level. And that was really, really useful and got us a long way. In fact, we still see this in practice in other organizations where Netflix data engineers have gone off and gotten involved in infrastructure at other companies. And it worked well, as long as the only changes you needed to make swapped partitions. So again, the granularity was just off. But this affected Iceberg primarily because we really felt the usability challenges. We had to retrain basically all of our data engineers to think in terms of the underlying structure of partitions in a table and what they could do by swapping. So rather than having them think in a more modern row-oriented fashion, saying, okay, take all my new rows, remove anything where the row already exists, and then write it into the table, they had to do all of that manually and then think about swapping because you can't just insert or append. And so the two things were the usability was terrible, and it really got us thinking about how do we not only solve our challenge here, make the infrastructure better, but how do we also make a huge difference in the lives of the customers and everyone on our platform to make it so that they don't have to think about these lower levels. Usability has always been important to me in my career, so I had already said we're just going to fix schema evolution, but really thinking about the operations that people needed to be able to run on these tables, I think, influenced the design. And it also showed me because I'd ported our internal table format to Spark twice by the time we started the Iceberg project, I knew I needed to change in Spark.
Sudip: I can imagine. How would you describe the way you guys were storing data on Netflix? Is that closest to something what we all now call Data Lake? Is that roughly what it was?
Ryan: The terms are very nebulous here, but I think of Data Lake as basically the post-Hadoop world where everyone basically moved to object stores. And I know that Hortonworks was calling their system a Data Lake at the time, but I think we've really coalesced around this picture of a Data Lake as a really Hadoop-style infrastructure with an object store as your data store. And then, of course, now we have Lake Houses, right, which again plays a pretty meaningful role in Iceberg's adoption too.
Sudip: Any particular thought on what's driving this movement towards Lake Houses?
Ryan: Well, I'm going to have to answer a question with a question, which is, what do you mean by Lake House? Because there's a broad spectrum here. I kind of see it as where you can actually do warehouse style analytics on Data Lakes. Basically, Data Lake is the way I think about it. You dump whatever data you want in whatever format you want, and then you need some kind of map, obviously, to navigate that. And that's when I think the Data Lake graduates into becoming a Lake House, in my mind.
Sudip: Okay. I think that that corresponds with what I hear from most practitioners.
Ryan: And this is absolutely what we were targeting with Iceberg. We were saying, let's take the guarantees and semantics of a traditional SQL data warehouse, and let's bring those into the Hadoop world. Let's give full transaction asset semantics to the tables. Let's have schema evolution that just always works, no matter what the underlying data format is. And let's also restore the table abstraction so that people using a table don't need to know what's underneath it. They don't need to be experts in building that structure. They don't need to be experts in how this is laid out so they can query that structure and so on. You should just be able to use the logical table space. And I say this a lot. We're going back to 1992. What I think is difficult about the term Lake House in general is you've got all these other definitions. Starburst thinks about Lake House as an engine that can pull together the highly structured Iceberg tables, semi-structured JSON, completely unstructured storage, and documents and be able to make sense of those, as well as to talk to transactional databases or even Kafka topics and things like that. So they see it from this vision looking downward at a disparate collection of data sources. On the other hand, we, as a single source, we think of it at Tabular, my company, as what all can I connect to Iceberg tables? We've got this rich layer of storage, and so the closest definition for me is Lake House is this architecture where I can connect Flink and Snowflake and BigQuery and use them all. And use them all equally. And then you still have other vendors who actually claim to be Lake House vendors and use it in their marketing, but don't support schema evolution. Don't actually support transactions and ACID semantics. And so I think that there's just a really big space out there. So for our part, we think of independent or separation of compute and storage as sort of defining what we do. Because you get that flexibility to use any engine with the same single source of truth. We layer in security and make sure that no matter if you're coming in through a Python process on someone's laptop, or a big warehouse component like Redshift, that you get that excellent SQL behavior and security.
Sudip: That makes sense. I want to kind of maybe double click on something you were just about started talking about, which is transactional consistency. And it is, I would say, one of the most important things that probably makes a Data Lake House in my mind. And you guys have a really interesting way of how you went about solving that. Would you mind expanding on that a little bit? What does it look like from a technical breakthrough standpoint, the way you'd be able to do it?
Ryan: I'm actually pretty happy with that being one of the defining features. I would throw in the SQL behavior stuff as well, because I think that really is a critical part. But yeah, I think that the ability to use multiple engines safely is probably the defining characteristic of basically the shift in data architecture in the next 10 years. I didn't even realize that until now. It was a bit of a happy accident because we were looking to make sure that our users at Netflix could use Spark and Pig at the same time. And they could also have Flink and these background jobs that pick data up in one region and move it to our processing region and all of those things need to commit to the same table. And what we didn't realize was that that's not something that you can do in the data warehouse world. In the data warehouse world, everything comes through your query layer and you're sort of bottlenecked on the query layer. Whereas we knew that we wanted to be bottlenecked by the scalability of S3 itself. And we needed to come up with this strategy for coordinating across completely different frameworks and engines that knew nothing about one another. And that was the happy accident. We were doing it so that we had transactions. I was literally thinking at the time, I want someone to be able to run a merge command rather than doing this little dance of how do I get something rather than doing this little dance of how do I structure my job so that I'm overwriting whole partitions. That was our problem was that we wanted to make fine-grained changes to these tables. We wanted to put more of that logic into the engine with the merge and update commands and things like that. But the background was that we had very different writers in three, four different frameworks. And we needed to support all of them. And in designing a system that could support all of them concurrently, we actually unlocked that larger problem which was how do I build data architecture from all these different engines? And that's what we're seeing today as this explosion. We're seeing everyone has data architecture with four, five different data products and engines. What they want is to share storage underneath all of those so that you don't have data silos. So that it's not like, oh, I put all this data in I put all this other data in Snowflake and in order to actually work across those boundaries I have to copy and I have to maintain sync jobs and I have to worry is it up to date? Did today's sync job run? Am I securing that data in both places according to my governance policies? It's a nightmare. And so we're actually seeing this happy accident of, hey, now we have formats that can support both Redshift and Snowflake at the same time. And that is like I said, I think that this is going to drive the next 10 years of change in the analytic data space.
Sudip: That's awesome! Can you tell us a little more about how you guys went about really making ACID transactions possible in Iceberg? And I'm probably alluding to other things, real cool things you guys do with snapshots and so on. Can you talk a little more about what is the technical underpinning of that?
Ryan: So the idea is to take an already atomic operation and have some strategy to scale up that atomic operation to potentially petabyte scale or even larger. I don't think that there is a limit to the volume of data you can commit in a single transaction. The trade-off is how often you can do transactions rather than the actual size of the transaction itself. So what we do is we start with a simple tree structure. The Hive format that we were using before had a tree structure as well. You had data files as the leaves, directories as nodes, and then a database of directories essentially. So you had this multi-level structure. And what you would do is you'd select the directories you need through additional filters and then go list those directories to get the data files and then you'd scan all of those data files. We wanted something where you didn't have to list directories because in an object store that is really, really bad. In fact, at the beginning it was eventually consistent and so we just had massive problems with correctness that we had to solve. In fact, S3 Guard actually came out of some of the earlier work on our team before I was there. That's a fun little tidbit. What we wanted to do with Iceberg was mimic the same structure, but we wanted to have essentially files that track all of the leaves in this tree structure. We started with files at the bottom of the tree and the leaves and then you have nodes called manifests that list those data files and then you have a manifest list that corresponds to a different version of that and that just says, hey, these four manifests, those make up my table and then you can make changes to that tree structure over time efficiently. You're only rewriting a very small portion of that tree structure and it all sort of rolls up into this one central table level metadata file that stores essentially 100% of the source of truth metadata about that table. You've got metadata file, manifest list for every known version, manifests that are shared across all of those versions, and then finally data files that are also shared across versions. The very basic commit mechanism here is just pointing from one tree root to another. You can make any change you want in this giant tree, write it all out, and then ask the catalog to swap. I'm going from V1 to V2 or V2 to V3 and if two people ask to swap, replace V2 at the same time, one of them succeeds and one of them fails. And that gives us this linear history and is essentially the basis for both isolation, keeping readers and writers from seeing intermediate states as well as the atomicity.
Sudip: Got it. And I imagine this architecture is also what makes time travel possible for Iceberg and just for listeners, the way I think about time travel is it allows you to roll back to prior versions if you want to correct issues in case of errors with data processing or even help data scientists to recreate historical versions of their analysis. Can you help us understand a bit more how time travel has been made possible for Iceberg?
Ryan: Yeah, absolutely. So any given Iceberg table tracks multiple versions of the table, which we call snapshots. And because you have multiple versions of the table at the same time, that's one of the ways we isolate things. So if I commit a new version, say, snapshot 4 in the table, and you're reading snapshot 3, I can't just go clean up snapshot 3 and all the files that are no longer referenced. I wouldn't want to, but it might also mess up someone who's reading that table. So we keep them around for some period of time. By default, it's five days. And our thinking there is just enough time for you to have a four-day weekend and a problem on Friday, and you can still fix it on Tuesday and know what happened. So we keep them around for five days, and that actually means that by having that isolation, we have the ability to go back in time and say, okay, well, what was the state of the table five days ago? And we've iterated on that quite a bit to bring into use cases that we're talking about. So now you can tag, which is essentially to name one of those snapshots and say, hey, I'm going to keep this around as Q3 2023. And this is the version of the table I use for our, say, audited financials or to train a model. And you want to keep that for a lot longer. So you can set retention policies on tags, and it just keeps that version of the table around. And as long as the bulk of the table is the same between that version and the current version, it costs you very little to keep all that data around.
Sudip I know there's no typical scenario, but in general, if you compare the metadata versus the actual data, what is the overhead typically in a snapshot of the metadata you store? What is the volume of data to metadata?
Ryan: That sort of ratio? It depends. So wider tables, more columns are going to stack up more metadata. I would say that there's generally a 1000x or three or four orders of magnitude difference between the amount of metadata and the amount of data in a table. Now that differs based on how large your files are, because you can really push that. When we collect file range metadata for each column so that we can at a very fine granularity, this is another way that we improve on the Hive table standard, where you had to read everything in a directory. We can actually go and filter down based on column ranges. You can really push that. So you can have column ranges that are effective, but then within those column or data files, you can have further structures that allow you to skip a lot of data. And so where your trade-off is there is a little fuzzy. You can have lots of data files in a table, or you can compact them and have just a few data files in a table. It kind of depends on the use case which way you would go.
Sudip: That makes sense. You have that knob to turn depending on how granular you want to go. I want to talk a little bit about one really striking feature of Iceberg, which is you don't really have any storage system dependency. You could go from S3 to Azure to something else. I'm curious a little bit how you guys have achieved that. It's like, you know, on the Snowflake side, you have separation of compute and storage. In your case, again, you kind of broke away from any dependency on the storage system. Can you tell us a little bit more about how you're able to achieve that?
Ryan: I think when we started this project, people had been trying to remove Hadoop from various libraries in the Hadoop ecosystem for a very, very long time. And we actually still have a dependency on Hadoop. You can use libraries, but you don't have to. What we wanted to do was lean into the fact that everyone was moving towards object stores instead of file systems. Hadoop is a file system, and it has things like directory structures and all of that, which led us to this listing directories for our table representation and some of these anti-patterns that we're trying to get rid of with the modern table formats. What we wanted to do was have a simpler abstraction than a file system and just go with what the object store itself could do. No renames, no listing, no directory operations. We just wanted simple put and get operations. We designed Iceberg itself to have no dependencies on these expensive, silly operations. We did that based on our experience actually working with S3 as a Hadoop file system. We had our own implementation of the S3 file system, and we kept turning off guarantees to make it faster. For example, if you go to read a file in S3, like S3A, the Hadoop implementation that talks to S3, you'll actually go and list it as though it were a prefix. The reason why is you want to make sure that it's not a directory. It could be a directory if there are files that have that file name slash and then some other name. We said, well, that's silly. We always want this to go and read it as a file. We don't want that extra round trip going to make sure, hey, this isn't a directory, is it? We were able to really squeeze some extra performance out of our I.O. abstraction by letting go of this idea that it should function and behave like a file system. Let's embrace S3 for what it is. It's an object store, and let's use it properly.
Sudip: That's super interesting. Thank you for sharing that. And if I may step back a little bit, when you think about your Apache Iceberg users who are really getting value out of it, what do you think are the top, maybe three features they are particularly excited about? What really makes them move from whatever they are using today to Iceberg?
Ryan: Time travel is one of the big ones. And the others, I don't know that I would really consider them features. Granted, from a project perspective, they're features. But I've always talked to people about Iceberg by opening the conversation with, I don't want you to care about Iceberg. Iceberg is a way of making files in an object store appear like a table. And you should care about tables. You should be able to add, drop, rename, reorder columns. And you shouldn't care about things beyond that. So Iceberg, the purpose of it is to restore this table abstraction that we had hopelessly broken in the Hadoop days. If you talk about core features that are super important for Iceberg, it's stuff like being able to make reliable modifications to your tables. Having that isolation between reads and writes, and knowing that as I'm changing a table, you're not going to get bogus results. It is being able to forget about what's underneath the table. In the Hadoop days, you had to know, okay, am I using CSV? Or am I using JSON? Or am I using Parquet? Because all three have different capabilities when it comes to changing their schema. It's all ridiculous, right? CSV, no deleting columns. But you can rename them. Parquet, you can't rename, but you can drop columns. Those sorts of things, they're just paper cuts. One of the things I'm most proud of with Iceberg is that we got rid of all the paper cuts. It just works. It doesn't always give you exactly what you would want because it's a never-ending process to make it better, but it just works all the time. That's what I hear from people who are moving over to it, that it's just like, hey, I didn't have to think about this. The defaults are, for the most part, really good. It's dialed in. It just works, and I don't have to think about these challenges anymore. Those challenges are actually very significant. Things like hidden partitioning. I talked about restoring the table abstraction to the SQL standard in 1992. Partitioning in the Hive-like table formats, they all require additional filters because you're taking some data property that you like, like timestamp. When did this event happen? You're chunking the data up into hours or days along that dimension and storing the data so that you can find it more easily. But in Hive-like formats, that's a manual process, which is ridiculous. Especially if you switch the timestamp mechanism, too. If you want to go from hours to days, that is, again, ridiculous. You can't do that at all. In order to switch the partitioning of a table in Hive or Hive-like tables, you have to completely rewrite the data and all the queries that touched that table. Because it's all manual, the queries are tied to the table structure itself. You have a column that corresponds to a directory and you have to filter by that column or else you scan everything. It's incredibly error-prone. On the right side, you have to remember, what time zone am I using to derive the date? Right? Because that's different. Hey, which date column did I use for this? What date format, potentially, if you're going to a string, did I use? And if you get any of those things wrong, and you just mis-categorize data, you might not see that. The users might not see that. It's just wrong. We've got an example that we did with Starburst. It's a data engineering example where you go and you connect Tabular and Starburst together and you run through this New York City Taxi dataset example. I love this example because you grab a month's worth of New York City Taxi data and you put it into a directory and you read it and it's like stuff from 2008 even though it's 2003 in the dataset. You're like, where did all this extra data come from? And that's exactly what you're doing with manual partitioning all the time. Someone said, hey, this is the data for this month, and you just trusted it and you put it there, and now when you actually go look at it, you find stuff from like 10 years in the future. It's a massive problem. And on the read side, you don't know if they put the data in correctly. You don't know if you're querying it correctly. You might not even query it correctly. You can get different results if you just say, hey, I want timestamps between A and B, or if you say, I want timestamps between A and B but make sure I'm only reading the correct days. You can actually get different results because of where data was misplaced. It's a huge problem. Again, in Iceberg, we just said, hey, let's just handle that. Let's keep track of the relationship between timestamps and how you want the data laid out. Let's have a very, very clear definition of how to go between the two and then be able to take filters and say, oh, you want this timestamp range? Well, I know the data files you're going to need for that. So we do a much better job of just keeping track of data and eliminating those data engineering errors. So you're kind of completely upstarting away everything that data engineers had to do to just keep the data consistent, transactionally valid, all of that. You're kind of taking all of that away and just exposing the table formats to the different engines. We take inspiration from SQL. SQL is declarative. Tell me what you want, not how to get there. And we've really compromised that abstraction in the Hadoop landscape. We've said, oh, in order to cluster the data effectively for reading, you're going to want to sort your data before you write it into the table. Which is kind of crazy, because we don't actually sort it because we want it sorted. We sort it because we want it clustered and written a certain way, and the engine happens to do that if we write it that way. It's a really backwards way of thinking. And in Iceberg, we've actually gone and said, okay, this is the sort order I want on a table. This is the structure I want on a table. We've made all of these things declarative. And then, we've gone back through, and this has been a process of years, we've gone to make the engines respect those things. So, Spark and Trino will take a look at the table configuration and say, oh, okay, I know how to produce data into this table for the downstream consumption after the fact. And you shouldn't have to figure out how to fiddle with your SQL query in order to get Trino to do the right thing. And the same thing in Spark. Everything should just respect those settings. And we're just taking it back to a declarative approach, which is the basis of databases in the first place. I want to touch on one other thing, which is what does all of this mean for query performances? I constantly hear that people who are using Iceberg, they're really happy about the performance they get.
Sudip: Can you talk a little bit about how that is possible, given of course you are doing a lot of the work under the hood, which is very complex.
Ryan: Yeah. There are a number of dimensions to this. The first thing that we did was we just made job planning faster. Our early, early presentations on Iceberg were like, hey, we had a job that took us four hours to plan. And that was just listing directories. And sure, we took that down to one hour because we were able to list directories in S3 in parallel. And then we said, okay, well, what if we actually made this tree structure and we used metadata files to track the data files for that query? And we get it down to something ridiculous. And then we said, okay, well, what if we then add more metadata? Because we're not relying on directory listing that only gives us this file exists. What if we kept more metadata about that file? What if we knew when it was added to the table and what column ranges were in that file? And things like that. Well, we could then use those statistics to further prune. And so we got this thing that used to not even complete and took four hours just to plan. And we got that down to 45 seconds to plan and run the entire thing because we were able to basically narrow down to just the data that we needed. So, you know, there are success stories like that. We also have a success story where we replaced an Elasticsearch cluster that was multiple millions of dollars per year with an Iceberg table because we were just looking up based on, you know, the one key. Now, ElasticSearch does a lot more than Iceberg, and I don't want to misrepresent there, but having an online system that is constantly indexing by just one thing, you can just use an Iceberg table and its primary index because Iceberg has a multi-level index of data and metadata. It's another reason for that tree that we were talking about earlier. So there's that aspect. Can we find the data faster and things like that? Iceberg also has rich metadata provided by this tree that is actually accurate because you know exactly the files you're going to read, which is a far better input to cost based optimization and that sort of thing. So we just keep accumulating these dimensions. The thing that we've been exploring the most lately is actually unrelated to really better or more metadata about the data. Granted, we are going into, can we keep NDV statistics and things that are better for cost based optimizers and those sorts of things. But what we've been finding out at my company, Tabular, is that automated optimization, which is like background rewrites of data, is amazingly impactful. So let me go into a little of why that's unlocked by Iceberg. In the before time in Hive table formats, you couldn't maintain data. You had to write it the correct way the first time because you couldn't make atomic changes to that data. Whenever I touch data, your query might give you the wrong answer. The solution across your organization is never touch data. Write it once and forget about it. That is, unfortunately, the status quo in most organizations. Write it once, try and do a good job, but don't worry about it after that. And that try and do a good job means only the most important tables actually get someone's attention. If you've got a table where there are a thousand tiny files, well, a thousand tiny files isn't really enough to worry about. It's when we get to hundreds of thousands of tiny files that we start caring. There's this huge, long tale of just performance problems. It's not awful, but it's not great. Automated optimization is allowing us to basically fix things after the fact. The fact that we fixed atomicity, the fact that we can make changes to our table, enable us to combine with the declarative approach I was talking about earlier and actually have processes that go through and look for anti-patterns and problems and fix them. Automated compaction is one where you might have a ton of overhead just because you have tiny files and that means you have more metadata in the table and everything is generally slower and it's not great. If you just have an automated service going and looking for all those tables with one partition with a thousand files in it that are all 5k, turns out you can really save a lot on your storage costs and your compute costs and all of those things and it's cheap. It's automated so it's not like you're spending an entire person doing that. This is not a problem that data engineers have time to care about, but it has a tremendous value to the organization in terms of just everything functions more smoothly.
Sudip: So you can basically go back and clean up all those data stores without putting any engineers behind it.
Ryan: Yeah, exactly. I mentioned combining this with the declarative approach. What Iceberg has done is it's moved a lot of tuning settings to the table. So we can tune things like compression codec, compression codec level, row group size, page size, different parquet data properties. We can go and figure out the ideal settings for just that data set and then you can have some automated process go and use those settings. So there's a ton of data out there sitting in snappy compression that's probably four or five years old, something like that. What if you had a process that could go turn that into the LZ4 or something safely and cut down your AWS S3 bill by half? That's the kind of impact that we're talking about and while you're cutting down the size of your data by 50%, that also generally makes your queries 50% faster. That is pretty amazing. Anyone with historical data would sign up for that in a heartbeat, right? Because we all are storing data that we never go back and compress and yet pay AWS for. Let's hope. That's kind of where Tabular is building a business. So yeah, I hope so.
Sudip: That's fantastic! I want to shift gears a little bit and talk a bit about what are the other alternatives to Iceberg? There is, of course, Delta Lake, Apache Hudi that come up. How do you think about those solutions versus Iceberg? What are the pros and cons versus Iceberg that you can think about?
Ryan: I think most people are standardizing on Iceberg these days. I have theories for that, but one of the biggest impacts is the reach of Iceberg into proprietary engines. That's largely been the past year, year and a half, that you've seen real big investments from companies like Snowflake and Google. Redshift announced support. Cloudera is moving basically all of their customers onto Iceberg. The broad commercial adoption for Iceberg in particular I think is one of the biggest selling points. If you're thinking about this space, either as a customer or as a vendor, customers have this extra concern of what vendors support the thing. But vendors and customers have two things in common. The first is really is there a technical fit? Is the technical foundation there? For formats like Hudi and Hive that use Hive partitioning, they don't support schema evolution, they basically don't support the ACID transactions either. Hudi's gone a long way towards getting close to ACID transactions, but it's still not quite there. I fired it up on a laptop last year and was able to drop data in a streaming job, duplicate rows and all sorts of things. So it's just not quite there. So I think Iceberg has that technical foundation, that commit protocol that we were talking about in the beginning. It's the same approach that Delta uses, by the way. That is really solid and reliable. So I think that that's one thing that both vendors and early adopters are looking to. Another technical foundation one is just portability. Can you actually build in another language and support all of the table formats properties? So this is the classic example here is Bucketing and Hive used Java native hashcodes and reimplementing Java native hashing on some other system is not going to happen. So you basically had this whole set of things that was not portable in Java. Hudi has many similar issues there and I don't think that it's going to be something that can be ported to C++ or Python or some of these other languages that you really want to fold into the group. So I think the distinguishing factor between Delta and Iceberg and Hudi is that that technical foundation is just not there. On the other side, I think you have the openness. So assuming that that technical foundation is there, is this a project that people can invest in? Can you become a committer? Can someone from Snowflake become a committer? And that really influences vendors' decisions because it's a very tough spot to support a data format that is wholly controlled by your competitor. And I think that that's why most vendors have chosen to recommend Iceberg because Iceberg has that technical foundation. It's also a very neutral product. It was designed to be that neutral from the start. Even before we realized that independence and neutrality in this space was going to be absolutely critical, again, we didn't think of being able to share data between Databricks and Snowflake. We thought, we want Snowflake to be able to query our data at Netflix. We didn't think about the dynamics of these two giants needing to share a dataset. And do you trust Databricks to expose data to Snowflake or Snowflake to expose data to Databricks? Do you think that that is performant? Do you trust it? There are a lot of issues there. I'm not making any accusations with those two in particular. All I'm saying is you want to know that your format is neutral and puts everyone on a level playing field. And especially if you're a vendor, you want to know that you're on a level playing field with other vendors.
Sudip: Absolutely, and I think that comment of openness applies in particular to Delta. That's where I think both vendors and customers have their concerns around. Let me ask you this question. Do you imagine you guys would ever build an engine yourselves since you own the table format, you probably could build a pretty optimized engine. Is that ever a plan?
Ryan: Let me correct you there because we don't own the table format. We contribute to the community, but we participate on the same terms as everyone else. And that's actually a really, really important distinction for us because we don't want to turn into a closed community. The adoption of Iceberg has been fueled by the neutrality of the community itself and the ability for all of these different vendors and large tech companies to invest and know that they have a say in what's going on. So I'm always very careful not to say Tabular is not the company behind Iceberg. We're contributors to Iceberg and we know quite a bit about it. I'm a PMC member but I never want to do anything to compromise that neutrality itself. But getting back to your question, are we going to build a compute engine? I don't think so because I think that there is a real opportunity here in independent storage. The vision that I have for our platform as Tabular is to be that neutral vendor. To be the person looking at your query logs coming in from some engine and saying, hey, if we clustered your data this way, it would save you a whole bunch of money. I don't think that we've ever had the incentive structure to do that because we've never had shared data. So the database industry for the last 30 years has lived under the Oracle model. And Oracle I think is just the most prominent person here. But we sell you compute, we sell you storage, and those two things are inseparable. The fact that all of your data is in this silo sort of keeps you coming back for more compute, for upgrades, for the contracts, and it sort of locks you in. And I don't think that that is unnatural. It's just been the reality for databases for 30 years. Now, the shared storage model is overturning that. And so I think there is a really huge opportunity here for us to rebuild the database industry where you don't automatically get those compute dollars simply because someone put data into your database, or wrote with your database. And so I think that it's actually structurally important for the market to try for this separation, for independent storage that tries to make your data work as well as possible with all of the different uses out there. We don't want anyone tipping the scales and saying, you know, this table is going to perform better if you use our streaming solution or our SQL solution. I think you want an independent vendor sitting there that is looking out for your interests as a customer and really helping you find those inefficiencies. Now, I know that some vendors do a great job of making things faster all the time and things like that. I just don't think that the incentives are quite aligned there. And in fact, throughout the majority of the analytic database history, you have largely been responsible for finding your own efficiencies. Oh, you need to put an index on this column. You need to think about clustering your data this way. And you hear horror stories where, oh, we didn't have the data clustered right, and we were spending $5 million too much a year on this use case. I think it is the responsibility of independent storage vendors to find those things and to fix them for you. And so I really think that that could be a valuable piece of the database industry in the future, and that's where we're headed as a company. Not towards compute, but actually being a pure storage vendor.
Sudip: And I'd say this probably has not been tried before at the layer that you are operating at.
Ryan: Yes, at the lowest storage layer maybe, but not at the layer that really sits in between the engines and the storage layer. Definitely have not seen that. It hasn't been possible to share storage. And it's going to, by the way, have a huge impact on other areas. The example I always give is governance and access controls. Where if you can now share storage between two vastly different databases, it no longer makes any sense to secure the query layer instead of the data layer. And so we have all of these query engine vendors that have security in their query layer. Well, that does you no good if you have three other query layers, or if you have a Python process that's going to come use DuckDB and read that data. I mean, it's a very messy world, and I think we're just starting down this path of what does the analytic database space look like if you separate compute and storage?
Sudip: Last question before I ask you a couple of lightning round questions. What is the future roadmap for Iceberg? And when I say future, I mean, let's say in the next 18 to 24 months. What are some of the big things on your mind that you want to tackle?
Ryan: There are a few things. As always, getting better, improving performance, and things like that. A couple of the larger ones are multi-table transactions. So Iceberg is actually uniquely suited to be able to have multi-table transactions just like data warehouses. So we're going further and further along that path of data warehouse capabilities. We're also coming out with cross-engine views. So you'll at least be able to say, hey, here's a view for Trino, and here's the same view for Spark, and be able to have a single object in your catalog that functions as a view in both places. Now that you're sharing a table space across vastly different engines, we also need to have the rest of the database world. So views are another area where once you're thinking about having a common data layer, you need to expand that a little bit. You need to expand that to access controls across databases. You also need to expand that to views. Views are a really critical piece of how we model data transformation and different processes without necessarily materializing everything immediately. Views are another area. We're also working on the next revision of the format itself, which is going to include data and metadata encryption, which is a big one, as well as just getting closer and closer to the SQL standard in terms of schema evolution. Schema evolution is actually really critical, and this has been coming up lately in CDC discussions. CDC is a process where you're capturing changes, hence change data capture, from a transactional database, and then using that change log, you're keeping a mirror in your analytic space up to date. And if you can't keep up with the same changes from the transactional system, you can't really mirror that data. So schema evolution is extremely important here because if someone renames a column, and I already talked about if you're just using Parquet or something, you just can't rename columns. If someone renames a column in that upstream database, you need to be able to do the same thing. Iceberg is, I think, ahead of the game in that we handle most schema transitions, but we actually need more. We need more type promotions, we need the default value handling, things like that to really get to feature parity with the upstream systems. Now these are mostly edge cases, but it makes a really big difference when you're running a system at scale, and someone in a different org made a change to their production transactional database, and now you've got a week's worth of time to reload all that data. So we're working on filling out the spec there and adding some new types, stuff like timestamp with nanosecond precision, and then blob stores, I think, as well.
Sudip: Wonderful. A lot to look forward to then in the next 12 to 18 months. This has been phenomenal, Ryan, so thank you so much for sharing the whole Iceberg story with our listeners. I want to end with one quick lightning round. We do it at the end of all of our podcasts. Just, you know, real quick questions. So the first one is what has already happened in your space that you thought would take much longer?
Ryan: I have been fairly surprised at how rapidly large companies have not only adopted Iceberg, but really put money and effort behind it. AWS is building extensions into Lake Formation and Glue, which is a very significant investment. The work that Snowflake is doing is incredible. We're very happy to have seen that Databricks, even though they back Delta, has also added Iceberg support to their platform, basically acknowledging the fact that Iceberg is the open standard for data interchange. I think I've been pretty shocked there. I would definitely have thought it would take longer for companies to ramp up, but they seem to have really hit the ground running.
Sudip: Fantastic. Second question. What do you think is the most interesting unsolved question in your space, again?
Ryan: Independent versus tied storage.
Sudip: Tell me more. What do you mean?
Ryan: I have a blog post, The Case for Independent Storage, that summarizes my thoughts here that we've talked about. This is basically the rise of open table formats has created a world in which you can share data underneath Redshift and Snowflake and Databricks and all these commercial database engines. Everyone is rushing, like the answer to my last question, to support and own data in open storage formats. I think that a big question is going to be whether people trust a single vendor that they also use for compute to manage the storage that is used by other vendors. I think that that is probably the biggest question. I've gone all in. I've placed all my chips on the table. We think that it's going to be independent. That's a central strategic bet of Tabular as a company, is that people are going to want an independent storage layer that connects and treats all of these others the same and tries to represent your interests as a customer and get you the best performance in whatever engine is right for the task. Is that going to matter in the marketplace? Or is it going to be that people are perfectly happy having an integrated solution where I buy storage and compute from one vendor and I sort of add on another, say, streaming vendor that uses that same storage. Is that the model or is it going to be independent? I think that the transition is to independence, and that's because it destroys that vendor lock-in. Even if you think today that this vendor for compute is head and shoulders beyond the rest and is going to be amazing. What about in five years or in ten years? Are they going to get disrupted? Are you going to have to move that data? I'm very bullish on the case for independent storage, but whether it will matter to customers is not something I get to choose. Spark versus Hadoop is a classic analogy there. The compute engine is getting disrupted. Last question. What's one message you'd want everyone to remember today? I would just say go check out Iceberg. It can make your life a whole lot easier. That's what we wanted to do.
Sudip: Absolutely. This has been the Engineers of Scale podcast. I hope you all had as much fun as I did. Make sure you subscribe to stay up to date on all our upcoming episodes and content. I am Sudip Chakrabarti, and I look forward to our next conversation.
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit sudipchakrabarti.substack.com -
In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in Infrastructure Software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider’s story. For each such project, we go back in time and do an in-depth analysis of the project - historical context, technical breakthroughs, team, successes and learnings - to help the next generation of engineers learn from those transformational projects.
We kicked off our first “season” with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. In our previous episode, we hosted Doug Cutting and Mike Cafarella for a fascinating discussion on Hadoop. In this episode, we are incredibly fortunate to have Reynold Xin, co-creator of Apache Spark and co-founder of Databricks, share with us the fascinating origin story of Spark, why Spark gained unprecedented adoption in a very short time, the technical innovations that made Spark truly special, and the core conviction that has made Databricks the most successful Data+AI company.
Timestamps
* Introduction [00:00:00]
* Origin story of Spark [00:03:12]
* How Spark benefited from Hadoop [00:07:09]
* How Spark leveraged RAM to monopolize large-scale data processing [00:09:27]
* RDDs demystified [00:11:43]
* Three reasons behind Spark’s amazing adoption [[00:21:47]
* Technical breakthroughs that speeded up Spark 100x [00:27:05]
* Streaming in Spark [00:31:13]
* Balancing open source ethos with commercialization plans [00:37:45]
* The core conviction behind Databricks [00:40:40]
* Future of Spark in the Generative AI era [00:44:39]
* Lightning round [00:49:39]
Transcript
Sudip: Welcome to the Engineers of Scale podcast. I am Sudip Chakrabarti, your host and partner at Decibel.VC, where we back technical founders building technical products. In this podcast, we interview legendary engineers who have created infrastructure software that have completely transformed and shaped the industry. We celebrate those heroes and their work, and give you an insider's view of the technical breakthroughs and learnings they have had building such pioneering products. So today, I have the great pleasure of welcoming Reynold Xin, co-founder of Databricks and co-creator of Spark to our podcast. Hey, Reynold, welcome. [00:00:37]
Reynold: Hey, Sudip. [00:00:38]
Sudip: Thank you so much for being on our podcast. Really appreciate it! [00:00:41]
Reynold: Pleasure to be here. [00:00:42]
Sudip: All right, we are going to talk a lot about Spark, the project that you created and is behind the company that is now Databricks. You went to University of Toronto, is that right? [00:00:54]
Reynold: I did go to the University of Toronto, spent about five years in Canada, and then came to UC Berkeley for my PhD. So, been in the Bay Area for over, I think almost 15 years by now. [00:01:05]
Sudip: What brought you to Berkeley specifically? [00:01:07]
Reynold: It's an interesting point. So when I was considering where to pursue my PhD studies, I looked at all the usual suspects, the top schools, and one of the things that really attracted me to Berkeley - actually, it was two things. One is there's a very strong collaborative culture, in particular across disciplines, because in many PhD programs, the way it works in academic research is you have a PI, a principal investigator, a professor who leads a bunch of students, and they collaborate within that group. But one thing that's really unique about Berkeley is that they brought together all these different people from very different disciplines - machine learning, computer systems, databases - and have them all sit in one big open space and collaborate. So it led to a lot of research that was previously much more difficult to do because you really needed that cross discipline. The second part was Berkeley always had this DNA of building real-world systems. A lot of academic research kind of stop at publishing, but Berkeley sort of has had this tradition of going back to BSD, UNIX, Postgres, RAID, RISC, and all of that, actual systems that have a real-world industry impact. And that's really what attracted me. [00:02:20]
Sudip: And obviously, that's what you guys did with Spark, too. [00:02:23]
Reynold: Yeah, we tried to continue that tradition. So we didn't stop at just the papers. [00:02:27]
Sudip: Yes, absolutely. And I think that's a very common criticism of a lot of academic work, right? Great in quality, but doesn't go that last mile to get to production or get to actual users. [00:02:40]
Reynold: It's not necessarily a wrong thing either, just different approaches. You could argue, hey, let academia figure out the innovative ideas and validate them and have industry productize them. It's not necessarily the strength of academia to productize systems. To some extent, it's just different ways of doing things. [00:02:57]
Sudip: Let me go back to when you guys started Spark, and this is circa 2009. I'm guessing you had just joined the PhD program, and this was still the AMP Lab - Algorithms, Machine, People Lab - right? [00:03:11]
Reynold: Yeah. [00:03:12]
Sudip: Which, of course, is now Sky Lab and was RISE Lab in between. Can you give us a little bit of an idea of what the motivations were to start the research behind Spark? I mean, this was the time when Hadoop was still king, right? Like, why do Spark? [00:03:27]
Reynold: It was an interesting story. So Spark actually started technically the year before I showed up. By the time I showed up, there was this very early, early thing already. So Netflix back then in the 2000s, I think even a little bit before 2009, had this Netflix Prize, which is the competition they created in which they anonymized their movie rating datasets so anybody can participate in the competition to come up with better recommendation models for movies. And whoever can improve the baseline the most would get a million dollars. [00:03:58]
Sudip: Yeah, I remember that. [00:04:01]
Reynold: That was a big deal. Eventually, it was shut down for privacy reasons. I think maybe there were lawsuits that happened, but it was a big deal in computer science: and in the history of machine learning. And this particular PhD student, Lester, was really into this kind of competitions: and also a million dollars was a lot of money. [00:04:17]
Sudip: Sure. For a grad student in particular, right? [00:04:20]
Reynold: A grad student makes about $2,000 a month. So he tried to compete, and one thing that he noticed was that this dataset was much larger than the toy dataset he used to work with for academic research, and it wouldn't fit on his laptop anymore. So he needed something to scale out to be able to process all this data and implement machine learning algorithms. And one of the keys with machine learning is that you are not done when you come up with the first model. It is a continuously iterative process to improve it over time. The velocity of iteration is very important. And he tried Hadoop first because that was the unique thing - if you wanted to do distributed data processing, you used Hadoop back in 2009. And he realized it was horribly inefficient to run. Every single run takes minutes, and it was also horribly inefficient to write. So the productivity for iterating on the program itself was very difficult because the API was very complicated, it's very clunky. So he kind of walked down the aisle - the nice thing about having a giant open space with people from very different disciplines - and talked to Matei, who was also a PhD student back then, one of my fellow co-founders at Databricks. He said, hey, I have this challenge, and I think if you have those kind of primitives, I could really do my competition much faster. So Matei and him basically worked together over the weekend and came up with the very first version of Spark, which was only 600 lines of code. It was an extremely simple system that aimed to do two very simple things. One was a very, very simple, elegant API that exposed distributed data sets as if those were a single local data collection. And second, it could put or cache that data set in memory. So now you can repeatedly run computation on it, which is very important for machine learning because a lot of machine learning algorithms are iterative. So with those two primitives, they were able to make progress much faster for the Netflix Prize. And I think Lester's team even tied for the first place in terms of accuracy. [00:06:24]
Sudip: Did he get the money? [00:06:24]
Reynold: He did not get the money because their team were 20 minutes late in the submission. So they lost a million dollars for a 20-minute difference. So if Matei had worked a little bit harder and had Spark maybe 20 minutes earlier, Lester might have been a million dollar richer. [00:06:44]
Sudip: That's such an amazing story. Wow, I actually did not know that! [00:06:48]
Reynold: So when Spark started from research, it kind of started for a competition and really just the collaborative open space and the opportunity that all of those people just happened to be there at the right time led its very, very first version. Now, obviously Spark today looks very, very different from what the original 600 lines of code was. But that's how it got started. [00:07:09]
Sudip: One question I have since you touched on Hadoop, do you feel that Spark benefited from Hadoop being already there? Did you guys use some components of the Hadoop ecosystem? Like for example, if Hadoop hadn't existed, do you think Spark could have still been created? [00:07:23]
Reynold: Spark definitely benefited enormously from Hadoop early on. There are also baggages that we carry from Hadoop that up until today are still there. It's definitely benefited massively. The first example was that Hadoop solved the storage problem for Spark. And as a result, Spark never had to deal with storage. Spark more or less considered storage as a commodity. To a large extent, organization were able to store a large amount of data reliably and cheaply, which was key. [00:07:56]
Sudip: And this is HDFS in particular? [00:07:59]
Reynold: HDFS, yeah. And later on HDFS largely faded and got replaced by object stores. But Spark never had to worry about, hey, how do you store a large amount of data? And that was very, very important. And Spark piggybacked onto the Hadoop deployments. All the initial deployment of Spark were sort of onto Hadoop clusters themselves. So the existence of those clusters made it easier because if the hardware resources were not there, that would also have been very problematic for any user of large-scale data processing systems. And Spark leveraged a lot of the Hadoop code itself, especially the storage layer, retrieval as well as the data formats. So Spark definitely benefited enormously from it. At the same time, it's a lot of baggage, unfortunately. There's no free lunch. [00:08:45]
Sudip: Right, right. And we actually had Doug Cutting and Mike Cafarella, on the previous episode here, and Doug was talking about how he fully anticipated Hadoop eventually evolving and being replaced by a more advanced system. So it's sort of generations of advancement that happened. One of the key game-changers behind Spark at a very high level was Spark's use of RAM, the memory, keeping data in memory. Can you talk a little bit about how Spark uses memory and what are the benefits? Obviously, there are benefits in speed, iterative computation, but can you talk a little bit about that? And then one other question I'll ask at the same time is why was it so difficult for Hadoop to maybe adopt memory? [00:09:27]
Reynold: This is actually a very complicated topic, but let's try to maybe explain it in just a few minutes. There are two places where Spark uses memory in a fairly clever way that Hadoop didn't do and those really led to the dramatic improvement. One is the ability to simply keep the data in memory. And as we said earlier, that's very important for any kind of iterative computation in which you want to repeatedly scan the same data. So it was critical for machine learning workloads. But at the same time, it's also the same primitive that's very useful for any interactive data science because you're often looking at the same data sets over and over and over again when you're doing interactive data science. Those actually happened to be the first maybe two killer use cases of Spark. The second place, it's a little bit less about the direct use of memory, but because Spark exposes fundamentally much higher-level abstraction compared with Hadoop. Hadoop is very simple - it's Map, Shuffle, Reduce - MapReduce. It's a very simple paradigm. There's no concept of joins and no concept of filters. You basically create filters yourself and map it back to Map and Reduce. Spark exposes a higher-level abstraction that has the concept of filters, joins, code group, and all of this. And as a result, it can train a more complex computation by DAGs, direct cyclic graphs, of tasks. And as part of it, it knows, for example, hey, if you're running a Map right after a filter, you don't have to persist the output of filter onto disk or onto HDFS and then read it back in. As a matter of fact, it can just stream through them. So this particular optimization now removes the need for data to go to disk repeatedly in a larger computation. But it's a little bit less about just memory. It has to do with a combination of both, hey, let's just pass through data, stream through data and memory, as well as having the ability to express that more complex computation diagram. [00:11:23]
Sudip: So if you had like three stages in that DAG, Hadoop would’ve returned the results after every stage to the disk and read it back, whereas Spark can keep it in memory because it knows that DAG? [00:11:36]
Reynold: It knows it's the same thing. I mean, it has to do with having the completeness of the computation rather than only Map and Reduce. [00:11:43]
Sudip: Right. And then one of the primary concepts behind Spark is what is called RDD, Resilient Distributed Datasets. Can you talk a little bit about that for someone who might not be familiar? [00:11:54]
Reynold: I think maybe from two perspectives, one's from a system perspective, the other one's from the user's perspective. And I think the user's perspective probably should go first. The brilliant thing about RDD is that distributed computation used to be super difficult. And if you think about message passing, Hadoop made it slightly simpler to have the MapReduce concept. RDD took it much further and basically said, hey, if you have a large dataset, the way you program large datasets should just be like how you program a collection of data in memory on a single node. If you were to write a Java program, a Scala program, a Python program, everybody knows what a list is, an array is. They're all collections and there are ways to transform collections. Maybe, the way we should be programming large datasets should be identical to that. And that really means the API now for programming a distributed program against a large amount of data is as if you're just manipulating some data that's on a single node. So that dramatically decreased the complexity in the API surface. Now, the second big innovation in RDD is this. One of the big things with distributed computation is, hey, you have all those machines, you might have thousands of machines that could fail. How do you deal with failures? So RDD, while the user-facing API exposes just a bunch of collections, internally, it creates a lineage for every collection. So it doesn't literally materialize when you say, hey, this collection is just a filter on the previous one. Instead of materializing the collection, it is actually lazy. It just tracks, hey, this collection is simply formed through doing a filter on the previous collection. So it gives you that lineage of how you create the datasets. And again, when you really need the result, for example, you want to output the data, you want to get back how many rows there are, it would trigger this computation graph. And if there's a failure on any of the machines, it runs something very simple - it analyzes the computation graph and checks, hey, so what are the downstream, upstream dependencies? If that node fails, I just need to reproduce the data on that node. So it creates a minimum plan to reproduce the partial dataset on that node to handle failures. And it does that all gracefully, without the user having to know anything about it. So that's really the two brilliant parts of RDD. The first is, it creates a new programming paradigm for distributed data processing, which makes you program basically single-node collections. And the second is all the underlying system techniques to make fault recovery work really well. [00:14:29]
Sudip: Got it. And then in 2013, you guys introduced DataFrame, and then in 2015, you introduced Datasets. So what are those APIs and how do they connect or relate to RDDs? [00:14:43]
Reynold: So in 2014, we introduced DataFrames. I remember because in 2015, at Strata conference, which is a big conference, I gave a talk about DataFrame GA. After we started Databricks and we started working a lot more closely with the Spark users. I mean, we always work very closely with Spark users but after the company started, now we no longer had the academic research to worry about. [00:15:08]
Sudip: No paper to write. [00:15:10]
Reynold: No paper to write. No exams to take. No courses to take. And then we realized at some point that, even though there's a lot of unstructured data and semi-structured data out there, at some point people introduced structure onto their data. Structure could be, hey, here's an array of floating point numbers for my machine learning vectors. Structure could be, here's a column called email description, which is a pile of text, right? Those are all structures. Probably like 95% of the programs become some sort of loosely defined structured programs. And the collection of data, which while it's a very powerful abstraction, is still not high enough for structured programming. And that involves also a lot of user-defined functions. Like imagine if you're programming in Python, you want to traverse a list, you want to do something to it. You write a lot of code to say, for example, let's do a for loop across the list. And then for each of the element, you try to compare it to say the number one. If it's number one or it's greater than one, you keep it. If it's less than one, you ignore it. There's a lot of code you're writing, expressing in Python. The problem with that code is that the system cannot optimize it because it is Python code. It is Python code with very strong Python-specific semantics that we can't do anything about. But we do know that often people are just doing very basic comparisons. They're doing very basic expressions that exist in a more structured context. So the reason we created DataFrame was twofold. One is we want to raise the level of structure in the API even higher that makes structured programs easier to write. So users have to write less code. The second is we want to be able to capture more and more of the semantics of the computation. So instead of the user writing a lot of user code in Python, they will just express what they want to do in the DSL in Python still. But this time it's the DSL that tells us the semantics of all those computations. And then we can optimize that under the hood. As a matter of fact, we did. Before Spark 2.0, Python was probably 5 to 10 times slower than any JVM language on Scala and Java for Spark users. And even today, if you Google or ask ChatGPT, it's very likely ChatGPT will tell you that if you use Scala, you get better performance with Spark. And the reason for that is not because Spark itself behaves very differently. Simply because if you had used Python before, we had to run a lot of your code in Python, and the Python code would inherently be slower. But with DataFrame, we're able to capture the semantic information and actually generate an execution plan that, regardless of what language you use, will be the same execution plan and we'll optimize it and we'll make it run faster and faster over time. And that was incredibly powerful. And it basically got Python to have exactly the same performance as the JVM languages. So it's a big deal because these days probably 80% of the Spark users use Python. [00:18:27]
Sudip: Right. I mean, it's a language for data scientists. [00:18:30]
Reynold: Yeah, exactly. And then the Dataset API came, although I actually would not recommend people to use the Dataset API. In hindsight, I would not have created the Dataset API. [00:18:40]
Sudip: I see. [00:18:41]
Reynold: So Python is weakly typed, dynamically typed. Scala is statically typed. And a lot of Scala programmers really love typed information because that’s powerful. One of the things with DataFrame is that it becomes dynamically typed. So even when we declare DataFrame in Scala, it doesn't actually know what exactly is the type of that DataFrame, what are the columns in it. All of those are dynamically generated. So Dataset was our attempt and I think we did our best job out there to create a statically typed program that allows you to bind in runtime but give you that compile-time safety throughout the program so you would know, hey, this is actually a Dataset of, for example, a student. And student is a class with all these fields. And we give you the compile-time safety, but with the final validation that when you read this data in, the first time you create a Dataset of student, we actually run the validation to see all the data actually have all these fields you need and how does it map to the student class. The reason I said maybe in hindsight, I would not have created it is, as it turned out, it only served a very small percentage of users. And second, it's extremely complicated. The most complicated code in Spark are one, the scheduler - schedulers are always complicated. Second is all this type mapping and static sort of type, dynamic type binding stuff. Those codes are super hairy and very error-prone and very few people know how to change it. So a lot of investment for high technical complexity for very small percentage of users. It is a very cool idea though in abstract. [00:20:21]
Sudip: That's a fascinating story. Talking about programming language, one quick question I had was, the choice of programming language for Spark was Scala. Why was that? [00:20:31]
Reynold: Every month some new engineer joins Databricks and asks, why do we use Scala? It was actually an easy choice back then. In 2010-2009, because the Hadoop ecosystem was in Java, in order to be able to leverage most of the Hadoop ecosystem, Spark needed to be in the JVM. At the same time, we really wanted something that could be interactive. We felt that the interactive experience was extremely important. Java was not interactive. Even today, Java is not interactive. There's no way you just hand-type Java without an IDE. It's not concise. It is super verbose. So Scala was a language that was identified as basically much more concise than Java with a repo that we could actually hack. So now somebody could just interactively in command line start issuing a one-line Spark code and run across like a thousand machines in the cloud. And there was no other language that would fit the requirements of being in JVM and being interactive [00:21:33]
Sudip: That actually clears up a big question for me I've always had, why Scala? [00:21:39]
Reynold: I think Spark and Scala had a coexistent relationship and really a lot of Spark users came to Scala because of Spark. [00:21:47]
Sudip: Yeah, exactly. A lot of people learned Scala, including myself, just to use Spark. Now, shifting gears a little bit, Reynold, one of the things that obviously we all have witnessed over the last 10 years is the amazing adoption of Spark. Now that you are looking back and it's 2023 now, looking back at the last 10-12 years of Spark history, if you were to pick, let's say, three factors that you think were the most important in driving this adoption of Spark, what do you think those would be? [00:22:13]
Reynold: The first and foremost would be, I think, the focus on the end user. Everything I talked about so far I always started with a level of abstraction to make it much simpler to program. These days I don't need to evangelize Spark anymore because it's everywhere. But one thing I often say is that, many of you will come because of the performance improvement. Because you've heard that you can make your program 10 or 100 times faster than Hadoop, but you’ll really stay for the programmability. You'll never go back to a MapReduce program once you program Spark. We didn't end with just the RDD API. We have continued pushing the boundary, introduce streaming APIs, introduce DataFrames and all this. And they're all about how do we help the end users to be a lot more productive, make the APIs more expressive for the tasks they are supposed to do. And that focus on the end user programmability is key.
The second one would be a culture of innovation which is not surprising because a bunch of us came from academia and really wanted to apply bleeding-edge ideas to a real-world system and see how we could improve it. So Spark brought a lot of innovative ideas that were never ever done in other systems or only done in very niche systems. But, Spark brought those to the masses.
The third one I would pick is unlike Hadoop and many other systems in the past, Spark had a batteries-included approach, which means, hey, what are the most common things people want to do? Let's make it doable out of the box, instead of having another extension framework or project you have to go download, install and configure in the system. We have had, for the longest time, all the popular data types. You could just use the built-in APIs to read them, like popular data types for distributed computation, not necessarily for local stuff because the focus was on large-scale data sets. And we added the whole machine learning library to Spark if you want to run logistic regressions - just run it out of the box, you have it. [00:24:17]
Sudip: And you are talking about MLLib here. [00:24:19]
Reynold: Exactly. All of this made Spark much more powerful because one of the big pain points for a lot of users is having to configure a system with dependencies and app frameworks and extensions. Whereas in Spark, you install it, and now you have all of this power. [00:24:37]
Sudip: And going back to the usability piece, I mean, I like how you put it, like, come for performance and stay for usability, right? And going back to the usability piece, I believe one of those driving factors was when you guys added SQL support. You, in particular, I believe, were behind the original Shark project and then the Spark SQL project. I'm just curious, at a high level, was there any particular technical challenge that you had to resolve to bring SQL in a distributed data processing system? [00:25:09]
Reynold: Over the years we have done it, we've been redoing it. Actually, what Shark referred to is actually what I did during my PhD before Databricks even started. The way Shark was architected was that it basically took Hive, which is the SQL Hadoop system. We took the physical client generated by Hive and converted it into a Spark program. And then we were able to run SQL somewhere between 10 to 100 times faster than what Hive would be able to do back then. One of the main challenges there was really the Hive codebase, which was potentially the most spaghetti codebase I've ever seen. I'm hoping I'm not offending anybody here and it probably had nothing to do with the creator. I think it's just because it was created at Facebook for a specific set of use cases and then it suddenly got insanely popular. A lot of use cases and different requirements got piled onto it and people added a lot of code very quickly. It was so bad that when Databricks first started, in the first year one of our founding engineers actually quit after working on this codebase for a few months and that was the sole reason given. And up until this day when I talk to him he's like, yeah, honestly, that was the real reason. That codebase was so difficult to work with. So when Michael Armbrust came to Databricks in early 2014, initially he was actually hired to build a query optimizer for large-scale data. At some point I walked up to him and said, hey, you should kill this thing I created. This nine-headed monster is impossible to deal with and the code is too difficult to maintain. I think if we start from scratch and build it from scratch and have nothing to do with Hive, it would be substantially simpler. It would be much easier to actually evolve. And he did. And then that became Spark SQL which is honestly much easier to set up, much easier to maintain, much easier to iterate on. [00:26:55]
Sudip: So you basically ended up killing your own creation. [00:26:57]
Reynold: Yeah, a lot of people didn't expect that but I was jumping over it. Some people were like, are you happy that now it happened? I'm like, yeah, that's incredible that it happened. [00:27:05]
Sudip: That's so funny. And then going back a few years, talking about the other thing that kind of led to adoption of Spark, I remember back in, I think 2015 or somewhere around that, you guys made a number of key performance improvements to Spark Core. Can you talk a little bit about what those improvements were and what were some of the major architectural changes you guys had done? [00:27:31]
Reynold: So 2015-2016 was a big year in terms of improvement and that's when we released Spark 2.0. And the claim at the time was Spark 2.0 was an order of magnitude, 10 times faster than Spark 1. It might not be exactly 10 times for every workload, but it was dramatically faster. And a lot of it had to do, first of all, with the DataFrame API. We talked about DataFrame API that gives an even higher level abstraction. Now we convey semantic information so we can optimize under the hood. And we fully leveraged that in Spark 2.0. So Spark 1 released the DataFrame API, raised the level of abstraction, and enabled us to do the optimization in Spark 2.0. And so those optimizations basically boiled down to two big things. One is we hyper-optimized for Parquet and actually created the first vectorized Parquet reader. And that led to the use of Parquet itself as the columnar format, and Parquet MR at the time was basically implemented row-wise. It was not fully extracting the performance out of a columnar format. So we implemented vectorized Parquet and that got massive speed up just in terms of read performance. And then for the entire query execution engine, we put this idea that exists only in academic papers and academic systems called Hosted Code Generation and put that into Spark itself. The idea is we would take a query plan and generate the actual code that's needed to execute that query. As if you're building a query engine purpose-built just for executing that one query. And the reason that would make it much more efficient is because a generic system has a lot of overhead because it has to be generic. For example, it has a lot of function calls because your query engines in particular have the concept of an iterator model in which you chain a bunch of iterators and each sub-operation is an iterator and every iterator call, when you say next, it generates a virtual function call which is very slow. Then the compiler is not great at optimizing it because those are complex programs. But if you, for example, have a simple program that does the aggregation. If you were to purpose-build a program to do that, you would just write a for loop from the beginning of the data to the end of the data and then sum something up into a local variable and then return the local variable. It would be a three-line program. Compilers would be amazing at optimizing that. So we basically treated Spark itself as a compiler that compiles SQL queries into actual purposely written code for executing those SQL queries. And SQL here doesn't just apply to SQL. A big, cute idea in DataFrame is that DataFrame is not very different from SQL and SQL is not very different from DataFrame. They all generate similar query plans under the hood and once you generate those purpose-built code for that query, you compile it with the JVM. The JVM now is much better at optimizing that. So basically vectorization and CodeGen combined, we actually got to a dramatic speed-up. By the way, we didn't invent either of this. In academia, vectorization always existed but it was never built for the open-source SQL system. This kind of CodeGen was pioneered by Thomas Neumann at TU Munich and they had an academic system called HyPer which was eventually commercialized and got bought by Tableau. So we took that idea and really made it into adoption. Probably more people benefited from that because of that work and benefited from Thomas Neumann's idea than the HyPer system itself. [00:31:13]
Sudip: I want to talk a little bit about streaming. There was Spark Streaming and I believe now you guys have Structured Streaming, right? I'd love to understand a little bit what is the difference between those two and then one other related question is, originally the way I know streaming was implemented in Spark was through micro-batches, right? Is it still the case and how is Structured Streaming different from the original idea? [00:31:37]
Reynold: Incidentally, the relationship between Structured Streaming and Spark streaming is very similar to DataFrame and RDDs. It's all about raising the level of abstraction. Spark Streaming was a fairly innovative thing because it basically came with this insight which is that, if you just run a batch that is small enough and fast enough, at some point you have the approximate streaming. To the extreme, you run a batch for every row which is basically one row at-a-time processing. And that was pretty popular because it introduced a whole new class and workloads that previously people thought would be very, very difficult and require super specialized systems for. Just with a lot of IoT devices, sensor data, message buses becoming more popular as Spark was growing, streaming workloads started growing too. There was a very important problem with Spark Streaming which was actually not about micro-batches. I personally think the big micro-batch versus true streaming debate is overblown too much. The biggest problem with Spark Streaming was that the window of streaming is completely tied to the physical batch size. So the physical property of its execution is leaked into the programming abstraction. The most common operation in streaming is windowing. If you want to window by, for example, a bunch of records, in Spark Streaming, the way of design is that you would actually run that batch as a window. And this really limits a lot of optimization activities and also limits the programmability. It also limits another very important thing, which is, if you have late-arriving data, now you can't deal with it because you're already done with that batch. Your data shows up later and cannot be considered a part of that window anymore. So the time and everything has to be physically tied to the way it was executed. So after we did DataDrames, we started thinking about how to improve all those issues of streaming. One very obvious one was, many people want the concept of time that's logical instead of physical. Physical meaning whenever the event showed up to the system, that's the time it showed up. Logical means the event has some property called time, maybe a column or a field, and that is the actual time. So we thought that for streaming, let's completely decouple the execution from the API and just think about how you can program a streaming job. We looked at a lot of streaming jobs and realized the intent people express and the transformation people express, were not very different from a regular DataFrame. It's a program, except they want to run that in a continuous mode. And for any data that comes in, they want to be able to run that instead of just running once. So we came up, I think it was 2016 or 2017, I remember I was giving the announcement at Spark Summit and I said that, the easiest way to do streaming is that you don't have to think about streaming at all. So Structured Streaming basically introduced the concept of streaming DataFrames. There's no separate API for streaming. It is just a DataFrame. And the only difference is how you create a DataFrame. If you created a DataFrame using a streaming reader because it's coming from some message bus or even just a pile of files that might continuously arrive on object stores, you have a streaming DataFrame. And you can run the same operations just like your normal batch DataFrames. And everything else is the same. And that really made it much simpler to program streaming because now people don't have to learn a new paradigm. They don't have to think about, hey, what is windowing? Well, it turned out windowing is a group by some time. So that was very, very powerful. And that also led us to realize that a lot of people, when they do streaming, they're not even actually trying to do things in super real time. What they really, really want is they want the incremental-ness that happens in streaming as data flows in. Because virtually every data pipeline is continuous, they always have data coming in. They might come in once a day or once a week, but it's always new data coming in. It's very rare you have a data pipeline you run once and never worry about. Now, as data continues coming in, now you have to worry about, okay, so you don't want to reprocess all of your data all the time because that's highly inefficient. So people invented their own way of doing incremental processing. And it turned out the number one use case for Structured Streaming was people just using it for incremental processing because now they no longer have to worry about the state. The funniest thing is that they would build a streaming pipeline, they would run it, and it processed the data. The data's actually coming in, for example, once a week. So if right after half an hour, if they finish processing their data, they’d actually shut it down manually. And then a week later, when there's new data coming in they rerun their streaming pipeline. And because it's fault-tolerant and because it tracks all the incremental states, it just does this whole end-to-end incremental processing. Because of that, we introduced a concept called streaming once, which is literally that when you run the streaming pipeline it finishes processing all the data it currently sees, and then it shuts it down. And the next time you want to do it, you relaunch it again. And that itself, actually, it probably generates hundreds of millions of workloads just on Databricks today. People are running streaming in a batch mode. [00:37:07]
Sudip: Exactly. Yeah, that's what I was going to say. It makes it so much easier. But it's a unified programming interface. It's a unified understanding. And I'm sure it also made it easier on the engine side, right?
Reynold: Exactly, the unification. Yeah, because we don't have to build so many different engines. Another question is, so does it still run in micro-batch? There're actually different modes today. There are certain things that are running in micro-batch because for example, obviously, the streaming once mode, it is just a giant batch job, except it does all the incrementalism. There are also continuous modes in which it actually processes data all at a time, in which you can get from records coming in to records going out in some milliseconds. [00:37:45]
Sudip: I want to shift gears just a little bit and go to like, you know, circa 2013 when you guys started Databricks, right? And then you had this really fast growing open source community, which was Spark, and then you had this very early company, Databricks, right? And one of the challenges a lot of open source creators have, particularly when they start a company, is how do you balance between the open source community with the commercialization effort? How did you guys manage to do it? Were there certain guiding principles that you guys had that really helped you? [00:38:21]
Reynold: It's difficult. I think Ali, the CEO of Databricks often joked about that we should never start another company based on open source projects. The reason for that is you need two strikes, right? Normally to start a company and if you want the company to do well, you focus on the business problem of the company and with all the stars aligned and you're lucky and you work super hard and you have an amazing strategy, it works out. And that is a very difficult problem. If you add open source to it, now you have a two-step process. You have to have the open source project taking off and doing well, and that itself is not a trivial thing. It's probably a little bit easier than building proprietary software, but it's not a guarantee. Most open source projects don't work out. And then after that, you need to have an amazing business model and work towards a great strategy. And now the actual end-to-end success requires you have to multiply the two probabilities, which makes it very low. And probably the reason why there are very few companies that have heavy open source roots, and been successful. One thing we focus on a lot is we have different teams doing open source and they have different mandates. The open source teams are tracking so their KPIs are about adoption metrics of the open source projects. We always open source everything in API. We don't want to lock customers in because we have another kill API that makes it super difficult to migrate out if they ever need to. And we focus a lot on evangelism of the open source project. We try to walk a very careful line. For the longest time, the Spark Summit was called the Spark Summit and there was very little Databricks content in the first day keynote. It was all about open source. We saw this change as time went on and then as we also renamed the project. There was a lot of AI content so a lot of things have changed, but those are a lot of the things we did early on. There's creating a more cleaner delineation, both in terms of organizational structure, KPI tracking, events, and all that. But it's not an easy task. Actually, if we were to redo it again, I'm not sure if we could do it. [00:40:21]
Sudip: I'm laughing because that's coming from the founder of the most successful company in open source ecosystem, right? [00:40:29]
Reynold: I know a lot of people think of Databricks and say, hey, there's a great business to be made about open source. I'm not sure. I mean, it's not been doable, but it's not an easy job. [00:40:40]
Sudip: When you guys started Databricks, I remember 2014-15 time frame, I used to go to some of the board meetings. I remember there was a pretty heavy debate about should Databricks stay all cloud, or should Databricks go and support an on-prem version of Spark? And there was a lot of customer pressure because Spark adoption was increasing. [00:41:03]
Reynold: Everyone wanted to pay us - like ten million dollars for Spark. [00:41:05]
Sudip: And there was a lot of competitive pressure. Cloudera was really going at the time, Hortonworks was around. But you guys had a very deep conviction that it is either Databricks cloud or nothing. Can you talk a little bit about that, like where that conviction came from? [00:41:23]
Reynold: Yeah. I mean, in a kind of sense, we're really glad that we didn't go on-prem and become a support company. It ultimately comes down to the longer-term vision of where we see the puck is going and we try to go towards that rather than capturing what's right in front of us. And that's something easy to say because there are also scenarios in which people think too long term and they're dead before the long-term future and vision could even manifest. But I think one of the reasons that really got us going there was it actually had to do with Berkeley also. So at Berkeley in 2009-2010, there was this very famous paper called the Berkeley view of the cloud. [00:42:05]
Sudip: Yeah, above the cloud, right? [00:42:07]
Reynold: The view of the cloud. There are variants of the paper. The initial title and then later it's just a view of the cloud. But that paper is probably the most cited technical paper on cloud computing. It was so popular that many business schools incorporated that in various of their classes. Like my wife, for example, went to business school and read that paper. Not because I showed her the paper, but it was actually part of the class. And that paper predicted that the vast majority of the compute and computing infrastructure will move to the cloud and it will be finally true, this computing as a utility. Very few companies and houses have their own generators. They are just using electricity from the grid, right? It's something that's reliable enough, something that's viewed as virtually infinite and the economics simply makes sense. There are niche use cases, those will never go away, but the vast majority don't think about it as much. And that paper predicted that future by analyzing not just the technical foundations, but also what that type of allocations would be possible and the economics and the accounting and all that that have pretty profound influence on, I would say, the way we think about all this at Databricks. And I think some, maybe not everybody, but a couple of Databricks founders are also involved in that paper as well. So we always thought, hey, we really wanted the ability to be able to release software super quickly. We always wanted the ability to be able to provision and get a POC going in a matter of days instead of a matter of months, because now you have to go procure the hardware. We always thought a lot of the complexity of software, especially infrastructure software, is in the operations of it, not just in the building of it. And all of those are enormous values we can create, but we can only do that if it's in the cloud. And we're so early on at the time that we just view, and even today, we kind of view most software in our stack are pretty broken. We need to continuously improve them. Velocity is key. And having simplified environments, not having to worry about the 20 different variants of Linux, 50 variants of IP wireless thing and firewalls and all that would be enormously beneficial. So that's kind of what got us started. [00:44:27]
Sudip: And clearly, it paid out so well for you guys. [00:44:31]
Reynold: But, it is difficult Because every time we hire an exec, it's like, hey, I have a great idea to increase our revenue by 10x. [00:44:39]
Sudip: Yeah, go on-prem. Before I switch to lightning round, one final question about Spark. What's next for Spark? What's in the future? [00:44:49]
Reynold: I think the API is actually pretty good. We've been doing a lot of incremental refinement. For example, one of the biggest complaints of Spark is when you use Python, the error messages are simply nonsense because those include JVM stack traces and all that. And we actually spent a lot of time improving those to the point that you could probably still see it, but most of the time, you don't even feel there's a JVM that's running the Python program. So there's all this sharp edges we're trying to remove, which ultimately, they're not big fundamental ideas, but they're really the ones that create friction that gets in the way of everyday users. So a lot of work goes into that. With a lot of the GenAI use cases, it's 2023, everybody has to talk about GenAI, we have noticed a lot of the Spark programs that were generated by ChatGPT and all the other things. I bet there will be more Spark programs generated by machines than by humans in the next year. And most of them don't have good practices because many of them were generated and learned on a giant corpus on Stack Overflow and whatever is on Reddit, mostly based on malpractices from the past. Some were written before certain new things were introduced and were never updated. So we're working on this thing called the English SDK, which basically, if you think of it, it's really just how do we teach ChatGPT with the right prompting so it generates best practice Spark as opposed to generating malpractice Spark that now some other human experts come in and try to fix it. Things like this would make Spark's adoption go far wider and really benefit a lot of users that previously just felt Spark was too daunting. It can be somebody who is reasonably technical but they use Spark and they generate a Spark program in ChatGPT, they realize: oh it crashes for this data. But it turned out with a better chatbot, it will actually generate things that don't crash. We want to get to that point. I think another very big opportunity for Spark is that I fundamentally believe the biggest innovation of Spark is not performance, it's not just the use cases, but rather for something that's very old in data, which is data engineering. In the old school days, they used ETL tools like Informatica and all of that. And later, there's a little bit of Modern Datastack that got super popular the last couple of years and people write SQL queries again. There's something fundamentally wrong with SQL for data engineering. It's a very hyperbole statement, but what's fundamentally wrong with SQL? SQL was not designed with engineering practices in mind. You cannot easily test a SQL query. There's no abstractions in SQL. You could use recursive common subtable expressions, but again, it's just a pile of text. There's no variables, there's no for loops, there's no functions, there's no classes, there's no CI/CD framework, which means the key word is engineering. Engineering requires a sense of rigor, and rigor is backed by fundamental engineering principles. And what are the fundamental engineering principles? They are abstractions. I'm talking about software engineering here, right? They're abstractions. They are testing. They are CI/CD. They are how you roll out. And SQL is just terrible for all of those. I love SQL. I did my PhD in databases. I spent a decade optimizing, figuring out how to build systems to run SQL better. But we actually have a solution in front of us. It's real programming languages that can do everything SQL does, which is actually the Python Dataframe API. If you compose exactly the same program in SQL in a Python Dataframe API, it looks almost the same by readability. You can now actually test it using just vanilla Python code. You could decompose your program because it doesn't have to be a pile of text. You can have multiple files backing them. Each file is a Python file. You can have classes. You can have functions. All of this are great tools. By the way, it's just off the shelf and available because it's Python. People build far more complicated Python programs than the most complicated data engineering programs. So certainly you could use that. We haven't done a good enough job to explain to the world what's the value of Python. Python is now very popular in data science and machine learning. But to the SQL folks, many of them don’t even know Python. We haven't really done a good enough job in educating them. I think that would be one of the most important things that Spark can get right is to tell the world, hey, here's how you can do data engineering. And by the way, it's not that hard. It's just vanilla Python. And the Dataframe API is just the equivalent way of writing SQL. As a matter of fact, everything you're familiar with in SQL translates here, except now you have all the toolkits to do serious engineering. [00:49:39]
Sudip: That's fascinating. So, Reynold, we end every one of our episodes with three quick questions. We call it the lightning round, starting with the first one being acceleration. So in your view, what has already happened in Big Data that you thought would take much, much longer? [00:49:57]
Reynold: Yeah, the death of Hadoop. I would think it would take a lot longer for Hadoop to die. Enterprise software never really goes away, but Hadoop today is largely irrelevant. I want to see it happen faster, I thought. [00:50:09]
Sudip: You definitely had something to do with it, didn't you? [00:50:11]
Reynold: Yes, we had a part in that, but if you asked me 10 years ago, I would tell you by 2030, maybe you would see a rapid decline. [00:50:21]
Sudip: Then the second question around exploration, what do you think is the most interesting unsolved question in your space, you know, largely Big Data processing? [00:50:31]
Reynold: There's many. I'll just pick one. I think how do you combine unstructured data and structured data. Especially with GenAI now, there's great ways of processing unstructured data and analyzing unstructured data, but then how do you combine them is unclear. I think there's a lot of value that can be generated currently. [00:50:48]
Sudip: Final question, what's one message you want everyone listening today to remember? [00:50:55]
Reynold: I mean, to maybe the builders - the open source framework builders would be - put the user first, think from their shoes. Think about simplicity to the users, I would say. The last thing I said, Python, data engineering, you want to use a real programming language for data engineering to bring the engineering rigor into data. [00:51:15]
Sudip: Reynold, thank you so much for sharing your insights. This was a real pleasure and frankly a privilege to have you on. Thank you. [00:51:23]
Reynold: Thanks a lot for the invitation, Sudip. [00:51:25]
Sudip: This has been the Engineers of Scale podcast. I hope you all had as much fun as I did. Make sure you subscribe to stay up to date on all our upcoming episodes and content. I am Sudip Chakrabarti and I look forward to our next conversation. [00:51:41]
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit sudipchakrabarti.substack.com -
Missing episodes?
-
In our Engineers of Scale podcast, we relive and celebrate the pivotal projects in infrastructure software that have changed the course of the industry. We interview the engineering “heroes” who had led those projects to tell us the insider story. For each such project, we go back in time and do in-depth analysis of the project - historical context, technical breakthroughs, team, successes and learnings - to educate the next generation of engineers who were not there when those transformational projects were created.
In our first “season,” we start with the topic of Data Engineering, covering the projects that defined and shaped the data infrastructure industry. And what better than kicking off the Data Engineering season with an episode on Hadoop, a project that is synonymous with Big Data. We were incredibly fortunate to host the creators of Hadoop, Doug Cutting and Mike Cafarella, to share with us the untold history of Hadoop, how multiple technical breakthroughs and a little luck came together for them to create the project, and how Hadoop created a vibrant open source ecosystem that led to the next generation of technologies such as Spark.
Timestamps
* Introduction [00:00:00]
* Origin story of Hadoop [00:03:26]
* How Google’s work influenced Hadoop [00:05:47]
* Yahoo’s contribution to Hadoop [00:13:51]
* Major milestones for Hadoop [00:20:06]
* Core components of Hadoop - the why’s and how’s [00:22:44]
* Rise of Spark and how the Hadoop ecosystem reacted to it [00:27:19]
* Hadoop vendors and the tension between Cloudera and Hortonworks [00:31:51]
* Proudest moments for the Hadoop creators [00:33:56]
* Lightning round [00:36:04]
Transcript
Sudip: Welcome to the inaugural episode of the Engineers of Scale podcast. In our first season, we'll cover the projects that have transformed and shaped the data engineering industry. And what's better than starting with Hadoop, the project that is synonymous with Big Data. Today, I have the great pleasure of hosting Doug Cutting and Mike Cafarella, the creators of Hadoop. And just for the record, Hadoop is an open source software framework for storing enormous data and distributed processing of very large data. Think hundreds and thousands of petabytes of data on again, hundreds and thousands of commodity hardware nodes. If you have anything to do with data ever, you certainly know of Hadoop and have either used it or definitely have benefited from it one way or another. In fact, I remember back in 2008, I was working on my second startup, and we were actually processing massive amounts of data from retailers coming from their point of sale systems and inventory. And as we looked around, Hadoop was the only choice we really had. So today I'm incredibly excited to have the two creators of Hadoop, Mike Caffarella and Doug Cutting with us today. Mike and Doug, welcome to the podcast. It is great having you both. [00:01:02]
Doug: It's great to see you. Thank you. Thanks for having us. [00:01:10]
Sudip: If you guys don't mind, I think for our listeners, it'll be great to know what you guys are up to these days. Mike, maybe I'll start with you and then Doug. [00:01:19]
Mike: Sure. I'm a research scientist in the Data Systems group at MIT. [00:01:27]
Doug: I'm a retired guy. I stopped working 18 months ago. My wife ran for public office and it was a good time for me to transition into being a home keeper, do shopping and cooking. But I also have a healthy hobby of mountain biking and doing trail advocacy and development, trying to build more trail systems around the area that I live in. [00:01:44]
Sudip: Sounds like you're having real fun, Doug. One day we all aspire to get there, for sure. I'm really curious to know how you guys had met. I've seen some interviews of you guys. You kind of talked about how, I think, Doug, you were working on Lucene at that time and then connected with Mike somehow through a common friend. I'd love to know a little more detail on how you guys met and how you guys started working together. [00:02:06]
Doug: It kind of goes back to Hadoop really. Hadoop was preceded by this project, Nutch. Nutch was initiated when a company called Overture, which we'll probably hear more about, called me up out of the blue as a guy who had experience in both web search engines and open source software and said, hey, how would you like to write an open source web search engine? And I said, that'd be cool. And they say they had money to pay me at least part time and maybe a couple other people. And did I know anyone? I didn't know anybody offhand, but I had friends. I called up my freshman roommate, a guy named Sammy Shio, who is a founder of Marimba. And I said, Sammy, do you know anybody? And he said, you should talk to Mike Cafarella. I think it was the only name that I got. And I called Mike and he said, yeah, sure, let's do this. [00:02:49]
Mike: So at the time, this would be in like late summer, early fall of 02. I had worked in startups and in industry for a few years, but I was looking to go back to school. So I was putting together applications for grad school. And I was working with an old professor of mine to kind of scoop up my application a little bit because I had been out of research and so on for a while. And that was a fun project, but it wasn't consuming all my time. And so Sammy, who was one of the founders of Marimba, which was my first job out of college, he got in touch and said that his buddy, Doug, had an interesting project and I should make sure I go talk to him, which was great. I was looking for something to do and it came at just the right moment. [00:03:26]
Sudip: That was quite a connection, Mike. And then going back to that timeframe, 2002-2003, I think, Doug, you started touching on how you started working on Nutch and eventually became Hadoop. Would you mind just maybe walking us through a little bit like the origin story of Hadoop? I mean, I know Overture funded you for writing the web crawler, but what was their interest in an open source web crawler in the first place? [00:03:49]
Doug: I think that's a good question to get back to some of the business context. We want to mostly focus on tech here, but the business context matters, as is often the case. So I had worked on web search from 96 to 98 at a company called Excite. I'd been pretty much the sole engineer doing the backend search and indexing system. And then I transitioned away from that, written this search engine on the side called Lucene, which I ended up open sourcing in 2000. Also in 98, Google launched, and initially they were ad-free. All the other search engines, there were a handful of them, were totally encrusted and covered with display ads. So just think like magazine ads, just random ads that they managed to sell the space to advertisers. Google started with no ads, and they also really focused and spent a lot of effort trying to work on search quality. All they were doing was search. Everybody else was trying all kinds of things to get more ads in front of people, and Google just focused on making search better. And by 2000, they'd succeeded, and the combination of this really clean, simple interface and better quality search results, they had taken most of the search market share already. But they needed a revenue plan. This company called Overture had, in the meantime, invented a way to make a lot of money from web search by auctioning off keywords to advertisers and matching them to the query. Google copied that and started themselves minting money. Overture was nervous because they had this market, and they were licensing it to Yahoo and Microsoft and others, but they were worried that all of their customers were going to get beaten by Google and go out of business. So on one hand, they sued Google. That's an interesting side story. But on the other hand, they decided, we should build our own search engine to compete with Google. We somehow need to do this. They bought AltaVista. They tried to build something internally, and they also thought, you know, open source is this big trend. Let's do an open source one to have something to compete. So they called me, and I called Mike, and we worked with a small team of guys there at Overture, led by a guy named Dan Fain, and we started working on trying to build web search as open source. [00:05:47]
Sudip: That is such a phenomenal historical context. Including myself. I don't think many, very many people had that. And then interestingly, Google also came out with their GFS paper in 2003, their MapReduce paper in 2004, which obviously influenced a lot of the work that I think you guys did down the line. I'm curious, what do you think might have caused Google to publish those papers in the first place? Any hypothesis on that? [00:06:14]
Mike: I think you're putting your finger on something interesting and important, which was, at the time, that wasn't common practice to have a research paper that told you a lot of technical details about an important piece of infrastructure. I don't think it was part of some genius, long-term plan to profit down the road. It was part of a general culture at the place to emphasize the virtues of publishing and openness and science. Maybe it helped them with hiring or something like that, but if so, that was kind of an indirect benefit. And it was really trend-setting. I mean, they ended up publishing a ton of papers. I think Microsoft and Yahoo and other companies followed suit. There's a whole string of really interesting papers throughout the 2000s and early 2010s, systems that we might never have learned about had they remained totally closed. But it's interesting to think about the impact of the GFS paper, I think, on our experience, Doug, which was we had worked on Nutch for, I guess, about a year. And after about a year's time, I recall that it was indexing on the order of tens of millions of pages, but you couldn't get more than a month's worth of freshness because the disk head just couldn't move that fast in a month. So it was a single machine, but we were limited by storage capacity or by disk throughput on the seek side. If we wanted the index size to grow substantially larger, then we had to have some kind of distributed solution. I remember we spent something like six to nine months, Doug, working on a dedicated distributed indexer for Nutch. I remember the technical details. Maybe you can pitch in a little bit there. But I do remember finishing it, or at least thinking we were finished, and then about five minutes later reading the GFS paper and realizing that we should have done it that way. [00:07:51]
Doug: I remember running it. I remember operating that thing. And I think we actually got up to 200 million web pages. We were basically doing, this was still well before the MapReduce paper, doing MapReduce by hand. We'd quickly learned that we couldn't do all of this on a single processor. I more or less knew that from my days at Excite. We sort of hit the limit of what you could do on single processing, and even then we were already doing some things distributed. So we needed to have a way to do it distributed, which was to chop up the problem into pieces and run them in parallel. Overture had bought some hardware for us. A friend of ours named Ben Lutch, another guy we brought on, was running that hardware in a data center somewhere, and we could farm off and run processes on these. But it was a lot of work. We'd run four things doing crawling, get that data down on the disks of those machines. Then we'd parse out all the links, and then we had to sort of combine, do a shuffle effectively in MapReduce terms, and do a merge of all those data on the different machines, and decide which pages to grab next, and then to do indexing. We had all that plumbing working, probably a 10-step process, each step of which was distributed across five machines. But it took you running and monitoring all of these processes for 10 steps, and shuffling files around by hand. We were just using SCP to move things between these nodes. It was laborious, and I don't think practical for more than five machines. We would have needed to start thinking about automating all that. Somewhere in there is when the MapReduce paper came out and automated all that, and added in a lot of other reliability considerations, as did the GFS thing. We didn't have to worry about drive failures and machines crashing. With five machines, that didn't happen, practically. But if we wanted to move up to 100 or 1,000 machines, then we knew it would, and that we'd need all that. So it was a pretty nice gift. I mean, back to motivations for Google, I think part of it was, I think as Mike sort of indicated, they came out of academia, they had this don't be evil motto, and felt like it was sharing this. I think there was also a little bit of an agenda, in that at some level didn't believe that technological edge was sustainable in any case, that what you really needed to do was build company culture and operations. I actually talked to Larry Page about that once, and he claimed that their only sustainable edge was operations, which I thought was an interesting claim to make. But also, they believed having an open source implementation would help them in recruiting, that people would already be familiar with these concepts, with this model, and when they came into Google, they could adapt more readily and come up to speed. Which again, says they weren't worried about competition at a technical level, which is interesting. I'm very grateful they had that sort of high-minded attitude, because we were able to benefit tremendously. That was a big project. It took them, you know, five years with a huge team to work through a lot of different alternatives, and come up with a solution that they published. And Mike and I could just go, hey, let's go implement that. We've got a blueprint here. We were very happy to take all their hard work off the shelf. [00:10:42]
Sudip: How long did it take you to kind of incorporate the ideas from those two papers, the GFS and MapReduce, into what you guys were building? [00:10:53]
Mike: I remember running about a 40-node version of HDFS roughly six months after the paper came out. So I think by the summer of 2004, we had something limping along. I do remember that that version, though, if one node had the misfortune of going down, a simple thing that the paper doesn't dwell a lot on, but one implements, is a machine goes down, a portion of the file system has now become a little bit more in danger because you have fewer copies of those bytes than you would like to have. So it's time to copy those to duplicate them more. And if a machine went down, the other machines were scrupulous to a fault, would absolutely blast as many bytes as possible to escape this dangerous situation right away. It would paralyze the entire cluster until it had done so. So you had to limit that a little bit, but it was roughly in that time that the basic system was limping along. [00:11:42]
Doug: Mike tended to focus on the core GFS algorithms and the core MapReduce algorithms, and I tended to focus more on hooking that all into the rest of Nutch and then running it. Mike also did some crawls up at UW as well. [00:11:53]
Mike: I should say, you know, one thing that I've always thought of as an underappreciated element of the project's success, and that Doug really took ownership of, was the readme and the out-of-the-box experience. You could download the thing, and an hour later, it could be working on 100 nodes. And at the time, that was just unfathomable. If you want to get, say, an eight-node distributed database system working at the time, well, you better go have a budget to go hire some consultants from IBM to help you set that up, right? But Nutch, almost Hadoop experience at this point, in a distributed setting was really smooth, and Doug focused on all that stuff. It was really, I think, a big ingredient in its success. [00:12:20]
Sudip: I think, speaking from my own experience, it really shortened our time to value, and we didn't have to raise a whole lot of money because cloud was coming up. This was circa 2009. AWS was just about coming up, so we could get up and running so quickly because we had this nice project you guys had created. So, very belated thank you for that. [00:12:32]
Sudip: It sounds like you guys had most of the plumbing before the two Google papers came out. Did you ever think of if those hadn't come out, how Hadoop might have looked now? [00:13:00]
Doug: I think we would have struggled. I think we would have come up with some scripts to try to automate some of this stuff, and long-term, we would have struggled with reliability issues and scaling issues. [00:13:08]
Mike: We were really interested in Nutch, the search engine at the time, right? The goal of all this work was to improve the search engine quality, to improve its coverage, ranking quality, speed to indexing, and so on. And so, in that alternate reality, if things had been a little bit different, I like to think that we would have encountered the same technical issues that the Google ads did, and we would have gone through a similar kind of technical discovery process just a little bit later than they did. Of course, they're very sharp. Would we have actually had as good an outcome? I don't know, but one thing that was nice is that focus on the search engine led us to see some of these problems later than the Google guys, but earlier than most other people. We were a pretty small team next to, I think, the resources that Google had. [00:13:51]
Sudip: Yeah, resources and how important search was. Obviously, that was the entire company, entire business model, right? Speaking of which, another search giant at that time, Yahoo, had an amazing role, a disproportionate role in making Hadoop successful. And Doug, you, of course, went to work there, I believe, in 2006, if I'm not mistaken. Maybe if you could walk us through the role that you guys saw Yahoo playing in Hadoop in the early days, how they made it such a successful project. [00:14:17]
Doug: Yahoo bought Overture, which was interesting, and they continued to support the work on Nutch. I think that deal closed in probably 2004 or something like that, 2005. And Yahoo was trying to build its own search engine to go against Google, and they bought a company called Inktomi, and a big team of engineers from that who had been running a web search engine. And we're trying to figure out the next generation of that. They recognized the need for a new data backend to hold all the crawl data to do the processing of it. They looked around, they saw the MapReduce paper and the GFS paper, and they said, we need that. And they saw this open source implementation, they thought that'd be a great starting point. And to boot, they had already funded it. I think that was a coincidence as much as anything, but it meant we already knew people there. I went and gave a tech talk about it in 2005. And in 06, they said, let's start, let's adopt this platform that you guys have in Nutch as our backend. We really want to invest in it. I said, great, I'll come work for you. That's what I would love to see is some serious investment. It's just Mike and I working on it. It's going to be a long tail of debugging to get to anywhere near the solidity that it needs. So I joined. They were like, we don't care about all the web search specific stuff, because we already have that. What we need is this backend stuff, the equivalent of GFS and MapReduce. We need the HDFS. And so we split it in two. They were also concerned about intellectual property stuff working in the search space. They wanted to work in as narrow a space as they could for exposure to patent issues and so on. So we split Hadoop out. I had that name waiting on a back burner. My son had this stuffed yellow elephant that he had named Hadoop. He just created this name, and I thought that would be good. It comes with a mascot. [00:15:55]
Sudip: And I purposefully didn't ask you that question, because that's one answer everyone knows, for sure. [00:16:01]
Doug: And so that was, I think, February, maybe, of 06. We split them out, refactored. It was a pretty small job. I think Mike and I had already factored the code reasonably well, so it was mostly just renaming a lot of things and putting them in different namespaces. And we were kind of off to the races. Yahoo immediately, the day I started, I think they had a team of 10 all of a sudden working on it. And we had a hundred machine cluster of really nice hardware compared to anything that Mike and I had ever had before. [00:16:27]
Mike: Yeah, it was incredible. I mean, up until that point, Doug and I were working on it. I think Stefan, Stefan Grosjef, had been contributing on the open source side for Nutch. He was kind of a notable contributor. There were a few other people, but when Yahoo invested in it, it really was an epic change in the number of people paying attention. It was great. We really expanded the set of people who were participating. [00:16:47]
Doug: By summer, they had probably a hundred engineers and a thousand node clusters on this. And I don't know if it was the end of 06 or was it in 07 when they actually transitioned their production search engine to running on top of this. It was a big process. Owen O'Malley started really making a lot of improvements there. Who were some of the other guys there, Mike? [00:17:06]
Mike: Arun Murthy and Eric14 was the lead engineer, or the engineering manager of that at that time. Rami Stata, who I think had come to Yahoo as part of an acquisition, he was the manager and kind of our champion for the project inside Yahoo for a period of time. He was really instrumental. So one thread of the story is the contingent nature of the project. There were lots of things that had to go right for this to be a success. And at many points, there were individuals or companies who decided to listen to the better angels of their nature. And whether it was Overture funding the project initially, Yahoo deciding to fund it, or Google deciding to write those papers, a lot of things came up right and eventually yielded a good outcome. But it took lots of people working independently to kind of happen into it. [00:17:48]
Doug: As an open source maintainer, my goal has always been to get a project to the point where I'm not needed, where it has a life of its own. It's built up enough of a user base, about enough of a developer base. And where Mike and I were in 05 with Nutch wasn't there. It was too raw. You could use it in the out-of-the-box experience was as good as it could be, but it was unproven. I was working as a freelance consultant. I was getting tired. I was looking for a full-time gig. And IBM talked to me and they said, we really want to invest in Lucene. And it would have been a nice job, but Lucene was fine. Lucene was off and running on its own. I didn't need to be there day to day to use Lucene because IBM was already using it. Whereas Nutch really needed a sponsor. And that's when Yahoo came along. And so it was really a great thing that they did and really, really took it and made it real, got it to that point where it really proved itself. And then Facebook and Twitter and all the rest could start using it. [00:18:39]
Sudip: In hindsight, what was your take on Yahoo's relationship with an open source project? Was that smooth sailing internally where people really believed in it? Or was there some push to put it within the walls? [00:18:53]
Doug: It wasn't something that they had done a lot of before. So it was new ground for them at a corporate level. I had learned because I'd been doing consulting in open source for a while that I needed a clause in my employment agreement that said that I could contribute to open source. None of the Yahoo employees had that. And so although the engineering management was using this and investing in this and committed to the vision of open source, the lawyers said, no, Yahoo employees, except for Doug, can't actually contribute to open source. So they had to submit all their additions and then I could commit them. It was this one little step that I had to do for well over a year. It took a change in Yahoo CEO before we could get somebody to override the legal department and say, this is actually okay for Yahoo employees to directly apply changes to Apache. So there were a lot of things like that. The other thing that was an issue is Yahoo, as I said, had 100 people working on this. And there were a handful of people in other companies that were starting to use Hadoop, but nobody had anything like that. And in open source, it's hard to not dominate when you've got that big of an imbalance. And we really wanted to build an egalitarian community where everybody weighs in. And it was hard for Yahoo to not be the 200 pound gorilla. And there were some growing pains around that over the years. [00:20:06]
Sudip: Before I shift to talking about the main components of the Hadoop ecosystem, I want to kind of just read through a couple of milestones that I found. And I'd love to know if any of those sound completely off. So I found that in 2007, within less than a year after you joined Yahoo or Doug, they actually were using Hadoop on a thousand node cluster. And then in April 2008, apparently Hadoop defeated supercomputers and became the fastest processing system to sort through an entire terabyte of data. And then in April 2009, I think Hadoop was used to sort a terabyte of data in 62 seconds, beating Google's MapReduce implementation. And then finally, I think in 2009 also, it was used to sort through a petabyte of data and indexing billions of searches and indexing billions of web pages. These are like heady, heady milestones. I'm curious a little bit, what was the feeling inside Yahoo at that time as you guys were hitting those milestones, which actually went in some cases way beyond what I think you guys had set out to do? [00:21:08]
Doug: That was mostly Arun and Owen doing those benchmarks and driving that forward. I think [00:21:13]
Mike: We were pretty stoked. It was awesome. Doug's right. The point is to get it to the point where it won't die. And people that we knew, but we were not working with every day, they were taking it and doing amazing things with it. That's what you want to have happen. It was thrilling to see. It was really fun. Yeah, it also gets to motives back to why Google published those papers. [00:21:34]
Doug: It gave their employees public visibility and employees like that. You want employees to be happy. I think that was part of the motive for Yahoo adopting an open source solution is people like to be visible in the outside world, more peer recognition, and being involved in open source gives you that. So it made Yahoo a more fun place to work and more rewarding. And also being able to try to beat these kind of records. Again, it's great for recruiting and retention, employee morale, so long as you believe that you're not giving away the bank. When you do that, I think you build a much stronger company. Yahoo's profile isn't what it used to be in the consumer space nowadays. But in the 2000s, they were not as maybe financially successful as Google, but the technical depth of the company was really good. They had a ton of people working on Hadoop. They had a lot of people in different parts of the company, like Yahoo Labs was a research lab that was fantastic at that time. The technical skills inside the company were great. Google was getting a lot of press. So I always felt like some of these guys, it felt great to go make a splash and get some numbers in the way that people in Mountain View were. And they certainly had the brains to do it. So I thought it was great. That's sick. [00:22:44]
Sudip: I want to shift to discuss a little more technical stuff. So as I understand, Hadoop has had four main components, HDFS, Yarn, Hive, and Hadoop Common. I'm curious a little bit if you wouldn't mind spending a little bit of time on what was the motivation to create each of those components and what do you think was kind of the guiding principle to build that ecosystem of four main components? [00:23:09]
Mike: I'll try to address some of this, Doug. I mean, you were more closely involved with some of these components than I was. I should say the Yahoo engagement in the 05 to 06 and so on, it was thrilling in a lot of ways. It was also kind of the beginning of the end for me because I was in grad school at the time. And when you have 100 people working on the project, I mean, people would file bugs and fix them before I could come in to put in my 10 hours a week. So at some point, I had to decide whether I was going to actually get my PhD or keep trying ineffectually to contribute next to everyone else. So I had to kind of pull out by 07 or so. So some of these later components, I don't have a ton of insight into, but I can comment on some of them. HDFS was the original Google file system element. The original version, I think, reflected a design set of choices that informed a lot of Nutch and to do, which was a focus on correctness rather than performance or other things. That's why Java was the programming language we chose for all this, not known for being super high performance at the time or arguably now. I think it's not as bad as people think, but I think it's okay. But the thing that was good about it was we were really worried like if the system is wrong or if no one can contribute, it'll die. If it's slow, people can probably live with it a little bit. And again, I think the performance penalty of that's a little bit overstated. So HDFS, the initial version, was really designed to emulate GFS in many ways. In some ways, we made decisions that reflected our particular use case, like there were certain kind of mutation operators or append operators in GFS that were not supported in the original version of HDFS. There was no traditional failover in the original version. That would have to come later with Zookeeper and a few other services that offered us high quality distributed system safety. So the original version of HDFS was an emphasis on correctness and just bulk storage. And that's turned out to be an enduring advantage of the whole project, right? That's still something that a long time later people really need. It's become technically much more sophisticated than it was originally. But that emphasis on like reliable storage that is as cheap per byte as possible has still proven to be a good idea. Yeah, I mean, I'll just add, you know, HDFS and MapReduce were kind of the original two functional components, each modeled after a paper from Google. The common portion is just the utilities that we needed to build this kind of system, you know, RPC, storage formats, just that kind of library. Yarn came along a little later. That was a project led by Arun to abstract the scheduling out of MapReduce and try to come up with a scheduler that was general purpose that other systems could use to schedule tasks across a large cluster once you had one built up for more than just use it for more than just MapReduce. [00:25:53]
Mike: One of the cool things at that time, you know, this is like the late 2000s, were that people were learning from the MapReduce example, both at a systems perspective and kind of a science perspective. And they were building much more ambitious systems in the original MapReduce pattern. So there was Pig, which was like a SQL processor that was built on top of it. Hive would come out pretty soon. From Microsoft, maybe a year or two later, there's a system called Nyad. And all these were kind of arbitrary computation graph systems. The original infrastructure of MapReduce inside Hadoop couldn't support that stuff, but Yarn could, like they could be much more ambitious about the computation graph. [00:26:27]
Sudip: And as I understand, Hive was the one that made someone use SQL to write MapReduce jobs, right? That kind of opened up the user base quite a bit. [00:26:38]
Mike: Yeah, that's right. Pig was very similar, but it used a different syntax. So it didn't use SQL syntax, but it tried to obtain something similar. Mike and I weren't doing science. We were doing engineering on this project, and maybe some social work trying to build communities and so on. To really evolve, you need to have people experiment and try to build new kinds of systems. And I think Hadoop really inspired that in the open source space. I think we saw a huge number of things follow on, many of which succeeded, many of which didn't. So we gave people this example, and then they could do some science based on our engineering example. But in the case of Hadoop, MapReduce and GFS were really the science part. [00:27:19]
Sudip: Hadoop really succeeded and was probably designed as a batch analytics system in the first place, right? And then as the new use cases were coming up, did you guys ever consider adding more like a streaming use case, more of an interactive analytics use case? Was that ever a consideration for you? [00:27:34]
Doug: Not really. I mean, I think we saw other people coming along and addressing those kinds of systems with Spark and other systems that really, really addressed it properly. And it was layered on top. And these systems were designed from the outset to be more batch oriented. A lot of these things were incremental, right? Like the HDFS and MapReduce process were very batch oriented. If you wanted something that was more interactive time or immediate, then maybe you'd use a traditional relational database. And I think what's become apparent since then is that there are a lot of different points in between, right? At the time, batch oriented execution was what I thought made it distinctive. Having a MapReduce experience that was a little more interactive, although not what you would call like real time necessarily, that turned out to be a pretty compelling point in the design space. And I don't think that was obvious to me like in 2011 or 2012. The big advance was before that, you couldn't process that kind of data. Even if you had the hardware, there wasn't really commercial software. That was, I think, what really excited me when I read those papers was it opened up this whole realm of managing terabytes and being able to do computations over. My background originally was in computational linguistics and doing processing over large corpuses of text is critical to that, but you couldn't do it until we had this. And open source really enables that in a way that proprietary solutions aren't as accessible to everyone. That was a thing that really excited me about it was that the possibilities that it opened up for folks doing creative tasks with large amounts of data. I mean, you could do that stuff back then. People had computers that you could attach to a network. Distributed programming was possible, but the distributed programming libraries were for researchers, not everyday programmers. And the storage costs, if you wanted a lot more storage than what a small number of disks hanging off your PC could supply, you could go buy a RAID device or an EMC-style device that was incredibly expensive and didn't give you the scalability. So for most people, it really opened up a lot of stuff that they didn't have before. And the MapReduce API looked really clean and elegant and looked like it made it really simple to process huge amounts of things. Nowadays, people look at it and they say, ooh, how clumsy. You really have to do all that work. But at the time, it was really groundbreaking in the simplicity. [00:29:54]
Sudip: I remember watching an interview of you, Doug, back in 2016 or something. Somebody asked you, what would you see as the success of Hadoop ecosystem? I think one comment you made was, I kind of expect some of the Hadoop things to shrink and be replaced by new tools, which turned out to be obviously a completely accurate prediction. So I'm curious a little bit, how did you guys think about when you saw something like Spark come up, which essentially was the newer version of MapReduce, using obviously in-memory. What are your first thoughts? Did you consider adding maybe an in-memory extension in Hadoop, particularly the MapReduce portion, to do something similar? How do you guys think about it? [00:30:32]
Doug: I don't think of it as a competition where Hadoop had to keep up with these other projects. Rather, for me, the bigger mission is trying to get more technology in open source. And so that's a success. When Spark comes along and makes some fundamental improvements and provides something that can replace, in many cases, maybe not in every case, Hadoop, that's a good thing. More power to it. The process is working. We're making progress because we're not selling anything. It's very different than a commercial marketplace, where if a competitor comes up with something, you need to try to match it. In open source, we're much happier to have different projects complement one another. [00:31:09]
Mike: I fully agree with everything Doug just said. The thing that made those sort benchmarks and the engagement with people inside Yahoo and elsewhere so thrilling was people cared enough to make it better. People who work on Spark cared enough to make it obsolete in some ways. It's not that different. They actually thought it was worthy of paying attention to, and they did a better job. [00:31:30]
Sudip: Great, awesome! Let me last in a couple of things. One is the three vendors that came out in Hadoop. There was obviously Cloudera, which started in 2008, as I understand. MapR came out in 2009, and finally Hortonworks in 2011. Doug, of course, you were involved in Cloudera from pretty much day one. I'm guessing, Mike, you also probably advised some of those. [00:31:51]
Mike: I did a little bit of advice, but I had gone on my professor career. I was out of the business mostly, but I did a little bit of consulting. [00:31:57]
Sudip: I'm curious, looking back, how do you see the three vendors in the Hadoop space? What did they do right? What did they get wrong? And if you were to do any company based on Hadoop today, knowing what you know now, anything different you'd do? [00:32:11]
Doug: There was definitely some unfortunate things that happened in there in those years. When VCs approached me in probably 2007, a group of folks from Accel and from a couple other firms and said, we want to start companies in this space. And I was like, fine. I'm not interested in that right now. I'm happy at Yahoo. But if you do, please get together and start one. It's going to be complicated enough for the open source project to deal with a startup, but to deal with multiple ones and them trying to stab each other, it's just going to poison the open source community. We really don't need that. And these folks I talked to started Cloudera and Cloudera got off to a start and I joined Cloudera. But unfortunately, we still had bad blood because Yahoo perceived it as a threat. Yahoo had gotten all this acclaim for investing in open source and building this amazing system. And now some of that acclaim was being taken by someone else, by a startup, and moreover, money was being taken, profit was being taken. And Cloudera had stock options for something which might become big and people at Yahoo were working at a big, already public company. And so there was resentment, which led to conflict in the open source community, which stymied the project there for a while and eventually led to the team from Yahoo creating Hortonworks as a competitor for Cloudera. And the two of us went at it pretty non-productively. It wasn't, I don't think, a really healthy competition where we egged each other on. Rather, we were selling very similar products in the same market and undercutting each other. Maybe it was good for consumers of the software. The customers, yeah. It wasn't great for either company and it wasn't great for the open source projects where the bitterness led over into. So when Hortonworks and Cloudera finally merged, it sort of put that to peace finally. It wasn't ideal, but it was what it was. [00:33:56]
Sudip: Absolutely. Coming back to Hadoop, as you kind of look back, you know, since you guys started, which is like, I'm thinking of almost 20 years now, what would be the most proudest moments or moments in your view? Maybe Mike, I'll start with you and then Doug. [00:34:13]
Mike: I'll mention one funny one, which is actually from the Nutch era rather than Hadoop. Okay. Nutch was successful in that a lot of its code lived on in Hadoop and so on. But as an actual system, were there that many Nutch users in the world? Not that many. However, there was one that was very notable to me personally, which was the Mitchell South Dakota Chamber of Commerce. I remember this very clearly. They ran like a small intranet search site on it. Maybe that's not that notable, but if you've ever driven cross country past the Corn Palace, I believe the Chamber of Commerce actually owns and operates, or at least did at one time, the Corn Palace. There's a piece of Americana that was tied to Nutch. Whenever some people asked me about Nutch, I would brag about the Corn Palace, how they were secretly running my code. That was pretty great. On Hadoop itself, you know, I think just the breadth of it, those sorting benchmarks were very exciting. The fact that for, you know, many universities based like undergraduate classes on it, that felt great. Those were all really notable moments for me. [00:35:09]
Doug: For me, Hadoop exceeded my wildest expectations. When we started sort of trying to re-implement GFS and MapReduce, my goal was to provide open source implementation of things that researchers could use to manage large data collections. To some degrees, I think it didn't hit me the degree to which corporations had massive amounts of data that they weren't harnessing. And that was what occurred to these VCs. They knew that, they were talking to, but I was not involved in enterprise software at all prior to joining Cloudera. I didn't know anything about that whole market that was out there. And to see that explode, to see banks, insurance companies, governments, these kinds of institutions really take off using this stuff was pretty amazing. I didn't see this becoming a staple of enterprises by any means. It was far beyond my imagination for where we could go with this. [00:36:04]
Sudip: That is a very nice segue into the last thing we like to do on every podcast, which is a lightning round. So maybe I'll start with you, Doug, since you kind of brought us here. Three quick questions. One is around acceleration. So what has happened in big data that you thought would take much longer? What has already happened? [00:36:23]
Doug: I guess, I mean, I'm an optimist. I always think things are going to go quickly and go well. But that said, I'm not sure I expected to move to open source and to the cloud as rapidly as they have. I think we've really seen open source take root in enterprise data technologies and become an accepted way of doing things, which 20 years ago, it was not at all. I don't think it was everything. Everything in enterprise was proprietary, pretty much. And also running things in the cloud. People really wanted to keep their data on their servers and cloud was not widely trusted. And that's, we've seen a real 180 there. To me, it always was appealing. I don't want to run servers. I like being able to rent a server much better and treat it as a service. So that's been a great thing to see. Mike? [00:37:10]
Mike: You know, I think if you're going to say what was surprising or that I thought would take longer, I mean, AI is a very true, but in some ways boring way to answer that question. Two things that are interesting about the kind of revolution in AI that we've seen, I would say going back to 2012, when some of the first vision models became really good, and it hasn't been stopping, is first to think about the extent to which big data has been enabling technology for the modern AI stack that we see now. Even if you had had the idea that neural approaches should be turbocharged, you probably couldn't have done anything with that observation in 2002. And the other thing, which is especially interesting to me, is the way that neural models, like these really large scale models that are produced at incredible expense, have migrated so quickly to open source. The open source AI stack is really good. And when you consider they face a lot of the challenges that we did around Nutch, which is like, you've got no hardware and you've got no data, but make it work anyway. It's really impressive to me what the open source AI community has been able to do. [00:38:10]
Sudip: I think what is interesting is, you know, Google at that time, when I think you guys were building Hadoop, was kind of the protagonist right at the time. And now they are kind of playing defense in some ways, right? Because of AI and what is happening in open source, thanks to Facebook, Meta, and so on. So it's interesting to see the kind of tables turned a little bit in that way. Second question is around exploration. What do you think is the most interesting unsolved question in your space? Maybe, Mike, I'll start with you. Like, what do you think is still not solved that you'd love to see happen? [00:38:43]
Mike: I've got a ton of answers to this question, which I hope doesn't make me seem ungrateful for everything good that's been happening. One thing I would say is that a lot of what people store in these enormous HDFS clusters is documents, right? Like we've got a huge store of company documents. But understanding of a document beyond just the text is pretty poor. So understanding images or plots or the kind of multimodal form of a document is generally not that great. I'm hopeful that some of these AI approaches would make that better. Another thing, which is maybe at the very top of the big data stack, is I'm getting kind of sick of dashboards. I've been seeing like the same, you know, really complicated dashboards for 15 plus years. Like, here's my data center or my complicated system. I've got a big data system underneath, maybe Hadoop, maybe MapReduce, maybe Spark that is collating a ton of data. And I boil it down into some neon acid green set of dashboards that is, you know, honestly, pretty unpleasant to manage. So I'd like to see us do something with all these, like, we have this extremely large and high dimensional data set. I'd like to move beyond just the big pile of dashboards that most people use to investigate what's going on. I don't even know exactly what the answer is, but we've been dealing with the dashboard metaphor for a long time. And I think we need some innovation there. [00:40:02]
Sudip: I think on that point, Peter Baylis at Sisu Data is trying to do, you know, something like that, where he's trying to move you away from dashboards and really focus on what is going wrong in your business. [00:40:13]
Mike: Yeah, I think that's one possibly really interesting direction. Yeah. Doug, for you? [00:40:17]
Doug: One of the big challenges I think we haven't yet met, and I'm hoping we will, is really dealing with issues around privacy and consent and transparency. It's not strictly technical, but there's technical aspects. We want to get value from data. Much of the data which is most valuable is about people, but respecting those people's rights and getting value out of the data at the same time can be in conflict and coming up with mechanisms to really handle that conflict and deal with it in a reasonable way. I think it's hard. I think we're only in the early days of that. [00:40:54]
Mike: I think we will see progress. [00:40:55]
Doug: I think as a society, we've seen that as we adopt other technologies. It takes decades. If you look at, you know, food safety and automobile safety, that took a long time to evolve, and it's continuing to evolve and develop how that's managed and regulated. Healthcare safety, and I think data safety, we've got some very crude things we're doing so far, and there's a lot of room to advance that. I wish the industry took it more seriously and led rather than follows and has to deal with laws that are crafted by non-technical folks rather than trying to come up with strong technical solutions that really respect people. Anyway, that's one that I'm concerned about. [00:41:34]
Sudip: Beautiful. Last question for you guys. What's one message you would like everyone to remember today? Maybe, Mike, if I can start with you. [00:41:42]
Mike: I think I'll mention that the success of the Hadoop project over the last, I guess, 20 years, it certainly took a lot of hard work by a lot of people. It also required a huge amount of luck. If I look at the amount of work that I put into this or something else, and maybe I think Doug would say the same thing, Hadoop's been dramatically more successful than a lot of projects. I don't feel like the work on it is any better or worse than some others. There's a heavy amount of luck in it. As I mentioned, there's also a heavy amount of contingency. Like at a few crucial points, some people decided to do something pretty good for the universe. Google didn't have to publish those papers. Yahoo didn't have to keep funding an open-source project. But they did something a little bit better than they had to, and it turned out to be great. So if you're listening to this, and you're in a position to do something a little bit better for the tech universe than you otherwise might have to, maybe you have the seeds of Hadoop on your hands. You should go for it. [00:42:30]
Sudip: That's a fantastic point. [00:42:31]
Doug: Yeah, no, I definitely want to echo Mike in that we were incredibly lucky and at the right place at the right time with the right skills to move this along to the next step. That said, I think a strategy that I try to employ, and I assume Mike does as well, is you do want to aim big. You do want to aim high. And at the same time, watch your feet so you don't trip. And it's this constant challenge of how to maximize the outcome without compromising your goals. And I think that's the art of doing this, is try to find something which satisfies all these competing concerns. Obviously, we wanted this project to be successful, and we looked around and found the gifts that people had laid out there for us, and opened them and ran with it. It was luck guided by, I think, some successful ability to compromise and find the right path for it. [00:43:19]
Mike: Doug, there's one question I was hoping you would answer during this conversation. Maybe I'll just ask it because I don't want it to end before knowing it. It's occasionally people ask me, if we were to do it all over again, would you still choose Java for it? And I give my answer, but I want to know yours. Would you still use Java to do all this stuff? [00:43:34]
Doug: Yeah, no question. I did my share of C and C++ programming, and it's painful for this kind of thing in particular. We wanted to focus on algorithms, on the overall architecture, and keep things as simple as possible. And to have done it in another language, I think would have been premature optimization. I think we were able to get solid performance by optimizing where we needed for the most part. So yeah, I don't have a regret there. [00:44:00]
Sudip: How about you, Mike? [00:44:01]
Mike: I agree with all of Doug's points that I would not have done it in a lower level language. I think the only question I have sometimes is whether it should have been even higher level. Like, I never regretted the claimed performance problems with Java. I never really observed them, or maybe just at that point in the project, they weren't important. I wonder if we should have done it in Python or Perl. Like, how high up the stack could we have gone and still had a successful project? [00:44:21]
Doug: Yeah, I think I'm a little bit more of a language snob. [00:44:23]
Mike: Fair enough. [00:44:27]
Sudip: Fantastic. Thank you so much, guys. [00:44:29]
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit sudipchakrabarti.substack.com