Designing Data-intensive Applications with Martin Kleppmann

title Designing Data-intensive Applications with Martin Kleppmann

description Brought to You By:
• Statsig — ⁠ The unified platform for flags, analytics, experiments, and more.
• Sonar – The makers of SonarQube, the industry standard for automated code review
• WorkOS – Everything you need to make your app enterprise ready.
—
Martin Kleppmann is a researcher and the author of Designing Data-Intensive Applications, one of the most influential books on modern distributed systems. As of this month, the second, heavily updated edition of the book is out.
In this episode of Pragmatic Engineer, we discuss Martin’s career in tech building startups, how he ended up writing this iconic book, and what he’s focused on now after moving into academia.
We talk about the tradeoffs behind modern infrastructure, how the cloud has changed what it means to scale, and the thinking behind Designing Data-Intensive Applications, including what’s changing in the second edition.
Martin reflects on lessons from building startups like Rapportive, which he sold to LinkedIn, and shares how his experience in both academia and industry shaped his perspective.
We also explore what’s ahead: why formal verification may become more important in an AI-assisted world, the challenges of building local-first software, and his recent research into using cryptography to improve transparency in supply chains without exposing sensitive data.
—
Timestamps
(00:00) Early career
(05:46) Building Rapportive
(10:47) Working at LinkedIn
(14:09) Writing Designing Data-Intensive Applications
(23:00) Reliability, scalability, and repeatability
(26:24) DDIA: the second edition
(30:50) Tradeoffs of using cloud services
(39:02) How the cloud changed scaling
(42:53) The trouble with distributed systems
(49:02) Ethics for software engineers
(52:45) Formal verification
(1:00:12) Academia vs. industry
(1:03:50) Local-first software
(1:09:50) Computer science education
(1:18:32) Martin’s current research and advice
—
The Pragmatic Engineer deepdives relevant for this episode:
• Building Bluesky: a distributed social network
• Inside Uber’s move to the cloud
• The history of servers, the cloud, and what’s next
• The past and future of modern backend practices
• How Kubernetes is built
—
Production and marketing by ⁠⁠⁠⁠⁠⁠⁠⁠https://penname.co/⁠⁠⁠⁠⁠⁠⁠⁠. For inquiries about sponsoring the podcast, email [email protected].

Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe

pubDate Wed, 22 Apr 2026 16:19:26 GMT

author Gergely Orosz

duration 5100000

transcript

Speaker 1:
[00:00] Designing Data-Intensive Applications has been the go-to book for anyone building large back-end systems. Nine years after publishing this book, the second edition is here. Martin Kleppmann is the author of this generational book. I sat down with him and today we cover how working on Kafka at LinkedIn directly shaped the ideas that became the first edition of the book, what's new in the second edition and why things like MapReduce got removed from this updated version, formal methods, local first software, decentralized access and many more. If you care about how large systems work, where they're heading and what the fundamentals are that don't change, this episode is for you. This episode is presented by StatSig, the Unified Platform for Flags, Analytics Experiments and more. This episode is brought to you by Sonar. Sonar, the makers of SonarQube, understands that code quality is about more than just avoiding syntax errors. It's about long-term maintainability by protecting the structural integrity of the system. As agents generate code at massive scale, they often ignore your system's structural integrity. This creates tangles, duplicated code, and other maintainability issues. These issues turn a module design into a big ball of mud, making it increasingly difficult to extend. But here's something that's really helpful. SonarQube's architecture management. It moves architectural governance out of static wikis and into your automated workflow. It allows you to visualize your current architecture, define architectural boundaries, and manage architectural issues in real time. Whether it's a human or an AI agent at the keyboard, Sonar acts as a circuit breaker for structural decay. It ensures every commit respects the system's blueprint, protecting the long-term health of your most complex applications. Head to sonarsource.com/pragmatic to find out more. So, Martin, welcome to the podcast.

Speaker 2:
[01:44] Hi, Gogo. It's great to be here.

Speaker 1:
[01:46] It's amazing to have you here. I don't think you need introduction to many software insurance, including myself. You're the author of this iconic book that I've had on my bookshelf for probably about 10 years, not much longer after it came out. Before we get into this book, which we're going to talk about, how did you get into the technology field?

Speaker 2:
[02:03] Yes, well, I did a undergraduate computer science like many others, and then after that, I wasn't quite sure what to do with my life, but I thought, well, starting a startup seems like an interesting thing to try. So I started a startup having no clue what I was going to actually do and then spent the first while searching around for things that might be interesting. The first startup didn't work out that well, but through that, I met some others who then became my co-founders for the second startup, which worked better, and we sold that one to LinkedIn. And then after that, I started being interested in teaching these distributed systems concepts. So that's when I got into writing the book. And then during the writing of the book, I also switched over from industry back to academia.

Speaker 1:
[02:47] Can we talk a little bit about your first and second startup?

Speaker 2:
[02:50] Yeah, GoTested, this was like 2008 or something like that. It was the age where people were having really difficulties getting their JavaScript working cross-browser. Internet Explorer was still pretty big at the time. Chrome had just come out. All the browsers were incompatible with each other. And so GoTested was a cross-browser, automated testing service for websites. It was based on Selenium, an open-source project that still exists. And the ideas you would write like test scripts that automate a user clicking through the various interactions with a website and then just check that the right behavior happens. And so, yeah, it was based on Selenium, but just as it provided as a hosted service. So people wouldn't have to run various VMs with various operating systems themselves. It worked technically, but I found it really hard to actually get adoption for it. A lot of people building websites, like in theory said, oh yeah, this is great. We need to test cross-browser and in practice, actually it was really difficult to get them to integrate it into their workflow and just get in the habit of using it and investing in writing the test scripts. So that ended up not really going anywhere.

Speaker 1:
[03:57] So it's like there wasn't like a business to be done or like revenue to be generated in meaningful sense?

Speaker 2:
[04:03] Yeah, well, there's at least one other, maybe two other companies from that same era that did manage to make a business. Source Labs is one that managed to actually succeed. But even for them, it was a pretty slow running business, I think. It was not an easy business to be in.

Speaker 1:
[04:20] And for the startup, were you in the UK building it?

Speaker 2:
[04:24] I was in the UK at the time, yes.

Speaker 1:
[04:26] Was it bootstrapped? Did you raise some kind of funding? How big was the team? How can we imagine this?

Speaker 2:
[04:31] It was mostly bootstrapped. So I did a bunch of consulting in order to fund hiring some people and then hired some friends on the cheap to help contribute to actually building the product. So it was done all very cheaply. I had a very small amount of angel money in there, but mostly bootstrapped.

Speaker 1:
[04:51] And then when you decided to not go forward with this, how did the next startup come? Rapportive, right?

Speaker 2:
[04:57] Yeah, the second one was Rapportive. That went a lot better. So that was putting social media inside Gmail, basically. So the idea was that if you get an email from someone you don't know, we had a little browser extension which manipulated the Gmail web interface so that on the side next to the email, we'd show you a summary social profile with a profile picture and job title pulled from LinkedIn and recent tweets pulled from Twitter, and maybe recent Facebook posts or things like that. Just whatever we could find about that person and put that as a social summary next to the email. We started in 2010 or something like that. It was then pretty quickly became quite popular. So on the back of that, we were then able to raise some money from Y Combinator, which was still fairly young at the time.

Speaker 1:
[05:46] That was very young. You must have been one of the very early batches. Yeah.

Speaker 2:
[05:50] I can't remember exactly when they started, but it was certainly in the early years. I think Y Combinator had already built up quite a good reputation at the time, but it was still fairly small.

Speaker 1:
[06:00] Then as part of Y Combinator, did you have to fly from the UK to San Francisco to attend that 10-week program, if I remember?

Speaker 2:
[06:08] Exactly. Yes. We initially came for the three months or whatever it was of the Y Combinator, but then we were able to get US work visas for ourselves and set up permanently in San Francisco.

Speaker 1:
[06:24] How was that shift from the UK where you spent going to university, your first startup, the first part of this, to coming to San Francisco?

Speaker 2:
[06:31] It was very exciting because it felt like going to the center of where it was all happening, really. We at the start of that, not knowing anybody at all, we knew one or two people in the entire Bay Area, but we contacted them and they introduced us to more people and they introduced us to more people. So we were able to pretty quickly actually build up a network. That's something that I really appreciated, that it was actually so open to outsiders like us who could basically turn up with an idea and an early stage start up, and we managed to raise some money and managed to actually become somewhat established in the Bay Area.

Speaker 1:
[07:09] Can you tell me how the other company grew and at what point the LinkedIn acquisition offer come and how can we imagine even? You were a founder of this company.

Speaker 2:
[07:18] It was about in 2012 that we sold it and we were five people at the time. So it's all still pretty small, not vast amounts of money involved, but it was a success, I would say, for everybody involved. The acquisition process itself was fine. As always with these kinds of transactions, there was twists and turns and moments where we thought it would all fall apart. Then we were almost running out of money and hadn't really succeeded in raising another round. So we had to sell or shut down. So we were under quite a bit of pressure. We couldn't reduce our own salaries because to do so would have violated the conditions of our visas. So we were in a slightly stuck situation. Given our lack of leverage in that situation, actually, I'm pretty happy how it all turned out.

Speaker 1:
[08:04] Yeah, it's nice that from 10 plus years, we can talk about this honestly, because oftentimes you see an acquisition by LinkedIn. And of course, you might ask the founders and they would say like, this was either our dream or our goal, or we will do so many things together. But some things that you don't often hear is, well, that there was a pressure involved as well. So did you go into this wanting to sell the company? Because you saw that things were getting a little, either you need to raise a new round or you sell to someone, and then you found LinkedIn to be the best of, or the only or the best option to go into?

Speaker 2:
[08:36] We tried a little bit to see what revenue generating options we had and hadn't really managed to make that work. So we were just burning money and our user growth was okay, but not really enough to go and raise a big round. So we were like a little bit stuck there and selling the company seemed like the least bad option in a way. And I'm pretty happy how it turned out because LinkedIn was great actually. They were very good to us. They allowed us to operate as essentially like an independent team within the company.

Speaker 1:
[09:09] So your team stayed together?

Speaker 2:
[09:10] Our team stayed together. We continued working on the product that we wanted to make.

Speaker 1:
[09:15] You got to keep working on Rapportive?

Speaker 2:
[09:18] Yes. Well, actually, so Rapportive, the Gmail browser extension got put on live support, but we were working on a new product at the time, which did eventually get released under the name LinkedIn Intro. It kind of got a slightly weird reception at the time, and it ended up getting shut down shortly after we released it. There's kind of a longer background story there, but I'm still really happy with LinkedIn, like how they gave us the freedom to do this and allowed us to launch this product. Even though it didn't succeed, they were very good to us throughout that process. Then after that got shut down, then our team got disbanded. But we had a good run within LinkedIn building this product.

Speaker 1:
[09:56] What tech stack did you work at the time? What did you use?

Speaker 2:
[10:00] Rapportive was fairly unexciting. It was a Rails app with a Postgres database basically, and some Redis and some similar things like that mixed in. So actually, nothing particularly revolutionary. We essentially built a graph database on top of Postgres, but there was a little bit of technical interest in there, but nothing particularly outrageous.

Speaker 1:
[10:21] Then you spent time after LinkedIn intro, you still work inside LinkedIn. As I understand, you worked on data infrastructure, right?

Speaker 2:
[10:29] Yes, data infrastructure. After our team got disbanded, I switched over to the stream processing team. So Kafka had just been developed at LinkedIn and just be open-sourced at the time.

Speaker 1:
[10:41] Oh, it was just being open-sourced.

Speaker 2:
[10:42] Yeah. I think it had just been open-sourced and then I got to work on SAMSA, which was a stream processing framework on top of Kafka.

Speaker 1:
[10:50] I always wanted to ask this question, so this comes here. Why did LinkedIn build Kafka or develop Kafka? Every time it's now such a fun foundational technology, I was always curious, why did a company feel the necessity to build this thing that seems pretty generic and it seems everyone would have needed it?

Speaker 2:
[11:08] Yes. I think Jay Kreps has a pretty good blog post from that era called The Log, where he explains his motivation behind Kafka and why make it an append-only log rather than a traditional message queue or something like that. I think the motivation was really about data integration because there were a whole bunch of databases and event generating systems, like activity events from users, for example. They were all generating data in a stream shape, and then a bunch of downstream systems that wanted to consume this, wanted to get it into the data warehouse and wanted to be able to get it into the Hadoop cluster at the time in order to run machine learning and things over it. There was just this data integration problem of actually, how do you physically get the data out of one system and into another? Jay designed Kafka as this integration point, essentially, like the almost kind of lowest common denominator, but still a general purpose abstraction for integrating various data sources and to downstream data sinks.

Speaker 1:
[12:16] Working at LinkedIn at Kafka and at a LinkedIn scale, what did you learn or what surprised you about working at this type of scale? As I understand, this was the first time that you hands-on worked at a really large system, right?

Speaker 2:
[12:29] That's right, yes, because previously, the biggest company I had worked in was Rapportive with five people. We had a sizable database, but it was still like a single instance database, and not really that big in the grand scheme of things. Then suddenly, I was at LinkedIn, and we got to get to use their big Hadoop cluster. That was fun, like hand-coding MapReduce jobs in Java at the time. I learned a huge amount there, especially when the stream processing ideas came up, and Jay was evangelizing the use of Kafka and the things you could do with it. That was a revelation for me, really, where I suddenly felt, this makes sense. I start to understand how these various data systems fit together, what they have in common, what the fundamental principles are, and so that experience then fed directly into the writing of the book.

Speaker 1:
[13:18] At what point did you decide to leave LinkedIn? To me, in your careers, I'm looking through the career, start out in the UK, do a startup, do a second startup by Combinator, move to San Francisco, get acquired by LinkedIn. The arc that most people would draw would be, okay, do something more in Silicon Valley or maybe start a second startup, etc. And instead, you decided to leave LinkedIn.

Speaker 2:
[13:40] Yeah. So first, I decided to move back to the UK actually, and I continued working for LinkedIn remotely.

Speaker 1:
[13:44] Okay.

Speaker 2:
[13:45] That was mostly because my girlfriend at the time, now wife, was still in the UK and long-distance relationship is not a lot of fun. And I didn't feel that at home in the Bay Area. So I wasn't really encouraging her to move to the Bay Area either. I thought it was better for me to go back to Europe, and I'm very happy with that decision. Like I still have a lot of great friends in the Bay Area. I love it as a place to visit, but I wouldn't want to live here, honestly. Then I was still remotely working for LinkedIn, and that worked all right for a while. When I then started writing the book, LinkedIn even gave me 50% of my time free to work on my book alongside my software engineering duties, which is really great.

Speaker 1:
[14:25] Amazing. Yeah, I just saw most of them.

Speaker 2:
[14:27] Absolutely. And they don't have to do that, and LinkedIn didn't directly get anything out of it in response other than a book that they could use for internal training purposes.

Speaker 1:
[14:37] Well, shout out to LinkedIn for this.

Speaker 2:
[14:39] Yeah, absolutely. Though then I did find then that actually trying to write a book in parallel with doing a software engineering job and being on call, et cetera, I just wasn't able to do it. It's just too much context switching and it's very easy for the urgent things from the on call to dominate and then not to have the freedom that you need in order to write something new. And so then after a while, I decided, okay, like it's probably better if I focus full time on the book. So I then left LinkedIn and just took a sabbatical, unpaid sabbatical, i.e. unemployment, to just focus full time on the book for a while. And then it's only after that that I actually even considered getting into academia.

Speaker 1:
[15:22] So how did the idea of the book come? What was the point where you decided you would write? And in your mind, what were you deciding to write? Was it already this book with this layout, or you had an early idea back then?

Speaker 2:
[15:36] I had an idea that, of course, the final product ended up looking somewhat different, but the overall goal, I think, stayed the same. So I knew I wanted to write something that was a broad conceptual overview. So not about how you use any one specific system or tool, but comparing the trade-offs between many different types of tools. And I knew that I wanted to be practitioner-focused, like not a theoretical textbook, but something that people could use to build real systems. That was basically the goal with which I approached it. And this was exactly the book that I wish I had had when I was starting out and working at Rapportive, for example, because we were all searching around in the dark where we're having performance problems with our database, and we had no idea what to do basically, because we were totally lacking the foundations to actually understand what was going on and how to diagnose the issues. And so I felt that, well, if I had had a bit more background on how these data systems actually work internally, then I could have had an intuition about how to debug these kinds of performance issues. And then after a while, after I'd learned more about how data systems work, I thought, well, okay, it's time to write this down so that others don't have to learn it the hard way, but can hopefully just get a better idea of how these systems work and thus be better at managing their own data systems.

Speaker 1:
[17:01] To start with, how did you learn about, for example, how databases work? Because again, from your story at Rapportive, you build systems, you've had some performance issues at a smaller scale, to be fair compared to LinkedIn, then you worked at LinkedIn and you saw a little bit of how the sausage was made. But I know a lot of software engineers who have been in this path and they still don't really know how the fundamental systems work. They just know, okay, we have a platform team inside our company and they build it. I could read the RSCs, but it's a lot of work. For the planning docs, I could look at the source code. It feels to me that even at that point, you just went down and tried to dig in. What resources did you use? How did you find out those basics which you later put into the book?

Speaker 2:
[17:39] A lot of it was just being curious and talking to people actually, and just asking them lots of questions. At LinkedIn, there were a bunch of senior data systems engineers who understood this stuff very well, but hadn't maybe necessarily written it down. So I just talked to a bunch of them and quizzed them, and that way started building an image in my own mind of how this stuff works. Then once I got the basics from these conversations, then I was able to go and reach research papers, for example. They go into much more detail of exactly how and why things are designed in such a way. But it is time consuming to read those things. So then what I tried to do was pull out what are really the essential ideas. I just read a ton of blog posts as well. And so the reason why you see so many references at the end of each chapter in the book is, well, that is actually the material that I myself used in order to understand what was going on. And then I thought, well, okay, well, if I found these things useful, then I'll also cite them in the book as a way for anyone, any reader who wants to go beyond the basics covered in the book. Here are some good sources to further reading.

Speaker 1:
[18:55] The structure of the book, this first book at least, its foundation of data systems, distributed data and derived data. If I understood these three big parts, did you already have a structure in mind when you started writing the book or did it shape as you went?

Speaker 2:
[19:07] This three-part structure is not that critical in the design of the book really. That's sort of more after the fact. I thought, well, it seems like we can group the chapters into roughly this sort of structure. But the topics of the chapters were more or less what I had envisaged. So I knew that I wanted to talk about what a transaction actually is. I knew that I wanted to talk about replication. I knew that I wanted to talk about sharding or partitioning. I knew that I want to talk about consistency and consensus. Those sort of high level topics, I think, were clear from my initial book proposal to the publisher. The details within each chapter, that is something that I often figured out once I got to that chapter. So I wrote one chapter at a time and started each chapter work with just a lot of background research to actually get up to speed on the topic myself. And it's often only then that, save for then replication, I decided, okay, well, it seems like the three major ways of doing this are single leader, multi-leader or leaderless, okay? I would decide on that structure essentially when I started writing each chapter and then try to fit the various points I wanted to make into this narrative structure.

Speaker 1:
[20:25] As a fellow author who also wrote a book, one thing I've noticed, there's a bit of parallels between estimating a book and estimating a software project and that you come in with an estimate. If you've never done it before, you tend to be wildly off. How was this in your journey? In addition, you also had a publisher and publishers are a little bit like project managers. They like to have a schedule, they like to try to keep you on track, they like to ask, what is it done? How did you manage that part as well? In the end, how long did you estimate it would take when you started and how long did it actually take?

Speaker 2:
[20:58] As always, it takes vastly longer than expected. It's the same for software and projects as it is for writing, I think. I think it took me about four years to write the first edition. That was not four years of full-time, maybe two and a half years of full-time equivalent or something like that, but written over the course of about four years. It definitely took a long time. The publisher deadline, I missed by a ludicrous margin. I think I missed it by about two and a half years or something like that. But fortunately, O'Reilly were pretty laid back with the first edition and were happy for me to just take my time and make it good. When it came to the second edition, then actually O'Reilly got a bit more aggressive and pushy about sticking to deadlines. I guess by that point, the book had been established and people were waiting eagerly for the second edition. So I understand the desire to want to accelerate it. But at the same time, I really appreciated the freedom that I had for the first edition to work on my own schedule. I had a bit less of that with the second.

Speaker 1:
[22:09] The tagline for the first edition, which I believe is the same as second edition, the big ideas behind reliable, scalable and maintainable systems. Reliable, scalable and maintainable. What do these adjectives mean to you?

Speaker 2:
[22:23] They're all slightly vaguely defined. There's not a formal definition of those things. But for me, reliability means full tolerance primarily. So meaning that a system should, on the whole, continue working even if a network link is interrupted, or a node crashes, or something like that. So a lot of the book is about techniques that support full tolerance, like replication, for example. So that's reliability. Scalability is one of those terms that get thrown around a lot, and it's fashionable and cool to make things scalable, because it suggests success and millions of users, and so that's, of course, everyone wants things to be scalable because everyone wants success. For this book here, I tried to take a bit more dispassionate approach and said scalability is just like what mechanisms we have for dealing with changes in load. If load increases, how can we add computing capacity to a system, for example, so that the system still continues working? Then the techniques that you use to achieve scalability, well, they are like sharding, for example.

Speaker 1:
[23:32] But in this case, scalability, your definition, do I understand that you're mostly referring to horizontal scalability, so they can not compute up or down pretty much?

Speaker 2:
[23:42] Yeah, I guess because that's the more interesting one. Like, yes, you can always buy a bigger machine.

Speaker 1:
[23:46] And what's interesting about that?

Speaker 2:
[23:48] And exactly, there's just not that much to be said about it. I mean, there are details of how you scale even on a single machine, but I think part of what is become interesting about like modern cloud services and just backend services in general is like how they've introduced this idea of horizontal scalability and shared nothing systems. So we can build systems that are able to cope with very high load, even if the individual components are just fairly cheap commodity machines. But maybe sort of part of the scalability story, which I wasn't thinking about as much at the time, but started thinking about more recently is not just scaling up, but scaling down as well. So actually, how do you run a service in such a way that if it has a very small amount of load, it's really cheap to run it. That's sort of in a way the same question as how do you continue running a service if it has very high load. Generally, you just want the cost and the computing capacity to be roughly proportional to the load that you have. At the low end, that means actually being able to scale down to something that is extremely cheap to run. That's not so necessarily given. That's something that is hard with on-premises software, for example, because if you've got a physical machine, that's a unit of deployment. Yes, you could carve it up into two dozen virtual machines and make those small virtual machines, but it still requires some resource allocation. Part of what's interesting about some serverless systems, for example, is actually their ability to scale down and say like, okay, if you're going to handle just three requests per day, that's just fine as well.

Speaker 1:
[25:31] Can you tell me about the second edition? When did the idea come about?

Speaker 2:
[25:35] Yeah, it had been clear for a couple of years that the second edition was needed, just because the first edition was getting a bit dated. There were changes in technology that just hadn't been reflected in the first edition. So I wanted to update it, but I now have an academic job. I'm actually doing research and teaching is my main thing, and updating the book is just a sideline business on the side in some sense. So it actually took quite a while to make progress with that, because I was always doing it alongside other projects, and essentially back to that context switching problem that I had while writing the first edition, but just now with an academic job that I didn't want to just drop, because I actually quite enjoy it. Initially, then I made very slow progress with the second edition, and also I kind of realized that I had slightly lost touch with current industry practices because I'd switched over to the academic side. I'd gone much deeper on the theory, but I was no longer up to speed on what people were doing with, say, data lakes or things like that. So then at some point, I remembered Chris Riccomini, an old colleague from LinkedIn. I had worked with him on the stream processing stuff. Well, you worked with him.

Speaker 1:
[26:49] He's the author of The Missing Readme.

Speaker 2:
[26:51] Exactly.

Speaker 1:
[26:52] Wow, what a small world.

Speaker 2:
[26:54] Yeah. I had read Chris' book, The Missing Readme, and thought, oh, he's a great writer. I had worked with him as a software engineer and found him a great colleague, and also he had been writing this newsletter called Materialized View. On like latest trends in data systems, essentially, and become a startup investor in that space. At some point, I thought, well, actually, I have to get in touch with Chris and ask him whether he wants to help out with the second edition. He was keen to do that, and that turned into such a good collaboration because he was up to date on what the cutting edge was in terms of technology in industry. I had strong opinions on how to teach, essentially, how to explain things in the book, make sure that we were explaining everything in a way that was very precise, very carefully chosen words, but at the same time, very accessible, so that it's hopefully easy to read. We took essentially my writing style plus Chris's knowledge of latest industry trends to bring the book up to date. That was a great collaboration.

Speaker 1:
[28:01] What are the big things that you added? Which ones of these you knew would be missing, and which ones did you realize during the writing process that, okay, this needs to be in here now?

Speaker 2:
[28:10] Yeah, so the thing we knew from the start that we wanted to reflect was Cloud Native Systems Architecture. It's a bit of a vague term, but what I mean with that is essentially building data systems on top of Cloud services as the foundational abstraction. In the first edition, the assumption was basically that you have some machines, each machine has some local disks, you can run a database instance on a machine, it will write its data to the local disk. If you want to replicate it to another machine, then well, the database software will replicate it at the database level to another machine which will also write the data to its local disks. For a long time, that was exactly the way computers worked. Now suddenly, people are building databases on top of object stores, for example, and now the replication happens at the object store level no longer at the database level, or maybe there's still some replication at the database level, but it really changes the nature of things if you're building on top of an object store. This is different from, say, building on top of a virtual block device like EPS or so, because these block devices, although they are cloud services, but they still offer the abstraction that is a single node operating system abstraction of a block device on top of which you run a file system, whereas an object store is just like a brand new abstraction. It just looks different from a file system, it behaves differently. And so then building on top of that as a foundational abstraction is something that people were starting to do at the time of the first edition. But since the first edition, that has really taken off, like a whole lot of system have been built in that style now. And so that's an idea that we really wanted to incorporate, and we weave that in throughout the book. So it's not just like one section here, but it's sort of an idea that we've integrated throughout the entire narrative.

Speaker 1:
[29:59] There's now a lot of managed services as well. The permit is definitely used, but there's also so many managed services that all the cloud providers use. And a lot of engineers, they often just use the managed services as is because they take care of replication, or they have SLAs for uptime and so on. But when you build on top of these things, and you kind of use those as primitives as well. Is there any risk as a software engineer that you're no longer incentivized to understand the underlying layer? Or are we building better systems because of that? How do you think about this? It feels there's a move of abstraction because of cloud, right?

Speaker 2:
[30:35] Yeah, it's definitely a shift to different and higher level abstractions. But you know, that's been the story of the entire computing industry since the start. It's like building new abstractions. So it is true that if you rely on a higher level abstraction, you're no longer thinking about the lower level details. And so if you're using a programming language with a garbage collector, you're no longer thinking about memory allocation. And so is that a loss? Well, maybe. Like if you're building low level systems, you should still have to care about memory allocation. You're building higher level business logic. Actually, I think it's just fine for people not to care about memory management. So I think there's an analogous thing here with data systems that if you're building the higher level systems that don't need to particularly care about the underlying infrastructure, then that's fine. Just use the higher level abstractions. Nothing wrong with that. But somebody still has to build those lower level abstractions and from lower level components, somebody's got to implement the cloud services.

Speaker 1:
[31:36] Martin talks about trade-offs that come with using cloud services. And this is a good time to talk about our season sponsor, WorkOS. If you've read Designing Data-Intensive Applications, you know that building this in that scale is all about trade-offs. But one thing isn't a trade-off, that's enterprise features. The moment you land bigger customers, you need SSO, directory sync, RBAC, audit logs, all the things they expect out of the box. Building that yourself can take months. WorkOS gives you APIs to ship it in days, so you can stay focused on your core product. That's why companies like OpenAI and Entr0fic run on WorkOS. Visit workos.com to learn more. I'd also like to mention our presenting sponsor, StatSig. StatSig built a unified platform that enables both experimentation and continuous shipping. Built-in experimentation means that every rollout automatically becomes a learning opportunity with proper statistical analysis showing you exactly how features impact your metrics. Feature flies let you ship continuously with confidence. And because it's all in one platform with the same product data, teams across your organization can collaborate and make data-driven decisions. To learn more, head to statig.com/pragmatic. With this, let's get back to Martin and the trade-offs that come with using cloud services.

Speaker 2:
[32:46] And so those people will have to then specialize even more in actually the details of how you engineer those cloud services, how you make them reliable, how you operate them and so on. The skills are still there, it's just a bit of specialization happening. That some people can worry about the higher level things without having to concern themselves with the lower level things. Some people focus on the lower level things and treat the higher level aspects as their customers.

Speaker 1:
[33:10] Interesting. It sounds to me that if you're an engineer who is utilizing a lot of these services, you might not need to know how they exactly work.

Speaker 2:
[33:18] Yes, and I would say the underlying philosophy of the entire book is to give people insights into just the sort of essence of how the systems work internally, so that if, for example, they start having weird performance behavior, you can have a bit of intuition for why it's doing that and how you might solve it. So, for example, say the Storage Engine chapter tells you about how B-trees work and how log-structured LSM trees, storage engines work. And the book is not intended for people who are going to actually build their own databases and implement their own storage engines. If you want to do that, you have to go much, much more, much greater depth than this book covers. But the idea is that as an app developer, if you know just a little bit about how the storage engine works internally, you'll be in a much better place to use it in a way that is, that gives you good performance, for example, until I diagnose any issues. That philosophy we've kept also in the context of cloud services, where, yes, cloud service hides some of the operational details that app developers don't need to think about anymore, but they should still know a bit about how they work internally, just so that they can use them effectively.

Speaker 1:
[34:24] I guess I'll argue about the trade-off deciding on which service to use, which characteristics to look out for for your use case, right?

Speaker 2:
[34:32] Exactly, and you know, there are huge differences of, say, if you're doing analytics, whether you're using row-oriented storage or column-oriented storage, that's a bit of a technical distinction, and it takes a little bit of background reading to even understand what that means, but it has a massive performance implication in terms of the final behavior of the system. And so those are those places where I feel like knowing a bit about the internals is actually like a superpower.

Speaker 1:
[35:00] And I guess engineers, the one thing that we always need to argue about or should need to argue about is, at the very least, cost versus performance. And by performance, I mean latency to the user. And of course, resilience of if something happens, like a region, like a zone goes down, a machine goes down, zone goes down, region goes down, how our product is affected and what's acceptable.

Speaker 2:
[35:22] The basic idea there seems to be like how much availability risk are you willing to take on versus both like the overheads in terms of the system itself, like the computational overheads but also the human overheads actually designing and operating the system and the cost overhead. Yeah, exactly. And so yes, you can have a system that is more able to tolerate various types of faults, but which is more expensive to design and operate, versus a simpler system that might go down a bit more often, but which is cheaper. And there's no right and wrong with that. Everyone needs to figure out where they sit on that trade-off space themselves. And I would say that multi-region is pushing in the direction of higher availability, because it means you could tolerate the outage of an entire region, but then it has implications on the consistency model that you can get across different regions, for example. So that's a trade-off that the book tries to make very explicit to help people reason that through of like, what is the right choice for them. In terms of multi-cloud, for example, one thing that I've been concerned about just in the last month really is European dependence on US cloud services.

Speaker 1:
[36:38] Yes.

Speaker 2:
[36:39] So what if geopolitics was to go horribly wrong and tensions escalate and Europe finds itself suddenly locked out of US cloud services? I hope that doesn't happen. I still think it's fairly unlikely, but it's no longer thinkable and as a result, I coming from this European perspective, have been thinking a fair bit about how can we engineer systems to be resilient against that sort of thing. That's not just like a regional outage, but it's like a business risk essentially and multi-cloud setup could help mitigate against that sort of risk. But at least, for example, if one company locks you out, then you could still have systems on another company. Again, that's very much towards the expensive but high availability, risk reduction end of the spectrum. But for the people who have really critical workloads where they think this geopolitical risk is a significant enough risk, I think it's seriously worth considering that kind of setup.

Speaker 1:
[37:41] I'm thinking that, Zanjir, we do have the responsibility because who else will do this?

Speaker 2:
[37:46] Yes, totally. But I totally agree with you as well that this understanding what the risks are and communicating what the trade-offs are, I think, is going to be a core part of our role as engineers moving forward as well. Maybe as AI writes more and more of our code, it's less about the details of how you express logic in a particular programming language, and much more about those kinds of high-level trade-offs.

Speaker 1:
[38:10] How has the definition of scale changed in this book? Because as we talked with Cloud, before Cloud building a scalable system, it sounded pretty involved because building a horizontal and scalable system, it's complicated. All the pieces you need to put in the first book, you detail a lot of this. With Cloud, a lot of the services actually, they do define how they allow horizontal scaling, what the trade-offs are. Do you feel that it's made a lot easier to reason about scale, scalability when you are using these primitives?

Speaker 2:
[38:42] I think achieving really high scale is still challenging because even though we have Cloud services like Object Storage, for example, which provide you this very elastic storage model, at least you don't have to worry about capacity planning on your disks anymore and running out of disk space because those kinds of operational things they're taking care of. But if you need charting, for example, that's something that actually does reflect on the application code as well. You can't really make that entirely transparent. And so you're at a sufficiently large scale, the charting is required because a single machine is not powerful enough to process your workload. Then I think even with cloud systems, you still have to do quite a bit of engineering thinking of how to realize that. Where I think the cloud has helped quite a bit is actually at the lower end of scaling down. If you want to have a very lightweight service that processes only a small number of requests, what we've got with serverless systems being able to very quickly spin up and spin down an instance, very lightweight, that's quite a good innovation that has enabled those very low scale services. That's something that would be much harder to do without cloud services because you would have to statically allocate a certain amount of memory and certain CPU resources to a particular virtual machine.

Speaker 1:
[40:02] I love serverless. I have a small website that runs on serverless, and I bill is like 13 cents per month because it has very little load.

Speaker 2:
[40:11] Absolutely. It's just making more efficient use of computational resources.

Speaker 1:
[40:14] Let's talk about sharding. In the first book, and when you wrote the first book, when I was working at Uber, we talked a lot about sharding and there was a lot of internal implementations or interviews involved asking about sharding because we were designing systems that were sharding. I did sense that over time again as cloud systems start to become available that give you turnkey solutions that act more like platforms. You send the data and it takes care of these things. Fewer engineers have to actually implement sharding. With cloud-native systems in your research, what have you seen? What are the cases where putting sharding in place is still important and where the places where it might have just disappeared as a concern? It's still nice to know, but you might not have to implement it.

Speaker 2:
[40:58] I think it's probably less of an effect of cloud and more of just hardware getting more powerful that actually a big machine nowadays can do a lot on a big machine. That means that more and more workloads you can just run on a single machine and that is sufficient actually to achieve quite significant scale already. There's still concerns of how do you actually efficiently make use of hundreds of CPU cores that you have on a single machine. So parallelism is still a required thing to think about there, and sharding is one way of achieving parallelism. But at least this sort of sharding across multiple machines has maybe become less of a pressing issue just because more and more workloads can just run on a single machine. Some people still have very large scale workloads that do have to be sharded across multiple machines, though it's not going away entirely. Replication is still relevant even at smaller scales because that's for fault tolerance, that's not for scalability.

Speaker 1:
[41:57] You have a chapter called The Troubles with Distributed Systems, which goes through a lot of things that can go wrong without going through the whole chapter. Can you recall some of the things that are memorable to you or some of the things that you feel are important to remember?

Speaker 2:
[42:13] Yeah, the whole idea of this chapter is that in distributed system theory, there are certain things that we tend to assume. Like for example, we just assume that there's no upper bound on how long it might take for a message to go over the network. You send a message, it might arrive within 100 microseconds, or it might take 10 years, and distributed system theory just doesn't make any assumptions about that sort of timing if we can avoid it. Or rather, some theory does make those assumptions, but it's a dangerous assumption to make because occasionally the network delay does become much higher than what is typical. Another thing is about crashes. For example, distributed system theories just says like, nodes can crash, but what does that actually mean? What in practice does it mean for a node to become unavailable? Because it might be a software crash, but yes, it might be a hardware failure, it might be somebody unplugging the power cable. It might be that the node is actually still running, but it's just become disconnected from the network. The point of this book chapter really is to defend and justify those theoretical models that we use for analyzing distributed systems. And just giving a lot of stories and case studies that show that actually tons of stuff does go wrong. And don't believe anyone who says, oh, failures are rare, don't worry about it, it's fine. The moral of this chapter is really that actually, if you want to make things reliable, you really do have to worry about a whole bunch of weird, unusual, but certainly possible edge cases. Timing is another one of those things. It's very easy to assume that your clocks are correct, and most of the time, the clocks are pretty correct, but we just can't rely on it because actually, they're just not precise enough on the whole. And so a lot of it is about, it's very tempting to make certain assumptions that things are well-behaved and in distributed systems, we just have to try to get away from those assumptions if we want the systems to work reliably even in the face of things going wrong. But it was a really fun chapter to write because it's essentially a big collection of stuff that has gone wrong. And so I went through a bunch of post-mortems published by various tech companies, for example, in order to see, okay, what was the root cause of how things went wrong and what kind of lessons can we draw from this that apply to the book in general. And there's some fun stuff like the sharks biting under sea cables and damaging them. That just makes for a great story. And then I hear that in recent years, the shielding of undersea cables has got better and therefore, sharks are not biting them anymore. But instead, the cows on land are stepping on cables and occasionally causing network interruptions that way. And that sort of thing is just, it makes it a bit more fun.

Speaker 1:
[44:56] That chapter is so interesting also because when, depending on what kind of teams you work on or what kind of people you talk with, when I talk with the S3 team, for them, that whole chapter is just their day-to-day. It's not a weird thing when like a hard drive goes up or there might be, okay, it might be a weird thing to have a fire in a data center, but they're prepared for all of those things. They're at the scale where these things just happen on a regular cadence because they're one of the largest scales. Whereas at a smaller company, even if you read this chapter and you will treat this as like, well, this could happen, but when it actually happens, it will be a once in 10 year and it will be a big deal.

Speaker 2:
[45:37] Yeah, but I think there's no right answer. It's a trade-off between risk and costs, broadly speaking. And that means a business decision has to be made in terms of where the business wants to lie on that trade-off. And so the goal of this chapter is really just to give people the information in order to make an educated decision, but I don't want to make that decision for people. That's for businesses themselves to decide.

Speaker 1:
[46:02] That's very clear. Have you come across some concepts or systems mentioned in the book in the first edition and now in the second edition that are becoming either more popular or less popular over time, more or less referenced by your readers thinking about from things like streaming systems, batch processing or anything else?

Speaker 2:
[46:19] Yeah, so some things that we've been able to take out out of the book compared to the first edition. In particular, for example, coverage of MapReduce was quite detailed in the first edition, but basically MapReduce is dead, nobody uses it anymore. Its successors, like in the form of Spark and Flink, for example, they are used and so we still reference MapReduce in the second edition, but more as a learning tool in order to understand how these kind of partition sharded batch processing systems work. So that's one thing where we've been able to reduce the coverage. But other areas where we've increased the coverage are, for example, systems in support of AI. And so even though this is not an AI book, but there are still data systems concerns that arise when leading to support AI applications, like a classic one is vector indexes, for example. And so we've added some coverage of vector indexes to the storage engine chapter. Fit in really well there because it already covers various different indexing strategies anyway. And so vector indexes, it's just another indexing strategy. We also added some coverage of data frames, for example. That's not an exclusively AI thing, but data frames are quite a good data representation for training data, for example. And that was not one of the data models that we discussed in the first edition, but we decided to add to the second edition because it has actually become a very important data model that people are using alongside all of the classic data models like relational and graph and JSON documents and so on. And so these places where we've just expanded the coverage a bit to reflect the kinds of systems people are building, for example, to support AI without it changing the direction of the book entirely.

Speaker 1:
[48:06] The final subsection in this first edition, the first few, I guess, like subpart were titled Doing the Right Thing. And in the second edition, this has its own chapter. The final chapter is Doing the Right Thing. And I echo a little bit from it. We, the engineers building these systems, have a responsibility to carefully consider those consequences and consciously decide what kind of world we want to live in. Can we talk a little bit about this section and the importance of it?

Speaker 2:
[48:30] Absolutely. Yeah. So the motivation for putting in an ethics section there in the first edition was that I just felt it had been quite ignored as a concern during my time in industry. That's like, especially in startups, people were very focused on like building a product that their customers would love and really like prioritizing these sort of ethical questions in the process. And so, for example, with the consumer facing products, it might be that the products are very much geared towards essentially data harvesting, collecting behavioral data because that's what can be monetized in the form of advertising. And there seemed to be just very little reflection on what was good and bad about these sort of things. So I really just wanted to encourage a bit of thinking there. Not really wanting to prescribe too much, like a particular approach there, but at least to point out, there is this thing such as data protection legislation now, which we do have to think about in the architecture of our data systems. And there is an ethical responsibility. People say that you get into tech in order to change the world. If you want to change the world, then thinking about the impacts that your technologies have on the world is part of your job. It's a really essential part, really. And something that engineers are often prone to ignoring is we focus just on the technology and less on the effects that that technology will have out in the real world. And so this chapter is really just an attempt to get people thinking about it a bit. And it's sort of a reflection of my own process as well, because as I started working on these systems, I didn't really think about ethical things particularly either. So I felt like I had to put that section in there for myself as well as for the readers, because it was my own way of grappling with these questions a bit.

Speaker 1:
[50:32] Is it fair to say that as engineers building these systems that will have an impact on on wide-region things, potentially societal-wide impact, we are just in such a good position to directly influence and maybe even change course. So do I understand that this section is a bit of a reminder that by building it, we have a huge opportunity to shape these. We probably have a lot stronger voices, maybe as strong voices as later on the regulator might have years down the road, right?

Speaker 2:
[51:02] Exactly. I think engineers have a very strong voice there. And like we talked about earlier, engineers need to articulate trade-offs in such a way that business leaders can then make educated decisions about how to address those trade-offs. And part of those trade-offs is pointing out risks. And risks include not just technical risks, like the data might get corrupted, but they include societal risks as well. For example, like what negative effects, what harms might arise from this technology, what sort of unintended consequences possibly, or what like risk for reputational damage, if it turns out that a technology has some harmful effects, you know, that can reflect badly on the company that made it. And that has to be part of the trade-off discussion. And I just want people to make intentional and deliberate decisions about those kind of things and not just sweep it under the carpet.

Speaker 1:
[51:57] One of the hot topics these days is, of course, AI. And you've written a very interesting post about this just in December about formal verification and how your conviction that formal verification might be more important with AI. Can we talk, for those of us engineers who have heard formal verification, can we talk about what this is and how you envision this becoming more important?

Speaker 2:
[52:19] Yeah, so there's a whole range of formal methods. One approach is to, for example, use a specification language like FISB or TLA plus or something like that, to describe the expected behavior of a system at a high level, and then use a model checker, which is essentially like a randomized test case generator, to just play through a lot of scenarios and see whether the system has those desired behaviors in all the different scenarios. That's like the intro level formal verification, I would say. The more advanced level is to use actual formal proof, and in that case, you can write a specification of some system in a formal language is usually using mathematical notation, and then make a mathematical proof that a certain algorithm or certain implementation always satisfies that specification. The distinction to testing there is that, well, in testing you just try through a couple of examples, give the algorithm some example inputs and check whether you get the expected output in those particular examples, but a proof can reason about potentially infinite state spaces. So it can tell you things about every possible thing that could possibly happen in the entire universe, show that, for example, a certain safety property is always given in those. Formal verification is a lot of work. I never used it in my time in industry because it's just too time consuming basically. I only got into formal verification when I was in academia, and I could afford to take the time to spend a few months proving an algorithm correct. But there I've started finding this very useful, especially if I was working on very subtle algorithms, where it's very hard to tell just from reading the implementation, whether this actually is always correct under all possible cases. But if it's an important algorithm where, for example, it will corrupt data if there's a mistake in it, or it will have a security vulnerability if there's a mistake in it, then when it's high stakes, things like that, then I feel it's worthwhile to have formal verification and to really make sure that the code really is correct. So I've done some formal proofs using the Isabel Proof Assistant, for example, there are a couple of others as well, like Rock and Lean and so on. These proofs are really hard to write. It takes a long time to learn the language of writing those proofs, and then even once you know the language, it's just really laborious in order to actually write the individual proof steps.

Speaker 1:
[54:51] And when you say it's hard to write, just as someone, I know how to code in so many different languages. Can you just explain what it means to hard to write? Does it feel like a strict programming language with all sorts of rules or lots of math formulas? What makes it hard for you to learn it and get good at it?

Speaker 2:
[55:10] Yeah, so you're trying to make a proof that a certain piece of code always satisfies a certain property. In some cases, that property might be quite easy to specify. Let's say as a really simple example, you have two lists and you want to concatenate them, and then you want to prove that the length of the concatenated list equals the sum of the two individual lists. Very simple property. How would you prove something like this? Well, you would have a function that concatenates two lists, and then you would probably do a proof by induction over one of the lists that shows that, okay, well, if you have one list of length i and another list of length 0, well, then the sum of the two is i. If you have a list of length i appended with a list of length 1, well, then it's i plus 1, and so on. Then by using a proof by induction, you can then show that the length of the concatenated list is i plus j where i and j are the lengths of the two input lists for every possible value of i and j. This is something that in a test case, in tests you would maybe test it for the cases of j equals 0, j equals 1, and j equals 5, and then you're done.

Speaker 1:
[56:22] J equals n to the max.

Speaker 2:
[56:24] Yes.

Speaker 1:
[56:25] In the edge case, that's how I write my unit test.

Speaker 2:
[56:28] Exactly. This is a trivial example like list concatenation. You can easily just read the code and convince yourself that it's correct. But if it's a much more complex algorithm, then our brains just can't like grok the algorithm well enough to really convince ourselves that it's correct if you don't prove it. That's where these proofs then become handy.

Speaker 1:
[56:47] If I'm an engineer and I would be interested in getting started with formal verification, for example, because I have the notion that it will be more important with AI, of course, it will be easier to write these things. Where would you point engineers to get started, or how did you get started in this field?

Speaker 2:
[57:04] I would suggest starting with model checking. Something like TLA Plus or FisBee are much friendlier to getting started with compared to proof assistants like Isabelle, Rock and Lean. These proof assistants just require a whole lot of additional knowledge. The resources for learning about writing these formal proofs, to be honest, not particularly good. I haven't really found really great books on it as well. The way I learned it was by working with some colleagues in my lab who had learnt it through years of prior experience. I just sat down with them and paired with them at a desk where I described the thing I was trying to prove, and they showed me how to prove it step by step, how to break it down.

Speaker 1:
[57:47] I'm interested to see if you're thinking will be correct, which is this thing will go more mainstream, and hopefully we'll have better books and resources for it as well.

Speaker 2:
[57:55] Yes, I do hope so. The reason I believe that this formal verification could become more important in the future is there are several aspects to it. One is that the LLMs are getting increasingly good at writing these proofs, and if we don't have to write the proofs by hand as humans, it just becomes feasible to do them in situations where previously it would have not been economical. But also, LLMs increase the need for these formal proofs because we're vibe coding a bunch of stuff. If we have to manually review all of that code, then that will become the bottleneck. So we can't really have humans reviewing all of the generated code either, if we really want to get the benefits of AI. So we need some automated way of checking whether the code is correct. And writing lots of tests is a very good starting point. But the thing the proof can do that tests can't, is to consider absolutely every possible thing that could happen. And that's really important in a security context, for example, where it just takes one little bug to create a vulnerability that destroys the security of the whole system. And so, I feel for those domains where, like, really we want to ensure there's a complete absence of bugs, that's the kind of places where formal verification can really shine. And I'm hoping that LLMs will actually make that a lot more accessible to people who would have previously not considered using formal verification, because it was just too hard and too expensive.

Speaker 1:
[59:20] You've worked in the industry and then you went into academia. Can you tell us what the difference is between us, myself and most people watching work in what you would call industry and the tech industry, or work at different companies, we're bootstrapping our own or just building our things. How does academia contrast to this? What do you and your colleagues do inside of academia?

Speaker 2:
[59:45] Yeah, within academia there are lots of different styles really, there's not one thing. Some people go full on theoretical, mathematical, don't care about the real world at all, just want to work on things that are intellectually interesting, and that's fine. Some people are very much at the applied end of wanting to do research that is likely to have a real world impact. I'm more on the applied end, and that's fine too. But a common distinction there is that academia can just think much longer term. If you're doing a startup, you have to ship something within a few months, you can't afford to think 10 years into the future. Maybe you'll have a long-term vision that you're gradually getting towards, but you do have to really ship things on a fairly short time scale. At a bigger company, maybe if you're working on infrastructure or so, you can think on a bit of a longer time scale because the requirements of what are needed are perhaps better understood. And in that case, making sure that the system is scalable, operationally robust and so on. It's then fairly clear what the requirements are, and it's still a matter of implementing it. But in that case, you can think a bit longer term. But in academia, what I really appreciate is the freedom to work on things that are long term and which are not immediately commercially viable or which are not aligned with the incentives of commercial companies. So one research area that I've been on for several years now is what we call Local First Software, which is this idea that we want to take away a bit of the power from cloud operators and give it back to end users. So end users should be more in control of their own data and less dependent on cloud services for providing the applications and the data that the users need. And that's something that doesn't naturally come to companies, right? Because software-as-a-service businesses, for example, the whole reason why they can charge a subscription is because they are able to essentially hold a gun to the customer's head and say, pay us at your subscription, otherwise we will delete all your data. And I totally understand the commercial imperatives that lead to that, but it also leads to this situation where, like, the people have a gun against their head all of the time. That isn't really a healthy situation to be in, in my opinion. But changing that in such a way to take away that gun from customer's heads is difficult if you're in a business whose revenue depends on perpetuating that kind of lock-in situation. And there I feel like in academia, I have the freedom to work on things that go against this commercial incentive of companies and say, like, actually, no, I'm going to do what I think is right for the users. And I'm going to say the commercial model of the companies making the software is second priority, and I can afford to do that because I'm not dependent on this commercial model.

Speaker 1:
[62:42] To add to this, it's very interesting and challenging engineering problems, right?

Speaker 2:
[62:47] Yes, and it's wonderful to get to work on interesting engineering and computer science problems while at the same time trying to pursue this higher level vision.

Speaker 1:
[62:59] For local, for software, what are some of these really interesting engineering challenges that we will need to solve or we need to solve to get to a more viable local-first software? May that be like, let's say, note-taking, it's a very popular one, right?

Speaker 2:
[63:14] Yeah. So with our vision of local-first software, we are trying to get away from this dependency on centralized cloud services. There may still be cloud services involved in syncing data between your phone and your laptop, say, because often going by a cloud service is just the most convenient way of establishing that kind of communication. But we just don't want to have to trust on a cloud service providing a particular function. And if you can get away from assuming this one cloud service, you could, for example, have multiple cloud services on multiple cloud providers, side-by-side, and you just sync by whichever happens to respond first or sync with all of them. And then if one of them disappears, no problem because you've got the other one. And so it gives us a huge amount of freedom and flexibility if we get away from this assumption of centralized cloud services. But that introduces a whole bunch of interesting research and engineering challenges. Because, so one thing that we've been working on lately, say, is access control. Simple problem, you have a document, you want to be able to grant collaborators access, and you want to be able to revoke that access again. Totally obvious, should be totally straightforward. In a centralized cloud service model, it is totally straightforward.

Speaker 1:
[64:25] You have the rules, you confirm those things, and you check for the right rules, and that's it.

Speaker 2:
[64:30] Yeah. But if you want to run your system over multiple providers or even in a peer-to-peer setting, then what could happen is that a user gets their edit permissions revoked, and concurrently, that user makes an edit to the document whose permissions have just changed. Now, some devices may see the edit to the document first and the revocation second, and so they would accept the edit to the document, and another device may see it the other way around. They may see the revocation first and then the edit to the document second, and they'll drop the edit to the document because they think it's not authorized. Now, those devices have become inconsistent with each other, permanently inconsistent. That means if we actually want to ensure consistency, even for this fairly basic setup, we now have to somehow figure out how to resolve the situation of an edit that is concurrent with the revocation of the user who made that edit, solving that problem then mean in a decentralized setting where we don't have just a single server that can make that decision. In a centralized setting, you just have one server, it decides that the edit to the document come first or did the revocation come first, and that one server makes that decision. But if you have multiple servers, they might make different decisions. So then you could have a consensus protocol, but then consensus is messy because it requires some quorum votes and requires notes to be online. And so we've been trying to do the whole thing without doing consensus. But while preserving high availability, while preserving the ability for user to work offline, preserving the ability to synchronize peer-to-peer without any servers, for example, that just makes the engineering challenge a lot harder. And it's solvable and we are close to solving it for AutoMerge, which is the CRDT library that I work on. But it's just much less straightforward than it is in the centralized case. But that's a nice example of where interesting engineering challenges arise from this desire to get away from centralized services.

Speaker 1:
[66:29] And then we were just talking about clocks earlier, but an obvious thing that came to mind is, well, if all of them had the same clock exactly to the microsecond, you could just use a clock, you could use a timestamp. But as you said, in distributed systems, we cannot always trust the clocks are always synchronized. I assume you just have a lot of the things that you have been researching and writing about are just coming back to.

Speaker 2:
[66:51] Absolutely. And in this particular setting of a user getting their edit permissions revoked, if a revoked user still wants to, say, vandalize a document, they can just backdate their edits, give it an earlier timestamp. So relying on clocks is absolutely useless here because people can forge the timestamps from those clocks and thereby then potentially undermine the access control mechanism. So in this kind of system, we have to worry about potentially maliciously generated actions as well when the actions come from end-user devices.

Speaker 1:
[67:22] This is fascinating because it feels to me that you're solving a hard or maybe even harder engineering challenge than some startups would do because the startups would go the easy route they would take on a constraint, in this case, a centralized server, which makes business sense, makes revenue sense, but because you are not doing this, you now need to look for a solution for a harder problem. If you solve this harder problem, you can give a building block that can move the industry forward. Just give an option for either a business or an individual or an institution to have an option not just to centralize, but use this decentralized local first approach. And then, of course, reason about the trade-off and decide whichever makes sense.

Speaker 2:
[68:04] Exactly. And that's what I mean with this long-term thinking. This is an example of it where because it's research, we can afford to take this idealistic principled stance. I said, yes, we're going to solve this harder engineering problem because we think decentralization is a valuable feature. And we know perfectly well that most startups are not going to solve this problem because they will just do the easy pragmatic thing, which is the right thing for startups to do. But we have a different set of incentives and we can afford to put in the time to try and solve those hard problems. And as you said, if we can solve them, then it creates more optionality for anyone, any users of this technology. They can, if they want to choose to use this decentralized tech, and there's still trade-offs around it, but at least if they're not having to invent it from scratch, it'll be a lot easier to adopt this kind of decentralized tech for those who want to use it.

Speaker 1:
[68:55] So inside academia, you're also teaching.

Speaker 2:
[68:59] What courses do you teach? At the moment, I have a concurrent and distributed systems course for the undergraduates and a cryptographic protocol engineering course for the master students. And then additionally this year, I have a seminar course on security and teaching also the undergraduate operating systems course. I've got quite a lot of teaching this year.

Speaker 1:
[69:22] The distributed systems course, it's available on YouTube. Can you summarize what people who would go through this course, which again is freely available? Thank you for you and the Universer for making it available. What would they learn throughout those courses?

Speaker 2:
[69:36] Yes, so that distributed systems course, it's a bit more theoretical than what is in the book. So it's more focused on algorithms and how we convince ourselves that the algorithms behave correctly under the assumptions of distributed systems that we talked about, of nodes may crash, communication might be unreliable, clocks might be wrong, etc. So that's really, it's not a very long course, it's just eight lectures worth of material, but it goes into substantially more detail on the algorithms than the book. So for example, one of the lectures goes through the entire Raft consensus algorithm, which is pretty complex. But I really wanted to show the students exactly how it works, because it's just such a nice illustration of the challenges of distributed systems and the various measures we need to take in order to handle the various types of edge cases and failures that can happen, and showing that those problems can be overcome. It's not easy, and the algorithms are very subtle, and it's very easy to have bugs in them, but it is possible to solve consensus in a way that works pretty well. And so that's really the sort of message I'm trying to get across with this course.

Speaker 1:
[70:50] And you mentioned that when you're writing the book together with Chris, you brought a lot of industry inside, and being up to date, and you brought your experience of teaching and what works.

Speaker 2:
[71:01] I don't think I have a particularly unique teaching style. I just, in lectures, I will go through slides. I like to annotate the slides by hand during the lectures. I've just drawn an iPad to make it a little bit more interactive. But other than that, it is fairly theoretical. That's partly the way the Cambridge system works. It favors theoretical and pen and paper courses over, say, implementation practical courses. I think it would be possible, certainly, to do a practical course on this. And I may incorporate a bit more practical exercise in the future. But right now, it's mostly a theoretical pen and paper course. And that is fine. The cryptography course that I do, that's much more hands on. So that's about actually getting the students to implement some elliptic curves from scratch, for example.

Speaker 1:
[71:49] And how have you seen it in your time in academia, which has been, it's now a longer time period, how have you seen computer science education changing? How do you think it might change further in the future, especially as we're seeing AI be part of industry and probably the world as well?

Speaker 2:
[72:07] Yeah. I mean, prior to AI explosion happening, actually, rate of change is very slow in computer science teaching. Partly, that might be Cambridge. Cambridge is over 800 years old. Everyone thinks on longer time scales. People don't tend to rush into the latest fads and instead try to focus on the fundamentals and the ideas that a lot of the fundamentals of computer science were developed in the 1930s already, and are still true today, and Lambda Calculus and those types of things, for example. We have quite a bit of a focus on those fundamentals rather than chasing the latest fashionable thing. That said, AI has totally changed the way we can assess coursework, for example, because of course now we can try banning AI, but it's impossible to actually enforce such a ban, and also it's kind of counterproductive because we do want students to engage with new technologies and figure out how to use them productively for themselves. But we want to somehow do that in a way that supports their own learning and doesn't undermine it. So how do we get the students to use AI in a responsible way, in a way that's mature, and we can't necessarily rely on the students being mature enough to know for themselves what is a helpful use of AI and what is a form of use of AI that undermines their own learning. Because some of them are quite mature and able to decide that for themselves. But many are not, and so we need to provide some guardrails for them. We do need to make sure that when we have assessed work, for example, it's fair and it's perceived as fair by the students. If the students feel that some of their co-students are getting really good marks without doing any work, that undermines the trust in the entire system. We have to be very careful with how we approach this. To be honest, we don't really have good answers yet. We do now, for example, have a boot camp right at the start of the first year for the new students to expose them to basic software engineering skills, which is like this is version control, this is unit testing, this is generative AI. The basics that really everyone should be familiar with, and then the hope is that they will use that throughout their degree in order to just improve the work that they do. But how exactly we handle things for assessment, for example, we're still in the process of figuring out.

Speaker 1:
[74:35] So it sounds like the pace of changes is going to be fast in the industry, and also in academia, we'll probably adopt it, and we'll see what comes after.

Speaker 2:
[74:46] Yes. There's a difference though, which is in the desired outcomes. I think with industry, generally, the desired outcome is like a working product, for example. In academia, the actual artifacts that the students produce, like an essay that the students write, that's not really the point. We don't ask the students to write essays because we love reading, they're amazing essays. We ask them to write essays because we want them to go through a thought process which helps them learn something. It's that thought process and that learning which is really the desired outcome here. That means that we do have to approach it a little differently because generally, in industry, if you can use AI to get a job done faster and you get to an equivalent result, do it, absolutely, because yes, that is the desired outcome. Whereas in education, we do have to think about how to ensure that the learning outcomes and the thought processes are still preserved, such that the students benefit intellectually.

Speaker 1:
[75:42] It's very relevant, especially Anthrophic had a recent study where they looked at junior engineers, one group used AI, the other one did not, and they found unsurprisingly from what you also explained that the group use AI, they had little to no learning, whereas the group that did not, they actually learned it.

Speaker 2:
[76:02] Yes, I saw that study as well. I think the detailed methods of that study, we might be able to quibble with a bit, but I think the general principle seems true that, yes, sometimes in order to learn something, you just have to struggle with it a bit, not struggle too much. If people are stuck on some technicality, and they can use AI to get unblocked, and then be able to focus really on the main learning outcome, then I think it's good to use these types of tools. But if the point is to actually grapple with difficult ideas and think them through in their own minds, then we need to still find ways to make sure the students are doing that.

Speaker 1:
[76:40] You work both in industry and academia. What do you think industry could learn from academia, and academia can learn from industry?

Speaker 2:
[76:46] The two really could be closer together, because often they regard each other with sort of disrespect, really. Like the industry people will say, that's theoretical, that's academic. It's got nothing to do with the real world, and they're really missing a trick there because actually, there are a lot of interesting insights from research that are very relevant to the real world, but they're not necessarily making their way across that chasm. In the other direction, the academics will say, oh, this industry stuff, you know, that's just engineering. They're not actually doing any interesting thinking. It's just like writing routine stuff. I think I see it as one of my goals to try and build better respect across both, in both directions by bringing interesting insights from research into industrial practice, but also by informing our research by the problems that arise in real world. And so that way, like joining those two things up a bit better.

Speaker 1:
[77:42] What are your current research topics that you're working on, or ones that you're excited about?

Speaker 2:
[77:47] I have two main areas I'm working on at the moment. One is local first software. So that's this idea that we want collaborative software like Google Docs, like Figma, et cetera, but in a way that gives better protection to users' data. That's less dependent on a single cloud provider who can lock you out of your files, and that's therefore more resilient, gives users greater agency and greater autonomy over their own data. So that's an area that I've been working on for the last 10 years or so, through a mixture of open source work and algorithm development and formal verification and so on. I'm now also trying to set up a brand new research area in a totally different topic, which is on using cryptography to prove things about the physical world. So I'm interested there in especially sustainability related things. So for example, if you want to verify that the carbon emissions involved in manufacturing a particular product were x, and you want to be sure that that number is correct because maybe you want to include emissions as part of your purchasing decision and choose the product with the lower emissions, for that to be meaningful, then the emissions number has to be correct. Unfortunately, at the moment, the numbers are generally not correct because the incentives are to lie and cheat and to use creative accounting techniques, all as a way of greenwashing basically. Or a related thing is happening in the EU, for example, which is bringing in new regulations on preventing deforestation of tropical rainforests. So that, for example, coffee, cocoa, palm oil, etc. are imported into the EU. The importer needs to prove exactly which plot of land it actually came from, and then check against satellite imagery that that was not recently deforested. And so I've been looking into using cryptography as a tool of proving things about the supply chains of these physical products, but without revealing commercially sensitive information. For example, a company will not want to reveal who its suppliers were, and which ingredients to its process it purchased from which supplier, for example, because that might reveal something about its secret recipe that it uses. And so the hope here is that cryptography can allow us to prove that, for example, the accounting has been done correctly across supply chains, but without having to reveal publicly any of this sensitive data about suppliers or other customers.

Speaker 1:
[80:10] What is your view from your vantage point on the impact that AI is having on academia, not just for students studying beyond that, and also industry, with your industry contacts?

Speaker 2:
[80:23] Yeah, I mean, I'm not that deeply into the AI things, really. I'm seeing it more through my collaborators, who are making very good use of AI tools for software development especially. I personally write very little code these days, and so I haven't had that much need or occasion to actually use AI agents myself personally. When writing prose, like working on the book, for example, I prefer to still do that the old-fashioned way of just write every word by hand. So I haven't let AI anywhere near the text of the book, for example. And I don't know if that's the right decision. It's not really a principal thing that I think it would be wrong to do so. It's more that for myself, the process of writing is the way how I figure things out, and figuring things out is really my goal here. So I'm trying to figure it out in my own heads, and for that, I just have to write it myself. Because there doesn't seem to be any way around it. But using AI is a way of getting feedback on ideas, or exploring whether an idea really holds up to scrutiny, or things like that. That seems like a very productive use of the technology, and that applies for both industry and academia, I would say.

Speaker 1:
[81:32] So as a closing, for a student or a young professional who is still studying and considering their route into either industry or academia, what have you seen, who thrives in one or the other?

Speaker 2:
[81:47] Yeah, my feeling is they're not really that mutually exclusive, or rather, some of the best PhD students I've worked with, for example, actually have a few years of industry experience. So they might have done an undergraduate, maybe done a masters, then spent a few years in industry, developing like actual, doing real software engineering, learning about the real world. Then maybe at some point got bored and thought, oh, actually I want to work on maybe more idealistic things or have more freedom to choose their own research topics and then start getting interested in doing a PhD. And that I find is quite a healthy route. You do get people who go straight from their undergraduate degree and masters into doing a PhD, but sometimes those people can just lack a bit of the breadth of perspective. And so I think having seen a bit of just real world engineering is actually really helpful for people, even if they then want to stay in research. But in the opposite direction, I think it can work very well too, because in research and academia, we just get to think things through a lot more carefully than people often do in industry. Often people in industry, I feel like sort of have short circuit reasoning. Like don't maybe don't quite reason something through from first principles, but just like, oh, I heard this from a conference talk. I'm just going to go with that. What academia can teach is this sort of nuanced and critical thinking to really reason through trade-offs, for example, and to really justify why something is true. I think it's really good actually if people can weave in and out of industry and academia a bit and not regard it as like two totally mutually exclusive career paths, but actually have a bit of switching between the two.

Speaker 1:
[83:34] Well, Martin, thank you very much. I expected us to talk a lot more about your book, which we did, but I have a newfound curiosity and respect for all the important and interesting academic work that you and everyone else is doing. So thank you so much for this.

Speaker 2:
[83:48] Thank you for the great interview. This was really interesting.

Speaker 1:
[83:50] I hope you enjoyed this rare conversation with Martin Kleppmann. I found it interesting to learn that the first edition of the book assumed that you have machines with local disks, but actually today this is not how most engineers build systems anymore. Cloud-native primitives like S3 change how you build systems, and this is why this book just needed a refresh. I also appreciated Martin's take on whether engineers still need to underhouse system internals when they're using managed services. If you're building business logic on top of these services, you probably don't need to know every detail, but it can become useful to be able to look deeper, especially when you need to debug your system. By the end of our conversation, I gained a lot of appreciation for the academic research that Martin is doing. The local first software work, the access control problem in decentralized systems, using cryptography to verify supply chain emissions. A lot of these are hard engineering problems that few startups will take on. It was nice to understand how academia is in a good position to do work that has a long-term focus. Do check out the show notes below for related to Pragmatic Engineer deep dives. If you've enjoyed this podcast, please do subscribe in your favorite podcast platform and on YouTube. A special thank you if you also leave a rating on the show. Thanks and see you in the next one.