CORECURSIVE #088

From Project Management to Data Compression Innovator

With Yann Collet

From Project Management to Data Compression Innovator

How do you accomplish something massive over time? I’ve had the chance to meet with a number of exceptional software developers and it’s something I always wonder about.

Today, I might have an answer with the incredible story of Yann Collet.

Yann was a project manager who went from being burnt out on corporate life to becoming one of the most sought-after developers in the world. What happens when you build something so impressive and valuable that it essentially becomes invisible?

And how do you do that when your day job is mainly organizing spreadsheets and keeping timelines on track?

Yann built LZ4 and ZStandard - two of the world’s fastest compression algorithms that have transformed databases, operating systems, file systems, and much more. We’ll go back in time to Yann’s initial steps with programming, his game-changing discoveries along the way and how his devotion to data compression hobby led him to create something that saves billions of dollars worldwide.

Transcript

Note: This podcast is designed to be heard. If you are able, we strongly encourage you to listen to the audio, which includes emphasis that’s not on the page

Introduction

Adam: Hi, this is CoRecursive and I’m Adam Gordon Bell. Each episode is the story of a piece of software being built.

How do you accomplish something massive over time? I’ve had the chance to meet a number of just amazing software developers and it’s something I always wonder about. How did they do that? Well, today maybe I have an answer. I have the incredible story of Yann Collet. Yann was a project manager who went from being burnt out on corporate life to becoming one of the most sought after developers in the world.

What happens when you build something so impressive and so valuable that it spreads everywhere and becomes almost invisible? And how do you do that when your day-to-day job is mainly organizing spreadsheets? Yann built LZ4 and ZStandard, two of the world’s fastest compression algorithms, and they’ve transformed databases and operating systems and file systems and much more. They’re everywhere. But we’ll go back in time to Yann’s initial steps with programming, his game-changing discoveries along the way, and how his devotion to data compression as a hobby led him to create something that saves billions of dollars and changes everything. This is an unforgettable career of passion that starts very simply as a hobby and it starts in Paris in the 1990s.

Paris

Yann: I had decided back then that if I had to live in Paris, I would live in Paris, not around it. Because then you have only the negative side and none of the plus side. But another aspect of it is that if you live in Paris, you better not have a car. So I don’t drive. I actually use my bike to go to work.

Adam: Yann was working as a project manager at a tech company that contracted with the government of France.

Yann: And let’s say the product I was in charge of was not working well to the point that I started to develop some scripts. I wasn’t a programmer, but I could do some simple stuff just to help the product just pretend that it works. Which obviously wasn’t supposed to be my job, but okay. So it gives me a better understanding of what’s going wrong and at some point I go to see my boss and say, “This product is so bad, we better rewrite it entirely from scratch.” My only experience at that point is that I’ve been able to put some script … And when I see some script, really make it clear this is just a few batch files. That’s what it is. So that’s really nothing. But it gives me a little bit of an idea of what programming is about.

Military Radios

Adam: The project was a military radio communication system.

Yann: So the point is you can communicate with voice but it’s awfully inconvenient to the point that you need to set it up in advance using some huge configuration files that needs to be exactly correct, otherwise it crashes. And of course everything must be static. So if there is any node in the networks that is not there, it’s going to crash. I mean this is a battlefield. You can’t expect things to remain static, that’s nonsense. So my goal is to have something which is fully dynamic, which essentially discover everything when it connects, authenticate and discover its neighbors, try to establish relations and it’s actually fairly complex.

But the point is it makes me work directly with software engineers now. And because of that, my product description is more than an idea on a slide. It’s actually a specification. It goes into great detail onto how it’s supposed to work.

Adam: This made Yann nervous. He’s not done this before and he’s just not sure if he’s doing it right or not.

Yann: Let’s just discuss with programmer. Do they understand what I ask them to do? How do they even accept the idea? And to my surprise, he takes the spec, say, “Oh yes, of course we can do that.” And it starts very quickly, actually. In a matter of two months we have a working prototype and then two more months and we have a decent early product. So it’s actually fairly quick.

Adam: This was a surprise because Yann was pretty certain he couldn’t actually contribute to product development.

Yann: I was always convinced this is not for me. It’s too complex. Before that, I’ve been exposed to technology but always on the user side, not as a creator of technology. And I was sure I couldn’t do it. But this experience proved to me that actually yes, that I can do it.

Adam: The project failed for unrelated political reasons, but it meant something to Yann.

Yann: I’ve got direct relation with software engineers and it’s a pretty good relation. We actually get along very well, probably because I’m basically the only guy in the marketing department to try to get close to their language, to express the idea in ways that they can implement. And we have a kind of feedback model where I get close to them and we course correct anything that goes wrong. This model gave me a form of friendship, I would say.

Adam: In a more merit-based world maybe Yann would transfer to product development and find his calling there and live happily ever after. But in this world he moves on to other projects, many of which don’t use his skills very well. And despite him working hard and really caring a lot, a lot of these projects just get canceled. The years slowly pass by until one day something changes.

Burn-Out

Yann: I would say it’s a moment in my life where I changed my aspirations. I was 35. In my 20s, I’ve been really working a lot, chasing the professional success and so on. And by my 30s I had accumulated quite a lot of scars. I was no longer convinced that working a lot was really a great future.

Adam: Basically Yann was burnt out.

Yann: It results in a extremely reduced ambition. Basically just be a nice guy, have a good life. That’s good enough. There’s no point in doing more than that. Generally speaking. I think it’s a good advice to give to anyone because it considerably reduce the kind of pressure that one can put upon himself about need if there’s a requirement to absolutely succeed.

So I kind of change all I’m thinking. Okay, I will try first to get a stable job, which I managed to get. And from there I discover oh, but now I have got a ton of time. Since I’m no longer working all day, I’ve got a ton of time so I can actually develop other activities just to enjoy the time. So initially I play a bit of video games, but it felt hollow after some time.

My next observation would be let’s learn about history. I always loved history. So I tried. I’m starting to learn a lot about history actually, but after a few years I’ve got this feeling that I’m reading again the same story or a different version of the same story so I’m not learning more or not enough.

Project Management

Adam: Also during the day, Yann’s now working as a project manager.

Yann: I’m organizing big projects across the world and so a lot of people depend on that, but I’m organizing Excel spreadsheet essentially making sure that the product is at the place where it’s supposed to. Making sure that everybody’s prepared, that the right teams are aware of. It’s not exactly as thrilling as inventing a new product. So something is missing. And I think that’s also why I go into, well let’s do some programming. After all I had a nice experience with real programmers and that will help me understand them better. And that’s how I start.

Adam: There’s lots of ways to start programming. You can get a book, you can do online tutorials, but Yann decided to head back to his roots, back to when he was in secondary school.

HP 48s

Yann: So as a student we had the right to have a graphic calculator. It was the early ’90s. And the one I selected was HP 48S. So that’s a very specific calculator because the CPU is really unlike others. It’s a four bit CPU but it has a very large register to be able to have big numbers. Essentially it’s a graphics calculator. You’re supposed to draw curve equations and curve with that. But of course if it is a programming device, that means you can try to make it random Doom or anything like that.

Adam: So it turns out that although he didn’t think he could cut it as a professional programmer, Yann actually had programmed in the past for this HP calculator. It was no Doom, but he had built a game called Fantasy Conquest.

Yann: So it’s the idea of having a band of marauders essentially and they roam around the country and they try to become the biggest band of marauders around. It’s not an intelligent game. It was kind of working enough for some players to enjoy it, but it was never great and I had really no time to really go into detail because that’s also student time. We also have some studies to do so most of the time it’s actually spent there trying to get a good engineering school.

Adam: But back then when he did work on the game, he remembered it being fun or at least something he enjoyed. This HP calculator, it had a 131 by 64 screen, so not a lot of pixels, but it was enough to build a game. The user experience though, that was a bit of a challenge.

Yann: In order to play the game, you would need to have exactly the same calculator, an HP 48. And then we could download the games through either infrared connection directly or more complex was to get a cable to a PC. And back then that wasn’t common at all. So infrared was used. A lot and the thing is the game was large, and it would take up to 50 kilobytes and that was considered very large at that time. So you needed a memory expansion to actually even be able to load that and that was one of the problem. This game was too heavy and it deserved to be optimized a bit. And so that’s where I start to think about compression.

Adam: So in the evenings after history has lost its luster, Yann starts picking up this game. Starts improving, it starts programming. It becomes his new hobby.

Picking Up Programming Again

Yann: I’m not going to program every day. It still has to be a hobby. So I’ve got a number of activities, but once something is interesting, you come back to it regularly so that’s exactly what I do. I would program around this calculator for many months. I wouldn’t say every evening but very frequently. And over time skill develop would simply improve just as a function of practice. All I wanted to do is have some fun and finishing this old game I never finished. That’s more or less all I wanted. And yeah, I never planned anything out of this, so it’s kind of a surprise that one thing lead to another.

Adam: This game did lead to so much and it ends up taking Yann to the other side of the world eventually, but immediately it helped them with something maybe more important.

Yann: How can I explain that? Because it’s difficult to. But I would say when you have a job, which is clearly just a job to live, I think it’s really important to have a side activity. So this side activity doesn’t have to be something useful or that brings money. It’s just joyful side thing. It could have been dancing. I’ve been doing dancing too by the way. It could have been anything that brings you joy in your daily life. I don’t want to give the feelings that I was doing only programming, data compression side. That was one of the thing I was doing. But yeah, I think it’s very important to have activities outside of the job. I would even call it compulsory if you want a fulfilled life.

Adam: And so fulfill him this game does. There’s a couple ways to program on a HP calculator, but to compress a game so that it doesn’t require the memory expansion, there’s really only one way. You have to program it in Assembly.

Assembly Compression

Yann: And so I start. The first iteration of course are not great, but they are all learning step. Every small iteration gives something of a better compression ratio, better speed, better memory usage, something like that. After some time, hey, I’ve got something which I can be proud of, which is essentially a very fast decompressor.

So the idea is that the game would decompress itself on the fly. It’s actually composed of multiple small modules and every time a module is called, it gets into RAM, it gets decompressed and that helps to reduce the size of the game by 30% approximately. And when we think about it, 30%, that’s all. It’s not that big. But for some reason I’m proud of it or something like that. I’ve invented something. Plus data compression, that’s clearly something that always felt magical to me. So that’s a starting point.

Before I knew it I’m realizing that I’m only working on the data compressor. I’m not touching the game anymore. I’m always trying to improve the data compression side. And after that, once again, there is no plan. It’s just a long journey where every step I learn something new and I find that interesting.

Adam: Data compression is a complex field to learn though. So initially Yann’s approach was just try things out and discover his own way.

Yann: How can I search faster? How can I encode that in a way which is less wasteful and so on? This is insanely slow obviously, but it works. Many case I find a great idea that I’m almost proud of and it doesn’t take long before I understand that this idea’s common for programmers, I just didn’t knew it. So for example, I [inaudible 00:14:30] hashing function in order to search faster and it doesn’t take long before I understand, well that’s common. Everybody does that. So it’s really a journey and ton of fun discovery.

Adam: So Yann makes a standalone compressor for the HP 48S.

The Personal Computer Transition

Yann: At this point though, the scene for the HP 48 calculator is pretty much dead, let’s say. So there’s not a lot of spectators to see the work, but I would nonetheless continue to develop it. I would start even to develop different variants, some stronger than others. But the real key selling point of the main algorithm I would employ is its speed. It reached a extremely high speed of 80 kilobyte per second, which obviously nowadays looks like shit but these calculators, they don’t have that much RAM so it’s still fast for them. And after some time, a good year I would say, I still have this feeling that okay, I’m developing something. It’s interesting. I like it. I find it interesting. But there is almost no one to enjoy it because the scene is dead. So my next step here in this journey is to say, okay, let’s go to the PC scene, so one which is active. 2009 approximately

Adam: The PC scene is active but they don’t have practice at squeezing performance out of old underpowered CPUs like Yann does.

Yann: I’ve read in newspapers that programmers only produce bloatware which are worse and worse every year. I would have a definitive competitive advantage there. So I’m pretty sure of myself and I start developing on the PC side and now I have to learn C. Because you don’t write in Assembly, so I have to write C. And yes, very quickly I understand that no, no, I am pretty far behind. Very, very far behind

Adam: The compression libraries available on Windows, they turn out to be way faster than expected. But Yann finds a way to catch up. He finds this online forum where people interested in compression gather. It’s a vBulletin forum and to register you have to answer questions about compression like ‘Who created WinRAR?’ Eugene Roshal by the way. And this place becomes Yann’s watering hole.

Encode’s Forum

Yann: I was not alone and that’s very important I think. But more importantly, I think it gave me a frame of reference. I could compare, I could get evaluated. And so there was a sense of belonging to a tribe of peers and I think it matters because it’s difficult to sustain such a long effort, multi years effort with no such contacts at all.

Adam: Also within this community, it’s easy to evaluate each other’s work.

Yann: Every once in a while someone would come and say, “Hey, I invented this.” And some people would test it and would say, oh, it’s good or it’s great or it’s not. And there would be no shortage of people who like me are interested in data compression and would test the program. Oh, that’s a perfect ground.

Then I learned that yes, data compressions is not limited to WinRAR and WinZip. You have actually hundreds of possibility out there and some of them are in my category, which means simple, light and fast. And now I’ve got something to compare to.

So once I finally understand how to program in C, that’s why I do my comparison and that’s why I quickly understand behind. Way behind. There are decent number of people out there which can develop very fast compression algorithm. Way better, way more efficient. So there is a learning curve there.

But I think the critical part that I learned from the HP 48 experience is just the will. The will to learn. This is kind of blind trust that there is something interesting to learn so let’s dig a bit more.

Doing Less Compression

Adam: This will to dig in is one of the keys to Yann’s eventual success. But also remember this is just a hobby for Yann like learning history. And so he’s not in a rush. He wants to understand each step along the way and he takes his time.

Yann: I think it’s fair to say that in the data compression community, most searcher were interested in data compression, really best ratio. And speed, yes, as a side effect. Let’s make it not too bad. That really comes second. And my mind was refocused on I want great speed and without sacrificing speed, I want to get better compression ratio. So it really changed the perspective of what matters. And it doesn’t take too long before I got something competitive and at some point also by chance I have an algorithm which seems to be the fastest around. So that’s LZ4. I think it’s fast because it’s simple and also for other properties of the CPU that I wasn’t aware at the time. So I just realized, oh, this thing can decompress at one gigabyte per second. That’s crazy. I wouldn’t have expected that. And that’s it. I do not continue on that. It’s one step in this learning experience.

Adam: So Yann has built the fastest algorithm to compress at this level of compression that he’s seen and then he just moves on. It’s just a hobby and nobody cares about speed like he does. Because the thing is LZ four is fast because it doesn’t do much. It does less compression than DEFLATE, the algorithm that’s used by zip files and GZ files. And because it does less, it’s very fast, but once it’s built, there’s not a lot to learn. It doesn’t do a lot. And so Yann just moves on to more advanced compression topics.

Yann: I learn about Huffman, arithmetic coding, reduced offset, context mixing. Things which become really, really complex. But none of this stuff is actually usable. It’s really just for the learning experience. I’m just following other people who have been developing the same things before me. So I’m not doing something really special but I learned. And at some point I stopped making progresses. I would say that’s a moment where I’ve learned whatever was easy to learn, whatever takes a week or two to learn. And that’s a moment where I have to decide what do I do now and do I move on like I moved on from learning history to programming? Do I move on to something else or do I double down? And so I decide to come back to this first compressor, LZ4, and I’m thinking, okay, I’ve got something here. This one is actually above its peers, so there is something there. But I achieved that outcome almost by chance. Let’s understand. Let’s go deeper, why is that and how can I improve it?

Adam: Making the first version of LZ4 was relatively easy, but understanding why it’s fast and how to make it even faster, well that’s a whole different level of complexity.

Performance Work

Yann: Instead of making huge progress every week, it would take months to make small progress. But that’s also the small progresses which are actually difficult to get. After that I’m starting to understand why it’s fast, how to make it faster, and also how to make it stronger.

So now I’m getting more into how do I search efficiently? How do I combine? How do I parse data? These are all fairly complex topics that I wouldn’t care spending time on if it’s just to learn quickly. But at that point I’m more in this mentality so I’ve got something which is a bit better. If I want to make it usable, I need to polish it. And so I take the time to polish it.

And by the time I believe I’ve got something good, it has taken almost a year to understand all this.

Adam: LZ4 is now faster than every single compression algorithm that Yann can get his hands on. Yann knows that LZ4 may have tremendous impact on the world. Data compression is all about making things smaller and more efficient. And when you’re dealing with vast amount of data, even small gains in efficiency can add up to massive savings in time and money and resources.

Something like LZ4 which can compress and decompress at lightning fast speeds has potential to revolutionize the way that we store and manage data. It’s that important.

But also it’s just some code on his computer.

It’s not a proper library, it doesn’t have a license, it doesn’t have users and nobody knows who he is. And so its impact could be nothing. It could be zero.

The next step to change that is to open source it. And open sourcing is more complex than it sounds. Yann’s never worked as a professional developer before and he’s trying to learn how to do things the right way.

Open Sourcing and Snappy

Yann: Then I understand new stuff which is unrelated to data compression such as never use global variable, make data libraries that people can actually integrate. All things that I wasn’t even aware of.

And it takes me a few months to get that right. And at some point in 2011 I’m thinking, okay, I’m ready now to open source it.

And one week or two before I do that, Google opensourced Snappy. And Snappy is basically exactly in the same category as LZ4 and is actually I think a bit better than LZ4 at that time. So I’m kind of what?

Adam: Snappy gets a lot of attention. It turns out that inside Google they had needed ways to compress data from BigTable. And so they needed something that did less than DEFLATE but did it much faster. This is very much Yann’s approach, but they came from a different direction.

Yann: As soon as it gets hot you get a lot of articles over internet which talk about this radically new compression which is so fast and a lot of project which get interested. Basically all databases start to say, “Oh, that’s what we need because we need speed.” So it’s an instant success.

Adam: Did you feel upset?

Yann: Upset? No. But maybe some form of stress. I’m not sure. I feel as if I was in a kind of increased activity mode in my brain. Because there is something happening now. What I learned from that is that my algorithm is actually the only one able to keep up with Snappy. So that’s still something. That’s not bad. And I was convinced that I should be able to do better.

So I doubled down again. I’m focusing back on speed and make that faster, faster. I’m looking at Snappy, I’m thinking this thing is a bit too complex for what it does. LZ4 is way simpler so it should be faster. I believe that. I believe it should be faster, but I don’t know exactly how.

So I focused on that. And indeed months after months progress show up and 10% by 10% LZ4 is actually faster than Snappy and after six months it’s actually way faster. So at least it answers the first part, which is a bit I would say egocentric like I can do it.

Adam: You took on Google.

Yann: Exactly. That’s kind of crazy for someone who wasn’t even a programmer.

Adam: That’s right. Through these six months of slowly finding speed improvements, Yann is still working as a project manager. He’s still biking to work in the morning, organizing spreadsheets and then biking home.

A Double Life

Yann: So some people around me are aware that I’m doing that as a hobby, but no one think much about it. I myself still have this excuse that thanks to this hobby I’m actually a pretty efficient product marketing manager and I have some good relation with my programming teams. I’ve got several ones.

So I can put together a plan. I know how these things communicate between them. So that makes me confident in this role. I think for the professional side, that’s probably is best and they feel it too. I mean programmers also quickly understand that the marketing guy in front of them understand programming, but most of them are not aware that I’m doing that on the side.

So it’s not about talking about data compression, it’s just about acquiring some culture of programming and data compression is more like my hobby.

Adam: Probably Yann is a good project manager, but in his hobby, which is very, very niche, he’s slowly becoming one of the best in the world.

Yann: Maybe before internet you just had to be good in your local neighborhood. So I don’t say that in a very strict sense. Local neighborhood can be the companies that work in the same field that you know about. But now you need to be good at worldwide scale. That’s insanely good. But in the same time now you can be very good at something very niche that would be of no importance to anyone you know around.

None of your friend know. None of your family. No one cares.

But at the scale of the planet there are actually a non-negligible amount of people who care about that. And now you can be very, very good at something very, very precise. I think it’s a very big change over the last 20, 30 years.

Adam: That change in the world, that change caused by the internet should mean that Yann is now getting all the attention in the compression field, right? LZ4 is faster than Snappy and look at all the attention that Snappy got. Well it turns out that that’s not exactly the case.

Marketing LZ4

Yann: So sometimes some people which like data compression, they would write an article about comparing different compression algorithm. And so I try to be visible and that’s where I would say, “Hey, I’ve got some things that you might be interested in.” And every time that happens it ends well for LZ4 because the comparison is pretty favorable if we only look at performance metrics.

Now on the potential user side, it’s a very different story. Because Snappy, you’ve got Google attached to that. So that’s real work, professional work. And this random guy no one knows about, okay, who knows. So I would say initially there is absolutely no traction at all.

But there is one thing I’ve got that Snappy doesn’t. During this time I spent understanding how it works in a deeper way, I developed a variant called LZ4 HC, which is LZ4 high compression mode.

So this is a variant which is slower when it compresses, but it has a much better compression ratio and the decompression speed is the same, which means it’s extremely fast. In scenario where you compress once and decompress many times, think assets in video games for example, that’s pretty useful. And that’s where I would essentially score my first wins.

Actually this way are my first users. Some indie video games developers realized that they could use LZ4 HC and get very good decompression speed out of it and a much better compression ratios than Snappy can offer because it doesn’t have a high compression.

LZ4 Adoption

Adam: The first game to use this format was for the PlayStation portable, which has a totally different CPU than the one on Yann’s PC.

Yann: Which also introduced me on the topic of portability. So it was great learning experience. And then I learned that more and more games were starting to use it and I don’t know exactly when, but at some point Unity decided to adopt it. And through Unity, a lot of games even today which use LZ4 and they are just not even aware of that. It’s part of the default setting.

Adam: I don’t know. Did this feel like something? You’re like, I’m onto something here?

Yann: Yeah. I did something that’s useful. That’s a great feeling. Honestly. At that point, that’s all there is. I mean, I’m glad. What I don’t realize is the kind of positive reinforcement loop it triggers. Because now this is a product which has been tested, has been shown to be better technically, but is also used in commercial products. And so now even in more serious activity, if you call it this way, I mean not video games but actually databases for other personal systems, now it becomes something that’s worth considering.

And now every time there is this competition, Snappy versus LZ4, LZ4 wins every time. And after a few months it start to be known. And so it reversed. I see more and more projects use LZ4 first.

It did not happen overnight, but it’s kind of a gradual improvement story and that takes time. The news takes time just to reach people.

Adam: First big non-gaming opensource project to use LZ4 was Hadoop.

Yann: Which was supposed to be the competitor of BigTable from Google. That was something. It was kind of, wow. This is serious now. A bit later I learned that it was being evaluated in the LHC, which is the Large Hadron Collider. So that’s for research proposed in Switzerland is very big accelerator of particles, which has a massive amount of data to generate and deal with. So they need something fast. So I say, wow, it’s even used for some kind of fundamental research. So yeah, that’s where I start to see this is more than just a toy project. It start to be useful for broader range of applications.

Open Source Investment Pays Off

Adam: One reason for its broad use and adoption is that when he first opensourced LZ4, Yann spent time learning about how to best manage an opensource project.

Yann: So even what is a license? What is an opensource license? What are the choices? Why is this one? I had a pretty good discussion at the very beginning of the project with people from Unity about that. That was important because I think I would have gone for GPL. Instead, I went for BSD, and it only changed the scope of I would say the addressable market. It was a very important decision retrospectively, but at that time I did not really understood it so I kind of trusted these guys. Seemed to know what they are doing.

And then of course I got some feedback.

Another thing I learn, feedback is gold. Anytime someone comes with a problem, that’s a gold. It would be so easy to say, “Hey, I don’t have this problem of my computer. I don’t care.” Instead, it’s, “Oh, you have a problem. So hey, it’s worth solving.” How many other people had the same problem and never told about that? It actually matters.

Adam: It does matter. Of course we all know that. But if you’re a Google dev, you probably have internal Google things to do. The external issues on GitHub, they might take a backseat.

Yann: And actually an amateur, which is focused on its own turf is going to be more reactive, more present than a professional which is already overburdened. So indeed that’s what happened. If going to the Snappy project at the time, you would see the issues increasing all the time and response time being longer and longer. So that’s part of what I call the project. LZ4 was really much more focused, much faster release time, correction and so on.

Getting in Linux

Adam: And so that pays off and LZ4 keeps spreading. Eventually some engineers at LG want to get it into the Linux kernel. Getting your project merged into the Linux kernel is a big deal, but it’s also hard work. It’s a large code base. And because of what it is, it has lots of rules about how things should be named and how code should be structured. Luckily, Yann doesn’t have to take this on at all. The engineers at LG lead the effort.

Yann: But even for them it’s hard. I see the multiple rounds where they get rejected and have to retry. And so as long as it’s not done, there is always a reason why it could not pass the bar. And I don’t know which reason it’ll be next time. It could be something as simple that yes, the name of your function do not completely respect the naming convention for this part of the kernel. And all of these things they look like secondary, but I totally understand that from the maintainer perspective, it’s important. That’s what makes them able to maintain this big pile of code. So it takes really a lot of time to get integrated.

I think it took a year. And at that point I realized that no one around me would understand what I was saying. So I feel a bit alone. So yeah, for me it was just a great achievement, but it seems that from this community it was a bit more than that. We have a French word for that, [foreign language 00:35:35]. And I’m not sure what the word in English. That’s where a knight become a knight.

Adam: Oh, it’s just knighted in English.

Yann: Knighted.

Adam: Yeah. It’s a simpler word I guess. Yeah.

Yann: Yeah. It’s more direct. I think it’s more or less what happened from their point of view. And now I was no longer just an amateur doing something and having strangely some success. It felt like anointed. It shifted my perspective that , oh by the way, it can be maybe a bit more than just a hobby.

LZ4 vs ZLib

Adam: So LZ4 keeps spreading, but also Yann gets to hear from people who evaluate LZ4 for their project and then reject it.

Yann: Remember databases system which typically cut data into blocks and compresses each block individually, for some of them would say, “Yeah, we absolutely need speeds to the point that we don’t compress so LZ4 is great.” And some of them would say, “No, no, no. We compress with Zlib and it matters because we are installed by our customer. We cannot tell them, no, you are going to need 20% more storage. We do need compression, we do need speed, but we have made the choice that compression really matters so we are already using Zlib.” And ZLib is really the standard compression libraries that everybody knows and everybody use.

Adam: ZLib uses the DEFLATE algorithm mentioned earlier, same as zip files or .gz files. It’s old, it works, it’s everywhere.

Yann: I mean, any project who has accepted the speed cost of ZLib would not transition to LZ4. And it’s actually the majority of the addressable use cases, they use Zlib. And therefore it makes LZ4 something nice, kind of a niche product for a niche use case. Not a small one, but still a niche. But the bigger majority of data compression remain inaccessible to LZ4 because it’s too light.

Adam: But back in his learning days, Yann had created a number of compression algorithms. LZ4 was the fastest, but there were others.

Yann: I made several ones actually, which would be quite competitive with ZLib. Let’s use one of them. Now I understand the open sourcing process, what has happened, which is way more than data compression. I think in term of proportion, LZ4 is such a simple thing that the proportion of data compression algorithm itself, that’s like 10% of the effort and everything else is about open sourcing it properly. So I think it was a good choice to start with that because there is so much to learn. But now I have that. I understand opensource. Can I take one of these old algorithm that I made many years before and bring it to the opensource community as a kind of competitor to ZLib? And so I would start to work on this idea. And I’m fairly convinced that yeah, it can work. I can have a compressor which is at least as good as ZLib in term of compression ratio, but much faster. And I was already convinced that a lot of users would like that because they were looking for more speed. They just were not willing to accept losing compression ratio in the process.

Adam: So the path to something better than ZLib seemed clear. LZ4 does half of what ZLib does but really fast. And the missing part is Huffman coding. A Huffman code is replacing common strings with a shorter prefix. If I just replace the text CoRecursive on the podcast website with CR and then put CR equals CoRecursive somewhere as a legend and then replace Adam Gordon Bell with AGB and so on with all the most common words, I can use less characters. That is something like a Huffman code.

But if all Yann does is re-implement this idea, it doesn’t feel like enough. What Yann wants to do is something bigger. He wants to fulfill this goal that he had when he was a kid working on his calculator game.

Inventing an Entropy Encoder

Yann: The sentence when I grow up, I will be an inventor, I think became more and more something in my mind. I really admire the people who can invent stuff. And that feels magical because we just get the final product of many, many years of iteration of a thought. And that’s what I wanted to do.

Adam: And so Yann starts work on his new entropy encoder, pulling in ideas from his earlier research, incorporating things he’s learned about performance. Because he needs speed and compression ratio. That’s his focus. And he does this in the open on GitHub and his Finite State Encoder repo.

Yann: And so I’m realizing finite state entropy in 2014 I think, something like that, and it works. It’s actually even faster than Huffman and it has the compression ratio of arithmetic. And that’s kind of a landmark now because a new entropy coder, this hasn’t happened in 30 years or so. Even 40. It was a long time ago. So that part satisfies, I would say the inventor in me. I created something really new and it’s not just new, it’s actually very efficient and I can use it for my project of bringing a competitor to ZLib. And so all these ideas start to get together. And since I’m doing that in the open as an opensource, everybody says that. It’s kind of the obvious next move. And so that’s how we reach the end of 2014. And I decide to release the first version of this standard as a technological demo. And it brings what it’s supposed to, which means compression ratio similar to ZLib, but way better speed.

Adam: In a simple benchmark, Yann’s project compresses a bit better than ZLib, but three times faster. Three times faster. He calls it ZStandard. And now he has people’s attention. Because compression is happening everywhere the computers are involved, but especially at data centers where the big companies are hosting all their servers. Hence Google being initially ahead of him. But this three times improvement, if it can be deployed, it just gave every cloud provider a way to save lots of servers, a way to save lots of money. And so now Yann starts becoming popular. Everyone’s reaching out to him.

FANG Comes Calling

Yann: So I would say every major player in the Silicon Valley that you can think of.

Adam: Google, Facebook, Apple, Microsoft.

Yann: Yeah.

Adam: Amazon.

Yann: Yeah. Yeah. Not Amazon. And also other players which are less known. Smaller companies.

Adam: Also at this time the HBO show Silicon Valley had just come out. It’s a great show and it’s literally about somebody discovering a better compression algorithm.

Yann: Some of my colleagues talked to me about that. It was what is this thing? And yeah, they are talking about data compression and it seems to be a big series in Silicon Valley so maybe it made data compression sexy. I don’t know. Anyway, whether there is a consequence or not, in 2015 I end up receiving really a lot of job offers, all about data compression in the Silicon Valley. And that’s a moment to decide. Basically it’s a one time opportunity. Either I play it safe, I keep my safe job in Paris and I try to continue doing data compression on the side or I go in fully and I fully invest my time in data compression. And the thing is, I also understand that point that LZ4 was kind of a leaky charge. It was possible because it was simple, but this one is not simple.

This is a really different level of complexity now. And hoping to do that on this side working two or three nights per weeks, that’s not going to work. On top of that, I am going to be a father. So I’m waiting for my son and this is going to take into this budget even more to the point that probably there will be nothing left. So the question is do I do it? I mean it was clear that if I selected one, I would have to give up the other. If I was to stay in marketing I would have to really go all in on the politics game, which leave not a lot of time for programming and the other side is even more clear. So it was an important choice to do.

Adam: In the data compression versus project management debate, compression has at least one very big plus on its side.

Yann: Generally speaking, data compression I think is the only thing I did in my life which has been useful. Everything else I tried to do was always for the benefit of some kind of middle manager who wants to show off to his own manager who wants to show off to … And end up being not useful. Just a abandoned or no impact. Well I’ve got a few contrary examples, but they are far in between. Data compression is really the first time I bring something to the world and I know it’s useful.

Interviewing at Big Tech

Adam: And so Yann, he heads to the Bay Area to start interviewing.

Yann: I was very clear that my intention was to bring Zstandard as an opensource project. Fully opensource, everyone can use. No string attached. And not every employer was thrilled about that. Some wanted to keep that completely in house.

Adam: That’s Apple I’m assuming.

Yann: I can’t give name.

And really Meta, which was called Facebook at the time, was very clear that they do have a culture of opensource. They were fully aligned on this objective and this was of great importance to me. That was the reason I was doing all of this. So I went with them and I think it was a good choice.

Adam: So Yann takes the job at Meta or Facebook then. There’s just one problem though.

Yann: If there was one moment where this story could go wrong, it’s exactly this transition between marketing to programmer. Because we think about it, it’s a pretty big transition. Especially I think I was 42 at the time, so I had been an amateur for a few years, going straight into the big league and this is Silicon Valley area. That sounds like a stretch. So there’s no guarantee that it will end well.

Adam: Facebook has a bootcamp process for developers joining the team. Yann doesn’t have his visa situation worked out yet, but for the bootcamp it doesn’t matter.

Facebook Bootcamp

Yann: Yeah, the bootcamp period, we do have almost a hotel. It’s a very small apartment that’s not too far from the campus. And so yeah, I stayed in Redwood City probably.

Adam: Yann is used to working on Windows in Visual Studio, but Meta is different. He needs to learn how to work on remote servers and how to develop on Linux.

Yann: So that’s also a learning experience, how to use VI in order to code, in order to have your plugins. And at the end it’s a lot of small tools and small way of thinking. None of them looks particularly terrible. It’s just that there are a lot of them. And so it takes time to not just be blocked on the stupid thing.

Adam: After Yann figures out his environment, which does take him a little bit longer than most, he has to take on some small tasks.

Yann: We have to reach our teams which have actual code in production and there is a small thing to do. It’s not complex for them, but for bootcamper there is a ton of context to acquire. Why are we even doing that? I remember one of the easier one was just a matter of processing a file in Python. Say what’s the problem with that? Well the initial code I had to fix was just loading the file and processing it. But if the file is huge then it costs too much memory just to load and process. And so I had to change that so that we would stream small amount of data process savings of that so memory budget would stay low. It’s a small exercise, but it’s actually a real software, really in production so that means it goes through a whole validation cycle that we get to discover firsthand and that was pretty useful to do.

Adjusting To the Big League

Adam: So Yann gets through bootcamp and he starts the data compression team at Facebook. He gets an intern and they start working away. But Yann’s having a hard time shaking the feeling that he’s out of his depth.

Yann: I mean even my first intern looked like a genius. So I’m surrounded my genius people. What am I doing here? It takes time just to accept that yeah, this is a place. Because the point is not to be better than everybody else around at everything. That doesn’t make sense. There is an infinite amount of things. Point is to be useful at something valuable so that we meet our peers, we can help them concretely. And that demands knowledge, but also ability to learn to evolve.

Adam: As Yann works to figure out how to work at Facebook, as he works to build ZStandard from a demo into a real library, he also has a son. And so his life is changing really fast and then out of nowhere a competitor appears.

Yann: So when my son was actually born and I was on paternity leave, I learned about Google open sourcing their own algorithm, Brotli, and it’s not exactly targeting the same thing, but it’s close enough I would say. It’s a modern take of their ZLib. It’s more focused on the web and therefore, once again, it’s more focused on compression ratio. But it’s not bad on speed. But Brotli was in the making for several years. I had started only a few months before. They were a team. I was alone. It felt like before even reaching my goal it was already lost.

Adam: But Yann persisted and he got help from his compression forum colleagues as well. Przemysław Skibiński.

Yann: But he was really knowledgeable of data compression and he wanted to help. He wanted to contribute and since it’s an opensource project, everybody can contribute. So he was contributing more and more. And by the end of the year we had this idea with my manager at Facebook, why don’t we hire him so that he can fully contribute. And so he received a contract and he can work on the projects throughout 2016. And so that’s why he’s one of the core developer of ZStandard. And his help was very welcome because I was kind of alone so I was a bit stretched.

Releasing ZStandard

Adam: Then in 2016 at Facebook’s At Scale conference, ZStandard was officially released.

Yann: Before it was on my own repository on GitHub. It was considered like a personal project I was authorized to work on. At that point, it become an official Facebook project. It’s a great event. First time I do that. So the event itself, I don’t know what I can say. I think it went well. But what was probably more important is the reception from the community. A lot of people were waiting for this 1.0. One of the metric which show it, which is the number of stars on GitHub, which would be catapulted as the day of the official release.

Adam: In the final comparison, ZStandard beat ZLib in every way. At the same compression ratio, it compresses three to five times faster .at the same compression speed, it results in files that are 10 to 15% smaller. And besides all this, it can decompress twice as fast regardless of compression speed. The only thing even close to ZStandard is Brotli from Google.

ZStandard vs Brotli

Yann: So I mentioned before that Brotli was ready one year before us. They should have reached the market before us. But they developed in C++ and I made a addition to develop in C because I knew from LZ4 that this is a universal API that you can connect to from anywhere. And now you can have Python, you can have Rust, you can have Java. Every language will connect to the C library. And also we have a very good control of our memory location, which matters for embedded for kernel.

And Brotli people understood that, but a bit too late.

So they start by releasing something in C++ and then they convert it back to C because they understood the importance and it took them a whole year to get that done. And that’s part of something you cannot guess if you’re not exposed to opensource deployment.

Adam: So the LZ4 work of understanding opensource and easy adoption paid off for Yann. It only took months for ZStandard to spread.

Yann: I would say by January most Linux distribution have ZStandard in their package repository. And of course internal Facebook product use it. And more and more opensource project use it. The adoption is really very fast. And that’s where I understand that the knowledge, the opensource process and restriction I learned while I’m doing LZ4, that’s what gets reemploy in ZStandard and what led to such a fast adoption.

ZStandard’s Impact at Meta and Everywhere

Adam: If ZStandard is three to five times faster, then that’s three to five times less CPUs spent on compression everywhere that made the switch. And like LZ4, ZStandard also opened up new frontiers. It has dictionary compression, which allows ZStandard to compress things that are really small using aggregate statistical patterns. This is a big deal at places like Meta and that’s not even counting secondary impacts. Internally at Meta, the impact has been massive.

Yann: So typically we do have caches in the stem so that data can be delivered faster because it’s closer. But that of course you need equipment and that costs money. So if you can compress more in these caches, you actually have more data closer to the customer. And there are two benefits to that. First, the customer is served faster so its experience is smoother. So it drive engagement. People would use the application more because it’s more pleasant. And on the other side, there are a ton of events in the network which do not need to happen because we do not have to go deep into the database and storage system because we don’t need to. It’s right there. And cumulatively these two effects, they are way more than the saving itself from the data compression.

So that shows that yes, there are primary savings, but sometimes the secondary effect is really important to and worth measuring.

Adam: Secondary, primary, tertiary. However you measure it, the impact has been huge.

Yann: Let’s go back to this idea that data compression can be invisible and still have fairly big impact both on storage and transmission because we also send more data. So once something is used almost everywhere, the amount of data processed is staggering. These are numbers we are not used to. There are so many zeros, we don’t talk in this kind of range normally.

So that’s insane.

How to Have an Impactful Life

Adam: This is the wildest story I’ve ever heard in terms of impact. A marketing professional, a project manager, a couple evenings a week after he bikes home from his job in Paris, starts tweaking a calculator game. And by the end of it, Yann has shifted a whole industry’s approach to data compression and saved billions and billions of dollars.

So the obvious question I have for Yann is how? How can you have that much impact in your career? In your hobby?

Yann: I would say the first advice here is don’t do that for the success. Success is too random and too far away. If someone targets success, it will lose stamina way before reaching that point. So do something because you like it. That’s the inner force that will drive you beyond I would say the normal investment that almost everybody can also do. And I think that’s very important one. So now if you are interested in a domain, keep going at it. And it’s really a small effort regularly in the same direction that brings you very far. I understood that from my mother, which had a … How do you call that? An accident. And because of brain injury, she couldn’t move much. Especially her hands were affected.

Adam: But despite her difficulty controlling her hands, she decides that she should write a book.

Yann: When you do have problem with hands, that’s slow. Very slow. Painfully slow. But incredibly enough, in the matter of a few years, she managed to write, publish, sell … I don’t remember the exact number. I think it’s about four books and they are not small ones. That’s something. She did that because she was focused on that. And of course it wasn’t the only thing she do in her life, but almost every day she would make an effort in the same direction and that would make the book closer to the finish line. So that’s I think a pretty big lesson here. Just keep at it. Learn something new every day and after some time, actually it’s a lot. It’s way more than average. That’s what makes you an expert in the field. So you cannot do that effort if you don’t like what you’re doing.

If it’s just for the money, that’s too long. That’s why it’s important to like something.

Outro

Adam: Thank you Yann, for sharing your story. You can find Yann on Twitter as Cyan4973 and also on GitHub and the Encode forums. And thank you to Chip Turner who hired Yann at Meta and set up the data compression team and also reached out to me and said, “Hey, Yann has this crazy story. You should talk to him.”

If you know somebody with a story like that, yeah, send me an email, send me a note. Also, thanks to Meta for your belief and open source and being cool about having Yann talk to me about his experiences.

If you liked this podcast, if you want to help me out, the best thing you can do is just tell other people about it. I know it takes me a while to get out each episode, but if you want more content from me, you should check out my newsletter where I cover similar topics and also follow me on Twitter, @AdamGordonBell, where I often share behind the scenes details of the podcast.

But really to get more content from me, the best thing you can do is join as a supporter. Go to corecursive.com/supporters. It’s in the show notes as well. And as a supporter, you get access to more episodes. I put out a bonus episode each month as well.

Last month’s bonus episode, bonus 17, was inspired by Yann’s journey. I recorded it not long after talking to him. It’s about how you accomplished hard things over time. About two keys that Yann highlights that also were impactful in my life and how I think about building up expertise or accomplishing something big over time. If you’re a supporter, check it out and let me know what you think.

And until next time, and I always say this very sincerely, thank you for listening.

Support CoRecursive

Hello,
I make CoRecursive because I love it when someone shares the details behind some project, some bug, or some incident with me.

No other podcast was telling stories quite like I wanted to hear.

Right now this is all done by just me and I love doing it, but it's also exhausting.

Recommending the show to others and contributing to this patreon are the biggest things you can do to help out.

Whatever you can do to help, I truly appreciate it!

Thanks! Adam Gordon Bell

Audio Player
back 15
forward 60s
00:00
00:00
59:27

From Project Management to Data Compression Innovator