Dan Williams: Kernels of Wisdom

Chris_Norman · ‎01-12-2023

A chance encounter with open source software during an high-school internship put Dan Williams on a path to become Principal Engineer on the team responsible for Intel’s persistent memory (PMEM) enabling in the Linux* kernel.

Here he talks about the “janitorial” aspect of maintainers, what he’s seen in 17 years of commits, his take on the discontinuation of the Intel Optane* as well as his new role as chair on the Technical Advisory Board (TAB) at the Linux Foundation*.

Q: Becoming a Linux Kernel Maintainer seems like an incredibly geeky role. Just how geeky are you?

A: I’m pretty geeky but I’m also a dad, so I don’t get to geek out as much anymore, but my wife and I were competitive Latin ballroom dancers in college and went to various competitions around the country. I watched the new “Star Wars: Episode I – The Phantom Menace” when it came out like five times in the theater like back-to-back until i realized it wasn't a very good movie. But I was very excited about having a new Star Wars movie.

Q: People get into open source by seeing a problem and trying to fix it themselves - ‘scratching an itch.’ What itch did you scratch to become a Linux maintainer?

A: What got me into Linux was a high-school internship at an air-bag plant. I was in the IT department, and somebody showed me Linux for the first time, and he said, “and it's open source!"

I had no idea what open source meant. But at that job, we had storage servers with these big expensive RAID cards. I'd have liked to have that kind of set up at my house, but I didn't have money for a RAID card. Then I saw that Ingo Molnár had posted patches for the 2.2 Kernel to add software RAID support and I realized “I can take this code and just pull it off the internet and add it to my computer and I can have a RAID array!” The power of being able to find code on the internet and if I wanted to, changing it and adding new capabilities to my system. That was it, I was hooked.

Q: You’ve tried to dispel the notion that the maintainers hold all the power, prestige and influence, and encouraged people to explore the roles of reviewers as well as maintainers and developers. What does it take to build a healthy community, and is there a good mix between maintainers and reviewers and developers?

A: People see maintainers making decisions and look like they’re calling all the shots, but really the main service a maintainer does is “janitorial” - you're there to keep your subsystem clean, keep it operating and be a reviewer of last resort. But if you have a really healthy sub-system, you are creating economies of people trading reviews to get their work in. So, people are all self-interested in Linux - you want to take people's self-interest and divert that into them helping others, so it turns into a community.

In general, the Linux community always needs more reviewers. Maintainers can't get away from reviewing because they're in that last role, but increasingly the complexity and the number of submissions is stressing the ability of the maintainer community to scale. We always need more reviewers. But more importantly, we need companies that care about Linux, that care about their employees doing reviews. It can't be the case that your company is just pushing features and hoping that the community will review it for you. Like I said before, it's an economy. It's a ‘give and take,’ and if you are always in ‘take’ mode, it gets noticed.

Q: The open source world has been typically viewed as a hobbyist role and has evolved into a more commercial ecosystem. Is that the biggest challenge that you see?

A: The long-term health of Linux is keeping that spirit alive of people caring, but also valuing the rigor of doing things right and the rigor of code review. I think the trick is educating people on the value for their long-term career and their long term influence to invest in that. So, this near-term pain of ‘eating your vegetables’ and doing code reviews actually pays off long term, in your influence and your trust in the community, because the entire thing operates based on trust. Will you be there if you make a mistake? It's more important than not making mistakes.

Q: How can we incentivize the company management structure to prioritize the developers to do this work?

A: It's a conversation. It's people demonstrating that “Hey look, this community decision was influenced because the community had so much trust in me (or another Intel employee), as a member of the community.” I do think Intel has done a good job in letting people wear their community hat proudly while also wearing their Intel hat. And at times we've lost sight of that, and the associated pain has been there, but we're in a good place, right now. I’m in a situation where my entire organization chain all the way up to the CEO understands open source, so that's a place that a lot of people would be envious of. You still have to deal with parts of Intel that are still on their open source journey.

Q: The TAB provides the Linux kernel community a direct voice into The Linux Foundation’s activities and foster bi-directional interaction with application developers, end users, and Linux companies.

What sorts of advice have you been giving in your role on the TAB?

A: The Advisory Board is interesting because we say that we don't have any real power. But what we have is the collective influence of the people who are members of the TAB, and I’ve brought whatever influence I can to the issues that come before the TAB. We got involved with discussions about the code of conduct. We got involved with discussions around inclusive language. It helps to have a body that is basically tasked with keeping an eye on the health of the community and work on joint statements, documents, advice when the need arises.

Q: How do you see your role changing as you take on the role of chair of the TAB?

A: The chair organizes the meetings. But also, the chair sets the tone of trying to keep the TAB having productive conflict, if they need conflict at all, but keeping it on task and making sure that all voices get heard. I’m following in the footsteps of TAB chairs like Grant Likely who is now CTO of Linaro*, Chris Mason who is influential at Meta*, making sure everybody's heard and leaning on the principle of doing what's best for the long term of health of Linux kernel. Whenever problems arise, if you can keep it technical or keep it on the project, and not on the personalities, things tend to resolve. It’s basically a lot of listening, a lot of patience.

Q: People assume that most of the kernel work you do is related to Intel CPU or hardware enabling, but there’s more to it than that.  It’s also ensuring the platform architecture can support the features, right? 

A: There's the CPU enabling and that's super important, but my career has really been on the things that are downstream from the CPU - storage adaptors or memory or these things that are not directly x86 core technologies. There’s something we call the platform problem, where people just want to develop their little feature on top of the existing APIs. A successful person, who’s leaning towards ‘maintainership’, peeks below that and says: “Hey, this API is not good enough for the situation and it needs to change here.” If I do my job right, you don't even know I did a job. You buy the hardware; you turn it on, and it works. I’m always aiming for the customer to not know who I am - not looking in git-blame to see who wrote this. That's the goal - to be invisible infrastructure that always works.

Q: You earned your stripes contributing to the kernel through work on Persistent Memory enabling in the Linux File System, and have likened suggesting changes to the file system architecture as ‘touching the 3rd rail’. What could have been done differently?

A: To understand where Linux is going, and get ahead of where the hardware is going, or anticipate problems and influence inside the company. A lot of times, I’ve been fixing things that were already baked. The silicon’s already shipping. So, we'll find a problem and say “Hey, upstream, we know that this fix is unseemly, or we wish we'd done it differently, but we talked to the hardware team, and they said next generation forward we'll do it differently.” and get that commitment before changing the Linux code, just so that the community sees that we’re taking care of reducing their burdens long term. That's the take-away I would have - continue to try to push that influence upstream in the hardware development direction as much as possible, before you get into these into these late fights, when things aren’t working upstream in Linux.

Q: The first commit you made in 2006 is now old enough to see an NC17-movie. Can you talk us through the changes you’ve seen?

A: We’ve gone from, especially on the device side of CPU, babysitting the device and asking it “Hey, transfer this, transfer that” and then we went into this era of “Okay, device I'll program you and you send the data over here.” so we had DMA. And now we're in the same kind of era but we have tons of DMA and tons of CPUs. But the next kind of evolution is what I'm working on now, which is Compute Express Link (CXL). It basically says we can put memory anywhere. It doesn't have to be directly attached to the CPU. It can be attached across this PCI-E link and then going further to have that link go outside the box and now you can have a big tray of memory at the top of your server rack and things can plug into it.

It's one of the reasons I came back to Intel from Facebook*. I went to Facebook in 2012 and one of the problems they were talking about is “We want to buy memory and CPUs on a different cadence, but they come together” and they were coming up with these software solutions to try to disaggregate things. This is not a software problem. This is a hardware problem. This is an Intel scale problem. I saw the beginnings of that customer need when I was there. But it's really fun being back at Intel, to actually build it out and see it happen and figure out what Linux needs to make it work. This is the place to be for watching operating system change, at least from my perspective, because we change the paradigm of what the operating system needs to deal with at a hardware level.

Q: We announced the discontinuation of the Intel Optane* line. How did that impact you?

A: It was sad. I'm not going to lie. We had a lot of foundational work for persistent memory that was done by [Intel Optane] but it turns out that that infrastructure is useful going forward. One of the technologies we developed was something called DAX which is Direct Access. It's a way to eliminate the buffer cache because you don't need to cache anything in front of storage when your memory is storage, and you can just talk to it directly. People found use cases for that for virtual machines. The minute that Optane was discontinued, we saw some other memory vendors say “By the way, we have persistent memory now.” So persistent memory is going to be a thing going forward. The work in the infrastructure and the capabilities continue. The impact of Optane will continue in that respect.

Q: And finally, how can we help engineers and organizations make their lives easier?

A: Socializing the idea that healthy upstream development includes healthy review so that organizations, not just Intel, but other organizations that contribute to the kernel, prioritize that reputation building, that knowledge building, for their engineers. And then it's also engineers getting better at having those conversations early and often about “Hey, this is where Linux is today. This is how much effort it's going to take to get these features going. Let's have conversations about improving that.” because it's one thing to shift left and start to start the software early. You want to shift further than that and get the hardware design right so that things flow better, but that's an ongoing learning process.

About the author

Chris Norman is an Open Source Advocate who has promoted the use of open source ecosystems for over a decade.  You can find him as pixelgeek on Twitter,  Mastodon, IRC and GitHub.  

Photo courtesy of Dan Williams