Waking Up With AI: Decentralizing AI

Learn More

Decentralizing AI

This week on “Waking Up With AI,” Anna Gressel looks at how decentralized AI training could revolutionize the field by allowing for the collaborative use of advanced GPUs worldwide, expanding access to model development while raising interesting questions about export controls and regulatory frameworks.

Guests & Resources
Transcript

Anna Gressel

Partner

» Biography

Anna Gressel: Good morning, everyone, and welcome to another episode of “Waking Up With AI,” a Paul, Weiss podcast. I'm Anna Gressel, and today it is just me, which is both maybe a blessing and a curse for everyone on the line, because we're going to take on a somewhat technical topic. And I know for our regular listeners, you're probably going to be sitting around, waiting for Katherine to poke fun at me for using too much jargon. So you'll just have to imagine her on the line, laughing at our highly technical concepts today. But I think it's a topic that is really interesting and worth exploring for a moment, which is on decentralized AI training. And we'll talk about what that means. But to take a step back, why are we even talking about training? And then, why are we even talking about this alternative method of training is really the question of the day.

But to ground ourselves for a moment, it's really a common refrain in AI circles, that model progress happens on three different fronts. And those three fronts are compute, data and algorithms. And we've taken this on in the past. We've talked about these pillars of AI model development on prior podcast episodes. So I'm not going to get into a huge amount of detail on how each of those plays into model development, but we're happy to do that if someone thinks we've under-covered that topic. But each of these pillars is, in a sense, critical to AI training. But also, as we've discussed with the DeepSeek model in particular — which was this really big model release that raised a lot of questions around the AI training space — there are really important issues at play and questions at play about where exactly the competitive edge is going to come from for future AI development. And everyone's looking for that competitive edge now. Everyone's wondering, like, what is going to give the most advanced models the most competitive undergirding and the most competitive set of levers? And so it's possible, when you think about that, that some models are going to outcompete, for example, because they have a powerful, particularly powerful model architecture. And we see this a little bit with some of the advances in Mixture of Experts models. They're able to do things that are very advanced, even though maybe they have fewer parameters than other large models that aren't Mixture of Experts models. There are other models that are going to be particularly competitive because of just the volume of training data and the amount of compute that is required to train those. Those are those really, really, really high parameter models. And right now, all of these different levers, the compute, the data, the algorithms, they're all really important. And we're seeing companies experiment with how to move those levers to kind of get the best outcomes, maybe at sometimes lower cost or a faster development life cycle or speed. And so there's a lot of focus on this right now, like how to actually come up with the most competitive, most advanced models.

So let's talk about one of these for a moment, which is compute. And compute is best exemplified by companies’ race to acquire some of the most highly performant Nvidia chips, for example. And many of our listeners will know that those are subject to pretty stringent export control. So the ability to acquire Nvidia chips is not universal, it's not distributed equally. And that brings us to the really important question of whether the most truly powerful AI models can be trained through alternative methods that don't require as much compute infrastructure, meaning as much of those super highly powerful, most advanced chips. And the alternative to that could be, there are many alternatives, but one possible alternative might be to use compute infrastructure that is distributed. And what I mean by distributed here is that the computing power, those chips, or the data centers that house those chips, may not need to be all in one place or even all owned by the same entities. And so a major challenge that people are trying to solve to get to that kind of more distributed paradigm is called distributed AI training, or sometimes it's called decentralized AI training, whereby major AI models could be trained using data centers located around the globe rather than all in one place. So that's really exciting when you think about different kinds of paradigms for how to get models that are highly performant, trained quickly and on potentially even less advanced chip and computing infrastructure. So a few companies have been working on this, and I want to just call them out. One is called Prime Intellect, which some folks listening here may not have heard of. Of course, another is Google DeepMind, which many people have heard of. And they have been in the news recently for novel approaches to using compute to train models on a decentralized basis. And Prime Intellect is pioneering what they're calling kind of a collaborative decentralized approach to model training, where GPUs owned by developers across the globe can pool their individually limited resources to collaborate over the internet and train a model together.

Let's unpack for a moment why this is actually pretty cool. So we provide a little bit of background about GPUs for a moment. So it’s a little bit of a brief frolic into GPUs, which are part of the important compute story here. Some of our listeners, maybe most of our listeners, will know that GPUs were originally developed largely for rendering computer graphics in video games. And that was a task that demanded specialized compute resources. So long before ChatGPT, Nvidia was a household name among gamers. And so there were like specialized gaming laptops and gaming computers. And if you know someone who had one of those, they probably had a pretty cool, pretty advanced GPU in there, but not a GPU that was at the level of the kind of advanced AI chips that exist today, the ones that are subject to the export control. So advanced, but not the most highly advanced. And so modern training labs, so like the labs that are used to train AI models, tend to run on those most highly powerful GPUs, the Nvidia H100 chips. And those are housed in massive data centers. But building those data centers and co-locating that many advanced chips is extremely expensive. Again, there are often supply constraints, and there can be complex engineering challenges involved in architecting those and getting them to run in the way that need to run to train models at the scale that's required to, again, create some of the most highly performant AI models.

So Prime Intellect, among others, has posed a question. What if you could collaboratively use some of those advanced gaming GPUs or other kinds of GPUs (but not the most advanced H100s) that exist all across the globe, like in gaming PCs or smaller data centers, to bypass that high barrier to entry for frontier model training that requires even more sophisticated chips? And that's a world that we're heading closer to. It's a world where doors to frontier model training are suddenly open to smaller developers or companies with lesser tech resources, and also the open source community that often wants to do kinds of model training collaboratively. So one of the challenges in making this collaborative training process happen is that the GPUs communicate really effectively with each other. And that's not just like communication between chips in the same data center. Now, with collaboration, we're talking about communicating between different data centers in vastly different geographical areas, possibly across continents. And that is a hurdle, but it's a hurdle that major AI labs are focused on and they seem to have crossed because some of the frontier models emerging from those labs can be trained pretty globally across data centers. What has been missing or had been missing — and again, we saw Prime Intellect working on this — is a much more decentralized or open source way of undertaking this kind of distributed training process, globally distributed training process, where multiple companies could again come together and pool their resources and bring together something that is larger than the sum of its parts.

And Prime Intellect recently announced is that they trained the first, 32 billion parameters globally decentralized Reinforcement Learning training run. And that was a training run where anyone could permissionlessly contribute their compute resources. So that basically meant anyone across the globe could help with model training by volunteering their GPUs for compute purposes. So that's a pretty big development. That said, of course, even 32 billion parameters is smaller than the largest frontier models by quite a lot. But it is worth noting that companies, really across the globe, are working again on this decentralized distributed paradigm. And now we're seeing these jumps happen in terms of the ability to do that collaboratively across jurisdictions. This is important. And I just want to talk about why it's important. This is important from a global perspective because, if Prime Intellect is right and really you can have highly sophisticated models trained on a global basis, that could allow for the development of consortia, including universities or nonprofits to come together and collaborate on advanced model development in a way that has not yet been possible. And we might even be able to see less powerful chips be used to train very powerful models. And that was kind of the question raised by DeepSeek. Could you use less powerful chips to train a highly, highly capable model? And now this other question comes into play. Could that be done in a decentralized way, rather than just by one company that has amassed a lot of specific resources within that one company's control?

Now, distributed and decentralized training may also raise questions about the efficacy and the power of the current export control regulatory framework, and other regulatory frameworks that might cover chips and other AI technologies. And we're seeing some of this play out right at this moment. So, for example, the Department of Commerce has enacted a new rule called the “AI Diffusion Rule,” which was effective on its release on January 15th, 2025, but compliance wasn't required until May 15th, 2025. And when we're recording this podcast that date is just like right around the bend. And that rule enhances and refines the Export Administration Regulation or EAR framework to regulate the global diffusion of the most advanced AI models and large clusters of advanced computing chips. And it essentially imposes a new license requirement for moving any controlled chips such as Nvidia H100s or any newly controlled advanced model weights. And moving, under this definition, is exporting, re-exporting or transporting. And this framework divides the world into three tiers. And it imposes license requirements, restrictions on chip allocation and advanced model weight storage depending on the destination country, essentially. So this is a pretty complex and sophisticated rule. But at its core, it's not really trying to prevent the world from advancing AI development. But it is focused on limiting or preventing the diffusion of frontier AI model development in certain parts of the world. And those, when we say frontier models, we really mean the most highly capable, most advanced models. There are specific carve-outs for open weight models, as long as those models aren't at the absolute frontier, you know on par with frontier model capabilities. And there's kind of a complex way of thinking about that.

So right now, there are just interesting questions raised by that kind of rule and how it might apply to decentralized models in the future. Again, we're not seeing decentralized models today that are on par with the most advanced frontier models. They're still pretty far away from that. But as a thought exercise, it'd be interesting to think through how such a diffusion rule could even apply to a training process that is truly decentralized, but results in the creation of a frontier model. So that hasn't happened yet, but all really interesting questions to think through, again, as these different kinds of methods advance. And then the perennial question in the AI space is, how do regulations map onto these lightning speed advances that we tend to see, whether they're advances in model architecture, and how compute is distributed, in how data is obtained and distributed or, again, how all of this comes together in kind of a complex recipe to create really advanced technology? Lots of interesting things to dig into. We'll of course keep our eyes on this decentralized and distributed training space. And with that, I think that's all the time we have today, folks. I'm Anna Gressel. Make sure to like and share the podcast. And you know, we still have some of those “Waking Up With AI” mugs. So email me if you want one, and we'll get it over to you. Thanks, everyone.

Show Transcript