Reinventing AI From Scratch with Yaroslav Bulatov

Yaroslav Bulatov helped build the AI era from the inside, as one of the earliest researchers at both OpenAI and Google Brain. Now he wants to tear it all down and start over. Modern deep learning, he argues, is up to 100x more wasteful than it needs to be - a Frankenstein of hacks designed for the wrong hardware. With a power wall approaching in two years, Yaroslav is leading an open effort to reinvent AI from scratch: no backprop, no legacy assumptions, just the benefit of hindsight and AI agents that compress decades of research into months. Along the way, we dig into why AGI is a "religious question," how a sales guy with no ML background became one of his most productive contributors, and why the Muon optimizer, one of the biggest recent breakthroughs, could only have been discovered by a non-expert.
Timeline
00:12 — Introduction and Yaroslav's background at OpenAI and Google Brain
01:16 — Why deep learning isn't such a good idea
02:03 — The three definitions of AGI: religious, financial, and vibes-based
07:52 — The SAI framework: do we need the term AGI at all?
10:58 — What matters more than AGI: efficiency and refactoring the AI stack
13:28 — Jevons paradox and the coming energy wall
14:49 — The recipe: replaying 70 years of AI with hindsight
17:23 — Memory, energy, and gradient checkpointing
18:34 — Why you can't just optimize the current stack (the recurrent laryngeal nerve analogy)
21:05 — What a redesigned AI might look like: hierarchical message passing
22:31 — Can a small team replicate decades of research?
24:23 — Why non-experts outperform domain specialists
27:42 — The GPT-2 benchmark: what success looks like
29:01 — Ian Goodfellow, Theano, and the origins of TensorFlow
30:12 — The Muon optimizer origin story and beating Google on ImageNet
36:16 — AI coding agents for software engineering and research
40:12 — 10-year outlook and the voice-first workflow
42:23 — Why start with text over multimodality
45:13 — Are AI labs like SSI on the right track?
48:52 — Getting rid of backprop — and maybe math itself
53:57 — The state of ML academia and NeurIPS culture
56:41 — The Sutra group challenge: inventing better learning algorithms
Music:
- "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- Changes: trimmed
About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Ravid Shwartz-Ziv: everyone, and welcome back to the information bottleneck. And today, we have a great guest, â Yaroslav. He is an AI researcher and an early researcher at OpenAI and at Google Brain. He was â one of the first to put production systems of deep learning and did a lot of... other things in the last 20 years related to deep learning. So, hi Ruslan, nice to have you.
Yaroslav Bulatov: Likewise.
Ravid Shwartz-Ziv: And as always, Ellen, hi Ellen!
Allen Roush: Nice to see you again, Ravid, and nice to meet you, Yaroslav. And you've already taught me something because I did not know properly what the Hessian was until I just chatted with you just now about that.
Ravid Shwartz-Ziv: So today we are going to talk about a lot of different topics, but we will try to focus on why deep learning actually is not such a good idea. You know, like, what are the places that deep learning is not working, â and how can we try â to propose different approaches like, directions in order to get them... to get it â better. So, Yaroslav, maybe we can start with AGI, â term that everyone are talking about it these days. â So, what do you think about it and what is the problem when everyone talking about it?
Yaroslav Bulatov: â Right, so periodically, â every time I come to a party in San Francisco, somebody tells me, when is AGI happening? What is my timeline? And actually, people ask me so much, I wrote an essay called God, Gold, and GPUs. And I just point them to an essay, but I can give a rough outline. â So like modern AGI discourse, people think of it as kind of like a second coming of Christ, like there's this point. Like â you go and things are hard and suddenly AGI comes and things are great. â There is this little book called Machines of Living Grace. Actually my friend of Entropiq, he brought it. It's like a little blue book. But it basically talks about all the great things that will happen after we get AGI. We will basically feel good. We won't have to work. Diseases will be cured. So it paints like a very optimistic, nice picture. â I mean, maybe I'm just old and kind of cynical that we'll have this sharp transition point that we'll just have this in three years, things will be amazing, or in 10 years, things will be amazing. â So I think the question is when AGI coming, it's more like a religious question. â And I was raised in USSR. I'm not a religious person. So I kind of don't think in those terms. I can't answer this question. When will be this point when everything will be amazing? There's a second kind of H.
Ravid Shwartz-Ziv: But do you think, do you think first, like, let's start from the beginning, like, do you think, like, AGI is well-defined term even? Like, what does it even mean, right? Like, a lot of people are â thinking they know AGI, like, what does it mean? But to get, a very concrete definition, it's quite hard.
Yaroslav Bulatov: Yeah, yeah, I think there's basically there's three kinds of a GI so one it's more like Something amazing so we can't define it, but we'll know it when we see it So it's like the religious a GI then there is a second kind of a GI which I called the account gap insurance It's I think what's promoted by some of the companies whose financials don't balance So essentially if you actually do the math if you look at some of these companies how much they're spending, what they're projecting. â In the end, things still don't balance, and they're saying, well, how will you make it balance? The answer is AGI. So it's like this extra thing that you put in to â calm the skeptics. â So as long as this logic works, then we have AGI. So right now, this account gap insurance argument still works. â And the third kind, it's more like a feeling. So for instance, I tried the Suno. for the first time in December. I visited my parents in Oregon and I tried to get them on AI. My dad now uses endogravity. My mom uses Vio to generate videos. But I discovered Suno for myself. So first time trying Suno, I felt the AGI. I'm like, wow, I can be a music producer. This is amazing. So I think you hear people saying they're feeling the AGI. I feel the AGI sometimes. In that sense, AGI is already here. But it kind of depends on your mood. You today you feel the AGI, the more you don't, it kind of changes from day to day.
Allen Roush: Okay, I need to interject because you even quoted the title of the exact article by I think possibly a colleague of yours at Google, Peter Norvig, who wrote that exact title, AGI is Already Here in Numina Mag back in
Yaroslav Bulatov: Mm-hmm. Mm-hmm.
Allen Roush: right when chat GPT came out like late 2022 or early 2023. And I want to play defense of this idea. Also, I'm from Oregon. And what I'm about to throw at you is definitely closest to that final I'm feeling it kind of argument. So I don't like this term AGI very much â either, because I think most people who say it in what even any of the three definitions you brought up really mean something closer to artificial general super intelligence, right? But
Yaroslav Bulatov: Mm-hmm. â wow. Mm-hmm. Mm-hmm.
Allen Roush: you know, most humans just going back for a moment, most humans are totally different in their skill set, like from five humans across each other, they will have some skills that they might share a lot of similarity in, but most skills they don't at all, right. And so humans, claim are a jagged frontier, which means that the end. when we build a system that is very different from how it behaves to us, but is
Yaroslav Bulatov: Yeah. â Mm-hmm.
Allen Roush: in general, quite intelligent and very good at certain things that many humans aren't, I'm willing to declare that that's a new type of life, kind of. â You know, and accept that it's really bad at spelling because of tokenization, for example, in ways that, you know, we can explain. So I guess by this definition, I really defend Peter Norvig, and he makes all these arguments in his essay, but like, what's wrong with saying AGI was a really low bar?
Yaroslav Bulatov: Mm-hmm. Mm-hmm. Yeah. We do.
Allen Roush: and accepting that most humans are stupid, so it's okay to declare AGI when things are stupid still.
Yaroslav Bulatov: Mm-hmm. Yeah, I mean for chess for instance Chess was considered to be the benchmark of AGI until Deep Blue. And actually, I brought it up. My former advisor, Thomas Dietrich, he was on the AAAI committee, and he brought up that there was a debate after the chess. What does it mean? And actually, dug up the discussions there in this essay, which I will link somewhere after this, God, God, on GPUs. But in the discussion, at least one person said the future is incredibly bright. We have the technology to create more sophisticated thinking machines. And soon, 80 % of the work may be automated. So in 1998, some people already felt the AGI, and they're like, it's already here. But this just underscores how difficult it is to just have a single benchmark, because we keep beating those benchmarks. Like Turing test, chess, Arc AGI is close to being solved. It's very hard to have a single test.
Ravid Shwartz-Ziv: But so I will claim, so we just published a paper, like an opinion paper with Yan and others that claim that like we don't need this definition. That like the people that actually humans, right? At the end, human intelligence is not general, right? It's very specific. So you need to talk about, like when you want to talk about these kind of capabilities, like you need to talk about specific
Yaroslav Bulatov: Mm-hmm.
Ravid Shwartz-Ziv: civilizations, right? That like, what is like for a specific type of thing, like what, like how can we, like how much we can adapt to these specific tasks. And not like we are good on all the, all the tasks in the world. This is kind of like no free lunch, right? Like, there is no meaning to talk about the generability in this case.
Yaroslav Bulatov: Mm. Yeah, I guess you and I don't need it, but there are some companies which need it because there is no financial logic which makes their projection make sense.
Allen Roush: So let me propose then. â I've always considered the golden, you know, the guiding star because I think that's what we're all trying to talk about is what is this thing that we're trying to build towards and this is a religious argument, right? I always thought it was recursive self-improvement, RSI, right? And, you know, the paper that Ravid refers to maybe defines it with its own four-letter acronym, which I cannot remember. Maybe it's three-letter, which you should mention or maybe I can dig it up in a moment. But, know,
Yaroslav Bulatov: Thank Mm-hmm.
Allen Roush: there's Ben and Jerry.
Ravid Shwartz-Ziv: I think we call it like... Yeah, I think we call it like a specialized superhuman adaptable intelligence.
Allen Roush: Yeah, yeah. And the adaptable part is, think, where the RSI comes in, because that's about incremental slash online learning, right? And Dario and other CEOs of these firms â have been debating about how important that is. And what's weird is a lot of them are saying, â we don't need continual learning. And I don't like that. I find that really scary. And maybe that makes me start questioning, can we hit?
Yaroslav Bulatov: Mm-hmm. Mm-hmm.
Allen Roush: Are, will real RSI happen or not? Do you think any of this stuff matters? We spent too much time on this, like at all.
Yaroslav Bulatov: You mean does the definition matter or which part matters?
Allen Roush: Yeah, the definition, yeah.
Yaroslav Bulatov: â I mean, it's a fun thing to talk about at parties. â It matters to companies because how else will they justify their very large price to earnings ratios or like how can they justify having negative cash flow for so long. So it matters for them. â For me, it's basically a more, â it's like a culture. You can discuss culture, art, theater, literature. We can discuss AGI in the same manner.
Ravid Shwartz-Ziv: Okay, so what is important? Let's talk about... Okay, so let's... hope that the AGI... Let's assume that AGI is not important. What do you think we should look when we try to develop better models, better systems? What are the important components, or like, the goals that we need to try to achieve?
Yaroslav Bulatov: Mm-hmm â Right, so â right now a lot of people around me, they're looking at capabilities, trying to do more and more things. And I'm actually, much, my goals are much more pedestrians. I'm actually right now very interested in taking existing capabilities and looking at ways of achieving the same capability much better. Because right now, the way, if you look at the progress of AI, it's been sort of like this evolutionary random walk. â It started a long time ago on a different hardware without knowing where we're going. And in the last 50 years, we created like this big jumbled mess and things barely work. Like, I don't know, I was at Archimedes' Banya last Wednesday and a friend of mine from OpenAI was outside in his car restarting a run because the run wouldn't launch. I mean, if you look at anybody who's training models, it is very human intensive, it's unpredictable. There's rumors that GPT-5 didn't work well because they messed up hyperparameters. I don't know if it's true, but it's very believable. So like modern AI systems, they're this extremely complicated systems which barely work, and the reason for that is legacy. We never bother to kind of â stop, take a breath. maybe refactor the stack because everybody is super excited about the next thing. â we get an extra percentage improvement on the benchmark? Is there going to be another new capability discovered? So everybody's rushing in that direction, but I'm actually much more modest. I am simply looking at, can we take an existing application, which are already pretty cool? â can generate songs, we can answer questions, and I am personally interested in making them. much more efficient. â And one motivation is, I feel like â because we never bothered to refactor, current â usage of AI is extremely energy inefficient. â It could be one, maybe even two orders of magnitude worse than it should be, because we never cared about energy. The methods were designed by mathematicians who never actually bothered to look at the memory hierarchy of a GPU. And memory is where most of the energy takes place. So I'm actually looking at â tackling that aspect and making things energy efficient without sacrificing capabilities.
Allen Roush: So what do you think about, and there were two paradoxes I wanted to ask you about, but one of them is this idea of Jevons paradox, which is that even if we gone up one or two orders of magnitude in efficiency, â that people would rapidly consume it and start demanding even more of the very same system again, because it's that much more useful to us in the world.
Yaroslav Bulatov: Mm-hmm. Mm-hmm. Mm-hmm. Right, right. So I can't control over how much energy people use in total. I can make AI more efficient, â which you can either say you're reducing the joules per intelligence or you can say you're increasing intelligence per joule. Those are two equivalent factors. And actually, if you look at the current projections, we're going to hit a power wall in â two years. So right now people are bottlenecked by GPU capacity. There are some hyperscalers which in two years they expect to have more GPUs that they have power for because it takes so long to â get power availability. So in that sense in two years we'll be bottlenecked by energy and the only way to get more intelligence is to increase efficiency. And I'm just surprised nobody is really looking at that right now, but that's what I'm looking at right now.
Ravid Shwartz-Ziv: But what does it mean in practice? Can you tell us how to make it?
Yaroslav Bulatov: â Yes, so I have a very simple recipe. The recipe is we've invented the AI once and because it was invented in the open, we know how it was done. And it's basically, they call it graduate student descent. We have professors proposing ideas, students coming out, trying those ideas, the few ideas that work, you improve on them. So this process is completely transparent because it's â been published. right now what I'm doing, I'm just going back to the 60s and just replaying this process. And I had this original plan. Originally, I allocated, I'll spend 20 years replaying it with the benefit of hindsight. And then I'll retire and I'll spend 20 years teaching. But this year, agents came out and I noticed that I am 100 times faster at trying ideas. So I'm like, I don't need 20 years anymore. So in practice what it looks like there is a there's a few volunteers, you know, I advertise on Twitter come join me There's a few volunteers in San Francisco. So every Monday we get together we practice Reinventing AI using agents share some tips and right now it's completely in the open. So this is not a company. This is a community effort Everybody should join it. If you hear this you just look at my Twitter. I have some links
Ravid Shwartz-Ziv: But, like, what do you want to do? Like, do want to win on some benchmarks? Like, to first implement some ideas and then to try it in different scenarios? Like, how does it work?
Yaroslav Bulatov: Yeah. Yeah, so I mean, on the board, what is my eventual goal? So my eventual goal, when I retire and teach, I don't want my lecture notes to be ugly. So I kind of want to make sure that AI is in a good state so I can teach it well. Like, I don't want to teach about prompt engineering, telling a model, I'll give it $5 if it gives the nice answer. Because next year, it may be $10. I don't know. So I kind of want to refactor it to be nice. But what it looks like is basically designing a series of toy tasks, practicing solving it using AI, and gradually make it more complicated. So for instance, right now we're looking at solving sparse parity task. â So we have some â simple baselines, a thing which solves sparse parity for three bits, something that measures approximately how much energy it uses. So now the task is, can we get AI agents to invent new algorithms? that solves sparse parity while using less energy. So the energy usage is the main thing that was neglected when inventing AI. So I kind of want to replay it, but always thinking about energy.
Allen Roush: And â the power consumption, â that's one side, but you're also trying to talk about reducing the memory and hopefully storage capacity, right?
Yaroslav Bulatov: â Because these are extremely well connected. So for instance, one thing I did at OpenAI, I worked on gradient checkpointing, which reduces memory. But it actually also reduces energy use, because most of the energy used in modern AI training workloads is because of memory. So if you reduce peak memory, you will probably also reduce energy.
Allen Roush: So what do you think about existing methods? â that are being kind of used to optimize everything. So I'll just point out like, you know, Nvidia started really focusing on low precision, â like floating point compute, right? Going down even as low as 1.58 bits or probably lower now log three. And that seems to be something that the architectures are kind of moving in lockstep with, right? Moving to sparse MOEs, multi-token prediction, auxiliary load-free balancing for the
Yaroslav Bulatov: Mm-hmm. Yeah.
Allen Roush: experts and more, right? So maybe it, I will point out that that stuff is all well better than prompt engineering and bargaining with $10. So I guess what's wrong, like I see the desire to throw everything out and start again from new and see what happens, but what's wrong with just optimizing the techniques we have now that they're starting to move in lockstep with each other a little bit?
Yaroslav Bulatov: right Mm-hmm. Mm-hmm. Yeah. Thank Right, so the question becomes â can you take an existing complicated things and optimize away â the inefficiencies? â Maybe, maybe not. So there's some counter examples. For instance, an example from biology. Have you heard of the recurrent laryngeal nerve? Well, so you can look it up. But basically there's this nerve which goes from our brain to the heart and then back to our larynx. And in giraffes, it makes it extremely long. And the reason is that in fishes, it was a direct connection, but somehow it was routed on the other side of the heart. So when the neck grew longer, it became extremely long. So if you're faced with the situation and you want to fix the problem of the long nerve, you may respond by making the neck shorter, but that has downsides. So the way I see it when people are trying to adjust to this â algorithms is sort of like this making the next order. And a more related example is â Python Global Interpreter Log. So Python was designed in the 80s with a single core. In 2006, we started to get multi-core systems, and people thought, â we should probably refactor Python to be suitable to multi-core. And it's been 20 years, and the last I heard, it's close to being done. â So. Meanwhile, we created these workarounds. So if you run Python, it's actually probably going to most of the time use most of the cores, remove the stuff to see Python. So the way I look at current workarounds, there are sort of the hacks to get around the fact that the underlying algorithm was designed for CPUs. And the underlying algorithm is, while gradient descent is inherently sequential, it's serialized layers in neural networks. You have to compute one layer after another. There's like many of those examples of global interpreter locks, which are present because we designed these algorithms for CPUs and we are creating workarounds. And I'm looking at â just doing a blank slate redesign.
Allen Roush: What do you think that design looks like?
Ravid Shwartz-Ziv: but.
Yaroslav Bulatov: What do I think? What?
Allen Roush: the design looks like. Like what does it look like? The new...
Yaroslav Bulatov: â Right, so that's kind of like the outcome of this effort. I know what the process looks like. The process is just to do whatever we did. Just repeat the process of the last 50 years, but use the benefit of hindsight. But if I were to guess, I might imagine it may look more like message passing, because if you look at the architecture of modern GPUs, you have H100. You have 132 streaming multiprocessor, the streaming multiprocessor has 128 cores, and you know there's fast memory where the cores can exchange â messages with each other every 30 clock cycles, but HBM takes like 500 clock cycles. So you should probably do 30 flops per each axis of a shared memory, and maybe you should do 500 flops per each axis of the HBM, which suggests some kind of a local hierarchical message passing algorithm.
Ravid Shwartz-Ziv: But you don't afraid that like at the end, current deep learning like, I don't know, 10 of thousands or like even more people worked on it and optimize it over the years, right? And it's kind of like, as you say, like a gradient descent of a lot of different ideas from different people in different scenarios that like, with a lot of compute, right? Do you think it's reasonable to...
Yaroslav Bulatov: Mm-hmm. Mm-hmm. Mm-hmm.
Ravid Shwartz-Ziv: that we can replicate it, that you can replicate it in, I don't know, even if you have like hundreds of people, right? Do you think this is something that...
Yaroslav Bulatov: â I mean most of the ideas of deep learning were already settled like in the 80s and actually if you look at the amount of compute people at the time and number of people that were involved it was not that many. â I mean in the last few years there's an explosion of people but they haven't really changed much of the underlying algorithms.
Ravid Shwartz-Ziv: But, but, but, at the end, you have a lot of ideas, and in order to understand or to make sure which one is better, like, you need to, right, like, you need a lot of engineering tricks, and you need a lot of compute, right? Like, we had these ideas from the 80s, and... but still, there are a lot of ideas, and we didn't know which one will work, right, till, like, 2010.
Yaroslav Bulatov: Mm-hmm. Well, mean, Yan Li Kun invented his things using Spark 7, which is a lot less powerful than an iPhone. So I think if Yan Li Kun could do it with a small amount of compute, then I could also do it with small amount of compute.
Allen Roush: Well, this was, I mean, I'm sure he understood the hardware he was working with at an intimate level in a way that I think maybe you're lucky enough to have computer scientists working with you who also learn that even for modern hardware. But that's a surprisingly rare thing even in good programs these days.
Yaroslav Bulatov: Well, I think a couple of years ago that's the case, but now I think because of AI it's been normalized. So right now we have people in our group, there is like a sales guy who is one of the most productive people. He had no background in this, but he used AI to get himself up. So I think in the last couple of years, the gap between experts and non-expert has shrunk. â And in fact, I'm finding maybe experts are actually a negative value because they're very attached to their ideas. There's this old quote from a speech recognition group, think Peter Jelinek, he said, every time I fire a linguist, my accuracy improves. And I'm finding a similar idea where non-experts are actually more open to trying these ideas and learning about them, whereas the experts are like, Continuing to try their idea. So I have a personal experience. I was really into optimization So I spent several years trying hard to get second-order methods to work â and I realized it's a bad idea But some of the people I worked with they continue trying and they may try for another 10 years So I actually like non-expert because they're more flexible and they're more like pragmatic they just see what works and they go with it where some of the experts they're like they pick it and they're ready to go for 10 years without seeing any results.
Allen Roush: So you really care about maximizing the diversity of ideas that are explored, right? That seems like a big theme for you and what you're focusing on,
Yaroslav Bulatov: Mm-hmm. Yeah. Yep.
Ravid Shwartz-Ziv: So, what do you think will be a good... How do choose the right direction? Now you have a lot of different ideas in lot of different directions, right? Some of them may have better performance, some of them will have better efficiency or whatever. How do choose the one that now you want to say, okay, so now I'm going to use how many... â
Yaroslav Bulatov: Mm-hmm.
Ravid Shwartz-Ziv: GPUs that I have in order to check if this is actually works in practice on like problems that people like using day-to-day life
Yaroslav Bulatov: Mm-hmm. Yeah. Mm-hmm. Right, so this would be a hard question to answer 20 years ago, but luckily today we can use the benefit of hindsight. So we can look at what happened at OpenAI. â ChatGPT2, like why does ChatGPT â have such a bad name? People actually didn't expect it to be so powerful. It was obviously an experiment. So what happened is â Alec Radford trained a model for next-character prediction and an unexpected side effect. This system, captured human reasoning. So I'm sort of motivated by this to try to do the same thing. So I think a final useful â sanity check would be to take a system which can do wiki-text one or two. So just take a hundred million characters â of training data and then a thousand characters of a question and then produce one character answer and that would be like the benchmark to go towards. to do this in an energy efficient way without sacrificing accuracy or wall clock time. So whenever I'm designing toy tasks, I'm always keeping that task in mind because that would be like the final validation. Do we get something that's as good as GPT-2 without sacrificing â accuracy while being much more energy efficient? And once we can show that, that opens up the next stage of exploration. where we can use much more resources for the next stage.
Ravid Shwartz-Ziv: And do you think this is like... â All the ideas will have this scaling loss, like the same scaling loss, that like if something is beating GPT-2 â size, it will work better on like larger models?
Yaroslav Bulatov: â Yeah, that's the assumption I'm going with. If we can do GPT-2 without sacrificing accuracy but improving energy efficiency, then this will be an argument to get much more resources and go much further with this.
Allen Roush: So you seem really well connected to so many of the folks who've built the current world we're in. â And I'm just, you know, I've never gotten to meet some of them personally, or in some cases, I've only spoken to them very briefly. I'm familiar with, you said Ian Goodfellow, right? Yeah, yeah. So what do you think of Keres and Ian's work and his projects and everything?
Yaroslav Bulatov: you Mm-hmm. Yeah, well, he was my intern, yes. Mm-hmm. Well, Kara, I'm not sure if he did Kara's. So he did Ciano. So Ciano is what? Yeah, no worries. No worries. So he did Ciano. And actually, Ciano was a big, huge inspiration for TensorFlow. â I remember in Ciano, it was a graph-based system, and it was quite slow. So the graph compilation step was slow.
Allen Roush: I'm it all mixed up.
Ravid Shwartz-Ziv: Yeah.
Yaroslav Bulatov: And when we were at Google Brain, he kept bringing it up. It's so slow. So Jeff Dean, whenever he would commit a change to the initial TensorFlow system, he would always benchmark the graph compilation step. And the time was like microseconds. So at Warstiano, sometimes the complaint was sometimes it took eight hours to compile the graph. So in that sense, yeah. So it was an inspiration for TensorFlow. â What do I think? I think â he worked on some fusion stuff and he left Google not so long ago. I think he was excited about coding agents. So he may be like exploring in that area right now.
Ravid Shwartz-Ziv: And what do you think, like, if you now you need to, like, don't know, compare between good ideas, like, how to create good ideas, you know? Like, okay, you said it, like, we know the recipe, but do you think, like, the level of people, like, the how people, like, I don't know.
Yaroslav Bulatov: Mm-hmm.
Ravid Shwartz-Ziv: how much they are like smart or like, I don't know, their interaction in the team or like the infrastructure that they have, like what do you think are the most important parts in order to build better models?
Yaroslav Bulatov: â Right, so I see there's a lot of selection bias. There's a few people that got on a good idea randomly, and then people retrospectively glorify them for it. And I actually have the opposite view. And I have two experiences that would shape this. So one experience, a couple of years ago, I was at a company. They â got really excited about â AGI and convinced a rich funder to spend a billion dollars on GPUs. â But then it took longer than I expected to get it working. I was there, we sitting on all these GPUs, and I was handing out the GPU boxes to random people on Twitter. And one of those people was Keller Jordan. And my only condition is do whatever you want, but if you find something cool, you have to tell me. And he used it to iterate on CIFAR. He had like this two-second CIFAR. And he discovered Muon Optimizer. â And I think that's an example where being an expert is counterproductive because optimization people try to improve on Adam for 10 years with little success and Muon is just such a weird idea that no optimization person would try it. It was completely like outside of their scope, but Keller Jordan is not an optimization person, so he was not prevented from trying it. So he tried it, it worked, he published a blog post and then he moved on. And now we have optimization people coming in and said, aha, we knew it. spectral descent under, or steepest descent under spectral norm. So now people are writing papers and explaining it, but it's retroactive. â I think â for doing something new, it's, yeah, you just have to try many things. I think the most important is like curiosity and fast iteration. Yeah, and the second example which guides me, there's a point after I left Google Brain that I got interested in this competition, Dawn Benchmark. And it's basically who can train ImageNet to 75 % accuracy without any limit on amount of compute used. And at the time, Google had the number one spot. It was like 22 minutes. And I basically took 100,000 of AWS credits. I spent three months creating infrastructure to rapidly iterate. And then I got this bright â student from Fast AI, Andrew Shaw, who was a mobile developer. And he took three months of â Jeremy Howard's class. And then he started iterating. And after like a month, he was already better than me. He discovered. this better batch norm initialization scheme, which actually gave us the last two minutes to shave off the time and beat Google. So now there's like this article MIT Technology Review. It's like a bunch of students beat Google. But â these two experiences make me really bullish on non-experts. I think you just need to be focused, iterate fast, and be pragmatic. You need to look at whether your thing works and bring some signal.
Ravid Shwartz-Ziv: And do you think, like, is it better to work in a team? And what size of the team?
Yaroslav Bulatov: I think it's you. Yeah, I mean, I think team helps you when we're humans, we get motivated when we're around other humans. You get better ideas. You may like get stuck on something, a person can see it and it's like, oh, why don't you try this? So yeah, I think there's some rules like more than 20 doesn't work. I kind of like small teams, just, you know, at least two people, at most five to 10 people. I'm trying to right now to do this â effort in a distributed way. I'm kind of inspired by GPU mode. So MarksRFm was able to grow to 25,000 people. So that's like an alternative approach, which is â exciting to me because I think 25,000 people armed with ChaiGPT or Claude â is almost like having 25,000 Jeffrey Hintons in my mind. â And I think they can do a lot. â how to organize it, I'm still learning that.
Allen Roush: So you mentioned earlier, since we've been throwing around so many famous names, â kind of an idea that opens, like, what do you think of â Schmidhuber? Your name is his first name? Please correct me, he'd probably be mad for missing â
Yaroslav Bulatov: Mm-hmm. Juergen, Juergen Schmitzgerber. Yeah, so actually, yeah, I talked with Navdeep Jaitley yesterday. I'm trying to talk to everybody who was among the first to use GPUs for deep learning. And I'll write the blog post. So I talked to Ian. According to Andrew, he's the first person to use GPUs for deep learning. Navdeep was the first person to use GPUs for speech. And I talked to him, and he told me to be careful. I should check â Juergen Schmitzgerber, because he might also be the first. â I mean, he has â good results, good papers. I liked his work, original. He had the state of the art on the MNIST with elastic deformations. His work with Siri-san. â Yeah, so he's old. I believe he's one of the pioneers. â So I respect his work.
Allen Roush: Well, me too. It's fascinating that he had to, you know, he has a reputation for asserting himself, as I would call it.
Yaroslav Bulatov: Yeah.
Allen Roush: What's fascinating is when people go and fact check him, they mostly agree. It's like, actually, he's right. And it tends. kind of, do you think that so-called, some outside observers might write it off as cockiness. Do you think that that kind of attitude is OK, even if you're right? I I do. But I think it leaves him as somebody who's considered a bit contentious in the field.
Yaroslav Bulatov: Mm-hmm. â â I mean, yeah, it became a bit of a meme. Maybe sometimes it's nice to have some contention. It makes things fun. â I also think he's like the most physically fit AI researcher. Like there's some topless photos of him. And I think that's nice to take care of your physical health. Yeah.
Allen Roush: Ha Have you seen the open claw creator? Have you seen the pictures of him? He is jacked. â my god. He steals your vibes and then steals your girl. â
Yaroslav Bulatov: Really? â I definitely support taking care of your physical health. So this is a good role model to follow.
Ravid Shwartz-Ziv: I have a question. what do you think about, like you mentioned before that we need fast iterations. What do you think about... Okay, two questions. One is like AI coding agents in general. don't know, software engineers, do you think they will disappear like these types of... these kind of works? And the second one is what do you think specifically on research, like coding agents and coding models for research? Because it's...
Yaroslav Bulatov: Mm-hmm.
Ravid Shwartz-Ziv: it looked that in the recent weeks, kind of like, there is a lot of excitement about it, and it looked that they finally can, you know, close the loop of making experiments by themselves, look on the behavior, look on the results, and like, improve some of the, I don't know, hyperparameters, and like, try new things. What do you think about it? Do you think, like, this is the future, or are they good enough already? Do you think, like, you see this is... I'll get to see more and more.
Yaroslav Bulatov: â Yeah, I don't think software engineers and researchers would disappear because if a â software engineer can do 10 times more stuff with â agents, people will expect them to produce 10 times more stuff. Because we don't have like a fixed amount of stuff we need to produce. originally people work to produce food. â But now only 5 % of Americans are involved in farming. And the 95%, what are they doing? It's the things we didn't know we wanted. Like â maybe somebody works at the car factory. 500 years ago, we didn't know we wanted cars, we wanted horses. And 10,000 years ago, we didn't know we wanted horses. We were fine working everywhere. So what happens is every time we increase our efficiency, we also increase our demands. Maybe in 100 years, it's not enough to go to Mars. we need to go to Alpha Centauri. So the demands are always raising, so I don't actually expect people to be replaced. â But â definitely it will become, it's already become an inherent part of the workflow. For instance, I have a friend in Google and she said that eight months ago she tried AI coding agents for the first time. And last month, Gemini was down and nobody could do any work. So this seems to be the fastest technological transition in history. Like eight months from zero to like impossible to work without it â And the AI agents for research â It's not as mature â so for instance One contributor to our group they have this automated cloud agent to try to improve learning algorithms And it made some experiments and all of the proposed ideas were worse on all axes It's like slower worse accuracy and worse energy so So we still need to like figure out how to make agents which are as good as â like an undergrad. â But I think we're getting there. I think we'll soon be at a point where we can like, people can work as professors and they have like a bunch of undergrads running around trying experiments and you just look at the results and you guide them. I think we're pretty close to that point.
Ravid Shwartz-Ziv: And, but what do you think, do you think this is something that you will try, you know, personally to push in these directions?
Yaroslav Bulatov: Yes, I definitely want to push in that direction. I'm hoping I will nerd snipe somebody to figure out how to do it because things are changing so fast. Like somebody is telling me I should do open claw. I haven't looked at it yet. So like every month there's some new technology that's changing. So I'm hoping like to â collect people together who will like collaborate to the correct collaboratively figure out the techniques. They would do it because I think that's the advantage. That's how we make this research take less than 50 years. I want to go back to 1960 and replay the last 70 years, but I don't want it to take 70 years. So I think this is the factor which makes it possible.
Ravid Shwartz-Ziv: And what do you think will be the results like in 10 years from now? Do you think like what is what is your time scale basically?
Yaroslav Bulatov: result. Timescale for my project or for AI in general?
Ravid Shwartz-Ziv: Boos.
Yaroslav Bulatov: â I mean for my project, â I guess â the best case scenario is in two years we'll have proof of concept which takes 100 times less energy to get the same results for next character prediction. â Maybe more realistic is 10 times less energy based on the fact that... Right now we rely on HBM, which is 100 times more expensive than registers, but shared memory is only 10 times more expensive. if you reduce your memory accesses to be just one level of hierarchy lower, you will get 10 times reduction in your memory energy cost, which is the main cost of AI training. â And as far as what will society look like? â Maybe we'll still be working the same amount of hours, but instead of typing in editors, we'll be prompting AI agents. Or maybe we'll be talking. Actually, you this microphone right now? I do most of my vibe coding using voice, because you can speak so much faster than you can type.
Allen Roush: Yeah, and I'm glad you do that. I actually do that now too for a lot of my workflow, but I want to give another advantage to speaking because if you A lot of people handwrite or even try to do lightweight prompt optimization and write out â prompts that they try to optimize before using it on a task. But AI has serious slop problems referring to overrepresented words and phrases that reduce the diversity of the possible outputs. And while people are influenced by all the AI stuff they read in interpersonal communication, there's a paper on that. Being able to stay things out
Yaroslav Bulatov: Mm-hmm. Mm-hmm. Mm-hmm.
Allen Roush: loud, you keep more of your voice in your writing style â than you do when you type, at least for me.
Yaroslav Bulatov: â Yeah, keep sometimes I would use AI for â proofreading my tweets, but it actually makes them too verbose. So often I just reject the suggestion.
Ravid Shwartz-Ziv: What do you think about multimodality? You mentioned at the beginning that you are focusing on text. Do you think this is something that, like... First, for why? And second, do you think this is something that â it's to combine with other modalities? Or do think there is something specific in text?
Yaroslav Bulatov: Mm-hmm. â Yeah, so I'm focusing on text because it's the lowest hanging fruit. The datasets are small. We could text one or two, it's like 100 megabytes. And it's also like a necessary prerequisite. Like if you can't make it work on text, there's no hope to make it work on multi-model. â So I kind of want to start with the simplest things possible. So text is just the simplest thing.
Ravid Shwartz-Ziv: Why? Why like this is... I don't know, if you can't make it work on text, you can't in other domains? Why is the case?
Yaroslav Bulatov: If I can't make it work on text, well, it's similar to why would you start with MNIST. And the heuristic I use, if I can't make my optimizer work on MNIST, I shouldn't bother trying to make it work on a larger example. That's just a heuristic that worked for me. Your thing needs to work on a simpler task first. So I just don't have hope that something that fails to work on text, the same technique will work on a larger task. is just too expensive to try.
Ravid Shwartz-Ziv: Do you think that like, I don't know, there are like smaller size datasets for vision for example, right? So...
Yaroslav Bulatov: Yeah, guess the second reason a text impact is much larger. If I have a simple task, there's already a lot of application for text completion, whereas multi-model, it's not as direct. So I would have less of an impact initially. So basically, it's like â effort to reward ratio is much better for text than for other domains.
Ravid Shwartz-Ziv: â you think, â the ideas, like the process will be similar, like just â other domains? mean, like, do you think this is like we can just replicate the same process, but this is just matter of like resources â maybe something fundamental different? â
Yaroslav Bulatov: Yeah, I think it would be replicating the same process. The idea is to figure out the process of inventing AI. Like, it could be an agent, but it could be just a bunch of Google Docs describing what we did. â So â if it works for text, then we just say, well, here, the starting point, just replace this again. Replay the whole thing again for multi-model and maybe improve on it.
Ravid Shwartz-Ziv: and how much you think you need domain expertise.
Yaroslav Bulatov: â I think I need â very little domain expertise because it's one question away. And it's better because, yeah, I think because of AI, we remove the need to actually have domain experts.
Ravid Shwartz-Ziv: And do you think this is like... Do think people will, for example, all these companies, the AI labs, Like Ilya's company or whatever, thinking machines. Do you think they are in the right direction and they are trying new things? Or you think they basically do the same thing and don't have a chance to have real innovation or real progress?
Yaroslav Bulatov: Mm-hmm. Yeah, yeah, so I think there is a tendency for everybody to chase the same peak. â So that's one problem. And the second problem is â Domain experts are very incentivized to convince everybody that domain experts are important. So they're kind of attached to this. So whenever you come to a company with a lot of domain experts, they'll kind of continue the status quo. looking what SSI is doing, I actually don't like the secrecy, like the approach they're taking there. And I think it sets a bad example to junior researchers. So a new researcher coming out into the field, and they look at it like, â SSI is so cool, and they do everything in extremely secretive way. Like, it's so secretive, there's like rumors, you cannot have a partner in a competing AGI lab, because they want to contain that information. So I'm kind of hoping to set the opposite example, that you can do progress by being extremely transparent, â hoping to, by achieving better results than them, this will kind of bias, this will inspire the new generation of researchers to actually â transparent and open approach and not this super secret approach that Ilya is taking.
Allen Roush: There's a quote that I'm reminded of â by George Bernard Shaw, the reasonable man adopts himself to the world. The unreasonable one persists in trying to adopt the world to himself. Therefore, all progress depends on the unreasonable man. I kind of feel like a lot of the proposals that you're talking about and that we've kind of gone around are people who do not accept a status quo, right?
Yaroslav Bulatov: Mm-hmm. Mm-hmm.
Allen Roush: you know, push the world to move in ways that orthodoxy seems to disagree with. And I think that there's also â in the philosophy of science, right? Karl Popper, think, you know, structure of scientific revolutions starts trying to analyze. And also his student, â Paul Firebrand, Fierband. starts trying to analyze why it was that like it was so difficult to get people to accept like the Catholic Church and the 1500s or you know that the world's round and all these other things right so
Yaroslav Bulatov: Yeah. Yeah. Yeah, so I think there's a lot of tendency for people who develop a particular technique to keep â promoting that technique. So with all due respect to Yan Li Kun, but he, as the inventor of backpropagation, he would be very personally attached to backpropagation. So I think there's this public â argument with me and Yan Li Kun, like, will backpropagation be there in five years? And I was posting it wouldn't be based on my observation of how it was invented. was invented for the wrong kind of hardware. And this step of sequential step and saving all the activation, it just doesn't make sense for GPUs. â And because we're hitting an energy bottleneck in two years, I don't expect us to be able to continue improving on AI without getting rid of that particular method.
Ravid Shwartz-Ziv: Are there any methods or, like, don't know, architectures that you really like, like, that you think will be fundamental? Or do you say, I don't know what will be, we'll see, we'll iterate and we'll figure it out? Is something that, like, as I know, Jan likes a backprop or quad models, do you have some concept that you think â will be there?
Yaroslav Bulatov: â Yeah, think actually, yeah, I think nothing is fundamental. In fact, maybe I would go as further away. I don't just want to get rid of â backprop. Maybe I want to get rid of... math as the concept we use for inventing these algorithms because if you think about it, when researchers were inventing these methods, they were optimizing for beauty. And it used to make sense because you want your implementation to be beautiful, meaning small, so you can communicate it to other people. â so this way, like, backprop is standardized. You can just say backprop. Everybody knows what it means. But maybe we're approaching a world where â AI agent can quickly invent a new algorithm which is specialized to specific task, specialized to specific hardware, and maybe this algorithm will be too complicated for people to understand. So we could be approaching an era where algorithms will be like binary code. Like people used to write binary 70 years ago, but now they don't. But we build tools around it. So maybe we'll actually get algorithms which people don't actually study, but we have tools. around and the advantage of that could be is that this beauty is a constraint but if we relax the constraint maybe we're gonna optimize things better and here I want to relax all constraint possible in order to maximize energy efficiency.
Ravid Shwartz-Ziv: And do you think, like, it will be possible in the sense that it's kind of like no free lunch, right? Like, if you don't have any constraints, then you just need... you will search and search, but it may be that you will not find any optimal solution.
Yaroslav Bulatov: I mean, â it is true that representation is important. So you can't just tell an agent, it a binary of a learning algorithm and tell it to modify random bits. That will be inefficient. So I think you do need some degree of beauty. And I'm still trying to figure out how much beauty do you need â in your algorithms.
Ravid Shwartz-Ziv: So do you think for math there is no â role for math? There is no place for math, we don't need math anymore. We can just optimize it, optimize the different methods, optimize the different architecture and see what works the best. Do you think we basically need to ignore all these equations?
Yaroslav Bulatov: Yeah, yeah, I mean, yeah, because representation matters during search, there is probably you do need some idea of simplicity, because the atoms which things do you try, you can just try random things. So there probably is some place for Matt, I'm still like trying to figure out how much can I get away with? Can I just not have any constraints and have agent optimize it? â Ideally, yes, but maybe I want and I will have to actually combine it with some ideas to try. â Right now it's good that we have all these papers so we can basically ask agents to like look at different papers, implement them and have humans as kind of like a guiding mechanism. â But also math in general, â there's a place for math in a society. It's entertaining. So I went to MAEA last year and Terry Tao gave a talk on prime numbers and I remember being extremely entertained even though I don't have a background in prime numbers I don't need it for my job but it was a very nice talk so I feel like we'll still have math we'll still have AI research but we will not be forced to make it useful we will just do it for fun just like people play chess for fun
Allen Roush: Well, think chess has a unique â aspect where before we had the deep blue moment and really 10 years after that you couldn't have the commentators in a game of the two best grandmasters in the world be able to give even decent analysis really. â I mean, unless you had like the number three and number four commenting, which often they were practicing, right? But now since then, we have, you know, super intelligence that can rate who is better in these competitions highly accurately. â And so do you think that that's the aspect that makes humans still want to do things is that we can still be
Yaroslav Bulatov: Mm-hmm. Mm-hmm.
Allen Roush: you know, able to rank ourselves even in a world where machines have beaten us at most tasks.
Yaroslav Bulatov: I mean maybe in some areas but I think in math people do not pretend math is useful and they still have fun doing it like pure mathematics. was just at the conference last week gathering for Gardner. It's the recreational math conference and people have a lot of fun there and I think maybe soon all math will be recreational math.
Ravid Shwartz-Ziv: And what do you think in general about the academia in machine learning? you think this is... we still need it to make progress or if everything is empirical, so probably like... and start-ups can do it better.
Yaroslav Bulatov: Mm-hmm. â Yeah, mean, academia is great. or this unencumbered generating new ideas. It can generate beautiful field which will motivate more people to join in and have some results which eventually become useful. I feel like in machine learning there's so much money in machine learning that academia have been a little bit corrupted by that. There was a discussion at New Rips last year. One of the professors said, â doesn't tell people to research X and Y because it will make them less employable. So people are guiding their students to do things which are actually have more hands-on practical impact. So I think â I don't really like this direction. There's already enough people doing practical things in companies. Do we really need everybody in academia to do those things as well? I think academia should, I'd rather see it more the way it was before the latest deep learning revolution. Just be there to kind of generate new ideas that can like create new fields.
Allen Roush: Well, and it really does feel like business has gotten very deep into this one part of academia, right? Machine learning and AI. I mean, I've heard how people talked about NeurIPS 10 and even more years ago back when it was still called NIBS. And today it feels like a business conference that the AI people just kind of have a side thing at. You know, with every five people you talk to, three of them at least are not people with papers, even at workshops.
Yaroslav Bulatov: Mm-hmm. Yeah. â yeah and also last new rips â I had the paper which was rejected and then I learned that there are 60 high schoolers which had their papers accepted â
Allen Roush: in my experience.
Yaroslav Bulatov: And an interesting thing with New Rips, if you submit a paper, you're required to be a reviewer. So there's a lot of high schoolers both submitting and reviewing papers. And actually, one of the underlying driving forces is that Stanford admissions are very competitive. So there's actually services. You can pay somebody to coach your child to go through this whole New Rips submission process. So I think that's an example of good hard slow. Like when people, Stanford optimizes for new RIPs acceptances, that's gonna force new RIPs to be a bad metric of success.
Allen Roush: for this.
Ravid Shwartz-Ziv: Okay, yeah, we are almost out of time. â Is there anything else that you want to add, that you want to talk about?
Yaroslav Bulatov: â No, I think this covered everything I wanted to say.
Ravid Shwartz-Ziv: that you want to promote.
Yaroslav Bulatov: I want to promote this â effort. It's on my Twitter, Sutra group. We have a challenge. â â at it. â now, we're trying to see if we can invent a better learning algorithm for sparse parity task. â there's a baseline using â But I think it can be improved at least 1,000 times. We just need to figure out the right prompt to use to improve it.
Ravid Shwartz-Ziv: Okay, yeah, so good luck with that. Yaroslav, thank you so much that you joined us today. It was really a pleasure. And thank you, Ellen. â
Yaroslav Bulatov: Likewise.
Allen Roush: Yeah, always a pleasure and I learned so much today, including the importance of the diagonal of the Hessean.
Yaroslav Bulatov: Mm-hmm. Very important.
Ravid Shwartz-Ziv: Yeah, so thank you everyone and see you next time.
Yaroslav Bulatov: Okay,