March 18, 2026

Stefano Ermon on Diffusion LLMs, Mercury & Why the Future of AI Won't Be Autoregressive

Stefano Ermon on Diffusion LLMs, Mercury & Why the Future of AI Won't Be Autoregressive
The player is loading ...
Stefano Ermon on Diffusion LLMs, Mercury & Why the Future of AI Won't Be Autoregressive
Apple Podcasts podcast player iconSpotify podcast player icon
Apple Podcasts podcast player iconSpotify podcast player icon

In this episode, we talk with Stefano Ermon,  Stanford professor, co-founder & CEO of Inception AI, and co-inventor of DDIM, FlashAttention, DPO, and score-based/diffusion models, about why diffusion-based language models may overtake the autoregressive paradigm that dominates today's LLMs.

We start with the fundamental topics, such as what diffusion models actually are, and why iterative refinement (starting from noise, progressively denoising) offers structural advantages over autoregressive generation.

From there,  we dive into the technical core of diffusion LLMs. Stefano explains how discrete diffusion works on text, why masking is just one of many possible noise processes, and how the mathematics of score matching carries over from the continuous image setting with surprising elegance.

A major theme is the inference advantage. Because diffusion models produce multiple tokens in parallel, they can be dramatically faster than autoregressive models at inference time. Stefano argues this fundamentally changes the cost-quality Pareto frontier, and becomes especially powerful in RL-based post-training.

We also discuss Inception AI's Mercury II model, which Stefano describes as best-in-class for latency-constrained tasks like voice agents and code completion.

In the final part, we get into broader questions  - why transformers work so well, research advice for PhD students, whether recursive self-improvement is imminent, the real state of AI coding tools, and Stefano's journey from academia to startup founder.


TIMESTAMPS

0:12 – Introduction
1:08 – Origins of diffusion models: from GANs to score-based models in 2019
3:13 – Diffusion vs. autoregressive: the typewriter vs. editor analogy
4:43 – Speed, creativity, and quality trade-offs between the two approaches
7:44 – Temperature and sampling in diffusion LLMs — why it's more subtle than you think
9:56 – Can diffusion LLMs scale? Inception AI and Gemini Diffusion as proof points
11:50 – State space models and hybrid transformer architectures
13:03 – Scaling laws for diffusion: pre-training, post-training, and test-time compute
14:33 – Ecosystem and tooling: what transfers and what doesn't
16:58 – From images to text: how discrete diffusion actually works
19:59 – Theory vs. practice in deep learning
21:50 – Loss functions and scoring rules for generative models
23:12 – Mercury II and where diffusion LLMs already win
26:20 – Creativity, slop, and output diversity in parallel generation
28:43 – Hardware for diffusion models: why current GPUs favor autoregressive workloads
30:56 – Optimization algorithms and managing technical risk at a startup
32:46 – Why do transformers work so well?
33:30 – Research advice for PhD students: focus on inference
34:57 – Recursive self-improvement and AGI timelines
35:56 – Will AI replace software engineers? Real-world experience at Inception
37:54 – Professor vs. startup founder: different execution, similar mission
39:56 – The founding story of Inception AI — from ICML Best Paper to company
42:30 – The researcher-to-founder pipeline and big funding rounds
45:02 – PhD vs. industry in 2026: the widening financial gap
47:30 – The industry in 5-10 years: Stefano's outlook

Music:

  • "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
  • "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
  • Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Ravid Shwartz-Ziv: Hi everyone and welcome back to the information bottleneck and today we have a very special guest Stefano Armon is a professor at Stanford and also the co-founder and the CEO, right? ⁓ The co-founder and the CEO of Inception AI and hi Stefano


Stefano Ermon: Hi, pleasure to be here.


Ravid Shwartz-Ziv: Thank you for coming and as always, hey Alan!


Allen Roush: Good to see you and good to meet you, Stefano. I'm a big fan of your work.


Stefano Ermon: Thank you, yeah, good to meet you.


Ravid Shwartz-Ziv: So I think let's start with diffusion in general and why you think diffusion is the future or if diffusion is the future.


Stefano Ermon: Yeah, diffusion has always been something that is very close to my heart. I've been working in this space for many years now. I think the first paper we had ⁓ on score-based diffusion models was in 2019. So now it's been, what, like seven years or so that we've been working in that space. As you know, I back then it was all GANs. That was the thing that people used to generate images. ⁓ And yeah, we were basically exploring alternatives, trying to have something more expressive and autoregressive models didn't quite work. And so it was more like energy based models. And then, we figured. ⁓ We were trying to train it by score matching and then we got into the whole idea of having a sequence of distributions that are annealed across noise levels to make the chains mix faster and then which eventually became a diffusion model. And so, yeah, since back then, I was very excited about this alternative approach that is much more stable, know, there is an underlying proper scoring rule, it's well understood theoretically, and it has the kind of property you would want from a deep learning model in the sense that it has this flavor of something that you can train relatively inexpensively just by denoising, but then at the inference time, it kind of like allows you to use a very deep computation graph because you can essentially


Ravid Shwartz-Ziv: So.


Stefano Ermon: choose the number of denoising steps. In theory, could even be infinity if you like the continuous time perspective, which has always been something that attracted me to these kind of models because it has the flavor of test time inference. I back then that didn't exist, but this idea of wanting to use as big of a neural network as you can while still being able to train it efficiently, that was kind of like what attracted me to diffusion models and why I still like them. And that's why I think it's going to be ⁓ It has a high chance of being the architecture for future generative AI solutions.


Ravid Shwartz-Ziv: Let's start from the beginning, like, what does it even mean, diffusion models?


Stefano Ermon: Yeah, the diffusion model is a strong kind of generative model that essentially generates objects, you know. images or video or music now text also through a process of iterative refinement, coarse-to-fine generation. So you kind of like start with the pure noise and then you gradually refine this object by removing noise, by fixing mistakes until at the end you got a crisp image or a video or some code that solves the problem you wanted to. ⁓ It's very different from the autoregressive approach, which is what most LLMs, for example, are typically based on, which is more like the model is essentially just predicting the next token or the next pixel, and then it's trained for that task. And then at the inference time, it's kind of like unrolling this process and just generating the object left to right, one token or one pixel at a time. So it's a very different kind of... model and sampling process. if you like analogies, guess an LLM is more like a typewriter where you go left to right one token at a time. A diffusion model is more like an editor. You start with an outline, a rough guess of what the paper or the article you want to be and then you refine it. And so that's kind of like the intuition at least.


Allen Roush: So that gives me a really good question here for you. I've noticed I have a lot of experience with using diffusion models like stable diffusion and continuations of those, and of course with LLMs, and also with experimenting with sampling or denoising ⁓ algorithms and all the other knobs and levers. And one thing that has always struck me about ⁓ the diffusion approach is that I see it as being a lot faster, at least today, de facto, right? inference efficient, but I also see it as more creative. And for example, like when I use at least chat GPT's auto-regressive image models, it has a horrible orange filter to everything that I cannot figure out how to remove. And it tends towards orange-ness, like the whole thing of running it in a loop a hundred times or whatever always leads to extremely orange images. I never saw any kind of ticks like this with stable diffusion and most other diffusion models. And I could use settings to tune it out. So I guess, you know, do you see, but similarly, there's also like a quality advantage with these ⁓ autoregressive image models where they're way better at generating text that's ⁓ coherent, even sometimes paragraphs that's correctly rendered. Right, so I guess, do you think that this ⁓ is, do you think it's an inevitable kind of thing where diffusion models are always faster, but also a little bit lower quality and maybe more creative? Or do you think that this is like training data? Or what's the source of this? Am I off base with how I interpret the whole thing?


Stefano Ermon: Yeah, I think it probably depends a lot on the model, on the... architecture on what data you've trained it on and exactly how you do the sampling. don't think fundamentally, you know, that the quality of a diffusion model should be any worse than what you get from an autoregressive one. You know, can always do a lot of kind of denoising steps and you can even think of an autoregressive model as a special kind of diffusion model where you, if you were to remove noise left to right when talking at another time, then you get an autoregressive model, right? So, you know, if the noise is struck in a special way, then you can also think of an autoregressive model as a diffusion model. so, you know, it's hard to make these kind of like generic statements about classes of models. At the end of the day, you these are statistical models. And so a lot of it depends on, you know, what kind of data you use, what kind of, you know, what's the inductive bias, what kind of neural network you're using, what kind of training objectives you're using, the hyperparameters and... And unfortunately we do not understand generalization in the context of deep learning and even less so in the context of generative models. It should be impossible to do what these models are doing, yet they do it. so I think that's probably one of the big scientific questions that is still wide open in terms of understanding why these models do it despite the curse of dimensionality, despite the fact that it shouldn't be possible.


Allen Roush: then a quick follow up. So I noticed that settings, ⁓ example, if I raise a temperature on an autoregressive LLM, it has no inference time ⁓ speed impact. It's the same tokens per second. ⁓ when I do that with ⁓ LLMs, it seems to slow the model down. ⁓ And so ⁓ do you ⁓ ⁓ ⁓ from one model to another and having ⁓ that isn't unique? intuitive like that. ⁓


Stefano Ermon: Which diffusion LLM did you experiment with?


Allen Roush: gosh, I don't remember the name. Maybe it was an implementation quirk of that model at the time. It was some kind of ⁓ one on hugging face.


Stefano Ermon: I see, I see, yeah. Because it's not even clear necessarily how to do temperature scaling in the context of diffusion of the lamps, because it's a little bit more subtle, like how do you actually, ⁓ know, there are some hacks that people do, but it's not necessarily even corresponding to a true temperature scaling. And in the context of score-based models, like pure real energy-based models, then yes, you're just scaling the... the function, is the score function. So there is a pretty clear interpretation, but in the context of discrete diffusion models, it's a little bit more tricky. And in general, the sampling matters a lot. It's one of these things that... I see it as an advantage of diffusion models is that learning and inference are decoupled, which actually goes against the deep learning wisdom of those things need to be coupled as much as possible, but they are actually very much decoupled. and after you've trained the model there is a lot of different ways you can use it at inference time to sample from it and I think that's a big strength of these models because it allows you to trade off compute for quality, for cost, in very interesting ways that is generally pretty hard to do with an autoregressive model where you're kind of forced to use it at inference time the same way you've used it during training, which is just predict the next token. Then yeah, maybe you can scale the temperature, maybe you can play some tricks on the way you change the logits, but there is not a whole lot of freedom there.


Ravid Shwartz-Ziv: But so I'm a big fan of the fusion models and like I, so my first, like I tried to, my first choice for Postdoc was Stefano actually. And when he didn't work out, like I reached out to Jan. So, I think like at the end people are not using like diffusion tax models are around like for a couple of years, right, already. And people didn't, success to scale it and to make it work, you know, as good as, I don't know, five trillion models that other frontier labs are using. Why I think it's the case? Is it something fundamental or like just people don't know how to do it and you think you know better?


Stefano Ermon: people just didn't... yeah, I I actually do not necessarily agree with that statement. I mean think that the work we've been doing at inception proves that it's possible to to scale up diffusion language models to commercial scale and there's been a few other proof points from Google like Gemini diffusion there's been work from other... industrial labs that have shown that it's indeed possible. I think it just so happened that the autoregressive stuff, the GPT work that happened at OpenAI worked and a lot of people then invested massive amounts of capital and did all the R &D based on that direction. But it's not necessarily the right one or the best one. It's in fact very much possible that there is going to be other approaches. If you think about 20 20 years from now, will LLM still be based on that technology? I don't know. think it would be hard to imagine. The first thing that we found to be working was actually the best one. That seems pretty unlikely to me.


Allen Roush: Do you think it's the same thing for state space models and similar alternatives to the transformer?


Stefano Ermon: Well, mean, there, I think there's been, I mean, people have, as far as I know, invested a decent amount of resources in trying to get those, know, alternative to attentions to work in practice. And there's been some success. think some pretty large scale models have hybrid kind of like architectures where, you you interleave attention with the more efficient kind of like. sub-quadratic alternatives. I think that's getting more more popular, more and more common. I'm not an expert, so at least not at the very large scale training of these kind of models, so don't know how big is the benefit, but I think it is a trend that people are... sort of like switching more and more more to these alternative architectures too, right? So again, maybe, who knows what the architecture of the future is going to be. I think that's an orthogonal question. There's the architecture and then there is the how do you train it, how do you use it at inference time. ⁓ The design space is pretty big, which is exciting.


Ravid Shwartz-Ziv: And what do you think about the scaling loss for diffusion models? Do think they are similar to autoregressive ones? Do you think scaling is helpful in this case? Like the bitter lesson is true also here?


Stefano Ermon: Yeah, there is an element of doubt in the sense that the more compute, the more data, the more parameters, the better the performance becomes. I think if you think about scaling laws, there's multiple scaling laws that matter. There's pre-training, mid-training, post-training, test time, and I think the answers are non-existent. They change depending on what you're looking at. Think about test time inference. That's a clear one where we know the fusion models are much faster. And so if you think about, know, at the inference time, what is the best way to... You know, like if you fixed time budget, what is the best answer you can get within time budget? Because essentially, the fusion language model can think faster than you actually can get better quality compared to alternative, to regressive models. Similarly, if you start thinking about our adult post-training, that's another kind of regime where you're mostly bottlenecked by inference, like you're spending most of your time doing rollouts. So having a method that is... scalable at inference time can give you big benefits in that stage of the post-training pipeline. So, you know, it's a little bit more nuanced, I would say, but yeah, it's unlikely that one architecture is going to be always the best one in every single stage of the pipeline.


Allen Roush: So I'm curious about the de facto ecosystems that built out around diffusion versus around LLMs. So for example, low-rank adapters was originally a paper in I think 2021 for LLMs, but then in the immediate aftermath, stable diffusion had come out and then it was very heavily embraced for adding personas into LLMs, ⁓ stable diffusion like models cheaply. And that led to all sorts of weird dynamics where hugging face became the GitHub of machine learning and the equivalent for hugging face civet.ai became something closer to a brothel of machine learning. And so I'm curious, do you think that there's like, what do you think of the whole distinction in ecosystems and worlds? And do you think that that kind of will continue or like diffusion LLMs go in very different directions? around the tooling that gets built around them and the techniques that get used for them. ⁓


Stefano Ermon: Yeah, I think it really depends on which parts of the pipeline you're thinking about. Like on the data side, a lot of things transfer like data sets that are good for training autoregressive models. They also work well for diffusion. Things like if you're thinking about adapters like LoRa, it's more of function of the architecture more than the actual neural network, than how you train it. And so those things. are also relatively easy to adapt from one community to the other. Other things are more specific. If you think about the inference code, the serving engines are very different. If you have another aggressive model, there's a bunch of infrastructure, some of it open source, available to serve models efficiently in production. The VLLMs, SGLENGs, TensorRTs of the world. ⁓ doesn't map very well to ⁓ computation graphs and different ⁓ paths that you have in a diffusion model, so that requires ⁓ ⁓ ⁓ Other things, ⁓ ⁓ other stuff is more compatible, so it really depends on ⁓ ⁓ ⁓ the stack you're thinking about.


Ravid Shwartz-Ziv: ⁓ Let's talk maybe a bit about the differences between image, diffusion image and text diffusion. So originally, as you said, there was image, right? is everything is ⁓ smooth and nice and... and you can do a nice theory above it. ⁓ But at the end, text is discrete and it looks that you try to solve it like you and other people try to solve it, like making with masking, right? the end. Do you think this is something that, like, how to say, this is like a workaround that you can... some good approximation? Or do you think that, like, it's just... it works, or... So yeah, the theory doesn't matter anymore and you're just doing empirical science.


Stefano Ermon: ⁓ The theory actually maps pretty well. I ⁓ think it took us a while to figure out exactly how to do that, because as you said, a lot of the diffusion models rely on these score functions and Fokker-Planck equations, but a lot of the math actually translates. If you take the score function, the equivalent thing is a concrete score, which is basically like a... Again, some kind of function that tells you locally how is the energy landscape, is the likelihood changing, and it tells you what kind of changes you should do to your object to make it better, to make it more likely. And it turns out that those objects can be learned through objectives that are effectively denoising. So there is a denoising score matching-like objective that can be derived. for discrete spaces and it's all very similar. It's a very similar kind of trick and very similar kind of like mathematics to get around the partition function essentially. ⁓ And then, you know, it doesn't have to be masked. That's the thing. I think all the math works for pretty general kind of like noise processes. I think at the end of the day, it's a similar set of constraints in terms of like the kind of noise processes that I've seen and kind of. tractability elements that allow you to do denoising score matching efficiently, like things you can sample from efficiently, things that have tractable transition kernels. There's a pretty broad set of noise in processes that then can be reversed with a lot of mathematical structure and then... A lot of the math works out, it's still all about solving differential equations essentially and you can still do it in continuous time. the only difference is really that instead of having a diffusion over a continuous space, you're doing essentially a diffusion over a graph. But there is a lot of flexibility in terms of how you design the noise in process. It doesn't have to be just having these absorbing states with masks. That's one way, but it doesn't have to.


Ravid Shwartz-Ziv: And how do you see in general the connection between theory to practice in deep learning? Like, you did a lot of theory, like your background information theory also. But you also did a lot of build algorithms, build models. Do you see there is a connection? Do you think that you can learn from one type to the other, or you're just doing this?


Stefano Ermon: Yeah, it's super helpful. think it gives you the right... ⁓ It doesn't get you all the way, the theory and the mathematical analysis, but it does provide an element of a certain level of intuition that I think is useful to essentially... pruned a search space, I think eventually it's still very empirical as we know and then you know it's still a lot of the results. rely on experimentation, whether we like it or not. And the theory is far from being able to be predictive of any interesting experiment. Despite all the work on deep learning theory and information theory, including my own, like the stuff is not really predictive all the way. And it doesn't save you from doing the experiments. But it does give you some intuition for maybe which experiments to do, which ones to not even try. if you think about, you know, like thinking about the loss functions and the right training objectives, then it's very important to really understand what is happening and how to design the right kind of like loss functions that would do the right thing and they have the right, you know, they're numerically stable and they have the right properties that would then lend themselves to something that could work in practice.


Allen Roush: When you talk about different loss functions, ⁓ guess, do you have opinions about which are better or which are worse for these kind of problems?


Stefano Ermon: Yeah, like it's like, you know, for example. There are different kinds of scoring rules or different kinds of divergences essentially that you can use to compare probability distributions or to learn score functions and in theory at least, there are certain sets of loss functions that in theory would all have the same global optimum or they would be proper in the sense that in the limit of infinite data, even in compute, you get the right answer. Then if you look at it, you can see that it might be too brittle and it might not lend themselves well to probability distributions where we expect extremely large swings for example in terms of like the probabilities that you assign to things that make sense and things that don't make sense. So you need a little bit of that intuition to figure out how to design loss functions that would actually work in practice.


Ravid Shwartz-Ziv: what do you think like diffusion models like do you think Do you think there are like specific types of problems that probably or likely that they will be more useful or ⁓ think that who knows ⁓


Stefano Ermon: Yeah, we're seeing success across the board. think people have not maybe caught up too much with what's happening. But if you try our latest Mercury II model, it's pretty good. It's pretty comparable to, like it's not yet frontier level quality. So if you're using a frontier level model, it's probably not going to be a perfect replacement yet. But if you have tasks where you have pretty tight latency budgets, ⁓ And so you will not be able to use the biggest and most capable model anyways. You're building a voice agent or you need to, you have a developer in the loop and you need to provide suggestions and edits and you have maybe, I don't know, 200 millisecond latency budget or up to maybe a second, a second and a half. then we are best in class basically, right? If you look at the charts, it's comparable in quality to the best speed optimized autoregressive models that have come out from the top frontier labs, but it's significantly faster because it's all parallel. Each neural network evaluation is basically giving you more than one token. And so to the extent that you don't need too many denoising steps, these things can be really, really efficient. And so it's already proven at the level that kind of like quality level. The open question is whether you think about the Pareto frontier between speed, quality and cost. There is a certain kind of trade-offs that you can get without the regressive models. I think we've shown that we can shift that frontier in some parts of that space. And yeah, the big question is, will it hold for the highest possible quality levels? how well things, what will happen if we, if as we scale more and more. We don't know, but I'm pretty optimistic that eventually it's all going to be an inference game. That's the only thing that matters. It's intelligence per watt or per dollar. the architecture that scales better along that dimension is the one that is going to win. and we have some pretty solid evidence that it's superior. And there's always the case that a more parallel solution is the one that eventually wins. That's the other side of the bitter lesson, I think. ⁓ And and end-diffusal models are designed to be parallel, right? It's essentially, you're able to modify many tokens at the same time. And so, as long as you can figure out the right way to do things, ⁓ I would be... teach a generative models class and if you think about autoregressive models, they have all kinds of issues. ⁓ miracle that they actually work as well as they do. yeah, think there's no reason that it needs to be the only ⁓ It's just like that's what everybody else is doing and there's a lot of PR built around them. But ⁓ an academic, as a researcher, you always have to take contrarian bets and then... ⁓ You're never going to be first if you just follow what everybody else is doing. So you always have to try different things and then ⁓ take some more risky paths and approaches. And at least for us, I think it's definitely paying off.


Allen Roush: And so I notice that you're not talking too much about the creativity side, but I want to draw analogy here between when humans, like when I write, it's always left to right. ⁓ And then I imagine if I've challenged myself to write in any order or even forced myself to constantly shift lines and positions, ⁓ I think that I would write in more creative way, at least in the sense of using more unique tokens or words. And so I'm wondering, have you ever measured this with your models or have you ever even measured like slop scores like ⁓ usage of over overrepresented phrases and words like and and and like it's not X it's Y and overuse of the ⁓ dash because I want to specifically point out that marketing communication is really you know people will throw that stuff into spam if they think it was an AI that generated it and you might also have a subtle advantage there independent of the inference.


Stefano Ermon: Yeah, yeah, I think it does generalize in different ways and at least qualitatively we've heard this feedback that it's different in the way they generate. ⁓ I'm personally not a big fan of trying to imitate the way the human brain works or the way the human processes are designed. think we're building a fundamentally different kind of intelligence on a very different kind of compute substrate. And so we just optimize for whatever works. We and then GPUs are the thing. that is abundant today and GPUs are good at matrix matrix multiplies and so you should just build the whole thing on something that is doing a lot of matrix matrix multiplies and I think it's as simple as that and ⁓ I'm personally not a believer in trying to imitate how the human vision system works or how the know a weave-write I think we were fundamentally trying to do something else and it's an engineering problem and ⁓ Unfortunately, I don't know to what extent we'll be able to get insights into how our brain works, I guess it is arguably probably one of the most interesting scientific problems of all times, but I don't know to what extent building some kind of artificial intelligence system will tell us about how our brain works. But maybe we're digressing here little bit.


Ravid Shwartz-Ziv: I just said it like kids, for example, if kids learn Hebrew first, so Hebrew is like right to left. So if they learn Hebrew and then after they learn English, so sometimes they write like English letters from right to left and they don't even know that they are doing it. But do you think it like, so what about hardware? Do you think like, you mentioned GPUs? Do you think there are specific types of hardware that is better for diffusion models?


Stefano Ermon: Yeah, I'm sure. yeah. mean, I think that's definitely, know, GPUs and even modern GPUs have kind of, especially in the last few years, have been evolving to be optimized for certain kinds of workloads, which are autoregressive in nature. So... there is definitely a lot of room for improvement for architectures that are designed specifically for diffusion language models. know, NVIDIA is an investor in inception. For that reason, if I had to guess, is that they need to understand what alternative workloads look like because designing a chip, you know, takes many years. And so if this thing is the thing that will become the future of LLMs, they need to be ready and know what the compute patterns look like and they can further optimize for that. But fundamentally the problem is that if you have an autoregressive model, ⁓ inference is memory bound, like you're spending most of your time moving around weights across the memory hierarchy down to the compute units and the arithmetic intensity is extremely low, like you're not doing enough additions and multiplications on those numbers and so no matter ⁓ how you change the hardware, the problem is that autoregressive models are sequential, and so they're not going to be parallelizable. And so that bottleneck is structural. You never get rid of it. And so you need a different architecture that is parallel by nature if you really want to be able to go into the compute bound regime, which is the one we want to be in.


Allen Roush: So this, ⁓ you have opinions about local versus global optimization algorithms? Like for example, ⁓ back propagation and gradient descent are technically like local optimizers, right? But there's the whole theory of global optimization, which some of it was also highly parallelizable in particular, like neuro evolution ⁓ and forward pass only methods. Do you see any connection to like maybe a return of forward pass only optimization to kind of the things you're talking about here.


Stefano Ermon: Maybe, I think we don't do anything particularly fancy in terms of ⁓ the kind of optimization algorithms that we use. Yet, I think it's entirely possible that there are better optimization algorithms that are more suitable for training the usual language models. It's one of those... Dimensions that we've not explored because it's just too much technical risk I think you only can take only so many bets and you need to place them carefully and we've not tried to innovate too much on the Optimization algorithms sure they play a big role, but we've not particularly tried to innovate along that dimension ⁓ architecture side, yeah, we still use transformers. Again, that's another thing that we could potentially try to find architectures that are more suitable for diffusion language models. We actually know that, you know, like if you were to, we have prototypes where we've used the Mamba layers and state space, and so we know it works, but then, you know, it's too much technical risk to change too many things at the same time. And so we focus on the things that we think are going to give us the highest ROI and they're going to differentiate our solution the most compared to what's available today.


Ravid Shwartz-Ziv: Why do you think transformers are so good? Why do they work so well?


Stefano Ermon: Yeah, it's a good question. mean, it's a pretty natural kind of computation. If you think about it, it's kind of like a kernel machine in some sense, where you're comparing all the things to every other thing, and it has that flavor of... It's kind of like a natural... type of, yeah, if you're doing something quadratic, that's sort of like what you probably want to do. ⁓ But again, ⁓ I don't, it could be just because people have spent a lot of time kind of like optimizing everything about transformers and other architectures would be just as good. I don't know.


Ravid Shwartz-Ziv: But what, so do you think like now if you need to give advice to a new PhD student, like what will be like a good research direction, right? Like which part of this very complicated system, like probably like there are better solutions and it's easier to find them. Like which part of the system you think it's worth to put time on?


Stefano Ermon: Yeah, depends where you... I think at least most of the PhD students that I know of, they don't have the compute to be able to do meaningful research on a lot of the training aspects. So I would actually strongly ⁓ advise against doing any kind of work on architecture search or optimization even. It's become... very difficult to be able to do impactful work without the ability to really test your ideas at scale. ⁓ But inference is a good place, personally, think, least academically. There are still lot of interesting questions around accelerating inference, controllable generation. It's not obvious how to do these things and the compute budgets required to do interesting things are... reach ⁓ so ⁓ feel like that's a better, given the times we live in, that's a pretty good ⁓ direction generally. It's pretty broad but... ⁓


Allen Roush: Well, then related, do you think that recursive self-improvement is right around the corner? And do believe some of these very big hype claims about that a whole lot of people are now starting to make in Silicon Valley about the next few years?


Stefano Ermon: ⁓ It's hard to say. I've been, I've generally been more pessimistic, I think, than other people in terms of like how long things will take. I mean, I'm not saying it will not happen. I do think that there is no reason it shouldn't. And it depends how you define things, right? I think a lot of the devil is in the detail and what do you mean by recursive self-improvement? I do see even at inception that, you know, the ability having access to AI tools makes us much more efficient at coding, doing experiments and so to me that is in some sense recursive self-improvement ⁓ but you know how much can it compound or how quickly that is hard to say.


Ravid Shwartz-Ziv: But what do you think about the recent hype in the last weeks about software engineers that now with the new cloud code and codecs we don't need anymore software engineers, we will not have a job soon?


Stefano Ermon: I mean, it's... I don't think it's true... or at least... I I have a pretty narrow sort of like experience and view in that I know the kind of software engineering work we do at the company. And I see that it does accelerate work, but it does not replace it. Like I kind of fire all my back end, even the back end folks. Like I still would not be able to fire them. And in fact, know, we ran into issues just like in the last week where even the best frontier monitors were completely useless. eventually, you had to be worked out by hand. And it was nothing fancy, like it was actually like a pretty, you know, pretty... low-level kind of thing about managing some infrastructure, but the monitors were completely useless, they had no clue. And so eventually the human engineer had to come in and save the day. eventually I if you want something reliable, kind of still need engineers.


Allen Roush: That's good. The good answer for sure. do your engineers like get gains from like, is it agentic and coding that they're doing mostly? then was it like, okay.


Stefano Ermon: mix yeah yeah they use agentic they use just like yeah editors with the suggestions and yeah i mean some things it works like it's found bugs that you just like a few days ago there was a there was a it not it was not even necessarily a bug but there was like a some extra latency that was caused by some you know some workforce that should have been a sync and there were some synchronization that was not needed and yeah it was found by AI pretty quickly so that was good. ⁓


Ravid Shwartz-Ziv: Yeah, maybe let's talk a bit about the difference between a co-founder of a company and to be a professor. What do you think, like, what is a better option? What is your... What do you recommend?


Stefano Ermon: you Well, it really depends on where you are a professor and what's the company, right? There is a big spectrum of startups ⁓ with different focuses and ⁓ missions. so I think in my case, the mission is actually not that different. I think about the mission of my lab or the mission of the company, they're actually pretty similar. the difference is more like the execution and the way you do things and the timelines and sort of like that you optimize for maybe in an academic environment you have longer timelines and different level of risk tolerance I would say also more limited resources and I think that you know the team sizes are smaller and the incentives are different It's as much, I would say it's more incentive compatible between professors and students in the sense that as a professor you succeed if your students are successful and so it's a very good environment to be in. ⁓ But then there are issues like, if you have a, you need to make sure that every PhD student has his own, sort of her own. thing and papers and they need to be first authors on a bunch of papers because they need to graduate and so it's lot of fragmentation and it kind of like forces you to... to make sure and trade-offs that are not necessarily optimal. ⁓ At a startup, it's different, it's more concentrated bats, it's big teams that are pushing all in the same direction. ⁓ The horizons are much shorter and yeah, so it's a very different kind of way of executing, but I think the mission is not that different actually.


Ravid Shwartz-Ziv: but like... I don't maybe you can tell us a bit more about, you know, the journey, right? so you... How does it work? Like, you had like an idea, a specific idea, and you said, I asked you to test it in the real world, I will go and I try it and I raise money in order to... to put it out there, or like, you said, oh no, I want to solve something in general, and then you try to find the solutions...


Stefano Ermon: So in this case, it was more like a sort of like more of the former in the sense that it was really just like a breakthrough that we had in the lab and this, you know. realization that I mean this was like an ICML 2024 paper, the one the best paper award, kind of like showing that at the GPT-2 scale these diffusion language models were actually finally comparable to autoregressive models in terms of quality, but significantly faster. But we were really at the boundary of what was possible in terms of like compute and also engineering time. And so for me it was more of a intellectual curiosity of being able to know, know, is this going to work if you scale things up? And it was not possible to explore that within an academic environment because of limitations on compute and engineering capacity. And so it was a pretty natural kind of step. ⁓ And of course, then there's also the whole excitement. mean, it's... Probably going to be the most exciting technological development that I will see in my life at this point. ⁓ It's hard to imagine something that would top building an intelligent machine. so being able to play a role, being able to speed at the frontier, it's also exciting and I feel like it is. you know, like you it FOMO, I guess, or like the realization that, you know, it's a pretty unique window of opportunity, And startups, it's all about timing and the timing felt right.


Allen Roush: So do you think that it's the path to get rich or not? And does money matter?


Stefano Ermon: I it does, I wouldn't say it doesn't, but it's not necessarily the main driving force. I think it is a nice thing, but I think if I was optimizing for money, I would not have stayed in academia for a long time. I I could have taken much more lucrative kind of like jobs earlier on. So, yeah.


Ravid Shwartz-Ziv: what do you think? We see it ⁓ lot recently, right? Researchers that spend time in Frontier Blab ⁓ go outside, raise a lot of money ⁓ very vague ideas on how to do things right. ⁓ What do think about it? Do you think this is a path that ⁓ work or do think this is something that will not work?


Stefano Ermon: Well, I'm not an investor, luckily. I mean, I'm not the one that needs to make these kind of calls and then price things. Yeah, yeah. So, I mean, I think, but that's an investment question. Like, is it worth, is it worth...


Ravid Shwartz-Ziv: as a researcher.


Stefano Ermon: you know, these companies worth that much money? is this a good, if you have the billion dollars, is this the right way to spend it? And do you think that there is a reasonable chance that the returns are gonna be sufficient to justify this kind of bet, right? And it's a simple start. And I think it really depends on the team. It depends on the idea. It depends on your projections in terms of like what is going to happen, right? I think there is a chance that this technology is going to be so transformative that it's going to create an enormous amount of value. And so it might make sense to make this kind of bets.


Allen Roush: So do you have any research that has excited you recently that you've been seeing really in any area of AI?


Stefano Ermon: Well, I've been mostly focused on diffusion for the language specifically. think that's getting a lot of traction, like not just in the industry and the kind of thing we've been doing at the company, but even at New Rips in December, I saw that there was a lot of papers on that. so ⁓ of makes sense. It's a pretty ⁓ direction and an opportunity to do something new or something different. ⁓ yeah, I'm pretty much all ⁓ in that direction at the moment.


Allen Roush: only have a few more minutes you. ⁓ So do you have any thing that you want to, you know, ⁓ kind of sell or ⁓ thing that's exciting to you right to hop in line that you want to, you know, broadcast to the audience?


Stefano Ermon: I think we covered a lot of it. think Mercury II is pretty exciting. If you guys haven't tried it, give it a try. think it's pretty cool to see. It might change your mind in terms of what's the gap between autoregressive and diffusion of the lamps. think it's pretty cool.


Ravid Shwartz-Ziv: So, one last question. What do you think it's better for like, I don't know, a new graduate student? What is like, do you think it's worth to do a PhD or do think it's worth to go to the industry these days? What is the trade-off that people will pay?


Stefano Ermon: Yeah, I get this question a lot and think my opinion has definitely shifted over the years as the situation evolves. I think the gap, the financial gap between academia and industry has widened considerably. ⁓ got into a point where it might make sense depending on what you optimize for, depending on what your situation is, it might make sense to not do a PhD and go to industry and take advantage of this pretty unique kind of situation and unique window of opportunity, right? So I think it really depends on what's your objective function and what you care about and what ⁓ you expect to do. And ⁓ some people are passionate about teaching, about research, so it makes sense to do a PhD I think. Other people are more financially driven and they don't want to lose out on this golden market right now in the research space, so it might make sense to just take a job at a lab. ⁓


Ravid Shwartz-Ziv: But do you see a difference between someone now that you are hiring? Do you see a difference between someone that have done PhD and someone that didn't?


Stefano Ermon: ⁓ yeah, there is of course a difference. But the counter factor would be if somebody had done research in an industry lab for the same number of years, which one would be better? I think there it's a little bit less clear how big that gap is and it really depends also on the kind of research you do. think we're not writing papers, right? So that's probably a skill that maybe the PhD student is more, is better at ⁓ or even just explaining their ideas or like, you know, but if it's more about, you know, writing good quality code and... being a proficient programmer, then maybe you get more of that if you've been in industry, if you spent time in big tech and they are more diligent programmers and so it really depends.


Ravid Shwartz-Ziv: maybe one last question, so... Five, ten years from now, how do you see the, like, the industry we look like? Like, what models, like, you think you will see more players that have frontier models? What your capabilities will be like?


Stefano Ermon: Yeah, and... 10 years is hard, think even 2-3 years is very difficult. I've been consistently ⁓ underestimating, I feel like, the progress. ⁓ It's been a wild ride just to see the progress and how quickly things went from nothing is basically working to the models we have today. So it's hard to predict. I don't think that we're going to hit the EGI next year. I'm definitely more pessimistic. I wouldn't be doing this company. I think we're betting on little bit longer timelines, ⁓ will give us enough time to ⁓ a shot ⁓ getting there first. ⁓ But ⁓ quickly does that mean? ⁓ hard to say. 10 years is a very long time. ⁓


Ravid Shwartz-Ziv: Okay, anything else that you want to add?


Stefano Ermon: No, it's good, yeah. Thanks for the interview, yeah, that was a fun chat.


Ravid Shwartz-Ziv: Thank you and thank you for the audience and thank you, Ellen.


Allen Roush: Always a pleasure and very, very cool to meet one of the ⁓ experts on diffusion in the virtual life. ⁓


Ravid Shwartz-Ziv: Thanks. Thank you. Bye bye.


Stefano Ermon: Nice to meet you again. Thanks.