The Principles of Diffusion Models - with Jesse Lai (Sony AI)

We host Chieh-Hsin (Jesse) Lai, Staff Research Scientist at Sony AI and visiting professor at National Yang Ming Chiao Tung University, Taiwan, for a conversation about diffusion models, the technology behind tools like Stable Diffusion, and most of the AI image and video generators you've seen in the last few years. Jesse recently co-authored The Principles of Diffusion Models with Stefano Ermon, and the book is quickly becoming a go-to reference in the field.
We start with what a generative model actually is, and what it means to "generate" an image or a sound. Jesse explains the core idea behind diffusion in plain terms. You start with pure noise, and a neural network gradually cleans it up, step by step, until a realistic image emerges.
From there, we talk about why diffusion has come to dominate so much of generative AI. Because the model builds an image gradually, you can guide it along the way, nudging the output toward what you actually want, refining details, or combining it with other controls. We also discuss the common critique that diffusion is slow and how the field has largely addressed it through new techniques.
We zoom out to the bigger picture, too. Jesse shares his view on world models and whether diffusion is the right foundation for them. We talk about what makes a generative model genuinely good versus just good at gaming benchmarks, and why evaluating creativity and realism is so much harder than scoring a multiple-choice test.
Timeline
00:12 — Intro and welcoming Jesse
00:47 — Why Jesse wrote the book, and who it's for
03:29 — The three families of diffusion models, and why they're really one idea
05:14 — What makes a good generative model
07:39 — How do you even measure if a generated image is good
08:59 — Why diffusion beats autoregressive models for images
10:33 — Is diffusion still slow? How fast generation got fast
11:12 — A simple intuition for what a "score" is
14:12 — How the different flavors of diffusion connect under the hood
14:42 — Diffusion for text and proteins
17:12 — Consistency models and the push for one-step generation
22:12 — Diffusion for world models: simulating reality in real time
26:12 — Do world models need to understand language
35:12 — Is diffusion the right tool, or just a convenient one
38:12 — What benchmarks actually tell us, and what they miss
46:12 — Closing thoughts and where to find the book
Music:
- "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- Changes: trimmed
About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.