How to Build the Smartest Camera in Your Pocket - with Peyman Milanfar (Google)

In this episode, we sit down with Peyman Milanfar, Distinguished Scientist at Google, where he leads the Computational Imaging team. Peyman is a member of the National Academy of Engineering, an IEEE Fellow, and one of the key minds behind the imaging pipeline in Google Pixel phones. Before joining Google, he was a professor of Electrical Engineering at UC Santa Cruz for 15 years, and he helped develop the imaging pipeline for Google Glass during his time at Google X. With over 35,000 citations and decades of work at the intersection of image processing and AI, Peyman makes a compelling case that denoising, long dismissed as a "digital janitor" task, is actually one of the most fundamental operations in modern machine learning, on par with SGD and backpropagation.
We trace the full arc from classical denoising algorithms to modern diffusion models. Peyman explains how early denoisers implicitly learned from image patches, how the "Is Denoising Dead?" paper in 2010 led him to ask what else denoisers could do beyond cleaning up noise, and how that question opened the door to regularization by denoising and, eventually, to the diffusion models powering image generation today.
We also dig into the practical side, including how Peyman's team shipped a one-step diffusion model on the Pixel phone for 100x ProRes Zoom, the challenges of controlling hallucinations in generative models for consumer products, and why understanding physics and the image formation process still matters in the age of large models.
The conversation wraps with a big-picture debate: why has language dominated the AI spotlight while vision lags behind? Peyman argues that visual intelligence is coming next, and that, unlike language, vision requires grounding in the physical world through robotics, world models, and continuous learning. He also reflects on his journey from professor to industry researcher and why he wouldn't trade the ability to take ideas from theory to millions of users.
Timeline
0:13 Intro
1:42 Why denoising matters
3:20 History of denoising
5:57 How denoisers work
9:39 Why phones need denoising
12:54 Tesla's vision-only bet
14:14 BM3D's dominance
16:58 "Is Denoising Dead?"
18:21 Regularization by Denoising (RED)
24:26 RED looks like diffusion
26:19 Denoising & manifolds
28:42 Energy-based vs. diffusion models
33:46 Blind denoisers
40:30 Diffusion for text
45:44 Perception-distortion tradeoff
53:05 Denoising vs. editing
57:01 ComfyUI & democratization
58:51 One-step diffusion on Pixel
59:51 Coding agents & domain expertise
1:02:45 Diffusion for music
1:06:53 World models & continuous learning
1:15:01 Why vision will overtake language
1:21:12 Professor vs. Google
1:25:08 Wrap-up
Music:
- "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- Changes: trimmed
About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Ravid Shwartz-Ziv: everyone and welcome back to the information bottleneck And today we have a great guest, Peyman. He's a distinguished scientist at Google since 2012. â He did a lot of things from â Google apps and Google glasses, and now he's leading the imaging team. â Before he was a professor at electrical engineering at the University of California and Santa Cruz. Hi. Nice to meet you. Thank you for coming.
Peyman: Hi. Yeah, likewise. Thanks for having me. Good to be with you.
Ravid Shwartz-Ziv: And as always, we have Ellen. Ellen!
Allen Roush: Nice to be here again and nice to meet you, Peyman. â And I'm really excited to kind of continue where we left off on the last episode with more discussions about diffusion, but this time for images instead of â diffusion language models.
Ravid Shwartz-Ziv: Yes. Yeah, so today we will talk about a lot of things, but I think we can start. we can start with â denoising and maybe let's start with the concept of denoising and why it's so important in machine learning in general and in recent years and what we are doing every day. So why we should care about denoising in general?
Peyman: Yeah, yeah, it's a great question. And it's a question that I am asked often because, you know, in a way, â we've spent decades really thinking of denoising as a kind of digital janitor that kind of cleans up the garbage. â But if you look at it â specifically through the lens of, you know, things like the Tweety's formula and the score function, you realize something that's really startling, which is that denoising isn't just a utility, isn't the genre that you thought it was, it's actually the fundamental engine that enables a lot of things that we do to work. â In fact, the big reason really is that to know how to remove noise from a signal or from an image in particular is to basically have a map of the manifold that these signals live on. â I think this is a very broad concept that hasn't really been promulgated enough so that people haven't really internalized how important an operation this is. And I think it's become kind of not only just the backbone of diffusion models, but before that, in the signal processing community, it was a core tool in the big arsenal of things that we have for solving inverse problems. And it is in a way... the most important tool that we have on the same level as things like stochastic gradient descent and back propagation â if you're working with â diffusion models in particular. So, really critical piece of technology.
Ravid Shwartz-Ziv: So let's start. So where, like, how does it emerge? Like, we know that today people are using Git everywhere, right? Like, recently in text, but before it in image, and of course in voice, right? So what is the beginning? Like, how you, first, how you see it? Like, why you think it's so important? And why it's so important, like, for so many domains?
Peyman: Yeah, so I mean, in terms of historically, â you know, I was trained â in signal and image processing. I didn't start off as a machine learning person. But where the history of kind of noise reduction, very generally speaking, is â as old as, â you know, gauss. â You know, he was making measurements by hand, observing things in the sky and... â in order to aggregate all of this information together that was noisy, things like these squares were invented. So the history of having to deal with noise is as old as having the history of any instrument that humans and scientists have had access to. And in particular, in the denoising and image processing framework, A lot of the stuff began with the space program and then there was a lot of, you know, obviously signals were being sent from â space probes and so on that needed to be cleaned up. So that's kind how things got started. But at the beginning it was very, very â kind of ad hoc. Very basic tools were used. And increasingly over time the tools became more more sophisticated. â began to work in this space â maybe around 20 years ago â with one of my graduate students. We looked at all the denoising algorithms that were around at the time and we realized that all of them were kind of producing results that were very comparable in quality. So the difference between the quality of algorithm A versus algorithm B at the very top of the list of high performing elegans was very, very small. So back in 2010, we wrote a paper with a very provocative title called, Is Denoising Dead? Basically, is that do we know everything there is to know about denoising? And our conclusion was that it's not because there was still some room if you compare it to kind of the fundamental limits of how well you could denoise things. But I...
Ravid Shwartz-Ziv: But what type of denoisers, like can you elaborate a bit like what does it mean actually like to denoise? So now like you have an image, like how you actually do it and what types of like families of denoisers are there?
Peyman: Yeah, exactly. Sure, sure. So the very pedestrian way of thinking about it is that you're sitting at a particular pixel and you know that this pixel is noisy and so you look around you â and you find pixels that are kind of similar in value or in context and then you do a weighted average of these things. And that results in â pixel value where you're sitting where the noise is less. So that's a very, very simple way of thinking about it. But now you can generalize this. You can think about this as, what if I go really, really far away in different parts of the image and look at weighted averages that are dependent on â patch structures? So let's say I'm sitting somewhere on a pixel and I look around the immediate neighborhood and I notice that I'm sitting at â side of an edge and I look for similar patterns around different parts of the image that have the same kind of geometric structure, and I would do a average of those things together. â This kind of context-dependent weighted averaging, incidentally, is very similar to the structure of â the attention mechanism that you see in the modern architectures. Now, we're going back 25 years now, but these ideas really have been around. So if you looked at... back, let's say, I don't know, in 2010 or so at what these denoises were doing, they were all basically data-dependent weighted averages of different pixels in the same image. So there was no mechanism at the time for training a neural network that had millions of examples of clean and noisy images. was just, you're just given a noisy image and that is all you're given. and you base your denoising algorithm on local features that measured what the neighborhood of the image that you're at looks like and then comparing that to the rest of the image and doing some kind of data dependent weighted average.
Ravid Shwartz-Ziv: So you don't have like any kind of learning here, right?
Peyman: Back then, basically, this was it. Now, if you dig a little bit deeper into the structure of these denoisers, you will see that there was, in fact, an implicit learning happening. How? â What was happening was that, as I said, you look at the local neighborhood of a pixel, and then you look for similar neighborhoods, similar geometric neighborhoods around the image. So what you're looking for are nearest neighbors of this structure. And effectively, it was realized some number of years later, and it's in one of the papers that I worked on, I think it was around 2013 or so, that basically these denoisers were sampling the image, a given image, they were sampling patches of the image, using that data, basically building an empirical distribution of the pixels. and then using that empirical distribution to build a statistical denoiser, whether that was minimum E squared error or map or whatever the case might have been. So the learning was happening in a very implicit way, but it wasn't happening with external examples. All the examples were just other patches in the same image. So you had this rather odd phenomenon where if your image was, let's say, a thousand by thousand pixels, you would hit a certain bound of how well you could denoise it, but if that image was 5,000 by 5,000 pixels, you could get better denoising because there were more possibly examples of a particular patch.
Ravid Shwartz-Ziv: I have one more question. In 2010, why we actually need denoises? I think, don't know, correct me if I'm wrong, the images, the quality of the images was good enough. So why we actually care about denoising an image?
Peyman: So the quality of the images wasn't good enough. That's from a practical point of view. â if you look at the, â this goes to the history of kind of imaging sensors. Let's set aside the more sophisticated sensors that are used in medical imaging and so on, but let's just look at photographic case. â So to give you a little bit of history. When we switched over from CCD sensors to CMOS sensors, one of the big advantages was that CMOS sensors â were much, cheaper, â but the price that we paid for the cheaper sensors, we were able to put them in mobile devices. That's how we could afford to put them in so many mobile devices. The price we paid for it was that these devices were much noisier. So...
Ravid Shwartz-Ziv: why actually they were noisier? I don't know anything, like, you know, I worked on sensors, like, long time ago about, like, acceleration and things like that, but, like, not related to images. So, like, why they actually, like, why smaller sensors are noisier?
Peyman: Yeah, it's. it's not just about smaller sensors. â It's about a very different... So, charge-coupled devices are just physically very different. Those are CCDs that existed before. CMOS sensors are completely different physical architecture. â And, â you know, making them small â came with various â trade-offs that had to be made in terms of the material that were used â â know, there's a... â There's a well that kind of has a certain capacity and there's the background noise, there's the quality optics that you put in front of it, et cetera, et cetera. So the need for denoising, even in 2010, was still very, very strong. And a lot of it actually came from the mobile phone industry. So if you look at... kind of the evolution of mobile phones, things really took off around 2010 and mobile phones became super popular. So the need for having clean images coming out of your mobile device really exploded back then. â And so denoising has been a very, very popular topic in image processing for a good 25 years. The interesting thing that happened, just to kind of go back to what I was saying earlier, This kind of need for having really high quality denoisers ended up running into â this limitation that we couldn't do much better because all we had access to was just that one given image. And then over time, â as machine learning became more prominent, then the design of denoisers became less about take a given image and get rid of noise. It became more about architecting a neural network or some other model that could learn from many examples and then apply that to a given image.
Allen Roush: So this discussion of sensors that we're having, I'm wondering, do you have opinions now about â like Elon Musk and his gamble with all vision sensors on Tesla where most other competitors use like LIDAR or even historical usage of like radar based vision systems?
Peyman: Autonomous cars are not my expertise, so I wouldn't have a very strong opinion â specifically about that approach. But suffice it to say that multiple sensors are useful when you want to be able to â see different parts of the electromagnetic spectrum that are not visible with just a regular camera. There may be things that you can see and certain â For example, if you have radar, â LIDAR, the atmospheric interference with those sensors is a very different â game than it is with visual sensors. So it's a little bit outside of my particular area of expertise, but I say diversity of sensors is generally a very good thing. â Of course, when you have a diversity of sensors, some of those sensors are going to be a lot more expensive than others. â So there are obviously trade-offs.
Allen Roush: And then â when you're talking about... â denoising algorithms. â I'm quite familiar with a whole host of the differential equation solvers that are kind of the analogy to sampling is for LLMs. And they all have names like Euler, Euler adaptive or DDIM, DPM, there's so many names for them. â Do you work in that space? Or do you have any opinions about what's good or what's not? And it seems like there's kind of a bitter lesson dynamic with where the most simple ones remain the most popular, similar to like Adam kind of being almost unchallenged despite nearly a decade of research trying to go past it in the optimizer space.
Peyman: Yeah, I mean, with respect to denoising in particular, which I want to kind of give a little bit more insight to, there was a similar thing that was happening in the literature for many years. There was a very famous algorithm that was dominant for a good part of, I don't know, seven or eight, maybe 10 years called BM3D, block matching 3D. And this was kind of the premium â denoising algorithm and it's kind of dominated the whole field for quite some time. If you look very closely at it, it turned out it was an approximation of a minimum mean squared error estimator. It was a very clever, very well executed, well implemented implementation of a minimum mean squared error estimator. And so in retrospect, once people really understood that it was a very effective approximation of MMSE, it was not surprising that other methods could not improve on it according to the mean squared error metric, because it was optimizing the optimal denoiser, right? So I think some of these things that you see where one method kind of maintains its dominance for a long time, Some of the time is because they're just very new and lot of â improvements are being made to them over time. But also some of them, like in this case, is because they're just kind of optimizing the right thing. Now, â it turned out that even that method was not able to execute a perfect minimum mean squared error estimator. There was still some space â between that and the actual MMSE that you could get. And that gap has now been closed with use of neural networks. So, you you can now design a neural network that actually achieves very, close to MMSE quality.
Ravid Shwartz-Ziv: So. So I have two questions, like one, so like in 2013, right, like you found that like all the denoisers are roughly the same, right? And then what? Like, so what is like in terms of like research and product, like that's it?
Peyman: Right, right, exactly. yeah, â this was a really interesting time because once we realized there was a really small space between what was achieved at the time in terms of quality of the noises and what was possible, the next question became, like do you want to spend the rest of your time, like my student's time getting a PhD and my own career just trying to close that tiny gap? It seemed like kind of a waste of time to do that. So the next... natural intellectual question to ask was, what else can you do with denoisers? In other words, let's assume that we have the best denoiser possible. What else could you do with it? And that's when we began to explore how denoisers could be used in different contexts. And from this came this idea of using denoisers for regularization. â And so that led to â the regularization by denoising paper that I think you're familiar with that we can talk about.
Ravid Shwartz-Ziv: Yeah, so tell us, like, how can we use regularization, denoiser as a regularizer?
Peyman: Right. So â there was sort of two things that happened around the same time. One was read and the other one was â these methods called plug and play. â And plug and play, so let me step back for a second and say the context. The context here wasn't denoising itself. The context was solving other inverse problems. And by inverse problem, I mean things like, let's say, getting rid of blur or â inverting some linear or nonlinear operator. â Things like medical imaging and many, many things. So suppose your model is like Y equals AX plus noise and A is this operator that is blurring the image or down sampling it or whatever. â So you want to recover X and this A operator is in between. If A is identity, then you just have a simple denoiser. â So we want to solve these problems. So one of the methods that our colleagues â came up with was â based on this algorithm called ADMM. ADMM had a step in it that basically, if you stare at it long enough, you realize it's just a denoiser. And so what these folks did is they replaced that step with a denoiser, and now they had something where... the algorithm became just essentially activated by denoiser.
Ravid Shwartz-Ziv: Wait, what is this algorithm actually like? What is the purpose of this algorithm? What is the goal of using it? Okay.
Peyman: It's an optimization algorithm. It's an optimization algorithm. So basically, you set up an optimization problem that tries to solve for â x, â and you choose your prior of choice. â Alternating descent method of multipliers, think the ADM stands for. And so you write the steps of this algorithm down, and one step just looks like a denoising. And so our colleagues said, well, look, let's just take that and just yank that out and just replace it with an arbitrary denoiser. It was a very, very particular kind of denoiser that that line implemented. So they plugged that in and, hey, it worked. â
Ravid Shwartz-Ziv: And it worked. So what does it mean that it worked? Like, do you get better results?
Peyman: It worked in the sense that, no, in some cases it gave better results. And what ended up happening was that by swapping that one line out and putting in arbitrary denoisers, denoisers that were even better than the denoisers that one line implied, they were able to get even better results, right? And so it was kind of a little bit magical to be able to do that. So around that time, Myself and and Mickey a lot who's now at the Technion â We were looking at similar problems and and we saw this and and what we wanted to do was to â So as what I just described was not a principled Way of solving an optimization problem. So basically when you swap that one line and you added a separate denoiser, you no longer optimized the the objective that you started off with â We wanted to see if we could do something similar but in a more principled way. In other words, could you replace the regularizer with something that involved the denoiser and then optimize using that? And after a few months of back and forth and at the time Mickey was visiting me, â we came up with this idea. And the idea was that you could build a loss function that was basically built on the residual of a denoiser. So you define an arbitrary denoiser f of x, you apply it to an image, and you subtract it from the input to the denoiser. And that residual, an energy based on that residual, was very effective as a regularizer. â
Ravid Shwartz-Ziv: Let's go step back just a second. So the task now, we have like some loss function that we want to optimize, right? On images, right? And this is like unsupervised task, right? Like you have some basically loss function that like it can be anything, right? now you say, okay, so you give me a denoiser, right? Like that, whatever denoiser that you want, right? I will put this denoiser, the residual of this denoiser as a constraint. I will plug it into the loss function, right? And then...
Peyman: It's another term in the loss function. It's basically another term in the loss function.
Ravid Shwartz-Ziv: â okay, so you say that like if now if you use this denoiser, it will be another term to... It's like adding another term to this loss function.
Peyman: It's not like adding another term, it's exactly adding another term. So, you know, imagine you want to solve this y equals ax plus noise problem, and the first term in the loss is y minus ax norm squared, right? That just minimizes the difference between what you want and what you have. And then you add another term, and that second term is purely an energy function that is built on the denoiser directly.
Ravid Shwartz-Ziv: Mm-hmm.
Peyman: And this denoiser is pretty arbitrary.
Ravid Shwartz-Ziv: And it doesn't matter what types of noise I'm assuming, what types of...
Peyman: No, it's that basically what ends up happening is that the denoiser isn't really being used as a denoiser anymore. It's being used as a way of constraining the problem. Okay. And â what was surprising is that when we first played around with this, from a practical point of view, when we implemented it, it worked. And it didn't matter which denoiser you put in there, whatever denoiser we put in there, the thing behaved in a very stable way and it gave us good results. But when it came to writing it down and understanding under what circumstances it works, we had to make â several assumptions. Like the de Noiser had to be well behaved, the Jacobian of the de Noiser had to be symmetric, â et cetera. It's a bunch of math, right? So we wrote this paper, but it turned out that even in the
Ravid Shwartz-Ziv: All this math.
Peyman: We made those assumptions so that we could make the math work. But it turned out that even the noisers that didn't satisfy those constraints still behave themselves and still give us good results. So for example, I use this in one of my talks that I give. We used a â median filter. A median filter, â you just take a box of pixels and you replace the center of the pixel with the median of its neighbors. There is no reason why that should work. And it turns out that if you take this optimization problem that we set up and you unroll it and you run it, it basically means you're running a denoiser over this, some modified version of the image many, many times, and you get a clean, for example, blur reduced image. â So that was quite a revelation. And in a way, if you go back and you look at the results that we came up with with RED, if you unroll that algorithm, what it looks like is very much like a diffusion method. In other words, you start with the corrupted image, and that corrupted image could be, for example, blurry. And what you do is you apply a denoiser to it, then you apply another linear operator, and then you apply the denoiser, and then you apply a linear operator, and apply the denoiser. It's almost exactly like a diffusion problem. We were not thinking about diffusion problems or diffusion methods at the time, but this had that characteristic. Now why did it have that characteristic? That was kind of the fascinating thing. And that came much later, the realization that this loss function that we built, which was based on the residual of this de-noiser, was actually pointing to the geometry of the manifold that images sit on. And that's the common through line between the original motivation of denoising, the regularization by denoising, the diffusion models.
Allen Roush: So do you think there's a relationship between â denoising and â manifold similar to the whole intelligence and compression relationship we see?
Peyman: Yes, absolutely. mean, the denoising keeps emerging across various domains â because, I mean, the core reason is that natural signals, whether they're speech, images, photography, whatever, natural signals generally tend to reside on a low dimensional manifold. because they have a lot of structure. But that low dimensional manifold is embedded in a high dimensional space. So â there's a lot that's been written on this for a very long time. And the reason why this becomes really important is because the following kind of empirical observation is true. You can make this very precise mathematically, but think about it this way. Imagine in the case of images. that images live on a manifold. When you take an image that lives on a manifold, think about this manifold as clean, right? Like it doesn't have noise in it. If you take an image sitting on that manifold and you add noise to it, what you're doing is you're popping this image off the manifold. What denoising does is it gives you a very definitive mechanism for projecting back onto that manifold. If your denoiser did a perfect job, you would land right where you started. But it never gives you a perfect denoiser. So you land somewhere very nearby. You typically end up projecting onto a tangent space to that manifold. And so, to get back to your original question, is there a relationship between these? Absolutely there is. Because what denoising does is because it gives you this tool for projecting onto this manifold in which the underlying signals live, it's like having a map. It's like implicitly having a map of this manifold and if you find yourself lost somewhere off the manifold it tells you how to get back.
Ravid Shwartz-Ziv: But that sounds like really similar to energy-based methods, right? But at the end, today, like we almost know, like even in vision, we are not using like energy-based methods for sure for text, right? Why you think it's the case? Why we are like, it sounds very natural to use it. Why you think it doesn't work in practice?
Peyman: Yeah, so I mean there's various different reasons. â Obviously energy-based methods are notoriously very difficult to work with. There's a lot of instability â and so on. So this speaks to one of the latest things that, it's just a paper that we just put on archive. It tries to understand â how you could use... the formalism of energy-based methods to effectively do what diffusion models do. So if you think about it, what diffusion models do is at each step you're doing a denoiser and you have to tell the denoiser the exact noise level, that's the noise schedule, right, at each step of the iteration. But if you're working with an energy-based model, there's no concept of this... â this noise schedule, right? You're just doing gradient descent on this energy that you've defined. So what we tried to do is to understand is there a bridge between energy-based methods and diffusion models and denoising? And what we came up with is that there's this quantity, â basically what we call the marginal energy. So imagine you basically â integrate â across the different noise levels that you might have. And now if you imagine the diffusion model is basically kind of fixing the noise model and then taking a step and then fixing the noise model again and taking a step, what if you integrate it across all the noise models from sort t equals zero to t equal to one? And that gives you â an energy landscape that is no longer conditioned on the noise level. And you can say, okay, great, that's my energy landscape, I'm gonna start doing gradient descent on it. Well, it turns out if you do that, you will fail because that energy landscape has some very nasty singularities. In fact, at every data point, you get this infinite well with infinite gradient, and so it won't work. And so if you want to do some sort of optimization on this, you have to be very, very careful about how you set it up. â And â it turns out that if you try to do what the original denoising diffusion models do, which is to predict the noise, you will fail because that turns out it will just basically push you towards these very unstable points and it won't work. So it turns out the flow models are a really good way of doing this because â both flow models and the direct â signal prediction models, they end up effectively doing a kind of natural gradient descent, not just normal gradient descent, but a natural gradient descent on this manifold that...
Ravid Shwartz-Ziv: What's the difference between natural and normal drainage?
Peyman: Right, so the standard gradient descent is you just take the energy, you take the gradient of the energy and you take a step in that direction, right? A natural gradient descent is you actually have another term in front of the gradient that scales the gradient. And that's effectively a kind of metric, it's like a conformal metric that tells you, okay, if you get close to the discontinuity, you need to slow down or go faster, et cetera, et cetera. So it turns out that this works. â And the methods that work, we were able to show the methods like flow matching and so on and so forth that work, take advantage of this. So while this paper that we just put â on archive, the title is something like â the geometry of noise, â why diffusion models don't need noise conditioning. â This is not, it doesn't provide an algorithm, but it provides you a roadmap for how you could go about doing this. â While we're at it, it might be worth also saying something about why, from, just intuitively, why is it that these models are able to get away with â not knowing the exact noise level? â It's, at the beginning, it's a little bit counterintuitive. It certainly was for me. It took a lot of getting used to. But... Gaussian noise in really high dimensions is not like a cloud like it is in lower dimensions, right? You've got these concentrations, right? So you get this very thin shell that the noise concentrates in. So every noise level in very high dimensional space is like it concentrates in a thin thin layer in this thin shell. So If you go from one noise level to the next, these are just concentric shells that don't intersect each other. So these sort of noise agnostic models can get away with not knowing the noise level because of this. â
Ravid Shwartz-Ziv: But you still need to learn like the scaling, right?
Peyman: You need to learn scale, exactly. The scale is the important thing. And the data, effectively the noisy data that you start with, effectively tells you the strength of the noise. This is the amazing part, right? People have worked directly in denoising and there's been some recent papers by Simoncelli and team that have shown this as well. You can have so-called, quote unquote, blind denoisers, right? People have studied this question, like can you have one denoiser that does denoising across all noise levels. This was not understood to be possible for a long time, and now people understand that it is possible. So this paper kind of describes why it's possible and what you can do with it. So anyway, to kind of pop back up and address your earlier question, â it's the geometry of the high dimensional space. that allows you to get away with having a kind of noise agnostic â or noise blind â diffusion, if you like.
Allen Roush: So you talked earlier about â singularities or discontinuities and problems associated with traditional gradient descent â and having that like additional term on it for the distance of each step. â I've always been a fan personally of gradient free optimization algorithms, particularly ones that call themselves global optimizers, right? Because they don't make assumptions about the shape of the the solution space. Do you have any opinions about those? And do you think that they would be more useful with diffusion models?
Peyman: â At least the way I've always thought about gradient-free methods is that they tend to be very useful in situations where computing the gradient is really difficult. â Whereas in situations like what we just described, the gradient is, you can compute the gradient in these cases because we know that â through things like Tweety's formula and so on and so forth, we actually have analytical ways of computing these gradients that allow us to do optimizations on these high dimensional manifold. So I'm not aware of a lot of work that's gone towards kind of gradient free optimization for these problems. That may be the reason.
Ravid Shwartz-Ziv: So what people are using today? Like they said, you know, there is a kind of regularizer. So what people are using today for this, like in images? â
Peyman: What what let me make sure I understand the question. â You mean.
Ravid Shwartz-Ziv: Besides, I mean, like, besides, like, just, â like, denoising images, like, what are the options, like, what are the things that people are using in denoises?
Peyman: Well, I mean, this regularization by denoising and the plug and play method have become â a very standard tool in inverse problems now. Pretty much every other paper that you pull out nowadays in the inverse problems community that does kind of practical things uses one of these methods. â I mean, I have to confess at the time when we came up with it, we loved the idea because we thought it was really cool, but we didn't think that it would be so widely used. And I will also say there's a lot that's still not known about these things, both the plug and play method and the regularization by denoising, there are still questions about its convergence, like under what conditions it converges and so on, but it hasn't stopped people from going ahead and using them anyway and then having some stopping criterion or whatever, very similar to what people do in the diffusion community. So it's been a spectacular success in terms of... giving people a very practical way of regularizing these inverse problems and getting to solutions quickly.
Ravid Shwartz-Ziv: And do you think this is like, â is it connected with the structure of the neural network? Because this is very general, right? But is it like, it looks like it works really well with neural networks, right? Like even diffusion model, like they didn't work before, right? Do you think this is something that like there is some like, don't know, some internal connection between the two? Or like it will be like it's general and can work in for any like â model families.
Peyman: Yeah, I think the idea that denoisers are effectively a projection operator onto this lower dimensional manifold of signals is very, very general. I don't think that has anything to do with neural networks per se. Right? So I think that is a kind of fundamental fact that is useful both from a conceptual point of view and for algorithm development. What's exciting about their use in, â for example, the diffusion world is that they also tie really well with how people think about and interpret diffusion models. It's beginning to be useful in terms of explaining why diffusion models work at all. â You you've got... solutions of ODEs and these are paths along from the input to the solution and these paths are not arbitrary, they go along a particular manifold. So I think the idea of denoising and its usefulness are kind of universal and very, very general. In this paper that we wrote that kind of summarized that you kindly said you liked a lot was We thought about, know, denoising on its own obviously has been very useful in practice. It's in every, you know, cell phone and whatnot. It's been very useful for solving image problems. And then in the last, whatever, five or six years, it's been the key to making diffusion models work, which has kind of kick-started a lot of this generative, â generative AI stuff for images and videos. So, â anyway. You can tell I'm very excited about this because for someone who's worked in this area for the better part of two decades, â what people considered many, many years ago as kind of the janitor of signal processing, it's lovely to see that it's got a really, really fundamental role in so many different things.
Ravid Shwartz-Ziv: what about text? know, like, Stefano Armon was here last time, in the last episode, â he bet that diffusion text will, for the long term, will win the race for better Lengovitch model. What do you think about it? Because it's fundamentally very different, right? Like, it's discrete.
Peyman: Yeah, it is discrete. I think a few years ago somebody asked me this question about, you know, I gave a talk about denoising and what it can do, and somebody asked me this question like, is this applicable to discrete data? And â it is absolutely applicable to discrete data. Denoising takes a very different form with discrete data. If folks are interested in this work, I highly recommend â papers from, this is probably 10 years ago. â Saki Weissman at â Stanford wrote a beautiful sequence of papers â about denoising of discrete data. And the data there that I think he was talking about, I don't have all the details at the â tip of my fingers, but the data that he was talking about â was kind of samples were created from a finite dictionary. know, think Saki is an information theorist. So you see where he's coming up. coming from. And so I think a lot of what you see now with, I'm no expert in diffusion language models, but what I'm hearing makes a lot of sense to me because I think the denoising operators that we have definitely are applicable to discrete data. There is a historical precedent for doing things in a very, very solid way from information theoretic perspective like Saki's work and others. this space. So it wouldn't surprise me if those worked quite well. Now, from a practical point of view, are they going to be immediately competitive with â aggressive models that we have now? â remains to be seen. It's very early days. But it's exciting to see â these things kind of come to life too.
Allen Roush: And then we talked earlier about noise. â You mentioned, â what's the type of noise? Perlin or periton noise?
Peyman: mean typically in â kind of noise that we work with â for diffusion models, it's just Gaussian noise.
Allen Roush: Gaussian, yeah. So â I ran into noise algorithms in a separate way in â development of a roguelike video game I've been working on. And different types of noise algorithms are really nice for â creating different types of dungeon shapes. And I know that there's even studies about different types of noise in the context of white noise for people to sleep next to. â white, pink, etc. colors or flavors of noise. Do these have any impact â from denoising â objectives with diffusion models?
Peyman: Yeah, that's a great question. I mean, we often â make the assumption that the noise corrupting our signals is this Gaussian noise. And we do that for several reasons. One is that if you think about, if you fix the first and second order statistics of the noise, like the mean and the variance, Gaussian noise is the highest entropy noise that you can have. And so in some sense, it's the most difficult noise to deal with because it's the most kind of random. It has the highest entropy. That's one reason. The other one is a central limit theorem, which is that if you take enough samples â of something and you do some form of averaging across them, then that mean kind of eventually â gives you a Gaussian distribution. Now, to your point specifically, do we have scenarios where the noise is not Gaussian in practice? Absolutely. There are... In my business in particular, when you talk about noise that you measure in camera sensors and so on, the noise is hardly ever Gaussian. â So there we have to be much, much more careful. Some of the nice theoretical things that we can say about what the denoising operator is may not hold exactly for the same kind of â noises that we deal with in practice than it would for Gaussian noise. â But the fact of the matter is that the operation, if you like, or the very idea of a denoising operator is one that reduces uncertainty. You have an underlying signal that has â a particular structure, and then you have something with a lot less structure that sits on top of it. The distribution of that might be Gaussian or not Gaussian. But the denoising operator reduces that entropy to â hopefully give you an idea of what's underneath. But yeah, mean, to your point, there's all sorts of noise in all sorts of different circumstances. I've worked in everything from photography to digital cinema and even a little bit in, I've worked with folks who are experts in games and I know being able to produce just the right distribution of noise to make a scene look right is really important.
Ravid Shwartz-Ziv: But this is related to the quality, right? Quality is something that it's not really like even how to define, right? What does it mean that you denoise in a really good way? Is it just the mean square error or do you focus on what human sees as clean image? How do you think about it? Different types of noise for different scenarios?
Peyman: Yeah, this is a great question. â It's been a number of years, some of us have worked on this question about what does it mean to have a high quality image? â As Alan was just pointing out, in some cases, a high quality image that gets a very high perceptual score is actually one that's slightly noisy, right? And you have the right texture. If you go to a movie theater and you watch the movie, if it's got no grains in it, it will look wrong and it will be very hard to watch. So having a certain amount of grain, having a certain amount of noise is actually a good thing. That's a separate question than like, you know, the operator, whether usefulness of the operator or not. So the way we've thought about this is â there are two things. One is a metric that is, you know, a legitimate numerical metric â that tries to tell you how far you are from a particular target, like mean squared error, or SNR, or things like that. The other one is a perceptual metric. And there's a beautiful paper called Perception Distortion Trade-off. What this paper says is something really compelling, which is that you cannot simultaneously optimize for a perceptual metric and a numerical metric. That if you optimize for one, you will be giving up on the other. And so you have this kind of effective Pareto frontier where one side of it is possible and the other side of it is not possible. And I think this is very interestingly, you know, again, it's one of those things that is... fairly well known in kind of my community and imaging and signal processing and so on, but very little known in the machine learning community. And I think folks should really know about this. â So to your point, yes, it depends on the objective. Like anything else, you have to kind of pick an operating point if you have an algorithm or something where you decide, like, you know, do I really want zero mean squared error or do you want high perceptual quality? â I will add something to that though that I think is even more compelling. I think nowadays, given the power of generative models, it's not simply about a numerical metric versus a perceptual metric. There's a third axis to this, which I think of as kind of authenticity. In other words, I can take a very poor quality image, it could be blurry, could be noisy, any number of things, and I can apply a generative model to it and produce a result that is arbitrarily high quality. However, it will no longer correspond to the reality of the moment and the picture that was taken. And so this third axis, I think, is super important. And work in this space is also beginning to... I think the same authors as the paper I just mentioned have also started to write about this. So these are really, really important questions. There's no definitive answer to any of them yet, but as the models that are being used in imaging become more more powerful, I think these are questions we have to wrestle with.
Ravid Shwartz-Ziv: But what is your â thoughts about generative models in the sense that... So I came from information theory compression and how you compress neural networks. when you talk about compression in neural network, about information compression, you always need to ask, OK, so what is the goal? What types of information I want to compress? What type of information I want to keep in the network, but this is true only if you have some very specific task, if you have the labels, for example. But in the general case, these days, know with pre-training or SSL or things like that, basically, we don't have a clear solution there. Basically, we kind of give up and say, in some magical way, the information is more accessible, but it's not clear if
Peyman: Exactly.
Ravid Shwartz-Ziv: if you have compression or not and something like that. How do you think about it in generative models? Like how do you think, like, is it some, like, underlying information that we need to some access to it? Or like, do you think we are just, like, store this information in the weight and just help to reveal it? Or how do you think about it?
Peyman: Yeah, it's a great question. There's this concept in information theory of the rate distortion tradeoff. And as far as I'm aware, there's no comparable concept that's been developed for neural networks and diffusion models. Maybe there is, and I'm just not, I'm ignorant of it. But... What I mentioned earlier, this kind of trade-offs between perceptual quality and numerical quality and authenticity and so on and so forth, it's basically pointing to this question. â In my world, â where we develop things for consumer applications, â the metrics, if you like, that we design our systems by are directly informed by the consumer's experience. So we try to understand what it is that is the right thing to be presenting to the user. And we try to make sure that our models, as I said at the beginning of the podcast, we want to be developing models that are enhancing and restoring reality, not altering or imagining it. Okay, and that's a very fine line to walk. And in order to do that, you have to be very, very careful with the models that you develop. â Having a larger model is not necessarily a better thing because you're increasing the capacity of the model to hallucinate details that maybe are undesirable. And as you can well appreciate, the larger you make the model, the harder it gets to control the behavior of the model, unless you have some way of setting some of the ways to zero or whatever. So it's a really important question. And I think if there's any place where I see a very clear role for theory to still play an important role going forward is to understand these trade-offs in a much more systematic way. You know, the many, many decades of study in compression and information theory gave us a real framework for understanding these kinds of tradeoffs. But I don't think we're quite there yet with the very large generative models.
Ravid Shwartz-Ziv: And do you think so, for example, like originally or like in classical denoising, right, like you want to reconstruct the original image that like appear in the world, right? But today it's not clear that this is what will be like the optimal denoising or like using of the generative model. Like I know sometimes you want, of course, change it or like to make the person more beautiful or whatever, right? So do you think this is another type of denoising or this is something fundamentally different?
Peyman: think that's â fundamentally different. â think, you know, again, if you go back to thinking about â as this operator that gives you back kind of â image, a natural image, and by natural image I mean something that was captured by a sensor, a picture of the real actual world that was out there, â then the function of denoising is fairly clear. It's to reproduce a real physical scene that was observed by the sensor. Now, if you're going to actually change that scene, that kind of goes into the realm of editing. And while it is true that you can use a sequence of denoisers in the form of a diffusion model to edit an image, I don't think that takes away from the core function of the denoiser. At each step, it's trying to replicate something very fundamental, which is a natural-looking image. I think, again, this is a good place to talk a little bit about kind of what role does theory play anymore â in a world where things are moving so fast and â and more powerful models are coming online and so on. â role of theory, I think, has shifted. â used to be a time when, â in our community, the theory dictated the algorithm. â If you look at the history of things like wavelets, the theory itself became the algorithm, became a particular way of doing things. â But nowadays, the job description is different for theory. We are no longer using theory to kind of primarily invent the algorithm. We're using theory to guide. the architecture and do things like guaranteeing stability and making sure that our models don't produce undesirable results. â So in some ways the power and the capabilities of the models have far outstripped any theoretical understanding that we have of them as of today. And so that's, I think, the challenge. â I will also say, I think doing theory in kind of in an industrial setting is different, but it's critical because if nothing else, it's the ultimate kind of cost saver, right? â We train massive generative models that cost many, many millions of dollars to develop, and you can't afford to just do these things willy-nilly on a trial basis. â You have to have some principles by which you operate, and this is how we work. You know, it's not just kind of random march around the architectures and so on. We try to understand at some practical level the theory behind why things work and how things work. And we let that inform how we develop the models.
Allen Roush: So these diffusion models have really kicked off the generative AI revolution. And I noticed that a lot of â the value, at least with the early stable diffusion models, came from people using them â as tools almost with very pro or prosumer â user interfaces around them. So for example, there was like sketch to image, image to image, there was regional prompting where you have different prompts for different parts of the image, denoise all of them at once. â There was control nets. â And I'm just wondering, kind of, how do you see the relationship? You talk about theory, which is probably the most extreme academic. But then we have the hobbyists who were building a lot of those user interfaces and are still highly influential, with comfy UI being the main way that people use open source diffusion models. And that's... clearly a hobbyist project, so I'm just curious how you see this relationship.
Peyman: Yeah, I think what's wonderful about the era that we're in is the tremendous flexibility of these models that allow everyone to experiment with them. mean, you know, whether you can code or you can't, â it's no longer a blocker on whether you can experiment. That was absolutely not the case even 10 years ago. You know, in my field, you had to be sort of an expert or you had to have, â you know, pay for, I don't know, some very expensive piece of software in order to be able to do some of these operations. Now, everybody can do certainly some of it and experiment. I think it's really important to separate kind of the ability to do a lot of experimentation from the ability to do something in a consumer space. that's really useful, effective, repeatable, and reliable. â So it's very different to give someone a tool where they can take an image and manipulate it in many, different ways and have fun with it and be creative with it. That's very different than being able to take a large model, a generative model, and put it inside a phone. and use it to generate high quality images. So we did this last year with â one of the features in the Pixel phones called â ProRes Zoom where we're able to produce â digitally upscaled images up to 100x. â And the process for that was very, very carefully understanding what the model was capable of. and training it to make sure that it didn't create weird artifacts, that it didn't hallucinate, and so on and so forth. So that model that we fine-tuned and were able to put on a device and ship, by the way, one-step diffusion model, which is unprecedented up to that point, I think it still is, that's a very different kind of model. than a model that you would have access to through an API that you could just play around with. So I think use cases are very much going to drive what the right use of the model is going to look like. And some of that is going to require fine tuning for the particular â consumer use cases. Some of it is going to be much more directed towards professionals that want to have maximum flexibility of playing around with a model. and others are gonna feel somewhere in the middle. So I think this is a really exciting time because you can see so many different use cases and so many different ways of using these models that it creates a very large, kind of spawns a very large swath of possibilities that haven't been there before.
Ravid Shwartz-Ziv: And what do you think about the coding agents? Because in the last weeks, it looks that everyone thinks that now we have cloud code. It kind of like knows how to close the loop, right? Like to make the experiments, like to have some number and like to iterate over this number. Do you think this is the future? Do you think this is something that we will see more and more and like less human in the loop or do we still need very specific? specific expertise in this field.
Peyman: I think tools that help â developers be more effective, â work faster, â are going to revolutionize our business. When I say our business, I mean like the collective industry, right? I don't mean specifically Google. â Because any tool that helps you work more effectively and faster is going to be game changer. Now, again, it depends on how people use it. â If you â a very specific thing that you want to do â rather complex, it may be very, very difficult to do that with kind of vibe coding or knowing very little. You still may need to have a lot of domain knowledge to get to where you want to go. But if you want to very quickly, you know, put together a website or do, you know, build an app that does a relatively straightforward function, that may be a lot easier to do. So again, I go back to the particular use cases, right? I think folks with my kind of expertise, you know, is an engineer working on computation, at the intersection of computational imaging and AI, are going to be irrelevant because of these tools? No, I don't think so. Because there's so much domain knowledge that informs how we interface with these tools that is not going to be replaced overnight or even in the medium term. â So I think the tools are an accelerator. They allow us to be more effective, to explore the space much more efficiently. And it's definitely a welcome tool in the arsenal of things that we have. But there's still a lot of work to be done â that will require, at least in my part of the world, that require very specific domain knowledge.
Allen Roush: And â what do you think about diffusion models for â other domains? So for example, I'm fascinated by the work that takes diffusion image models and has them diffuse, I think it's spectrogram images and then uses them to be pretty mediocre â music production models or kind of bootstraps it. But what do you think of diffusion in different modalities and diffusion image models applied to very unique use cases like that.
Peyman: I have to say I know very little about music generation. I'm mostly just a consumer of music rather than a developer or a composer. I haven't really heard a lot of â generated music â to be able to form a â well-informed opinion about it. But some of this stuff is not entirely unprecedented. And let me tell you why. A long time ago in my earlier career, â much earlier. One of the first projects that I worked on was acoustics, and this was specifically underwater acoustics in the ocean. â And so one of the ideas there was that you measure via spectrograms, you create these spectrograms of all the noises that are measured under the ocean. And there's many, many listening stations under the ocean for various purposes. And then what you have is you have this picture, you have this image, which is a spectrogram. And in this image, you can identify signatures of various things. You can hear shipping vessels, you can hear fishing boats, you can hear whales, shrimp, and various other things. And you can filter some of these things out. So for example, this particular project that I worked on, you this was like 30 years ago now. was about tracking whales in the ocean from their acoustic signatures. And so you can take the spectrogram, you can filter things out of the spectrogram, and then you can regenerate the music or the sounds from the spectrogram and listen only to the sound of the whales with everything else filtered out. And so now we're talking about kind of mechanizing these things with large models. So I can only imagine the possibilities and how exciting it would be. But I know so very little about music that I don't think I'm particularly qualified to say if they'll do a good job or not.
Ravid Shwartz-Ziv: But do think like there is some kind of like common structure for multimodality? Do you think like the future we will see, you know, like model generators that know how to handle different modalities like videos and voice and all these together?
Peyman: I do. Mainly this is just my intuition, not because I have any particular technical insight there, but â I mean we do, right, as humans. We are entirely multimodal creatures. know, we have all of our five senses and you know the... audio and vision and so on and so forth. But without getting too far out of my comfort range, I will say this. I think that perhaps the most obvious place is to look at things like video and images. Right? I mean, when you go to your camera, you have two modes. You switch between taking a still photo versus a video. I don't really think of those two things as very different things. â I mean the quality of the material you get is different obviously and it has a different inherent value but videos and pictures insofar as your vision is concerned are one and the same thing. You're always seeing things in a dynamic world. Now video, because it's dynamic and because it kind of sees objects moving, also gives a sense of what the 3D world looks like. And that's much more difficult to get from a single image. So to me, this break between video understanding and video processing and image understanding and image processing have always been artificial. If nothing else, multimodality at its very basic element will mean image and video are kind of the same entity in some sense, the same visual entity.
Ravid Shwartz-Ziv: But do you think we will, in order to get these models to fully understand and to â be better, do you think we need kind of like an internal structure and world models if you want, you can call it like that? Do you think we should have this component or as today we don't need an explicit structure?
Peyman: Well, I mean, it remains to be seen. It's unclear whether you need a thing specifically called the world model, â but a model that understands the three-dimensional world in the context of whatever it captures is obviously going to be much more useful than a model that understands simply the projection of that three-dimensional world onto a two-dimensional sensor. Right, so I mean that's kind of self-evident. â Whether we need the level of sophistication that these world models are bringing to the table, whether that's absolutely necessary for the kinds of tasks that I care about, I'm not sure. But there's no question that those are going to bring us to a point of being able to better understand the environments in which the sensors are working. And â yeah, so. I mean, I'm optimistic about what these world models will bring us because I think that inherently we need additional information about the scene that we're capturing in order to be able to do things with it. And this gets to another point, which is that, you know, how humans interact with the world is that we live in the world in a very dynamic way. It's not a frozen static model like we do now with â the models that we use, we kind of train the model, it's frozen, and then that model makes inferences about the world. That's not how we work. You we are constantly moving through the world, we're constantly readjusting and learning new things and understanding the world from different angles and so on and so forth. So yes, you might have world models, but if they are simply static world model, that will be less useful than something that learns continuously. So that's kind of the way I think about
Ravid Shwartz-Ziv: and And why? So at the end, think, like when you look on an image or like a video, it really depends on you or like your perceptual understanding, right? Us as humans, They understand different types of objects and they see the world differently if you want, right? And do you think we need kind of like make these algorithms more human? I don't know, maybe through language for example. Or do you think this is like we can teach and learn our models without very specific supervision if you want in some, just to understand the world without any supervision.
Peyman: Yeah, I'm not really sure about the supervised aspect of things. I know there's been some kind of debate. I've done some of this debating online with some folks you might have seen on Twitter about whether a child learns very quickly from relatively few examples. Is that considered supervised learning or is it unsupervised? As somebody who's raised two kids, I would say there's a lot of supervision going on there. So, but it's not the same as supervising the sense of training data. It's a very, very different mode of learning. â I think the critical thing, and I'm certainly not the first person to say this, many others have pointed this out, Jutendra Malik has given a beautiful talk about this, is that you need a component of your models interacting with the world, interacting with the physical world. in order to understand it. And whether you call that supervised or unsupervised. So I think this is where kind of the robotics components comes in. Again, I will preface this by saying I'm no expert in any of this, but I see why things like world models and robotics are really important. It's because this is kind of the next stage of bringing a static model trained in a lab out into the real world and asking it to interact with the real world. â finding the corner cases and being able to train and retrain and improve them. One place where this has gone beautifully are self-driving cars. And you can see Waymo, if you've driven in a, well not driven, but you've sat in a Waymo, being driven around, it's a beautiful experience because you can see this is a prime example of an intelligent system â finding its way around. We need to do that. for all kinds of other modalities that are not just driving around a city. So I think it's a really exciting period of time.
Allen Roush: And what do you think about continual learning and its importance? â You you hear people like Dario claiming it's not necessary for the data centers of geniuses we're about to have. But I personally think it's very important. And do you think that diffusion models can do continual learning or be made to do it?
Peyman: I don't know about diffusion models specifically, but I'll just say, very generally speaking, I think whether you want to call it continual learning or not, it's a matter of being able to interact, a model being able to interact, or some manifestation of a model, some physical manifestation of the model being able to interact with the real world. I think that's important. â Continual learning literally means every piece of information, I think it means every piece of information that you get somehow updates your model. But that's not necessarily how we operate either, right? I mean, you know, once I'm used to a particular environment, I've learned a lot of things from it, I kind of stop paying attention to it. I'm not picking up any new data points. It's when I go into a new environment that I pick up the cues and I begin to learn new things about that environment, I update my model of the world. So. I don't know if that would call that continual learning or not, but certainly it's novelty, right? know, being exposed to novelty is a way to improve the model. We all know that, right? It's about collecting new information that the model hasn't seen before. And you either have to bring the data to the model or you have to take the model to the data, which is the real world.
Ravid Shwartz-Ziv: So how do you think in the future, in 10 years from now, the models in our cell phone will record once in a something of the environment and then they will update themselves? do we need very specific types of information from specific people in order to update it? you will drive or you will hire 1,000 people that will work with their phone everywhere.
Peyman: It's a good question. don't, I really have a very good answer to it except to say that, â you know, when it comes to â any kind of model â interacting with the world, obviously â there's a lot of practical questions that one has to worry about. You know, this is obviously the case, you see with self-driving cars. But when it comes to consumer â devices, these are really difficult questions. â They bring up questions of â privacy, â kind of autonomy, what is your device doing? Have you authorized your device to do this? â Do you really want this to be... â Okay, let's take the particular example of a model updating itself. If that model's updating itself on your device, that may be okay, because it's private, it's sitting on your device, it's not being sent to the... â to the server, but if you're collecting stuff and it's being sent to a server and the model's getting updated, that's not okay â because it's violating your privacy. there's all sorts of, again, part of the reason all of this is interesting is when this kind of intelligence interacts with the real world, that's where a lot of interesting questions come up. â I think we're still at the very beginning of that, so it's interesting to see where it will...
Ravid Shwartz-Ziv: And so I want to ask you a bit about, you know, the recent progress with LLMs, that it looks right at like everyone are really, let's say, hype about LLMs. And it looks at like image and vision, it's kind of like a stay aside and people are not so excited about it. Do you think this is something â for the this difference or this thing that like we will see something different in the future.
Peyman: Well, I mean, I would say people are not excited about it. Everybody that I talk to is excited about it. I think there's a slight difference between â kind of where the impact is already realized as compared to where the impact is going to be realized over the next several years. â Obviously, LLMs have made huge progress. There's a lot of great stuff already out there. It's made some great... â entries into commercial use, consumer use, and all of this other stuff. But yes, mean, the visual aspect of intelligence is extremely important. It hasn't kind of taken the... the lead position in artificial intelligence yet, but I think it's coming. It's the right set of things to be looking at next. I think part of the reason language has been so prominent is in part because it's a one dimensional signal, right? It just, know, autoregression makes a lot of sense â when you're looking at a sentence. You know, you can go in one direction, but when you're looking at visual data, even in a single image, there's no sense of. know, which direction you should be doing your inference, video, know, video's changing in time and so on and so forth. So it makes sense to me from an engineer's perspective â â dimensional signals have made more progress and these higher dimensional, more complex manifestations of intelligence having to do with images and videos and perceptions of the world in â â 4D are â to follow.
Ravid Shwartz-Ziv: It's so funny, we talked two weeks ago with Naomi Shapira from University of Boston. She's like Langerich â person. And we asked her this question. She basically said, we don't need to care about images. Langerich is the only thing that matters. â I actually agree with you, by the way. â
Peyman: Yeah. Yeah, think, you can argue this many different ways, but if you think about, again, I go back to, I try to anchor myself in â the human experience is how do we learn? know, language is not the first thing we learn. We learn to see the world first. â You know, and we hear and we smell and we touch. Language comes kind of much later. At least that's my understanding. Maybe there's some other kind of language that comes very, very early. But â if you look at the way we've developed â AI in the last several years, we've not done it in that order. We've done it with language first, right? Which is like if a kid, that's not how a kid learns. But I think it's been a matter of necessity. â Where things have worked, we've pushed very hard. And where things are harder, we're making strides. And I think those are things that are coming in the next few years.
Allen Roush: â And then â we, what do you think about like â integration of, you mentioned I think earlier like the tech for 3D object scanning, like think it's like point clouds and Gaussian splatting and stuff. And I know that there was â usage of like depth maps and normal maps to help â ground image generating diffusion models. Do you see that as kind of a next frontier for getting even more accurate with the images and video?
Peyman: Well, I think whenever you have the possibility to ground â your models in â physics and in physical reality, it's something you should do. â The other aspect that I think is â worth thinking about, I think a lot of people in my part of the world where we care about solving inverse problems that are motivated by physics think about is when you develop a model, you have to think about the process that generated the data. So, for instance, if you are in the business of taking a low quality image that's been blurred, for example, and getting rid of it, it definitely helps to know something about how blurry images happen. There isn't one way. There are many, many different ways. You can have the camera be out of focus. You can have the camera make a sudden shift. You can have even your sensor can be dirty. You put a little â oil, you know, rub your finger on your forehead and then rub it on the lens of your camera. You'll get blurry images. So there are many different ways that an image can physically, a sharp image can become blurry. And understanding that from a physics point of view allows you to simulate all of these scenarios or collect the right set of data so that you can then solve the problem. So this goes to the question of kind of understanding the physics allows you to be able to curate the right set of data, to be able to the correct kind of simulation pipelines for creating synthetic data, if that's what you're gonna do, and then to use those to solve the right problem, as opposed to just kind of picking a very unrealistic â input-output and then expecting your model to just kind of take care of the rest.
Ravid Shwartz-Ziv: So one more personal question. So you were a professor at the university. And now you're at Google for a long time already. What do you prefer?
Peyman: Yeah, this is a great question. â I've been now at Google for almost exactly the same length of time that I was a professor. So, you know, even. So I can give a good answer here â because I've had just about the same amount of data over time. So the way people mostly ask me this question is, you know, do you miss being a professor? And the answer I give is... only one respect, and that is being able to interact with â graduate students, who I always enjoy doing that. But I get to do that now anyway because we have many talented â interns that come in. But also the other cool thing that's happened is that when I was a professor, when my PhD students would finish, I'm sure you as a professor will have this experience if you haven't had it already, is that your graduate students become very, very productive like the last year or two that they're with you. And then... At that peak of their productivity, they leave. They graduate and they leave. And guess where they go? They come to me. So this is a great joy for me to be able to continue to work with people, but at a slightly different stage in their career. But I will say this, just very generally speaking, the kind of differences between working in an academic environment and working in industry is I've been very, very privileged to be able to work on some of the same problems in academia and industry. I've had lifelong ambitions to put some of the ideas that were just theoretical 20 years ago, bring them all the way to production, put them on a device and ship them to millions of people. when you talk about preference, one really wouldn't have happened without the other. So to me, they have really complimented, in my particular case, in a really beautiful way. The other thing that I would say, having been in industry, has taught me is the immense value â of teamwork. In... Academia, you largely have kind of, you know, as a professor, especially if you're a senior, more senior professor, you're kind of, you know, well established, you have your little fiefdom of students and post-docs and a few colleagues that you work with. But teamwork here in the context of, you know, working on consumer products and so on and so forth, it could be several hundred people that you're interacting with. And the complexity of the projects can be enormous. And it's a very, very different kind of joy to be able to see all of that kind of click together and be able to see people kind of work together effectively to take something from zero to one and be able to put it into production. And in my case, again, as I said, I've been very privileged to see kind of end to end from theory to practice.
Ravid Shwartz-Ziv: But now if you need to choose like only one, like to write a paper with like 10,000 citations versus ship a product to I don't know, millions of people, what you would choose?
Peyman: I wouldn't choose. I would do both. this is something, yeah, this is something, I think this is one of the things that's unique about Google in particular is that â we are an applied research team, â which means that we don't do theory for the sake of theory. We do theory for the sake of understanding and for the sake of efficiency and for the sake of doing things the right way without being wasteful. But we also obviously develop things for the practical world. And it's the combination of those two things, being able to do some of the theory and being able to see the effect of that theory in practice is what gives me the greatest joy. So... â Yeah, I really would, I mean, I think that's kind of a false choice. I've gotten too used to be able to do both of them to give you a straight answer, sorry.
Ravid Shwartz-Ziv: Now that's a great answer. Okay, we're almost out of time. â Do you want anything, do you have anything to add that you want to talk about, that you want to promote?
Peyman: I think we've covered a lot of ground. I've really enjoyed the conversation. It's been a lot of fun. We talked about many, many different things. â for inviting me. I really enjoyed it.
Ravid Shwartz-Ziv: Thank you, thank you so much for coming. And thank you for all of you. â
Peyman: My pleasure.
Allen Roush: Yeah. Yeah, it was a pleasure to meet you and pick your brain.
Peyman: Likewise. Likewise. Thank you.