March 6, 2026

EP28: How to Control a Stochastic Agent with Stefano Soatto (VP AWS/ Pro. UCLA)

The player is loading ...

Stefano Soatto, VP for AI at AWS and Professor at UCLA, joins us to explore how the agentic era fundamentally redefines machine learning, from static train-and-test models to dynamic, interactive control systems. This shift underscores the growing need for intentionality and trust in AI-driven development.

Stefano sees LLMs as stochastic dynamical systems that need to be controlled, not just prompted. He introduces a framework called "strands coding" that sits between vibe coding and spec coding - you write a skeleton of your program with AI functions constrained by pre- and post-conditions, so you verify intent before a single line of code is generated. The surprising part: even as AI coding adoption goes up, developer trust in the output is going down.

We go deep into the philosophy of models and the world. Stefano argues that the dichotomy between "language models" and "world models" doesn't really exist, where a reasoning engine trained on rich enough data is a world model. He walks us through why naive realism is indefensible, how reverse diffusion was originally intended to show that models can't be identical to reality, and why that matters now.

We also discuss three types of information, Shannon, algorithmic, and conceptual, and why algorithmic information is the one that actually matters to agents. Synthetic data doesn't add Shannon information, but it adds algorithmic information, which is why it works. Intelligence isn't about scaling to Solomonov's universal induction; it's about learning to solve new problems fast.

Takeaways:

Vibe coding is local feedback control with high cognitive load; spec coding is open-loop global control with silent failures, neither scales well alone.
Trust in AI-generated code is declining even as adoption rises.
The distinction between next-token prediction and world model is mostly nomenclature - reasoning engines operating on multimodal data are world models.
Algorithmic information, not Shannon information, is what matters in the agentic setting.
Intelligence isn't minimizing inference uncertainty - it's minimizing time to solve unforeseen tasks.
The intent gap between user and model cannot be fully automated or delegated.

Timeline

(00:13) Introduction and guest welcome

(01:12) How the agentic era changed machine learning

(06:11) Vibe coding one year later

(07:23) Vibe vs. spec vs. strands coding

(14:30) Why English is not a programming language

(16:36) Constrained generation and agent choreography

(20:44) Diffusion models vs. autoregressive models (25:59) The platonic representation hypothesis and naive realism

(31:14) Synthetic data and the information bottleneck

(36:22) Three types of information: Shannon, algorithmic, conceptual

(38:47) Scaling laws and Solomonov induction

(42:14) World models and the Goethian vs. Marrian approach

(49:00) Encoding vs. generation and JEPA-style training

(55:50) Are language models already world models?

(59:13) Closing thoughts on trust, education, and responsibility.

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About

The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Ravid Shwartz-Ziv: Hi and welcome back to the information bottleneck and today we have a very special guest, Stefano Soato. He's a professor at UCLA and VP for AI at AWS. Hi Stefano.

Stefano: Hello, Ryan. Hello, Alan.

Ravid Shwartz-Ziv: And as always, hi Alan, how are you?

Allen Roush: It's great. It's a pleasure to be here. â And UCLA is a beautiful campus.

Stefano: Sure is.

Ravid Shwartz-Ziv: Yeah, that's true. So I think like we'll just go directly to talk about today topics and Stefano as the one that working on a gigantic frameworks for AI for LLMs. This is right. This is the topic that everyone are trying to figure it out in the last month or like a year or so. How do you see these things? How do you see this framework? How do you see like why it's so successful?

Stefano: Yeah, so certainly â the way I view machine learning has changed in the agentic era, which I date at least for myself starting about 2022 with the advent of chain of thought prompting. It was clear that you can use these models to perform actions by generating APIs that trigger either physical actuators or other actions in digital space. And so they can take measurements from sensors. So that's the realm of control systems. in particular stochastic control system because these â models generate data stochastically. And so â the way we view machine learning from the classical induction where you have a data set you'll never see again and you use it to model a function, and then you will use it forever after on data you've never seen before. And then you talk about generalization and â regularization and OCamBraser and these things that... belongs to the past or to the pre-training phase of LLM. But now we're in the phase where these models interact with the environment. They receive signals from the environment. The environment is everything but the agent, so it doesn't need to be the physical environment. And they can use these signals to update their memory. And there's not just the weights that's a structural memory, but also the in-context memory as well as the in-storage memory. And it's a very rich world, has enormous potential and almost positive and also risks potential. So we spend a lot of our time trying to understand how these models interact with the world, how to control them, literally, they are control systems, how to control them and how to make sure that they are responsive to customers' inquiries or to the task, â meaning that they need to understand their query. So all of these questions have resurfaced. questions of philosophy that date back to Aristotle, if you want to go that way back, but at least to Hume and Russell and the like. So I'm open to any question, but â essentially the way in which we operate as machine learning â researchers has changed quite significantly in the agentic era and the impact is being seen immediately.

Ravid Shwartz-Ziv: So, let's start with like, why you think they are so successful? Why do you think like this shift from, right, like that you have data that you are using for training and then like you test on very like distinguish separate data. Why this shift to agentic AI became these models so successful? Do you think this is something that like, that finally you're interacting with the world or this is something more like fundamental with the models themselves?

Stefano: I think there's two facts. One is that â just by force of scale, models have been getting better and better. So even if the interaction is limited to a chat, people have noticed how the quality of the interaction has improved. And then there is something quite powerful in creating an entity, which could be your agent, that acts on your behalf. It's extremely powerful. I still remember. when the first computer I started, when the first program I wrote in school was a program to write, â to maintain a library. And so at some point when the instructions, and we're talking about Pascal here, so that will date me, would trigger a write onto disk. And so you saw an instruction trigger a physical action. That was very powerful. And so these models can interact with the world and can receive â feedback or data from the world that opens up possibilities that were barely visible with the chat interface. Although very early in the chatbot interface, people were effectively playing the agentic arm of LLMs. I remember one of my friends basically asked the LLM to design a business plan and he would be the executor, the one who would go and buy ads and rent space. And so that created symbiotic relationship where The model was the brain, the engine, the reasoning engine, and he was merely the executor. So â I think that the possibilities are enormous. And I think that has triggered creativity at level and experience before, because it is very easy to do something that already â works, where I use works in quotes. Now, my job is making sure that the vast schism between something that works in quotes and something that actually you can deploy in production and the customers can rely on and it is safe and it is confined in the sandbox and all security access protocols and payments and authentication and observability. All of these things, the remaining 10 % of the work that takes 90 % of the time to look after, that's what I spend most of my time on. But there are genuine ideas that â have entered. the play space that they're not new ideas that they are very old, but they've been revisited in new key, which I'm extremely excited.

Ravid Shwartz-Ziv: And do you think that, so like, I don't, I think like this time we talked about like, there was like one year for the post of Andrei Karpati about Vive coding, right? Like he just said it now, like everyone are starting to Vive coding and you can do whatever you want with prompting and, but, but... then it was like, yeah, we can, you can build some toy models and you can do something very like small and not very sophisticated. But today it looked at like everyone are saying, or like they are talking about, â now we finally can write like a put a cloud code, just a very simple prompt and it will design and implement, you know, a very sophisticated â system. And now we don't need the, I don't know, programmers anymore. and we can go and scale everything very easily. What do you think? you think this is something like, right, you talked about this gap between very small problems to production-ready system. Do you think this is something that we already do, or do you think it's something that will come with better...

Stefano: Now we're getting into a current event and news territory. I have quite a bit to say about that. I think that Andrea Carpatti was right in â launching the provocation of English being the next programming language. Certainly, you can interact with Vibe and create something that has the appearance of working. However, there are fundamental limitations to what you can do with Vibe, which is one modality of writing code with AI today. as well as with spec, is the other way to create, â to code today. I'll give you an example related to control system. Again, because a language model, and I put the quote in language because my notion of language does not refer to the natural language with which the model could have been trained, but rather the fixed inner language, which we call the neural ease or the mental ease, which emerges in the train models, which are trained with sequential data that has latent logical structure. images, video is one of them, song, whatever. And so a language is a collection of discrete entities, their relations, and operators, including composition and functions. So these models have an inner language. It's not the one you and I speak. Now, you can prompt or command or try to control these models with prompts in English. And â because this model is a stochastic, dynamical system, it is literally a first torrent of random walk that has adrift the pre-trained backbone and after that it's a Brownian motion with sigma that increases with t. Okay, so the only way you can keep them and you can take them to the final destination which is inside your head is you give a prompt, the model realizes some piece of code. If that's not what you want, then you correct the prompt and iterate. And that's like trying to control a robot to go from LA to New York. by getting in the back of the cab and say, okay, now you turn left, you turn right, you get on the freeway and I exit there. By the time you get to Las Vegas, you're exhausted and you cannot go anymore. So, Vibe coding is local feedback control where you are the controller. Very high cognitive load on the user. Fantastic if you want to do very simple projects, but for a very complex project, it becomes very onerous. That's why. Now, spec coding is where you provide to the model not an interactive set of instructions, but rather a specification of all the characteristics that the model should have. Also in English, but that could include tests that the model has to generate and pass in order to move on and so on and so forth. And that's open-loop global control. It's like you give somebody a detailed list of instructions on how to go from LA to New York, but it's very, very difficult to be precise because you don't know what will happen along the way. For example, if I... where to give my son the list. And then at some point, when you get to welcome to Las Vegas, turn right, except that in San Bernardino, there is a welcome to Las Vegas motel. And so if he calls me and then I'm in New York, except that he's in Atlanta, how am I gonna backtrace where he missed? So this is one of the major problem with this style of coding, which is model deception, which is silent failures. which is tests which are tautological because the model generates its own test after it generated the code. So it's very easy to generate a test that automatically passed. So these are the fundamental problems of this type of this style of coding. And by the way, if you look at the statistics, you see that adoption of AI coding models by professional has been steadily going up. At the same time, trust in the outcome has been going down. It was around 70 % in 2024. and it went down to 60 % in 2025, even though the volume of adoption has gone up significantly. Why is that? Because when models become more powerful, people use them for more and more complex things. But at that point, short of reviewing every line of code that the model has generated, it's very difficult to arrive at a point where you can trust the outcome. Now you see a lot of narrative in the days that says, my God, I used this or that model and it generated, I don't even look at the code. I don't even need to look at the code. That's really problematic. That's not right, okay? So let's go back to the analogy of control system. So you're trying to control a stochastic dynamical system. What is the right and accepted way of doing so? It is so-called a two-level controller, which is global open-loop planning, local closed-loop control. If you want to go from here to Las Vegas or to New York, you would give a series of paths, let's say, go in this direction, which is... if followed to within a certain bound will guarantee that you will get to New York. But typically the path is not feasible because there may be a truck parking front or it may require me to go through a house. So I can veer from the path, but I need a mechanisms, which is a feedback mechanism to keep me local to keep me plug. So it's very fortunate that you mentioned that because in fact, have just launched what we call strands coding, which stands in contrast with vibe and spec coding, which is exactly the embodiment of this control â strategy, which is how stochastic dynamic systems are controlled, but in the form of a language. Now, this language separates the type of uncertainty, which is due to the model understanding your prompt, okay, the task, which we call the intent gap. You know what you want, it's inside your head. The model doesn't see inside your head, only sees what you express. but your express is ambiguous, can be interpreted in number of different ways. You are the only one who can determine whether what the model does fits your intent. That cannot be delegated. That cannot be automated. What one can do is it can move that decision point as early as possible in the interaction and as high level as possible. So you shouldn't have to verify that the model has done what you want after the last line of code has been generated. Instead, you should be able to verify it formally before any line of code has been generated. Then from that point on, the language, is basically an ordinary programming language augmented with AI function, has confined stochasticity that ensures that the stochasticity of AI cogeneration is confined within the scope of what we call AI functions. So that ensures that even though the model can implement these functions in however many ways it wants, so long as it passes the pre and post conditions, which you can verify a priori, that will take you to the destination. So that to me is very important because trust is the number one problem to adoption. And it is the reason we are talking about these models in the news. And the original understanding of the intent can only be verified by you and that cannot be delegated, but the rest could be controlled the way we have been doing control of stochastic systems for power plants to bicycles. So it's the same problem. When you the bicycle from A to B, you know. If you plan a straight line from A to B, you cannot follow it because you would fall. And so if you want to stand straight, then you cannot. And so you typically have to trade off uncertainty in the local control with a bound around â the goal where you're heading to. And so this framework for coding, which will be exposed in open source through the strands agents library, which is an open source, â will allow people to do AI coding. Not in English, because English is not a programming language and it's lacking aspects which are the reason that we have programming language. They're formal. They allow you to...

Ravid Shwartz-Ziv: What types of like... What are the problems in English? Why it's not a good programming language?

Stefano: So the problem with English is that there's something you can If you are an extremely skilled programmer who has extremely refined experience with a particular model, you know how to frame the prompt so that precisely when the model gets to that point, it will interpret in a certain way. So in a sense, you are reverse engineering the model in your head, which again is a very high cognitive load. So again, you can do it for a relatively simple project, a very complex project, it's very difficult. Now, models are getting better and better, okay? So what they can do without intervention in open loop is kind of longer and longer and longer. The problem is that you cannot delegate the intent gap, the verification of the intent gap, because the model is not inside your head. It cannot determine that that is exactly what you want to do. So that remains to you. And so the goal of this framework, which is in strands agents, is to separate what the uncertainty due to intent gap that you have to deal with. in whatever way you want, just like when you communicate with somebody, you phrase something, the person repeats it in their word, and you go back and forth until both are satisfied that each other have understood, even though they can be falsified but not verified. But the rest becomes automatic and guaranteed to converge to something that fulfills the intent, the original intent. It may be bit vague. Sure, go ahead.

Allen Roush: And so I, sorry, I didn't mean to interrupt, please finish. â

Stefano: No, no, what I mean is that it sounds a bit abstract, but people will get a chance to see it in the Strands agent library where they can download AI functions and coding what we call skeleton coding, just like you code with pseudo code with placeholders, except that instead of pseudo code is actual code, but instead of the body of the function being, let's say, Python code, it's English. Except that what the model generates inside the scope of the function is constrained by what you specify as the precondition and postconditions that determine the acceptable behavior of the model. And so your interaction with the model is at the level of the conditions, not at the level of the code. So you only verify the conditions. You don't need to verify the code.

Allen Roush: That's a good logical place for me to ask a question here. I'm personally obsessed with â constrained and structured generation as techniques. And it still sounds to me like they're a little bit different in some ways, right? The analogy for what I'm talking about is if you're trying to generate a set of numbers, right, like a classifier detecting if, I don't know, a recipe is good or bad tasting, â if the model's unconstrained at the log prob level, then it might generate TEN, like those letters, but you can force it to say, from the outside of the numbers 0 through 10 integers is banned. And when I think about how we do, and I've called it and maybe others call it agent choreography, which is explicit direct guaranteed control, I guess the analogy for what you've talked about would be like, hey, when you drive to â Los Angeles, â can't go off the road. Like â going off the road, like not being on pavement is a failure point. â How is what you're talking about or what this framework doing, how is it different from this very explicit constraint like pedantic style control?

Stefano: Yeah. So it depends on how you frame the constraints because I could state the constraints in English as a prompt. So that's what we call model bagging. Okay. Please give an answer of number of one to 10. And most of the time it works, but you cannot guarantee a priority that it will. The other is where your statement of the problem is not written in English, but it is an encode. Okay. everybody's taking at least a coding course and can write a skeleton of a program, even though they don't care about the syntax, do I need a semicolon here and whatnot. But the conditions that you need to be verified in order for the outcome to be acceptable are specified as just like a function in Python has types, instead of types you have preconditions and postconditions. This could be algebraic expression. could be function in Python or could be the same AI functions. So, and the level of granularity of these conditions is up to you to decide because I could put zero pre and post conditions, then it's exactly the same as Vibe or the condition could be actually, could be require acceptance by humans. So now you are in the loop or it could be no condition and then whatever the initial condition said, your open loop, but you can make them arbitrarily refined. to the point where it identifies uniquely a piece of code or an equivalence class of expressions of code that have the same input-output behavior. Let give an example. Let's say you want to code a calculator app. You want your model to write a piece of code that will do the calculator app. Obviously, it has to satisfy all the rules of arithmetic, precedence operators, and all these things. So this would be hard conditions. And the advantage compared to model bagging or prompt and pray is that the model generates within the scope of a function and doesn't get out of that scope until the conditions are satisfied. Now it's possible that the model may be either not able to satisfy the condition because it's not good enough to generate a solid piece of code, or maybe because they're not satisfied, in which case you get an exception that the model failed at this precise post-conditions, so you don't have these silent failures which are one of the bane of these models. So in the case that you mentioned, if you are an experienced programmer, you can bypass the VIBE phase and directly write your code in Python, but augmented with these AI functions, which are functions in Python with body written in English and with pre and post conditions. If you're not an experienced programmer, you can do VIBE coding or spike coding however you're to, except that instead of that phase leading to the entire code base being written, you stop where you have a formal skeleton, which has AI functions with pre and post conditions, which you can verify to fit your intent before a single line of code has been written by AI. While knowing that by the time the last line of code will be written, it does what it's supposed to.

Allen Roush: And then. So follow up question here. â We've seen, â of course, the first there were diffusion models with stable diffusion, â just in terms of getting pretty good. That was a mini chat GPT moment before chat GPT. And then chat GPT, which was autoregressive. But now we have â diffusion language models, which can generate the output text in any order. And now we have autoregressive image models. this distinction between autoregressive left to right and diffusion in any order you want interact with what you've been saying?

Stefano: So again, I have to use my control theory background â because I think that shines the light on the relation between these models. So â these are dynamical systems that have a state. So a state is a function of past data, call it a memory if you wish, given which whatever inference you do or prediction is independent of past data. The state is the information bottleneck between the past and the future, as Ravid would say. or our friend, Taddy Tishby, would say, literally. So the state has the Markov splitting subspace. Given the state, you can throw away all the data. But the genesis of diffusion model is very interesting, actually, because the reverse diffusion equation, which is incorrectly attributed to Feller in 1949 in the original paper, and then was attributed to BDO Anderson in 1982, a paper on automatica, is actually due to Lindquist and Pitchy in a 1979 paper where they were trying to settle a philosophical dispute with the late Jan Willems, who was â arguing with them on the difference between a model and the system. The system being the true world, whatever it is, that generates the measurement and a model being any realization that can reproduce or hallucinate, we would say today, right? Data, which is different than the data generated by true model, but indistinguishable up to an uncertainty residual. So in order to make that point, they follow the following argument. Well, if we take a finite covariance, let's say, or a time series, â can, that obeys a certain forward diffusion operator. So x2t plus one is equal to a x2t plus some noise, okay? So that, determines the forward path where time flows forward. Now, if we let time flow backwards, that would not be a physically plausible realization because time only flows in one direction in the physical world. But in the model, we can make â time flow in one direction and the other direction. And the fact that the reverse diffusion has the same exact structure of the forward diffusion proves that the model cannot be the same thing as the system. And this is a point against the notion in epistemology, they call it the notion of naive realism, which is the belief that there is a true world and our models as they get better and better all approach that true world. That position is indefensible and false. Now this idea of reverse diffusion paradoxically was meant to be a really the stupidest system that can prove that time reversal is only possible in the model, not in the system is now being co-opted to generate â images that run from a system. distribution by just push forward mapping some known density, let's say Gaussian to the distribution. But there, the notion of time is completely fictitious. So there is no causal ordering as you go back and forth. You just play with time as a scale variable. In fact, you can do diffusion model with scale. So whereas autoregressive models are actually state-based models, there are particular class of state-based models that are called nilpotent, meaning that their dynamics unless you feed them something decays to zero after n steps and being the site of the context. And so in case, the memories, they co-opt sliding window of the data. Now you have state-based models and hybrid that are superset of those that include both kind of Mamba-type models as well as transformers. But these have the property, these have the market property. So the state, the memory of the system, okay, separate the past from the future. So they're useful as an inductive object. Now, in an agentic setting, we're not so interested in approximating distributions, which is what you do in diffusion model when you want to generate images or snippets of text in one go. But we're interested more in generating actions where actions then produce a return. So there is a true direction of time and a true causal ordering in that. And so we mostly work with â state space model, meaning models that have a state. where I include transformers, even dense transformer, vanilla transformer, are a particular class of state-space model that happen to be nilpotent. So that's what we work with because there is a true direction of time. â Now, in a diffusion model, can play a diffusion model on a sliding window, in which case you will have a delay before you can generate the token on the first letter because you have to wait until, but then you can generate them all in one go. And, you know, in certain application that... could make sense, certainly in computational biology. There's a reason for that. There's no time index there. But in the agentic framework, we don't use those as much.

Allen Roush: And one quick follow up too. So you, I agree that naive realism is in my mind indefensible, but then we have the platonic representation hypothesis, right, which is, â I've heard it multiple ways, but I think both are currently seen as true. One of them is as models are getting bigger and trained on more data, their representations become extremely alignable, right? So even if they're not identical, it's extremely easy to convert one to the other. And then the second is, â train model on bad code, now it generates hate speech and vice versa. So I worry that as much as we dislike it, I'm seeing evidence for things kind of at least like platonic, know, that there is a real truth kind of stuff. What's your take on this?

Stefano: Okay, so now we're getting into philosophy. â So â I like a quote by â Bertrand Russell, which is reported in the 1949 paper by Einstein that says, we're all born naive realists. Naive realism is a belief that things are the way they appear. So naive realism, if true, leads to physics, but physics, if true, leads to proving that naive realism is false. Therefore, naive realism, if true is false, therefore it's false. So the point being that You observe data, finite amount of data, no matter how big. You process it with finite amount of memory, finite amount of compute, finite amount of time. There are infinitely many data generation mechanisms that are compatible with that. So even if there was one that is the true one, you would have no way of knowing the one that I assume to be true and the one you assume to be true would be different. We have no way of aligning that. So in... â Naive realism, which is how we are training as engineers, we say, well, if you and I observe data from the same phenomenon, the data may be different, but the phenomenon is the same. And so the phenomenon, the true phenomenon, the true distribution, if you are doing inductive machine learning, or the true world, if you're a physicist, is the link between what you observe and what I observe. Now, in reality, that thing has been much wider complexity than any collection of measurements. So that doesn't provide a link between my and yours â representation of the world. insist on this. I say representation because I actually like Jan Kundring's â nomenclature better because it talks about presentations, not representation. So presentations go from your brain to the data, not the other way around because you can hallucinate data. And in the process of controlling the hallucination, which has been referred to as the process of perception, The process of perception is to use your internal presentation, the actual correct term in system theoretic language is realization, okay? Realization is your internal model that you can use to make real, to generate, to hallucinate data, and then comparing them to data that you actually measure, you can then refine your representation, realization, okay? Until you have alignment, which is a... work in progress. Now, I hear what you're saying in terms of the risk of babble, because if the world is infinitely more complex than any of our observations, then I convert to something, you convert to something else. How do we even relate to each other? How do we communicate? And language is often ascribed as â being engendered by the need to communicate. But in fact, language is the language of computation. These models are computers in the real sense. They have universal computers. And so it's the language of computation and planning. And if I have it and you have it, that's where we can relate. Let me give an example. So let's say that the world is a triangle, just for the sake of simplicity. And you observe a triangle, and I observe a triangle. Now. â

Allen Roush: Okay.

Stefano: I have a certain reference frame and a triangle is six numbers in this reference frame, the coordinates of the three vertices, right? You observe it in your reference frame, it's completely different numbers. And you and I have no way of relating these triangles. But if you have in your brain the ability to operate on these triangles, so you can change the coordinates, you can change the reference frame, you generate the entire trajectory, okay? And I can do the same independently because I also have the ability to perform. computation in my brain, then these orbits, these equivalence classes are what is unique. There's only one for each triangle. And so if you and I independently can compute, then we can register the language because we can each independently change the coordinates of this triangle until we have the same exact numbers. And that will happen if and only if these triangles form an equivalence class, are in the same equivalence class. So there is a way to talk about alignment. to talk about uniqueness of the underlying hypothesis set or at least an equivalence class of hypothesis set, even without assuming that there is a true distribution out there that somehow we're all supposed to converse to because that requires an assumption of a prior or a model or an hypothesis that you cannot falsify a priori and there's no, verify a priori and there is no criteria on your measure of complexity is different than mine and who's right. So, but there is a way of doing it in a transductive setting. using computation as the vehicle.

Ravid Shwartz-Ziv: I want to take you to synthetic data a bit. do you think this is something that like, right, like we talked also like in previous episodes, it looks that it's become very important, right, in the recent NLM training. And do you think this is something that like... At the end, you will see that all the data will be synthetic and how it's affected by agents. think agents need less quality data. There's something very specific in synthetic data that is bringing it to something very unique.

Stefano: Very interesting question. obviously we've been enamored for synthetic data for a while. Now it's not a recent concern. I'll give you a concrete example. In 2018, we launched Taxtract, which is a document analysis service. And if you want to train a model to, for instance, read tax forms or documents that are sensitive, it's very difficult to obtain real documents. And so the first... of the service that we launched was entirely trained on synthetic data, not a single real document. And so â that is possible. However, â on the other hand, â if you reason in terms of the information bottleneck, there's a data processing equality that says, well, if I have a data set that I used to train a model, take that model, it to generate synthetic data, that contains not one bit of information more than the data set it's trained on. And so if we train a model with a synthetically generated data set, that the sigma algebra expands by it is exactly the same as we started with, so we're not gaining anything. So this has bothered me for a while, because in a sense, language models and certainly RL training environments seem to define the data processing inequality, right? So because you do get improvement with synthetically generated data, and how is that possible if they don't expand the sigma? Now the reason is because there's two different notions of information here. One is statistical information. And so when you talk about information, by the way, it's a can of worms because people have their own idea. And then if you don't make things precise, it's misleading. So when people talk about information, there are always two distinct entities. One is the carrier of information, which is the function that carries the information, the object that carries the information. The other one is the number, the amount of information that it carries. So there's three types of information we work with. One is statistical information, Shannon. In Shannon, the carrier of information is a random variable, which is a measurable function. Information is not a property of the data. Data has zero information. It's a property of the distribution from which the data could have been typically sampled from. Distribution does not exist as an objective entity, it's our construction. So, definiti. And the amount of information is... the entropy of that distribution. It's not a property of the data. So the carrier are measurable function and entropy is the measure. Then there is algorithmic information, which is where the carrier is the program or computable functions. And the amount is the length of the shorter program that runs through a Turing machine generates your data. Very different concept. And then there is conceptual information, which is instead of having measurable function as a carrier or computable function as a carrier has conceivable functions of the carrier, which are functions that carry concepts, and these are LLMs. The preimage under a trained backbone of any vector in the abelian space of logits is an equivalence class of sentences that constitutes a meaning, meaning an equivalence class of sentences. So a concept or a conceit or a meaning is an equivalence class of data that maps to the same abstract vector. When you do simulation, you do not add any Shannon information to your data set, but you do add algorithmic information. Because if you view it from a Shannon, from a statistical point of view, the whole body of mathematics contains zero information, right? There's nothing that is not deducible with zero entropy from the axioms. Almost every statement is deducible. So the data processing equality would tell you there's nothing to learn in math. But of course, that's not true. And in fact, what you do learn is algorithmic information. And algorithmic information is what is critical to the agentic setting. And this has been the revolution that I described from the inductive side of machine learning, which has been our main concern up to 2022, to the transductive setting, which has been our concern ever after. The transductive setting induction is where you take a data set specific, you turn a model that works for general query. So from specific to general. Transduction is where at inference time for this specific query, you gather all the data and all the information that you need in order to provide a specific answer to this specific query. So from specific to specific. And that's the domain of reasoning. Deduction is a special case of transduction. And so in that case, algorithmic information is key because that is where the value of experience comes in. So it's a fascinating story that has roots in 1964 in three papers, 1964, Solomonov,

Ravid Shwartz-Ziv: And.

Stefano: 1972 Levin, and again 1985 Solomons. Very fascinating and quite at the center of everything we're doing today.

Ravid Shwartz-Ziv: And do you think there is a connection between these two, three concepts of information? Are they related to each other?

Stefano: â They are very related to each other in the sense that the whole algorithmic information theory establishes a connection between â program length complexity and probability. Because if you use a certain probability to encode, let's say, the set of all possible programs, then the likelihood of sampling a program given a task is proportional to when encoded according to a probability that's Solomons of semi-measure. â So the length is equivalent to the likelihood of sampling that program from that prior. So there is a very â strong connection between the two. However, it's how we use them that is different. So let give an example. So in inductive machine learning, â the goal is to try to minimize uncertainty of future data, of future â inference, outcome of inference. You give me a query, I've never seen this query before. I'll give you an answer. I don't know if the answer is correct or not. There's no oracle. But I want to minimize the uncertainty on the outcome of inference. In a transductive setting, â there is no uncertainty. I can always reduce the uncertainty however much I want because I can come up with an answer and I can run a simulation or I can invoke a verifier or I can pay an annotator to come and so I can always reduce the uncertainty as much as I want. The issue is that I will have to pay for it with time. So â in 1972, well, 1964, Suleymanov published a paper that â described an algorithm for doing universally optimal inference, universal with respect to any computable distribution. Basically, no matter what algorithm you have, there's no algorithm that is better on the entirety of possible tasks than his, which is line up all the programs. in increasing order of length, execute them through a Turing machine, then pick the ones that generate the data that you've seen, and then average their prediction. And that's the optimal. There's no intelligence, there's no insight, there's nothing. And by the way, that is the limit of scaling laws. If you take the scaling laws to the limit, you get Solomons of induction, which is this completely stupid algorithm. No intelligence, no learning. You ask the same question twice, it will repeat the same calculation. You ask to do two plus two, it will do exactly the same thing that it does if you try to solve the Riemann hypothesis. So that's not, I don't know what the definition of intelligence is, but it doesn't matter what it is. I don't think that this fits that mold. â

Ravid Shwartz-Ziv: But this is very different. This is very different from what people are talking about when they are talking about scanning loss these days, right?

Stefano: Scaling laws for training and inference are the same. Essentially, you have a certain amount of complexity, which is computation of memory, â time, and compute that you put in the model. And yes, empirically, you see these nice straight lines. The issue is that the mistake we made is that we take these lines that describe performance in downstream tasks for scaling law for inference or â log perplexity for training. and we take that as a proxy of intelligence. So proficiency in downstream task as a proxy of intelligence. But in fact, Solomonov has best possible proficiency with zero intelligence. So obviously these curves do not measure intelligence, okay? Because in the limit there is none, let alone super intelligence. So then what is intelligence then? So I don't wanna define intelligence, because I think it's very difficult to agree on a definition, but it's very easy. to agree on the definition of stupidity because there's many traits, any of which present would deny any reasonable definition of intelligence, right? And so in that case, for example, if I ask you the same question three times in a row and you every time repeat the same calculation and give me the no better answer, okay, that's not very intelligent behavior, but that's what Solomonov does. So Levin in 1972 in the same paper that introduced â the notion of NP completeness, it's a three page paper in Russian. introduced this notion of not only universally optimal, but universally fast algorithm known as Levin search. That involves no learning. You cannot beat it universally, but there's no learning. The problem is that there is a constant which is astronomical. And then the same Solomonov of 1964 in 1985 commented on Levin's paper and pointed out to the fact that that constant is two to the minus L, the length of the optimal bespoke â bespoke solver for a task. So take any task. There is a solver that gives you the optimal solution in time no greater than the time it takes to the optimal bespoke solver for that specific task up to a constant, which depends on minus the exponential of the length. So this is the role of learning. It's not to decrease uncertainty on the outcome of inference, because you can make that as small as you want, but it's to decrease the time it takes to find new solutions to unforeseen tasks. And in order to do that, you have to have learned algorithmic components that you can recompose to find solutions to new tasks. Like when you play chess, there is an algorithm that will give you the optimal next move. It will take forever to do it. That's not how you play chess. How you learn to play chess is that you have seen certain moves, certain boards, and you roughly know that this is a good move. And there is a theorem that says that the speed up in the amount of time it takes for you to find a solution to an unforeseen task equals the algorithm in mutual information between the optimal solution and previously seen data. And that tells you that not only you learn by reducing the time it takes for you to find solutions to unforeseen task, but you learn only by finding the shortest solution to unforeseen task. That's an if and only if it's not a bound, it's an equality. So algorithmic information is central to trans-ductive inference, which is central to agentic setting in AI. And that is a new revelation for me, even though all the results are old. They're 1964, 72, and 85.

Ravid Shwartz-Ziv: Okay, I think it's a good time. Maybe let's talk a bit about the world models. Jan were here like a few â weeks ago and also Runder that is doing a lot of â JEPA style models and in buy-in that is doing like he did in banking. How do we see? we need world models or current paradigm we can... Okay, yeah, maybe. We already have it. Okay. Tell us, do you think current agent, agentic frameworks are world models?

Stefano: We have other models. Hold on. Okay. â Yeah, so I have to preface that. â as a researcher when the evidence changes, my opinion changes. I used to be a Gibsonian. â What does means? It means that I used to believe that there's a fundamental difference between learning by observing data passively â from learning by interacting.

Ravid Shwartz-Ziv: Tell us maybe what does it mean?

Stefano: with the environment and specifically interacting with the physical environment is yet another level. So Gibson in the 60s, between the 50s and the 60s, 1950, people arise this notion that you look in order to see and you see in order to look, meaning that the sensing action cycle was essential to develop a correct representation of the world. And that's assuming that there is a world objective. â In reality, whatever it is, it's different the way you represented an eye. And so that kind of is a squishy notion and it goes back to naive objectivism. So, but I used to believe that there is a fundamental difference between these two. And so you're not gonna, you know, you're gonna address or understand intelligence using a brain in a jar. And I used to view these language models before transformers, when we were doing, kind of old style NLP as essentially brains in a jar, know, your feed text and so on and so forth. And my opinion changed and â quite radically to the point where I can disavow things that I've been teaching for 20 years, I would have to apologize to students if I misled them, but that was the dominant thinking at that time. And so let me give you an example, okay? So let's say that you teach a computer vision course. and you teach, let's say, 3D reconstruction. You have a bunch of 2D images, there is a 3D world out there, and your goal is to infer from these images a model of the world. Let's the geometry of the world, the photometry, the dynamics, the semantics, and so on and so forth. So that's reasonable. So in order to do that, you establish relationship between different data of the same scene. And there has to be a relation, because these are different data of the same scene, kind of like the triangle example that we were giving before. It sort of must be a relation. And you assume that the relation is in the scene. But in fact, that relation is in your head in the sense that if you, for example, assume that the scene is planar, Lambertian, and front-to-parallel, then you can find relation between corresponding points in images through an algebraic relation, which is a nomography. Very simple. Now, of course, you don't know. whether the scene is flat and parallel, â flat, front to parallel flat and numbersian, and you cannot infer it from the data because that's your assumption. So that's a belief. It's unverifiable belief, okay? So, but you know, if that belief is satisfied, then your data, you're able to fit a nomography with low residual. Okay, if not, well, you can graduate and go to, let's say, nomography, sorry, people or geometry. So now you have a fundamental matrix that of a nomography. and you can keep going and then it may not be Lambertian, okay, then you have specularity and so on and so forth. But all of these relationship are basically a function that you can compute that given the images and one statistic of one image will give you the statistics of the other image. For instance, in epipodal geometry, you take one image, back project it onto the world, so you compute the back projection, then you compute the forward projection with some rotation and translation, get these at point in the other image, and that's the relations that you establish through the scene. But this relation is computation, is sheer computation. You do the argmax with respect to RNT of the residual between your back projection and reprojection and the actual observation. So that's computation. Now, if people are geometry works, see if your scene is static, co-visible and Lambertian. Can you verify that? No. Again, that's a belief. And so you operate on that belief. So if you keep relaxing your assumptions, you get to the point where you say, all I'm doing is I'm... designing a function that I can compute inside my brain that will explain the relation between these data. And the relation can be, again, geometric, which is what we just described, but could be photometric, could be more complex, could be semantic, and so on and so forth. establishing these relations is variable length inference computation, AKA reasoning. LLM where the L again is not the natural language with which you might have trained your model. Your model could have been trained with just images. â But that L is the new release that the model used to compute, to perform computation using the chain of thought. And so that is a language model. So that language model is a world model. It's a model that establishes equivalence classes of measurements or meanings â for any type of modality that you want to feed the model. presumably assuming that you have witnessed them before. So in that sense, the scene is the meaning of the images and that meaning is established by the model. So the model is the world, not the physical world, whatever that is that physicist haven't yet told us what precisely it is. So â it's a quite a radical switch in perspective and going back to what Kundring says, he calls it the Goetian approach. to vision as opposed to the Marian approach to vision. The Marian approach to vision is the naive realism. You start with the data, you compute the primal sketch edges, and then you try to compose them, put together a skeleton and whatnot. And again, you make some assumptions, some priors, land someplace, I make different assumption, different priors land somewhere else. Who is right? Nobody. The Gertian approach is where you have a mental model, which is inside your head. That is the world. That is the world model. The world model hallucinates data. So long as the data that hallucinates is statistically indistinguishable from the data that you generate, your model is aligned and it's a valid realization of the world. It makes it real. With your model, you could build a plastic model of what you're seeing. So that's real. And so from this perspective, the dichotomy between language model and word model in minor magnitude does not exist. They are the same thing.

Ravid Shwartz-Ziv: But so some people will say that the problem with this approach is that like the process of embedding or processing the input, say the prompt or something else, versus the generation, basically, so in current paradigm, in current transformers, it's the same, right? And if you need or like world models, if you want like a proper world models, you need to distinguish between the two, right? You need to separate between the embedding to the generation, because this is the only way that your model can remove irrelevant details and not be tied to generate the exact noise or whatever in the output space.

Stefano: Yeah, I can see the general thinking in there. Sure. So a couple of things. One is that I've never used the word noise in everything I said so far, because in fact, one of the things that you learn when you understand that models operate transductively within the tenets of algorithmic information, there is no such thing as noise. That's actually one glaring issue with Kolmogorov's theory that is known to those who are experts in the field, I was not. And when I read the books, I get the impression that the theory is sound. And because Kolmogorov tried to separate in a canonical universal way, structure, which is what you would like your embedding to capture, from randomness, which is nuisance variability that you want to weed out of your model. And arguably, I used to believe that most of your cortical processing in the brain, which is roughly your cortex about. the size of your hand, about half of it, process visual information, and most of it is getting rid of nuisance variability. But in reality, what is a nuisance depends on the task. So I don't know what is noise because what is your signal is my noise. For example, if you take a credit card, it has a digital code embedded, but it's in a magnetic strip. And so when you read it, you get an analog signal. So if you want to read the number of the card, then the analog signal is the noise. But if you want to identify this particular card because you want to avoid cloning, then you don't care what the digital code is, you want the analog signal. So what is noise for something is to somebody is signal to another. So, but going back to the different modality between the input and output. So the core engine there is a â stateful, could be autoregressive, could be a state space model, but it's a stateful model that operates in a tokenized or discrete space. Okay. That's true of your brain with synaptic rates, it's true of transformers, it's true of a bunch of other things. The continuum is an abstraction, sometimes it's convenient for us to think about things or prove things, but in reality everything discrete. Okay, so now, how you go from, â on one side, from measurements of physical data that are typically quantized, but they're quantized in a way where individual tokens are not meaningful, they don't correspond to any physical quantity, like the spectrograms or the pixels don't have any semantic meaning, to a token or an embedding that conveys some of the structure. It could be algorithmic structure, could be statistical structure of the data. So that's the part of the encoding side. And on the decoding side, how do you take this discrete token and generate images? So we have plenty of â examples of very successful both encoding and decoding where Success of the encoding is measured by how well you can perform inference and success of the coding is how indistinguishable they are, at least to the naked eye, â synthetic data from real data. So we are long past the point where you can generate images that are indistinguishable from real ones and yet they're synthetic. So I think that on the decoding side, we're fine, we're done. â Yes, there's always slight artifacts for people that worry about... â know, counterfeit images and the fakes and so on and so forth, are still able to leverage subtle artifacts and inconsistency in the geometry or the cast shadows, illumination artifacts and so on and so forth, but it's becoming very, very difficult and at some point it will become impossible uniformly across all samples. And on the encoding side, where you put the loss, whether you're predictive, you predict in the space of measurements or predicting some slightly more abstract space, That depends on the modality and the goal and all these other things. So in sense, in Japan, makes sense. lift the space of comparison one level up. You don't try to predict data, because if you're trying to predict video a minute from now, even though there's no such thing as noise, the level of complexity involved in determining the value that every pixel will take a minute from now is completely overwhelming and most likely not useful for any... â but practical application. So you can get away with either a short horizon at a pixel level or a longer horizon at a slightly higher level. On this level, by the way, there's old experiments in psychophysics from â the random loss stereogram of Bela-Eulish that were designed to prove the point that we don't need to interpret the data in order to establish a relationship. where interpretation was mean that, you if I give you two images, you don't need to know, you know, this is a house and a tree and so on and so forth. Because if you remove all the photometric information left with a random stereogram and you take a cube and you shift it, when you view it stereoscopically, you can see a square, you know, standing, know, sticking out of the plane. And that's exactly, the experiment is correct, but interpretation is wrong because it makes exactly the point that the relationship between these two. is algorithmic and in fact the algorithmic information of random dot stereograms are completely trivial because all you need to do is tell me where the square is and shift it by a pixel, right? So â there isn't a true scene out there which is this rectangle floating in space, but I can rationalize the relation between the data by just explaining it with a rectangle that shifts. And that's a very, very small, very, simple scene we're going to explain. So I think that when you go to multimodal data and world models, yes, there are difference, of course, in how you train them in the type of â encoders or tokenizers you use and the renderers and the decoders you use. And certainly there's more complexity in gathering data, especially gathering data in closed loop, because you don't learn how to park your car by... trying and then hitting the wall and then not hitting the wall and then hitting the wall again and not hitting the wall. That stuff you have to do in your head. And so there's definitely complexity, but at the core, the basic computation engine is the same. And in your brain, there's no differentiation between neurons in prefrontal cortex and in V1, V2, V4. So there's indication there as well.

Ravid Shwartz-Ziv: So you don't think like we need to like at the end of JEPA style training like autoregressive will take us to infinity.

Stefano: It makes perfect sense. I'm saying that they're not very different in the end from how we train stateful models for any other domains in the digital space. What I'm saying is that the dichotomy between language model and word model is mostly a matter of nomenclature. Because nobody at this point trains large-scale model only with language data. But once you train a model with complex enough data that it and that you have the emergence of this reasoning ability, that reasoning engine can operate on any data of any modality. And so it can operate as a world model once you feed it data of the right modality and enough of it that spatial reasoning arises. So that's definitely, it's not a solved problem by any means, but it's not that different from the way you're developing â language models today in an agentic setting.

Allen Roush: And what about the practical app, like world models, of course, have these theoretical definitions, or even, know, JEPA style architecture, the kind of definitions that you and Jan are talking about. But I believe all practical world models today are essentially video models, right, of generating like video output where you can control, like you can interact in that world, move left or right, genie, project genie, and competitors from other companies being the best example. How do I do you think that there will be a change and that will get practical world models closer to the definition that you or Jan or others have articulated and away from this video model with control kind of practicality we have?

Stefano: Yeah, I think there are already areas where world models are practical. For instance, in domains where, like you mentioned, where the ultimate output is data, not actions in physical space. For instance, models for videos, know, synthetic advertising and so on and so forth. So these are already there, know, virtual avatars and interaction in virtual spaces. So â I think that... when you get to physical interaction with the world, there are complexity that arise due to reliability. Your robot will break, now what? So that relate to liability and responsibility. Okay, if somebody gets hurt, whose fault is it? It's the design of the algorithm, it's the model. So there are complexity that are easier to manage in worlds that live inside digital space that you can nicely confine. â And so, but I think that there are certain, some domain of application where we see the impact of oral models already and gradually we'll see more and more â applications. â Certainly there's a lot of beneficial ones that can happen, know, assistance to the elderly or even â mobility. You I started â my career from Ernst Dickman's 1989 paper, Dynamic Monocular Machine Vision, because in 1984 he had cars driving autonomously on freeways the way we have autopilot in 2015. So those applications are definitely coming, but the level of complexity when it comes to interactive physical space is decidedly higher.

Ravid Shwartz-Ziv: Okay, I think we're almost out of time. â Anything else that you want to say? Plugins that you want to advertise?

Stefano: I think these are very exciting times. There's also, I acknowledge that there are a of concern about what some of these agents can do because they are very powerful. And as usual, with great power comes great responsibility. So as I mentioned in the opening, the vast majority of my time is spent thinking about what could go wrong and how to avoid it. more than thinking about some of the nice ideas that you have kindly discussed with me. â But I think â that education is key here because it's very easy to be convinced that these models are so powerful that you can use them and then not even look at the outcome and just use them as is. And that carries â risk in my opinion â because... â a father, I'm also an instructor, and so I always ask myself, â should young people learn today? â â seeing â that used to be established â of become challenged. And high judgment is what â hard to replicate because the level of context that come from experience is not written or writable in a prompt. â so that... vastness of context is â what you should strive for. And that means that when you create these artifacts, because of the intent gap, the model cannot read inside your head. So there has to be an expert human verification stage. You want your user interface, your UX, to be cognizant of that and push that to as early as possible so that the cognitive load is as light as possible. Like when I drive, I want my car to take as much as possible of my driving so they license up my code. But I don't want to take over critical decisions that I want to keep for myself, at least not until there can be guarantees that come from structuring the environment in certain ways and so on and so forth. So I think that is important to people to keep in mind that these tools are powerful. There are functions that we need to keep for ourselves. and developing these tools in a way that ensures that these faculties are preserved and controlled. And that's very important. And that's what I think myself and most other people who work in the enterprise are constantly worrying about and paying attention to.

Ravid Shwartz-Ziv: Yeah, these are very important things to work on. Stefano, thank you so much that you joined us. It was pleasure. And, Ellen, thank you so much.

Stefano: Thank you for inviting me. Yes. Thank you for the question. Thanks, Ellen. All right, bye-bye. Thank you.

Allen Roush: It's always an honor.

Ravid Shwartz-Ziv: Thank you everyone. Bye bye.

Stefano Soatto