June 7, 2026

Jürgen Schmidhuber - Part 2: JEPA, the Road to AGI, and Who Really Invented Modern AI

Show Notes
Transcript

In the second half of our conversation with Jürgen Schmidhuber, we focus on the key ideas he's pursued since the early 1990s and discuss why he believes these concepts are only now being rediscovered.

We start with JEPA. Jürgen argues that the method LeCun named in 2022 is the same family he published in 1992 as Predictability Maximization. From there he traces the adversarial lineage back further still, to his 1990 world-model paper and 1991 Predictability Minimization - the curiosity-driven minimax games he sees as the real origins of GANs.

We also talk about why these ideas took thirty years to land, why today's trillion-dollar data-center buildout is driven by AGI fear, and why he thinks Apple may come out ahead.

The back half turns to what he sees as the real frontier: physical AI. Today's systems are superhuman behind the screen but helpless at a leaky pipe, and until a robot can use human tools, there's no AGI. He discusses self-replicating, self-improving machines as "a new kind of life," reframes continual learning and test-time training as ideas from his 1991 fast-weight work, and detours through Solomonoff's universal prior, Hutter's AIXI, and the Gödel machine.

We close on the subject Jürgen is famous for: scientific credit. He makes his case for rigorous attribution, casts himself as a "speaker for the dead" championing forgotten pioneers like Ivakhnenko, and reflects candidly on whether the fights are personal.

Timeline

00:30 — What JEPA is, and the 1992 Predictability Maximization story

04:54 — Implementing PMAX: autoencoders, Siamese networks, Infomax

09:10 — Predictability Minimization, factorial codes, and the roots of GANs

16:00 — Why it took 30 years: the economics of compute

20:52 — Data, the web, and 1990 as the origin point

23:09 — Hardware inflation, the trillion-dollar buildout, and the coming crash

34:05 — Physical AI: the plumber problem and self-replicating machines

41:14 — Which 90s ideas are being scaled right now

45:26 — Continual learning and test-time training as "old hats"

55:19 — Measuring intelligence: Solomonoff, AIXI, and the Gödel machine

1:05:26 — Self-replication and von Neumann

1:09:51 — Will he see AGI in his lifetime?

1:10:42 — Credit, integrity, and being a "speaker for the dead"

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Ravid Shwartz-Ziv: Hey everyone, Ravid here. This is part two of our conversation with Jorgenshmiduber. So in this episode we go deep on JEPA and predictability maximization, the path to AGI, continual learning and the future of AI. So enjoy. Okay, so, Yogan, now let's talk about JEPA. I know now these days there are a lot of people that are talking about JEPA, right? There are a of startups that wants to use JEPA to solve many different problems. But let's start at the beginning. How do you see JEPA or how you define JEPA?

Juergen: So JAPER is a name that was given in 2022 by a paper that was written by LeCun for a method which we published in 1992, actually a family of methods, and we call that Predictability Maximization. And my co-author back then was my student, Daniel Prelinger. And what is... predictability maximization about P max, it's about. looking at the latent space of an encoder of the input. So inputs are being encoded by an artificial neural network and then the hidden units, the latent space, represents these inputs in a compact form. And â now there's another network that sees something else, that sees maybe the previous input, maybe the current input is the last one in a sequence of inputs. Or, you know, The goal is now for second network to predict not the raw pixels of the input we are talking about, but to predict the latent representation, the hidden units, the abstract â code that represents the input without maybe representing all the details, because usually you cannot really predict all the pixels of the next input, of the next input image, for example. Instead, however, you can â somehow try to extract an abstract representation of this input, and that can be hopefully predicted. And then we said, okay, let's now just use a a system that has two objectives. One is take the input â and encode it such that it's informative, such that it's not a trivial representation of everything. And the other objective is make the latent representation of that input make it predictable. Such that you now basically have two networks. One is trying to minimize the prediction error on the hidden units, on the latent representation. And the other one is trying to maximize the information that is represented in that hidden representation. And it's a very simple system. you now have two objectives. One is make this latent representation predictable. And we had a term, epsilon it was called, that â gave a weight to that wish. And then there was another term which just wanted â to make sure that the latent representation really tells you something non-trivial about the input, conveys information about the input. And now you have these two conflicting objectives. Now the latent representation is forced on the one hand to become more predictable, but at the same time still doesn't collapse in the sense that it represents everything and nothing. No, it's still informative. So that's what we call predictability maximization in 1992. And it's a whole family of methods because there are many ways how you can â make sure that the â you latent representation conveys information about the input and there are many ways of building a system and neural network that learns to predict the latent representation but at the same time the latent representation becomes more predictable. it's a whole family of methods and we had all kinds of experiments with different ways of implementing the basic principle. That's what we did in 1992 and then 30 years later it was renamed named JPAW.

Ravid Shwartz-Ziv: So what can describe more â in details the different way that we can actually implement PMARC, the different component and some follow-up paper that actually use it and take these components for specific implementations and then use it?

Juergen: Yeah, so the simplest way of implementing that â back then was you just have an autoencoder which tries to encode the current input in its hidden units. And then you can decode it again, and you get back the original input. But in between, you have this kind of compressed version of it. Now, â you have a second network which, â from its latent representation, tries to predict. the latent representation of the odd encoder and there's an error term that makes the latent representation of the odd encoder more like the prediction such that it becomes more predictable. same time, the autoencoder always wants to encode as much as possible about the input, such that it's informative. So that's a very simple way of â building a system like that. And we can also have a symmetric version. Back then, we also had a symmetric version where we just said, OK, here is the first input, here is the second input, and now we encode both of them in latent representation, and then we decode them again. But... The important thing are now these hidden units, latent representation, and we just say, okay, let's make them equal. Let's make them equal. Such â that both networks try to be informative about their own input, such that each latent representation is identical to the other one, which means it's also totally predictable from the other one. predictability in both directions. That was 1992. And I think later, something like that was called a Siamese neural network or something. But the basic principle, very simple, either in the asymmetric case, where you try to predict the latent representation of the next input, given the current latent representation of the current input. and â the symmetric situation where you predict both forward and backwards. And then the most obvious thing is let's just force these two latent representations to be equal. And we had â symmetric situations with experiments and asymmetric situations with experiments. Yeah, and so it worked. And now instead of using an on encoder, you can also use different kinds of â forcing these latent representation to become informative about their inputs. And one of them is something which was called Infomax back then. There was a guy called Linsker who had Infomax as a criterion that one could use as the information preserving criterion. Today they say the anti-collapse criterion. And we had

Ravid Shwartz-Ziv: So what is like, let's describe it a bit, like what is exactly like this anti-collapse criteria.

Juergen: Yeah, so if you now try to predict the latent representation of the input, one way of making it very predictable is â you always have â zeros in the entire latent representation, and it's trivially predictable all the time, something like that. Then... â It is totally uninformative, it doesn't tell you anything about the input, so that's not good. Instead you want to have a force that makes it predictable.

Ravid Shwartz-Ziv: a...

Juergen: force that makes it informative. â And at the same time, you don't want â the criterion of predictability overtake â the other important thing, which you want the representation to be informative.

Ravid Shwartz-Ziv: So.

Allen Roush: I'm just curious, earlier you articulated this idea in the context of JEPA that sounded a lot like GANs, right? Generative Adversarial Networks. And I'm sure that, you you probably had something to say about the so-called creation of those two. Was that also created at the pretty much the same moment in your opinion?

Juergen: It was all created at the same time, yes, because what you really can do is you can say instead of using an odd encoder to make sure that the current input is represented in a way that is informative, you can also use a technique which I call predictability minimization. That was also in 1991. That was one year before the predictability maximization that we are now discussing. And the predictability minimization works like that. have an input, a of hidden units, a bunch of input units, and maybe what comes in is an image. And then you have â a couple of hidden layers, and then you have the bottleneck layer. Then you have a layer, which is the one that you consider the interesting layer. That's going to be the representation of what's coming in, of the input. And now, one. Great goal of â unsupervised learning, a holy grail of unsupervised learning is to create an internal representation of the input which is statistically non-redundant. Which means that if you know one of these hidden units, you don't know anything about the other hidden units. such that each of these hidden units represents â an aspect of the input that is independent, statistically independent of the other aspects represented by the other hidden units. How did we achieve that? Well, we said if we have 10 hidden units, then let's predict the 10th hidden unit from the other nine. So let's have a little predictor that tries to predict the 10th hidden unit from the other nine. And then let's have another little predictor which tries to predict the ninth hidden unit from the nine and another predictor for the eighth hidden unit and so on and so on. So you always try to predict all these â hidden units from the other units, the other hidden units, and now you have a fight because the predictors, they try to minimize their mean squared error, for example, something like that. And then at the same time, the feature detectors, the hidden units, they try to run away from the predictions. So they are trying to maximize the same error function that these little predictions are minimizing. So now you have a adversarial network which is trying to come up with a representation of the input where the components are not predictable from each other, which means ideally you really have a factorial code where the

Ravid Shwartz-Ziv: That's it.

Juergen: where the probability of switching on the fifth hidden unit is... is the same as the conditional probability of switching the fifth hidden unit given all the other hidden units. Such that the knowledge about the other hidden units doesn't tell you anything about the fifth hidden unit. And that is true for every of these hidden units, which means that you have a statistically independent â set of code components. And of course, the real world is very different. The real world is super redundant because if you have an image coming in as an input and you see five white pixels, then with high probability the pixel within this region where you saw five white pixels is also white. So the world is super redundant and that's what makes it so hard to do optimal Bayesian reasoning because the simple Bayesian classifiers, they assume that their input components are statistically independent, but in the real world they aren't. So if you look behind me, there is some white field over there and it's full of white pixels. know this pixel, this pixel, and this pixel, and this pixel, then with high probability you can say that this pixel is also white. So enormous redundancy in the real world. And the goal must be to â remove the redundancy and have an abstract representation, a disentangled representation of these inputs that is â where each component is statistically independent of the other components and then you have a situation where the probability of the current input equals the product of the probabilities of these hidden units, which is awesome. And then we showed that the ideal the optimum of this error function of this adversarial minimax game is really the factorial code. And there was a guy in the 80s, a neuroscientist, and he emphasized the importance of factorial codes for Bayesian reasoning and so on, but he didn't have a method like that. And then we can use this method not only for creating disentangled representations, of input ensembles, now we can also use it as a submodule in the predictability maximization â context, where we just want to have a method that â creates internal representations that are informative about the inputs. And now we have this as part of the entire system, which tries to make a predictable representation. What does that mean? So on the one hand, these internal units want to be unpredictable from each other. And on the other hand, you have this predictability maximization thing going on where from the latent representation of the previous input you want to predict the new input here, which means that the latent representation of the new input will have to give up some of its optimality and of its informativeness about the input. But then you have these conflicting goals and we, back then, we weighted these different

Ravid Shwartz-Ziv: we wait

Juergen: objective

Ravid Shwartz-Ziv: these patterns with an excellent, yeah. â

Juergen: with that epsilon. â

Ravid Shwartz-Ziv: So what do you think is like the like at the end the missing parts right like why you think like in the 90s like there are so many like the beginning of the 90s there are so many like new methods that came up and but it tooks like I don't know 20 25 30 years until someone can pick them up and like actually scale them and actually like use them in like real problem that's all like real use cases do you think like it's compute or like something else

Juergen: No, it's compute. And â of course, back then, that was 35 years ago, compute was about 10 million times more expensive than today. And it reflects this almost law-like acceleration of compute over time. So every five years, compute is getting roughly 10 times cheaper. Sometimes there are years where it's moving. faster and then slower again. But every five years a factor of 10. So in 30 years a factor of a million. you know, since we just talked about these adversarial networks, there was an even earlier adversarial network in 1990 in my paper on world models. So I had a paper on modeling the world where it's also the goal to predict the next input or to predict the reactions of the environment to the actions of a controller. And then â again there was an adversarial game to implement an artificial scientist, a very simple kind of artificial scientist. How does it work? Well there's the the generative network which generates probability distributions of actions basically and then you have these actions as an output and you feed them into another network the outputs you feed another network and it's trying to predict the reactions of the environment. The reaction of the environment could be anything. It could be one or zero or something like that, but it could also be the next input which is coming into the system. And then â you try to minimize the prediction error. So the model of the world, the world model as I called it back then, is trying to minimize the prediction error. But the controller is trying to come up with output actions that lead to data where the other guy, the model of the world is still surprised, but the predictor is still surprised. So it's trying to generate sequences of actions that lead to error of the predictor. So suddenly the prediction machine and the generator of the actions, of the outputs, they are players of a minimax game where one is trying to maximize the same error that the other guy is minimizing. And all those outputs of the controller that where the reactions of the environment become predictable, they get boring. But those where you still can surprise the predictor, they stay interesting until the prediction error goes down. And then that was my first adversarial neural network and it was a generative adversarial neural network because the controller was a generative model which had probabilistic units, Gaussian units that were trained through back-propagation through Gaussian variance and mean generators. So that's an old thing. I think the procedure for that was first maybe formulated by Williams in 1988 or something like that. But in a more global context, it's an older procedure. And then you would suddenly have this adversarial network, which generates outputs. And then the predictor, which generates evaluations of these outputs. And the evaluation error is the same thing that the first network wants to maximize. Yeah, so that was the first generative adversarial network that I had, but then the predictability minimization was not generative. So it was not a generative adversarial network, it was an adversarial network where you had an input coming in and then you get a latent representation of the input and the goal of the input representation was to become factorial. In other words, it went to â Each component wanted to become unpredictable from the other components. And again, you had this minimax game, but now with a different objective. So yeah, in most cases, a minimax game with a different context.

Ravid Shwartz-Ziv: So what do you think? Do you think like, for example, data is also, it was missing then because like one of the claim today, right? Like now we have so much data, we can train like LLM, we can train like Dino, like whatever, we can train like self supervised learning. And the reason that we see this like huge performance boost is because of the data. What do you think about that?

Juergen: Yeah, because of the data and the computer. And so the roots of all of this were all in around 1990. One really has to say that because the World Wide Web itself also was created around this time at the European Paraglider Collider at CERN. And â that's the thing where today all the... data is collected from. Today we have I don't know how many trillion terabytes of data that â some human somewhere at some point considered interesting. That's why it's now on the web. And all this data that has been evaluated as interesting enough to be uploaded by at least one human that is now being used to train all these networks. And none of this would be possible without the much faster computers that we have, a factor of 10 million per dollar increase of computational power per investment. And the data, which was â increasing in a similar way, of course... All of that was possible only because suddenly everybody had computers to upload stuff and to upload long videos and longer videos and more higher resolution videos and everything that came to their mind. yeah, it's the roots of all that, you know, around 1990. And today we reap the rewards because if you wait long enough, then compute becomes cheap enough and data becomes cheap enough. and you can do all the crazy things that they are doing today as this is being rolled out to billions of users.

Allen Roush: So I just want to quickly, you know, question that thesis about â compute continuing to... So I completely agree that compute has gone through the exact process you've described up to this moment. The market seems to be pricing memory stocks and GPU manufacturing and similar in such a way to say that we're now entering a period of hardware inflation where at least with high-end data center equipment, including A100s, which are now six-year-old, high-end GPUs, the price per GPU hours actually started going up again and I'm seeing back at $150 to $160 an hour when and so I guess what I'm wondering about is do you do

Juergen: â yeah.

Allen Roush: even if the like actual hardware continues to advance per like in you know the nanometers reduction in size per transistor, do you think that we might be in a somewhat extended period of hardware inflation meaning cost going up across the industry despite this? And I point out video game consoles like PS5 and all that they've like almost doubled in price as well for the same console so this is happening across the industry.

Juergen: â yeah. Yes, it does. And â it is true what you're saying. the remarkable thing that happened in recent years, maybe in the recent decade, but mostly in the 2020s, is that suddenly â people not only profit a lot from this incredible reduction of â dollars per compute, no. they need so much more compute that suddenly people are spending much more money on compute than they did 30 years ago. So it's not the case that the money spent on compute has stayed constant. No, recently it really shot up like crazy. And that's the reason why you suddenly have investments, I think. In 2025 and 2026, the major companies are investing something like $1,000 billion in data centers. This is many times. I don't even know how many times more than anything that was considered interesting in 1990. You know, the sheer amount of money that is being invested. Why is it that they are suddenly spending much more money on something that already is much cheaper than it was back then? because suddenly they are fearing that if they are not at the front of this current wave, they might miss on building some artificial general intelligence or something, and then any investment would be justified. But of course, what's really happening

Ravid Shwartz-Ziv: Do you they are right?

Juergen: No, of course not, because at the moment they are all focusing on a particular kind of â model, which is essentially supervised learning where you just take all the data from the World Wide Web that was created in 1990 or 1989 and â use variants of the algorithms that were created back then, know, transformers. So back then linear transformers existed already and â LCM and things like that. And you use all these old techniques and â It turns out that with the new compute you can pass the Turing test, which â for a long time many people thought means that you have an intelligent system. Although passing the Turing test, well today we have systems that easily pass the Turing test, but it just shows that the Turing test is a bad way. an incomplete, unsatisfactory way of measuring intelligence. Because what we don't have at all, what we don't have at all are A.I.s that can do in the physical world what humans can do with their hands. My favorite example for decades has been the plumber. There's no robot that can do what a plumber can do. There's no robot that can do what an electrician can do and so on. So at the moment, people just because you know the joint test has been passed they think â we are really close to AGI and of course we aren't but now they think maybe maybe we will have AGI not in 10 years or 20 years or something no maybe we'll have it in one year or two years two years and certainly they some of the big companies think if if they don't spend a hundred billion dollars this year they might be holding the bag But I think it's a really misguided approach because as we have discussed before, we are still far from AGI because the AGI and the real world doesn't work. â there's no AGI without mastery of the real world. And â the current investments... lead to a situation where these formerly nimble software companies like Google and Microsoft where you had a small team of ten software guys improving an operating system and then rolling it out for billions of people who all use their own computers, who all use their own cell phones to run that operating system. But suddenly they're all working in the cloud and they have to construct these data centers and they have to invest into nuclear energy and gas turbines and what not to just... satisfy the energy needs of these data centers, which means these companies are becoming like utilities. They're becoming like utilities. And I had a tweet a couple of years ago exactly about that. They are becoming like utilities. And if you just look at the price earnings ratio, It doesn't become obvious because the price-earning ratio doesn't look at the free cash flow. But what you really have to look at is the free cash flow. And the free cash flow of these very profitable companies has gone down to almost zero. And now there are these investments and... Excuse me?

Ravid Shwartz-Ziv: But do like, so do you think like at the â end, if you need like, even if you need to solve like algorithm problems in like in wall models and things like that, at the end, like you still need a lot of compute, right? Even for like another path.

Juergen: Yeah, yeah, yeah, you need that. But now currently, some people are so optimistic and or they are so fearful that maybe a job is really close, although it isn't. It would still take a couple of years, not not a few months or something like that. And â and they think now they have to spend all the money they have. And of course, â within five years, if it's true that the trend keeps going and there's no reason why why it shouldn't. â proceed like that every five years, 10 times fewer costs per compute. So if it is true that this trend will continue, then if you invest $1,000 billion into GPUs today, you're going to lose $900 billion within five years. And there's no business model that comes close to recuperating that loss. So some sort of crash. is getting closer. It's not the end of the world. It just means that lots of people are going to lose a lot of money. But that's okay. That has happened before in the past. And it doesn't really mean that the way to AGI is going to be stopped in a meaningful sense or whatever. No. It just means that currently some misallocations of capital are happening. But â in the long run, know, those guys who are, you know, like the companies who are standing back a little bit and not investing like crazy and instead saying, let's see who comes up with a really convincing AI technique or something that we can rent or, â you know, provide to our customers. â in a way that doesn't cost us a lot of money, then maybe these are going to be the big winners of this game. So usually those who invest a lot of money are not â necessarily those who reap most of the rewards. It has been like that for centuries.

Allen Roush: So do you think Apple might be an example of one of those companies who didn't invest as much as others and in a good way?

Juergen: Yeah. Yeah, Apple has a very strong position because I think at some point they figured out they don't have to be at the front of building the most expensive â large language models. And instead they can see whoever will be ahead in a couple of years, know, maybe Google or Microsoft or some or Althropy or whoever. And then â they will have a deal, you know, and everybody will be happy to get a deal with Apple and run their system on all the iPhones. And probably they are going to find a profitable way of â exploiting. whichever company is coming up with a reasonable business model there. â But I think there will be many other profiters, not just big companies like Apple. No, in the end, everybody is going to profit from that. So AGI or AI in general is getting cheaper and cheaper all the time. â

Ravid Shwartz-Ziv: I think there would be.

Juergen: factor of 10 every five years. â I'm often comparing it to, you know, so in the 80s, I knew a rich guy who had a Porsche. And in the Porsche, there was the cell phone. So he could grab the receiver and talk to anybody who also had a Porsche like that via satellite, you know. And today, â In all the developing countries you have lots of people who have a smartphone that is much, much, much better than what he had in his Porsche. And it's going to be exactly the same thing with AI. It's going to be really cheap for everybody. And those guys who are now investing hundreds of billions into data centers and making connections fast and whatever, they will make â a little bit of profit, but most of profit will go to others.

Ravid Shwartz-Ziv: And do you think we will see how the market will look like? Do you think we will see one company that will take all? Even when it will reach to AI, do you think there will be one company that is first one to reach AI? Even if it will be 5 years from now, 10 years from now or 20 years from now, do you think this company will take all? the end of it.

Juergen: But you know, whatever a company like that is, it has to solve the physical AI problem. Because at the moment, the only AI that is working well is the AI behind the screen. Yes, behind your screen there's a superhuman go player and a superhuman chess player and a superhuman video game player and a superhuman summarizer of documents. Very quickly in milliseconds summarizes thousands of documents in a good way. And a superhuman generator of illustrations. And all of that is just AI behind the screen. But there is no robot in the real world, no AI-driven robot that can do what a little kid can do. And now look at the fact that hardware and machinery is evolving much more slowly than software and computing hardware. Thirty years ago, compute was a million times more expensive than today. But the robots, thirty years ago, they were just a few times worse than today's robots. Back then, thirty years ago, there also were walking robots, you know. They walked more slowly and they had to make sure that the center of gravity is above the feet and stuff like that. And today they have more dynamic walking and stuff. But you know, maybe they are â three times better than back then. They're always off today. But they are not a million times better. So there's this... truly important physical aspect of AI which is evolving much more slowly than the impressive thing which is compute, which is getting cheaper by a factor of 10 every five years. any company... So what will be the breakthrough in robotics? The breakthrough in robotics is going to be the first robot that isn't super smart, but smart enough to operate all the tools and machines that are currently being operated by humans. Once you have a machine... I'm getting a feedback now. Once you have a machine that can operate the tools and the other machines that already exist, then you have a new kind of life. Why is that? Because a machine like that, or a collection of machines like that, can make more of themselves.

Ravid Shwartz-Ziv: Once you have a machine.

Juergen: because they can repair the machines that extract the ore from the ground and ship it to the factories where the material is refined and then all kinds of tools are being made out of these â materials and microchips and screens and whatever. And the important thing is... then you will have a machinery that allows us to make more of the same machinery. Suddenly, it will not be important any longer how many people are in this nation or in that nation or something. No, because suddenly you have the ultimate physical scaling machine, robots that can make more robots of their own kind. And of course, it's not going to be just self-replicating then we will have not only replicating machinery, we have also self improving machinery, because all the â of machine learning that already work well behind screen. â they will start working well outside of the screen and you will have truly self-improving machines where not only the software is self-improving but also the hardware and you know and that will be the start of something huge and gigantic and completely world-changing that's the that will be the true start of AGI mastery of the real world to the extent that AGI can build in a physical way more of itself and improve in a physical way itself. That's going to happen and once that happens then you will have the expansion of robots and AI and infrastructure into the rest of the solar system and then the rest of the galaxy and within a couple of tens of billions of years the entire universe.

Allen Roush: So, this whole articulation of physical AI being kind of the next frontier, does that subsequently motivate the obsession that the world now has with the concept of world models? Do you think that that's the correct way to invest to create the synthetic data for the physical AI?

Juergen: Yeah, so of course my own life has been overshadowed by this goal of building a real AI and so... My own motivation was always to build AIs that can learn in the real world to predict what's going to happen as a consequence of the actions. And that explains my early interest in world models. In 1990, general purpose recurrent neural networks as world models for systems that become artificial scientists and try to invent their own experiments to better figure out how the world works and then use that knowledge to become better because once you have a good model you can plan ahead and stuff like that. And back then few people were interested in that, but now it's kind of clear that the only AI that is working well is behind the screen and the almost all of the economy is not behind the screen, no, it's out there in the real world and the factories and the real robots and the cars and trucks and whatever. And obviously you want to learn. So if you are interested in building to AGI, you have to master the real world. And those people who are now interested in that as well, they of course have to think about world modeling. And then there are different techniques that you can use to build better world models. And there are different techniques that you can use to exploit the algorithmic information in the world model such that the controller, the decision maker can learn to to solve all kinds of problems in the real world.

Ravid Shwartz-Ziv: And what do you think like, â how to say, like the new parts or like the new ideas that you know, like now you talked a lot about the 90s and the 91 and 90 and 90. And do you think like there are like new ideas that come up from the 90s to these days that we should actually pick up and try in the scale now we have enough compute?

Juergen: Yeah, so it's still happening. There are the old ideas from the early 90s which are being upscaled as we speak. And one of them is the principle of transformers goes back to 1991, the unnormalized linear transformer, which... which was published in 1992 then in Neural Computation on a different name and key and value wasn't called key and value it was called from and to but you know it's central to many of the systems that we have today and then you know a little bit â a couple of years before that, the GANs, another thing which is a scaled version of what we had in 1990, â you know, deep residual learning, another thing that goes back to 1991, deep residual learning, the title of the most cited paper of the 21st century, goes actually back to what my PhD student and my diploma student, Sepp Hochheit, published in his 1991 thesis, where he had these residual connections to overcome the vanishing gradient problem. So all that stuff roughly originated at the same time when the World Wide Web was created for some reason and the first smartphones were created in 1992. IBM did that. yeah, most of and when the Cold War ended and the Berlin Wall fell, that was a really important recent moment in time and since then I think much of what we do today is an upscaling of the old concepts of back then. Yeah, there are lots of improvements, epsilon improvement here, epsilon improvement there, but it's a little bit like with the cars, know, the combustion engines were invented in the 1800s. If you look at a combustion engine today, it looks really different from a combustion engine of the 1800s because it's much more efficient. and you have all kinds of electronics and cooling and all kinds of things that they didn't have back then to optimize the explosions in the combustion chambers and to optimize the turbulence in the combustion chambers and do all kinds of things that make these motors much more efficient than back then. But the principle is exactly the same thing and we see the same â development now in AI. â

Ravid Shwartz-Ziv: So do you think there are no new ideas since the 90s?

Juergen: Well, there are new ideas, but it's more like, you know... epsilons on top of the old stuff and it's mostly about scaling up. It's mostly about scaling up. Now to scale up you have to have lots of new ideas and you have all kinds of new problems. So some of the most sought-after people currently are data center engineers who really know how to optimize everything and people who... plan the construction of these data centers in a way that makes them not obsolete within a couple of years and stuff like that. So you need all kinds of new expertise to scale up that stuff. of course, much of the scaling up depends on the chip designers who are now also using AI to more rapidly design improvements of their chips. lots of stuff is going on and lots of improvements all the time, just the principles. The principles go back to around 1990.

Ravid Shwartz-Ziv: What about continual learning? Are there like all the ideas? â Like online learning or whatever. Like there are all the ideas that we should actually use or what just... Yeah.

Juergen: I'm telling you.

Allen Roush: it. And is it important for AGI?

Juergen: Continual learning is, â of course it's important because it's, I wouldn't even emphasize that but, so we have only one single lifetime, right? There's birth and there's death and in between we learn and you know after 30 years we are here maybe and then we have learned a lot from this not even finished training example. We have only one single training example. We have learned from the data coming in through the first within the first 30 years. We have learned a lot about how the world works, you know, and we start we have made predictions about the future and some of these predictions were wrong, but then we adjusted our prediction machines and we We â improved our controllers that uses the prediction machines to make better plans for the future. And sometimes the plans don't pan out, but we get new data and we improve further our prediction machines. And all of that is â continual learning in the sense that we have only one training example, which is one lifetime. And that's all the data we will ever see. And that's the only thing, holy data that we can use to, you know, â optimize performance in the remaining trial until death and there are ways of maybe prolonging the remaining lifetime and that is a possibility so that's one way of getting more reward possibly but still it's just one single lifetime which means anything that happened here in the beginning of your life still may be relevant to understand what will happen here And a system, a general purpose â learning machine that doesn't take that into account is limited. Because if you have a limited learning machine that is only going from this trial to the next trial to the next trial, and where you kind of assume that these trials are independent, which is what almost all of machine learning still does, then you are limiting yourself and you are limiting the learning algorithm that you're looking at because the trial by trial â approach already is extremely restrictive. And so we have always tried to look at the entire lifetime, always being ready to take into account information that was collected, you know, half a lifetime ago, because it may help you to better plan your future much later. So general purpose learning algorithms will take that into account. And the success story algorithm does that. And the GÃ¶del machine does that. And the AIXE model of Markus Hutter, my former postdoc around the early 2000s does that. So all the general purpose things do that. And they don't even talk about continual learning because it's kind of clear. You have only one single lifetime, one lifelong learning. one lifelong trial and continue learning is just a way of rephrasing what's going on. Of course, you don't want to forget what you learned here as you are trying to solve this problem here.

Ravid Shwartz-Ziv: So do you think in the future we will not see these pre-training, retraining, post-training â paradigms? Everything will be just training, going online, when we see new examples?

Juergen: Yeah, yeah, in many ways. So for example, today one problem expression that I've heard is test time training. Test time training. What does that mean? It just means that you have a neural network and then it was trained by gradient descent, but as it is getting new data that it has never seen before, there are still weight changes going on. but not through gradient descent, but because the network itself has learned a learning algorithm, which basically says how should my fast weights, I always call them fast weights because I call them fast weights in 1991, how should my system change these fast weights to take into account the new information that's coming in? And so the test time training is another old hat. It goes back to 1991, to these fast-weight programmers and to the unnormalized linear transformer, because the unnormalized linear transformer is doing exactly that. So you have one network, the slow network, which lands by gradient descent â to adjust the weight matrix of a fast network. How does it do that? Well, it generates these keys and values, and the keys and values are used to define the changes, the weight changes of this fast weight network in a way that takes the context into account such that the thing can answer the next query which is coming in in a way that takes into account the entire context. So there's test time training all the time. in systems like that. And to me, all of that is just, you know, a particular instance of the more general thing, the lifelong learning system, which as new data is coming in, immediately incorporates that data to improve whatever it can improve so far, its understanding of the world. So build a slightly better world model with the new data, which was maybe unexpected, and then â build a slightly better controller that uses this improved world model to make a better plan.

Ravid Shwartz-Ziv: But like, okay, so test time training is a good example. Like, it doesn't work really well right now, right? Like, we all know it. So, why, like...

Juergen: But you can say every single transformer is doing a sort of test time training, because as the new data is coming in, you are generating these fast weights. The inner product â matches where you compare the keys and the values and then apply it to the queries. There is â a â little fast-weight system being generated and immediately used. And so all the time you have fast weight changes there.

Ravid Shwartz-Ziv: But like, I don't know, like for example, like changing the actual weights, right? Like when you have new sample, like new examples, it looks that it doesn't work at least as well as like pre-training and like post-training, like gradient based training, right? â So do you think...

Juergen: You learn by gradient descent to do these weight changes, the fast weight changes. So gradient descent learns one network that is trained by gradient descent learns to generate later fast weight changes, â which are not due to gradient descent any longer. No, they are now the outputs generated by flow network.

Ravid Shwartz-Ziv: So why people are not taking this idea? Why you're not taking this idea? And now you have, we have a lot of compute, right? So why we are not doing it?

Juergen: Yeah. We are doing it all the time. We are using transformers and linear transformers and all the time.

Ravid Shwartz-Ziv: Okay, but like why we are not doing like, you know, like we are not updating our beliefs, right? Like now we have like new, like people are trying to do these like memory summarization and like all these tricks about like how to, when I have new samples and like when I have a new data, how incorporated, like there are a lot of hacks that people are trying to do. But at the end, like you have finite, very small... â

Juergen: me. Yeah.

Ravid Shwartz-Ziv: context window, right? Like this is what you can put in your data.

Juergen: Yeah, but these are limitations of current models and large language models that everybody's using. They have all kinds of limitations simply because â those who are making these large language models, they can use only so much data â and then it costs them a lot of money, â millions or hundreds of millions even, to come up with a new language model and then they ship it. And â it's completely â It is not â made to quickly absorb new knowledge from a robot that is using it and all the time new data is coming in through the actions of the robot and is not being incorporated in the model in a way that you â would like to see. So it's not doing what an optimal decision maker would have to do. How can you overcome that? Yeah. Excuse me?

Ravid Shwartz-Ziv: But is it not something that like you... But is it not some... Yeah, that like... Building frontier models. Is it not something that like you want to do? You know, like maybe in the university like there is not enough compute and money to do it, but like you don't want to join a frontier lab to start your own startup in order to make it.

Juergen: Yeah, yeah, no, I do. I do.

Ravid Shwartz-Ziv: What about measuring intelligence? I know I remember that you had a paper somewhere like in... don't know... 15 years ago that you used games in order to measure intelligence, right? Do you have more like... What is your takeaway? you think like how we actually measure intelligence using benchmarks, using games? Like what is the optimal way to do it?

Juergen: Yeah. Okay, so there is the type of intelligence that humans find interesting, which is very different from, you know, these general purpose â intelligence definitions that we have had since, well, maybe... maybe since 2000, so mathematical definitions. And this, I would say, started with Markus Hutter's work, who back then worked on an SNF grant that I had there on algorithmic probability and universal decision-making machines. And he came up with this AIXE model, the AIXE model, which is super general. So it looks at all possible computable environments. actually it looks at all possible computable probability distributions over environments. And then â the only thing that you know about the environment is that the probability distribution according to which the information is coming in from the environment is computable. And now there are countably many â possible environments because there are countably many computable probability distributions and some of them are popular with humans like Gaussians and Poisson and what but there are trillions of additional distributions. corresponding to trillions of different universes with different probabilistic laws and so on. And now the question is, what is the optimal decision maker that you can build in such an environment? First of all, it turns out that from a mathematical perspective, there's an optimal prediction machine for these environments. What is it? It's the one that goes back to, not to 1990, it goes back to roughly 1960 to the work of Ray Solomanov, who who is known for this concept called algorithmic probability and universal inductive interference and â Kolmogorov and Solominov, for some reason both of these guys have four O's in their name. They laid the foundations of that. And so â in 1978 or something like that, showed, Solominov showed that â these prediction machines that he had are universal. which universal machines did he have? He had a universal prior, the Solomonov prior, which essentially is the weighted sum of all possible computer probability distributions. And then you have this universal prior, and you start using, you start making predictions with that universal prior about the the true world, which you don't know. You don't know what is the true distribution of your universe. But it turns out that if you are using the standard Bayesian framework for making predictions with a prior, then the â predictions of the universal prior, according to the universal prior, rapidly converge to the predictions that you would get if you knew the true prior of your current universe. There is an optimal, mathematically optimal way of predicting in arbitrary universes as long as they are computable in this sense. So that's the first step. is what Solomov did in 1978. And then on top of that, Markus Hutter, my former postdoc in the 2000s in Switzerland, â he made an optimal decision maker. He said, OK, let's use that Solomov prior.

Ravid Shwartz-Ziv: But in some sense, but...

Juergen: predictor as a world model. That's now the world model. And then let's have a controller. And the controller is just trying to â discover action sequences that lead to maximum predicted reward according to the Solomonov prior in an environment that you don't know. It's the only thing you know about the environment that is, it's computable. And then it turns out that this is in a mathematical sense the best way that you can do it. you always, you say, okay, now my lifetime has been this. And now I look ahead to â where my lifetime will be twice as much as this. So I will be twice as old. What here, what is my optimal action that I should execute? Well, I look at all the possible action sequences that I can now generate. And I look at the world model, which is solomona fry essentially and the bayesian framework around it and then you predict what's going to happen How much reward are you going to get with all these different action sequences? And then you execute the first action in the action sequence that leads to the highest predicted reward. Now you execute that. Then new information will come in. Maybe your world model will change a little bit, you know. So maybe you will have to change your plan now. But that would be the optimal way of dealing with With any environment, it's not true, with any environment there is computer work, which is already a huge restriction because almost everything that you could talk about is not computer work. But the only thing we can write formal papers about are the computable things. So we hope that universes and this universe and every other universe that makes sense is also computable. And then we have suddenly an optimal decision maker. So we know something about optimal decision making in arbitrary environments. means we know something about optimal AGI in arbitrary computable environments. It's just that we don't know enough because if we look at all computable â probability distributions, then the set is countably infinite, which means we don't have an efficient algorithm for computing these optimal actions. So we have to do more sophisticated things. But at least we have this yardstick which shows us the limits.

Ravid Shwartz-Ziv: But do you think like... But do you think like, what is the right path, you know, like to come with this like theory that, yeah, like we have some like very abstract and prior and like in some, in some world that everything is computable and then to try to... â

Juergen: Yeah.

Ravid Shwartz-Ziv: to make the assumptions more like similar to real world and then to come like with theory and to, at the end to reach out to a point that we can actually calculate the â components. Or the other way that like, as we are kind of like doing today in deep learning, right? Like let's try a lot of different tricks, different hacks and different ideas. And like at some point, maybe we are trying to build some theory on the top of it.

Juergen: Yeah. Yeah, so what you should do is something that is much more computable than AIC, which is the GÃ¶del machine. And the GÃ¶del machine... is basically a system that has code. It's a computer which is connected to the environment through robot fingers or whatever and it can manipulate the environment and it has a code base and then it has a theorem prover on board, a proof searcher on board and it can look at its own code and it can try to make predictions. about the consequences of its own code and it will try to â change its own code to rewrite its own code in a way that may affect even the proof searcher, which is just part of the code, in a way that is provably good, leads to more rewards per time than â what it had before. And as soon as it finds something like that, â a computable, approvably good self-improvement, it is going to execute it. And this includes also â a proof that it doesn't have to wait for an even better self-improvement, because otherwise it shouldn't execute the previous self-improvement. that is â an AGI which is computable. You don't have to go through infinite â numbers of probability distributions or whatever. It's an AGI that is limited to the current environment. And it's not very interested in all the possible other universes. No, just this current environment. And it's always trying to find a way of improving itself. which may include everything, know, try to come up with a different kind of world model, a different kind of analogy building machine, a different kind of sub-goal learning, anything that's computable, and it's trying to come up with a better way of â achieving reward than the previous version. So this is a reduced but more realistic â way of building a universal intelligence. It's still not feasible, at least not in the current...

Ravid Shwartz-Ziv: and

Juergen: world we are living in for various reasons. But I think it's in principle what you should do. And we have lots of toned down versions, especially recently where you know easily can use large language models to rewrite their own code and stuff like that in a way that is reminiscent of the GÃ¶del machine, although it's not theoretically optimal like what we had in the GÃ¶del machine.

Ravid Shwartz-Ziv: And do you think?

Allen Roush: Earlier, you were also articulating what sounded like what's a von Neumann machine, like machines that can replicate themselves and construct themselves. So I'm just curious, since we're doing the gamut of hyper-intelligent people known by last names here, do you have any opinions about the importance of self-replication in AI systems?

Juergen: So for Neumann, he had, â he had influential papers or at least one influential paper about software, about self-replicating software. And that was in the early days of software, know, and Conway, for example, had these glider self-replicating cellular automata patterns and stuff like that. â so there, in a very idealized environment, they were thinking about self-replicators. And of course, today we have lots of self-replicators. Any computer virus is a self-replicator, which is trying to find a computer environment in which it can easily â and then make copies of itself and offload itself to other computers. So we have lots of software self-replicators already. The challenging thing, which we don't have, is... like life. So in the physical world, we have one example of a really impressive self-replicator, which is life and little bacteria. they somehow, although they are tiny, they have all this complex machinery that they need to import chemical material from the environment and use energy from the environment to build little replicas of themselves. And then suddenly you have two of them, and then you have four and eight and stuff. So this is something that we know but we don't understand. So we don't have any machinery that can replicate itself. However, for the first time now, as I mentioned before, we are in a situation where we may be pretty close to self-replicating machinery because all we need is this one kind of machine, which is a little bit smarter than the other machines, which is smart enough to operate the other machines and the other tools that are needed to repair the machines and do all the things that currently only living beings can do. Only humans can currently do all these you know, all these things in the physical world that you need to repair the trucks and to repair the computers and all these essential physical things that you need for a self-replicating civilization. So we do have a self-replicating civilization, but it needs the humans. Yeah, with the humans you can make more humans and you can make more chairs and you can make more machines of all kinds. But you still need life. The current civilization that we have still needs life, human life, to grow. And as soon as we have a machine, which may look a little bit like a human, because it probably will have to be, since it will have to operate all these traditional machines, it will probably have fingers and hands and stuff like that. some way of climbing and so it may not look exactly like humans but â similar at least and a machine like that that's smart enough such that we can teach it to operate all the other machines that is the missing link you know once you have that Then suddenly you have truly self-replicating machinery and you truly have a new kind of artificial life, not just in software, not just like von Neumann self-replicating software things or Conway self-replicating gliders and stuff like that. No, then you have it in the real world and that will make all the difference. Software. is kind of under control. We know how to improve software through machine learning, we know how to make copies of software, we can easily make a billion copies of software and roll it out to billions of people, but we cannot do that with robots for example. It's much harder to make replicas of physical stuff, but it will become easy to the extent that we have this missing link.

Ravid Shwartz-Ziv: Do you think you will see in your lifetime, even you or your ideas reaching to AGI?

Juergen: Yeah, in the 70s I said that â I want to build this thing that learns to become smarter than myself and it was clear to me that is not just software. You have to have physical AI, real robots, otherwise you are only halfway there. â And back then I was hoping that within my lifetime we can see that such that I can retire. That was my prediction. I want to build that in my lifetime or see somebody else build it in my lifetime and then I can safely retire. And we are not there yet, but I think we have a good chance. You know, if I don't get a heart attack tomorrow or whatever, then maybe I will have a few years or even decades left and then we will see that.

Ravid Shwartz-Ziv: â So one personal question, are you frustrated or disappointed that you didn't or you don't get enough credit about your ideas?

Juergen: I can't claim that we don't get a lot of credit. Some of our work is really well known. However, â some of the work that we did a long time ago, especially around 1990, 1991, that was rediscovered or maybe just republished later. Yeah, and then of course in science that's not the way it goes. In science you always have to correctly attribute, you have to do your prior research, you have to figure out who did this first and who punished first, and then credit goes to that person. you know, whenever it turns out that you maybe did unintentional plagiarism because you didn't know the previous paper, as soon as you know you have to correct your papers, you have to write an addendum, an eratum, a corrigendum, and â that's a science way, a way of scientific honesty and scientific integrity. That's what you have to follow. And yes, is true there are a couple of people, pretty visible, some of them, who don't do that. This is annoying. But on the other hand, science is self-correcting and in the end, the truth always wins, you know. And if it hasn't won yet, then it's not yet the end.

Ravid Shwartz-Ziv: But are you taking it like personally, know that like always just yeah, like this is a science debate, right? And like there are two sides and everyone like say their own opinion and it likes it can be like what is the best algorithm it can be like who is credit? Is To get or like do you think like it's more personal or like yeah, like he doesn't give me this credit and like I don't like it We go.

Juergen: Yeah, so maybe you have seen that I'm not only â defending my own team when it comes to things like that. I'm also sometimes a speaker for the dead. â And there are, you know, let's take, for example, deep learning. So it's a popular expression. It's about something that is really central to modern AI. It's about deep artificial neural networks with many layers, not just two layers like what Gauss and Legendre had in 1800, 220 years ago. No, about many layers on top of each other. And there are people who are trying to leave the impression they invented that. But in fact, it was invented by other people. And unfortunately, the guy who was central to that development, Ivanko from Ukraine, died in 2013 and he is not widely known. He is now much better known than, you know, maybe... 20 years ago before I started to promote that. And so I'm trying to do my best to promote people like that. But he is still not yet widely known that he should and his student Lapa should get the credit for the first deep learning machines that really worked. 1965 in Ukraine. So that is one of the the misattributions that you see in today's surveys of AI. where highly cited surveys of today don't cite the original work, which is unacceptable. And these papers should be retracted or they should be â corrected. Yeah. And so I'm defending not only my own team when they are being plagiarized, but also others. this is something that everybody should do. It's standard in honest science. It's the basic rules of scientific integrity. Whenever you discover something, you have to check, am I the first? Did somebody else publish that before? And of course, in patents, if you file your patent one day after the other guy, you are scooped. So it's too late. And the same is true for scientific publications. So of course, you have to do that. That is the traditional way of science. And if you don't, then you are not a scientist. And it's a little bit sad that many people in our field aren't really honest scientists. But science is self-correcting.

Allen Roush: So, thank you. So related to this, I wanted to ask this question for a while too. Science has been around for an extraordinarily long time. And even just in the past five years, the amount of papers that are being published and thus citable have exploded 20 % or more, I think, year over year at probably every conference. So the page lengths on papers have not gone up, right? Maybe a little bit, I suppose now nine pages is being accepted instead of eight pages. for NURBS, but I claim it's relatively slow and maybe has a ceiling. So do you think that we ever hit a point where the amount of papers that you have to cite in order to basically appease everybody and make sure everybody got the credit they deserve, you know, starts taking up such large amounts of the space that at best we would need to move it to, I don't know, the appendix. I'm referring to the related work section where you have to specify what each thing in your bibliography, like how it's important. But I guess what I'm saying is, can't people argue, several hundred people potentially argue that they were needed to be cited in some paper, thus creating these extremely long related work sections that take up the whole length of a paper? how do you think people should avoid this?

Juergen: In principle. So follow what Richard Feynman said. He said, I'm citing all the guys, even those who don't deserve the credit, just to make sure that â they don't give me a hard time. And it doesn't hurt, you know. The important things, however... where it's not just a tiny little epsilon improvement of some basic algorithm and maybe 100 other guys also at the same time had a very similar epsilon improvement of the same old thing. There you will not get stronger reactions as strong as what you get if you come to the foundations, when it comes to the foundations of deep learning itself, for example. or when it comes to the foundations of deep residual learning or stuff that really has become important, you know. Then you want to make sure that you cite the original work. And maybe you had a great idea and you only later discover that it wasn't new and was published before. Then you still have to do with the thing that you have to do as a scientist. which is write a corrigendum and make sure that in every follow-up survey and in every follow-up presentation you correctly attribute this work. If you don't, you are not a scientist. And if you get a reward for that or an award, you should be stripped of your awards and you should give them back.

Ravid Shwartz-Ziv: And do you think? And do you think there are people currently that are doing it?

Juergen: â yeah, absolutely.

Ravid Shwartz-Ziv: They could.

Juergen: I have...

Ravid Shwartz-Ziv: You don't have to answer like if you don't want to, that's fine. But like you think like, okay, so...

Juergen: Yeah, no, no, wait, wait. I have a bunch of reports on that which are widely known, know, and lots of people have seen those. So one of them is about three touring awardees who each got one third of a touring award. And they published lots of papers that left the impression that they invented something which was actually invented by other people whom they didn't cite. So it's easy to find. I have an AI blog which has lots of little pages and reports and so on. And there you can easily find overviews of that kind. So if you're interested in that, check out that blog.

Allen Roush: So real quick then, I'm aware that we're talking about things that people might use the term drama related to, but let's, I want to go to something a little bit more neutral where, â so right now at ICLR Brazil, or I guess, you know, shortly before this podcast recording, there was a highly â reported about paper called TurboQuant. because Google wrote this blog post about it and purports it to be this much better form of quantization with significant inference speed improvements. And there's been a lot of drama associated with this paper. If you check its open review, there have been allegations from so-called research members of the public that â there was possible full-on academic misconduct and fraud. Maybe you don't know anything about this, but I'm curious what your opinions are about some of these more modern dramatic kind of things that have happened over the last few years that were not directly related to somebody, you know, taking credit that they shouldn't have taken. Do you have any opinions on any of these other more recent events or or even, you know, at NeurIPS a couple years ago there was a the Swedish, I forget her name, but there was a Swedish roboticist lady who was interpreted as having made anti-Chinese statements, right? So I'm just curious if you if you think about her or aware of this kind of drama.

Juergen: Yeah. I'm not familiar with these particular cases. I know there are lots of additional cases. â whenever it happens to be something where I'm familiar with, I try to weigh in. â But very often, I don't have time for looking at everything. â In my own experience, often it's like that. Maybe years later, it turns out there's something which was once debated, â is becoming kind of important, you know, and then it's becoming even more important. And once it's really important, then suddenly it turns out that â it was just a repubblication of something that was published â decades earlier or something like that. And â in cases like that, when it's important and impactful then â I tend to weigh in but there are many many cases where I can't and â it's also not my job so everybody's it's everybody's job everybody in the field who is calling himself a scientist should do the prior research, or if â other people discover something like that and see that it's maybe a re-publication of something that was published before, then of course they are obliged â to look at the details and then if it's correct... Of course you check if it's correct or not. Then you have to retract your paper or you have at least to write an eratum or a corrigendum and from then on you have to give credit to the originator in every single presentation. And that's what honest scientists do.

Ravid Shwartz-Ziv: What is the best paper work that you think not yours like from the last five years? Your favorite... What is your best paper like your favorite work from the last five years?

Juergen: Say again. Hmm. best paper of my team for the past five years. So that would be 28-21.

Ravid Shwartz-Ziv: Also 10 years.

Juergen: Well, if we go back 10 years, I guess that... So in 2018, there were two world model papers. And one of them is pretty famous now. And that's the one where David Ha is the first author. It's just called World Models. And that was very cool. In the same year, I had another paper which was about a more sophisticated kind of world model where actually the controller and the world model are collapsed into one and through distillation procedure which I first used in 1991 actually. I think that's an important paper and it looks a little bit like DeepSeek. did something like it, then later to shock the stock market. They are not totally detailed in their paper, but it looks like it. Anyway, so this 2018 paper, that's one of my more recent favorites. Excuse me?

Ravid Shwartz-Ziv: and paper that are not your that you and paper that you didn't write like that are not yours is there like a paper that like you you like in the last five or ten years

Juergen: â then. â Yeah, there are many good papers, but there's nothing that I would like to highlight right now.

Ravid Shwartz-Ziv: Okay. â One last question that I actually like curious about. So like you have strong opinions about a lot of different ideas and domains and projects. Like how is to be your student? Like how is like the day to day work with you? Like, cause like I had an experience like this very strong opinion, researchers and it challenging what is your take? Do you think you're a good advisor? It's difficult to work with you?

Juergen: I don't think it's difficult to work with me because I leave a lot of room to my students. However, I try to make good decisions in the beginning, you know, so we are getting lots of applications and then I'm trying to hire â the most brilliant students that I can find who are motivated to work along lines that I consider interesting. â But that doesn't mean that I dictate anything to them. No. I leave them a lot of room because the basic river of thought is already, you know... acceptable to them and to me. So we are swimming in the same river so to speak and then there lots of side rivers and then often what happens is that something doesn't work and then the student looks at the details, why doesn't this stuff work and then there's a little tiny thing, a little devil in the detail and it makes all the difference. The student suddenly finds a way of overcoming that devil in the detail and then suddenly it works really well and then the same student is able to become the world expert in this particular thing and then has one paper after another which always goes a step beyond what he had so far. And so recently in this year I just had two Chinese PhD students finishing who joined my lab a couple of years ago and there you have in my point of view, brilliant work about vision models and diffusion models for vision and large language models, self-referential large language models. large language models where you have entire societies of these large language models, we called society of mind. Borrowing that term from Minsky, I think 1980s something, he came up with that terminology and now we had societies of mind that were different from his, where each of these members is a large language model and then it turns out that collectively they can solve all kinds of problems that one of them cannot solve. And there was a best paper award for that as well. these are general topics which are popular at the moment. where these students whom I just mentioned left their mark. Yeah, so this is this recent stuff which is super compatible with what most people currently want to do and think it's important. â And of course it's connected to the old stuff that we did a long time ago, but it's in many ways a different kind of research.

Ravid Shwartz-Ziv: Okay, I think we are out of time. â We talked about a lot of topics and a lot of different topics and it was super interesting. Do you have anything that you want to add?

Juergen: My problem is that I'm rambling too much and there comes a question and I start going off in a direction that is only remotely connected to the original question and â I could go on and on and on like that. But I think it's better to stop for now and let's see what comes out of that.

Ravid Shwartz-Ziv: It's not a problem, it's a good thing, at least in a podcast, I think. We'll see, let's ask our audience, but I think it's a good thing.

Juergen: Yeah. â OK,