Jürgen Schmidhuber - World Models, RL, and the Year that changed AI (Part 1)

In this episode, we host Jürgen Schmidhuber - the man, the legend, one of the godfathers of modern AI. His lab worked out many ideas behind today’s systems (LSTM, world models, artificial curiosity, Transformer variants, and even GAN-style setups) decades before they became fashionable, and he’s just as well known for making sure people remember who did what first. This is the first of two conversations with him.
We go back to his lab in the early 90s and ask how one small group came up with so many of the ideas that are now being scaled to a thousand billion dollars, back when compute was ten million times more expensive. A lot of the episode comes down to one distinction he keeps making: prediction vs. decision-making. His take is that LLMs are very good prediction machines that imitate the web, but that’s only half the problem. To actually act in the world, you need a controller that uses a world model to plan. He talks about his 1990 work on world models and artificial curiosity, where the controller gets rewarded for running experiments that improve its own model (an adversarial setup years before GANs), why planning millisecond by millisecond doesn’t scale, and why you need sub-goals instead.
We also talk about compression as the core of understanding, from falling apples to Kepler to Einstein, and why we still don’t have a robot that can do what a plumber does, even though the AI behind the screen keeps getting better. Then the conversation moves to credit assignment: how “to Schmidhuber” became a verb, what he thinks is broken about the award system, and a long exchange on PMAX vs. JEPA. He ends on the real origins of deep learning and a prediction about self-replicating machines in space.
Timeline
00:00 Intro
00:55 1991 in Munich, and why that lab mattered
02:38 "I'm not very smart" and why compute getting 10× cheaper every 5 years changed everything
04:25 Chess as an AI proxy
08:27 Artificial curiosity in the 90s vs. today's RL exploration
09:10 Why RL is harder than supervised learning
20:48 Coding agents vs. robots, and how a baby learns its own hands
26:20 Compression as understanding
33:40 What's actually missing on the road to AGI
37:30 Why millisecond-by-millisecond planning is stupid
47:44 Convergence to LLMs, GPUs, and how far we still are from the Bremermann limit
51:49 Unsupervised learning, factorial codes, and predictability minimization
58:12 Credit assignment: the fights with LeCun and the Nobel critique
1:02:13 On his last name becoming a verb
1:05:17 The award system's missing peer review
1:07:03 Closed labs and the decline of open research
1:13:23 Audience questions
1:34:02 Closing: who really invented deep learning?
Music:
- "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- Changes: trimmed
About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Ravid Shwartz-Ziv: Everyone, Ravid here. Quick note before we start. We had so much fun talking with Jorgen that we decided to do another one. So this is the first of two episodes with Jorgen Schmidlberg. The second one is coming soon. Enjoy! everyone and welcome back to the information bottleneck Hey Jörgen!
Juergen: Hello, Ravid, and hello, Alan. How are you doing?
Allen Roush: We're doing great, I can tell you that. â How are you?
Juergen: That's fine. Thank you so much.
Ravid Shwartz-Ziv: maybe â tell us â you define or how you want to introduce yourself.
Juergen: My name is Jürgen and I'm interested in artificial intelligence.
Allen Roush: I think that's a bit of an understatement of the century.
Ravid Shwartz-Ziv: Okay. we can start with like, â I'm not if like this is the but like long time ago, right? 1991, okay. No one like the field is the offline machine learning and deep learning is so small, right? There is almost no money in it. â And â you have â a small medium â Germany â â then what? Tell why it's why such an important moment in time.
Juergen: So 1991 is the only palindromic year of the 20th century. And it was important for AI because in this little lab that you mentioned in Munich, we were able to come up with all kinds of algorithms which are now essential to what the big companies, the most valuable companies in the world are doing. A thousand billion dollars are being invested in scaling up things that started back then. We didn't invent deep learning, no, that happened in Ukraine in 1965, but then in 1991 we were able to make major steps forward and we had a whole bunch of interesting things going on there.
Ravid Shwartz-Ziv: So why, how it can be that it's exactly like in one place, there are so many discovered, like you rediscover or reinvent so many things exactly in one place, in one point in time. How, what is the process that actually make it happen?
Juergen: Yeah, so I'm personally not very smart and â I'm just smart enough to, or back then I was smart enough to realize that I'm not very smart, but it might be possible to build something that is much smarter than myself such that this thing learns to do all the things that I cannot do myself and then I cannot retire. And back then â there was no... competition or almost no competition because nobody was really interested in artificial intelligence back then except for you know very few people and â the reason is back then compute was about 10 million times more expensive than today because every five years computers getting 10 times cheaper. So in 30 years we have a fact of a million and 35 years and now it's 35 years. We have a fact of 10 million roughly. And back then these algorithms that are, know. behind the P and the T and chat GPT and stuff like that, they could be applied only to tiny little networks with a few hundred parameters. And today we have billions and trillions of parameters and we can do so much more for the same price. So back then, because you couldn't do much, you could do the same thing in principle, predict the next token and stuff like that, but you couldn't scale it up. That was only possible in the 2010s, something like that, and then more recently in the 2020s. And that's the reason why back then we had little competition. So it wasn't as easy as today to get scooped by somebody else doing a similar. kind of research.
Allen Roush: So I have a question about kind of your escapades and work in another part of AI. I see that you're listed in the chess programming wiki. And I'm fascinated by how the field looked at chess programming and chess development as a proxy for AI capabilities up until at least Deep Blue in 2001. so I guess what I'm curious about is what was your to that era of AI development. â And what did you think about, â do you think that there were echoes of the current race in like large language models back in the 90s and 2000s? For example, if you remember what RIBCA was, I still remember the RIBCA benchmark cheating scandal from the mid 2000s. And I'm curious if you see parallels there.
Juergen: Yeah, so chess of course is really one of the first domains where AI was applied to before I was born. And as far as I know, the first guy who wrote down the chess program, and that was in 1948 or something, or maybe 1946, I'm not sure. Conrad Zuse, the same guy who built the first general purpose computer in 1941 in Berlin. And he also had the first high level programming language, which was called Plan Kalkyl. crazy name. And then he wrote programs in Plan Kalkyl and one of them was a chess program basically. So that was long before I was born. I was born in 1963, two years before deep learning was invented in the Ukraine. And â then in the 80s I think I had a Sinclair computer and I tried to implement a little chess program myself. But my work on chess is completely irrelevant in the grand scheme of things. However, Chess profited from the same trend that we have been profiting from, which is every five years computers getting ten times cheaper. So back then when Zuse did his general purpose computer in 1941, he could do roughly one operation per second. One. And then 30 years later, you could do a million operations for the same price. And today we can do a million, I think almost a billion, billion, not quite, â instructions for the same price. So the â chess programs â were dominated by... exhaustive search. So you just do this recursive look ahead search and then you assume the opponent is able to look ahead one step fewer than your look ahead. And then you have this recursive search going on and more or less this thing then in 1997 led to a chess program that was able to beat the chess. champion, Kasparov, not using any neural networks. Maybe at the same time, more interesting, or actually three years earlier than that, there was Tisauro's neural network, which learned through reinforcement learning to play backgammon. So that was a learning program, and it reached human level competitiveness, I think around 1994. So that was more a foreshadowing of what was going to come in the field of board games. And it had more to do with modern AI than what we saw in the 1997 defeat of
Ravid Shwartz-Ziv: start to talk about different direction, different subfield in AI and â would love to hear your opinion. â worked in the 90s on â curiosity and increasing motivations, right? It looks like there are a lot â in common between the current RL exploration research. But what happened? â
Juergen: and
Ravid Shwartz-Ziv: Why you think our L is only become better now? Why you think like what I don't got lost in translations and needed to rediscover the to get us our current algorithms.
Juergen: Yeah, reinforcement learning is more complicated than supervised learning. What you see today in the most popular AI models, large language models, how do they work? They basically apply supervised learning to... â imitate all the data on the web. So predict parts of the data from other parts of the data. Predict the next word in a sentence from the previous words. Predict the next pixel in an image from all the previous pixels that you have seen, stuff like that. And â that can be done by just one single network that is â using gradient descent to, you predict parts of the data from other parts of the data and you train it on all the data you can find in the in on the world wide web and that means that you will insert an enormous tremendous human bias human oriented bias in your network because it is going to be trained on all the data that some humans at some point found interesting which means it will be totally biased towards humans maybe not to all human groups equally, but at least it will be super biased towards what humans find interesting. And that's why these large language models today are so useful in so many applications. However, The large language models are not about decision making. For decision making, which is associated with reinforcement learning, which is about learning to make decisions, you need to do more. So you need to first have a prediction model, something like â an LLM or foundation model as it is called now, which just learns to predict the future given the actions of the â actor. And the actor has to learn to use this model this model of the consequences of these actions to plan, to come up with mental simulations of the consequences of possible future action sequences and then pick an action sequence that is rewarding, that leads to a lot of reward. What does that mean? Well, there are these special inputs that have a value, the reinforcement signals. Our body is full of little pain sensors and we also put pain sensors into our robots such that the robots, who in the beginning are totally stupid, they learn to understand what's bad for them. And whenever they produce actions that make the robot hand bump against an obstacle, against a table or something, then they learn to predict over time that this is going to happen. And so they have a little model of the world, a world model. I called it a world model in 1990, which then can be used to do... planning in the sense that you do several mental experiments and then you choose action sequences that lead to high predictive reward given the world model. Now you need at least two things now. There is the prediction machine and that is kind of standard in the sense that it's really working well. So we have really good prediction machines. They are so good that they pass the Turing test. The Turing test is just about typing text to another guy behind the screen and then the goal is to figure out is the other guy a human or is it a computer? And if you can't do that any longer then the Turing test is passed which means that the Turing test is actually a bad way of measuring intelligence because it's just about a tiny little aspect of intelligence, you know? Are you able to... to predict whether the other guy on the other side is a machine or not. And the harder part is the decision-making part, where you need the other network, which uses a prediction machine, to come up with good decisions. But then the other guy, the controller, I usually call it the controller and the model, the controller has to use the model to plan ahead and to predict its future. and to select action sequences that are promising. However... In the beginning the model is stupid, you know, and so the controller has to live with that and it has to figure out ways of Making the model of the world better. So basically it has to invent action sequences experiments that lead to data that improve the world model so it has to have this thing which I call it artificial curiosity in 1990 I had a a tech report which was called on making the world differentiable, blah, blah, using recurrent neural networks as world models and as controllers, which used the world models to plan. But then also, the controller had an incentive to come up with experiments with action sequences that lead to a better world model. And the very first naive approach of 1990 was really â take the error of the prediction machine, of the world model, and use that as incentive for the controller. So in other words, the controller is maximizing the same thing that the model of the wild is minimizing. So the model sees the output actions of the controller and the other data which is coming from the environment. The controller also sees the stuff which is coming from the environment. â it is a generative model because it has little Gaussian units in there which, you know, compute the mean and the variance. of probability distributions over actions. And then â you have a generative â network, which is trying to minimize the same thing that the other network is maximizing. So today they call it generative adversarial network. But it was back then there. It was the motivation for the controller to come up with inventions, with experiments, that lead to data where the model of the world, the world model still can improve. So you have an intrinsic motivation for the controller to become a little artificial scientist, basically, to learn, to figure out which parts of the environment the world model is still unfamiliar with, and then conduct action sequences that go there. such that the world model can learn some things that I don't know yet. And as the world model learns about that part of the world, it becomes boring because the error goes down and there's less reward for the controller.
Ravid Shwartz-Ziv: I want to try to push back a bit because like, right, like world models, we had like many people that also like were really excited about world models, but like at the end we know that like language model, like generative, right, next token prediction models are really good, right? And they can do a lot of tasks and like we actually like, I don't know, three, four years ago, like we didn't believe that they can be such good models. â What types of tasks you think they cannot do? Like in 5G, â 10x, 100x, more compute and more data. We just can't be there with current models.
Juergen: So the current models that trained on the World Wide Web, all the data you get from there, what do they know? They know only that data. And they don't know how to use that data to come up with decisions with sequences of actions that lead to reward that is not in there in the data. So all the time you want to solve new problems. Suppose you are a robot and you want to go from A to B without... You want to go to the charging station in your room and it is... And on the way to the charging station you don't want to bump into these obstacles. But your room is not there in the videos that you downloaded from YouTube which allow you to... have a foundation model that allows to predict next pixels and next frames and videos in general and so on. And so you somehow need an additional mechanism, a decision maker that uses the prediction machine, the world model to come up with good action sequences. But you know, the world is full of problems where the solution is not found in the data that you can already download. For example, Suppose you. â Suppose you want to be like Ronaldo, know, Ronaldo the famous footballer and you want to, as the ball is coming in, the football is coming in, you want to jump up at the exactly the right moment and move your foot such that it touches the ball and you look behind you and there's the goalie and so you slightly adjust the angle of your foot such that the ball is hit in a way that leads to a goal. And all of that is super complicated. And there is no description of these decision-making processes that go on in Ronaldo's brain as he is doing all of that. So there's no language-based description of what he's doing there. So to become a successful robot that learns to achieve rewarding goals, such as shooting goals against the opponent team or something like that, You have to accept new data all the time. So new data is coming in that's not there on the web. And you have to adapt it to the current context you are living in. And you will relate it to previous problems of a similar kind that you have solved in the past. And you are going to use â the neutral signals that you are getting through the cameras, through the eyes. and through the microphones, you are going to use that neutral data which is coming in to improve the ways that you are selecting of getting the truly important data which are the positive rewards signals. â And as you're trying to do that, as you are trying to learn to become a decision maker that â can do all these things, you have to become a reinforcement learner. Now you can profit a lot from existing data. You can profit a lot from neutral data, which is not directly related to the reinforcement signals. So you can profit from data that allows you just to understand how the world works. And that's what this whole artificial curiosity thing is about. So you want to generate through your actions new data, which is not on the web, that tells you something about the world that you didn't know. And then you build that into your existing world model and then it becomes a better planner and you can better plan with it in the current context. And all the time you have this interplay between getting new data from the environment and improving therefore the planning processes that lead to desirable rewarding events. â for all of that, model is not good enough because it does only half of it. It only makes the predictions.
Ravid Shwartz-Ziv: End. But for example, But for example, like the physical, like do you think it's something that's related to the physical world? Because for example, like coding agents, right? They are really good, right? They are not perfect, but like they become really, really good in making exactly this in code, right? Like to planning a task, like to come and to solve a very complicated task that involve different steps and that they haven't said before, right? And the way that we actually like train these models are really kind of like signal, right? So why you think like we don't see it with robots?
Juergen: Ah, so of course all of learning is about generalization. So when you measure the performance of a learning system, you never care whether it models the training data. Yes, during training you care only about the training data and you try to model it well, but the essential thing is the generalization on the unseen data. Now, if you have read all books on the web, you will be able to ask lots of questions that have never been asked. in any of these books, because you know a lot about answering questions of a certain kind, because lots of similar questions have been asked. And then you generalize from there. But in the real world, now you are a robot, you want to control your own fingers. There is very little data about your own fingers on the web. If you're a baby, a human baby, you don't download the web to understand how your fingers work. No, the baby, what does it do? The baby, â instead of downloading the World Wide Web, â it accidentally does something like this. And then, as it does something like this, the video changes. The video that is coming in through these neutral channels, through the cameras. And what happens now? Now the world model of the baby adapts. And it learns to predict how the video is changing. And then it learns something that it didn't know, that it has these hands, basically. And it learns how certain signals to its motor neurons are going to affect the inputs that are coming in. So that's how it learns to become, how it learns to understand physics, how it learns to understand falling objects, how it learns that the world is a three-dimensional thing which constantly maps to these two-dimensional, you know, â projections which are coming in to the cameras and to the eyes and that's how and and and what it does the baby is it's trying to compress all the data all the time and it turns out there's a really good way of compressing all the videos coming in from a room like this all you need is a 3d model of the room and then you can predict many of these videos that you get as you're walking through the room from previous actions of the walking guy, of the baby or of the kid that is doing that. And so all the time the data is coming in and the prediction machine is trying to find regularities that it didn't know. And there's so much out there in the real world and the physical world, so much data that you want to shape through your actions and so much reward that you want to obtain through your through your actions, â that is not at all sufficiently modeled through the little bit that you can get from the wild oil event. Yeah, you can get something from that and you can initialize your world model with that, that's also good, but it's not enough because all the time you need new data as you're trying to solve new problems the physical world, â need new data â depending on the context that... â that isn't there in the previous data in a way that allows you just to generalize from the previous data and solve the new problem. So that's why you really need some active mechanism for not... collecting all the data of the universe. nobody can do that. But just the data that is relevant for solving your current goal. And next time you have a slightly different goal and you still have stored the old data and it helps you to maybe shorten the search for a policy that achieves the new goal. because all the goals are kind of related, or at least many of them are related. And so you can exploit the algorithmic information that you have discovered in your previous controller and model. â And you can exploit that algorithmic information in a way that allows you to â cut down the search tree and to shorten the path to additional success.
Ravid Shwartz-Ziv: And so, you talked about compression, and I'm a big fan of compression. Like, actually, the information bottleneck is... the information bottleneck principle is based on compression, right? That compression is understanding, and all, like, these things. But every time, like, we tried, like, many times to apply information theory principles for compression, for existing methods, but every time, like, it doesn't really work, you know? Like, we... didn't see big improvement from applying compression directly. And also, like we see it, right? All these scanning loss, more parameters, more data, keep help your performance. So what is your takeaway on that? Do you think there is something more fundamental that we are missing or there is something else that we don't know how to apply it in the right way?
Juergen: Yeah, so implicitly we are doing a lot of compression. Because suppose you are looking at videos of apples falling down, that's my favorite example. And then if you look at three successive frames of that video, with â high certainty, you can predict many of the orange pixels of the apple as it is falling down in the next frame. What does that mean? You have basically understood gravity. You cannot predict everything. You cannot predict all these reflexes on the skin of the falling apple. There you would need more complicated things such as ray tracing or whatever. But since you have to store only those pixels that you... couldn't predict, so only those deviations from the predictions, you can predict a lot. And you can greatly compress, because everything that you can predict, you don't have to store extra. So all the time as you are dealing with prediction machines, you compress. You do compress a lot, the data that's coming in. And because our world is governed by a handful of apparently rather simple laws, you can generalize like crazy. And there are so many ways of compressing. For example, even simpler than the example with the apples, you just have, â you know, behind me there's a white wall, and then you see three pixels in this white wall. And with high probability, this â pixel there in the middle is also white, you know. So any prediction machine that predicts another white-grey pixel here, given the surrounding pixels, will be able to compress the images that are coming in. So all the data that is coming in from this very friendly world is very compressible. The current systems are limited still in their... â ability to extract arbitrary compressibility. They are good at extracting limited kinds of compressibility, like with the falling apples, you know, and the image compression thing. They are not as good yet as what Kepler did a long time ago. So Kepler, what did he do? He got these data points, noisy data points from telescope observations of the planets. And then you have all these noisy data points, but he noticed, â there's a very simple mathematical object that kind of describes all of these noisy data points, and it's an ellipse. And to describe the ellipse, I need only two parameters. And â suddenly, was able to assuming. first, made certain, well, back then, wasn't even Gaussian assumptions, because Gaussian wasn't alive yet, but he made certain assumptions about the noise and then he said, â okay, there's a very simple law, quadratic law, that allows you to predict these observations. And there was a physical law that he discovered and he wrote it down in little symbols. So he abstracted away from all the pixels and then he had a very simple sequence of symbols, a squared plus b squared. some parameters equal zero, something like that. And then â another guy, slightly â afterward, said the same thing that makes these apples fall, these apples fall, and that makes these planets move, that's actually driven by the same law. Let's call it the law of gravity. And it's a very simple thing, it depends on the mass of this body and on the mass of this body, and then there's a simple â way of expressing how they attract. And 300 years later, there was yet another guy who was dissatisfied with the deviations from the predictions of the old theory. For example, Mercury in the perihelion does things as it's rotating there, it does things that it shouldn't do. And therefore, to describe what Mercury is doing, you need lots of extra bits of information to encode the deviations from the predictions of standard gravity theory. And then this guy found a very, simple explanation, which is now known as the general theory of relativity, which allows you to predict away all these deviations. And the theory of relativity is really simple. It just says, no matter how quickly some object moves and accelerates or decelerates, and no matter â what kind of gravity field it is in, â light speed always seems to be the same. So the constant thing is this. And then to â make a predictor based on that that predicts all the consequences of that, you have to learn a little bit of tensor calculus. And Einstein needed, I think, about 10 years to go from the basic intuition to the tensor calculus version. But then there it was. And that theory, again, greatly, greatly compressed lots of observations. And then it was able to generalize like crazy. And so we don't have artificial scientists yet that are as good as the best humans at extracting new novel high-level concepts. compressibilities like that. I think we are moving in this direction, but we are not there. So we actually do have artificial curiosity based systems that go in this direction, but they didn't have that chat GPT moment. The chat GPT moment was limited to much simpler systems that are only about prediction and downloading everything and then predicting parts of it from other things. But...
Ravid Shwartz-Ziv: But what are the missing parts?
Juergen: not about finding the most elegant descriptions of all the data that's coming in. Yeah.
Ravid Shwartz-Ziv: So what are the missing parts? What we still need to develop and what do you think, like, you know, like we are almost there and what do you think, like, will take us 10 years from now? And maybe now it's a good time, maybe we can talk also about AGI.
Juergen: Yeah, so an AGI system, of course, is not an LLM by itself. No, it uses a foundation model or something like that to build a predictive model of the world. But then you need the other system, the controller, which uses the predictor, the world model, to plan actions. So there's a decision maker, which is collecting data. And there's the L guy who is modeling the data. And the decision maker is using the model to plan the actions. And that's really an old hat. â early AI of the 20th century. And what's new is that we are using artificial neural networks for doing all of that. And it's not so new because it was done around 1990. So before that, there was Paul Verbus in 1987, and he had a feedforward controller and a feedforward prediction machine. He didn't use the name world model. He â used system identification and he related it back to what the control theorists had done before, except that the control theorists â didn't have adaptive neural world models and stuff. And there was Monroe who applied that to us in 1987 to â reinforcement learning. then, â you know, and then what we did about that, we â generalized that and to world models for partially observable environments, know, and the entire world is partially observable, so I really want to deal with partially observable environments where you need memory of past events to be able to predict, well, what's going to happen, and where the controller needs a memory of past events to make good decisions for the future and stuff like that. And then we had â extra, all the reward signals. â join the standard neutral signals as inputs to the controller, which is also important if the controller wants to do meta-learning. Meta-learning means you see the errors or the evaluations of what you're doing and then try to invent your own better algorithm to maximize the rewards. And to do that, you need at least to see the reward signals or the error signals or whatever the evaluation signals are, you need them. You need to see them such that you can devise internal errors and space on these evaluation signals that are better than what you started out with at minimizing the errors or maximizing the rewards. So meta-learning, that's essential. And then you also need this essential thing. which takes into account that in the beginning the world model knows nothing, which means that you have to incentivize the controller to generate action sequences or experiments that improve the model of the world. So these were the new things kind of in 1990. then back then we had a naive way of planning because we said, okay, let's just do millisecond by millisecond planning. What does that mean? It means that you roll out the potential action sequences in your mental experiments millisecond by millisecond, which is stupid. it's like... â
Ravid Shwartz-Ziv: Why? Why stupid?
Juergen: Why is this stupid? Suppose you want to go, where are you based now?
Ravid Shwartz-Ziv: in a year.
Juergen: in New York. Suppose you want to go from New York to Paris. Then â there are two ways of doing it. The stupid way is you say you plan every single muscle movement on your way to Paris. You say, okay, I'm going to move my left pinky a little bit like this and then I'm going to grab the phone and then I'm going to move the phone like this and then I'm going to call a cab and then I'm... you know walking down the stairs to the cab and then I'm checking in at the airport and then â on the plane although for six hours nothing is going to happen you simulate all these possible muscle movements which is stupid. That's an extremely exploding search tree as you're trying to find good action sequences that lead you to the goal. No, what you really do is you decompose your plan of going from New York to Paris into little sub-goals and you say, okay, first I have to get a taxi and I already have a sub-program that tells me how to do that. And then I tell the taxi driver that I need to go to the airport and I already have a sub-program that knows how to do that. And then at the airport I check in and then I sit down on the plane and then for six hours nothing is going to happen. I already have a sub-program that covers that. And so your plan of the future just â looks at a couple of tiny little decision points, crucial decision points in between the subprograms. Because you are decomposing that goal that you have into a couple of standard subgoals and instead of doing millisecond by millisecond planning, you just think of a couple of important things that you have to do, a couple of important subgoals that you have to reach on the way there. Now where do these subcalls come from? You have learned to create these subcalls. You have learned to create them and somehow you can use maybe your model of the world, your prediction machine, to extract good subcalls. And all of that we didn't do in 1990, although we had a sub-goal generator. This is not quite true. We had a sub-goal generator also in 1990, because back then the problem was already obvious. You cannot really do millisecond by millisecond planning. You have to generate sub-goals. back then, millisecond by millisecond planning is good enough for board games. It's good enough for chess and goal.
Ravid Shwartz-Ziv: End.
Juergen: And the more recent successes, maybe 10 years ago in chess and gold, they were really based on millisecond by millisecond planning. But that's not good enough for the real world. The real world is so much more complex than chess. Chess, what is chess? Chess, you have a little eight by eight pixel image, like a thumbnail image. And some of these pixels are black and some of them are white and then... You have a couple of numbers that encode the figures of the chess game and so on. So it's a tiny, tiny, tiny little one. The same is true for Go. So there, everything is super controlled and a feed-forward network is sufficient to come up with a good policy for solving problems like that. The real world is so much more messy and you have to memorize all the time what happened. â Now my colleague is over there and he's disappearing behind the wall, but that means he will come out here over there and all of that has to be taken into account here. so back then we already realized that millisecond by millisecond planning is stupid and we need a sub-gold generator and we had one, but it wasn't good enough for the really challenging problems. So back then we said, okay, let's now look at a reinforcement learning system, which... already has learned all kinds of â action sequences going from A to B, going from B to C, going from E to F and so on. then let's say, okay, we have an evaluator that takes a start and a goal as an input suddenly have three networks, not two any longer. â And the â evaluator network sees a start and a goal and says... â it predicts the costs of going from start to goal. Something between zero and one. One means â very difficult, zero means very simple. And then you train the evaluator on lots of examples of what the reinforcement learning system already can do. And now you give a new problem to the reinforcement learning system, which it cannot solve yet. And then you take two copies of the evaluator. And now you have an additional network which is called the sub-goal generator. So the sub-goal generator sees the start and the goal and emits a sub-goal. Now, what happens to the sub-goal? Well, one of the evaluator copies sees the start and the sub-goal as an input and predicts how costly is it going to be to go from start to sub-goal. And the other evaluator, which is just a copy of the first evaluator, sees the the sub-goal as a start and sees the goal and then predicts against the cost. And then what you want to do is you want to minimize the sum of the costs from start to sub-goal and from sub-goal to goal. Which means the only chance that the sub-goal generator has is to generate through gradient descent a sub-goal that doesn't cost much. And then you have a â Hopefully you have a new action sequence that leads from start to goal using that sub-call. But that was also a naive way of generating sub-calls and since 2015 we have better ways of doing that. â
Ravid Shwartz-Ziv: So we talked a lot about what are good directions and what are good ideas, but I want, let's, for a second, let's focus on what is a bad direction. What is â a popular research direction right now that you think is a dead end and why? That everyone are working on it, a lot of people are working on it and you say, no, don't work on it.
Juergen: I'm not going to say that large language models are a bad research direction because they are good enough for certain things, you know, and they are good enough for solving all kinds of interesting problems. They are just not good enough for reaching AGI, you know, because they just... focus on one aspect of decision making, is model the world in which you are in such that you can, with the help of an additional decision maker, come up with promising action sequences. So I'm not going to say that LLMs are stupid. No, they are useful for certain things. They are just not very general. They are not going to be good enough for robots that want to deal with the physical world. And at the moment, of course, the only AI that is working well is the AI behind the screen. So the only AI that is working well is behind your desktop computer, and there is a superhuman go player and a superhuman chess player and a superhuman video game player and a superhuman summarizer of documents and a superhuman â video generator which within milliseconds can generate a video of corresponding to some prompt and whatever. But all of that is AI behind the screen. And the thing that you really want to see in AI, AI in the physical world and with real robots, that doesn't work at all. Works a little bit, you know, but also decades ago it worked a little bit, but it doesn't really work in the sense that there is no AI driven robot that comes close to... â to doing what a plumber can do. My favorite example for decades has been the plumber, where you lie down beneath the sink and then with a screwdriver and a couple of other tools you fix the pipe or something like that. And â what an electrician can do. So all of the really challenging stuff, of the hard stuff, that doesn't work. We have acrobats, robot acrobats. And when I say we have them, The Chinese have them basically. But we don't have â robots that can do with the hands all the stuff that you can do, that I can do, that a little boy of seven years old can do. This is because hardware and AI control hardware is so much more.
Ravid Shwartz-Ziv: you
Juergen: challenging than AI controlled software. And that's also the reason why for 10,000 AI software companies, you have maybe only 10 AI robot companies. Because the real world with the real robots is so much more challenging. And for decades in my lab, we have had, you know, little humanoid robots, some of them even looking like babies, but they were so inferior compared to you know, what humans have. A hand like this, with millions and millions of sensors and cables connecting the sensors to the controller, I wouldn't even know where to put all these cables. And then you cut it, you you cut your hand and it starts healing yourself, itself. It starts healing itself. This is super advanced technology. We have nothing like that in human-made hardware. And so this connection between the hardware and software and machine learning, that's the big remaining challenge.
Ravid Shwartz-Ziv: in the recent years, â we saw the like, you know, converge of like all the ideas to LLMs, what Jan called LLM pillar, right? That everyone are working on it. â And people said it, and â it's kind of true that like most of these improvement is tied GPUs and like these co-improvement of the models that based on GPUs and GPUs that based that like using these models. Right, so like we see that like the GPUs converge to like to very specific narrative. Do you think we will see it also in the future or do you think like now is the time for more ideas, diversity of ideas, diversity of both algorithms and software? mean maybe the end of the â Nvidia monopolicy?
Juergen: Computers has not reached its limit. So yes, every five years computers getting 10 times cheaper, but the limit of that process, the only physical limits we are aware of are still far out. That's the Bremer limit. The Bremer limit says with one kilogram of matter, you cannot compute more than 10 to the 51 instructions per second. And even if the current trend continues, we are not going to reach that limit before the year 2200 or something like that. And so and I've pointed that out for, I don't know, for many decades. And the Bremerman limit was established in 1983 by a guy called Bremerman. And so. â We still have a long way to go, just to quickly put that in perspective. A human brain probably cannot do more than 10 to the 20 elementary instructions per second. Probably just a fraction of that, because most of my neurons in there, as I'm talking to you, are not active. Otherwise, my head would explode. So a human brain probably can do much less than 10 to the 20 elementary instructions per second. But let's assume 10 to the 20, which means that all human brains together, so 10 to the 10 roughly, can do not more than 10 to the 30 instructions per second. Now remember, one kilogram of matter, the only physical limit that we know is that this can do 10 to the 50 instructions. So 10 to the 20 times more than all human brains combined. and then take into account that the solar system contains roughly, what is it, about 10 to the 30, I think two times 10 to the 30 kilograms of matter. So we are so far away still from the limits of what can be computed with the local matter that we have here. However, the exciting stuff in the future is not so much what's happening behind the screen is what's happening in the real world. And the hardware, the robot hardware evolves much more slowly than the compute hardware. The compute hardware has a factor of a million in 30 years. Now, if you look back at the robots 30 years ago, they could also, you know, they could walk already. They always had to be careful to have the feet. have to have the center of gravity above the feet and so on. But you know they were not a million times worse than today's robots. They were maybe three times worse or something like that. So they are evolving much more slowly. So there we don't have this explosion that we see in computers. Now what we really need to do, and nobody has achieved that in a convincing way yet, is to bring hardware and software together in a way that allows us to do what babies can do and kids and so on.
Allen Roush: you've discussed, we've spent a lot of time talking about like how children learn, and we've even discussed various types of learning like supervised learning, reinforcement learning. â I'm always the one that feels like unsupervised learning doesn't get mentioned much, â or the credit that I at least perceive that it In particular, I claim that so much of â the data, â especially that babies see is not directly right? And â I claim that some kind of â process analogous to clustering or even dimensionality reduction, in my mind, seems like that would have some biologically plausible influence. Do you agree with this? Do you think that processes analogous and supervised are happening?
Juergen: Yeah, yeah. No, of course, you are totally right, but the techniques that can be used for supervised learning can also be used for unsupervised learning. And it's happening all the time. So whenever we talk about compression, we are already talking about a sort of unsupervised learning. And one has to keep in mind that there is no really good definition of unsupervised learning. What we mean, what you really want is we want to model the data that is coming in, all this redundant data, in a good way. What does that mean in a non-redundant way? In a way that internally decomposes data such that you have these disentangled representations as they are called now. And back then in 1991, we had another adversarial system which tried to decompose it into something that's called factorial codes. Factorial codes popularized by Barlow and non-neural â papers, and papers which weren't really about artificial neural networks that achieved this. But there you want to have a code where the neurons, the hidden units, are statistically independent of each other, but in a way that makes sure that the probability of the current input equals the product of the probabilities of these hidden units, of the activations of these hidden units. And that's what in 1991... was achieved, at least in principle, through this technique which I call predictability minimization. Not predictability maximization, which is like JPA, but predictability minimization where you have maybe images coming in and then you have a hidden unit layer, a code layer. And then you have maybe 10 hidden units in this code layer. And then what you do is you have a prediction machine on your network, which tries to predict the 10th unit from the other nine. And you try to predict the next unit again from the other nine and you predict all these 10 different units from all the other units. And then â at the same time, each of these units wants to become as unpredictable as possible. So it's just maximizing the same error that the prediction machines are minimizing. Again, we have this adversarial game, which was already there in artificial curiosity, but now used â to generate a factorial code, a disentangled representation, where you have an optimal representation in the sense that some you can do all kinds of techniques that are important, like Bayesian analysis, in a trivial fashion, because the product of the probabilities of the code units is the probability of the data which is coming in. So unsupervised learning, you can use supervised learning techniques to achieve them. to me, I didn't even mention them because it's kind of the standard repertoire. Additional ingredients of the standard repertoire in unsupervised learning are some sort of pruning, know, where you say, okay, let's try to model the same data, but with fewer parameters. Let's somehow figure out which of these units we can, can prune and which of these weights we can get rid of such that we have a smaller and more compact network representing the same thing, basically again, doing compression. So in the end, everything falls back on, this concept of compression. And we have lots of papers on that in the nineties because Even back then it was totally clear that's what you want to do. You want to compress your data, you want to compress the representations of your data, and you want to make them friendly such that you decompose the data into abstract concepts where the different components of these abstract representations stand for different things, such that you cannot really predict one of them from the other ones, such that you have a disentangled representation. All 1991 stuff. And there are several ways of getting closer to that goal. Another one that goes back to Sepp Hochreiter, flat minimum search in around 1997, 1999, also achieving similar things and also ending up often with factorial codes. Flat minimum search just means that you again find a... through gradient descent, a compact representation of the data. Yeah, so I could go on and on and on, but I guess we are deep reading from the basics. Again, the model of the world wants to compress as much as it can through predictive coding, usually, the data which is coming in. Everything you can predict, you don't have to store extra. If you can make your model smaller and... â
Ravid Shwartz-Ziv: Let's go! Sit.
Juergen: use less parameters in your model, then it becomes a better compressor. And then you can hope for improved generalization. And the other guy, the controller, however, has different job. It has to use the model of the world to figure out, given a new problem, what is an optimal action sequence? What does the model of the world know about the world that I can use to more quickly reach my current goal?
Ravid Shwartz-Ziv: So let's go to a different topic, â credit assignment. It looks like you are very vocal about a proper credit assignment, you have your fights with Jan about credit assignments, and you criticize the Nobel Prize award to Hinton and Hoffeld. â What is a... like first of all why it's so important? Why it's important for you and in general and how you think it's a proper way to do it?
Juergen: yeah. So â I'm not going to say anything surprising. I just repeat what â honest scientists would repeat. â of course the currency in science is different from the currency in business. In business you try to maximize money. â science â have this reputation based on â discoveries. And â if you want to go in science, you want to make sure that â the currency system, the banks and everything that they are working. And you don't want to have a situation where you can easily get scooped by someone or somebody else will republish the stuff that you published to us and then gets credit for that. just like in banking, you don't want to pay money into your account and then somebody takes it out, somebody else. So that's the same thing. So the currency of... â Science is this reputation and â the credit that you get. And all of science is incentivized by that. You have to see, know, so the entire science system is incentivized by that. So if we throw that incentive away, then nothing remains. And then all the guys who went to have success in life, they just will say, me... I'll become a banker or a VC guy or something like that. â And the old system based on scientific integrity is going to vanish. So of course we have to keep it.
Ravid Shwartz-Ziv: but But people will claim that, right, like you can have a very vague idea and there a lot of ideas, but there is a long path between like taking this idea and to implement and to use it and to train it or like whatever. So where do you think like is the line? Where do you think like a proper credit assignment like should be, should meet?
Juergen: That's very simple. like the patent system or the publication system, if you publish something and there's a timestamp and it's an accepted â timestamp, then it's clear this guy was the first and he was the first to publish it and so he will get the credit for that. However, if somebody else builds on it and improves it and has an epsilon improvement here or a major improvement there, Or somebody else takes the basics and then 20 years later when computers 10,000 times cheaper uses it to build a great system that can be rolled out for millions or billions of people, then he should get credit for popularizing it in this way. So you should always get credit for the stuff that you really did. So maybe the guy who invented it gets the inventor's credit. the popularizer gets the popularizer's credit. But if the popularizer tries to get also the inventor's credit, then something is wrong. then, you know, the traditional science system is going to react.
Allen Roush: So this leads me to a question I've really wanted to ask you. And you might be aware of this, but if not, I'm fascinated to see if this is news to you. Your last name has become a verb. And it is known as to Schmidhuber somebody when you kind of put effort into saying that you invented something or invented the ideas before. know, in the past of it being presented. â I'm curious, you know, do you you think, I guess, you know, the field has kind of come to an agreement that like, at least when you do it, you know, you're you're correct most of the time, right. And I think there's a fear that at least some people who â kind of try to claw back that credit might be not actually, you know, as responsible for for the those things as they might say they are like in the future. Right. So I guess what I'm wondering is how do you react to this kind of development and your name being associated with these things?
Juergen: Yeah, I think it shows a little bit the sad state of machine learning and AI in general, that something that is so obvious gets a label like that. Because of course, every scientist should be like that. I think most scientists are like that. And they want to make sure that â they get credit for their own inventions and that somebody else who is trying to take that credit should be punished for that. I think most scientists agree. And only serial plagiarists or somebody don't agree. So it's kind of weird that you take a guy who is active and kind of loud in that â area and always points out when somebody is trying to get credit for something that somebody else did first. that this should get a special label. I don't mind, you know, the Schmitturing, it's fine with me, but it's a sign that the entire field is not in a good shape because it's just what scientific honesty and integrity is about.
Ravid Shwartz-Ziv: And did you face any backlash or criticism for taking that position, for example in the Novel Prize award issue?
Juergen: Well, the interesting thing is that none of the guys who were accused of republishing stuff that was invented by others has tried to defend himself. Think about that. Why don't they do that? Well, because they can't.
Ravid Shwartz-Ziv: So, why it's happening? Why you think it's happening that some people get awarded and some not? And how can we solve it?
Juergen: How is that possible? So one has to see that with the awards, one essential element of peer review is missing. So there's a little bit of peer review because there's a committee, and then you have these nominations, and then letters of recommendation are coming in and so on. And then the committee makes a decision. Now in proper peer review, which you have, for example, with journals, you get also lots of submissions and then maybe a paper is accepted so it passes peer review. It passes peer review, however a little bit later it turns out that it's plagiarized. So it's product of plagiarism and somebody else has published the same thing earlier. So then what happens? Then the important element of peer review comes in. The paper gets pulled, it gets retracted, or at least there has to be a corrigendum and an erotome. And that part is conspicuously missing in the somewhat incestuous award system. You know, where the committee over here from this award. looks at the other awards and says, yeah, these guys got an award for that and we don't really know too much about the topic but I guess they did their job and made sure that there was no plagiarism involved and so on and so let's give our award also to that group and all of that is hidden behind closed doors and the whole award system is... susceptible to collusion like that, you know.
Ravid Shwartz-Ziv: And what do you think, but do you think like at the end, it looks like we are going to evolve that all the big, like the recent advantages of machine learning and which models are coming from closed source models, right? And like, it looks at like more and more like parts of this community not publishing the work or like at least not publishing the important important parts of their work. So for me, it's quite sad. And I think a lot of people that actually now are leading in this company, Entropi, OpenAI, they actually became very famous and they earned a lot of money because they allowed them to publish their work both at universities and Google Brain or â other companies, right, like 10 years ago. But now it's not the situation. So first of all, do you agree and do you think there is something that we can do to improve it?
Juergen: Yeah, so with companies the incentive is different. The incentive is the profit of the shareholders and of course you have a whole legal system around companies and if a company does something wrong according to the legal system then you can sue them or you can you have and debates between companies. How do you do? Are we going to deal with your patterns and you are going to do something with our patterns and let's have a cross licensing deal and all kinds of things that are â foreign to the scientific system and to scientific honesty and â scientific integrity and all these things. So on the one hand, there are the scientists and almost all of the basic â algorithms in AI and machine learning were created by little labs, not by the big companies, not by little labs with not so much funding, taxpayers funding usually. And the basic breakthroughs were really created decades ago there in these little labs, which only recently â suddenly are confronted with â big money because suddenly you have the PhD students moving into to super well-paid positions and then forming their own companies and then. â Suddenly not doing the PhD thing any longer and the science thing, but now maximizing different objectives, which are shareholder rewards and shareholder profits. And there you have a totally different legal framework. It's not about scientific honesty anymore. And it's about everything that is legally given this well-established â legal system for... for companies. So suddenly you have different goals, you you are maximizing different kinds of rewards. Now the rewards are money basically and
Ravid Shwartz-Ziv: So, but like...
Juergen: Certain things are allowed and others are not allowed and everybody in this field is trying to find paths that are more or less legal or legal enough such that the company doesn't have to pay a lot of shareholders.
Ravid Shwartz-Ziv: Good to you. But do you consent from this direction? Because at the end, most of the scientific improvement and contribution came from open research. So do you think we will see less improvement 10 years from now, 20 years from now? Maybe it will be very focused in specific fields or specific companies? Or do you think it will just propagate outside as we saw before?
Juergen: Let's keep these two things separate. In science, of course, there's a fixed point, which is the truth. All future surveys or the history of our field that will be written in the future is always going to try to focus on the truth. There's also a reward for that. So if you are a scientist who is good at discovering that certain other guys did something important that was not credited correctly, then they get a reward for finding that out. So the science system in principle works and you get a reward for discovering the plagiarism of other guys. Sometimes it takes a while, but you know, the fixed star, the fixed point, that is the truth. And science is really about doubting the current... write-ups and the current explanations of what happened and always trying to improve it and you have this attraction point, this fixed point where everything is moving towards that. in topics that are more important than machine learning we have seen radical revisions of the accepted narratives. For example, look at the history of the universe itself. So the most important thing in the universe is probably the universe itself and for thousands of years there were quite misleading narratives about the creation, the origins of the universe. you know millions, hundreds of millions of people believed in them. There was an accepted narrative until within a few decades everything changed. And then today there's a new narrative, was created maybe roughly a century ago, and suddenly it turns out maybe the universe is much older than we thought. It's not 6,000 years old, it's 13.8 billion years old. And then there was this long evolution that led to the current state of things. And suddenly hundreds of millions of people are convinced of that more recent. completely changed narrative. And the same is going to be true for all kinds of subfields less important than the entire universe. For example, the history of AI â on our little planet.
Ravid Shwartz-Ziv: Okay, I think like, now let's go to a question from the audience. We got a lot of questions. We do not have time to go over all of them, but let's try several of them. So we got a question from a person, let's call it like YL, about JEPA. So he said, JEPA is a name of a class of architectures. JEPA without a predictor is a
Allen Roush: YL.
Ravid Shwartz-Ziv: â GEA, GEPA and GA can be trained in many different trades, right, including distillations, sample contrastive methods, information maximization methods, and PMUX is merely one informux method that applied to a GA, and not the first one. So if PMUX was so great, why did you not â pursue it?
Juergen: Yeah, okay, so now we are talking about JR and JPA and let's first get the facts right. So in 1992, we had this paper on JPA, which we didn't call JPA, we called it Predictability Maximization. What was it about? It was about a neural network that... encodes the incoming data somehow, and then there's another network that's trying to predict the internal representation of the first network. So that's the predictor of the internal representation of the first network, and the internal representation of the first network tries to become predictable, such that it somehow represents an abstract concept which doesn't contain all the information about the incoming data. but â only the abstract version of it, which stands for predictable class and predictable internal representations. â And that paper, 1992, that cited a paper from 1989, which was by Becker and Hinton, which was about maximizing mutual information. And that's what LeCun called JER, G-E-R, Joint Embedding Architecture, in 2022. In 2022, he wrote a paper on... on J-PAR and there he cited that and then he said there is this other thing, this J-PAR, which is a main contribution of his 2022 paper. Main contribution. It's really explicitly â mentioned there and that was exactly what I just mentioned. The Pmax family, predictable maximization family, of course, is a whole family of â methods. And what is the essential thing there? Well, you have two terms. One is try to minimize the prediction error, such that the thing that you want to predict becomes more predictable. But you don't want to... Now it's called collapse. You don't want to â let it collapse. in the sense that it becomes a trivial thing that everybody can predict from nothing. And â for that purpose, there was an extra term, which was called the D term, which tried to convey as much information as possible about the input. So there were two terms, and there was a little weight factor, an epsilon term, which was used to define the relative weighting of the wish to predict well and of the wish to â have information about the input that is being predicted. And â then we had a whole bunch of different types of methods for doing that. One of them was â the Infomax method by Linsker. Another one was just an odd encoder. And another one was â where the encoding was done by predictability minimization, the method that I mentioned earlier, where you have compact code, which is optimal in a certain statistical sense, where the product of the probabilities of the code units equals the probability of the current input. And then we had all kinds of experiments, which is good, because the 2022 paper by Lacoon had no experiments. Although later he said our experiments back then, 1992, when compute was a million times more expensive, were almost non-existent or something like that. He wrote that, although his own paper didn't have any experiments. then there was the Bill Young latent paper by Michael Valko and team. where the epsilon was basically zero, which means that there was only the prediction term. That was 2020, I believe. I believe that was 2020. And then he had other tricks, another repertoire of tricks to make sure that the other thing still conveys information about the input, that it doesn't collapse, as it is now called. That it doesn't collapse. And then, and if you ask Michael Weiko of Bios, then you will find that he completely agrees that the Japer family is the PMAX family, which was published 30 years earlier, and that all these more recent things like Barlow Twins, Barlow Twins is a section of the PMAX paper, although Barlow Twins were published almost three decades later, apparently, and another.
Ravid Shwartz-Ziv: But why?
Juergen: Another popular regularizer is another section of the PMAX paper of 1992. So all of that clearly shows, and I can't even believe it that anybody is debating that, it clearly shows that the PMAX family of 1992 is the JEPA, the so-called novel thing. However,
Ravid Shwartz-Ziv: But why you're going, but like why you didn't try to push it? Like why didn't try like.
Juergen: However, yeah, now, if you, I have a report on that, it's very easy to find on the web, it's called Who Invented J-PAR? Actually, and there you find a little note which says basically, we don't think that J-PAR â is a great idea or necessary for world models or anything like that. Let me see, I somehow have it here. Where does it say? I had a footnote on that. Yeah, so a disclaimer I wrote, while the cognoscenti agree that large language models are insufficient for AGI, JEPA is so too. We should know we have had it over three decades under the name PMAX. So you need a whole bunch of additional techniques for achieving AGI. And of course, in my as a scientist, I always have tried to â improve the state of the art in a way that I found convincing. â And JAPER by itself is completely insufficient to achieve AGI. Instead, you need other things, and that's what we have focused on, like certain types of meta-learning, where you learn the learning algorithm itself, â certain ways of training your world models, certain ways of exploiting your world models for planning in a way that is not the stupid millisecond-by-millisecond planning. So, all kinds of things that I thought are much more important than the JAPER. The only thing, the only reason why I'm â recently or why recently I have focused on JPLAY is because a guy came up and claimed he invented that and was his main contribution and or one of his main contribution in 2022, although it was an old hat by then. Yeah.
Ravid Shwartz-Ziv: So do you think like the you need different set of skills to come up with an idea and to like actually you know push it to SOTA empirical results? â and what is more interesting for you?
Juergen: Yeah, yeah. Yeah, so what actually I think what you shouldn't do is what PMAX and JPEG does, which is kind of get rid of some of the data because at some moment you think that this data is not relevant. No, 20 years ago, 2006, I had this paper about... compression progress. And there I point out that if you can store the data, you should store all of it. If you can store a lifetime of observations, store all of it. If you can afford it, if storage is cheap enough, store all of it because the holy data is the only thing that you will ever know about the world. It's the holy data. If you throw some of it away, then you rid yourself of the chance of finding a regularity, you know, 10 years later maybe, that you haven't found before, that you ignored before because you thought there is no regularity, but actually in retrospect, it turns out there is one and you can still learn it if you keep the data, if you store it all. And then what you really want to do is you want to predict all of it. So the world model wants to predict all of it. Now, of course, there are many unpredictable pixels. That's true. However, As the world model, the recurrent neural network or the combination of recurrent transformer or something is going to predict it, it's going to develop all kinds of internal representations, hidden units, that are really, really informative. Maybe to predict the expected value of this pixel over here, not the pixel itself, but just its expected value, it's really useful to store the value of another pixel that you saw 1,000 steps ago. So you will have an internal representation that represents that. Why do you have that internal representation? Because it turns out it's useful for better predicting the â expected value of this pixel. Yeah, there are so many things you cannot predict, but that's okay. Now what you have to do, what your controller has to do, it wants to access the algorithmic information in the world model, but not in the J-percent. Not in the J-percent, no. It just wants to learn to address the given the current problem that it wants to solve. It wants to address the important hidden units that are relevant to the current problem. So it wants to learn to inject prompts into the hidden units and something will come back. And these things that come back, they are answers and they are just number vectors, but that's okay. These number vectors, they convey algorithmic information that is contained somehow in all these billions of YouTube videos that the world model has seen. And now you want to exploit the algorithmic information of the world model. How do you do that in a simple way? Now, the 2015 way in the paper on learning to think, which describes a reinforcement learning prompt engineer, is that you send in that the controller has connections that belong to the controller. which lead into the hidden units of the model. And something is coming back. So that's where the thinking process sends back number vectors, thoughts, if you will. And then you have this back and forth. So you can now have sequences of prompts and answers and sequences of prompts and answers. And then you can have a little period where the system thinks about what it should do. And since it's a general purpose neural network, recurrent neural network or something similar, you can do anything that is computable. So any kind of analogy-based thinking where the incoming queries wake up some old partially relevant concept which is representing an analogy to the current thing that you want to solve. then this will wake up and then partially will get fed back into the controller. And it's the goal of the controller through a sequence of tasks to learn to send good prompts and to interpret the answers in a way that allows it to more quickly solve its current task. And the ACID test is always does the controller learn its new task faster from scratch?
Ravid Shwartz-Ziv: Yeah.
Juergen: by cutting off all the connections to the world model? Or is it easier to learn to send good questions into the world model and get something back and then with a few edits have a policy that leads to the goal that you currently want to solve? And so of course you give it lots of tasks like that and then over time it learns to address the important parts of the world model that you need to solve the current problem. And so here you see The wild model is not trying to do the cheaper thing. It's still trying to predict everything. This is good. But the internal representations are the only thing that counts, the abstract layer, so to speak. And you have to learn to use it. And all these ways of using the internal representations, are arbitrarily complicated. These are computational processes. So since you don't know the problems in advance, you have to learn them.
Ravid Shwartz-Ziv: Yeah.
Juergen: You have to learn which parts of the world model are important under which circumstances. And that's what you really should do.
Ravid Shwartz-Ziv: What's a paper or idea from outside machine learning â that shaped how you think it? How you think about the world?
Juergen: Well, from outside of machine learning, of course, there are the limitations of physics. There's light speed, for example. And for someone who has talked since the 1970s about the expansion of AI colonizing the rest of the universe, this is a really important limitation, which is independent of AI. So it's just light speed. And â as you're trying to acquire more and more mass of the universe to have more more AI and more and more intelligence and more and more infrastructure and more and more AI ecosystem where all kinds of different AIs are competing and collaborating with each other as this thing is expanding, it is totally limited by this light speed thing. And so it will take a couple of hundred thousand years, at least 100,000 years until the galaxy is full of... AI colonizers, but it will really take tens of billions of years until the entire visible cosmos is full. So no matter what is going to happen in the near future in terms of singularity and whatever, â all of that is just the beginning of something that will take a long time.
Ravid Shwartz-Ziv: Okay, and the final question, do you have one concrete prediction about AI in the next 24 months? What will be like how how the field will look like in 24 months like once? One thing that like you can say, okay, this will be
Juergen: So I have a few little small predictions, which are currently in papers that we are writing up, but they are not cool enough to answer your question. Then your two-year horizon is difficult for me because â I'm not sure whether it's going to happen within two years. I think in the near future, in not so many years, not so many decades, not so many years, we will have this new moment in AI where not only the AI behind the screen, but the AI in the real world and the physical world with the real robots is going to be as impressive as the thing that we have behind the screen. And â what I keep pointing out or have been pointing out for years is For a long time, for hundreds of years actually, we have had people talking about self-replicating machinery. And it was total science fiction, but now it's getting less so because once we have a robot that doesn't have to be super smart, just smart enough to operate all the tools and all the machines that are currently operated by humans. And at the moment we don't have a robot that can do that. We don't have a robot that can do what a plumber can do. But once we have such a robot, then we have a new step in the evolution of everything because we have a new kind of life, because a collection of machines and robots like that can make more of themselves. Just let that sink in. So if you have â machinery like that, that can not only replicate itself, but where you also can use all the well-known concepts of software-based machine learning to improve this collection of machines, then you get a new kind of... â a new kind of autonomous systems that is not limited any longer by the current autonomous systems which are limited to the computers built by humans and â you will get a rapid expansion of such machinery that will leave the biosphere because most of the matter and the energy in the solar system that you can use for building more robots and more infrastructure for AI and more compute and more â other kinds of infrastructure. Most of that is not in the biosphere. The rest of the solar system gets about two billion times more energy than the little thing that hits the Earth. And we are using only one five thousandth of that actually. so the The future is going to be self-replicating machinery and self-improving machinery in space. And then once that starts, within a short period of time, the economy of the solar system is going to be billions of times larger than the little thing that we currently have in the biosphere. But it's also going to disconnect.
Ravid Shwartz-Ziv: But you think you will see it? Do you think we will see it in two years?
Juergen: And it will start at the time when the first robots can learn to operate all the machines and tools that are currently operated by humans. Because once you have such a robot, then in collection, together with the other machines, they can make more of themselves. They can make more of the machines that extract the raw material from the ground and send it to the factories where the material gets refined and where you build microchips out of these materials and trucks and everything. And then it's going to grow by itself. So it will be the ultimate scaling machine. At the moment, the only thing we know how to scale is software. you write a little bit of software and make a billion copies. No robot has ever been produced which is available where you have more than 10,000 copies or something like that. But suddenly you will have billions and billions and of robots that make more robots and stuff. And so that is going to change everything. That is the one. thing that hasn't started yet and that's the one thing that is going to change everything. And suddenly it's not going to be important any longer how many people are in this country and how many experienced workers do we have in this country and so on. All of these are going to be irrelevant questions.
Ravid Shwartz-Ziv: Okay, I think it's a great point to end. Do you have anything else that you want to add, to promote?
Juergen: Yeah, well, there are so many things that we only touched upon. Just one message, one little message, which is not, for some reason, is not widely known yet. Who invented deep learning?
Ravid Shwartz-Ziv: I know someone in the 50s.
Allen Roush: I know. I've heard the Mark 1 perceptron. Was that related or slightly after?
Juergen: Yeah, so deep learning is about deep neural networks with lots of layers. The first neural networks that had two layers, linear neural networks, they were called linear neural networks in the 90s. They go back 220 years. That's what Gauss and Legendre did. They had exactly the same architecture in 1795 to 1805. They had shallow learning with two layer networks and the same weights and the same error function, mean squared error to minimize, and exactly the same thing that was called a linear neural network in the 90s. So that goes back 220 years. But deep learning is about deep networks with lots of nonlinear layers. And that started in 1965. Where? deep networks that really learn, whether you have hidden units that learn to get, you know, to form internal representations and then stand it up to the next level and so on. So that was invented in 1965 and published by Ivanko and Lata in Ukraine, of all places, in Ukraine, 1965. And they had, 1971, Ivanko had a paper in English where you had an
Ravid Shwartz-Ziv: Jimmy?
Juergen: an eight layer neural network with lots of nonlinearities in there. And it was modeled to predict the next token. What was the next token? It was a description of the British economy. So that was an example. Deep learning, 1965, 20 years before the connectionists kind of rediscovered that. And cool thing, deep learning without biologically implausible back propagation.
Ravid Shwartz-Ziv: Did you read it? Did you read the paper?
Juergen: There is the English paper, 1971, by Ivar Nenko, very easy to find, and that's the one that you can read, which summarizes what they did before. That's in English, and everybody can understand.
Ravid Shwartz-Ziv: Okay, great. Yeah, everyone go to read the foundation of deep learning. â
Juergen: Yes, since we are at that, there is an easy to find report, annotated history of modern AI and deep learning, which I wrote in 2022 and updated in 2025. And there, all of that is compactly summarized. And you can easily find it in my AI blog or on AXEF. And in the AI blog, you will find all the other things like the neural network world model boom, a recent paper, â and who invented JPAW, and all kinds of additional pointers and references.
Ravid Shwartz-Ziv: Great, we will put link in the episode description. Jorgen, thank you so much for coming. It was a real real pleasure for us.
Allen Roush: Yeah, it really was. We're deeply in your debt.
Juergen: Ravid and Alan, it was my pleasure. Thank you for having me.
Ravid Shwartz-Ziv: Thank you. Bye everyone.
Juergen: Bye bye.