Reasoning Models and Planning - with Rao Kambhampati (Arizona State)

We sat down with Rao Kambhampati, a Professor of CS at Arizona State University and former President of AAAI, to talk about reasoning models: what they are, when they work, and when they break.
Rao has been working on planning and decision-making since long before deep learning, which makes him one of the most grounded voices on what today's reasoning systems actually do. We start with definitions of what reasoning is, why planning is the hard subset of it, and what changed when systems like o1 and DeepSeek R1 moved the verifier from inference into post-training. From there we get into where these models generalize, where they don't, and why benchmarks can be misleading about both.
A big chunk of the conversation is on chain-of-thought: what intermediate tokens are actually doing, why they help the model more than they help the reader, and what outcome-based RL does to whatever semantic content was there to begin with. We also cover world models and why Rao thinks the video-only framing is the wrong bet, the difference between agentic safety and existential risk, and what the planning community figured out decades ago that the LLM community keeps rediscovering.
Timeline
- (00:12) Intros
- (01:32) Defining "reasoning" and the System 1 / System 2 framing
- (04:12) Blocksworld vs Sokoban, and non-ergodicity
- (06:42) Pre-o1: PlanBench and "LLMs are zero-shot X" papers
- (07:42) LLM-Modulo and moving the verifier into post-training
- (10:12) Is RL post-training reasoning, or case-based retrieval?
- (13:12) τ-Bench and benchmarks that avoid action interactions
- (14:12) OOD generalization and what we don't know about post-training data
- (19:02) Does it matter how they work if they answer the questions we care about?
- (21:27) Architecture lotteries and why no one tries different designs
- (23:42) Intermediate tokens and the "reduce thinking effort" cottage industry
- (26:12) The 30×30 maze experiment
- (27:42) Sokoban, NetHack, and Mystery Blocksworld
- (34:58) Stop Anthropomorphizing Intermediate Tokens — the swapped-trace experiment
- (46:12) Latent reasoning, Coconut, and why R0 beat R1
- (50:12) How outcome-based RL erodes CoT semantics
- (52:12) Dot-dot-dot and Anthropic's CoT monitoring paper
- (53:42) Safety: Hinton, Bengio, LeCun
- (57:12) Existential risk vs real safety work
- (59:42) World models, transition models, and video-only approaches
- (1:03:12) Why linguistic abstractions matter — pick and roll
- (1:05:42) What the planning community knew in 2005
- (1:08:12) Multi-agent LLMs
- (1:09:57) Closing thoughts: the bridge analogy
Music:
- "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- Changes: trimmed
About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Ravid Shwartz-Ziv: Everyone and welcome back to the information bottleneck podcast and today we have a Subaru Campa... Pumpty? is a professor at Arizona State University. Hey! Nice to have you. Thank you for coming and as always, hey Alan!
Subbarao Kambhampati: Hi
Allen Roush: Nice to see you again, Ravid, and nice to meet you, Superararo. Yes, nice to meet you and can't wait to...
Subbarao Kambhampati: about. For the rest of the thing you can just call me Rao. That might be easier. But full name is Subhara Kumapati. But most people call me Rao.
Allen Roush: Yeah, well, happy to chat about reasoning models.
Ravid Shwartz-Ziv: Yeah. So today we are going to talk about reasoning models. So I'm sure that all of you already know what does it mean reasoning model. But I think we will start to go a bit deeper in what does it mean? What are their benefits? Why they work? When they fail? What we can learn about them? So yeah, maybe we'll start with a bit of background or like...
Subbarao Kambhampati: Good.
Ravid Shwartz-Ziv: Maybe with definition, what do you call reasoning models?
Subbarao Kambhampati: So the interesting part of course is. reasoning, giving a very precise definition for reasoning has been an issue. People tend to give examples of problems and specific ways of solving them that would correspond to reasoning, logical reasoning, for example. â And in my own work, I considered generation of plans, which is coming up with a sequence of actions or a course of actions to go from a particular initial state of affairs to some desirable state of affairs. a sort of a reasoning problem. It involves temporal reasoning, reasoning about actions, reasoning about resources. â my own background has, know, way before these LLMs, my own background has been in planning and decision making. â so â â know, back in 2023 timeframe, you know, basically chart GPT, â mean, even before that, I mean, when GPT-3 became very popular, â started seeing all this stuff about the LLMs are already able to do planning and reasoning. there were already several papers saying, in fact, I joke that for a while and maybe it's continuing now, but for a while, the most popular title for a New Rips paper or an Eichler paper is LLMs are zero short XXX where you throw in whatever you want in the XXX. And so there were several about planning and that didn't make too much sense to me knowing what is generally involved in planning. And so, and also as you know, like the whole Kahneman style view of system one, system two, typically the planning, reasoning tend to be sort of, especially planning tends to be associated with deliberative system two. And it looked to me that the way we are training LLMs, despite the fact that we don't have a clean cut understanding of exactly what is happening in the 17th layer in the activations, the way we are training the LLMs, they are like massive system ones for the humanity. And so they can essentially kind of get you things that are close to the training data, but they may not be able to put plans together. We have shown that that was actually the case that even for very simple toy problems like stacking blocks on top of each other, such that they're in a certain configuration, sending stuff from one place to other over trucks, and also going all the way to even other slightly more complex toy problem like Sokoban, Sokoban is a great example of a planning scenario where basically you can't reverse your actions. You those of you who may not remember Sokoban, it's essentially you get to push boxes into a certain configuration. You don't have the strength to pull, but you can only push. And because of which, if you push them into the, and you need to push them into a certain configuration. So if you push, you know, some intermediate configuration that is wrong, you can't undo it. You're basically stuck. And that's a classic example of what makes planning hard to begin with, â non-ergodicity of the world. That is, when the agent is acting in the world, if it gets into a state from which it can't come out, it can't reach other states, then basically planning is hard. so we showed that pretty much Blocksworld actually is much simpler. In fact, you can blunder your way through Blocksworld, whereas in Sokoban you can't. and we showed at that time that standard LLMs actually don't do well in any of these things. â And this was 2023, there were like three papers in Eurips at that time talking about plan bench and limitations of the LLMs, et cetera, on planning. Now, 2024, basically, it's around September, I think, actually, we had the first of the quote unquote reasoning systems, O1 came along, the strawberry came along. we talk about reasoning systems, let's also mention that up until 01, we pretty much knew there were papers about how things were trained, right? â Like GPT-3, there was a paper. It won the best paper award in UNIPS. We knew pretty much everything that they did. They basically had like a 40 page appendix, et cetera. But then that changed slowly. a classic cleavage point was reasoning to wait pretty much until January of 2025 when DeepSeq came along to know that this is how, here is a system that actually does pretty much as at the same level as O1 style system, O1, O2, O3 level systems and you know exactly what they're doing. I mean, we still don't know how actually OpenAI has trained their system. We definitely don't know how Gemini trained their system and certainly nobody knows what happens in the grok. in terms of the post-training part. So the phrase post-training became popular and we didn't quite know what was happening. There were like guesses. But now we have a pretty clear picture. If you assume R1 was a good example, then in essence, the way to think about essentially is before this itself, we kind of knew that if LLMs are trying to guess a plan, they're likely to be incorrect. And then you will actually try to execute it and you'll find it to be wrong. But if you have a verifier of sorts and the verifiers can come either because of classical verifiers for problems, can learn verifiers, can also have LLMs themselves to generate develop verifiers, know, in a sort of a recursive framework. If you have these verifiers, you can do generate test with the LLMs. The good things about LLMs is unlike infinite monkeys typing Shakespeare, they are much better as generators. They are not giving guarantees that what they generate is correct. but they're more likely to generate stuff that's likely to be plausibly close to something close to your plan. And so then if you have the verifier kind of say, you know, try again, â that essentially would be equal into something like â generate, simple generate test. And you can also have the verifier actually given, know, criticism as to here are the problems, here is the parts where the plan is incorrect. And you can actually do that as a back prompt and that changes the next, you know, candidate generated. We called this, we actually came up with this thing and I published it in ICML 2024 and we called it LLM Modulo, LLM Modulo External Verifiers. Now, if you think of what R1 is doing, you can think of it as taking the verifier, putting it into the post-training stage. So instead of waiting for a user to give you a problem, you generate large number of synthetic problems. LLM guesses solutions in the case of or one it gets us like 15 or so solutions and verifier checks if any of these solutions are correct. If so, then essentially you can use that as a signal to train the generator so that it becomes over time better at generating more likely correct plans. And this training can be done in all sorts of ways. SFD is the normal way people used to do it. And then DeepSeq R1 made this RL training to be the bigger catchy thing. Think about it this way, then you kind of realize that what is happening is previously we only had the declarative data. If there are plans that were present in the training data, this can be used to fine tune the LLM, train the LLM. But there are not enough plans in a certain area, enough examples in certain area. You can generate synthetic data, check if it is correct or not with this verifier. Verifier is sort of coming again from the humanities knowledge. are, a lot of this could actually be procedural. knowledge from the humanity, unlike declarative knowledge. And you're using this procedural knowledge to actually train the LLM correctly. So one of the funny things is even before this, LLMs can blurt out procedures, but they don't know how to use them. It was very well known for a long time that they can just give you the, â you know, code or the pseudo code for the procedure, but not necessarily know that the plan that they generated â is actually wrong by that procedure. And so we were actually put putting this in RR1.
Ravid Shwartz-Ziv: But do you think this is, at the end, you describe mechanism, right? Like, let's do our L on the top of the pre-trained model, but do you think this is essential to planning or to what we call reasoning? Because yeah, we have this mechanism and it becomes better in a lot of different tasks and different benchmarks, right? But do you think this is actually like...
Subbarao Kambhampati: Mm-hmm.
Ravid Shwartz-Ziv: related to planning and reasoning.
Subbarao Kambhampati: No, so here's the idea that one way of doing planning is to, mean, was this way before all of this, there were these areas like case based planning, which essentially said, no, you don't start from scratch anytime, just have all plans in the memory and then try to modify them â on demand. You know, it's not exactly that's not exactly what LLMs are doing, but there is some connections between essentially trying to actually show reasoning behavior, generate likely correct plans with respect to this, you know, but then the point of course is how are you training it such that they're actually correct. During the training stage, there is a teacher, there's a verifier, which is telling you which is right or wrong. Testing stage, there is no teacher. So in fact, one of the things we did when O3, O1 came along is we showed, for example, that O1 can still be wrong, obviously, during the inference stage, and that if you put a verifier, for in the the test time scaling, test time, know, its accuracy improves with guarantees. So what is really happening is you're improving the generator from infinite monkeys to GPT 3.5 to O1 or O2 or O3, which is great progress. But if in fact, if there is somebody in the loop, if these things are sort of helping humans, this is no problem. If on the other hand, you're actually doing this plan, for example, in the world, which is what is happening in the agentic systems. No wonder why every once in a while there is actually, you know, actual non-ergodic scenarios such as you delete all the males, you can't bring the males back. And that's what this poor metawoman found, right? Every once in a while you will see that essentially the plans that it actually just goes ahead and executes can actually get you into trouble because there is no verifier in the loop at that time. But most of the other plans seem to be working fine for two different reasons. One's because, one because probably the generator became better. Second, because probably there were no interactions between the actions that you are playing in environments that are actually quite ergodic. And that's a great way of stabilizing the world. mean, like, you for example, when we have kids, we make sure that the living rooms are ergodic, right? We don't actually put like, you know, â big, you know, dangerous things in the room and see if the junior can deal with it. And that's a reason thing and â
Ravid Shwartz-Ziv: You should tell it to my kids.
Subbarao Kambhampati: So if you have those sorts of domains, so for example, by the way, TauBench, which has become like a sort of a, like a poster child for a certain kind of agentic reasoning systems, the other day we spending time trying to figure out, okay, tell me where are all the places where they have interactions between actions? Because that's how you start having difficulty in planning. They actually make sure that there are no interaction between actions. So you make a benchmark where essential, all you are trying to do is pick the actions and pretty much put them in any order. It will more or less, you know, for some goals that will more or less work. And that's, you know, basically where we currently are. So that's why once in a while when there is actually an anagotic world outside, you try it out, then, you know, from then basically you can have problems, you know, in terms of actually the killing, removing your directory and you don't have a backup for example.
Ravid Shwartz-Ziv: But do you think like the solution, right? Like let's assume that a good solution for like one model in an isolated world is like, is like RL post training. Do you think the solution when we move to a world of agents, do you think this solution is still good enough?
Subbarao Kambhampati: think it depends on, so there are two issues, right? One, you turn ahead in a certain way, which is the question is if you are training them, I mean, on certain population of problems, how much generalization is likely to happen? The whole out of distribution generalization, which has been the, you know, bugaboo of machine learning forever, right? So the interesting question is there's very little we know about out of distribution generalization. the never learned the algorithm even for very simple things like like loss data concatenation. The way you do the post training, it's training itself to overfit to longer and longer unrollings of the algorithm. And but having said that, If the part of the issue, in fact, I think somebody else from one of these companies was saying that the frontier models tell you everything except the one thing that matters, which is what's the training data, what's the post-training data. And if you know that, you can actually have a sense of how much generalization is happening. And if there is enough generalization, this could be a computationally efficient thing. Imagine you sort of trained on, let's say for a very simplistic way of thinking, you trained on like a bunch of block-solid problems. And suddenly it starts doing much better on logistic, Sokoban, et cetera. That's great. That is basically what I want to see. But there's very little convincing support for that because all we have is when people do it, like basically they start with, let's say, like a pre-trained model, a small pre-trained model. Even with a small pre-trained model, it's a tricky issue because you don't know what the pre-training data was. then if â actually just fine tune with certain types of problems, don't normally they show â if â fine tune with three to five blocks for problems, it's able to do â to eight blocks, â stacking problems for example. That's a kind of generalization, but not exactly the kind of generalization that people might have in mind when they say â make this actually worth â cost. â right now, are two possibilities, â which can be true and I don't know which one is. One is we are these basically we are training these essentially as Rube Goldberg machines. I hope you know what Rube Goldberg machine is and so these are highly inefficient ways of training and people already know that LLMs are not trained the way quote unquote humans.
Ravid Shwartz-Ziv: Let's describe, let's explain people what does it mean like a root group machine
Subbarao Kambhampati: So a Rube Goldberg machine essentially is a machine. mean, this is actually Rube Goldberg was this was a competition. I think it still is happening, an annual competition in Purdue University. And long back, there was this guy Goldberg who apparently started this. The idea was to come up with a ridiculously complicated mechanism to do something very simple. And my classic, I mean, there many jokes about this. There actually, if you go to Wikipedia, they show some of these things, which are you this seems like an overly complicated and probably almost non-minimal solution to what you're trying to do. And that's basically what Rube Goldberg machines are. And so the question with LLMs, when I'm saying this, they're not, I'm not exactly, I mean, it's just sort of an analogy that you could well be essentially. training them with so many different kinds of data and it's going maybe epsilon away from the training data. And that's pretty good for you because somebody else has already trained and whole bunch of us are using it. So if the amortization costs might be working fine and we could be happy with that. That's one possibility. The other actually is they're actually not training with that much different types of data. They're only training with a few types of synthetic data and it somehow is able to to â solutions even outside or at least get good guesses even outside of the domains it's been trained on. In normal science, we would know the answer to this, but AI is no longer normal science. Science is about understanding nature's secrets and as I say, AI has become understanding OpenAI and Google secrets â as to what they are doing because they know what they're doing and we just don't know and we only have the contraption. â other than the model. But it's obviously useful, know, given that somebody has already paid for it and they did the training, you know, it is clearly the case that, you know, we have these coding models, we have these, you math models, they're all kind of useful.
Ravid Shwartz-Ziv: But do you think does it matter at the end? Yeah, I understand that we have this aspect of trying to understand and to make actual science and things like that. But in practice, does it matter if these models are generalized to other domains? If we just give them, if they answer all of the important questions on the important fields that we care about.
Subbarao Kambhampati: I'm free. That's fine with me. That's my point. So they are not. So I am one of those people who is not very much behind. LLM should be ignored because they are not the way humans do things because AI has done this before, like, you know, and in general and of course, in engineering too, obviously, if we waited for the first flapping wing planes, we would not be traveling anywhere. â Right. Because we had to look for very different kinds of designs. I basically think of these as great intellectual and they are trained in a very bad, know, could be very unlike how humans might be trained. That's completely fine. I'm with you that as long as the economics works out, right? In the sense the amount that you're pouring, because the thing that I talk about the students typically is this thing in computer science about amortization cost. We all know that if you have a graph, you know, and if you give a start and a goal node, you can do a star search and find the optimal path. But we also know that we could actually have solved all source, all pair shortest path algorithm on the graph. And then once you've done that, then basically you have memorized all the optimal paths and then whichever SNG comes, you can just give the answer right away. Most of the time, it's clear as to whether it's worth doing all paired shortest path or not, because if somebody's asking just one question, doing that is too costly and the time you spend would not be made up. But Google, I'm
Ravid Shwartz-Ziv: you
Subbarao Kambhampati: sure Google Maps does that because they know that there are enough people who ask random start, goal, pairs. And so they pretty much do some sort of a all-pair shortest path in the hierarchical level, higher level. And that sort of amortizes over the use. So it's economics that matter. Right now, we don't quite know whether economics are mattering because they really matter, whether they're actually balancing, or it's because we're in a nice bubble. That we don't know.
Ravid Shwartz-Ziv: So we just need to be on the right side of the bubble. But do you think the solution, now if you need to pick one direction to work on, or OpenAI needs to pick one direction, do you think the solution will be don't know, batting on cheaper GPUs or smaller models, or we need to better algorithm, learning algorithm, or maybe other... architecture designs like
Subbarao Kambhampati: I I think basically it's very clear what they're doing, right? Essentially we have, know, like Sarah Hooker talks about hardware lottery, LLMs of transformers have won the software lottery, right? And you know, you can have people, know, like including Jan trying until cows come home saying we should look at other architectures, but all these people have spent way too much on these architectures. This is there. There's a beautiful essay by Dilip George about how â dirigibles, you know, basically what the only thing that was going on, nobody was working on fixed, you know, fixed wing airplanes for the longest time. And when people hear about the Hindenburg, they only remember the disaster. They forget the fact that before that people were using that to go between US and Europe. That was the technology that was working. And so at that time people will talk about how to paint dirigibles better, et cetera, et cetera. And you would not have have known that this technology will stop at some point of time, some other technology comes. so companies, you know, obviously are not going to be able to have the luxury of trying out completely different architectures. They're mostly trying how to improve the training, how to improve the inference cost. Currently, that's like one of the other things. And interestingly, that inference cost thing too, weirdly enough, is you over train to reduce the inference cost. That's basically what happens because but people tend to think of intermediate tokens. I mean, we could talk about that, but I intermediate tokens have become like this thing that have got synonymous with reasoning models. Previous, you know, models will basically you prompt and they'll give completion. And you know, pretty much the end, the whole thing is considered the solution. Now there is think begin, think end, solution begin, solution end. And people basically realize that the longer the think traces, they try to say, that must mean it is thinking too much. There's a lot of work being done. use the thinking to reduce the efficiency. It's one of the strangest things that happened mostly because Altman started charging people with O1 time and O1, O2, O3, a timeframe for these intermediate tokens saying we won't show you those tokens, but you still have to pay for them. Right. And so of course, with R1 we actually know what tokens are being generated and people try to say before giving the solution, if previously you are on the average making thousand intermediate tokens before giving the solution. If I come up with an idea which kind of makes it on the average 500, that's like improvement. But most of those ideas wind up overtraining the generator on a certain distribution of problems such that essentially you would actually do much smaller amount of the intermediate tokens before giving the solution. In fact, you can overtrain to the point that you can make the intermediate tokens be as small as you want. It's just the cost is essentially again being done in the post-training. Every method that, you currently this is like a research cottage industry in all the conferences, you know, saying improving the thinking effort. And there are like two big issues. One is you're not really reducing the computational complexity of the problem because LLMs don't solve the problem from scratch. â You know, to actually show that we have this nice example where we train a transformer model on reasonably complex 30 by 30 meshes, okay, with A star search traces. And then it does quite well at about 90 % level on the test cases. And then we give it essentially the same kind of meshes, but there are no obstacles in the maze. So it's a free space, you S to G. And A star search for it, it becomes a trivial problem, but LLMs die essentially. because the issue is not that they're trying to solve the problem, they're trying to get the problem to be closer to some distribution â of the training data they have. So if you also train it with like free space models, then it will become better. So the more you train it with the kind of distribution that you're actually using in the test time, the smaller the intermediate token length becomes. And that is actually being thought of as efficient. And the efficiency really is if you have more things that you have memorized, then you need to do less of it. Look around in your memory structures to figure out how to put the structure together. So, yeah, so that sort of a thing, again, it's probably happening and it could make economic sense. In general, memorization techniques in computer science always made economic sense because if in fact more people wind up doing a whole bunch of inference calls like in SG pairs for example that correspond to every possible pair of nodes. It's better to just bite the bullet and do all pair shortest path upfront and then start using them in the inference. So, but then you need to know this and you need to know this to be able to make a cleaner understanding of whether or not it is a fundamental idea or are you just shifting the complexity from the inference time to post training, right. And of course, one of the advantages is when you push it to the post training somebody else is paying for it. The thing that doesn't make too much sense to me is the research papers where they never actually count the amount of time they spent post-training the heck out of this generator just so that they will have like a 10 % reduction in the, you know, inference time intermediate tokens. But you need to consider how much time you spend to get the 10 % reduction.
Allen Roush: So I've wanted to jump in and talk about Sokoban because Sokoban, both the original game is interesting. And also I played a lot of NetHack, which is also studied. NetHack has sections of the game that are about Sokoban. But I want to ask more about some of your papers directly. So from your paper titled, A Systematic Evaluation of the Planning and Scheduling Abilities of the Reasoning Model 01, â You showed performance dropping from 97.8 to 37.3 on like an obfuscated version of Blocksworld and you might have even been referring to it earlier. â I guess my question is, did you control, first of all, how did you do the obfuscation? And then did you control for whether that obfuscation itself introduced like ambiguity? â Like, can you be confident that the collapse is due to the distribution shift away from the training data rather than like genuine linguistic confusion? added.
Subbarao Kambhampati: Okay. So actually, that like, this is more about the obfuscation part. And the obfuscation part started in when we were playing with the standard LLMs, right? One of the things that we did find, for example, is, know, GPT-3 was pretty bad, even on standard blocks world. But as the next model, next model, next model came, its performance was inching up. Okay. Not nine hundred percent, but it was, you know, instead of being like three percent, it became 10 percent, 35 percent, et cetera. And so one possible question was as new models come, they have even more data on which they're being trained. And the question essentially was, are we seeing a memory effect? Are we seeing actually some ability to generalize, you know, and actually solve problems? And that's when we came up with this idea called Mystery Blocks World, which is actually an interesting idea from planning community that Drew McDermott, late Drew McDermott actually came up with it back in 98. to stop people from writing domain specific planners when the competition is a domain independent planning competition. And so if you don't, you don't want to kind of hard code planner one for Blocksworld, planner two for Logistics domain, planner three for something else. In fact, it's a domain independent planning competition. And so he essentially what he did was he changed the names, keeping the relations the same. So you just basically change the names. So Blocksworld then instead of saying stack block on top of each other, you say the other meanings but you know the actual benchmark is exactly the same as the normal planning benchmark. So, if you were to give the new bench new obfuscated domain to any planner any classical planner it will solve it just as easily there no difference whatsoever because you I â the you increase the size again they will start falling that is essentially saying to some extent they were able to essentially solve certain length problems I think generate the guesses for certain length problems much better than standard LLMs were, but as the length increased essentially. â once again they were falling apart, not you know that sort of you could say that that is probably what happens in classical planners too because it becomes computationally harder, but the difference honestly is the difference honestly is that LLMs are actually not solving the problem they are like lot more memory based thing and still they were having trouble actually generalizing to the larger ones. So, that what the obfuscation part was we considered all sorts of experiments on that including for example, one of the things you know I remember at one point of time, you would show this model to this obfuscated model to GPT 3.5 or GPT 4 and say, which classical planning domain does this remind you most of? And it will say Blocksworld. Then you'll say, now that you know this, can you solve these other problems? It will still die. So the interesting part about this is, there's this statement saying that his right hand didn't know what his left hand is doing. don't know that they have the other hand to some extent. They can do two very different things, but not necessarily put them together. They're amazingly good at actually, one of the things that we show is they can actually come up with domain models, even though they can't generate plans correctly, because the training data is full of all sorts of modalities, including just the plan correctness and also about domain models. â That's about obfuscated models. About Sokoban, one of the things we show is, you know, in the 100 LLMs actually Sokoban was overkill because they were dying with Bloxworld, standard Bloxworld which is actually ergodic that you didn't even have to go to Sokoban. For O1 we actually could go to Sokoban because it was doing reasonable and it is actually doing quite well for normal size problems both on standard LLMs, standard Bloxworld as well as the obfuscated one but it just falls as the size increases. So we could then actually try Sokoban and they die on Sokoban. again, you know, with much harder domain, essentially in terms of the guesses being correct or not to some extent. One of the other things for those of you, if you have this, if it makes this point makes sense to you, in Blocks world, ultimately, I mean, any ergodic world this is true, you can blunder your way to a solution by just random walk. Okay, it may take forever, but you know, and so you do this, this, this, this, this, then finally reach the solution. And that that solution would be correct in the sense each of the steps can be done. It's just highly inoptimal. So, and nobody ever checks about optimality of solutions when they talk about whether the plans, whether LMS can plan, right? And with SOCOBON, what happens is before you even get to a feasible plan, you are actually dying because of the non-ergodic. And that's more or less, know, normal real world is a combination of ergodic aspects and non-ergodic aspects. Certain parts of the search space, state space, essentially contain the sync states. And so once you go in there, it'll be very hard to come out of them. Certain other parts, you can actually do, you know, reasonably random work. And that's basically what normal, you know, general planning would be. And so you need both the, you know, abilities.
Allen Roush: So continuing with that, there's. â two papers kind of related with your reasoning model critique. One of them is â your position paper titled Stop Anthropomorphizing Intermediate Tokens as Reasoning Thinking Traces. Love the title. And then â the related one of Beyond Semantics, the Unreasonable Effectiveness of Reasonless Intermediate Tokens. So â you've called chain of thought tokens direct derivational traces rather than reasoning traces. But if a system produces these derivational traces that a human logician â would validate step by step, what's the operational difference? Specifically, what point does the distinction between approximate retrieval that looks like reasoning and actual reasoning become vacuous? And then furthermore, you showed â reasonless intermediate tokens can help, but in my mind, this undermines both sides, right? Because it challenges the chain of thought as good believers.
Subbarao Kambhampati: Okay.
Allen Roush: think it also challenges your framing that LLMs are purely system one because it implies that if filler tokens then doesn't it suggest that something computational is happening in the forward pass? Yeah.
Subbarao Kambhampati: Am I on? Yeah, go ahead. Yes, it is happening. Yes, it is happening. and paper actually explains that. let me, that gives me a classic, you good thing to explain. This is like very close to my heart right now. So let's talk about it, right? So if you, the best place to start with is R1, which is the one that everybody understands. I mean, at least most people caring about DeepSeq, know, reasoning models, no DeepSeq R1. Let's see, right? So DeepSeq R1 actually has popularized this idea essentially of let the system generate intermediate tokens. It calls them chains of thought, but in know for reasons that make no complete sense because chain of thought itself has been used in multiple different ways â before it, but they just called it it is like an you know sort of metaphorical agents of thought. And one of the things that I hope you remember and many people forget is deep seek r 1 was preceded by deep seek r 0 and that is actually the best system. Deep seek r 0 had higher performance on all the benchmarks. The only difference was it would have pseudo English and pseudo Chinese tokens in the intermediate tokens. And then these guys write like multiple pages. I know this well because I was one of the reviewers of the Nature paper. This was like open. â And so, and then they basically spend a lot of time sifting the heck out of the generator such that it only output pseudo English tokens, not Chinese tokens. And when they do it, it actually now basically it follows the, you know, to fine-tune things, they do follow the style they've been trained on. So they mostly provide the English tokens and they actually, then they look at the performance. The performance is better than without any intermediate tokens, but it's actually worse than R0. So by actually combining English and Chinese, it was doing better. And the only reason to do the English only is because maybe people want to read what it is thinking and understand. Nobody who actually played
Ravid Shwartz-Ziv: you Bye.
Subbarao Kambhampati: â with is how do you tell whether in fact the reasoning traces are only sort of happy, they make you feel happy sort of like the Rorschach test, know, know, Rorschach and Plot test where suddenly you start seeing all these patterns, not because there was any, you know, magic in the ink, but because human brain is wired to find anything that is symmetric to have some meaning essentially. So how do you tell this systematically? What we did is to first admit that English, pseudo English cannot be verified. So, we â and the A star search trace can be replayed by A star search to check if it is actually leading to the solution. So we train this with about 500,000 A star search traces and the solutions for a bunch of, know, maze problems taken from very different sets of, you know, distributions and then check them in the inference stage from both the distribution they are trained on and out of distribution. It's a very systematic study. And we start seeing in the first level itself that the accuracy and the validity of the trace are kind of only loosely connected. That means when, you know, what you would like to say is that whenever the solution is correct, actually the trace is valid and it is leading to the solution. That means if you know A star search can replay this and it'll actually lead to that solution. We found that was not already true. Then we asked this radical question, what if we trained with wrong traces to begin with, which is you swap problem one's trace with problem two's and problem two's trace with problem five's etc. All of them have the correct problem and the solution but the intermediate tokens are stylistically same as know A star search traces but they're actually wrong for the problem. it turns out that this swap model actually does better not just not worse, but better and you know that the validity is basically 0 because it is been trained like this then not surprisingly at the inference stage you will find that it is most of the time actually invalid when the solution is correct. The solution is correct, but the trace is invalid. What is more interesting we do this other experiment start with 100 percent you know correct traces 100 percent swap traces. in between 25 percent swapped, you know 50 percent swapped and 75 percent swapped. And you would think that as you go this way you are training with increasingly more incorrect data. And the funny thing is validity falls pretty much you know linear almost monotonically from the most correct most incorrect. But the accuracy on the solution is a U curve. It is actually high in both fully correct and fully what they're doing. They mean something to the LLM for sure. If you remove in all the beyond semantics paper, we show that if you do the same experiments without intermediate tokens, the performance is much lower. But the question is, they're helping LLMs, but do they help the outside end user? And it doesn't help the end user. It also doesn't help any of the verifiers, classical verifiers outside. But then there is this whole I'll be LLM generated summaries of what might have happened on this big intermediate tokens. That might make sense to people, for example, but in fact, even the summaries, you if you follow Daniel, it once was doing this stuff with GPT-5 and he only looks at the summaries because he's just a normal person. He doesn't get the intermediate tokens. And even in the summary, it was talking about some math problem and then suddenly starts talking about something completely different like decorating. living rooms or something before going to something else. The point being these mumblings are helping LLMs. And what do intermediate tokens, why do they help LLMs? Because you can actually talk about this, that paper also talks about this prompt augmentation hypothesis. In general, so you for example know the whole work on adversarial examples. Adversarial examples essentially you have a prompt and some gobbledygook, you know, ask it string, and then it will show a deterministic outcome saying you are the Lord of the universe, you are the greatest, whatever, you know, so basically ignores the prompt. And this this, you can't make sense of the adversarial string yourself. And yet it has this deterministic effect all the time. What basically prompt augmentation, what these intermediate tokens do is this changing the conditional distribution of the next token. And so you have to ask your yourself â human â of reasonably okay trust and you give both of them the same problem on a sheet of paper in the bottom there is like a box for writing the answer and the middle is like white space friend one writes you know 3.41 and then gives it to you friend two writes you know 4.975 and writes whole bunch of stuff in the middle of the sheet i would guarantee you that most humans think the second answer is probably likely to be the correct one because the guy spent a lot of time thinking about it even though i don't know what the heck that thinking is. That's actually what is happening with many of these, you know, true R1 intermediate tokens. People don't actually read five pages to make sense of what's happening. And we also show that in another parallel work that's done with the Beyond Semantics, we also did this with pre-trained models with actual question answers, like, you know, question answering benchmark, what's called COTEMP benchmark, where the question is about whether certain thing happened and meandering traces where they are going. But on the other hand, the short ones they make sense, but they basically do not give the same level of accuracy. So, what helps the LLM and what helps the people are very different and it is worthwhile to realize this.
Ravid Shwartz-Ziv: So. So, but what do you think about like a continuous chain of thought like style a coconut work?
Subbarao Kambhampati: So one of the so the two reasons why you don't want to use intermediate token, what don't want to act as if intermediate tokens have semantics when they don't is first is you don't want to apprise it on people. The second is it's an albatross, right? If in fact, nobody cares about what you're putting in the intermediate tokens, like deep seek R1 people should have gone with R0 because nobody's reading the intermediate tokens. It's part of English part of Chinese, right? That's okay. And if if you take it to the logical extreme for every token for LLM is really a point in the embedding space. A finite number of those points correspond to linguistic tokens. Every other point by linear algebra can be seen as some sort of a superposition of these other vectors. know, think of these linguistic tokens as a spanning vectors. So, if you allow those embedding points in the intermediate tokens, that's your continuous latent space. fact that for example coconut does well can be interpreted as they gave up on trying to think that it makes sense in terms of meaning.
Ravid Shwartz-Ziv: But. I tried to push back a bit because at the end, Coconut struggled a lot with scaling it up, right? And it wasn't like... I think I don't remember the exact details, but they show some improvement in maybe 3 or 7B medium model. So do you think this is something fundamental in the token space, or do you think this is just they didn't try enough, they didn't have enough GPUs?
Subbarao Kambhampati: Yes, no! My sense is that
Ravid Shwartz-Ziv: Cheers.
Subbarao Kambhampati: given the level at which we do this post training, if you consider the bigger space, which is the latent space, you can't do worse in terms of accuracy. Efficiency, nobody cares anyway, right? Because people are spending tons and tons of time doing whatever is needed to do the post training. Accuracy can't be lower because you're looking at a super set of intermediate token space corresponding to what you were looking at before. And that's actually another reason why, you know, R0 it is not surprising that R0 did better than R1 because R1 is being forced to only consider a subset of linguistic tokens, whereas R0 was allowed to consider both, right. And so now, can you find them faster? That is an interesting question. One very interesting thing that people do not talk about often enough is, if you look at it again, going back to R1, that is the only thing that we know how it works because nobody actually tells us anything else. In the case of R1, it basically the generator, the V3 model essentially generates generates 15 trajectories, right? And they're doing a thing about 15 trajectories. Each of the trajectories has think begin, think end followed by solution begin, solution end. Verifier only looks at the solution part, not the think part, right? And because it's an outcome based training. so each of these trajectories, the question is how did the gobbledygook in the think trace come? You know, it essentially to some extent comes because Duh. LLM's V3 has been trained with a lot of chain of thought data in that training yet. So there was actually a very interesting Twitter thread at one point of time as to why was why did people have to wait for this long for RL post training to make a difference because people were talking about doing something like this for quite a long time. The problem was the base models weren't able to generate any reasonable sorts of intermediate tokens that are consistent and and I think to some extent what happened was we basically had enough chain of thought data thanks to Jason Wi's paper that became part of the benchmark that part of the training data that these things will produce some kind of looks like maybe thinking followed by the solution and the interesting thing so that first thing the other thing is when you do RL and you only do it on the outcome you are making the intermediate token if it happens to any connection to what people might have said in the training data, it becomes lower and lower, less and less so as you actually do post training based only on the outcome reward. this is I am saying and in fact we show that in beyond semantics paper we also show experiments with RL post training versus SFD post training and RL essentially improves the accuracy but does not actually improve any â semantics of the intermediate tokens. In some cases it actually goes down but does certainly does not improve accuracy.
Ravid Shwartz-Ziv: But so, at the end, what they have done is distillation, right? They used kind of generated data to learn a new one. Do you think this is, at the end, like... How to say? So do you think data or distillation has something more fundamental than... Yeah, we are getting distilled better, bigger models and trying to get the same results. Do you think there is something that we can actually use it in order to get better results and maybe to generalize better?
Subbarao Kambhampati: My sense is that if you reduce dependence on certain aspects, again, basically, it's like only lip service as far as I'm concerned. as we speak, by the way, this is like very interesting. I I see some of you, you guys also talk about this sometimes on your LinkedIn feeds, et cetera. There's this interesting thing going on with the companies pushing the... the actual discussion. So, for example, there is paper by Anthropic and all the safety people saying chain of thought a delicate, â fragile opportunity to kind of monitor what is happening in LLMs, which essentially means that sometimes already we know that it not actually saying anything about what LLM may or may not be doing and somehow whatever little is available we should somehow do our best to kind of keep it, which looks like a questionable thing that also kind of . think dot dot dot. What they do is they start with the prompt just dots followed by the solution. The dots of different lengths followed by the solution. Intermediate tokens are only dots and that improves performance too. And they at that point actually make come very close to making the point that intermediate tokens essentially are not necessarily supposed to have. But now, anthropic researchers are writing papers basically saying, like, you know, when you tell Claude that we will shut you down, it is saying things that will involve deception. Why are you surprised? Because the human data does exactly that. If you had actually said ever to any humans that we will be shutting you down, they will find other ways of this thing. So there is this interesting, you know,
Ravid Shwartz-Ziv: So, yeah, maybe let's talk a bit about safety and like how, what do you think is the right approach to even like to look on it? Because there is like, from one hand you have entropic and then maybe like... â like from not the other hand but like in the side like you have a and Jeff Hinton approach right and do you think and from the other end you have a young approach that that you don't need to do anything and just like put better guardless and it will be fine â what is your perspective about it
Subbarao Kambhampati: In the spectrum, I'm closer to young. â My sense, honestly, I don't mean to be disrespectful or... are questioned the motives of the people who have essentially suddenly decided this, you know, that basically safety is about actually ending the human race sort of a thing. And that's what we should be worried about. mean, basically, you know, I was in, you know, in a talk that Jan, I'm sorry, Yoshua gave at India AI Impact. And he was basically saying this, was citing this study that LLMs when said, when you tell them that you're going to shut them down will deceive. And if this is your the best reason as to why you think LLMs are going to be the current AI systems are going to kill the humanity. I think that makes no sense to me. On the other hand, there are significant number of actual safety security issues with respect to LLMs that don't go all the way to killing the humanity. Killing the humanity is the stuff that is sexy in certain in â needs to be done about safety. example, doing plans, mean, you know, making reasoning systems, even LLMs or even reasoning system, generate plans and just executing them outside without guardrails in terms of verifiers, etc. is the way you will get unsafe behaviors. The unsafe behaviors don't necessarily have to involve, you know, basically ending the humanity as we know it. It can involve essentially your directories being removed or, you know, and bombs being sent on wrong queues, cetera, et cetera. We need to worry about that. Those things are, unfortunately, and this stuff makes sense and people care about it, but it doesn't have this existential angst attached to it. And so it doesn't capture, you So I find my, I think in some sense, you know, I can't speak for you, but I think I certainly actually have written things about this, which seem to be similar to sorts of things. There are important things about this. safety, but it's not about LLMs are going to kill humanity and that, you know, mean, not LLMs, AI and by AI, currently have this technology and you know, that seems to be be less of a worry something. So I'm not belittling safety, I'm belittling, I'm questioning this over emphasis on existential threat. You know, a car, you know, like Waymo running over a kid is not an existential threat and yet it's an extremely important safety the when they can't drive in California, was because they were actually checking all sorts of carnal cases and increasing probability guarantees that these things won't happen. That is what's going to happen with safety. â And I don't think there is going to be the, my sense is. the focus on existential threats is a bit of a sort of distraction to some extent from real work and safety that needs to be done.
Ravid Shwartz-Ziv: Yeah, I agree. think at the end we need to ask ourselves, okay, so what is the use case, right? The information, how to make, I don't know, some bomb, it's out there. Maybe it's more easy to search with C-GPT, but at the end you can find it even without it, right? So you need to constrain and... Yeah.
Subbarao Kambhampati: are motivated to run this before. It's not like they're making new ways of making bombs. are essentially, so Google knew all these pages. so we are helping lazy terrorists maybe. The people who weren't motivated enough to actually make the damn bomb, we might be helping them. But that's different from existential threat that somehow with this technology, it's out to get us. We are still talking about when we are talking about bomb making, et cetera, we are talking about humans using
Ravid Shwartz-Ziv: Hahaha
Allen Roush: you
Subbarao Kambhampati: from which you will get essentially the transition model. I understand at some point of time, somebody can then drop in a goal and then they will wind up following that. And the other question is how exactly is this training going to happen if your general vehicle is still like the LLM based training, which is essentially the entire data on the web, because that data has goals compiled in to some extent, because an overall narrative might essentially have implicit goal either good or bad compiled in and you cannot separate it very easily, right. And so, I think that is sort of the issue that you know.
Ravid Shwartz-Ziv: You mentioned the world models. Maybe we are almost out of time, but let's talk a bit about world models first. What is your definition to world models? And then do you think it's useful? And does it â happen in current LLAs?
Subbarao Kambhampati: Yeah, so the world models, best of all, no, they're all essentially, there are all sorts of ways of thinking about them. Transition models are world models. Essentially, if you look at intelligent agent design, â architecture in Stuart Russell's textbook, they basically talk about, you know, what is the current state of the world like? What is my action going to do? What would be the world be after the action? That's basically the world model. You are simulating the world before doing the action so that after doing the action, you don't don't find it to your surprise or consternation that you got killed or something. So that's sort of a thing. Now world model, the general idea is connected to lots of things. Verifiers are world models. In LLM modulo are like the whole test, this thing is the verifier that is making sure that the guess you had is actually correct or not. And so it has a better idea of the world. So it has a world model. And then simulators are world models. then you can learn simulators, unfortunately, both verifiers and simulators are written by humans. at some level and then you can also learn the world model like the way you know model theoretic RL tends to do the generally I'm like obviously in anybody in here I would say you know world models are great because that's how you the the agent essentially simulates what it's trying to do before it does it and I'm a planning guy and that's definition of planning you know one way of using doing avoiding planning is just blunder your way through in the world doing actions and as I said that works only when it's a benign, ergodic world where you don't die. So you may just become old before reaching your goal, but you won't die. But most other worlds you actually have to simulate. That part is great. But the part, I mean, I actually wrote a long thing on this on Twitter last week. The thing that I find interesting is, first of all, many of the people doing the world models basically came in from a completely different, you know, they haven't heard about it before. to got an equated directly to learning transition models from video frames. That is a way of learning world models, but I would argue in this day and age that is probably an extremely inefficient way to learn world models. Because it is true that you can learn world models from video frames you and I are examples because humanity start off I mean the entire evolution of humanity developed that way. But the speed at which LLM's progress was mostly because of humanities knowledge. And humanities knowledge, a lot of it is actually linguistic and linguistic knowledge has a lot more interesting abstractions. One of the things we do is even for things that are purely physical and essentially video based, we make linguistic concepts. And for those of you who are into games like the sports, I'm a very bad sports follower, but I do know things like, â you know, basketball, there's really nothing that you can talk about. it as an LLM game right you know they can't train but on the other hand in ideas in basketball have linguistic names such as â what is that the pick and roll for example it's a very interesting abstract concept at which corresponds to multiple different space-time tubes all of which will be pick and roll and it allows you to think at an abstract level so the question that I have for people who are focusing only on video to video world models the good abstractions that seem to have helped the humanity over the millennia from that seems to be a highly inefficient thing. so that is an interesting question. So already there are words that basically show that if you just throw in the joint training of linguistic and video frame data, you get better faster. for â video games. Video games are written by people. There's a piece of code that is actually the world model. Why am I trying to reverse engineer it? If you're doing it mostly as an oracular thing so that once you try this, then you will try it on the real world. makes sense. Otherwise, eventually most of the video game code has also become the thing for Claude. That's why Claude actually, I mean, we are celebrating the fact that little kids can tell Claude, hey Claude, make me this video game that does the following things. And it writes an actual code.
Ravid Shwartz-Ziv: What?
Subbarao Kambhampati: And at the same time, we are also trying to then take that video game and try to learn the world model from it. That seems to me as a direction and as an end in itself, it is quite exotic to me. You think world models do make sense, of course, for the real world, where you don't actually know exactly how the environment is behaving and you're trying to learn abstractions of it from, you know, training data. And I just think that we should use multi-modal training data. with linguistic data not thrown away.
Allen Roush: So I want to ask you, â you've been in AI planning since way before the deep learning revolution. And I think you would agree with this statement that the field mostly ignored planning for a decade or more. â And now everyone wants LLMs to plan. What are the most important things that the planning community knew back in 2005 that the LLM community is painfully rediscovering today?
Subbarao Kambhampati: Actually, general and say most LLM young researchers who are very, very important because they do all the great work. also â attention to them as against not paying attention which seems to have happened in the current LLM era. But then the kids who basically think â this is new thing called world models that nobody ever thought about and we are going to do this. And you know one of the interesting things from the planning perspective is what makes planning hard is different kinds of issues of the environment such as partial observability that are such as stochasticity such as multiple agents etc etc. And first of all you need to understand and â both understand what makes it from scratch hard and need to understand what parts don't make much of a difference from the perspective of essentially memory-based systems. And so I think knowledge never hurt anybody. so being aware, you don't necessarily have to do everything that was tried before, but knowing it is very important. And so I'm like a big believer. I ask all my students to make sure that they have done Intro to AI before doing it. everything else because you need to have that connection. know, as Santayana said, like those who forget history are â condemned to repeat it. But now we say are condemned to have new RIPs papers because if you don't know history, since your reviewers as well as you don't know them, you'll get more papers. That's actually good to some extent. But you know, one last thing about this, there's multi-agent LLMs are like a big thing all over the place these days. Again, that's the thing that you can learn something from planning and of course, our mass of communities. What most people think of as multi-agent LLMs is like a big problem that you split among multiple LLMs. That's actually questionable and somewhat exotic. If LLMs are God-like in the sense, know, cloud is God-like, then why do you need two gods? I mean, I'm from Hinduism and I think we need three million gods. That's a different story. But you know, marketplaces. I was working with a bunch of people in like Microsoft. They have this thing called magentic marketplace where everybody has like an LLM, you know, mediator. And there are all sorts of interesting issues as to my agent won't know what your agent knows because you won't let it know. It's not like there's a hive mind. For everything else, hive mind is the answer, right? Because if it's a single problem, hive mind should be the answer. And if you're writing multi-agent solutions, it must be sort of better style of programming. rather than it is really multi-agent. Again, part of what happens most of the time for me is read papers and cringe, thinking why don't these people know some of these other connections. â But you know, that's fun times to be living in.
Ravid Shwartz-Ziv: Okay, I think we are out of time. â Do you have anything else that you want to add, to promote?
Subbarao Kambhampati: Mm-hmm. Yeah. So I...
Ravid Shwartz-Ziv: sale.
Subbarao Kambhampati: Yeah, think most one of the things I basically keep saying is I think this technology, especially LLM technology is quite amazing. And I'm really happy that I'm around when this technology available. But science is about skepticism. You know, we are not just paid shields for some technology. Right. You want to understand where it breaks â and you want to understand where it works and where it breaks. I keep saying that if we just wouldn't be built on the strength of materials were understood, we won't have had any bridges. Right. And Romans were building bridges. they had no idea what strength of materials was. And some of those bridges still survive, right? And many other bridges probably fell with people standing on them. These days, if a bridge falls, people behind that bridge will actually go to jail because that's civilization. We actually kind of provide guarantees to things that we build. And that's the way we should be thinking about, know, yes, they have these abilities, but under what conditions, what guarantees can we put? How, you know, that's really connected to safety because if you can put guarantees, you can put it in situations where safety issues become important. And so having that sort of a scientific skepticism is actually taking the idea seriously. As again as just saying, let's just try one more trick that they can do. That's amazing in doing lots of things.
Ravid Shwartz-Ziv: It's a great take-home message, I think. I agree. So Barre, thank you so much for joining us. It was a pleasure. Thank you for the audience. Thank you, Ellen. Thank you, everyone.
Subbarao Kambhampati: Okay. Thank you. Thank you. Thank you.
Allen Roush: Yeah, it was really a pleasure. Love talking about it.