April 24, 2026

What Actually Matters in AI? - with Zhuang Liu (Princeton)

Show Notes
Transcript

In this episode, we hosted Zhuang Liu, Assistant Professor at Princeton and former researcher at Meta, for a conversation about what actually matters in modern AI and what turns out to be a historical accident.

Zhuang is behind some of the most important papers in recent years (with more than 100k citations): ConvNeXt (showing ConvNets can match Transformers if you get the details right), Transformers Without Normalization (replacing LayerNorm with dynamic tanh), ImageBind, Eyes Wide Shut on CLIP's blind spots, the dataset bias work showing that even our biggest "diverse" datasets are still distinguishable from each other, and more.

We got into whether architecture research is even worth doing anymore, what "good data" actually means, why vision is the natural bridge across modalities but language drove the adoption wave, whether we need per-lab RL environments or better continual learning, whether LLMs have world models (and for which tasks you'd need one), why LLM outputs carry fingerprints that survive paraphrasing, and where coding agents like Claude Code fit into research workflows today and where they still fall short.

Timeline

00:13 — Intro

01:15 — ConvNeXt and whether architecture still matters

06:35 — What actually drove the jump from GPT-1 to GPT-3

08:24 — Setting the bar for architecture papers today

11:14 — Dataset bias: why "diverse" datasets still aren't

22:52 — What good data actually looks like

26:49 — ImageBind and vision as the bridge across modalities

29:09 — Why language drove the adoption wave, not vision

32:24 — Eyes Wide Shut: CLIP's blind spots

34:57 — RL environments, continual learning, and memory as the real bottleneck

43:06 — Are inductive biases just historical accidents?

44:30 — Do LLMs have world models?

48:15 — Which tasks actually need a vision world model

50:14 — Idiosyncrasy in LLMs: pre-training vs post-training fingerprints

53:39 — The future of pre-training, mid-training, and post-training

57:57 — Claude Code, Codex, and coding agents in research

59:11 — Do we still need students in the age of autonomous research?

1:04:19 — Transformers Without Normalization and the four pillars that survived

1:06:53 — MetaMorph: Does generation help understanding, or the other way around?

1:09:17 — Wrap

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Ravid Shwartz-Ziv: Everyone and welcome back to the information a bottleneck podcast and today we have Zhuang Liu is a assistant professor at Princeton. Hi, Zhuang. Nice to have you. And as always we have Ellen. Hey, Ellen!

Zhuang Liu: Hi, Ravid. Hi, Ellen.

Allen Roush: It's good to see you again, Ravid. Nice to meet you, Zwaan. And yeah, you should go take a look at his Google Scholar. He has a lot of really cool papers.

Ravid Shwartz-Ziv: Yeah, so we'll talk about some of these papers today, and in general, we are going to talk about what are the important components in today's AIs. So you have so many works. I think we can start with some understanding of what do we think are the important components. So a few years ago, you had a paper about a convnet for the 20s. when he's right, that, you know what, maybe you will tell us about what is the paper about and then we can start to decompose the components for current AI systems.

Zhuang Liu: Mm-hmm. Mm-hmm. Yes, of course. So that was a really interesting story. So the paper, we were working on this paper in 2021 and back then Transformers has just entered the computer vision world, computer vision research space through the introduction of vision Transformers, right? And then everybody from the vision community have been switching to vision Transformers from more traditional comnets. and they are getting better and better performance. in this work, want to investigate whether truly like ComNet has lost its competitiveness and whether it's possible to â control all the design components there to study whether ComNet can be modernized to be â on par with vision transformers at that time. So we want to study whether it's a... â that seemingly â performance gap between â transformers and combinators is due to their inherent design, like whether you use self-attention or convolution or due to other small, like seemingly small design details. â we found the â to be â latter. â after a lot of studying, â the â component, components of a ConvNet, we eventually get the model to be on par with vision transformers, very strong vision transformers at that time, on a diverse range of tasks. So this demonstrates that no matter you choose ConvNet or vision transformers, as long as you get everything correct, you'll get the same frontier level performance on vision tasks.

Ravid Shwartz-Ziv: And do you still believe that this is true? Do you still believe that like, at the end, architecture doesn't really matter?

Zhuang Liu: â I wouldn't, yeah, I largely tend to agree, but I wouldn't say it doesn't matter. â I would say as long as you get everything correct, like as long as you have explored the design space enough, you'll converge to some point where it's like kind of the Pareto frontier in terms of accuracy and efficiency trade-off. And it's very hard to go off this frontier. â Like I think in the past many years, there haven't been really a... like innovations that have been widely adopted other than the â kind of the mature architectures that we already were working with â several years ago. But this process itself is very, very fun. â In fact, recently there have been some paper releases on this, â especially from this â open weight model companies like Kimi, DeepSec. They're still playing around with architectures like how do you... How should we change residual connections? How do you connect different layers? This kind of stuff. And I respect that work a lot. In fact, part of the reason that architecture research is less active in academia right now is that we couldn't really afford to, afford the computer to demonstrate this effect in a convincing scale. So that's a sad part, but I still try to play around with it. myself using our resources at university. Actually, with the help of Cloud Code, I can now go back to coding to play around in the first hand myself. So this is very fun research. But I do think from a practical â point of view, what data we learn, what data we train the model on matters more than the architecture choice, as long as the input-output interface stays the same. Architecture is how we parameterize the function approximator, the very basic function of neural networks or deep learning. And how we parameterize it, no matter, as long as you get a few things correct, like using residual connections, using self-attention or other â reasonable mechanisms, and having the right activation at the right places, feed forward layers at the right places. You'll be at the very close or at the frontier â curve of this performance â versus efficiency â balance. And then I would tend to think that what matters more in terms of practical usage, in terms of bringing the product to â more and more people to be more effective in their daily usage. I do think other aspects like what data has this model been trained on, how it can handle context and memory is what matters more. So in terms of context and memory, yes, there do exist architecture work that address that. So I think that is kind of the more immediate problems we need to address in making AI up another level.

Allen Roush: In a convent for the 2020s, â first of all, you know, maybe just give you a summary of that paper. But then my question is, â you know, as I read it, it looks like you and also correct me if this my understanding is wrong. But my understanding is you gradually modernize resonant towards a Swin like design and end up with a convent that competes strongly with transformers. â Which single ablation

Zhuang Liu: Great.

Allen Roush: Did you, like, in that paper, most changed your own mind about where the supposed transformer advantage was really coming from?

Zhuang Liu: â What ablation? â I think it's every ablation. you look at the graph, nothing single-handedly â transformed, brought up the performance by large margin. Some of them are more useful than others, but â it's not really that one change changed everything. â Maybe the use of activation and reduction of normalization layers is one thing that... â of intrigued me and has a noticeable performance change. â But it was really â together. So this seemingly â components when we compound them together, they can â cause large differences that would normally attribute to â some â change like changing convolution to self-attention or into... â modernized, say, â modern mechanisms like linear tension, right? So I think the big lesson here is these small details, when combined together, matters much more than the seemingly core components of the networks. Yeah.

Ravid Shwartz-Ziv: But do you think it's... Because for me it looks that we have... We are trying a lot of things. Some of these things are working, right? And then we get better models. And then in retrospective, we can start to understand exactly what the components that actually matter, right? We can try to... â so now let's change Covenant and make it look more as a transformer, for example. Right? Do you agree with that? Do you think...

Zhuang Liu: Mm-hmm. Mm-hmm. you Mm-hmm.

Ravid Shwartz-Ziv: like this is something that we actually need this breakthrough and then we can trace back and understand the details and try to match other methods or do think like we just need to try again and again and we don't need the very specific directions?

Zhuang Liu: I see. Got it. Yeah, I Transformer absolutely is a blessing to community and also the vision Transformers, the adoption of Transformers in computer vision. â I don't debate that. That's hard to debate. It's absolutely a breakthrough, a very important breakthrough. Perhaps the biggest breakthrough in that years. â Our study was more focusing on like computer vision only domains. But another benefit of vision transformers is that they are now the unification of text and image representations. The use of transformers is very important for later developments like Lava, like this multimodal framework where you have text vision encoders encode the vision into tokens and then just treat tokens. get them together to be fed into a downstream LMS. This is the basic framework of lot of multimodal models right now. So I think the adoption of transformers in vision is a big step for unification of multimodal data. This is back to our study. This investigation of details, I would say it's more of a lesson. It's not... â I think it changed my understanding more than like, I'm more proud of this work because it changes my own understanding and a lot of other people's understanding rather than like we now can go really, like you can still stay with Calvnets. People can still do that. Calvnets has its own benefits as well, especially when restricted to computer vision only. It's convenient to deploy. It's kind of easier to understand. â â also have â better support for higher resolution long sequences because the operation is local. â think they just excel at different places. â And this study, transformation of details is more kind of a lesson for me, rather than like, we need to figure out each component, how much they contribute to performance difference.

Ravid Shwartz-Ziv: So. So, okay, so these are... Okay, so architecture is not so important and you also, like you have a more recent paper that you actually showed that â normalization layers are not so important, right? That you can basically can â remove normalization layer using... â Tangents... activation function, right? And maybe you need to do some tweaks and to treat it a bit differently, but it works fine. So, what do you think are the important components? The core components that you say, like, okay, these are so important for AI, for deep learning, and why only in the recent years, like, we...

Zhuang Liu: Mm-hmm. Mm-hmm. Mm-hmm.

Ravid Shwartz-Ziv: came up with such good models. Why? If architecture â doesn't matter, why we only have good AI models in the last time? Five years, not ten.

Zhuang Liu: Yeah, it's good question. So first, Transformers was brought up into the world 10 years ago, â nine years ago. So it was a long time before. And even after that, we still follow a similar fundamental framework right now with some minor changes like â activation layers, mix of extra per- mixture of experts that not always needed and also some change on how like local attention sliding window attentions, but the same frameworks is still the same as nine years ago when the paper initially came out. So I think that's the answer to that would be â for me would be just the data, the scale of data we train them on, the scale of compute we use to train them. â It's like the classic story from GPT-1 to GPT-3. Once you train the model, train the basically the same model using more compute, more data, more diverse data, more internet scale data, we all get this emerging very competitive performance we see now. So I would attribute that to the data and also secondarily to the compute. epochs we train on the same data. think actually â it's mostly data because now we train most models with less or just equal around one epoch.

Allen Roush: I see that this thesis permeate through your work about the field is confounding architecture with recipe, right? Specifically in your conv next paper, right? If you were setting rules for like architecture papers today and maybe you do these exact rules for your students, what controls would you require right before someone can claim that some inductive bias, some old

Zhuang Liu: Mm-hmm. Mm-hmm. Okay.

Allen Roush: especially like anyone matters.

Zhuang Liu: Got â it. In ideal world, we would have unlimited compute, right? So first of all, I would require the effect, the benefit to be demonstrated on a state of scale, like not necessarily frontier models, but like 7Bs, 30Bs. Yeah, but I think scale is important for industry to really see the benefit, to be convinced about the change, the benefits. But that's not always possible. it's just an idea word. But secondly, if you want to study this mechanism's architecture changes in smaller scales, like the ones we could do with reasonable resources, I would require first â hyperparameter sweep. We cannot demonstrate the new architecture just on one set of it works on one set of hyperparameters better than the old model, especially when that hyperparameter is tuned just for that model. So each model should compare against each other on their own best hyperparameter settings. And the most important hyperparameters would be linear rate, decay, optimizers, optimizer types. And it just bothers me a lot when people don't even tune the learning rate and then claim like, people only tune the learning rate of their own method without tuning the learning rate of the baseline. That's kind of the â â most common fix for a lot of â claims that turn out not to be generalizing. â for one. And secondly, I would require the... â the idea or the method to be demonstrated on more than one dataset, â especially it should be on reasonable scale dataset. I would say ImageNet is still applicable today, â but ideally they should also demonstrate the idea to work on â some small scale large language models, say for example training on Fine Web, but I understand that not every lab has the resource to do that, but I am a big fan of like demonstrating idea of a diverse range of datasets, at least the common datasets, people, come. So I guess that's the two standards I would impose. Yeah.

Ravid Shwartz-Ziv: Do you think, I have question, do you think like this is sound like Antho? If you have a good idea, it should work on different domains, different data sets, different places? â Or like there are very good ideas that actually work on very specific scenarios?

Zhuang Liu: Mm-hmm. Right, right. do think... â both are equally valuable. â In second case, so I would want to know how to describe that specific scenario that the model works better. â And there still should be more than one dataset. â If you claim that your model works on long context, audios, you can still do multiple datasets on that. â And I give an explanation of why the method is good on this particular domain. And then we can start from there and address the weaknesses of why it doesn't work on other domains and go from there. Or maybe we are good with the fact that it only works on this domain, and we can just try to increase that gap. So I don't think both are valuable. I'm not, so that's a value of research, right? You are given a, you are not required to be all successful at the first step. though that would be good, but it's not a necessity.

Ravid Shwartz-Ziv: And so let's talk a bit about data. You said that this is the important part here. What exactly in the data? â Maybe let's start with talking about your paper. I think it was like several months ago, a â year ago, about decades battles on data-sys bias. So let's start to describe or explain to us what does it mean, what you have

Zhuang Liu: Mm-hmm.

Ravid Shwartz-Ziv: tried, what was the motivation to to try to revisit some of the battles.

Zhuang Liu: Mm-hmm. Yep. So I'll first talk a little bit about the context, the background. â So over years, people have been, so this paper is more vision focused. Over years, people have been developing larger and larger assets from more and more diverse sources. For example, initially we had Amnest, and then CIFAR, and then ImageNet. And then this internet scale, there's like a data comp, â conceptual captions, CC, â that's it, from Google. And they have seemingly more and more diverse images. And they are larger and larger to the billion, from like tens of thousands to billion scale. And it's... It was kind of tempting to make the assumption that we are already good with datasets because it's really all the things you can get from internet from these datasets. But in some of our initial experiments, we found that these datasets are still nothing like each other. And how do we measure this? We designed a very dumb experiment that doesn't make sense at all as a deep learning training â program. What we did is this. â given three very large scale computation death sets. We train a neural network to classify which death set a given particular image comes from. It's not a practical problem. It's just to try to guess the death set's origin. It's a multi-class classification problem. It turns out, even with the seemingly very diverse death sets, The model can still tell this answer with surprisingly high accuracy, to â higher than 80 % accuracy when you have three datasets. So the random guessing accuracy will be 33%. But in fact, the model can be â much better than that. So this means to a model, these datasets are still very, different. It has very clear cues for the model to tell which dataset this image comes from. Of course, we are doing this on held out validation sets. We are not predicting the training â images. But still, this has been surprisingly high, surprisingly accurate. And that prompts us to reflect, we really been successful in curating a large scale, all-encompassing dance set, â at least in computer vision domain? What kind of data is the end goal? It's kind hard to define this so-called unimposing global distribution data set. And different people might have different criteria. And it might not be a reasonable assumption there. So should we? So because a big story of the success of LLM is that it can do, it's not like domain-specific model, right? So we want the model to be able to do everything. And in order for that to happen, we need a model. A common assumption is that we need a model to have seen everything during training. And how much progress are we making on that? It seems from this initial experiment that we haven't â reached that point yet. And still makes sense to think about how do we curate these datasets for the model to be really generalizing â to every task we want the model to be good at.

Ravid Shwartz-Ziv: So what is the answer, do you think? What do you think we actually need from good data? What types of properties, like diversity, some uncertainty, uncertainty, some redundancy? What do you think we actually need from good data?

Zhuang Liu: you Yeah, would say yes, diversity of content, diversity of styles, â diversity of... I think it depends on... I think a big lesson of deep learning is that if you want the model to be good at something, you better train it at something. And if you want to be good at everything, you need to train on everything. But in today's regime, we still have this trade-off problem, right? We don't have unlimited compute. although we have built this giant descent, we still have limited compute, limited capacity of the model. So the different capabilities of the model have learned may still compete with each other, meaning that if you want the model to be better at coding, you might need to sacrifice a little bit on the models, say, â â psychological counseling abilities for users. That's just an example. How do develop a mixture of training data with the right representation and with enough representation for each thing we want the model to be good at? And to be good enough at, that's how to balance different domains of data. â It's â important design. And what we found actually â in our recent project of text to image narration, we actually found a surprisingly simple solution for, it's not the best, but it's simple enough, which is to divide all the domains you care about, roughly divide, have the domain to be equal importance to each other. You don't want to have, say, how to do haircut equally represented at how to do coding. They are like of different. importance to most people. We want the model to train much more data on coding rather than how to do haircut. â And that makes sense. But if you extend the concept of haircut to other daily life â skills or at the appropriate level, if you control that to be about you know, experience about the same importance level as another domain. Then you just collect data, high quality data from each domain and then mix them equally in your training data. That turns out to be good for a lot of other â projects.

Ravid Shwartz-Ziv: Do you think this is the future? I just combined a lot of different sources.

Zhuang Liu: â I think for general model, yes, if you â just want the model to be â good at everything but not excelling at a particular difficult task, think, yes, data coverage is the king. So I think Ilya Suskivar famously said that if you â some task, you have a big model and you collect enough data, the model will be and your training and success is kind of guaranteed. I think it still applies to modern deep learning where you don't only have one task but a variety of tasks. If you want your task to be, ability on the task to be good when facing the users, just have enough data on the training set. That's kind of the most makes sense, most reasonable solution.

Allen Roush: So, uh, and I'm just looking here. Sorry, I had a question about one of your other papers. Oh yeah, so...

Ravid Shwartz-Ziv: Till then, I... Okay.

Allen Roush: No, no, I'm ready. Image bind, right? This was an interesting one mentioned â that you have that image paired alone data, right? Combined like, I think I counted six modalities into one embedding space. â And do you think that's like a deep statement about hub modalities? Or is this like a contingent fact about like visions role in â

Zhuang Liu: Mm-hmm. Mm-hmm.

Allen Roush: at scale data.

Zhuang Liu: Mm-hmm Yeah, I think that's a very important message of the paper is that modalities, they could be embedded together. â that's kind of the basis of how this multi-modal foundation models work right now. So the common approach is to convert every embedding, every modality with the encoder to be aligned with language models representation as tokens. It's just image bind, we focus more â on learning the encoders, not really connecting them with large language models. And I think another insight from this work is that vision is a natural bridge â to bridge all these â modalities, because vision data is kind of like the default input we get as a human. And it often co-occur with a lot of other modalities, like audio. When you watch a YouTube video, â the audio data and the video data, vision data naturally â flows together, right? And you can use that as a signal to align the two. And also for â vision language. I guess for language and audio you can do this as well. I believe we also had some... â see what other data types we have. So motion data, Motion data, it also co-occur with images or vision data. So I this reveals the fundamental role of vision in our daily perception.

Ravid Shwartz-Ziv: But why do you think, like, at the end, the big jump in the capabilities was through Glengorich, right? Like, we had vision for a while, and...

Zhuang Liu: Yeah.

Ravid Shwartz-Ziv: We didn't see this huge adoption across all different fields and subdomains and companies, right? When Lengowich model became much better, suddenly people started to use AI. Why do you think we saw it? Do you think this is just coincidence or there is something fundamental in Lengowich?

Zhuang Liu: Yeah, I think it's a much discussed subject. My understanding is that vision is inherently much higher throughput of a much higher bandwidth of data flow into our perception systems. we don't have enough. The compute required to actually put this data into use still haven't been available. So say if I take my current view, and it's just one image, one frame of images. And the space to store that image is â much higher than the space to store, say, the description of this image in language. The language could be stored in bytes, but the image would need to be in kilobytes. It's just a thousand times different. So an image is worth actually more than a thousand words. So it's really high throughput data. And we don't have enough compute to handle this data yet. And also we don't have a good mechanism for focusing on the... So with the current multi-modal language models, â we don't have a mechanism to let the model look back on the image to focus on specific areas because everything is already encoded in the vision tokens. If the vision code is not good, the language model... of the autoregressive model can do nothing about it. And language is on a much lower dimension space that each word has a clear meaning that we, it's kind of the unsupervised learning by human from nature. So we prepare these words for the model as â in our human evolution process that these concepts are important. And it's a very condensed concept. For example, a cup. To describe a cup from image you need you need a kind of thousands of images of cups But just to describe a cup from language just need that one word that one word is just bytes of storage space compared to Megabytes of images so the The computer required to process that much information that much of data information would be naturally be much higher and I don't think we are there yet

Allen Roush: Okay, and then â I I love this paper title by the way â Eyes Wide Shut, right? Because I saw that that it's Stanley Kubrick film, right? It's final film, I believe. Anyway, you argue that many multimodal LLM failures trace back to like a clip-like visioning photos via clip-blind pairs.

Zhuang Liu: Mm-hmm. Mm-hmm. Mm-hmm. Right.

Allen Roush: and MMVP and please spell out what that means and also just summarize this paper. â How much of that like bottleneck is really a vision problem in your opinion versus being like a language model or like an alignment problem?

Zhuang Liu: Mm-hmm. I think it's very much a vision encoder problem. as I said, these models only learn what they are taught to learn in training. If you don't impose the model with the types of training tasks you want them to excel at, they will not be good during test time. So specifically in clip training, We are training images to be image representations to be aligned with their captions representations, right? so you only the caption focus more on the content of the Image right what's out there? What objects are there? What are they doing? But that's less on explicitly the positions of this this objects right if you have a human and dog on the image The caption likely doesn't say whether the human is on the left or on the right. It just says, human plays with dog. That's kind of the natural thing we want to say when we see such an image. that's OK with human. We don't really care about whether who is on the left. But if you want the model to be answering these type of questions well, then we need to use that in training. And that's what current clip training don't care about. we end up with a CLIP model â that's going to be used â a vision coder for multi-modal language models, it's not trained to be good at these â tasks or other â of â that we â on â that study. â again, I think this just reinforced my idea that â you want a model to be good at something, you need to train on that something. â

Ravid Shwartz-Ziv: What are your thoughts about reinforcement learning? Do you think, like, everyone, all the labs now are making their own versions of environments and they basically, yeah, we want to be good in coding, we want to be good in, like, some specific tasks, like make a specific environment, let's train the model to be good on this environment, give it feedback, rewards during these tasks, and that's it.

Zhuang Liu: Mm-hmm.

Ravid Shwartz-Ziv: you think this is the future? We'll just see more and more environments like that.

Zhuang Liu: Yeah, actually I don't know how feasible that is given like for every lab to really fine tune this model using either reinforcement learning or supervised fine tuning. â I would hope that we have a common, we have a pipeline, a mature pipeline for the model to, â so this is closed source models that they give us like interface to have a really strong general purpose model. I do hope that in the future we have another like method that's as mature as pre-training for us to do continual training. And it could be reinforcement learning. It also could be other more like context engineering, more â prompt engineering, agent collaborations. I think all these are open. maybe you even need to tweak the architecture for the model to have large memory, larger context, things like that. I don't think the concept of continual learning and adapting a general purpose to a particular domain is very important. Because each one throughout their life, they have different contexts. You want the model to be a good assistant to you, to be able to empower your life, empower your work. You want a lot of context. I think human brain still, this is something the model hasn't been â able to match human brains, which is really a lot of memory. and the to learn fast, to remember facts just by seeing it once. And never forget, when you interact with Cloud Code today, the biggest thing I have to worry is whether it remembers something I did before. I think that's the experience of a lot of people. And that's kind of the... In our professions, in our own particular career, we have a lot of things that we want to model to be able to remember, so we don't have to repeat all that. It's not necessarily a particular task, but just everything, like how our interaction with other people, our history of achievements, of failures, things like that. think the answer to that may be more than just RL â the â but â like system engineering, how â â everything for the model to have easy access to. â Yeah, but it still falls onto the data, how we organize the data, how we feed enough data, how we â data from different sources, from different inputs. Maybe we'll wear glasses, maybe we'll... â We have these smart glasses so that we have visual input to these models.

Ravid Shwartz-Ziv: But do you think the basic components are there, or will stay the same? Do you think we just need to build the scaffolding? Like, yeah, how to... Agent move in the world, or something like that, and collect data, and bit organize it, and memory, and all these things? Or do you think we need to change something fundamentally?

Zhuang Liu: Yes, â that's a very good question. I think that the... â I think a sad fact is that not everybody is able to work on the fundamental side on these very large models. people who can afford to train them can experiment with that. That's why we see a lot of agents work right now. Because it's kind of the only thing they could do to improve on the system. think agents are wonderful. But I feel every kind of system of agents I built. â Every scaffolding, for example, I try to build a scaffold that lets a Cloud Code model to run for long time. But I believe after a weeks or a few months, â the model will develop the... I'll find a simpler solution, say for example prompting or using some built-in commands or skills that can achieve the same thing but with much less overhead. without like a Python scaffolding or something like that. So I do think that's kind of also the best lesson, We want to keep the system simple and just let it decide a lot of things by itself. But yeah, the sad fact is that not everybody is able to make contributions to the evolving. abilities of these underlying models. what we can do is to contest engineering agents. I am not sure which. I do think fundamental abilities-wise, we can still catch up. So every task we care about right now, we care about at a certain performance level. We can do with less agents, less scaffolding, but more relying on the model capabilities. I think we're still. We're still on that curve.

Ravid Shwartz-Ziv: but why we actually care, right? Because then you also say, then all the problems we can solve with â agents today, or there are no fundamental problems that we can't solve, right? So, yeah, it may be that we can build â or tune models to be more efficient, or to learn with less number of examples, but in a world that we have more more compute and more and more data, why not just to build agents and to solve all the problems?

Zhuang Liu: I agents still make mistakes, â code agents. And I think a lot of mistakes it had for me was not remembering something. That's obvious. That should be kind of obvious. So I do think in terms of memory, memory and context doesn't. like most important aspects right now, especially memory. mean, there are two sides of the same coin, Even if you have unlimited context, like in theory, you can access everything you with the total model. But if it forgets or get facts wrong, then it still doesn't have good memory. And also, how can we make So when Cloud Code announced the 1 million token as a context window a few days ago, I think everybody was very cheerful of that fact, including me. It's very good. But still, how can we have unlimited memory? Unlimited. At least like the continued learning problem, right? How can we not forget? think that's something more important and would be more fruitful if we have success than how to build collaborative agents. The fact that we need a lot of agents is exactly because not one agent cannot remember everything. So we need an agent to do separate tasks. If one agent can remember everything, and when it does one task, it doesn't forget about the previous task, then everything could be built on the one agent. They could be paralyzed on the back end of the servers of the company, but... If they just one personal assistant to us, that would still be â more convenient than orchestrating multiple agents.

Allen Roush: So, â Let's see, in that same... â paper that I had asked about before, like Eyes Wide Shut, â you suggested, I think, mixing in vision and self-supervised features to improve grounding. â What do you think the most ideal vision encoder for a multimodal language model looks like if you optimize jointly for both language alignment and fine-grained visual discrimination.

Zhuang Liu: Yeah, I think that's exactly the solution I have in mind right now. It's just to do both. Just to try to build in the... Yeah, I think those are the two dominant paradigms for vision pre-training right now. I would say â I would add... A lot of people are talking about world models right now, So I would add world models. adding a temporal dimension on that vision part, that would also be very helpful. â

Ravid Shwartz-Ziv: What is... Let's talk maybe, let's talk a bit about world models. What is your definition for world models?

Zhuang Liu: what model to me is just predict how the world works, right? Predicting how the world works given your current condition.

Ravid Shwartz-Ziv: What does it actually mean? So do you think, for example, a few weeks ago there was Stefano Suato here that claimed, yeah, LLMs have world models. And before there was Jan, like, don't know, Randall was here and claimed, no, we need to explicitly build world models, like, models, world models for... our models and currently LLMs don't have it. What is your opinion? Do you think we can define something that we can actually like say, okay, these models have world models and our world models and like these ones don't have some and internal state that they can represent the world?

Zhuang Liu: Yep, â I think I have world models in the language space. That's â definitely true. And language is a higher abstraction space in terms of how all the perception signals we receive. â It has actually very good world model, I would say. I often discuss history with Chai Jiviti. And a few days ago, I told it to make like Given that, so there's something happened in the history of China and I wanted, I feel unfortunate for one party there and I told the CHPT that, can you imagine hypothetical scenario where that losing country or state actually won the war and then everything was changed. And actually it gave very reasonable responses. not, it's nothing, like if you... put together all the small events, everything makes sense. It's just very small, like, probability shift how people made decisions there. And everything follows. It's like real history. Like, it could very be real, have been the real history. So in that sense, I think no human, at least no novelist, no â historian could be, â may be able to at least exceed that level of... of logical detection in that series of events. I think that they do have a very good world model, just in a very, very high abstraction model. So I guess when we say we don't have a world model right now, it's in the vision space, the vision signal perception space. We cannot fully recover, simulate the model in pixel space. That's also very true. I think whether we have the model depends on what level of the world you want to model. If you consider the higher level events of the world to be a self-containing world, yes, we do have a model by language models. But if you consider every pixel, every raw signal, every physical signal, including Like not maybe not only vision, but also every substance in the world like physical properties, all these things. Yes, we don't have that kind level of what fine-grained model yet. And I think the underlying reason is still vision is a much higher throughput data modality. We still don't have the compute to â model that perfectly.

Ravid Shwartz-Ziv: So do you think we actually need WorldModel for 99 % of our tasks? To solve 99 % of the tasks that we are interested in?

Zhuang Liu: Thank I think for digital work, like white-collar work, we don't. Because a lot of things operate on the digital space. at most, I need a model to be able to read my computer screen. And computer screen can also be kind of digitized or compressed. Or at most, it's usually just a set of images, not like real-time video flows. And that would be easier. So right now, the bottleneck for me to interact with Cloud Code is often me having to screenshot my screen. And I think that should be solvable, because these models may have access to our computer screens in a safe manner very soon. And I don't need to share a lot of context, like how do I set up something on a website? Those kinds of things I still need to sometimes screenshot to the Cloud Code. Yeah, but for physical work, like construction, driving, physical activities, yes, I do think we need a vision mode model because the feedback we receive in that kind of work is so fine-grained, so detailed. Also, like hair cutting, right? Which part of hair you want to cut more or less, like the hairstyle. That one â is not possible by consulting a language model. You need to see if you want a model to do this work, like also some physical, like medical surgeries. I don't think we need vision world models for those. It's obviously not more than 1 % of work. It may be like 70 % of work. Yeah, I don't think for more than half the work we'd we do need a vision world model for the model to be really good at it.

Allen Roush: â And then â you have this cool paper here too that I was really â fascinated by because we, I have a paper here at ICLR about anti-slop and removing what I guess you might call an idiosyncrasy in an LLM, right, because your paper here, idiosyncrasy in large language models finds that model specific signatures â survive rewriting translation

Zhuang Liu: Mm-hmm. Mm-hmm. Mm-hmm.

Allen Roush: summarization in some ways. what do you think those signatures are really measuring? Is it pre-training data, post-training style or like slop like what we do in ours? Decoding behavior, also something I look at a lot or something more structural or what's your thoughts?

Zhuang Liu: Mm-hmm. Okay. Yeah. So this paper â was about doing the same classification of â data source, data origin, but language model outputs. So given a piece of text, we want a separate neural network model to be able to tell which language model generated this text. And we found that â it can be very very accurate up to 99 % accuracy when there are five candidate models and Back then it was also quite surprising to us But now I do think people accept more and more that there are clues in this text generated by language models and other common folks who are not AI researchers can also tell Which model likely generated this text so it's not as surprising right now because Each company finds their own strategy to maximize their user engagement, and it could be different. The model could output different styles. And in ChageBT, you could also explicitly select a lot of different styles right now. So now I think people are less surprised about this. But in terms of components, what contributes to this? Yes, I do think that each provider's own choice of style matters a lot. the system prompt, we don't have access to a SysPrompt. Do they tell the model to be verbose or concise or have bullets or not? And also post-training, different companies use different post-training strategies. And the way they hire the annotators, how to rate each response would also have systematic differences. That's how they instruct these annotators to annotate. So they end up encouraging different behaviors. And also then pre-training as well. Pre-training, each company has different sources of pre-training. Some companies may want their model to be better at â coding mass reasoning. Some models might optimize for general knowledge coverage. â the sad fact is that we don't know how much difference there is. So we can only use the outputs as a... â approximation so â So yeah, think everything mattered, but I will think post-training and how they... provider design the system prompts â is causing the difference likely. Yeah, the majority of the difference.

Ravid Shwartz-Ziv: What do you think about pre-training? Do you think this separation between pre-training, pre-training, post-training will continue? Is it important? Is there something fundamental there, or do you think it's just a way that we converge to do our training?

Zhuang Liu: Got it. â think pre-training and mid-training shares more than with post-training. Post-training is kind of making it... The objective is kind of... The reward signal is kind of different because now it involves human... sometimes involve human judgment, â human preferences. I think that's a... biggest difference, â pre-training and mid-training are all just alter-aggression, right? Just on different styles of data and different context lengths. I think that's a... Pre-training and mid-training is, be... I think mid-training is a more recent concept, Just like a few years ago, we only had pre-training and post-training, â but now we have mid-training. Mid-training is maybe a â temporary state because... Mid-training is really for extending context, length, for bringing higher quality data. That's a model training focused on higher quality data. So I think this may be the... I don't have inside information in these companies, but I think this may be compromises we need to make because we don't have enough compute to always train on super long context, to always train on... We don't have enough data to always train on higher quality data. So think pre-training and mid-training are kind of... like both pre-training, unquote. Post-training is different because it involves actually human steering of the model behaviors. I think that would still continue. That would still, the difference would not disappear. But I would hope that we have another stage of continued training customized for each user. That would be very nice, like customizing the preferences, the memory needed, the style of the user.

Ravid Shwartz-Ziv: And do you see it like, I don't know, the continual learning, do you see it something like, for example, self-supervised learning, that like you have different, like you learn from the differences between views or something like that, or do you see it for specific tasks, like we have new data and now we need to solve specific tasks? How do you see it?

Zhuang Liu: Mmm. I think it's, it's a... It's less â about increasing capabilities. I would actually see it more as a better memorization, â better memory. The abilities we have from these models is already very good enough. Like they can solve math questions that we cannot solve, like by most people. We just need the model to be mindful of each person's... idiosyncrasies, like how would I like to respond to â some events? What general principles do I have? Even if I write down every piece of history for my own life, every piece of preference for me â as in an â &D file, it can still miss. â It can still missed, even if I put it into context. For example, now I have a global â Cloud MD file. It tells the model to look at certain things when something comes up. They often still ignore that. I don't have a good way to â really make it sticky to the model. So I always think the Continuum Pre-training would be more about having stable memories, having not making mistakes on â trivial things â rather than developing more skills. Finding the right skills to use â at appropriate scenarios rather than developing more capable skills.

Allen Roush: So do you have, and maybe this is a bit of a topic change, but do you have, it sounds like it's probably Claude code, but have you used chat GPTs like 5.4 codex and then Gemini? Do you have like a favorite model? Anything like that?

Zhuang Liu: â not really. Yeah, I mostly just use Cloud Code because it has so many functions I need to learn, like â commands and how to best use it. I kind of want to stick with this â ecology first, like get me familiar with enough to use them rather than like... trying multiple competing but similar products. â I try to keep things simple. But some of my students do use different models. think Codex, Cloud Code are the two main ones. Codex, heard that some students prefer Codex. â And also part of it because it has sustained longer given the same price tier. Yeah, sometimes they do. have to use this tool for the experiments. if they want more quotas, they tend to use codecs.

Ravid Shwartz-Ziv: What do you think? I think I saw someone on there. LinkedIn on Twitter that said that now with the new coding agents I don't need the students anymore. I can just write â to my coding agent what I want to do. I to make all the experiments and give me the results with the reports and all these things. What do you think about this approach? Do think we will see more students or less?

Zhuang Liu: â So â from education perspective, I do think we need more students to be immersed, to be capable of using AI and developing AI further. So I don't think that should be a debatable standpoint. We need more capable students. We need to train the students. â more people to grow from this experience. From a real project, practical project of you, I think the answer is the same. So I can run my own little project right now with Cloud Code, given if I have a reasonable amount of resources and time. But it's not like it's fully automatic. I tried to let it build one project from experiments, from ideation experiments to writing paper in one day, one or two days. But it's just not good. The question proposed. I mean, it's a vague question, but it's nothing too interesting to me. And the experiments it did, it's not comprehensive. It's enough to support the conclusions. I have to prompt it many, many times to get it on the correct trajectory. And then back to the memory question, it forgets things more often than I thought it would. I keep it to use this â GPU, use this partition of GPUs. It may follow that for a few hours, and then after that job is done, it kind of forgets that thing. And also, how I want it to be â never stopping. Keep exploring based on your current experiment results. Design your next experiment to test a new hypothesis. But it just doesn't listen. Sometimes it gets into a local minimum. So I think they're good at low-level tasks. â not as good as â in higher level research understanding and navigation. And if â student â me, like student is â different from me, â I can make crowd â code to â more productive for me, to make my work more productive, so can they, right? And â don't â if they have the right mindset, that don't delegate everything to AI, they still learn to grow themselves to be a good researcher. I do think we need more such students, not fewer.

Ravid Shwartz-Ziv: So, yeah, I tried to do... I don't know if you heard about it, Andrew Carpatti released auto-research. He basically gave an agent, coding agent, to optimize NanoChat, right, and to run multiple experiments overnight, and he showed that, like, this is very... like, it optimized, and the loss, the validation loss, goes down.

Allen Roush: you

Zhuang Liu: Right. Right, yeah.

Ravid Shwartz-Ziv: you And one of the suggestions of the agent was to change the random seed, for example, and then it became much better. I actually tried something similar. I just take this project and just make a very simple Bayesian optimization, hyperparameter optimization. And apparently, if you look, so you can get better results in shorter time like with a smaller number of examples of iterations and and I think this is something that like at the end we need to be I don't know if careful is the right word but like we need to understand what types of â usage is actually good for it and what types it's still not there and we actually like waste our time in trying to prompt it and to try to

Zhuang Liu: Mm-hmm.

Ravid Shwartz-Ziv: to make it because it's fancy and everyone are using it. So, yeah, I think I agree with you that autonomous research is still not there. I don't know if it will be there in the future, but maybe, who knows. For sure, for some cases, for some usages, I don't know, to building some products is already like very good and almost there.

Zhuang Liu: Mm-hmm. Mm-hmm.

Ravid Shwartz-Ziv: but for research it's still not there.

Zhuang Liu: Exactly, yes. That's my experience as well.

Ravid Shwartz-Ziv: Okay. â

Allen Roush: So â I have a question about the more recent paper you did, this one, Transformers Without Normalization. And as I recall, that has Yan LeCun on it too. â So you replace these normalization layers with dynamic tanh layers, and you still match or beat normalized transformers across multiple settings. And this is part of that same theme

Zhuang Liu: Mm-hmm. Right, yeah. Mm-hmm.

Allen Roush: we've been talking about with your work of like inductive biases due to being historical accidents, right? So are there any other components that you think we call essential that are also just like historical accidents?

Zhuang Liu: you I would not, â not that I can think of right now, otherwise I would â have already published that paper. I think a few things. Residual connections are very essential, and I do still believe they are essential. They are not historical accidents. In fact, there have been many, many efforts in replacing residual connections, the identity connections by making it developing variants and removing them. â now, none of them have really picked up. I don't think normalization as well. I don't think normalization is not historical. It's not an accent. â I wouldn't recommend every company to change to dynamic 10H right now because first it's kind of tricky to get it working on LMS. And second, â it does not bring speed up based on current hardware and libraries. So it's kind of a very interesting finding, but I wouldn't... â like exceeding normalization layers and recommend people to just go with that. â I don't think I think across the history residual connections, normalization layers, and self-attention â the... and then back then like linear layers, those four maybe the like four pillars I would say that â stand the test of history, test of time. architecture components. â

Allen Roush: â And then â you have â this other one, Metamorph, where you're claiming â visual generation can â emerge as a byproduct of visual understanding with instruction tuning. Do you think, because you've, you know, in your other works, you talk about like this recipe model. So do you think like understanding first generation second is like â a general recipe that we should think of for like as a guiding principle for making models?

Zhuang Liu: I don't have a clear answer for whether we should always aim for unified generation and understanding model. We are still exploring this. We still exploring this. So one of our ongoing projects explore the question of whether generation helps understanding. For example, you ask a model really difficult question and the model first generate â intermediate reasoning. It's a kind of a of thoughts, but with images. But we found that this is effectively only in a very limited â of cases. â since this direction is still not certain yet, â haven't been able to really get it to work, â be helpful. But the other direction, â helps generation. I do think that's... â with like more plausible. At least you can reason in the language domain, in the language space. Or do some visual understanding of your input image before doing edits of the image. I don't think that's obviously going to work because like â prompt rewriting, right? Some generation systems have a prompt rewriting system. That's exactly understanding model, It tries to understand, tries to deduct what should be there and how you want to arrange the objects before going on proceeding to generate the images. And I think that's kind of the more plausible direction. So in terms of whether â we should have one model to both generation and understanding, think this still an open question. Yeah.

Ravid Shwartz-Ziv: Okay. Do you have anything else that you want to add? To talk about?

Zhuang Liu: Yeah, I don't think anything on top of my head right now.

Ravid Shwartz-Ziv: Okay, so thank you so much for coming. was pleasure. And thank you, Ellen.

Zhuang Liu: you so much â inviting me. â

Allen Roush: It's always a pleasure.