Training Is Nothing Like Learning with Naomi Saphra (Harvard)


Naomi Saphra, Kempner Research Fellow at Harvard and incoming Assistant Professor at Boston University, joins us to explain why you can't do interpretability without understanding training dynamics, in the same way you can't do biology without evolution.
Naomi argues that many structures researchers find inside trained models are vestigial, they mattered early in training but are meaningless by the end. Grokking is one case of a broader phenomenon: models go through multiple consecutive phase transitions during training, driven by symmetry breaking and head specialization, but the smooth loss curve hides all of it. We talk about why training is nothing like human learning, and why our intuitions about what's hard for models are consistently wrong - code in pretraining helps language reasoning, tokenization drives behaviors people attribute to deeper cognition, and language already encodes everything humans care about. We also get into why SAEs are basically topic models, the Platonic representation hypothesis, using AI to decode animal communication, and why non-determinism across training runs is a real problem that RL and MoE might be making worse.
Timeline:
(00:12) Introduction and guest welcome
(01:01) Why training dynamics matter - the evolutionary biology analogy
(03:05) Jennifer Aniston neurons and the danger of biological parallels
(04:48) What is grokking and why it's one instance of a broader phenomenon
(08:25) Phase transitions, symmetry breaking, and head specialization
(11:53) Double descent, overfitting, and the death of classical train-test splits
(15:10) Training is nothing like learning
(16:08) Scaling axes - data, model size, compute, and why they're not interchangeable
(19:29) Data quality, code as reasoning fuel, and GPT-2's real contribution
(20:43) Multilingual models and the interlingua hypothesis
(25:58) The Platonic representation hypothesis and why image classification was always multimodal
(29:12) Sparse autoencoders, interpretability, and Marr's levels
(37:32) Can we ever truly understand what models know?
(43:59) The language modality chauvinist argument
(51:55) Vision, redundancy, and self-supervised learning
(57:18) World models - measurable capabilities over philosophical definitions
(1:00:14) Is coding really a solved task?
(1:04:18) Non-determinism, scaling laws, and why one training run isn't enough
(1:10:12) Naomi's new lab at BU and recruiting
Music:
- "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- Changes: trimmed
About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Ravid Shwartz-Ziv: Hi everyone and welcome back to the information bottleneck and today we have a great guest Naomi and she's a Kempner research fellow at Harvard and an incoming assistant professor at Boston University. Hey!
Naomi: Hello.
Ravid Shwartz-Ziv: And as always, Ellen, hey Ellen!
Allen Roush: great to be here and to talk to somebody who's an expert on â training because we haven't had too many folks like that â who work actively in that space on the podcast yet.
Naomi: Thanks, yeah. I'm definitely, I know that you have some people who are focused on trying to do training, but I'm more on the side of trying to understand training.
Ravid Shwartz-Ziv: So let's start from the beginning. Why we actually want to understand the training? Why do think it's important?
Naomi: Well, for one thing, if we look at, to draw a little analogy, if we look at a biological organism, which a lot of people in interpretability or in the empirical science of deep learning are trying to move towards reflecting these really established scientific methods in biology, biologists rely very, very heavily on understanding the evolutionary history of that organism. And if we sort of give up on the entire history of evolution, then we start believing some very strange things because we start treating certain side effects potentially of evolution as meaningful. There can be spandrels or what's called a spandrel by evolutionary biologists, which is something that sort of just like happens to develop as a side effect of something else. Or you might have a vestigial trait like our tailbone. All of these things are generally like things we can observe. And if you treat them as meaningful in that final organism, then you're going to have some really misleading conclusions out of it. And it's similar with â language models or with neural networks in general. At the end of training, you might say, I found a head that has this pattern. It must be meaningful. But in fact, you might discover that it's a side effect of training or that it was really important early in training. â But the model has sort of grown past it and it might even be disadvantageous later on. For instance, if you look at selective neurons in neural networks, like neurons that activate very highly for a specific class, so an early prediction, let's say, of that class internally in the model. They tend to predict that the model is gonna perform worse â in general, but they're really important to have around early in training as a scaffold. So that's like a vestigial thing, right?
Ravid Shwartz-Ziv: Do you know what is like Jennifer Aniston neurons? Do you know this story? Yeah.
Naomi: Yeah, yeah, I'm aware of the Jennifer Aniston neurons. Yeah, you, yeah, even a biological analogy is potentially going to really throw you off, you know.
Ravid Shwartz-Ziv: So people that don't know, like so Jennifer Aniston, I think it was like something like maybe 20 years ago when Friends was really popular. So people found that when you show people, when you show them like pictures and record their brain, record like specific. in your brain, there are neurons that will activate, that you will fire only when you show them a Jennifer Aniston picture. â Then, it looks like after, at that point, people thought that it was very consistent, like the same neuron actually will fire only, or mostly with a Jennifer Aniston picture. But after it, apparently, people found out that it's not so accurate and that sometimes it will fire, sometimes not, sometimes other neurons will fire. It's very, very messy and not so very clear how we thought it would be.
Allen Roush: Now I'm a little curious here because at the time she was quite popular with people who are attracted to women and I'm wondering if there was some relation there but more seriously I'm actually curious about â based on what you've been talking about Naomi â does this have anything to do with grokking behavior in LLMs?
Naomi: â Yeah, absolutely. So in the like grokking process, there's this transition from memorizing data to generalizing and the classic example. Yeah. Yeah.
Ravid Shwartz-Ziv: So let's just a second. Let's start with what is groking. I... most of people know but like... So what is groking?
Naomi: Yeah. So I mean, it's it's the definition is basically a delayed generalization. So early on, the model memorizes all of the training data. And in a classic learning theory framework, there's this assumption that memorization is something that happens after generalization. â But then it turned out that there's something that's sometimes called benign interpolation that's happening where you memorize everything. You get perfect accuracy on the training set. But then as you continue to interpolate within that training set, eventually you achieve generalization by sort of forming the right kind of geometric structures internally. And â the thing about grokking, so the classic case of this is this modular arithmetic set. â And the thing about grokking is that it happens at a very specific amount of data on that set. If you go below that, it's just going to memorize. You don't have enough support to actually generalize. And if you go above that amount of data, â even in this multi-pass setting, â you end up generalizing immediately, very quickly. â And that sort of â behavior actually is indicative of a lot of other kinds of breakthroughs or phase transitions in training. because this transition from memorizing to generalizing, â you might think of it as a transition between modeling strategies. And there are other modeling strategies. Like you might start by generalizing as a like Ngram language model, and then you discover actual syntactic structure, or you learn to have specific modules in the network specialize in specific ways at a particular point. And suddenly the model has discovered a whole new world of how to generalize effectively, right? And, â and the other thing that's sort of significant about this rocking phenomenon, this, this original result is that the original results, like I said, is only present under specific circumstances, especially in the data scale. And that's also very representative of these kinds of transitions. Often you see these kinds of like breakthroughs happen at let's say unstable points in what you might call an emergence threshold. So emergence is this whole phenomenon that people probably have some exposure to where there's this idea that you have models maybe not really improving at a task for many, many scales. And then suddenly at a specific scale, you see it jump up. â And when you're at that threshold, if you think about it, the model is kind of... You know, it's, it's, it's precarious. It's almost like it could generalize or it could not. It could learn this task or it could not. And so, â you have a lot of sensitivity to the initial conditions and stuff like that. So even in grokking, you see actually there are even in the classic setup runs that take a really long time and you don't see them generalized in that classic grokking period or. that actually do generalize immediately, because you got lucky with your initialization. And so there's all of these things that are unstable around that kind of, let's say, phase transition. Yeah.
Ravid Shwartz-Ziv: But why... what is â bringing to this phase transition, right? At the end, these things are like... most of these things are continuous, right? So why do think there is this very clear phase transition?
Naomi: â well, it's, it's probably something to do with symmetry breaking. Like you have saddle points that form when a model is sort of able to decide where to store something. could go in one head. could go in another head. It's developing a modular structure. And there's this, there's this point that, that you could call the head specialization phase transition, for instance, in transformers where There's a simultaneous development of syntactic attention structure, as in heads that focus on the subject of a verb and things like that. And the induction head structures, which are the complex circuits that allow in context learning by doing like search and then copy, like learning from these specific examples in the context. These are, these are. sort of structures or circuits that really rely on different modules specializing. So they have to generally happen with a saddle point. But â our work, me and Ravid, when we looked at the syntactic attention structure that I just mentioned, this specific kind of specialization, we actually saw two phase transitions happening in terms of the model's generalization behavior and in the overall like loss. There's two consecutive big drops in the loss and the second one depends on the first. And the second one involves the models sort of observable, external, complex grammatical abilities to learn about things like â negative polarity indexing, which is where an English speaker knows when they can say the word anymore. If I say, go to the movies anymore, it sounds very strange to most English speakers, but if I say, don't go to the movies anymore, that's normal. And that requires a real understanding of syntactic scope. So models develop that very, very abruptly after they have this internal structure. Okay. So I'm just going to pull back and say what I just said means that there are at least three phase transitions that happen in the models that we were looking at masked language models, because there's always a first phase transition, which you might see is the edge of stability phase transition where the model finds its initial basin. So This is just my personal sort of worldview maybe, but I think if you told me there's one phase transition, I would say, okay, one phase transition. If you say there's two, I would say, hmm, okay, one where you find the basin and one where you find the head specialization. If you say there's three, I'm gonna say there's probably not exactly three phase transitions, right? So there's probably a lot happening under the surface that isn't visible in that population level loss that we use to understand training dynamics most of the time. There are probably, and we've seen some examples of what might possibly happen, there are probably a lot of smaller concepts that get learned at specific moments in specific locations by the model. that aren't visible because when you add up a bunch of little saddle points and phase transitions, it looks like a very smooth loss curve.
Allen Roush: and And I just have a quick question on this too. So historically, at least I grew up in my time learning machine learning in university circa, I guess, like 2015 through 2017, where I was taught about train test validation curves. And you're supposed to train until your test performance starts getting, I believe, it's worse, even though the train performance is getting better. validation data set to tell you, give an approximation of the model's quality on unseen data. And then of course, there's cross validation, which is supposed to be the much better, but much more computationally expensive version of this, which also motivated the name of the machine learning version of Stack Overflow that nobody seems to remember existed called cross validated. â
Naomi: Right.
Allen Roush: It seems at least since 2021 that nobody, like all this is thrown out the window by double, triple, descent, grokking and similar. What's your take on this whole thing?
Naomi: Thank Okay, â so first off, mean, a lot of these phenomena like grogging, double descent, et cetera, these are such like early stage phenomena, such low scale phenomena that these sort of visible big population level in distribution, train test, random split phenomena are sort of irrelevant. But what is really, really relevant is like edge case phenomena out of distribution generalization. â When you have a specific subset that you are trying to test on at the end of training, suddenly all of â these possibilities of overfitting become relevant again. But you're not worried so much about overfitting to the training set because the model is very good at generalizing usually to a similar small shift just to test. But now you worry mostly about these edge cases or specific benchmarks, specific subsets of the data, right? And that's where you get all of these, like, you know, it learned the number of Rs in strawberry, but now it's overfit of like in that distribution. So it gets the number of Rs in blueberry wrong, which I mean, this is actually a bunch of other stuff. This is tokenization. If you go into tokenization, then suddenly a lot of things start to look really weird. Although I'm gonna digress for a moment about tokenization. Tokenization is really, really great to talk about with, â especially with like lay people, because in fact, no, with scientists, with experts, because a lot of people come in with an assumption that language models learn like people. that the things that are hard for a person are the same things that are hard for a language model. But as soon as you start talking about tokenization as a factor, all of that goes out the window. And suddenly a lot of things that are not human-like in models start to make a little more sense. And you let go of this idea that training is like human learning, which I would say is such an important thing to understand. Training is nothing like learning. Training is a little bit like evolution, but not exactly like evolution, but it's nothing like human learning. â Yeah.
Ravid Shwartz-Ziv: Every time... Every time that someone says something like that, I'm putting my neuroscientist's head and strongly claim that no, there is like no real connection. Like, we don't understand both of them, that's fine, but this is like the connection, right?
Allen Roush: I massive agree.
Naomi: Yeah, the only connection is that we don't understand either of them. Exactly. â And it means that our assumptions about what, â for instance, that a model wants to learn a simple thing first and build on it like a human does. Terrible assumption. It loves simple things, and it's just going to stay with the simple thing, right? You need to. keep it from learning anything. That's why we pre-train, because it's impossible to learn the entire distribution of all language. And then later you can introduce a more specific task that's easier to model.
Ravid Shwartz-Ziv: have a question. So, like, there are different, let's say, axes â time or, like, scaling, right? Like, you have, like, training time, right? Like, you have the data. Let's scale the data. Let's scale the model, right? â specify in specific tasks. â do you think, like, what are the â similarities and the differences between these axes? Do you think they are, like, totally orthogonal to each other or...?
Naomi: Thank
Ravid Shwartz-Ziv: or do you think that there are some phenomenas that are emerging all of them, or some of them together?
Naomi: Yeah, obviously scaling like the amount of like data, like compute through data is not literally the same as scaling it through model size. That's why we have our like chinchilla scaling laws where you can find a optimal trade-off between these two things, right? â But beyond that, â multi-pass training where you have, where you scale time without increasing the amount of data is not the same as... â as just scaling data and not all data is sort of equal also. Like even in very sort of little synthetic settings, if you try to look at how models generalize in their grammatical capabilities after exposure to like fairly simple â English sentences, they learn that they should be applying sort of hierarchical syntax. the, â like if I say, â what does, like the llama that does work enjoys â â when I want to make a sort of question out of that, I would say, does the model that enjoys work love running? â and that requires me to know what the actual, like the actual main verb there is. â can't just. move a random verb or move the last verb I saw, it requires me to actually know the entire latent structure of that sentence. And if I want to do that, I need to be exposed to sort of â grammatical structures during training that are easier to model by learning that latent structure. They're called center embeddings by linguists. these like, so as you add more examples that are easier to model with the right latent structure, the model is more likely to actually pick up on that latent structure and learn to assume new rules are going to apply that structure. So. So pulling back to real cases, right? If you look at the GPT-2 paper, the main contribution there was not scaling up transformer models. It was not the architecture. These were pretty similar to existing things at that time. People had scaled these things up. But what they did was they took the data and they removed things that were causing the model to make mistakes, right? And if you think of it that way, it's very clear that the data â quality is something that you have to think about. That the support for the specific kinds of generalization behavior you're interested in is something that you really, really want to think about, right? So I would say, yeah.
Ravid Shwartz-Ziv: But you talk about data quality, how you actually define data quality? What does it mean? Do we want clean data? And if so, what does it mean clean?
Naomi: It's another situation where it's like, I know when I see it, or you might want to sort of, you might want to sort of have to test the actual claims. For example, if you want to do in-context learning or really like complex long range dependencies reasoning in natural language, you kind of want to have a lot of code in the data. And at some point people were like, maybe removing code will make it a better language model because it's doing language. And then they actually found that it's much worse at handling complex language tasks because code is so efficient at teaching these kinds of long-range dependencies and reasoning, right? So there are examples of intuitions we might have about what models need to eat during their training that are totally wrong. And that's why we really need a strong empirical science here, because a lot of our intuitions were wrong.
Allen Roush: And so far, seems that we've been talking about or bringing up examples that are all in English. I'm unfortunate it's my only language. â I know, Ravid, I believe you speak at least Hebrew and possibly more than that. And then I don't know if you speak more languages, Naomi. what I'm curious is how does all the things we've talked about â translate over? I think Chinese is the second most well-represented language in most language models today and is very different from English. So I'm just curious if you have any insights into how different languages operate â within in the context of what you've said so far.
Naomi: Yeah, so I am embarrassingly monolingual. It is embarrassing because I do sometimes work with linguists and linguists they'll say, â it's not about how many languages you speak, but find me a monolingual linguist. Anyway, so â in this case, I would say â there's a lot of really interesting things that we're seeing in, for instance, in Chinese models that tend to â use both. English and Chinese in chain of thought reasoning and in their like thinking traces, you know, their thought traces. These kinds of things are â are really interesting to see because we can maybe start to ask what languages allow you to express things better. â But it's it's also kind of â but it's also kind of pointing to something about how models seem to work. where they have a multilingual models. They seem to have some kind of interlingua where they can transfer tasks between languages. And in that interlingua, things looking off a lot like English in, for instance, Llama or another English dominant model, â but they aren't exactly English. â So things looking off a lot like you're translating from Bulgarian into English and then back. But it's not literally the same as doing that because then you wouldn't be very fluent in your Bulgarian. It's very obvious when you are translating something directly from English.
Ravid Shwartz-Ziv: I think it was like deep sick, right? Like before the RL part, they just like... Apparently it was like just changing the â language. You can like switch between language during the chain of thought, right? And then the RL part just align it and it was okay.
Allen Roush: And.
Naomi: Mmm.
Allen Roush: I also have to quickly interject here, I am pretty sure Interlingua is actually the name of one of those like constructed languages like Esperanto. I have to check but I swear I have heard that before as its own thing. Just on the side, but please answer, Nealey.
Naomi: It's a term that linguists sometimes use to describe a fantasy latent space that's language independent, which maybe kind of doesn't really exist if you believe that language has some effect on the way that we express and think about semantics.
Allen Roush: Saper-Whorf hypothesis, right? If I'm remembering that correctly.
Naomi: Yeah, so I am not a linguist, so I am gonna defer and not talk about the Sapir-Whorf hypothesis because anything that I say about, for instance, studies of Russians who are able to differentiate better between dark and light blue because they have two different words is gonna make a linguist very unhappy.
Ravid Shwartz-Ziv: And so, what do you think? Do you think, like, the way to improve models in more, like, rare language is to what? To start with base model, with a very good base model and to train them, or, like, to generate a lot of synthetic data, or, to translate between, like, models?
Naomi: think this is a really good question and it's not one that I think has been completely answered. There is some work â from here at Harvard from like, I don't want to say all of the author's names because I don't know who's first, who's last, anything. know Martin Wattenberg is in the list somewhere. Actually. It's a Google paper potentially because he's on it. Cut this entire thing out of me speculating who might be the author left. Anyway, so â there is a result they had around what kind of data that you need to introduce to create enough support between languages â to allow the geometry. to â develop so that you actually can compose each language with each task. So you do need diversity, enough diversity â to make it sort of worth it for the model â to learn to align and sort of like align them with geometries that do compose, that don't interfere with each other. It needs to be able to sort of store Bulgarian in a way that's a little bit separate from biology question in order to sort of transfer these things across, right? In a modular way, basically.
Allen Roush: And related to this, what's your take on the Platonic representation hypothesis? I know Ravid has also recently sent me a paper that responds to it with basically an Aristotelian representation hypothesis. And maybe if you know what these are, please define them. assume you do. If not, I can define them as I understand. But please do.
Naomi: Um, the, so I, the Aristotelian paper is on my list. Who has time to read every paper that they need to read? I don't know. But the platonic representation hypothesis I would say is this idea that there is a aligned representation between all of the different sort of modalities or ways of expressing some underlying reality. Um, the. So I haven't read the Aristotelian paper, but my impression of this and my belief about this is that it is, â it's quite easy to take all of these things that are essentially language based. So let's say you have, for instance, an image classifier. An image classifier is a mapping between images and words. It's not about image to image in this case, right? A mapping between images and words is underrated as being a multimodal task. Image classification is a multimodal task. It expresses the things that humans care about enough to encode it in language. And this is why you can never leave language on the table, by the way. I am a language modality chauvinist and I can talk about that forever, but I'll leave that aside just for the moment. So if you look at, for instance, ImageNet, which is based on WordNet, which is a like an ontology developed by NLP people, WordNet differentiates between Beagles and â Cocker Spaniels, right? But it doesn't differentiate between different like pigeon lineages, right? Because those are not things that humans routinely express with language, because we don't care that much about pigeon lineages, right? So you already from image classifiers have a mapping from images onto language. So to me, this is a huge element in why there's already this alignment because you've literally got a multimodal representation to start with on image classification. Now on image reconstruction on sort of settings that are purely monomodal, not language based, sort of you might say this is also mapping, it's very impressive. But also these are situations where the underlying geometries might need, you know, let's say modularity or might have ways that we can't already, that we aren't already aware of, or maybe geometers are aware of, but machine learning researchers aren't, that make it easier to generally encode structures, especially hierarchical structures. So these are, think, These are, think, two things that could make it so that it's easy to find certain kinds of alignments.
Ravid Shwartz-Ziv: Okay, I would love to hear your opinion about interpretability in general. Like, I don't know, what do you think about SAEs, for example, like other tools, like, a lot of, like, work in this field, right? Like, every time you have, like, new method that says, oh, now we can understand language models, now we can understand what they're thinking, right? What is your take? Because I'm a bit skeptical, I must say, most of the time. By the way, so who doesn't know, like, for example, sparse autoencoders? The core idea is that you decompose, like, the residual stream activations into, like, sparse, interpretable features, and then you look on these features. There are a lot of works, different versions. But it's not clear, I think, like, in general, in the field, there is, like, mixed opinions, like, what is real and what is a artifact.
Naomi: Yeah, so â I guess I, I, so let's say there's two questions here. What do I think of interpretability? What do I think of sparse autoencoders? â I think interpretability is actually really crucial. I think that if we use tools, we should understand those tools. We, â We have spent decades understanding the aerodynamics of airplanes. Why? Not even necessarily because it actually lets us build so much better airplanes, â but because we should understand the tools we use. I believe this on principle. think that science is worth doing, but this is maybe not a practical position, but I also think that it could potentially be practical and I'll talk about that later. Sparse autoencoders are basically topic models. â Topic models are this really classic approach to clustering, â especially language data. And sparse autoencoders are basically topic models that are trained on the internal representation in the model. And the problem with this sort of approach as like a naive approach, â especially with sparse auto encoders, which are over complete, there's like more, â dictionary entries than there are inputs. So you have many, many, many things that you could potentially find meaning in. Or sorry, â it's like the input dimension. When I say the input, mean the input dimension. So the sparse auto encoder approach is basically topic modeling on internal representations of something that has very, very simple surface level. â properties, which is language data. So it's very easy to extract these kinds of interpretable structures and patterns from language data. And â it's really hard to show that this is a meaningful representation of the model's processing, especially because random seeds can lead to very different sparse autoencoder features. â So it's possible that you are finding features that you could have extracted from the input data with a, you know, simple model, like a model as simple as an SAE, like a, you know, two layer model. And if so, it's not clear what you have shown about how the model processes that. â I think that this is a really important distinction. think that a lot of work from Anthropic recently has been trying to tackle that. criticism that it's not really â identifying processing steps, which is what a mechanism is. If you're talking about mechanistic interpretability, they've been working on things like trans coders that are specifically trying to, instead of just clustering a representation, trying to sort of serve as a proxy for computation or processing between a layer or set of layers. And I think that that's a more promising approach personally, although there's still potential criticism, it's still not causal. Personally, I am not married to the idea that interpretability needs to be causal. I think that you, that from a practical standpoint, maybe you care about being able to say, I've reverse engineered it, I've found like where specific steps happen and exactly what those steps are. But I think that if we can use interpretability to predict the model's actual behavior on inputs, if we can model it at a computational or an algorithmic level and not necessarily at a mechanistic level, and these are distinctions that come from Mars levels. So basically these are three different levels at which you might understand a system like a brain or a neural network, right? If you wanted to... just be able to predict what kinds of things the model will do, â you don't need to know where it does it. And I think that â focusing only on localizing mechanisms, maybe it'll work, maybe it wouldn't, but it doesn't have to be the only objective we have. If we can find little signatures that are left behind by algorithms, and we've done some work on this, we've shown in like some toy settings, that you can find these signatures that are predictive of unseen model behavior, of how it will behave under future distribution shifts. You look at models that behave exactly the same in distribution in terms of their outputs. You look inside and you say, know what rules it seems to be using. And you can say which rules it's using in terms of you can correlate them with specific out of distribution behaviors, even if the way that you have identified the rule is not actually used to implement the rule. It's not a mechanism that's actually executing the rule. It might actually be something that's sort of countering out the rule, but it indicates what kind of structure the model is aware of. And so you can say this model will generalize as a parentheses balancing system and not just a counting system or whatever.
Allen Roush: And thank you for reminding me about the existence of topic models, which at one point I was actually a big fan of before the Helelem Revolution. â It reminds me also of my favorite type of learning of them all and one that I really want to make a comeback someday in a more broad use case instead of just occasionally being helpful in papers. Unsupervised learning, â indeed, which seems to encompass both the concept of dimensionality reduction and clustering. I'm curious as to your opinions on both of these and â how they interact, I guess, with this interpretability question.
Naomi: Yeah, well, so I think that unsupervised learning is very great and a lot of interpretability techniques that are sort of bottom up like SAEs are basically unsupervised clustering techniques, â which I think is potentially valuable. Either you'll learn something about the model or you can learn something about the data that would have been harder to extract directly. I would love if we did that because this is to me the most One of the most promising areas of interpretability is actually clawing back some human scientific understanding from like black box scientific discovery models, like protein folding models and so on and so forth, weather predictors. If we can understand what these models know that we don't, and they know something that we don't because they are much better than we are at protein folding, you know, at efficiently protein folding, right? So they know something that we don't know. â Which means that if we can understand them better, we can bring that back to human scientific understanding. Because you haven't done science by building a really good predictor in this sense, right? And â so from that perspective, unsupervised methods are going to be the way to go. They are going to bring that back in. And I think it's really a valuable direction for sure.
Ravid Shwartz-Ziv: But you don't think that, it may be that, like, these decisions are just too complicated, right? Like, there is no simple rule that we can understand. It may be that, like, you know, just, like, a huge function that, like, there are a lot of very, like, I don't know, unrelated, like, factor that we just cannot understand and we cannot quantify and, say, these and separate, and... That's it. This is the decision and we just can't understand it.
Naomi: So let's think about weather predictors, right? We cannot predict the weather and there are many complicated things that we are not gonna be able to fully understand. But if you accept the idea that humans should do science, then we actually have spent a lot of effort trying to learn. sort of rules or ways of expressing this really, really complex phenomenon. And if it is possible to sort of have a simplified proxy to have this black box model, which is probably much simpler than the actual weather, then maybe that's a way that we can extract back some of that simplified understanding. Cause humans are all about simplified models of natural phenomena. So if we can find new ways of developing effective simplified models of natural phenomena, then great. We don't have to know the entire system, right? Because we already can't understand the entire system. That's why we trained the model.
Ravid Shwartz-Ziv: But this is like two different answers, right? Or two different questions. Do you think, is it like a very simple solution to that? And we just need to find it, right? Very simple explanation or method that we can understand? Or like we're just chasing after a simplification or approximation of the real decision?
Naomi: Are you saying, should we do science? Are you saying like? you? Yeah.
Ravid Shwartz-Ziv: I don't say what we should do, I just think what you think. In principle, what do you think? Do you think that there is a simple solution that we just need to... In some real, very optimistic world, we will find a solution? Or you think we are just running and we are trying to find approximation and simplification, and it always be an approximation?
Naomi: So â I'm a full believer in, know, all models are wrong, but some are useful. â Maybe useful here would mean something like enlightening, help us to gain a little bit of a piece of understanding, right? â And the black box model that we've trained is potentially useful, already wrong, maybe just wrong enough that we can... get a better grasp of what's going on in there than of the actual entire phenomenon, right? And we've already got also some really complex phenomena that there's been work on, like recently â Project SETI with a C, not an S, published a bunch of work on whale communication. Now we understand some of the things that whales are saying. They say things like, let's go hunting and they take a little vote. on whether they should go hunting, really cool work. And we could not really have done that unless we had large amounts of data, unless we had really like predictive models and unsupervised methods actually that we could use to sort of pick apart â what elements of whale speech might be meaningful, right?
Allen Roush: just want to quickly point out I have been waiting for so long for AI to decode what my dog is saying. I want to know. â
Ravid Shwartz-Ziv: and
Naomi: I've been doing collaborations on animal communication lately and I mean, animals are just much, much smarter and more complicated than we would probably expect. Or a lot of people, let's say, would probably expect. Even like we're looking at certain species of fish. If you tell people that there is like a fish that has like the highest brain body mass ratio of any vertebrate, they wouldn't believe you. But they exist, there's these electric fish that have these massive brains that they need to use in order to communicate, their really complicated social behaviors and â just sense, sort of model the world around them using these electric senses. Really, really just the... The natural world is full of things that we don't understand and maybe we can understand them a little better by... letting a model do the first step of simplification, you know?
Ravid Shwartz-Ziv: I think I heard in some podcasts that after 9-11, the wheels were really happy because â they know how to process the radars of the airplanes. And these sounds make them to be very miserable. when two weeks after it, there were no plans almost, they were really happy because they didn't... They didn't sound all these weird sounds everywhere.
Naomi: Wow. Yeah, it's definitely, â there's definitely like a lot of sort of interaction we have with the natural world that does not benefit the wider, let's say community, the more than human community. Like a lot of cetaceans have basically gone extinct from the, this kind of like noise pollution because they're so dependent on it. And this is, this is where I would love to see more technological progress is just in sort of that kind of, â that kind of like reduction in the need to like have that human footprint. I don't know. This is a completely orthogonal topic. I'm happy to talk about language models.
Allen Roush: I'll just finish by saying the Hitchhiker's Guide is both a timeless book and movie and so long and thanks for all the fish, right? Is the relevant bit here, but yeah, let's go back to language.
Naomi: So real.
Ravid Shwartz-Ziv: I have a question. So what do you think about the multimodality? Do you think this is something that the more modalities actually help to understand or make it more complicated or like do you need them?
Naomi: Well, as I mentioned, I'm a language modality chauvinist. And I obviously, you know, if people want to generate an image, they can't do it with a pure language model. If people want a robot to follow the towel, they can't do that with a pure language model. But I think all of the reasoning capabilities that we need to deal with language are going to be more like to handle language input, language output. are going to be more efficiently conveyed through language than through any of these other modalities. There's a sort of like a hope that people have that maybe what they say grounding language in some multimodal model of the world, multisensory model of the world is going to support the model, you know, the world model in a way that then translates back into better reasoning. I don't believe that. Humans say all of the things we care about. Humans say, you know, like chickens have two legs. Humans say the sky is blue. We are constantly talking about everything that we actually care about describing and models can learn that from how we describe it. There is a structure, a distributional structure within language that describes the world as we care about it. And there are other things in an image that maybe we don't care enough. to talk about. There might be like little movements of pixels that you could potentially find useful if you were just doing a predictive model of the next frame in a video, but we don't care. So the things we care about, we talk about. I don't think that you get anything out of the other modalities. Although I mentioned code. Code is an exception. It teaches long range dependencies very, very efficiently. Someone recently... â successfully persuaded me that there's a single other text modality that actually does benefit.
Allen Roush: Well, wait a moment. What about DNA?
Naomi: you mean like, do we, â like, is a language model going to be better if it is exposed to DNA structure? Is it going to be better at talking about?
Allen Roush: Well, I don't know, but in terms of teaching long-range dependencies, my understanding is that you can encode it as strings of text and that the bioinformatics crowd has been looking over our shoulders and vice versa forever. My understanding is that the dependencies are very long with it.
Naomi: â yeah, although maybe like, maybe there's a better way of encoding it to reduce those dependencies. For instance, I would say, â I, I, I guess. I don't see DNA as something that will transfer â like world models to our â language input, output setting. Because if there was something that we cared about. we would still talk about it, about the DNA. â It's true that there are scientific modeling problems that are not going to be solved through natural language. I would say DNA is another modality. And if we want to understand specifically DNA, we need DNA. If we want to understand the biological world around us in a way that we can express verbally and have answered verbally, the model is not going to draw on the DNA. This is not to say that there's nothing in the DNA that we could learn about, just that the model's most efficient avenue to any sort of â reasoning task, any world modeling is going to be through natural language because natural language is designed to be spoken. Natural language is designed to be learned. Natural language is designed to efficiently communicate everything that we want to talk about. in very rich, infinite variety type of ways. And actually, if I can go back to the interpretability thing for a moment, I said I'm really interested in interpretability on scientific domains. But where I come from is interpretability in natural language, language modeling. And I think that where we're at right now, we've made a lot of advances in language models. And we haven't actually. made lot of advances necessarily in other models and understanding other domains. And the reason for that is because language is the best modality and every human is a domain expert in language. You cannot develop interpretability methods and new understanding unless you are a domain expert in that data, unless you can recognize the patterns in that data and every human alive, you know, let's say roughly every human alive. is a domain expert in their native language. Very few humans are domain experts in protein folding. And I would say the human core domain experts in protein folding are better experts at their language.
Ravid Shwartz-Ziv: Good luck. But wait, wait, wait, but like in images, in vision, we are experts, right? Like you can walk in the room and you can understand most of the, like, all the objects that are there, right? Like you can understand what is happening in nanoseconds or whatever.
Naomi: â okay, we can, we can talk about vision, but I would say, I would say that we are not quite as good of domain experts at vision. We aren't good as much at generating the entire image. We're not like, like humans need to go through extensive training in order to be able to render a simple scene as like they see it on, two dimensions. Right. â Humans are like language machines. just, language is a very effective way of conveying everything we care about. And images are actually â much less efficient and much less cooperative, I would say, than language, right? The pixels or the photons are not cooperating, right? They're not. They're not giving you like an image of an orange because they are trying to give you all of the important information about that orange. They just happened to hit your eye, right? And that's â something that makes vision a less efficient way of actually â like learning and of actually communicating that world model that we care about. It's, look. these Brocazeras in the brain has been around for like, I don't know, like less than 2 million years apparently. And we've like built civilization on it. Whereas we've had a visual cortex for like 500 million years. And we're still sort of not actually able to do as much, I would say with vision, like a random human as with language.
Allen Roush: â That's a rare example of Moravik's paradox being a little bit less right, because his whole thing is the longer you've evolved to do something, the better humans will be relative to AI systems. â I'm also... â gosh, I'm forgetting my question now, which is really awkward, so maybe I'll pass it to Ravina and ask after. â my goodness.
Naomi: I'll just say that coevolution is very powerful. Multi-agent settings are really promising because coevolution is very, very powerful and languages coevolved and photons aren't.
Ravid Shwartz-Ziv: Yeah.
Allen Roush: I'll also point out there's an old Schmidhuber paper on cooperative coevolution as an alternative to standard neuroevolution, which I happened to reimplement just randomly without knowing who Schmidhuber was as my first paper ever trying to learn. Yeah! â
Naomi: â a rite of passage.
Ravid Shwartz-Ziv: I, yeah, I just claimed it, like, I'm not sure that, you know, I agree that images are more, contain more redundancy, right? And information. I'm just not sure that this is a bad thing, you know, because, for example, in self-supervised learning, this is like what we actually want, right? We want redundancy in order to learn good patterns. So, if you rely on...
Naomi: Yeah.
Ravid Shwartz-Ziv: you know, something, you know, evolution or whatever you actually have and you just need to go and to understand how to â use it, so maybe that's not good. But if you want to actually learn something from scratch, probably redundancy is a property.
Naomi: I don't think that's necessarily true. mean, especially in the case of, for instance, images, like â I said, there's â an underlying structure that we care about. We express that in language. â A â panda is not its fur, it's its entire shape and like the pattern. But the texture is the thing that the model picks up on first and the model will tend to attend more to that texture, which we don't necessarily â think of as the key element here. It might attend more to the environment when it's inferring that this is a panda. It looks at the bamboo, right? All of these things are not reflecting the structure of the world as we see it, as the â actual people who walk around and interact with the world and understand it. So that redundancy can actually mislead the model in terms of the formation of its understanding of the world structure, as we understand it and care about.
Allen Roush: do you think then of taking â like a â point cloud or neural radiance fields or similar there which try to kind of give you the full picture in principle if they've done their job correctly?
Naomi: Hey, â wait, sorry, can you repeat that one? three taking a 3D point cloud or neural.
Allen Roush: Yeah, like, of having images, for example, one part of world models is that they're supposed to render a whole world around them. like Project Genie, as I recall, is using a lot of like neural radiance field and point cloud data. And of course, you know, there's world models as Gann LeCun articulates them. And then there's world models as in how they seem to be created in the real world, which is these video models with interactivity and seem to be really improved by giving you so-called the complete picture, which is like using in some cases, LIDAR to create these neural radiance fields in 3D. I'm just curious, does this change your opinion a little bit on redundancy not being that useful when instead of an image or even an incomplete video of something you have in principle complete 3D reconstruction? And I also remembered the question, so I want to follow up with that question I'm going to ask.
Naomi: Okay, I'll just quickly answer here. â I think that vision models, I don't want to spend this whole time talking about computer vision research, because I don't believe that there's nothing there. Like computer vision research has advanced super far in having models learn rules of physics and stuff in vision. But... And, and I also believe that there may be certain kinds of reasoning that you need to have agency and engagement with the world in order to effectively learn. And we actually have those because we're, doing RL tuning and so on. Right. I don't think that a language model can do computer vision tasks without having training on computer vision. I just, â I just don't actually think that that wouldn't count. that what a language model does wouldn't count as fairly rich reasoning. â And I think that the underlying sort of structure that it's using is definitely a rich representation of the things that humans care about and the structure of the world as we see it. Yeah.
Allen Roush: And my quick question that I did remember now is it seems that the easiest and best way or among the easiest and best ways to generate good looking images with language models. And we've seen massive improvement in this side, like Pelican on bicycle tests from Simon Wilson. SVG or SVG, I forget the name of that format, but the vector you can encode like basically once you get easily encodable â languages for images, which I believe SVG is that has been
Naomi: Yeah. Yeah.
Allen Roush: making just leaps and strides for... so I'm just curious your thoughts or maybe we can avoid this if you think it's still too much like image stuff but I think it's LLM related
Naomi: I'll be honest that I don't, I'm not a multimodality person. I'm a language model chauvinist. So I don't know that much about the area and how it works. â But what I would say about these models is that they have natural language input that they have, that they are mapping a Pelican on a bicycle onto the phrase a Pelican on a bicycle, a super efficient way of encoding or communicating the concept of a Pelican on a bicycle. Much. â files, â â more efficient in coding of many, many things in images, but â as efficient as saying the phrase a pelican on a
Ravid Shwartz-Ziv: I have a question. let's, what do you think in general on world models? Do you think this is something that there are lot of discussions, you know, if we can get to the next level, like, right, all the young's argument about world models about like separate generation process from the learning. Do you think it's necessary?
Naomi: well, humans are pretty heavily reliant on generation or prediction as part of our learning, as part of our interaction with the world, â even processing, right? We, like there is, there is so much predictive, â elements in our speech processing as well as our, you know, production. â so I think that predictive models are a really key sort of. element of any kind of world model. We need to be constantly predicting. This is also true in even just like our ability to move in the world, right? Like so much of our core capacity to move is stored as predictive ability in our brains, right? â So obviously that's very important. I love the conversation that people have about world models and also I kind of hate it. because it's such a sort of like philosophical discussion where what people are really talking about is how they're defining a world model. And there's constant like misalignment between people's conclusions that are really just about how they're defining this term. I kind of prefer to think about how these internal representations or models â interact with model capabilities, especially like generalization on edge cases or out of distribution, â because those are more sort of, â because those are measurable, right? It's like the consciousness question. World models are like consciousness. You'll never be able to say, maybe never, maybe not never, I don't know. We just don't have a good understanding of these phenomena enough that we can say whether a model has a consciousness, right? So it's better to just say, what are the capabilities of this model? Can it convert my citations into accurate Bib Tech?
Allen Roush: Unfortunately, â a apparently harder task than it seems, as we've seen by the large number of hallucinated citations in papers recently.
Naomi: Very much so. This is such a case of â the things that are hard for language models are not the things that are hard for humans. And unfortunately, some of the most tedious tasks that we would like to outsource are also the tasks that the models are the worst at.
Allen Roush: The jagged frontier is what I've heard this referred to as.
Naomi: I like that, I like that term.
Ravid Shwartz-Ziv: So do you think that coding is not a solved task? I think I saw a post about it from, I don't remember, the one that invented the Cloud code. He said that, yeah, now I think I have 20 to 30 PRs a day, and coding is a solved task. we just need to ship everything faster and like all these buzzwords. What do think about it?
Naomi: I think there are, â look, I have been astounded by what we're able to do with code models right now, â but I still see gaps, especially when I talk to people who do like efficiency, who do systems level stuff, they are not there. They are not able to actually like, â or at least the ones that are out, maybe, you know. Anthropics secret next Claude code version is actually like really, really good at like deciding, making decisions about vectorization and being like cache aware. But at least the, the models that are public right now are not good at some of these tasks. And it's not, I would say that most of the code that people write does not involve judgments. that are beyond the current language models, the current coding models, but it's not solved as far as the models in production are concerned.
Allen Roush: Well, and I find it remarkable. There's a paper that shows it, and I don't know if they tested on like Opus 4.6 or Gemini, certainly not Gemini 3.1, but that showed that pretty much all language models, even the closed source ones at the time are terrible GPU kernel programmers, which is some of the hardest programming there is. And then also, you know, I would go and point out a few other domains where I regard the average programmer as being extraordinarily skilled, certainly in comparison to the absolute, I joke that we AI research are terrible programmers by most metrics because we don't have to be good to do things. In a lot of cases, we have to be good scientists. But I think like game programmers, for example, they are dealing with million line code bases. And then the people that are maintaining like old like COBOL, FORTRAN, or even esoteric languages, right? Where that are used in production. I'm sorry to anybody that is mad that I call Haskell esoteric, but it is. And so what's your thought on this whole like distribution
Naomi: Right. you
Allen Roush: distribution gap where like Python and Java and C++ make up probably, I guess, is well over 70 % of all the code generation â data that we have.
Naomi: This is also true, I would say, in the math domain. Like everyone is very impressed by Erdos problems being solved by like, obviously, and there's a lot of those, the ones that have been solved are not necessarily like a really diverse set of problems, right? And like IMO scores, they're great, but like there's very particular kinds of math that shows up in an IMO, right? I have had some bad experiences with... trying to use them to answer questions about combinatorics. And I've had even worse experiences trying to ask questions about automata theory and â formal grammars. These are like not even that obscure to me, but.
Allen Roush: No. Chonsky is probably smiling somewhere hearing all that. â
Naomi: Yeah, I mean, it's very frustrating actually â to try to talk about the areas of math that I care the most about with these models. And then I talk to someone who's a learning theorist and they're like, it's incredible. They write the whole proof for me.
Ravid Shwartz-Ziv: Okay, anything else that you want to say, to talk about, to promote?
Naomi: sure. â I'm just going to take a second to talk about non-determinism. My favorite thing right now. because when we look at scaling laws, people are so into the, â this, this smooth power law, right? Like you, you scale up, you scale up and it, and it goes real smooth and nice, right? And there's certain things that don't smoothly scale as much as the population loss, as I've mentioned. And those factors, these like specific edge case behaviors that are the most relevant thing to us right now, because all of the mistakes models are still making are arguably edge cases. These models are much... sort of iffy on those, these scaling laws might be less predictive around these kinds of edge case behaviors. And â are also more likely to vary from one training run to another. And we can't really study this easily, especially at scale, because nobody's running 50 different training runs of like the next version of Gemini, right? At its full scale. â So we don't have a good idea of even If you say I've developed a great method, I've developed a great process, and then all you can show me is one model, what you've found is a really good set of weights, right? You've found a set of weights that does a specific thing, and that doesn't necessarily mean that your process does that specific thing. And this is getting worse now because the post-training process involves so much RL and we're not... scaling up to this infinite width limit where everything converges to its shared solution. We're doing things like adding mixtures of experts. We're not actually scaling up in the ways that we reach these guarantees of convergence to a specific solution. We're actually introducing scales in ways that potentially might increase the variance and might increase the uncertainty about whether our methods have particular outcomes. And I want people to really think about that a lot more because that's where we have these really interesting distinctions between models where we can really get new understanding of what counts as a concrete, discrete strategy to a model where it knows something or maybe it just didn't learn that generalization behavior. So that's just,
Ravid Shwartz-Ziv: So, wait, they have a question about it. So now, what the factor that actually will affect the way that we will converge? Now I'm training like two, like the same models, right? I'm running them, like what will affect the point that they will converge to?
Naomi: random seed could affect it, like very small decisions about the specific data, potentially data order. Although I'm not entirely sure if like random data orders tend like have any real likelihood of, of having, of introducing real variance. I don't know about that â in reinforcement learning, there's a lot of degeneracy that can happen. Models can end up. with like multiple solutions. And when we say at the end of it, this is because of the data, we don't even know, right? There are certain behaviors that are really noticeable that might be because of the data that's used and might be because of the architecture that's used and might be a random coincidence, right? â dashes show up a lot in all of these models. So we have a really good reason to think â dashes are something that's going to be sort of inevitable as something that's favored by the model. So then, That gives us enough information that we can dig in and say why. In the case of â dashes, I would say it's the tokenization. Bringing back to tokenization, right? â dashes save you a token because there's no white space around the American style of using an â dash. British style you use spaces, but they don't use British style. They use the American style. They skip the spaces. They save a token. So. Like when you actually see a behavior repeat over and over again across different data sets, across different training runs, then you know that you can attribute it to something about your actual method. And we don't have that about that many sort of things right now.
Allen Roush: Do you have opinions on things like the lottery ticket hypothesis and specifically, I've always been a critic of back propagation and local optimizers and always preferred global optimizers that are forward pass only even though they're not â in vogue and haven't been for probably more than 10 years. Do you have opinions on these things just before we go?
Naomi: Yeah, I don't have really strong opinions on this, although I think that a lot of why models work right now is because of sort of phenomena in the training process that we probably don't fully understand. For instance, even looking at the lottery ticket hypothesis, it's not the case that at scale you can just like prune things right away with your like perfect sub network and it learns perfectly well. There is something about the context of the noise around it or the context of these like higher dimensional of the higher dimensional space that actually allows the learning to happen â early on. Right. So, so many of these ideas when you don't look at them through the lens of training â seem more plausible and then looking at the training process complicates things. So our beautiful vision around like global optimizers, for instance, maybe they'll work, maybe they won't. If they work, it probably won't be because they're like giving you a better approximation, but because they do something to the training process.
Ravid Shwartz-Ziv: anything else that you want to promote? Students that you want to get?
Naomi: Yeah, I am going to be recruiting my students for my lab. Next year, like this year, I'm starting my lab and I'm going to keep on hiring students for the first couple of years. So I am happy to talk to people at Boston University about Boston University. We have like a really exciting group that's starting up that's really focused on these scientific questions around language models especially. And like now they've hired just a bunch of really like my favorite scientists in this area all at the same time. So it's pretty exciting.
Ravid Shwartz-Ziv: Yeah, so go to Naomi. Thank you so much. It was a pleasure.
Naomi: Yeah, always good to see you and good to meet you, Alan.
Ravid Shwartz-Ziv: Yeah.
Allen Roush: Likewise.
Ravid Shwartz-Ziv: Thank you, Ellen. And see you next week.