May 16, 2026

Language, Cognition, and the Limits of LLMs - with Tal Linzen (NYU/Google)

We host Tal Linzen, Associate Professor at NYU and Research Scientist at Google, for a conversation on the intersection of cognitive science and large language models.

We discussed why children can learn language from around 100 million words while LLMs need trillions, and the surprising finding that as models get better at predicting the next word, they become worse models of how humans actually process language. Tal walked us through how his lab uses eye-tracking and reading-time data to compare model behavior to human behavior, and what that reveals about prediction, working memory, and the limits of current architectures.

We also got into nature versus nurture and how inductive biases can be instilled by pre-training on synthetic languages, world models and whether transformers actually use the geometric structure they encode, the BabyLM challenge and data-efficient language learning, and what mechanistic interpretability can offer cognitive science beyond just fixing model bugs. The conversation closed on academia versus industry, the role of PhDs in the current AI moment, and how AI coding tools are changing the way Tal teaches and evaluates students at NYU.

Timeline

00:13 — Intro and what cognitive science means
02:16 — Using computational simulations to understand how humans learn language
05:26 — How children learn language vs. how LLMs are pre-trained
07:53 — Why mainstream LLMs are not good models of humans
10:07 — Comparing humans and models with eye-tracking and reading behavior
13:52 — Sensory modalities, smell, and how much you can learn from language alone
16:03 — Animal cognition and decoding animal communication
17:00 — Nature vs. nurture, inductive biases, and what transformers can and can't learn
21:21 — Instilling inductive biases through synthetic languages
27:34 — The bouba/kiki effect and cross-linguistic sound symbolism
28:33 — Latent causal structure in language and whether models discover it
31:13 — Does knowing linguistics help build better models?
35:07 — World models: what they mean, and why transformers encode geometry but don't use it
39:13 — Tokenization, and why Tal doesn't like it
41:35 — Scaling laws and the inverse-U curve of model quality vs. human fit
44:34 — Where the human–model mismatch comes from: architecture, memory, and data
47:08 — Diffusion language models and sentence planning
48:21 — Data quality, synthetic data, and curriculum effects
50:54 — Comparing models at different training stages to human development; BabyLM
54:40 — What level of the model should we actually probe? Representations vs. behavior
1:01:04 — Mechanistic interpretability, Deep Dream, and human dreaming
1:02:11 — Cognitive neuroscience, intracranial recordings, and working memory
1:10:31 — Should you still do a PhD in 2026?
1:12:31 — Will software engineers lose their jobs to AI?
1:17:43 — Teaching in the age of coding agents: what changes in the classroom
1:20:54 — What's next: human-like LLMs as user simulators, and recruiting
Music:
- "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- Changes: trimmed
About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Ravid Shwartz-Ziv: Hi everyone and welcome back to the information bottleneck and today we have a great guest Tal. He's a professor at NYU and a research scientist at Google. Hey Tal, nice to meet you.

Tal: I'll you as well.

Ravid Shwartz-Ziv: And as always, Allen hey Allen

Allen Roush: It's great to be here and great to meet Tal for the first time.

Ravid Shwartz-Ziv: Yeah, so today we are going to talk mostly about the connection between cognitive science and LLMs. So let's start maybe from the beginning. What does it even mean, cognitive science, and why it's important?

Tal: There are different definitions of that term historically. The way that I use it is â more on the human side of things, understanding the human mind from different angles, from the angles of â psychology and linguistics, but also from the â computational angle. â Trying to develop computational simulations and â computational techniques to understand the human mind. â But the connection between the human mind and artificial minds that we create â is also arguably part of cognitive science. So we are studying what it means to think quite generally and with a focus on people. But â when studying those questions, we also learn something about other systems that might be able to think.

Ravid Shwartz-Ziv: So there are two ways I think that you can look on it, right? From humans to machines and from machines to humans. How do you see it? Which direction do you like more or do you think that this is something that you can interact or you can combine?

Tal: Yeah, I enjoy doing both of those things. I use... computer programs, â I guess now most of them are of the kind that's known as AI, â to understand humans. the way that we would do that is we would develop simulations that have different assumptions about how we think humans might work. In my case, I'm mostly interested in language. it would be â models of human language acquisition or of â comprehension, how people read the sentence and understand its meaning and â do things in the world with that meaning. So to understand these mental processes in people, we can develop simulations that make different assumptions. So let's say we have a few different theories about how people learn language, and we can implement all of those theories in computational simulations and see which of those simulations matches the human behavior the best. And that can teach us something about humans' work.

Ravid Shwartz-Ziv: What does it mean? What does it mean in practice like these simulations? How does it look like?

Tal: Well, so let's say that you think it's really important for people to have a certain kind of exposure to language. Let's say it's really to be able to learn this aspect of the grammar of English, you have to hear a certain kind of question or a certain kind of sentence structure when you're at a certain age. So we can test if that's the case. We can take a model, let's say one of those â language models that everyone uses, technologically speaking, but train it in a very specific way. So instead of training it on a lot of text from the internet, let's say, we can train it on text that comes from prescriptions of interaction between parents and children and see what the model learns from that. And we can see if the model is able to learn what children learn just from the data the children get exposure to when they grow up. So then we can see what aspects of the data are crucial for language acquisition. Just one example. â

Allen Roush: So when you talk about â the data that a child is learning, for example, mean, it seems like a lot of the way their training works is almost by switching different types of learning algorithms. So for example, self-supervised is how most language models are continuously pre-trained. But I think that at least when they're babies, it's mostly unsupervised, like no direct labor. until later. do you do you still think that there's like do you agree with this analysis for me and and do you think that that has any bearing on the cognitive understanding of their minds?

Tal: â I think that if anything, the way that children learn language is a little bit more supervised than the way that â pre-training and language models work because I think that we, the current paradigm is to give a language model a corpus and have it predict and export given the previous words. That's something that children can also do if they want to. And they probably do that to some extent. mean, that's certainly... an important aspect of language acquisition, but I think that we do also have more â directed supervision as â children learning language. So we have more interaction with the world â and maybe we will â see people pointing towards an object when they say the name of that object. â Or... we will be, the baby will be interested in a certain toy or a certain food item and they will try to say the name of that food item and if they're successful they'll actually get it, if not everyone will just be confused and the baby will not get what they're looking for. So I think that there is more signal, rewards from the environment in a sense early on in child-like of a condition compared to how... we currently train language models. â So I think that's probably one of the reasons that â language model training is not as data efficient as children are. That's â a big â open question, think, in the field right now is how to make language models as data efficient as people. I've had to work in my lab with that direction, exploring different directions.

Ravid Shwartz-Ziv: So do you think like... So what is the answer? Do you think like now, okay, so you take like models and train them using some specific assumptions and then you compare them like to how humans behave? Like in the cognitive aspects and what is the answer? Like do you think there is like an answer that you can say, okay, humans, models are as humans in these aspects or not?

Tal: â I guess it depends on which model you look at. There's so many different kinds of models that you could be. Yeah. Well, I think we, there's some aspects that...

Ravid Shwartz-Ziv: You tell me.

Tal: at least the mainstream models that people use â online are not good models of humans. And we know that because they're, first of all, they're trained on much, much more data than people could ever hope to see. I have these estimates where â people hear a handful of a million words a year, let's say five million words a year. â So by the age of â 18, you'll hear maybe a hundred million words. â And that compares to trillions and trillions of words that we train our language models on. This has lot of consequences where the models are able to memorize a lot more than people, whereas people need to... â generalize a lot more when they interact with It's much more likely that a new sentence that you hear is quite different from the sentence that you've heard in the past compared to language models that have just seen so many different kinds of sentences that nothing is really new to them. So that's... a major difference between models and people and I think that â we have seen evidence that actually the better the models become as language models in terms of making accurate predictions about the following words, â the worse they become as models of people. which makes sense, we're not developing these models to be human-like, we're developing them to be economically useful. Those are very different goals. So we need to be able to really answer those questions in cognitive science and â develop models that are more human-like, we need to go in quite different directions than the direction that the mainstream of the field is going.

Ravid Shwartz-Ziv: Can you talk a bit about how you compare humans to models? what they are like? You say that they are not in line, right? What does it mean in practice? Like, how you actually measure it?

Tal: Yeah, that's a great question. There are different behaviors that we can compare between the models and the humans. â One specific skill, human skill, that we've been â studying a lot in my group is prediction. So... when you read a sentence, you make predictions about upcoming words and you often use them to make your â reading process more efficient, faster. So if you know what the next word is going to be or you have a pretty good guess, you might not even read it. You might just look at it from the corner of your eye and confirm your prediction. if it's what you thought this word was going to be, just skip that word and move on to the next word. â Or you read it a lot faster just to confirm your prediction. So that's one of the reasons that reading is â so fast. And to be able to simulate that process or to understand what goes into the predictions that people make, we can use language models. So how do we, as people, guess what the next word is going to be? What kinds of information do we use to make those guesses? And we can implement those ideas in language models and see whether the prediction that the language model is making align with the predictions that people are making. we can... record reading behavior from people. We have eye trackers where people are sitting in front of the screen and we just track their eyes as they are reading sentences so we can see exactly which words are skipped and which words are difficult to read and that you sometimes go back and reread parts of the sentence and we can compare that to the simulated reading behavior that we would get from a language model. So if the language model is superhuman at predicting the upcoming words, they would just read really fast, or they would never have to go back and reread because everything is exactly what you expected. So we see that the models that are somewhat more limited, either because they're trained on less data or that their memory is more constrained, are better matched with how people read compared to the really kind of cutting edge superhuman. Bye-bye.

Ravid Shwartz-Ziv: Do look also on like high level tasks? know that like there is the task completion time reasons now that we see that these models are much better than before and all. Do you look also on like â more like complicated tasks or tasks that require like integrate different sources or something like that?

Allen Roush: Um... I don't know, you go.

Tal: Yeah, we haven't done that, but I think that would be a great idea. â I'm very interested in the Time Horizon work, but when you look at the benchmarks, they're very heavily focused on software engineering. And of course, that's a very specific skill â that people have. And it is really an interesting open question how that will generalize to other domains of, â you know. human cognition, but we haven't done that yet.

Allen Roush: So today, I think that there is an information theoretic argument around how much data that humans have when they're at a certain point in life. You can try to sum up the amount of images or video that they're processing throughout their life as well, which I think is a lot. And furthermore, LLMs are trained on image and video as well now. But LLMs, to my knowledge, with maybe some very specific exceptions, have no olfactory data, right, a smell. And humans do. So I'm wondering, from a cognitive standpoint, do you have opinions about how important some of those lesser appreciated senses are? Yeah.

Tal: Yeah, that's a great question. â I think that people that certainly â learn a lot about the world from sensory input, â being said, I'm very interested in language â learning and it turns out that you can learn a lot about the world. â from language and it's something that â not just the successive pre-trained language models but even when you look at cognitive studies of people who â do not have access to certain sensory â modalities are still able to learn language very well. it's not of course, â blind people of course, they will have access to a lot of other sensory modalities but a very major modality is missing from... their experience and nevertheless they learn language very well so it's possible that some of the importance of perceptual input in language learning is overstated. But that being said, you can certainly learn from it. â Even if it's not necessary, it might still be beneficial and helpful. And I think smell would probably also be beneficial. I don't see why not. â But I don't know much â work on that. That's good question.

Allen Roush: Yeah. And then, â do you think that there's any connection of the work that you're studying to animal cognition?

Ravid Shwartz-Ziv: end

Tal: I have not personally explored that connection, but it's really an active field of research. I just saw a website for a new non-profit that is trying to decode animal communication with modern language model related tools, but it's not something that I've worked on. One of my incoming colleagues actually, a Ravid Nozor, â Pratyusha Sharma has worked on this area, so should maybe interview her at some point on her whale communication work.

Ravid Shwartz-Ziv: And what do you think about nature versus nurture, Like the genes versus environment? Because there is a long debate, of course, what is more important in humans. But it's not even clear how you translate this debate to models, right? Maybe inductive biases versus learning or something like that. But what's your thought about it? Do you think you can actually translate it to machine? You can actually and use some of these constraints to models.

Tal: Yeah, I think so. the debate in cognitive science, guess, is how much you are born already knowing about the world, right? And some things you certainly know how to do the moment you're born, there's a lot of, let's say, reflexes that babies have from the first day of their life. So it's clear that there's a lot of â nature and not everything is nurture. So there's a lot of things that we know how to do when we're born. Where it gets tricky is with the kind of higher cognitive domains. â things like â language learning, was a â big debate that still is â active between people who think that humans â have a certain innate capacity to learn language. that includes relatively specific properties of language. They know that there's no language where you ask a question by reversing the words of the sentence. And that helps you learn how questions are actually asked in your language. Because you have all these constraints about what languages are possible. â So that's one school of thought in linguistics, but then other people â are more skeptical of that approach and they think that specifically with language there's actually very few innate capacities. Now, â there is something that we can actually test with models because with the architectures that we use in deep learning, we can be pretty sure that those architectures do not have an innate â strain on what languages can look like. There are extremely general architectures that can learn languages but they can also learn to model sequences of proteins and we have like vision transformers so with the exact same architecture we can learn a lot of different things. So if this architecture can learn the same thing that human learns then it means that it's probably not necessary to have that innate bias. So we've done a lot of experiments like that and It looks like the â answer is actually that transformers are not able to learn what people learn about language from the amount of data that people have. So eventually the transformer will learn that thing, but it will take it much longer than people, which suggests that it doesn't have the right inductive biases, â or that its inductive biases don't match the human ones.

Allen Roush: it.

Tal: And we have some more trying to put those inductive devices into the models such that it can learn natural language faster when it learns it.

Allen Roush: Is everything that you say â also applies for like non-English languages?

Tal: Yeah, absolutely. I don't think that English is easier to learn than other languages or the other way around. I mean, there is some, there is a concern that we, that the architectures that we develop are a little overfit to English sometimes. So it's possible that those architectures, just because of the way that historically architecture search works in the field, â we reward architectures that are useful for ambition, not necessarily the ones that are useful for languages that have very different structure. â the general finding that people can learn language from a relatively small amount of data compared to transformers, let's say, I think is consistent across languages.

Ravid Shwartz-Ziv: And what type of constraints you are applying?

Tal: Yeah, so that is a very nice â question that we're working on right now. Basically, for a long time, people were trying to build the constraints into the architecture. So we have had... â â transformers that create parses of the sentence, like a symbolic parse, and then the transformer or the RNN perform â their computations on top of that parse. So let's say if you have an in RNN, â instead of just processing all of the words of the sentence from left to right, it will process them in the order that is dictated by that parse. Words that belong together in the same constituent will â first compose together and then you compose them with the other words. â So that's one way to do that. â It's pretty difficult, just technically speaking, and it's hard to extend that to other... biases. the bias that these... â grammar-based model describe is just the bias that grammar is important and the parse of the sentence is what defines how words should combine. Then there's a lot of other things that we want to instill in the models that we don't really know how to put into the architecture. So â the easiest way to do it is by â training the models on different languages. Those can be synthetic languages. So we have the model learn many different synthetic languages or maybe just a handful of those. And that gives the model an idea for what a natural language is going to be like. So that can put the model in the right â initial state so that natural language learning goes faster.

Ravid Shwartz-Ziv: does it synthetic language? â How you create, how you generate synthetic language?

Tal: So one way you could do that is with a grammar that you construct. â It can be... the words of the language can be totally made up but there's still a rule where â to construct a sentence you have to put the subject first and then the verb and then the object, let's say â and then you... but there's a lot of parameters you can vary there so in some languages you can put the object before the verb â in others the word order can be flexible so you can shuffle it however you want and you... to describe the grammar of a natural language, would need thousands and thousands of those rules. So can create simpler versions of natural language that only capture one aspect of what you want the model to learn and then â show that to the model in the synthetic pre-training stage before you move on to pre-train on natural language.

Ravid Shwartz-Ziv: And does it help? How much does it help?

Tal: Yeah, yeah, it does. It does. You need â fewer natural language tokens to get to the same loss â compared to if you didn't pre-train on the synthetic data. And also compared to if you pre-train on the same amount of natural language. So we actually find that those synthetic languages are more effective, compute-wise, â in teaching the model certain things that are then useful for natural language as well. which is a cool finding. there's something about those synthetic languages that force the model to learn â structure that is not always as necessary for natural languages because with natural languages sometimes you have these kind of local dependencies and co-occurrences that can get you pretty far in terms of reducing the â cross entropy loss.

Allen Roush: And when you talk about these â constructed languages, I'm wondering, â would that include things like â Esperanto or Interlingua or some of these other actually human constructed ones? then the further question is, what about context-free grammars? â And I ask that because there's CFG-based constrained sampling that you can do. â

Tal: Mmm.

Allen Roush: about how this all interacts with your work.

Tal: Yeah, so we actually use context-free grammars to generate these languages. Those are languages that are relatively simple. And arguably, natural languages are more complex than just context-free grammars can describe. another way to put it is that you need huge, huge grammars to describe natural languages. So yes, we do use context-free grammars. â Those are simpler â varieties of grammars. â The question about constructing languages like Esperanto is a really cool one actually and I haven't thought about it â enough but it's possible that those languages could be more efficient in terms of teaching the models, certain rules just because they're more regular. I think that I don't know much about Esperanto but my sense is that those languages â are more regular, there's less exceptions and â idioms that you need to memorize. there might be, â this is another language, I forget the name of it, Toki Poni maybe it's called, a language that only has a few hundred words. â you know that one? Yeah, yeah. So to construct more complex ideas, you have to put together all those words and it's a very... â

Allen Roush: Yeah. Yeah, that is, yeah.

Tal: composition on that sense. Everything can be broken down into smaller parts. So I wouldn't be surprised if those languages are really effective in teaching the model that structure is important. But we haven't done that experiment. I think it's a great idea.

Allen Roush: And then related, what about effects around the sounds of language? So here I'm talking about the boba kiki effect, if you know what that is.

Tal: Yeah. That is another nice question. â I am not an expert on this effect. think that it means that what it says is that the sound B is associated with round objects and the sound K is associated with like jagged sharp objects or something that's consistent across many languages.

Allen Roush: Yes.

Tal: â yeah that is a really cool finding but i â don't know enough about it to have an opinion â i would imagine that a language model that's trained on many languages would pick up the fact that in many languages that's the case but maybe that's not so surprising

Ravid Shwartz-Ziv: And yeah, I think it's very cool because, for example, in Tableau data, you have a model, like one of the best model is called TabTFN. The idea is that you pre-train it with its transformer, and you pre-train it like you sample a path from some causal graph. So you are sampling like questions and answers, like random ones. You just go to X and Y, and the answer from your traveling in some causal graph, and you're just sampling it again and again. And apparently it lends structure of, like, I don't know, the world in some sense, right? It like, you have some causal graph in the world and you just need to discover it. Do you think this is something like that is also happening in language models? Or do you think that this is more complicated?

Tal: Let me see something like that, can you explain?

Ravid Shwartz-Ziv: So do you think, yeah, so yeah, do you think like we have some causal graph like in Lendco-reach that we just, the model need to discover or like just no structure and yeah, we just need to discover again and again and again like different types of settings.

Tal: Mmm. Yeah, I see what you mean. I think that, â that the answer to that is, â Yes, there is latent structure in language that the model needs to discover, and we also have some evidence from mechanistic â studies of the internal â activations and attention heads that the models do in fact â discover that structure. Maybe not â perfectly, and it's always a little bit fuzzy, and it's never exactly the grammar of the language that you would find in a linguistics textbook, but there's certainly representations of grammar inside the model. That's actually an ongoing project of the lab to understand exactly how they work. â The models are able, even if you train models on relatively small amount of data, they're certainly able to generalize to new sentences that they haven't seen before. And I think that the only way that you could do that and still produce grammatical sentences is by capturing the causal â process that gives rise to those sentences. which we call grammar, guess, in linguistics.

Ravid Shwartz-Ziv: And do you think this is something that like knowing grammar, like knowing â classical â rules and classical knowledge and literature in â cognitive science or something like that, does it help when you want to develop better models?

Tal: â if my knowledge of linguistics helps me develop better models? That's a good question. I think that I used to be â more useful than it is now â in the sense that the training on a lot of data is â probably more effective than trying to teach the model grammar very explicitly.

Ravid Shwartz-Ziv: Yeah.

Tal: because our models are, as I before, they're pretty good at picking up the underlying latent â factors that generate the sentences. What some people might call world models, but world of world is just language. But yeah, think it is actually pretty analogous. So what this means is that â knowing the... intricacies of the grammar of a language is not that helpful in teaching the model that aspect of the language. It's just better to give it more examples. Now there's a couple of caveats to that. â One is the sample efficiency one. So I think that if you do want to train models, as I said before, â on a low research language or language where you don't have a lot of examples from, then being smart about the inductive biases that you put into the model is going to be important and there linguistics is helpful. And the other one is in evaluation. So I think that knowing how language works can help us develop more targeted evaluations of the model. So instead of just seeing if it is good at predicting the next word in general, that can... be for a lot of reasons that you're going to predict the next word. We can focus on specific capabilities that we think that are difficult to learn and often the definition of those capabilities can come from linguistics. That's why I think we have done a lot of eval work in my group.

Allen Roush: One of the sides of linguistics that I've always found most interesting and also least well covered by computational linguistics and natural language processing is etymology. So do you think that the work you're doing has any kind of etymological side or component?

Tal: for that perspective, yep. It could. One of my students just before this meeting presented this idea of reconstructing the historical â proto-language. from examples of different languages that are supposed to have inherited from that language. So if you have, I don't know, old English and old Slavic and whatever, can try to reconstruct the proto-Indo-European language that hypothetically was originally spoken by the people who then split up and made those other languages. So yeah, think we can definitely use language models for that. That hasn't been the focus of my work. And I think that I'm, more to sit in the present, I guess, than the past as it comes to language. Not that I don't personally like etymology, just, you you have to pick your battles in â science.

Ravid Shwartz-Ziv: So you talked about world models and a lot of people in the podcast in the past talked about it, but I want to hear your thoughts from cognitive science perspective. Do you think, first, if it's necessary and if the current models have internal states that you can call it world models?

Tal: Yeah, we just finished the project on world models and... half of the meetings in the project were just trying to define what world models means. I think it's one of those terms that are in vogue and the type guys that no one, mean, people know what they mean, but there's a lot of different meanings in the air. So people are talking past each other sometimes. I think that understanding the, as we said before, the generative process that... gives rise to the phenomenon they're trying to model is clearly very helpful, especially in terms of generalization. I think if you see something that's a little bit out of distribution, but you understand the generative â story of how it was generated, you will be able to understand its meaning a lot better. So that's why we've done a lot of work on compositional â generalization, where you... only see certain combinations of the parts in training, but then you have to generalize to new combinations of the same parts of the sentence. And that's something that models or like the standard models are not very good at because they're very, they're kind of overfit to the distribution that they are â trained on. So I think it's not necessarily a natural thing for let's say Transformer to develop a â world model or and if we just want to be more concrete, the generative story of how the data was generated. â But that being said, it might still do it to some extent, â imperfectly. And I think that's an empirical question to what extent our models to do that. I think they would be more robust and they would learn faster if they â did it more... explicitly and they do not for sure. just had a... so there's a lot of the literature on finding world models in transformers is like looks at the representations inside the transformer and â finds geometric structure that maps onto the geometric structure of the outside world. So if the embeddings of two objects are close to each other and those two objects are also close to each other in the outside world, then we find some sort of isomorphism like that. So that very recent study that we did on world models â tried to investigate what the implication of that is of that geometric structure in representations. somewhat pessimistic results actually. So we did find that when the model is learning the structure of this grid in context, if you look at the representation of each point in the grid, you can actually reconstruct the geometry of the grid pretty well, which is... very nice and cool, but then we found that the model doesn't actually know how to use these representations. They're just like nice geometrical patterns that are not then affecting the model's behavior. So it looks like the models are somewhat able to model the world, but they're not doing it very effectively and they're not using that knowledge. as well as they would if it was really part of the principle of how they worked, I think.

Allen Roush: Well, and is that a tokenization related problem? And what are your thoughts on tokenization in general and how it affects all these results?

Tal: I don't think that this specific issue is a tokenization related issue, in general, I don't like tokenization. I imagine few people like â tokenization in language. think that it's, â especially the way that it's implemented in our models, we have widely different tokenization schemes for different languages. â You see an English sentence that will be like half the number of tokens as the corresponding Hebrew sentence just because I don't know the models doesn't care so much about Hebrew tokens and that seems like not the right way to do it. I guess I think many people hope that one day we won't need to have this kind of... rigid fixed tokenization step and we'll just give the model all of the characters of the input and it will figure out on its own what to do with them. So I guess we were not quite there yet.

Ravid Shwartz-Ziv: do you think about all these entropy-based tokenizer bitwise tokenizer? Do you think this is the right direction?

Tal: To be honest, this is not something that I've really thought about very much. â

Ravid Shwartz-Ziv: But how do you think all your results affected by tokenization? Do you think this is something that now if I will use different tokenizer, it will change my results now where the model is focused on and how fast does it do it?

Tal: Uhhh... Yeah, I mostly think of tokenization as a nuisance in my work, so probably as affecting the behavior of the model in subtle ways, but I'm not sure exactly what the implication would be of changing tokenization to any of the other schemes.

Ravid Shwartz-Ziv: And what about like scanning laws? know, that there is lot of discussion and a lot of debates like what does it even mean if it's like meaningful, if it's useful, like what we can do with them. Do you see a kind of like scanning laws also when you look on it from cognitive science perspective? There are some like... imagined behavior that you can see only in large scale models.

Tal: â Well, it's certainly the case that larger models are â able to do things that smaller models are not. â But I always thought of scaling laws as something that's â where you try to predict the behavior of the larger models from if you have enough data points from smaller models, you try to fit some sort of â curve that interpolates.

Ravid Shwartz-Ziv: Late. Because you said, right, like, before you said it, like, that at some point, like, you can't explain, like, the, like, if the model became, â becomes very good, like, it's outperform humans, and you can't explain humans with, with the, with these models. But, is it something that you can actually, like, predict, or, like, this is something that they fit till some point, right, if you scale them, and they are becoming better and better, and then, boom, they break, and, and that's it.

Tal: Mmm.

Ravid Shwartz-Ziv: this is something that you can predict.

Tal: Yeah, it's actually â very predictable. So I understand the question now. â If you plot the perplexity of the model on the x-axis and the fit to humans on the y-axis, â there's actually a very strong correlation between those two things, where as the perplexity goes down, meaning the model is better at predicting the next word, its fit to human data also goes down. â But that only holds for the very good models. If the model is quite bad, let's say you have a trigram model that just predicts the next word based on the most recent two words, then of course you're going to have the opposite scaling curve. Or as you improve the model, the fit of humans improves. So there's, it's kind of an inverse U shaped curve for bad models. it goes up as the quality of the model increases and for the really good models it starts to go down as the quality of the model increases. Yeah, it's very cool how strong the correlation is. â I think that the... Yeah, we're good.

Ravid Shwartz-Ziv: Yeah, so what is the most important part that you see that affects the disalignment between humans and models? Is it the data? Is it the learning algorithm? Is it the architecture? maybe the infrastructure?

Tal: Yeah, I think it's all of these things and we are trying to tackle each one separately. Architecture wise, we... â are very focused, the communities are focused on Transformers and Transformers memory works in a very different way from human memory. the Transformers can â remember exactly which word was the first word of the book that you're reading now. Like you started reading it 20,000 words ago but the Transformer just remembers exactly what happened on that first page. and humans are not able to do that. working memory is actually very limited, so we have to compress the context in some way. So that... is you can also see that a transformer is very good at predicting â repeated mentions of the same name or if you give it a list and then 10,000 words later you give it the same list it's going to perfectly predict what the next items on that list will be. So that is a big discrepancy between humans and at least mainstream models that we are trying to address. â And other discrepancies related to the data, the amount of data which is that â the models are, they know a lot more than most people. So like you have, if you ask a model â what year Mozart was born or whatever, most language models are pre-trained. They've seen that fact many, many times in the training set. â They're able to retain that fact. So that is another source of â discrepancy. â Yeah, so we, it's not just perplexity as a single number that captures everything. There's different cognitive skills that those models have that humans don't and that we each need to match better to humans to get more, better explanations of human behavior.

Allen Roush: And then â we've been talking a lot about transformers, â but do you have any opinions about things like diffusion language models or diffusion in general?

Tal: Yeah, I don't have good opinions about the fusion models, I have to say. I'm very â focused on autoregressive models, just because a lot of my work is on how you read and hear, when you hear a sentence, or you read it in real time, it very naturally goes from left to right, if you're reading English. So â I'm a little less focused on the models where you... generate the whole sentence at the same time. â That being said, it might be useful for some applications like understanding how people plan the sentence. when you decide what to say, maybe you do have a vision of the whole thing that you're going to say and then you flesh it out more gradually like a diffusion model. But I haven't â done any work in that direction, but that could be a cool application.

Ravid Shwartz-Ziv: And I want to go back to data. you think, so what more important quality of the data? And if it's the quality, like how you define even the quality of data or the amount of data that is available? or like maybe synthetic data, rather today, like you can train with RL on synthetic data and apparently like you don't need the, you just need a few â examples in order to unlock some capabilities. How do you see it like compared to humans and do you think we can actually learn something from it?

Tal: Yeah. Yeah, that's a good question. As I said before, I think you can teach models a lot with synthetic data and maybe â even more than with real data in some cases. â But â in terms of quality, I would say that â human data quality is higher â in one important sense, which is that we often â â try to create the data that we need. If we are interested in something or we don't know much about it, then we will â interact with the world, we will collect data. Whereas with language models it's often more passive in that sense, in pre-training at least. So it could be that the data is not low quality exactly, just not what the model needs. So it's not that beneficial at that point. And there could also be curriculum effects where the model gets data that is very complicated for its current capability and doesn't really know what to do with it. So I think that with children we're â better able at... tailoring the complexity of the advances to the child. So those things could be helpful to the models. That being said, it's possible that it kind of washes out once you get to a trillion tokens. â But I think that my understanding, least, that I haven't worked on pre-training research in a big lab, my understanding is that data quality is still reported even at the scale that we use for.

Ravid Shwartz-Ziv: But when...

Tal: the big pre-trained models.

Ravid Shwartz-Ziv: But when you compare humans to models, you actually try to compare the level of development or something like that? Because at the end you are comparing model after training at the end versus kids or humans in some point in time. So do you think this is something that you can compare? like you need to take it into account somehow.

Tal: I think you can look at â the model's behavior at different â points in the training process. So we can see what it does after 10 million tokens and then after 100 million tokens and â try to map that more or less to human development. â Yeah, I don't think that's crazy. So a lot of the work in that area has... been â related to this shared task that some of my students have been involved in called BabyLM. And it's like a corpus that we created and has only 100 million words. And those words are â mostly drawn from sources that we think children might also â have exposure to when they learn language. â Every year there's this task, a shared task at one of the conferences and people can submit systems that try to learn only from that corpus. So I think that â if people are interested in that area, definitely go to the website and see what people have done in direction. And curriculum learning is certainly one of the approaches that people are using to improve sample efficiency. presenting the simpler sentences first â at the early stages of model training and then increasing the complexity of the sentences as the model gets more sophisticated.

Ravid Shwartz-Ziv: So what type of tasks you make in this competition, like in the BabyLM, what types of tasks you evaluate the models? Is it specific to random thing that we think babies are good or?

Tal: â That would be nice actually, but we don't necessarily use baby specific tasks. guess one of the â most, I guess, developmentally motivated evaluations that we have there is it's like an age of acquisition comparison between models of humans so â children acquire certain words â earlier than other words and those are not necessarily the most frequent words. â So some words that are very frequent in children take a much longer time to learn them. So you can compare the curve of how long it takes the model to reach a certain proficiency with that word to the curves that we get from children. But other evaluations there are more like â grammatical evaluations. So like, â You can test the model or whether it knows that â certain combinations of words are â grammatical and English versus ungrammatical.

Ravid Shwartz-Ziv: Yep. You talked before a bit about representation and like the models, like it's, they are not using this representation. What do you think about this research field in general? You know, because like, I feel that personally you can get, can, it really depends on the tools, but eventually you can get whatever you want from the representation. You can get whatever answers that you want from the representations. And what do you think? What is the right level that you actually need to evaluate, to look on the models. Is it like the output, is it a representation of the final layer, maybe the intermediate layers? And if so, like what types of tools do you think are useful?

Tal: Yeah, so I know that â in the mechanistic interpretability world, there are a lot of people who think that it's only worth doing it if it's actionable in the sense that it teaches you how to fix a problem with the model. So let's say if your model hallucinates often and you found some sort of internal feature in the model that is correlated with the propensity to hallucinate, then once you detect that feature you can turn it off or like you can tell the model, no, give a different answer here, your answer is probably going to be hallucinated. So that's the kind of actionable... use of internal representations and I think that's great. I think it's a great direction but I really don't think that's the only reason to study how language models work internally. I think that it's with basic science in general it's often not very clear immediately how a better understanding of how a system works will lead to improvements in the system. But that doesn't mean that you shouldn't do that basic science â to figure out how the system works because it's a bet that somewhere down the road â you will be able to develop better systems with this knowledge. â So yeah, can maybe give an example from a project that we're working on. But let me know. Maybe you have another question.

Ravid Shwartz-Ziv: Yes.

Tal: â yeah, sir. So.

Ravid Shwartz-Ziv: Okay, yeah, so, like, I... At the end, I want to know, right, like, okay, do you think it's useful, like, to use... Okay, from example, from information theory perspective, when we looked on the intermediate layers, we see that they contain more information, okay? But I know that you can't do anything with that, like, the fact that, like, it contains more information doesn't help you, and it doesn't help the model to extract this information. So what do you think? you think we should use these tools in order to answer very theoretical questions? we need to say, we don't care about the theory, and we only need to care about practical issues and to understand how we can get better models. like how can we train these models better?

Tal: Yeah, well, maybe the information theory is too much of the bird's eye view. Maybe that's not the right level of granularity. Maybe you need to go to more mechanistic level. But let's say you were able to find the exact circuit of the tension heads that does something in the model, which is hard and not always possible. But some people have been able to do that for simple processes. then that can teach you, especially to the extent that the model is not doing what you want it to do, â what the possible ways that the model could do this. and how to maybe fix it, how to change the model so it does those things differently. So let me just try to give you the concrete example that we are trying to work with right now. It's all very preliminary, but it's â related to one of the ways that we think â language models are different from humans in processing sentences. And that is the fact that humans, we think, have limited capacity to consider multiple possible interpretations of the sentence. So even if the sentence hypothetically could mean a lot of different things, we get very attached to just one of those interpretations. And that's because of limited memory and we just don't have a lot of capacity to consider everything. But we think that with language models, our hypothesis is that that's not the case. That language models can consider a lot of different interpretations and that leads to a mismatch between how the language models read and how people read. So when people get to the point where they realize that their interpretations was incorrect, they often have to do a lot of work to reread the sentence. And then consider the other interpretation that they did not consider the first time around. So if we are able to understand how language models encode internally those interpretations of the sentence, â then that can give us a tool for making them more like people and that we can constrain them to only consider one of those interpretations. â Before we're able to do that, we need to first determine what the format is that those interpretations are represented in the model. Yeah, application of mechanistic interpretability for cognitive science.

Allen Roush: So speaking of mechanistic interpretability, I remember some of the oldest works being around trying to heavily excite individual neurons and see what their representation or what the model's representation is when a specific neuron is activated. I also remember work with titles like Deep Dream. Do you think that what anybody has ever done with models in that space has any analogy to how humans dream and if so, do you think that there's value into the human dreaming process â for cognitive understanding, I guess, of ourselves and also of model representations? â

Tal: Yeah, that is a really great question that I am entirely unqualified to answer because I know very little about human dreaming, but it's a really, I would love to know more about it. Yeah, sorry. â

Ravid Shwartz-Ziv: But, â I have a question. like it's kind of like related, but do you think there is a connection or like we should â try to explore the connection between like the, â let's say the normal level or the normal activities and the output and the cognition or like the connection between neural activity, cognition and LLMs. And all these are two separate fields, say like neuroscience or computational neuroscience and cognition.

Tal: Yeah. I certainly think there's value to cognitive neuroscience. And in fact, I, am involved in a collaboration, â right now with a team of neuroscientists and they're doing it from Orion, also intercranial recordings, recordings actually from, â inside the brain from patients that are undergoing surgery. So it's pretty amazing data. and, â it's related to the questions that we have about working memory. So we're trying to study human working memory and how it's implemented in the brain and compare that to the models that we're developing that have more constrained working memory. So I definitely think there's value there. I'm not a neuroscientist myself, so I have to rely on my collaborators to do that work, but I find it... very interesting to understand how cognition is related to the brain.

Ravid Shwartz-Ziv: Like, my personal opinion is that... So I did my PhD in Computational and Neuroscience in Jerusalem and the reason that I'm actually not doing it today is because I think that it's just too complicated. Like, we are so much far from understanding the... that's related to the brain and it's really, I don't know, frustrated. â It's really frustrating that you can't understand, like, you have So, like, you don't have data and everything is so complicated. So, yeah, but maybe other people are better.

Tal: Yeah. No, no, I think, I think that's a good point. And I think that, you know, I used to, â I did some cognitive neuroscience in my, â PhD, â like actual empirical work with fMRI and MEG. And at some point I, â decided that it's, â too difficult, â right now that the, â a similar concluded to yours actually, that, â the techniques that, that we have are, â don't provide a ton of signal and you need to work really hard to get, â result that, â â is maybe easier to get from behavioral experiments. So that's why I shifted my focus more to behavioral and computational work.

Ravid Shwartz-Ziv: so, yeah, maybe like, â maybe we'll a bit about the â duality or because you're both at NYU and Google, maybe â we can about, you know, the differences, the similarity, what do you prefer, â like academia versus industry.

Tal: So yeah, I really enjoy doing both of these â things, but â though sometimes it gets â like a lot have two â different jobs at the same time. â There's times that I do more of my NME work and times that I do more of my Google work. So there's, I guess, two kinds of work that â you can do â on language models inside the big labs. I think Google is one of the most open ones in terms of there's many people doing â research that is â then published and submitted to conferences and so on. I've been doing a lot of that â work at Google. Recently I've been... interested in contributing more to the research that doesn't get published and work on the big language model on Gemini that is confidential and that even if it wasn't confidential it's not always clear what would be the artifact that would get published in a conference. A lot of the work is not of the format that you can easily publish even if you could do it for â competitive pressures perspectives. So â it's a very different kind of work. like doing different kinds of work and I think that, know, I've been a, this is my ninth year as a professor, so I know what that work is like and it's very interesting but I really appreciate the opportunity to do a different kind of work as well at Google and the kind of work that also maybe requires both interacting with those big systems that are really opaque from the outside so you can finally kind of understand them much more intimately from the inside. That's very exciting and I think also can give you ideas for open research problems that you can pursue in academia. â And from the perspective, guess, being closer to the technical work in terms of running large experiments and coding and things that, as a professor in academia, are often not exactly your day-to-day activities. So I do really appreciate that â combination. The downsides, I guess, is that I think as a professor in academia, you get to set a research agenda â in a much more autonomous way than at one of the big labs. think that, especially if you're trying to work within to advance the... â large models like Gemini, then you should probably do something where there's a clear gap in the current model and one of the higher ups, like one of the people who managed 2,000 people decided that this is really important priority area, so it makes a lot of sense to work on that as opposed to your own little kingdom. And I really appreciate, mean, academia is very special in terms of the teaching and mentorship aspect of it. You really spend, â yeah, absolutely. And â I like to teach, I like them. â I like to advise PhD students. â Of course.

Ravid Shwartz-Ziv: Do you like to teach?

Tal: There's also some nice things about working on a team of â many competent and experienced people. It definitely has its own advantages, but I think for me personally, the mentoring and working with junior people is a huge advantage of academia.

Allen Roush: Does Google force you to use TPUs only?

Tal: I think, well, I don't if it forces me, but in practice it's definitely the easiest path.

Ravid Shwartz-Ziv: But, so, what do recommend for, I don't know, someone in his 20s or something like that, early 20s, do you think PhD is something that's still important these days, or do you see any difference between someone that now worked like five years on Gemini versus someone that did his PhD?

Tal: You mean... I think a PhD is a good idea if you want to do research. I think that right now it's not actually that easy to do research of the traditional sort in industry because everything is, the time scale is so short. Like you really have to show that your method is useful very quickly. And I think that it's a lot harder to do the kind of long-term research that you can do as a PhD. So just in terms of your career if you're going to work as a research engineer or scientist for the rest of your career, may as well do a PhD because that's your only opportunity to really do this sort of long-term research. The downside, I guess, is that you would miss out on all of the excitement of AI right now. And knows where AI will be in 2030 when you graduate from your PhD. And money, of course, yes. Yeah. That is a good reason to take those jobs. So what I would advise is...

Ravid Shwartz-Ziv: and the money also the money

Tal: I don't know, it really depends on your goals. I personally, as you know, the choices that I've made in life, I personally think that academia, especially in the long term, is a really â great career. You might make less money, but if you play your cards right, you can keep your job for very long time, which is not something that I can say for â jobs in tech, so. â

Ravid Shwartz-Ziv: Do you think we will see the result of talk about that all the software engineers will lose their job now when AI finally can close the loop and make all the coding tasks? Do you believe this perspective?

Tal: Yeah. I don't know how qualified I am to talk about this to be honest. I think that I wouldn't even talk to a director of engineering and a startup or something. My hunch is that a lot of engineers are doing work that is easier to automate and maybe some engineers won't be needed anymore, but I still think that...

Ravid Shwartz-Ziv: What type?

Tal: the good engineers or the really smart people with the good ideas are going to still be useful for a long time, maybe even more useful than they have been, but that's really just a hunch and not in the trenches of that.

Ravid Shwartz-Ziv: What about AI researcher? What do you think about AI researcher? you think this is something that... Because I saw several posts on Twitter about people that are training models and say, right now the current AI coding models are qualified, they're good enough that you can give it a task, you can give it an environment, and it will optimize â everything. There was this â tweet article about, like, Yeah, I think it's some professor or researcher that gave a cloud code the task to make the smallest transformer that it's doing like the â 10 â digit addition, something like that. Like what is the smallest number of parameters that can do it? And it worked like all night and it came with maybe 800 or 700 parameters. Do you think this is the future? you think we still need the AI researcher that actually made the decision and made the call?

Tal: Yeah, you're asking a really great question. think it's so hard to â forecast how good AI is going to be in a couple of years that it seems like a â fool's errand. â That being said, I think that there are certain aspects of AI research that are very... â tedious and if you really trusted Cloud Code to optimize your hyperparameters for you and to babysit your training runs and rerun them when they crash, of course, I mean, it would be great â to not have that be part of AI research. But in terms of finding the interesting questions, â I don't know how long it will take before we can trust AI with that. Not very optimistic. I don't know if humans are that good at finding the interesting questions in AI research, be honest. I think as a community, we're definitely â doing a lot of uninteresting research, or obvious research that hundreds of other people can be doing at the same time. So we should train people to have better taste as well.

Allen Roush: F. Do you use Gemini a lot and do you have opinions about Gemini?

Tal: â man, you know, one of the things that I find â shocking when I look at social media is how people always have very strong opinions about like the most recent models, the models that were released three days ago, and they only know exactly what vibes Opus 4.6 is better than Codex 5.2 is or whatever for. I don't know how people have time to do that, to like, to compare all the models to each other.

Ravid Shwartz-Ziv: I just said it like it's so fun because like so so when a professor is coming to the to the postcast right like so if you're asking it something that you know like just a bit near its field he will say like you know I'm not an expert on this thing again like even though you know it studied it for years but like on Twitter everyone are expert and everyone knows everything right â

Tal: Yeah, no, I I think that there are people who professionally, maybe it's there's some financial incentive in, they spend hours a day just like playing with the most recent models. â And I just cannot have, I cannot find the bandwidth to do it. I mean honestly, it would be fun to do it, but just I don't. So if your question is what I think about Jim and I compare to the other, to the competitors, I honestly have no idea. I do use Gemini because I get it â for free at work, so I use it, obviously. But I wouldn't know to tell you if it's better than the other ones.

Ravid Shwartz-Ziv: And what do you think about the new generation of students? you think that like there is again this debate or this â argument that they will not know if you, if they, because they are using AI coding agents and they are like using a CGPT all the time or Gemini, they will not know their fundamental principles and the underlying component, basic component that you need in order to be a good researcher for example or even a good programmer. What do think about it?

Tal: Yeah. Yeah, I'm definitely concerned about it.

Ravid Shwartz-Ziv: Do you see it in your classes, for example?

Tal: Yeah, we have definitely changed the way the assessment works. We definitely have a lot more in-class â exercises and stuff to make sure that people are not only using â AI. of course, it's great if students use AI. â coding agents, it's silly to tell them to never do that. But if it gets to the point where you talk to the student in a meeting and you realize that they actually don't understand how attention works, because they just accept whatever suggestion the coding agent gave them, â then it gets a little worrying. I'm not exactly sure what to do about it other than to just make sure that when you interview people for a job you actually talk to them and make sure that they have a good mental concept of the technologies that they work with because you can no longer just look at their GitHub repos and assess their coding capabilities based on that. But yeah, it's a it's a challenging problem that I haven't figured out how to solve.

Allen Roush: Yeah, I point to my own GitHub code from before 2023 or 2022 as evidence that I could code at one point without the help, but yeah.

Tal: Yeah, I mean, honestly, I think there's maybe some aspects of coding that are not that important for an AI researcher. Like maybe it's okay if you're not the best software engineer in some respects, if the relevant aspects are something that can be easily automated. I don't know. I mean, I think that that's still TBD. I'm nervous. I don't think that we're quite at that point where I can just like... farm out the whole aspect of software engineering to an AI agent without trying to understand what the agent did. But if it gets to that point at some point, but the person still understands the mathematical aspects of the model and the training, then maybe it's not a huge deal. Maybe not everyone needs to be an expert on everything. But it's a great question that you're asking.

Ravid Shwartz-Ziv: Okay, I think we're almost out of time. Anything else that you want to add?

Tal: No, I think we covered a lot of ground in this â

Ravid Shwartz-Ziv: Yeah. Anything else that you want to advertise, to promote?

Tal: Well, â that's a good question. I guess we're, â yeah, you should absolutely come to my lab. â We're â recruiting PhD students. I'm actually even recruiting a postdoc. â Especially if you're interested in, â actually I realized that we didn't talk about this during the,

Ravid Shwartz-Ziv: to come to your lab, for example.

Tal: actual podcast part of the podcast only at the beginning. Yeah, yeah, yeah. Yeah, there's this direction that we recently started working on that I am excited about of â an application of human-like â language models for â user simulation.

Ravid Shwartz-Ziv: We can talk about it now if you want.

Tal: in a loop with an LLM assistant that's trying to help a person learn something or â perform a task. And â we think that if you get the language model to have more human-like limitations, then the assistant that's trained to interact with that simulated user will then transfer better to human users. So I'm very interested in that. â general question if we can get that to improve â language models by creating better user simulators and improving especially â interaction over multiple turns between the â model and the simulator. So if you're interested in that topic, I'm definitely happy to chat. But yeah, there's â definitely also continuing to work on cognitive science and mechanistic interpretability, developing better evals for language models. â As you said, we need evals that are longer. â So, evils that are not just like a single turn, where the model needs to do something complicated, maybe that involves a lot of interaction with the user, with tools. So, those are the directions that I'm excited about, and if people are interested in that, I'm very happy to consider a postdoc, or next year, a PhD, I guess.

Ravid Shwartz-Ziv: Yeah, sounds good, right? Very interesting questions. Tom, thank you so much that you came to us.

Tal: Yeah, thank you so much for having me. This was really fun.

Ravid Shwartz-Ziv: Too silly.

Allen Roush: It was a pleasure to meet you.

Tal: Let me get to it, And, and review that we have other far. Okay.

Ravid Shwartz-Ziv: Thank you so much, Allen

Tal Linzen