Dec. 15, 2025

EP20: Yann LeCun

The player is loading ...
EP20: Yann LeCun

Yann LeCun – Why LLMs Will Never Get Us to AGI

"The path to superintelligence - just train up the LLMs, train on more synthetic data, hire thousands of people to school your system in post-training, invent new tweaks on RL-I think is complete bullshit. It's just never going to work."

 

After 12 years at Meta, Turing Award winner Yann LeCun is betting his legacy on a radically different vision of AI. In this conversation, he explains why Silicon Valley's obsession with scaling language models is a dead end, why the hardest problem in AI is reaching dog-level intelligence (not human-level), and why his new company AMI is building world models that predict in abstract representation space rather than generating pixels.

 


 

Timestamps

(00:00:14) – Intro and welcome

(00:01:12) – AMI: Why start a company now?

(00:04:46) – Will AMI do research in the open?

(00:06:44) – World models vs LLMs

(00:09:44) – History of self-supervised learning

(00:16:55) – Siamese networks and contrastive learning

(00:25:14) – JEPA and learning in representation space

(00:30:14) – Abstraction hierarchies in physics and AI

(00:34:01) – World models as abstract simulators

(00:38:14) – Object permanence and learning basic physics

(00:40:35) – Game AI: Why NetHack is still impossible

(00:44:22) – Moravec's Paradox and chess

(00:55:14) – AI safety by construction, not fine-tuning

(01:02:52) – Constrained generation techniques

(01:04:20) – Meta's reorganization and FAIR's future

(01:07:31) – SSI, Physical Intelligence, and Wayve

(01:10:14) – Silicon Valley's "LLM-pilled" monoculture

(01:15:56) – China vs US: The open source paradox

(01:18:14) – Why start a company at 65?

(01:25:14) – The AGI hype cycle has happened 6 times before

(01:33:18) – Family and personal background

(01:36:13) – Career advice: Learn things with a long shelf life

(01:40:14) – Neuroscience and machine learning connections

(01:48:17) – Continual learning: Is catastrophic forgetting solved?

 


Music:

"Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.

"Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.

Changes: trimmed


About

The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

 

Transcript

00:00:00 – Introduction

Ravid Shwartz-Ziv: Hi, Yann. And welcome to The Information Bottleneck. And I have to say, this is a bit weird for me. Like I've known you for almost five years and we've worked closely together, but this is the first time that I'm interviewing you for a podcast, right? Usually our conversations are more like, "Yann, it doesn't work, what should I do?" Okay, so even though I'm sure all of our audience knows you, I will say Yann LeCun is a Turing Award winner, one of the godfathers of deep learning, the inventor of convolutional neural networks, founder of Meta's Fundamental AI Research Lab, and still their Chief AI Scientist and a professor at NYU. So welcome.

Yann LeCun: Pleasure to be here. Yeah.

Allen Roush: And it's a pleasure for me to be anywhere near you. I have been, you know, in this industry for a lot less time than either one of you, doing research for a lot less time. So the fact that I'm able to publish papers somewhat regularly with Ravid has been an honor. And to be able to start hosting this podcast has been even more of one. So it's really a pleasure to sit down with you.

00:01:25 – AMI: Why start a company now?

Ravid Shwartz-Ziv: Awesome. Yeah, so we thought—congratulations on the new startup. You recently announced that after 12 years at Meta, you're starting a new startup, Advanced Machine Intelligence, that you focus on world models. So first of all, how does it feel to be on the other side, going from a big company to starting something from scratch?

Yann LeCun: Well, I co-founded companies before, I was involved more peripherally than this new one. But I know how this works. What's unique about this one is a new phenomenon where there is enough hope from the part of investors that AI will have a big impact that they are ready to invest a lot of money, essentially, which means now you can create a startup where the first couple years are essentially focused on research. That just was not possible before.

The only place to do research in industry before was in a large company that was not fighting for its survival and basically had a dominant position in its market and had a long enough view that they were willing to fund long-term projects. So from history, the big labs that we remember—like Bell Labs belonged to AT&T, which basically had a monopoly on telecommunication in the US. IBM had a monopoly on big computers essentially, right? And they had a good research lab. Xerox had a monopoly on photocopiers and that enabled them to fund FAIR. Did not enable them to profit from the research going on there, but that profited Apple. And then more recently, Microsoft Research, Google Research, and FAIR at Meta.

And the industry is shifting again. FAIR had a big influence on AI, the AI research ecosystem, by essentially being very open, right? Publishing everything, open sourcing everything with tools like PyTorch, but also research prototypes that a lot of people have been using in industry. So we caused other labs like Google to become more open, and other labs to also kind of publish much more systematically than before.

But what's been happening over the last couple of years is that a lot of those labs have been kind of clamming up and becoming more secretive. That's certainly the case. I mean, that was the case for OpenAI several years ago. Now Google is becoming more closed and possibly even Meta. So yeah, I mean, it was time for the type of stuff that I'm interested in to kind of do it outside, better than inside.

00:04:32 – Will AMI do research in the open?

Allen Roush: So to be clear then, does AMI, Advanced Machine Intelligence, plan to do their research in the open?

Yann LeCun: Yeah, I mean research. I mean, in my opinion, you cannot really call it research unless you publish what you do, otherwise you can get easily fooled by yourself. You come up with something you think is the best thing since sliced bread. Okay. If you don't actually submit it to the rest of the community, you might just be delusional. And I've seen that phenomenon many times, you know, in lots of industry research labs.

There's sort of internal hype about, you know, some internal projects without kind of realizing that other people are doing things that actually are better. Right. So if you tell the scientists like, you know, publish your work, first of all, that is an incentive for them to do better work. That is more, you know, whether methodology is kind of more thorough and the results are kind of more reliable. The research is more reliable.

It's good for them because very often when you work on a research project, the impact you may have on product could be months, years, or decades down the line. And you cannot tell people, like, you know, come work for us, don't say what you're working on, and maybe there is a product you will have an impact on five years from now. Like in the meantime, like they can't be motivated to really do something useful. So if you tell them that, they tend to work on things that have a short-term impact. So if you really want breakthroughs, you need to let people publish. You can't do it any other way. And this is something that a lot of the industry is forgetting at the moment.

00:06:21 – World models vs LLMs

Allen Roush: AMI, like what products, if any, does AMI plan to produce or make? Is it research or more than that?

Yann LeCun: No, it's more than that. It's actual products. But things that have to do with world models and planning and basically with the ambition of becoming one of the main suppliers of intelligent systems down the line. We think the current architectures that are employed, LLMs or agentic systems that are based on LLMs, work okay for language.

Even agentic systems really don't work very well. They require a lot of data to basically clone the behavior of humans, but they're not that reliable. So we think the proper way to handle this, and I've been saying this for almost 10 years now, is have world models that are capable of predicting what would be the consequence or the consequences of an action or a sequence of actions that an AI system might take. And then the system arrives at a sequence of actions or an output by optimization, by figuring out what sequence of actions will optimally accomplish a task that I'm setting for myself. That's planning.

So I think an essential part of intelligence is being able to predict the consequences of your actions and then use them for planning. And that's what I've been working on for many years. We've been making fast progress with a combination of projects here at NYU and also at Meta. And now it's time to basically make it real.

Ravid Shwartz-Ziv: And what do you think are the missing parts? And why do you think it's taking so long? Because you're talking about it, as you said, for many years already, but it's still not better than LLMs, right?

Yann LeCun: It's not the same thing as LLMs. It's designed to handle modalities that are highly visual, continuous and noisy. And LLMs completely suck at this. Like they really do not work, right? If you try to train an LLM to kind of learn good representations of images or video, they're really not that great. You know, generally vision capabilities for AI systems, right, are trained separately. They're not part of the whole LLM thing.

So yeah, if you want to handle data that is highly continuous and noisy, you cannot use generative models. You can certainly not use generative models that tokenize your data into kind of discrete symbols. Okay, it's just no way. And we have a lot of empirical evidence that this simply doesn't work very well. What does work is learning an abstract representation space that eliminates a lot of details about the input, essentially all the details that are not predictable, which includes noise, and make predictions in that representation space. And this is the idea of JEPA, right? Joint Embedding Predictive Architectures, which you are familiar with as you worked on this.

Ravid Shwartz-Ziv: Yeah. So also Randall was hosted in the past. Randall Balestriero also was in the podcast. He talked about this at length. So there's a lot of ideas around this.

00:09:30 – History of self-supervised learning

Yann LeCun: And let me tell you my history around this, okay? I had been convinced for a long time, probably the better part of 20 years, that the proper way to build intelligent systems was through some form of unsupervised learning. I started working on unsupervised learning as the basis for making progress in the early 2000s, mid 2000s. Before that, I wasn't so convinced this was the way to go.

And basically, this was the idea of training auto-encoders to learn representations. So you have an input, you run it through an encoder, it finds a representation of it, and then you decode. So you guarantee that the representation contains all the information about the input. That intuition is wrong. Insisting that the representation contains all the information about the input is a bad idea. I didn't know this at the time. So what we worked on was several ways of doing this.

Jeff Hinton, at the time, was working on restricted Boltzmann machines. Yoshua Bengio was working on denoising auto-encoders, which actually became quite successful in different contexts, right, for LLMs among others. And I was working on sparse auto-encoders. So basically, if you're training auto-encoders, you need to regularize the representation so that the auto-encoder does not trivially learn an identity function. And this is the Information Bottleneck podcast, this is about information bottleneck. You need to create an information bottleneck to limit the information content of the representation. And I thought high dimensional sparse representations was actually a good way to go.

So much of my students did their PhD on this. Koray Kavukcuoglu, who is now the Chief AI Architect at Alphabet, and also the CTO at DeepMind, actually did his PhD on this with me. And a few others.

So this was kind of the idea. And then as it turned out—and the reason why we worked on this was because we wanted to pre-train very deep neural nets by pre-training those things as auto-encoders. We thought that was the way to go. What happened though was that we started experimenting with things like normalization, rectification instead of hyperbolic tangent and sigmoids, like ReLUs. That ended up basically allowing us to train fairly deep networks, completely supervised.

And this was at the same time that data sets started to get bigger. And so it turned out like supervised learning worked fine. So the whole idea of self-supervised or unsupervised learning was put aside. And then came ResNet and that completely solved the problem of training very deep architectures, in 2015.

But then in 2015, I started thinking again about how do we push towards human level AI, which really was the original objective of FAIR, and my objective, my life mission. And realized that all the approaches of reinforcement learning and things of that type were basically not scaling. You know, reinforcement learning is incredibly inefficient in terms of samples. And so this was not the way to go.

And so the idea of world models, right? A system that can predict the consequences of its actions, it can plan. I started really seriously playing with this around 2015, 2016. My keynote at what was still called NIPS at the time in 2016 was on world models. I was arguing for it. That was basically the centerpiece of my talk was like, this is what we should be working on, world models, action conditioned.

And a few of my students started working on this, on video prediction and things like that. We had some papers on video prediction in 2016. And I made the same mistake as before and the same mistake that everybody is doing at the moment, which is training a video prediction system to predict at a pixel level, which is really impossible. And you can't really represent useful probability distributions on the space of video frames. And so those things don't work.

I knew for a fact that because the prediction was non-deterministic, we had to have a model with latent variables to represent all the stuff you don't know about the variable you're supposed to predict. And so we experimented with this for years. I had a student here who is now a scientist at FAIR, Michael Eickenberg, who developed a video prediction system with latent variables. And it kind of solved those problems we were facing slightly.

I mean, today, the solution that a lot of people are employing is diffusion models, which is a way to train a non-deterministic function, essentially, or energy-based models, which I've been advocating for decades now, which also is another way of training non-deterministic functions.

But in the end, I discovered that the real way to get around the fact that you can't predict at the pixel level is to just not predict at the pixel level. It's to learn representations and predict at a representation level, eliminating all the details you cannot predict.

00:16:41 – Siamese networks and contrastive learning

And I wasn't really thinking about those methods early on because I thought there was a huge problem of preventing collapse. So I'm sure Randall talked about this—when you train, let's say you have an observed variable X and you're trying to predict a variable Y, but you don't want to predict all the details, right? You run both X and Y through encoders. So now you have both a representation for X, SX, a representation for Y, SY. You can train a predictor to predict a representation of Y from the representation of X.

But if you want to train this whole thing end-to-end simultaneously, there is a trivial solution where the system ignores the input and produces constant representations. And the predictor's problem now is trivial, right? So if your only criterion to train the system is minimize the prediction error, it's not going to work. It's going to collapse. I knew about this problem for a very long time because I worked on joint embedding architectures. We used to call them Siamese networks back in the 90s.

Allen Roush: Same, because people have been using that term Siamese networks even recently.

Yann LeCun: That's right. I mean, the concept is still up to date, right? So you have an X and Y, and think of X as some sort of degraded, transformed, or corrupted version of Y. You run both X and Y through encoders, and you tell the system, look, X and Y really are two views of the same thing. So whatever representation you compute should be the same.

So if you just train a neural net—two neural nets which share the weights, right—to produce the same representation for slightly different versions of the same object or view, whatever it is, it collapses. It doesn't produce anything useful. So you have to find a way to make sure that the system extracts as much information from the input as possible.

And the original idea that we had in a paper from 1993 with Siamese nets was to have a contrastive term, right? So you have other pairs of samples that you know are different and you train the system to produce different representations. So you have a cost function that attracts the two representations when you show two examples that are identical or similar and you repel them when you show it two examples that are dissimilar.

And we came up with this idea because someone came to us and said, like, can you encode signatures of someone drawing signature on a tablet, can you encode this in less than 80 bytes? Because if you can encode it in less than 80 bytes, we can write it on the magnetic tape of a credit card. We can do signature verification on credit cards. And so we came up with this idea—I came up with this idea of training a neural net to produce 80 variables that will be quantized in one byte each, and then training it to kind of do this thing.

Allen Roush: And did they use it?

Yann LeCun: So it worked really well. And they showed it to their business people who said, we're just going to ask people to type PIN codes. We have a lesson of how you can integrate technology. Right. And I knew this thing was kind of fishy in the first place because there were countries in Europe that were using smart cards. And it was a much better solution. They just didn't want to use it for some reason.

Anyway, so we had this technology in the mid 2000s. I worked with two of my students to revise this idea. We came up with new objective functions to train those. So these are what people now call contrastive methods. It's a special case of contrastive methods. We have positive examples, negative examples. For positive examples, you train the system to have low energy. And for negative samples, you train them to have higher energy, where energy is the distance between the representations.

So we had two papers at CVPR in 2005, 2006 by Raia Hadsell, who is now the head of DeepMind Foundation, the sort FAIR-like division of DeepMind, if you want, and Sumit Chopra, who is actually a faculty here at NYU now, working on medical imaging. And so this gathered a bit of interest in the community and sort of revived a little bit of work on those ideas.

But it still wasn't working very well. Those contrastive methods really were producing representations of images, for example, that were kind of relatively low dimensional. If we measured like the eigenvalue spectrum of the covariance matrix of the representations that came out of those things, it would fill up maybe 200 dimensions, never more. Like even training on ImageNet and things like that, even with data augmentation. And so that was kind of disappointing. And it did work. There was a bunch of papers on this. And it worked okay.

There was a paper from DeepMind, SimCLR, that demonstrated you could get decent performance with contrastive training of Siamese nets. But then about five years ago, one of my postdocs, Stephane Deny at Meta, tried an idea that at first I didn't think would work, which was to essentially have some measure of information quantity that comes out of the encoder and then trying to maximize that.

And the reason I didn't think it would work is because I'd seen a lot of experiments along those lines that Jeff Hinton was doing in the 1980s, trying to sort of maximize information. You can never maximize information because you never have appropriate measures of information content that is a lower bound. If you want to maximize something, you want to either be able to compute it or you want a lower bound on it so you can push it up, right? And for information content, we only have upper bounds. So I always thought this was completely hopeless.

And then Stephane came up with a technique which was called Barlow Twins. Barlow is a famous theoretical neuroscientist who came up with the idea of information maximization. And it kind of worked. It was wow. So then it's like, we have to push this, right? So we came up with another method with a student of mine, Adrian Bardes, co-advised with Jean Ponce, who's affiliated with NYU too. Technique called VICReg—variance, invariance, covariance regularization. And that turned out to be simpler and work even better.

And since then we've made progress and Randall recently discussed an idea with me that can be pushed and made practical. It's called SIGREG. The whole system is called LoRD JEPA. He's responsible for the name, I don't know. That means Latent Orthogonal Regularized Discriminative JEPA, right? And SIGREG has to do with sort of making sure that the distribution of vectors that come out of the encoder is an isotropic Gaussian that's high dimensional.

I mean, there's a lot of things happening in this domain, which are really cool. I think there's going to be some more progress over the next year or two. We get a lot of experience with this. I think that's kind of a really good promising set of techniques to train models that learn abstract representations, which I think is key.

00:23:40 – Data and the bitter lesson

Ravid Shwartz-Ziv: And what do you think are the missing parts here? Do you think more compute will help or we need better algorithms? It's kind of like, do you believe in the bitter lesson, right? Do you think...

Allen Roush: Well, and furthermore, what do you think about the data quality problems with the internet post-2022? I've heard people compare it to low background steel now to refer to all that data before LLMs came out, like "low background tokens."

Yann LeCun: Okay. I think I'm totally escaping that problem. Okay. Here is the thing. I've been using this argument publicly over the last couple years. Training an LLM, if you wanted to have any kind of decent performance, requires training on basically all the available, freely available text on the internet, plus some synthetic data, plus licensed data, et cetera. So a typical LLM like, you know, going back a year or two, is trained on 30 trillion tokens. A token is typically three bytes. So that's 10 to the 14 bytes for pre-training. We're not talking about fine-tuning. 10 to the 14 bytes.

And for LLMs to be able to really kind of exploit this, they need to have a lot of memory storage because basically those are isolated facts. There is a little bit of redundancy in text, but a lot of it is just isolated facts, right? So you need very big networks because you need a lot of memory to store all those facts.

Okay, now compare this with video. 10 to the 14 bytes, if you count two megabytes per second for video—relatively compressed video, not highly compressed, but a bit—that would represent 15,000 hours of video. 10 to the 14 bytes. In 15,000 hours of video, you have the same amount of data as the entirety of all the text available on the internet.

Now, 15,000 hours of video is absolutely nothing. It's 30 minutes of YouTube uploads. Okay. It's the amount of visual information that a four-year-old has seen in his or her life, the entire life, waking time. It's about 16,000 hours in four years. It's not a lot of information.

We have video models now, V-JEPA, V-JEPA 2 actually that just came out last summer, that was trained on the equivalent of a century of video data. It's all public data. Much more data, but much less than the biggest LLM actually, because even though it's more bytes, it's more redundant.

So you say, okay, it's more redundant, so it's less useful. When you use self-supervised learning, you need redundancy. You cannot learn anything in self-supervised learning or anything, by the way, if it's completely random. Redundancy is what you can learn. And so there's just much richer structure in real world data like video than there is in text, which kind of led me to claim that we absolutely never, ever going to get to human level AI by just training on text. It's just never going to happen.

Ravid Shwartz-Ziv: Right. So it's this big debate in philosophy of whether AI should be grounded in reality or whether it could be just, you know, in the realm of symbolic manipulation and things like this.

00:27:00 – What is a world model?

Allen Roush: And when we talk about for world models and grounding, I think, you know, there's still a lot of people who don't even understand what the idealized world model is, in a sense, right? So, for example, I'm influenced by having watched Star Trek, which I would hope you've seen a little bit of, and you're thinking of the holodecks, right? I always thought that the holodeck was like an idealized, perfect world model, right? Even so many episodes of it going too far, right, people walking out of it. But, you know, it even simulates things like smell and physical touch. So do you think that something like that is like the idealized world model or do you think like a different model or like way of defining it would be?

Yann LeCun: Okay, this is an excellent question. The reason it's excellent is because it goes to the core of really what I think we should be doing, which I'm doing, and how wrong I think everybody else is.

So people think that a world model is something that reproduces all details of what the world does. They think of it as a simulator. And of course, because deep learning is the thing, you're going to use some deep learning system as a simulator. A lot of people also are focused on video generation, which is kind of a cool thing. You produce those cool videos and wow, people are sort of really impressed by them.

Now there's no guarantee whatsoever that when you train a video generation system, it actually has an accurate model of the underlying dynamics of the world. And it's learned anything, you know, particularly abstract about it. And so the idea that somehow a model needs to reproduce every detail of reality is wrong and hurtful. And I'm going to tell you why.

00:30:00 – Abstraction hierarchies

A good example of simulation is CFD, computational fluid dynamics. It's used all the time. People use supercomputers for that, right? So you want to simulate the flow of air around an airplane. You cut up the space into little cubes. And within each cube, you have a small vector that represents the state of that cube—velocity, density or mass, and temperature, and maybe a couple of other things. And then you solve the Navier-Stokes equations, which is a partial differential equation. And you can simulate the flow of air.

Now, the thing is, this does not actually necessarily solve the equations very accurately. If you have chaotic behavior like turbulence and stuff like that, simulation is only approximately correct. But in fact, that's already an abstract representation of the underlying phenomenon. The underlying phenomenon is molecules of air that bump into each other and bump on the wing and on the airplane. But nobody ever goes to that level to do the simulation. That would be crazy.

It would require an amount of computation that's just insane. And it would depend on the initial condition. I mean, there's all kinds of reasons we don't do this. And maybe it's not molecules. Maybe it's, you know, at a lower level, we should simulate particles and like, you know, do the Feynman diagrams and simulate all the different paths that those particles are employing because they don't take one path. It's not classical, it's quantum. So at the bottom, it's like quantum field theory. And probably already that is an abstract representation.

So, you know, everything that takes place between us at the moment, in principle, can be described through quantum field theory. Okay, we just have to measure the wave function of the universe in a cube that contains all of us. And even that would not be sufficient because there are entangled particles at the other side of the universe. So it wouldn't be sufficient, but let's imagine, okay, for the sake of the argument.

First of all, we would not be able to measure this wave function. And second of all, the amount of computation we would need to devote to this is absolutely gigantic. It would be some gigantic quantum computer that is the size of the earth or something. So no way we can describe anything at that level. And it's very likely that our simulation would be accurate for maybe a few nanoseconds. Beyond that, it would diverge from reality.

So what do we do? We invent abstractions. We invent abstractions of particles, atoms, molecules. In the living world, it's proteins, organelles, cells, organs, organisms, societies, ecosystems, et cetera. And basically every level in this hierarchy ignores a lot of details about the level below. And what that allows us to do is make longer term, more reliable predictions.

Okay. So we can describe the dynamics between us now in terms of the underlying science and in terms of psychology. Okay. That's a much higher level of abstraction than particle physics, right? And in fact, every level in the hierarchy I just mentioned is a different field of science. A field of science is essentially defined by the level of abstraction that you allow yourself to use to make predictions.

In fact, physicists have this down to an art in the sense that, if I give you a box full of gas, you could in principle simulate all the molecules of the gas, right? But nobody ever does this. But at a very abstract level, we can say PV equals nRT, right? Pressure times volume equals number of particles times temperature, blah, blah, blah.

And so you know that at a global emergent phenomenological level, if you increase the pressure, the temperature will go up, or if you increase the temperature, the pressure will go up, right? Or if you let some particles out, then the pressure will go down and blah, blah, blah, right? So all the time we build phenomenological models of something complicated by ignoring all kinds of details that physicists call entropy. But it's really systematic. That's the way we understand the world. We do not memorize every detail of—we certainly don't reconstruct—what we perceive.

00:33:47 – World models as abstract simulators

Allen Roush: So world models don't have to be simulators.

Yann LeCun: Well, they are simulators, but in abstract representation space. And what they simulate is only the relevant part of reality. Okay. If I ask you, where is Jupiter going to be a hundred years from now? I mean, we have an enormous amount of information about Jupiter, right? But within this whole information that we have about Jupiter, to be able to make that prediction where Jupiter is going to be a hundred years from now, you need exactly six numbers—three positions and three velocities. And the rest doesn't matter.

Ravid Shwartz-Ziv: So you don't believe in synthetic data sets?

Yann LeCun: I do. No, it's useful. Data from games. I mean, there's certainly a lot of things that you learn from synthetic data, from games and things like that. I mean, children learn a huge amount from play, which basically are kind of simulations of the world a little bit, right? But in conditions where they can't kill themselves.

Allen Roush: But I worry at least for video games that for example, the green screen—like actors doing the animations, they're doing extremely—it's designed to look good, you know, often badass I guess for an action game, but these often don't correspond very well to reality. And so I worry that like a physical system that's been trained or with the assistance of world models might get similar quirks, at least in the very short term. Is this something that worries you?

Yann LeCun: It depends on what level you train them. So for example, I mean, sure, if you use a very accurate robotic simulator, for example, right, it's going to accurately simulate the dynamics of an arm. You know, when you apply torques to it, it's going to move in a particular way. There's dynamics, no problem. Now simulating the friction that happens when you grab an object and manipulate it, that's super hard to do it accurately. Friction is very hard to simulate. Okay. And so those simulators are not particularly accurate for manipulation. They're good enough that you can train a system to do it and then you can do sim-to-real with a little bit of domain adaptation. So that can work.

But the point is much more important. Like, for example, there is a lot of completely basic things about the world that we completely take for granted, which we can learn at a very abstract level, but it's not language related. Okay, so the fact, for example—and I've used this example before and people made fun of me for it, but it's really true. Okay.

I have those objects on the table and the fact that when I push the table, the object moves with it. Like this is something we learned. It's not something that you're born with. The fact that most objects will fall when you let them go, right? With gravity. Maybe it's around the age of nine months.

And the reason people make fun of me with this is because I said, you know, LLMs don't understand this kind of stuff. Right. And they absolutely do not even today, but you can train them to give the right answer when you ask them a question. You know, if I put an object on the table, then I push the table, what will happen to the object? It will answer, the object moves with it, but because it's been fine-tuned to do that, okay? So it's more like regurgitation than sort of real understanding of the underlying dynamics.

Ravid Shwartz-Ziv: But if you look at, I don't know, Sora or like Veo, and Veo from Google, they have like good physics of the world, right?

Yann LeCun: They are not perfect. They have some physics, yeah. They have some physics.

Ravid Shwartz-Ziv: So do you think we can push it farther or do you think it's one way to learn physics?

Yann LeCun: So all of those models actually make predictions in representation space. They use diffusion transformers and that prediction, the computation of the video snippet at an abstract level, is done in representation space. Not always, by the way, sometimes it's just in pixel. And then there's a second diffusion model that turns these abstract representations into a nice looking video. And that might be more collage. We don't know, right? Because we can't really measure the coherence of such systems with reality.

00:38:00 – Object permanence

But to the previous point, you can train... Like here is another completely obvious concept to us that we don't even imagine that we learn, but we do learn it. A person cannot be in two places at the same time. We learned this because very early on we learned object permanence, the fact that when an object disappears, it still exists. Okay. And reappears as the same object that you saw before.

How can we train an AI system to learn this concept? So object permanence, you know, you just show it a lot of videos where objects, you know, go behind a screen and then reappear on the other side or they go behind a screen and the screen goes away and the object is still there.

And when you show 4 month old babies scenarios where things like this are violated, their eyes open like super big and they're like super surprised because reality just violated their internal model. Same thing when you show a scenario of like a little car on a platform, you push it off the platform and it appears to float in the air. They also look at it, know, nine months, 10 month old babies look at it like really surprised. Six month old babies barely pay attention because they haven't learned about gravity yet. So they haven't been able to incorporate the notion that every object is supposed to fall.

So this kind of learning is really what's important. And you can learn this from very abstract things. In the same way babies learn about social interactions by being told stories with simple pictures. It's a simulation, an abstract simulation of the world, but it sort of trains them on particular behavior.

So you could imagine like training a system from, let's say an adventure game, like a top-down 2D adventure game where you tell your character, move north, and he goes to the other room and it's not in the first room anymore because he moved to the other room. Right. Now, of course, in the adventure game, you have Gandalfs that you can call and he just appears, right? So that's not physical. But when you pick up a key from a treasure chest, you have the key, no one else can have it. And you can use it to open a door. Like, there's a lot of things that you learn that are very basic, even in abstract environments.

00:40:21 – Game AI: Why NetHack is still impossible

Allen Roush: Yeah. And I just want to observe that some of those adventure games that they try to train models on—one of them you might know about is NetHack, right?

Yann LeCun: Sure.

Allen Roush: And NetHack is fascinating because it is an extraordinarily hard game. Like ever ascending in that game without cheats is like 20 years without going to the Wiki. People still don't do it from playing. And my understanding is that AI agents, the very best agent models we have or even world models, are pathetic.

Yann LeCun: Yeah. Yeah. So to the point, people have come up with sort of dumbed down versions of NetHack.

Allen Roush: Exactly. MiniHack. MiniHack. Exactly. They had to dumb it down just for AI.

Yann LeCun: Some of my colleagues have been working with this. Actually, one of my master's students here. Michael Eickenberg, who I mentioned earlier, has been also doing some work there.

Now, what's interesting there is that there is a type of situation like this where you need to plan. But you need to plan in the presence of uncertainty. The problem with games and adventure games in particular is that you don't have complete visibility of the state of the system. You don't have the map in advance. You need to explore and blah, blah. You can get killed every time you do this.

But the actions are essentially discrete. The number of possible actions is finite, it's turn based. And so in that sense, it's like chess, except chess is fully observable. Go is also fully observable. Stratego isn't.

Allen Roush: Stratego is not.

Yann LeCun: Poker is not. And so it makes it more difficult if you have uncertainty, of course. But those are games where the number of actions you can take is discrete. And basically what you need to do is do tree exploration. Okay. And of course, the tree of possible states goes exponentially with the number of moves. And so you have to have some way of generating only the moves that are likely to be good and basically never generate the other ones or select them down.

And you need to have a value function, which is something that tells you, okay, I can't plan to the end of the game, but even though I'm planning only sort of nine moves ahead, I have some way of estimating whether evaluating whether a position is good or bad, it's going to lead me to a victory or a solution. So you need those two components, basically something that guesses what the good moves are, and then something that evaluates. And if you have both of those things, you can train those functions using something like reinforcement learning or behavior cloning if you have data.

I mean, the basic idea for this goes back to Samuel's checker players from 1964. It's not recent. But of course, the power of it was demonstrated with AlphaGo and AlphaZero and things like that.

00:44:08 – Moravec's Paradox

So that's good, but that's a domain where humans suck. Humans are terrible at playing chess, right? And playing Go, like machines are much better than we are because of the speed of tree exploration and because of the memory that's required for tree exploration. We just don't have enough memory capacity to do breadth-first tree exploration. So we suck at it.

You know, when AlphaGo came out, people before that thought that the best human players were maybe two or three stones handicapped below an ideal player that you call God. No, you know, humans are terrible. The best players in the world need like eight or nine stones handicap.

Allen Roush: I can't believe I get the pleasure to talk about game AI with Yann. I just have a few follow-up questions on this. The first one is this example that you talk about around humans being terrible at chess. And I'm familiar a bit with the development of chess AI over the years.

I've heard this referred to as Moravec's paradox and explained as humans have evolved over billions or millions, large number of years for physical locomotion and that's why babies and humans are very good at this, but we have not evolved at all to play chess. So that's one question.

And then a second question that's related is, a lot of people today who play video games, and I'm one of them, have observed that it feels like AI, at least in terms of enemy AI, has not improved really in 20 years, right? That some of the best examples are still like Halo 1 and FEAR from the early 2000s. So when do you think that advancements that we've been doing in the lab are going to actually have real impact on gamers in a non-generative AI sense?

Yann LeCun: I used to be a gamer, never an addicted one, but my family is in it because I have three sons in their 30s and they have a video game design studio between them. I'm sort of embedded in that culture.

But yeah, no, you're right. You know, it's also true that despite the accuracy of physical simulators, a lot of those simulations are not used by studios who make animated movies because they want control. They don't necessarily want accuracy, they want control. And in games it's really the same thing. It's a creative act. What you want is some control about the course of the story or the way the NPC kind of behave and all that stuff, right?

And it's difficult to take control at the moment. So, I mean, it will come, but there's some resistance from the creators.

I think Moravec's paradox is very much still in force. So Moravec, I think, formulated it in 1988, if I remember correctly. And he said like, yeah, how come things that we think of as uniquely human intellectual tasks like chess, we can do with computers or computing integrals or whatever. But the things that we take for granted, we don't even think is an intelligent task, like what a cat can do, we can't do with robots.

And even now, 37 years later, we still can't do them well. I mean, of course we can train robots by imitation and a bit of reinforcement learning and by training through simulation to kind of locomote and avoid obstacles and do various things. But they're not nearly as inventive and creative and agile as a cat. It's not because we can't build a robot. We certainly can. It's just we can't make them smart enough to do all the stuff that a cat or a mouse can do, let alone a dog or a monkey. Right.

So you have all those people bloviating about like AGI in a year or two, it's completely deluded. It's just complete delusion. Because the real world is way more complicated. And you're not going to get anywhere by tokenizing the world and using LLMs. It's just not going to happen.

00:48:02 – AGI timelines

Ravid Shwartz-Ziv: What are your timelines? When will we see like, I don't know, AGI, whatever it means, or like a little bit...

Allen Roush: And also, where are you on the optimist pessimist side? Because there's some doomers among—or doomerism amongst like Gary Marcus. And I think, well, I guess he's a critic. The doomer would be, is it Yoshua? Like, where do you fall in all these things?

Yann LeCun: Okay, I'll answer the first question first. So first of all, there is no such thing as general intelligence. This concept makes absolutely no sense because it's really designed to designate human level intelligence, but human intelligence is super specialized. Okay. We can handle the real world really well, like navigate and blah, blah. We can handle other humans really well because we evolved to do this. But chess, we suck. So there's a lot of tasks that we suck at that where a lot of other animals are much better than we are. Okay.

So what that means is that we are specialized. We think of ourselves as being general, but it's simply an illusion because all of the problems that we can apprehend are the ones that we can think of. Right. And vice versa. And so we're general in all the problems that we can imagine. But there's a lot of problems that we cannot imagine. And there's some mathematical arguments for this, which I'm not going to go into unless you ask me.

So this concept of general intelligence is complete BS. We can talk about human level intelligence. So are we going to have machines that are as good as humans in all the domains where humans are good or better than humans? And the answer is, we already have machines that are better than humans in some domains. Like, we have machines that can translate 1,500 languages into 1,500 other languages in any direction. No humans can do this, right? And there's a lot of examples in which this can go in various other things.

But will we have machines that are as good as humans in all domains? The answer is absolutely yes. There is no question that at some point we'll have machines that are as good as humans in all domains. Okay, but it's not going to be an event. It's going to be very progressive.

We're going to make some conceptual advances, maybe based on JEPA, world models, planning, things like that, over the next few years. And if we're lucky, if we don't hit an obstacle that we didn't see, perhaps this will lead to kind of a good path to human level AI. But perhaps we're still missing a lot of basic concepts.

And so the most optimistic view is that perhaps learning good models and being able to do planning and understanding complex signals that are continuous, high dimensional, noisy. If we make significant progress in that direction over the next two years, the most optimistic view is that we'll have something that is close to human intelligence or maybe dog intelligence within five to 10 years. Okay. But that's the most optimistic.

It's very likely that as what happened multiple times in the history of AI in the past, there's some obstacle we're not seeing yet, which will actually require us to invent some new conceptual things to go beyond. In which case that may take 20 years, maybe more. Okay. But no question, it will happen.

Ravid Shwartz-Ziv: Do you think it will be easier to get from the current level to a dog level intelligence compared to dog to human level?

Yann LeCun: No, I think the hardest part is to get to dog level. Once you get to dog level, you basically have most of the ingredients. And then, you know, what's missing from—okay, what's missing from primates to humans beyond just size of brain? Is language maybe, okay? But language is basically handled by the Wernicke area, which is a tiny little piece of brain that's right here, and the Broca area, which is a tiny piece of brain right here. Both of those evolved in the last less than a million years, maybe two. And it can't be that complicated.

And we already have LLMs that do a pretty good job at encoding language into abstract representations and then decoding thoughts into text. So maybe we'll use LLMs for that. So LLMs will be like the Wernicke and Broca areas in our brain. What we're working on right now is the prefrontal cortex, which is where our world model resides.

00:52:50 – AI Safety

Allen Roush: Well, this gets me into a few questions about safety and the destabilizing potential impact. I'll start this with something a little bit funny, which is to say if we really get dog level intelligence, then the AI of tomorrow has gotten profoundly better at sniffing than any human.

And something like that is just tip of the iceberg for the destabilizing impacts of AI tomorrow, let alone today. I mean, we have Sam Altman talking about super persuasion because AI doxes you. So it figures out who you are through the multi-turn. So it gets really good at kind of customizing its arguments towards you. We've had AI psychosis, right? Like people who've done horrible things as a result of kind of believing in a sycophantic AI that is telling them to do things they shouldn't do.

Yann LeCun: You've got to tell us about that too.

Allen Roush: What?

Yann LeCun: One day a few months ago, I was at NYU and I walked down to get lunch and there's a dude who's surrounded by a whole bunch of police officers and security guards. And I walk past and the guy recognizes me and says, oh, Mr. LeCun. And the police officer kind of whisked me away outside and tells me like, you don't want to talk to him.

Turns out the guy had come from the Midwest by bus through here. And he's kind of emotionally disturbed. He had gone to prison, blah, blah, blah, for various things. And he was carrying a bag with a huge wrench and pepper spray and a knife. And so the security guards got alarmed and basically called the police. And then the police realized, okay, this guy is kind of weird. So they took him away and had him examined. And eventually he went back to the Midwest. But I mean, he didn't feel threatening to me, but police wasn't so sure. So, yeah, it happens.

I had high school students writing emails to me saying, I read all of those pieces by doomers who said AI is going to take over the world and either kill us all or take our jobs. So I'm totally depressed. I don't go to school anymore. So I answer to them, say like, you know, I don't believe all that stuff, humanity is still going to be in control of all of this.

Now, there's no question that every powerful technology has good consequences and bad side effects that sometimes are predicted and corrected sufficiently in advance and sometimes not so much. Right. And it's always a trade-off. That's the history of technological progress. Right.

So, let's take cars as an example. Cars crash sometimes and initially, you know, brakes weren't that reliable and cars would flip over and there was no seat belts and blah, blah, blah. Right. And eventually, the industry made progress and started putting seat belts and crumple zones and automatic controlling systems so that the car doesn't sway and doesn't flip or whatever. So cars now are much safer than they used to be.

Now, there's one thing that is now mandatory in every car sold in the EU. And it's actually an AI system that looks out the window. It's called AEBS, automatic emergency braking system. It's basically a convolutional net. And it looks out the windshield and it detects all objects. And if it detects that an object is too close, it just automatically brakes. Or if you detect that there's going to be a collision that the driver is not going to be able to avoid, it just stops the car or swerves. Right.

And one statistic I read is that this reduces frontal collisions by 40%. And so it became mandatory equipment in every car sold in the EU, even low end because it's so cheap now. So this is AI not killing people, saving lives. I mean, also same thing for medical imaging and everything. There's a lot of lives being saved.

Ravid Shwartz-Ziv: But do you think—you and Jeff and Yoshua, both of you won the Turing Award together and you have different opinions about it, right? And Jeff says that he regrets, and Yoshua works on safety. And you're trying to push it forward. Do you think you will get to some level of intelligence that you will say, oh, this becomes too dangerous? We need to work more on the safety side?

Yann LeCun: I mean, you have to do it right. I'm going to use another example. Jet engines. Okay, I find this astonishing that you can fly halfway around the world on a two-engine airplane in complete safety. And I really say halfway around the world, it's a 17-hour flight. You can fly from New York to Singapore on an Airbus 350. It's astonishing.

And when you look at a jet engine, a turbofan, it should not work. I mean, there is no metal that can stand the type of temperature that takes place there, and the forces when you have a huge turbine rotating at 2000 RPM or whatever speed—the force that puts on it is just insane, hundreds of tons. So it should not be possible. Yet those things are incredibly reliable.

So what I'm saying is you can't build something like a turbojet the first time you build it, it's not going to be safe. It's going to run for 10 minutes and then blow up. And it's not going to be fuel efficient, et cetera. It's not going to be reliable.

But as you make progress in engineering and materials, et cetera, there's so much economic motivation to make this good that eventually it's going to be the type of reliability we see today. The same is going to be true for AI. We're going to start making systems that can plan, can reason, have world models, blah, blah. But they're going to have the power of maybe a cat brain, right, which is about a hundred times smaller than the human brain.

And then we're going to put guardrails in them to prevent them from taking actions that are obviously dangerous or something. And you can do this at a very low level. Like if you have, I don't know, a domestic robot. So one example that Stuart Russell, for example, has used is to say, if you are a robot, a domestic robot, and you ask it to fetch you coffee, and someone is standing in front of the coffee machine, if the system wants to fulfill its goal, it's going to have to either assassinate or smash the person in front of the coffee machine to get access to the coffee machine. And obviously, you don't want that to happen.

It's like the paperclip maximization. It's kind of a ridiculous example because it's super easy to fix this, right? You put some guardrails that say, well, you're a domestic robot. You should stay away from people and maybe ask them to move if they are in the way, but not actually hurt them in any way or whatever.

And you can put a whole bunch of low level constraints like this. If you're a domestic robot and it's a cooking robot, right? So it has a big knife in its hand and it's cutting your cucumber. Don't flail your arms if you have a big knife in your hand and people around. It can be a low level constraint that the system has to satisfy.

Now, some people say, but with LLMs, we can fine tune them to not do things that are dangerous, but you can jailbreak them. You can always find prompts where they're going to escape their conditioning, all the things that we stop them from doing.

01:02:38 – Objective-driven AI

I agree. That's why I'm saying we shouldn't use that in the end. We should use those objective-driven AI architectures that I was talking about earlier, where you have a system that has a world model, can predict the consequences of its actions, and can figure out a sequence of actions to accomplish a task, but also is subject to a bunch of constraints that guarantee that whatever action is being pursued and whatever state of the world is being predicted does not endanger anybody or does not have negative side effects.

So there, it's by construction, the system is intrinsically safe because it has all those guardrails. And because it obtains its output by optimization, by minimizing the objective of the task and satisfying the constraints of the guardrails, it cannot escape that. It's not fine tuning. It's by construction.

Allen Roush: Yeah. And there's a technique that for LLMs for constraining the output space where you say that you ban all outputs except whatever you want, like maybe zero to 10 and everything else. And they have that even for diffusion models. Do you think that tactics like that as they exist today significantly improve the utility of those kinds of models?

Yann LeCun: Well, they do, but they're ridiculously expensive because the way they work is that you have to have the system generate lots of proposals for an output and then have a filter that says, well, this one is good, this one's terrible, et cetera, or rate them and then just put out the one that has the lowest toxicity rating, essentially. So it's insanely expensive, right? So unless you have some sort of objective-driven value function that kind of drives the system towards producing those high score, low toxicity outputs, it's going to be expensive.

01:04:06 – Meta's reorganization and FAIR

Allen Roush: Yeah, and I want to change the topic just a tiny bit. We've been very technical for a moment, but I think our audience in the world has a few questions that are maybe a little bit more socially related. The person who appears to be trying to fill your shoes at Meta, Alex Wang—I'm curious, do you have any thoughts or anything about how that will play out for Meta?

Yann LeCun: He's not in my shoes at all. He's in charge of all the R&D and product that are AI related at Meta. So he's not a researcher or scientist or anything like that. It's more kind of overseeing the entire operation.

So within Meta's Superintelligence Lab, which is his organization, there are kind of divisions if you want. So one of them is FAIR, which is long-term research. And one of them is GenAI Lab, which is basically building frontier models, which is mostly entirely LLM focused.

Another organization is AI infrastructure, software infrastructure. Hardware is some other organization. And then the last one is products. So people who would take the frontier models and then turn them into actual chatbots that people can use and disseminate them and plug them into WhatsApp and everything else. So those are four divisions. He oversees all of that.

And there are several AI scientists. There is the AI scientist of FAIR, that's me. And I really have a long-term view and basically I'm going to be at Meta for another three weeks. And FAIR is led by our NYU colleague Rob Fergus right now after Joelle Pineau left several months ago.

FAIR is being pushed towards kind of working on slightly shorter term projects than it has done traditionally with less emphasis on publication, more focused on sort of helping GenAI lab with the LLMs and frontier models. You know, less publication, which means Meta is becoming a little more closed.

GenAI Lab has a chief scientist also, which is really focused on LLMs. Other organizations are more like infrastructure and products. There's some applied research there. So for example, the group that works on SAM, Segment Anything Model, that's actually part of the product division of MSI. They used to be at FAIR, but because they worked on kind of relatively practical things they were moved to the product side.

01:07:17 – SSI, Physical Intelligence, and other startups

Allen Roush: And do you have any opinions on some of the other companies that are trying to move into world models like Thinking Machines or even I've heard Jeff Bezos and some of his...

Yann LeCun: It's not clear at all what Thinking Machines is doing. Maybe you have more information than me.

Allen Roush: Maybe not. Sorry. Maybe I'm mixing it up here. It's Physical Intelligence. Sorry. And then I mix them up with SSI as well. They're all kind of like...

Yann LeCun: So SSI, nobody knows what they're doing, including their own investors. At least that's the rumor I heard. It's a bit of a joke, yeah.

Physical Intelligence, the company, is focused on producing geometrically correct videos. Where there is persistent geometry and when you look at something and you turn around and you come back, it's the same object you had before. Like it doesn't change behind your back. Right. So it is generative, right? I mean, the whole idea is to generate pixels. Okay. Which I just spent a long time arguing is a bad idea.

There are other companies that have world models. The good one is Wayve. W-A-Y-V-E. So it's a company based in Oxford and I'm an advisor for disclosure. And they have a world model for autonomous driving. And the way they're training it is that they're training a representation space by basically training a VAE or VQ-VAE and then training a predictor to do temporal prediction in that abstract representation space.

So they have half of it right and half of it wrong. The piece they have right is that you make predictions in representation space. The piece they have wrong is that they haven't figured out how to train their representation space in any other way than by reconstruction. And I think that's bad. But their model is great. Like it works pretty well. I mean, among all the people who kind of work in this kind of stuff, they're pretty far advanced.

There are people who talk about similar things at Nvidia. A company called SandboxAQ. The CEO of it, Jack Hidary, he talks about quantitative models, large quantitative models as opposed to large language models. So basically predictive models that can deal with continuous high dimensional noisy data, right? Which is what also I've been kind of talking about.

And Google, of course, has been working on models mostly using generative approaches. There was an interesting effort at Google by Danijar Hafner. So he built models called Dreamer V1, 2, 3, 4. That was on a good path, except he just left Google to create his own startup. So I'm interested.

01:10:00 – Silicon Valley's LLM monoculture

Ravid Shwartz-Ziv: So you were really critical about the Silicon Valley culture that they are focusing on LLMs. And this is like one of the reasons that now you started the new company—it's starting in Paris, right? So this is something—do you think that we will see more and more, or do you think this is something that will be very unique that only a few companies will be in Europe?

Yann LeCun: Well, the company I'm starting is global. Okay. It has an office in Paris, but it's a global company. It has an office in New York too. A couple of other places.

Okay, there is an interesting phenomenon in industry, which is that everybody has to do the same thing as everybody else, because it's so competitive that if you start taking a tangent, you're taking a risk of falling behind because you're using a different technology than everybody else, right? So basically, everyone is trying to catch up with the others. So that creates this herd effect and a kind of monoculture, which is really specific to Silicon Valley, where OpenAI, Meta, Google, Anthropic, everybody is basically working on the same thing.

And sometimes like what happened a while back, another group like DeepSeek in China comes up with kind of a new way of doing things. And everybody is like, what? You mean like other people outside Silicon Valley are not stupid and can come up with original ideas? I mean, there's a bit of superiority complex, right?

But you're basically in your trench and you have to move as fast as possible because you can't afford to fall behind the other guys who you think are your competitors. But you run the risk of being surprised by something that's completely out of left field that uses a different set of technologies and maybe addresses a different problem.

So whatever I'm interested in is completely orthogonal because the whole JEPA idea and world models is really to handle data that is not easily handled by LLMs. So the type of applications we're envisioning that have tons of applications in industry where the data comes to you in the form of continuous high dimensional noisy data, including video, are domains where LLMs basically are not present, where people are trying to use them and totally failed essentially.

Okay, so if you don't want to be—the expression in Silicon Valley is that you are "LLM-pilled." You think that the path to superintelligence—you just train up the LLMs, you train on more synthetic data, you license some more data, you hire thousands of people to school your system in post-training. You invent new tweaks on RL and you're going to get to superintelligence. And I think this is complete bullshit. Like it's just never going to work.

And then you add a few reasoning techniques, which basically consist in doing super long chain of thoughts and then having the system generate lots and lots of different token outputs, from which you can select good ones using some sort of evaluation function. That's the way a lot of these things work. This is not going to take us there. It's just not.

So yeah, I mean, you need to escape that culture. And there are people within all the companies in Silicon Valley who think like, this is never going to work, I want to work on JEPA and hierarchical planning. So escaping the monoculture, I think is important. Yeah, that's part of the story.

01:15:42 – US vs China: The open source paradox

Ravid Shwartz-Ziv: And what do you think about the competition between the US, China and Europe? Now that you are starting a company, do you see that some places are more attractive than others?

Yann LeCun: We're in this very paradoxical situation where all the American companies—till now not Meta, but all the American companies—have been becoming really secretive to preserve what they think is a competitive advantage.

And by contrast, the Chinese players, companies and others, have been completely open. So the best open source systems at the moment are Chinese. And that causes a lot of the industry to use them because they want to use open source systems. And they hold their nose a little bit because they know those models are fine tuned to not answer questions about politics and stuff like that. But they don't really have a choice.

And certainly a lot of academic research now uses the best Chinese models. Certainly everything that has to do with reasoning and things like that. So it's really paradoxical. And a lot of people in the US, in industry, are really unhappy about this. They really want a serious non-Chinese open source model. There could have been Meta, but Meta has been a disappointment for various reasons. Maybe they will get fixed with the new efforts at Meta, or maybe Meta will decide to go closed as well. It's not clear.

Allen Roush: Mistral just had a model release.

Yann LeCun: Which is cool for code gen. Yeah. That's right. No, it's cool. Yeah, they maintain openness. No, it's really interesting what they're doing.

01:18:00 – Why start a company at 65?

Ravid Shwartz-Ziv: Wow. Okay. Let's go to more personal questions. Yeah. So you're 65. Right? You won a Turing Award, you just got the Queen Elizabeth Prize. Basically, you could retire, right?

Yann LeCun: Yeah, I could. That's what my wife wants me to do.

Ravid Shwartz-Ziv: So why start a new company now? Like what keeps you up?

Yann LeCun: Because I have a mission, you know? I mean, I always thought that either making people smarter or more knowledgeable, or making them smarter with the help of machines—so basically, increasing the amount of intelligence in the world—was an intrinsically good thing.

Intelligence is really kind of the commodity that is the most in demand, certainly in government. But in every aspect of life, we are limited as a species, as a planet, by the limited supply of intelligence, which is why we spend enormous resources educating people and things like that.

So increasing the amount of intelligence at the service of humanity or the planet more globally, not just humans, is intrinsically a good thing. Despite all the doomers are saying. Of course, you can be dangerous and you have to protect against that. In the same way, you have to make sure your jet engine is safe and reliable and your car doesn't kill you with a small crash. But that's okay. That's an engineering problem. It's not like a fundamental issue. Also a political problem, but it's not insurmountable.

So that's an interesting thing and if I can contribute to this, I will. And basically all research projects I've done in my entire career, even those that were not related to machine learning, in my professional activities were all focused on either making people smarter—that's why I'm a professor—and that's why also I'm communicating publicly a lot about AI and science and things like that and a big presence on social networks and stuff like that, right? Because I think people should know stuff, right?

But also on machine intelligence, because I think machines will assist humans and make them smarter. People think there is a fundamental difference between trying to make machines that are intelligent and autonomous and blah, blah, blah, and it's a different set of technologies from trying to make machines that are assistive to humans. It's not. It's exactly the same technology.

And it's not because the system is intelligent or even a human is intelligent that it wants to dominate or take over. It's not even true of humans. Like it's not the humans who are the smartest that want to dominate others. We see this on the international political scene every day. It's not the smartest among us who want to be the chief.

And probably many of the smartest people that we've ever met are people who basically want nothing to do with the rest of humanity. They just want to work on their problems. They are not—kind of stereotyping.

Allen Roush: That's what Hannah Arendt talks about, the Vita Contemplativa, right, versus the active life or the contemplative life, and making a choice kind of early on on what you work on.

Yann LeCun: But you can be simultaneously kind of a dreamer or a contemplative, but have a big impact on the world, right? By your scientific production, like think of Einstein or something. Or even Newton—Newton basically didn't want to meet anybody. Or Paul Dirac—Paul Dirac was practically autistic.

Allen Roush: Famously.

01:20:29 – Regrets and backprop

Allen Roush: Is there a paper or idea you haven't written or something else that's nagging that you want to get to or maybe that you don't have time, or any regret?

Yann LeCun: Oh yeah, a lot. My entire career has been a succession of me not devoting enough time to express my ideas and writing them down and mostly getting scooped.

Allen Roush: What are the significant ones?

Yann LeCun: I don't want to go through that. The backprop is a good one. Okay. I published some sort of early version of some algorithm to train multilayer nets, which today we would call target prop. And I had the backprop thing figured out, except I didn't write it before. Rumelhart, Hinton, and Williams—they were nice enough to cite my earlier paper in theirs.

So there's been a bunch of those. ConvNets, various other things. And things that are more recent. I have no regrets about this. Like this is life. I'm not going to say, oh, I invented this in 1991. Like some...

Allen Roush: We all know.

Yann LeCun: The way ideas pop up is relatively complex. It's rare that someone comes up with an idea in complete isolation and that nobody else comes up with similar ideas at the same time.

Most of the time they appear simultaneously. But then there is various steps beyond just having the idea. And then there is kind of writing it down, but there's also writing it down in a convincing way, in a clear way. And then there is making it work on toy problems maybe. And then there is making the theory that shows that it can work. And then there is making it work on real applications. And then there is making a product out of it. So this whole chain.

And some people, at the extreme, think that the only person who should get all the credit is the very first person who got the idea. I think that's wrong. There's a lot of really difficult steps to get this idea to the state where it actually works.

So this idea of world models, it goes back to the 1960s. People in optimal control had world models to do planning. That's the way NASA planned the trajectory of the rockets to go to orbit. Basically simulating the rocket and by optimization, figuring out the control law to get the rocket to where it needs to be. So that's an old idea, very old idea.

The fact that you could do some level of training or adaptation in this is called system identification in optimal control. Very old idea too. It goes back to the 70s. Something called MPC where you adapt the model as it goes, while you're running the system. That goes back to the 70s to some obscure paper in France.

And then the fact that you can just learn a model from data. People have been working on this with neural nets since the 1980s. And not just Hinton. It's a whole bunch of people who have been working—people who came from optimal control and realized they could use neural nets as kind of a universal function approximator and use it for control or feedback control or models for planning, blah, blah.

And like a lot of things in neural nets in the 1980s and 90s, it kind of worked, but not to the point where it took over the industry. So it's the same for computer vision, speech recognition. There were attempts at using neural nets for that back in those days. But it started really, really working well in the late 2000s, where it totally took over. And then early 2010s for vision, mid 2010s for NLP, and for robotics, it's starting.

Ravid Shwartz-Ziv: But why? Why do you think it's only now starting to take over?

Yann LeCun: Well, it's a combination of having the right state of mind about it and the right mindset, having the right architectures, the right machine learning techniques—residual connections, ReLUs, whatever—then having powerful enough computers and having access to data. And it's only when those planets are aligned that you get a breakthrough, right? Which appears like a conceptual breakthrough, but it's actually just a practical one.

Like, okay, let's talk about convolutional nets. Lots of people during the 70s had the idea, or even during the 60s actually, had the idea of using local connections, building a neural net with local connections for extracting local features. And the idea that local features is like convolution, like in image processing, goes back to the sixties. So these are not new concepts.

The fact that you can learn adaptive filters of this type using data, goes back to the perceptron and Adaline, which is early sixties. Okay. But that's only for one layer. Now the concept that you can train a system with multiple layers, everybody was looking for this in the sixties. Nobody found. A lot of people made proposals, which kind of worked. But none of them was convincing enough for people to say, okay, this is a good technique.

One technique that was adopted was called polynomial classifiers. Now we would call this kernel methods. Basically, you have a handcrafted feature extractor. And then you train basically what amounts to a linear classifier on top of it. That was common practice in the 70s and certainly 80s.

But the idea that you could train a nonlinear system, a system composed of multiple nonlinear steps using gradient descent. The basic concept for this goes back to the Kelley-Bryson algorithm, which is optimal control, mostly linear from 1962. And people in optimal control kind of wrote things about this in the 60s. But nobody realized you could use this for machine learning to do pattern recognition or to do natural language processing.

That really only happened after the Rumelhart, Hinton, Williams paper in 1985, even though people had proposed the very same algorithm a few years before. Like Paul Werbos proposed what he called ordered derivatives, which turns out to be backprop, but it's the same thing as the adjoint state method in optimal control.

So these ideas—the fact that an idea or a technique is reinvented multiple times in different fields and then only after the fact people say, right, it's actually the same thing. And we knew about this before, we didn't realize we could use this for this particular stuff. So all those claims of plagiarism, it's just a complete misunderstanding of how ideas develop.

01:28:00 – Hobbies: Sailing, flying machines, astrophotography

Ravid Shwartz-Ziv: Okay, what do you do when you're not thinking about AI?

Yann LeCun: I have a whole bunch of hobbies that I have very little time to actually partake in. Sailing, so I go sailing in the summer. I like sailing multi-hull boats like trimarans and catamarans. I have a few boats.

I like building flying contraptions. So I wouldn't call them airplanes because a lot of them don't look like airplanes at all. But they fly. I like the sort of concrete creative act of that. My dad was an aerospace engineer, a mechanical engineer working in the aerospace industry. So he was building airplanes as a hobby and building his own radio control system and stuff like that. And he got me and my brother into it, my brother who works at Google, at Google Research.

Allen Roush: In France?

Yann LeCun: In Paris. And that became kind of a family activity, if you want. So my brother and I still do this.

And then in the COVID years, I picked up astrophotography. So I have a lot of telescopes and pictures of the sky.

And I build electronics. Since I was a teenager, I was interested in music. I was playing Renaissance and Baroque music, and also some type of folk music. Playing wind instruments, woodwinds. But I was also into electronic music. And my cousin, who was older than me, was an aspiring electronic musician. So we had analog synthesizers. Because I knew electronics, I would modify them for him.

I was still in high school at the time. And now in my home, I have a whole bunch of synthesizers and I build electronic musical instruments. So these are wind instruments. You blow into them, there's fingering and stuff, but what they produce is control signals for the synthesizer.

Allen Roush: Oh, it's cool. Very cool. A lot of people in tech are into sailing. I've gotten that answer a surprising amount. I'm gonna start trying to sail now.

Yann LeCun: Yeah. Okay. So I'll tell you something about sailing. It's very much like the world model story. To be able to control the sailboat properly to make it go as fast as possible and everything, you have to anticipate a lot of things. You have to anticipate the motion of the waves, how the waves are going to affect your boat. Whether a gust of wind is going to come and you have to—the boat is going to start heeling and things like that.

And you basically have to run CFD in your head. Because you have to figure out the fluid dynamics, you have to figure out what is the flow of air around the sails. And you know that if the angle of attack is too high, it's going to be turbulent on the back and the lift is going to be much lower. So tuning sails basically requires running CFD in your head, but at an abstract level. You're not solving Navier-Stokes, right? We have really good intuitive models.

So that's what I like about it. The whole thing that you have to build this mental predictive model of the world to be able to do a good job.

Ravid Shwartz-Ziv: The question is how many samples you need?

Yann LeCun: Yeah, probably a lot. But you get to run it in a few years of practice.

01:32:00 – French identity and family

Ravid Shwartz-Ziv: Okay. You're French and you lived in the US for many decades already. Do you still feel French? Does that perspective shape your view of the world, the American tech culture?

Yann LeCun: Well, inevitably, yeah. I mean, you can't completely escape your upbringing and your culture. So I feel both French and American in the sense that I've been in the US for 37 years and North America for 38, because I was in Canada before. My children grew up in the US. And so from that point of view, I'm American. But I have a view certainly on various aspects of science and society that probably are a consequence of growing up in France. Absolutely. And I feel French when I'm in France.

Allen Roush: I'm curious, I did not actually realize that you had a brother that also worked in tech. I'm fascinated by this because Yoshua Bengio's brother also works in tech and I always thought that he was the only Serena Williams situation in AI. But you too also have a brother. So how many more AI researchers—is it that common that it just runs in families?

Yann LeCun: I don't know. But certainly the combination of scientific upbringing and creative hobbies, yeah, that's what my family had. My father was an engineer. My sister, who is not in tech, but she's also a professor. My brother was a professor before he moved to Google. He doesn't work on AI or machine learning. He's very careful not to. He's a younger brother, six years younger than me. And he works on operations research and optimization, essentially. Now it's actually also being invaded by machine learning.

01:34:00 – The dream: What if world models work?

Ravid Shwartz-Ziv: Okay, one more question. So if the world models work in 20 years from now, what is the dream? How does it look like? How will our lives be?

Yann LeCun: Total world domination. Okay, no, it's a joke. I said this sentence because this is what Linus Torvalds used to say. He said, what's your goal with Linux? And that was super funny. And it actually succeeded. I mean, basically, to first approximation, every computer in the world runs Linux. There's only a few desktops that don't, and a few iPhones, but everything else runs Linux.

So really pushing towards a recipe for training and building intelligent systems, perhaps all the way to human intelligence or more. And basically building AI systems that would help people and humanity more generally in their daily lives at all times, amplifying human intelligence.

We'll be their boss, right? It's not like those things are going to dominate us. Because again, it's not because something is intelligent that it wants to dominate. Those are two different things.

In humanity, we are hardwired to wanting to influence other people. And sometimes it's true domination. Sometimes it's through prestige. But we're hardwired by evolution to do this because we are a social species. There's no reason we would build those kind of drives into more intelligent systems. And it's not like they're going to develop those kinds of drives by themselves.

Yeah, I'm quite optimistic.

Ravid Shwartz-Ziv: Me too.

01:35:59 – Advice for young researchers

Ravid Shwartz-Ziv: All right. Okay, so we have final questions from the audience. If you were starting your AI career today, what skills and research directions would you focus on?

Yann LeCun: I get this question a lot from young students or parents of future students. I think you should learn things that have a long shelf life and you should learn things that help you learn to learn, because technology is evolving so quickly that you want the ability to learn really quickly.

And that is done by learning things that have a long shelf life. So in the context of STEM—science, technology, engineering, mathematics—I'm not talking about humanities here. Although you should learn philosophy.

This is done by learning things that have a long shelf life. So the joke I say is that, first of all, the things that have a long shelf life tend to not be computer science. Okay, so here's a computer science professor arguing against studying computer science. Don't come, don't come to study.

And I have a terrible conflict to make, which is I studied electrical engineering as an undergrad, so I'm not a real computer scientist.

But what we should do is learn basic things in mathematics, in modeling, mathematics that can be connected with reality. You tend to learn this kind of stuff in engineering, in some schools that's linked with computer science. But electrical engineering, mechanical engineering, engineering disciplines—when you're learning in the US calculus one, two, three, that gives you a good basis, right? Computer science, you can get away with just calculus one. That's not enough, right?

You're learning probability theory and linear algebra, all this stuff that is really kind of basic. And then if you do electrical engineering, things like control theory or signal processing, all of those methods, optimization—all of those methods are really useful for things like AI.

And then you can basically learn similar things in physics, because physics is all about—what should I represent about reality to be able to make predictive models, right? And that's really what intelligence is about.

So I think you can learn most of what you need to learn also if you go through a physics curriculum. But obviously, you need to learn enough computer science to program and use computers. And even though AI is going to help you be more efficient at programming, you still need to know how to do this.

Ravid Shwartz-Ziv: What do you think about vibe coding?

Yann LeCun: I mean, it's cool. It's going to cause a funny kind of thing where a lot of the code that will be written will be used only once, right? Because it's going to become so cheap to write code. You're going to ask your AI assistant, like, produce this graph or this research, blah, blah, blah. And it's going to write a little piece of code to do this. Or maybe it's an applet that you need to play with, a little simulator. And you're going to use it once and throw it away because it's so cheap to produce.

So the idea that we're not going to need programmers anymore is false. The cost of generating software has been going down continuously for decades and that's just the next step of the cost going down. But it doesn't mean computers are going to be less useful. They're going to be more useful.

01:40:00 – Neuroscience and machine learning

Ravid Shwartz-Ziv: Okay, one more question. So what do you think about the connection between neuroscience and machine learning? There are some ideas a lot of time that AI borrows from neuroscience and the other way—predictive coding, for example. Do you think it's useful to use ideas from...

Yann LeCun: Well, there's a lot of inspiration you can get from neuroscience, from biology in general, but neuroscience in particular. I certainly was very influenced by classic work in neuroscience, like Hubel and Wiesel's work on the architecture of the visual cortex is basically what led to convolutional nets.

And I wasn't the first one to use those ideas in artificial neural nets. There were people in the 60s trying to do this. There were people in the 80s building locally connected networks with multiple layers. They didn't have ways to train them with backprop.

There was the Neocognitron from Fukushima. Which had a lot of the ingredients, just not a proper learning algorithm. And there was another aspect of the Neocognitron, which is that it was really meant to be a model of the visual cortex. So it tried to reproduce every quirk of biology.

For example, the fact that in the brain, you don't have positive and negative weights, you have positive and negative neurons. So all the synapses coming out of inhibitory neurons have negative weights. And all the synapses coming out of non-inhibitory neurons have positive weights. So Fukushima implemented this in his model.

He implemented the fact that neurons spike. So he didn't have a spiking neuron model, but you cannot have a negative number of spikes. And so his transfer function was basically rectification like a ReLU except it had a saturation.

And then he knew from Hubel and Wiesel's work that there was some sort of normalization and he had to use this because otherwise—there was no backprop—so the activation in his network would go haywire. So he had to do divisive normalization that turns out to actually correspond to some theoretical models of the visual cortex that some of our colleagues at the Center for Neural Science have been pushing, like David Heeger and people like that.

Yeah, I mean, I think neuroscience is a source of inspiration. More recently, the sort of macro architecture of the brain in terms of perhaps the world model and planning and things like this—how is that reproduced? Why do we have a module in the brain for factual memory, the hippocampus, right? And we see this now in certain neural net architectures that there is a separate memory module, right? Maybe it's a good idea.

I think what's going to happen is that we're going to come up with new AI neural net architectures, deep learning architectures, and a posteriori, you will discover that the characteristics that we implemented in them actually exist in the brain.

And in fact, that's a lot of what's happening now in neuroscience, which is that there is a lot of feedback now from AI to neuroscience, where the best models of human perception are basically deep learning models today.

01:48:03 – Continual learning

Allen Roush: Do you think that the field will ever figure out continual or incremental learning?

Yann LeCun: Sure. Yeah, that's sort of a technical problem.

Allen Roush: Well, I thought catastrophic forgetting—because your weights that you trained with so much money on get overwritten.

Yann LeCun: Sure, so you train just a little bit of it. I mean, we've always done this with SSL, right? We train a foundation model, like for video or something like V-JEPA, it produces really good representations of video. And then if you want to train the system for a particular task, you train a small head on top of it. And that head can be, you know, running continuously. And even your world model can be trained continuously. That's not an issue. I don't see this as a huge challenge.

In fact, Raia Hadsell and I and a few of our colleagues back in 2005, 2006 built a learning based navigation system for mobile robots that had this kind of idea. So it was a convolutional net that was doing semantic segmentation from camera images. And on the fly, the top layers of that network would be adapted to the current environment. So it would do a good job. And the labels came from short range traversability that were indicated by stereo vision essentially.

So yeah, I mean, you can do this. It's particularly easy if you have multi-modal. Yeah, I don't see this as a big challenge.

01:49:32 – Closing

Allen Roush: It's been a pleasure to have you.

Yann LeCun: All right, it was a real pleasure. Thank you so much.

Ravid Shwartz-Ziv: Thank you.