The Information Bottleneck

Editing a Compressed Memory

Ravid Shwartz Ziv — Mon, 29 Jun 2026 18:48:56 GMT

Written with help from Claude for drafting, editing, and figures. All the mistakes are its.

A Transformer remembers by keeping everything. Every token it has read stays in the KV cache, and any later token can look back at any earlier one exactly. That is why attention is so good at recall, and why its memory and compute grow with the length of the context.

Linear attention makes the opposite bet. It throws the cache away and keeps a single fixed-size matrix: a running summary that every new token updates and every query reads. Memory stops growing and decoding gets cheap. But a fixed-size summary cannot hold an unbounded number of facts cleanly, so writing something new can disturb what is already stored. Almost all the recent progress here (DeltaNet, Gated DeltaNet, KDA, and now Gated DeltaNet-2) is about making that write more surgical.

This post builds the whole thing from the ground up. You do not need to know any of these models going in; just linear algebra and a rough sense of what attention does. Shape of the argument

A fixed-size state is an associative memory built by summing key–value outer products, and reading it is content-addressed lookup.
Because it is fixed-size, overlapping keys interfere. That is the one limitation everything else fights.
Giving an old key a new value is the hard case. Adding leaves the stale value behind; replacing the matrix destroys every other fact; the delta rule does the surgical thing.
The delta rule looks sequential but trains in parallel as one small triangular solve per chunk.
Decay, then per-channel decay (KDA), then decoupled erase/write gates (GDN-2) are three refinements that keep that solve intact.

Which memory we mean

“Memory” means three different things in a language model. This post is about one of them.

The weights. The query/key/value projection matrices and the gates, learned during training and fixed afterward. Long-term knowledge, changed only by more training. Not this.
The KV cache (softmax). The full list of past keys and values, so any query can look back exactly. Lossless, grows with context, reset each sequence. Linear attention removes this.
The recurrent state (linear attention). One fixed-size matrix summarizing every token so far. Lossy, fixed size, reset each sequence. This is the memory we mean.

So this is in-context memory: holding the current input within a single forward pass, so token 5,000 can use what token 3 said. New prompt, empty state, nothing saved. It is not retrieval/RAG, not continual learning, not “remembering you across sessions”; it is the job plain attention does with its KV cache, just compressed into a fixed matrix instead of a growing list.

Where the state comes from

Each token has a representation, and three fixed learned matrices turn it into a query, a key, and a value:

The key is a token’s address, what it is about; the value is the content stored there; the query is what the current token is asking for, matched against the keys to decide what to pull out. A token writes itself in as a key–value pair and later reads with a query. The vectors depend on the input, but the three projection matrices are fixed weights, shared across every position and sequence.

Ordinary attention computes each output as a softmax-weighted sum over the past:

The exponential is what forces the cache. The score exp(query · key) does not split into a part that depends only on the query times a part that depends only on the key, so the weight on each value is tied to that specific key, and the normalizer sums over every past key. There is no running summary you can keep instead: you have to store every key–value pair and revisit them for each new query. That is the KV cache: memory grows linearly with sequence length, and producing all outputs scales quadratically with it.

Linear attention drops the softmax and uses a score that factorizes, in the simplest case just the dot product of query and key. Once it factorizes, the sum rearranges:

All of history collapses into one matrix of fixed size (key-dimension by value-dimension), and the query reads it in a single multiply. Memory no longer grows with context and the per-token cost is constant. This fixed state is the object the rest of this post is about; it exists precisely because the softmax is gone.

Two ways to remember a sequence. Softmax keeps a growing KV cache, a key–value row per token; linear attention keeps one fixed-size matrix that every token writes into. The cache scales with length; the matrix does not.

That fixed size is the appeal and the problem at once. Packing an unbounded history into one matrix is cheap, but it means many facts share the same finite space. The next section shows what that does to a read.

Why a fixed-size memory interferes

We have the fixed-size state. Before writing into it, look at what reading it gives you. Reading is applying a query to the memory:

Each stored value is weighted by how aligned its key is with the query, the dot product of the two. That is content-addressed recall: values whose keys match what you asked for, weighted by the match.

To expose the problem, take the cleanest possible query, one that exactly equals a key you already stored. This is the case that should return its value perfectly, so any mess is the memory’s fault, not a mismatched query. Splitting off that term:

If the stored keys were orthonormal, every cross term would be zero and the read would be clean. They are not. Each nonzero overlap leaks a fraction of some other value into the answer. And here is the structural reason they cannot all be orthogonal: the state is a single matrix, so the key space has only as many dimensions as the key vector is wide, and a space of that dimension holds at most that many mutually orthogonal directions. Store more associations than that and some keys must share directions; even below the limit, random unit keys have small but nonzero overlaps that add up.

As the context carries more associations, the term you want stays about the same size while the leakage is a sum over everything else, so it grows. Signal-to-noise falls with context length: a long document forces many distinct facts to share one fixed box and they smear together. That is why this whole family struggles on long, many-needle retrieval, and why the improvements below all aim at that pressure point.

Reading a stored key returns its value plus a small leak from every other key. The wanted term stays the same size while the leakage is a sum over everything else, so it grows as more facts share the fixed state.

Why softmax doesn’t have this problem

The leakage is not caused by folding the sum into the state matrix. The summed form and the matrix form are the same number; folding only fixes the size and the cost, not the value. The leakage is already there in the raw dot-product score.

Softmax runs those same dot products through an exponential and normalizes. The exponential sharpens them: the matching key saturates near one and the mismatched keys are crushed toward zero, so the wrong values effectively drop out of the read even when the keys overlap. Same overlaps, clean answer.

But that is exactly the property that cannot be summarized. The exponential of a dot product does not split into a query part times a key part, so there is nothing to precompute: you are forced to keep every key and recompute the exponential against each one, which is the growing cache. So it is an either/or: a sharp score reads cleanly but cannot be folded into a fixed state, while a foldable score gives the fixed state but leaks. Interference is not the cost of compressing; it is the cost of using a score weak enough to be compressible.

Updating a value when a key comes back

As the model reads a sequence, a later token sometimes produces a key close to one an earlier token already wrote, but carrying a different value. The state already holds a binding in that direction, and the new value should take its place. This is the update case, and it is the one ordinary outer-product memory gets wrong.

For example, a passage sets x = 5 and later sets x = 7. Both tokens produce nearly the same key (the direction standing for “the value of x”), but with different values. When a later token reads x, the answer should be 7. Plain addition cannot give that: it never removed the old binding, so the slot holds 5 and 7 at once and the read returns a blend. The same shape shows up whenever a key recurs with a new value: an entity whose state changes (”Alice is in Paris… now Tokyo”), a correction (”blue… actually green”), a form field revised.

Two clarifications, since “update” can mislead. The prompt itself is fixed; the forward pass only reads it left to right, and “update” means a later position’s binding should win over an earlier one. “x = 5” stays in the text; it just should not win the read. And keys are not matched by name: two tokens are “the same key” when their key vectors point in roughly the same direction, so their writes land on the same spot in the state. A repeated mention produces a nearby key, the later write hits that slot, and the read afterward should reflect the new value.

So every write is one of two cases. Add: the key points somewhere new, a fresh fact, which is most tokens; plain accumulation is fine, and that is what vanilla linear attention does. Overwrite: the key lands on a direction already in the state, and the slot has to be updated to the new value, not stacked on top of the old one.

Why not just replace the whole matrix?

Because the state is shared by every association at once. Three ways to write an update to one key, into a memory that also holds a second fact:

Replace the whole matrix with the new key–value outer product: fixes the target key perfectly and deletes everyone else. Read the second key afterward and you get near zero. You wanted to change one slot and you erased the notebook.
Add the new outer product: keeps the second fact, but leaves the old binding in place, so reading the target key returns old-plus-new, the stale value smeared into the fresh one.
Delta: read what the key currently points to, subtract just that, then write the new value. Only the target slot changes; the other fact is untouched.

Add keeps the other fact but smears the target. Replace fixes the target but wipes the other fact. Only the third (read, subtract, write) gets both right. That third option is the delta rule.

Updating one key three ways. Add leaves the old value smeared into the new one; replacing the whole matrix fixes the target but destroys every other fact; the delta rule edits only the target slot and leaves the rest intact.

The delta rule

The update that does this is the delta rule (Widrow & Hoff, 1960), used for linear attention in DeltaNet (Yang et al., 2024). It writes the new value relative to what is already stored, not absolutely. First read what the memory currently returns for the key:

This is whatever sits in that key’s direction right now. We never have to know in advance whether the key was used before; we just read it back. Then move the slot from that old value toward the target, by a fraction β (the write strength):

The bracket is the gap between the new value and the old one, and adding it back pushes the value stored at that key toward the target. One update covers both cases with no branching: if the key points somewhere new, the read is about zero, the gap is just the new value, and it reduces to a plain add; if the key lands on a direction that already holds a value, the read returns that old value, and the update subtracts it and writes the new value in its place. The memory tells the rule which case it is in.

Multiplying the correction out shows what it does to the whole state:

The second term writes the new value along the key. The first term removes a β fraction of whatever the state held along that key, and only along that key: the projection onto the key direction leaves everything orthogonal to it untouched. That is exactly why, in the previous widget, replacing the whole matrix wiped the bystander but the delta update did not; it only edits that one key’s line of the state.

Why β is not just 1

A write strength of 1 is a hard overwrite: erase the old binding completely, write the new value. So why not use it everywhere? β is produced per token by the model, and two things argue against pinning it to 1. Real keys are not exactly orthogonal, so erasing hard along one key also disturbs neighbors that partly share its direction, and a smaller β makes a gentler edit with less collateral damage. And not every write should fully replace: sometimes the right move is to nudge a value, accumulate evidence, or write weakly under uncertainty. So β between 0 and 1 is a dial: 1 overwrites, 0 leaves the slot alone, in between is a partial move. (In the online-learning view it is a per-step learning rate, and a rate of 1 everywhere is rarely what you want.)

One caveat the next widget makes concrete: the clean overwrite is exact only when the key is orthogonal to the others. When keys overlap, editing along one drags on whatever shares its direction, the same interference from before, now showing up in the write.

Training it in parallel

Training needs every output over the whole sequence at once, then a gradient. Plain linear attention gives them cheaply because the state is a running sum, so the outputs collapse into two matrix multiplies (with a causal mask zeroing the future):

Why this is fast: it is all dense matmuls, and a GPU runs a matmul as thousands of multiply-adds in parallel on its tensor cores, every output position at the same time. Nothing waits for anything else.

The delta rule breaks this. Its erase factor (the one from the operator form above) makes each state genuinely depend on the previous one, so you cannot write the answer as one sum of independent terms. Done literally you process tokens one at a time, each a tiny rank-one update that uses a sliver of the GPU while the rest sits idle.

DeltaNet’s contribution (Yang et al., 2024) was to recover the matmul form by working in chunks. A chunk is a contiguous block of C tokens; a length-L sequence is split into L/C of them. The expensive work happens inside a chunk, all as matmuls, and only a small summary state is passed from one chunk to the next.

The trick: solve for the values that were actually written

Every step adds a rank-one term whose left factor is a key, so the state is always the start state plus one such term per token:

The written value here is not the raw value, but the correction from the delta rule (target minus old value, scaled by β). The keys are known; these written values are the unknowns. The point is that if we can get all of them in a chunk at once, with a single matrix solve instead of a token-by-token walk, the whole chunk becomes parallel matmuls. So we solve for them jointly.

The written value at each step depends on the current read, and that read expands into known keys and earlier written values:

In words: reading a key against the state-so-far is the start-state read, plus every earlier written value weighted by how much its key overlaps the current one. Substituting gives a relation among the written values alone:

Each written value depends only on earlier ones, which makes this a triangular system. Stack the written values into a matrix, collect the pairwise key overlaps into a matrix T, and the whole set of equations becomes a single solve:

Because the matrix being inverted is unit lower-triangular, the inverse is one forward substitution on a small C-by-C matrix. Everything else is dense matmuls: build the overlap matrix from pairwise key dot products, then form the carried state and the outputs:

The sequential token loop is gone, replaced by matmuls plus one small triangular solve. Writing a product of rank-one factors as a single low-rank update this way is a classical move from numerical linear algebra, the WY representation (Bischof & Van Loan, 1985) and its UT-transform variant (Joffrain et al., 2006); DeltaNet borrows it to collapse the chunk into matrix operations.

Training a chunk in parallel. Inside a chunk everything is dense matmuls plus one small triangular solve; only the carried state passes to the next chunk, the single sequential step.

Why the chunk size is small

The chunk size sets how often the state is handed off: there are L/C chunks, so that many sequential state updates. The two extremes make this concrete. A chunk of one token is the original fully sequential recurrence. A single chunk covering the whole sequence is one handoff, done in one parallel block. (This is the opposite of what it might sound like: a bigger chunk means fewer, larger steps, not more.)

So why not use one giant chunk and be fully parallel? Because the chunk builds and solves a C-by-C matrix, so its cost and memory grow quadratically in the chunk size. At the full length you are back to the quadratic cost of full attention, and the matrix no longer fits in the fast on-chip memory the matmul engine reads from. Too small, and you pay too many sequential steps and underfill each matmul. The kernels use 64.

Adding decay: Gated DeltaNet

Everything up to here is DeltaNet: a fixed-size associative memory, edited by the delta rule, trained in parallel. The last three sections are refinements, each adding expressive power with a small change that leaves the chunk algorithm intact.

The delta rule overwrites one slot at a time but cannot let old context fade on its own. Gated DeltaNet (Yang, Kautz & Hatamizadeh, 2025, arXiv:2412.06464) multiplies the state by a scalar decay before each edit:

Tracking the cumulative product of the decays, an earlier write contributes to a later read scaled by how much decay has accumulated in between. In the chunk algorithm this is just a per-row reweighting of the same matrices plus an extra factor in the causal mask; the triangular solve is unchanged. Decay is close to free to add. What it cannot do is forget different features at different rates, since it is one number.

Decay per channel: KDA

KDA, the linear-attention layer in Kimi Linear (Kimi Team, 2025, arXiv:2510.26692), replaces that single decay with a per-channel decay vector, a different forget rate for every key channel:

Now every channel is scaled differently at every step, which looks like it should break the chunk form. It does not, because of a change of variables: factor the cumulative per-channel decay out of the state, and it cancels from the recurrence, leaving a plain delta product in reweighted key and erase factors:

After this substitution the chunk equations have the same shape as before; only the entries carry the decay factors. KDA buys richer forgetting at no structural cost.

The per-channel rates are not hand-set hyperparameters; they are learned and data-dependent, produced from each token by a small projection (the Gated DeltaNet parameterization, a softplus of a learned linear map passed through an exponential). A per-head term and a per-channel bias set each channel’s baseline forget rate, and the per-token projection pushes that rate up or down, so the model learns both the typical decay profile and how to modulate it on the fly. The active edit, though, is still a single write-strength scalar, which scales both the erase and the write at once.

Splitting the edit: Gated DeltaNet-2

Erasing acts on the key side: which coordinates of the old read to remove. Writing acts on the value side: which coordinates of the new value to keep. These are different axes of the state, so GDN-2 (Hatamizadeh, Choi & Kautz, 2026, arXiv:2605.22791) gives each its own channel-wise gate, an erase gate on the key and a write gate on the value:

Compared with KDA, the write direction is unchanged (the left factor is still the key), but the read it subtracts is now channel-selected by the erase gate, and the value it writes is channel-selected by the write gate.

Forward: same machine

Run the same change of variables, now folding the erase gate into the reweighted factor, and the recurrence is again a plain (now asymmetric) delta product. The chunk pipeline keeps the same shape: the same overlap matrix, the same triangular inverse, the same state and output equations. The only difference is what fills them: the erase gate enters the key-side rows, the write gate the value-side rows, and the overlap matrix is now built from an asymmetric pair rather than a symmetric one.

Backward: one real difference

Training propagates a loss gradient back through the chunk. Write the solve as the triangular inverse applied to the written values. Backprop needs the gradient with respect to that inverse, which accumulates as a product of the incoming gradient with the written values:

The whole question is whether the gate can be pulled out of that inner product. In KDA the written value is a scalar times the value, so it slides straight out, and you can compute the gate-free products once as a matmul and scale afterward:

In GDN-2 the written value is a per-channel product, so the gate sits inside the sum over channels and there is nothing to pull out:

No single number multiplies the whole inner product; the gate reweights each channel before it is summed, so no row or column scaling recovers it from the gate-free version. The erase side has the same issue. The gate therefore has to be folded into the matmul itself, not applied as a scaling after. The forward pass is essentially KDA’s; the backward kernel is the part that must be rewritten to carry both gates inside its accumulation, and that gate-aware backward is the real implementation cost of the split.

Setting both gates to the same scalar recovers KDA exactly; tying the decay to a scalar as well gives Gated DeltaNet; dropping the decay gives the delta rule. Each model is the next with some gate held to a scalar.

Where this nets out

Step back and it is all one idea, taken in stages. Linear attention compresses an unbounded history into a fixed matrix, fast but lossy. The delta rule edits that matrix surgically instead of piling onto it. The chunked triangular solve makes the edit trainable at scale. Decay, per-channel decay, and decoupled erase/write gates each give the edit finer control over what to keep and what to remove, without giving up the fixed-size state or the parallel training. None of them recover the softmax cache's perfect recall; they make the compression smarter.

That is also where the measured gains land. In the Gated DeltaNet-2 paper the improvement over KDA is modest on language modeling but clear on long-context, multi-key retrieval, the regime where many associations are forced to share one fixed state and interference is worst. The ablation is honest about the split: a channel-wise erase gate with a scalar write recovers most of the gain, so the erase side is doing more work than the write side.

This is also why pure linear attention rarely replaces softmax outright. Exact recall is often worth the cost of the growing cache, so most production models stay softmax, and these layers show up where memory and throughput dominate: long context, high-throughput serving, constrained hardware. The common deployment is hybrid: interleave a few full or sliding-window attention layers for exact recall with many cheap linear layers. Recent open-weight models make this concrete. Qwen3-Next and Kimi Linear both stack three linear blocks (a Gated DeltaNet variant) per full-attention block, a 3:1 ratio, and MiniMax-01 mixes lightning (linear) and softmax attention in a similar pattern.

Sources:

DeltaNet chunkwise algorithm (Yang, Wang, Zhang, Shen, Kim, NeurIPS 2024)
Gated DeltaNet (Yang, Kautz, Hatamizadeh, ICLR 2025, arXiv:2412.06464);
KDA / Kimi Linear (Kimi Team, 2025, arXiv:2510.26692);
Gated DeltaNet-2 (Hatamizadeh, Choi, Kautz, 2026, arXiv:2605.22791).

AI for Science with Qichao Hu (Molecular Universe / SES AI)

Ravid Shwartz Ziv — Mon, 29 Jun 2026 04:32:25 GMT

In this episode, we talk with Qichao, the founder and CEO of Molecular Universe, the AI-for-science platform that grew out of SES AI, a high-energy-density battery developer he’s run for fourteen years. His core distinction is that companies from the AI world build tools, such as foundation models that predict properties, while companies from the science world care about the final product, such as the new battery or material that actually ships. Molecular Universe sits firmly on the science side, and the difference shows up everywhere from what they publish to what they refuse to.

We get into the actual workflow of materials discovery and where AI compresses it. A single trial in a traditional lab can take a year with maybe a 40% success rate; the goal is to run a thousand candidates in parallel and turn that year into a week. Qichao walks through improving low-temperature fast-charging for EV batteries: from hypothesis generation through molecule-, material-, and device-level property prediction, down to autonomous labs that synthesize and test the top candidates without a human touching a pipette.

The hardest problem, it turns out, isn’t predicting molecular properties or measuring device performance, but it’s the black box connecting the two. In batteries, that’s the solid-electrolyte interface, which the field has been hand-waving about since the seventies. And the thing standing in the way of cracking it isn’t a clever training trick but data: companies sitting on twenty years of records are finding it too messy, incomplete, and poorly labeled to train on, and are having to start collecting from scratch with new protocols and robots.

Timeline

00:13 — Intro and welcome;
01:19 — Shovel vs. gold
05:18 — Why the world’s smartest scientist doesn’t automatically give you a better battery
07:25 — The discovery workflow
09:37 — Exploration vs. exploitation
11:54 — Safety and filtering: screening novel molecules against banned and toxic-substance lists
17:55 — How hypotheses get generated, and where frontier LLMs help
20:29 — From hypothesis to ~400 formulations: property prediction, ranking, and handing off to autonomous labs
26:37 — “A foundation model for everything” — and the black box between molecular properties and device performance
30:01 — World models and physics
33:09 — The great unknown in batteries
37:08 — Simulation vs. reality: calibrating massive simulated datasets with a sliver of experimental data
41:47 — Lab robotics: how fast the hardware has caught up, and what a floor of autonomous labs looks like
43:50 — The real bottlenecks
50:21 — Pre-training from scratch vs. post-training LLMs, and why training tricks haven’t reduced the need for good data
52:42 — Evaluation
55:42 — Publish the B+ model, keep the A model
58:05 — Five years out
1:00:37 — Closing thoughts and wrap

Music:

“Kid Kodi” - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Infrastructure for AI at Scale - With Benny Chen (Fireworks AI)

Ravid Shwartz Ziv — Wed, 24 Jun 2026 04:03:28 GMT

We talk a lot on this show about RL, agents, and the move between pre-training and post-training, but not enough about the layer everything actually runs on. Benny Chen, co-founder of Fireworks AI, one of the largest inference platforms around, walks us through what it takes to serve models at scale: sourcing GPUs, writing the kernels, the runtime, and the routing layer that lets a customer hit one endpoint and forget the rest.

We talk why the real bottleneck is power, not chips, and why that favors Nvidia and Google. Why MoE keeps winning even when dense models look better on paper and why he'd rather run fungible capacity at 95% than specialized chips at 60%. We also talk about quantization limits, where RL efficiency has to go next, and his case that AI is still under-hyped. We also get into cross-region training, sparse autoencoders and why interpretability hasn't taken off in open source, whether open models can close the gap, and a frank read on Anthropic's go-to-market.

Timeline

00:00 — Intro: the part of AI nobody talks about
01:20 — What "infrastructure for AI" actually means: the layers, from GPUs up to routing
02:59 — Why not just buy your own GPUs and do it yourself?
05:17 — The scale Fireworks runs at
06:35 — Hardware inflation, GPU costs, and the real risk hiding in commit duration
10:14 — Nvidia vs AMD vs TPUs, and why power is the bottleneck
11:57 — Mixing GPU types and generations; fungibility vs. specialization
14:22 — Once you have the GPUs, what's the next layer to build?
17:04 — Dense vs. MoE, and why the hardware picks the winner
21:07 — Quantization: is FP4 the floor? TurboQuant and INT vs. FP
24:28 — How tied are the algorithms to the hardware?
25:12 — DeepSeek, DeepGEMM, and next-token prediction as reconstruction loss
28:50 — Why RL is still wildly inefficient compared to pre-training
30:08 — Speculative decoding, AI-generated kernels, and auto-research
34:00 — The AGI question: why text gets automated but vision may stay expensive
37:07 — Hype check: why Benny thinks AI is still under-hyped
41:28 — Training vs. inference at the infrastructure level
44:12 — Scaling across data centers: cross-region training with Cursor
45:40 — Sparse autoencoders, interpretability, and why open source is human-constrained
49:04 — Will open models catch up — on quality and on compute?
51:41 — Are we plateauing? Opus 4.7 vs. 4.6 and the coming data wars
54:41 — Physical limits, HBM, and whether chips keep getting faster
58:17 — The belief about inference everyone gets wrong
59:31 — Anthropic, mythos, and a frank take on go-to-market
1:04:41 — Wrap-up

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Broken Peer Review, AI, and Worms — with Oded Rechavi

Ravid Shwartz Ziv — Sun, 21 Jun 2026 03:53:01 GMT

Oded Rechavi is a biologist at Tel Aviv University and the co-founder of QED, a company building AI to review scientific work. He's also spent years studying worms.

We start with what's wrong with peer review and grant funding: why it takes years to publish, why reviewers are often your own competitors, and why the whole thing is locked to an economic model that rewards publishing more papers, not better ones. Oded explains why he doesn't call QED "peer review" at all, and what it would take to actually validate science instead of just stamping it.

Then we get into the biology. C. elegans has exactly 959 cells, every one of them named, and a fully mapped brain. Oded's lab studies how a worm's experiences get passed to its offspring through RNA rather than DNA — meaning what happens to a worm in its lifetime can change its descendants. We also talk about using ancient DNA to reassemble the Dead Sea Scrolls, what AI can and can't do for biology, and why he wants to build an "Ironman suit" for researchers rather than replace them.

00:00 Intro

01:35 Why scientific publishing is broken

04:02 Years to publish, and what it costs science

07:20 Bad reviewers, conflicts of interest, and the money

10:47 Why preprints don't fix it

15:37 How AI conferences handle review

22:07 Conferences vs. journals — does slow review help?

25:22 Building QED: review, not peer review

30:02 Tracking a paper from idea to submission

33:11 What writing a grant actually involves

35:00 The ERC reviewer crisis

37:06 Tailoring feedback to your field

41:48 Switching to biology

44:30 Every cell has a name: inside C. elegans

46:28 Inheritance without DNA

48:16 What the worm "thinks" changes its offspring

51:58 Reassembling the Dead Sea Scrolls with ancient DNA

56:07 Psychedelics and worms

58:36 Can AI run the research itself?

1:04:49 Automation vs. validation

1:07:12 The origin of life

1:08:49 Why people reject AI-written work

1:16:18 Will humans still have a role?

1:17:39 Wrap-up

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Will AI Take Our Jobs? With Alex Imas (Google/University of Chicago)

Ravid Shwartz Ziv — Tue, 16 Jun 2026 14:49:38 GMT

Will AI take our jobs? We put the question to Alex Imas, the new Director of AGI Economics at Google DeepMind and a professor at Chicago Booth, whose entire job now is studying how frontier AI reshapes the economy. His short answer: probably some of them, but the popular story is mostly wrong about which jobs and how fast.

Alex makes the case that a job is a bundle of tasks, not a single thing AI either does or doesn't do, and that the number of people who should actually care about is how much consumer demand responds to falling prices. Get that wrong and you predict mass layoffs. Get it right and you sometimes predict more hiring. We get into why the automation panic is two centuries old, why he thinks blue-collar work is in more danger than white-collar, and why the people already winning are the ones adopting AI fastest.

We also cover the AGI versus ASI distinction and why it changes everything for the economy, what happens when there's no moat and open models stay six to eight months behind, the three-tier pricing future he sees coming after the 2026 compute crunch, and what any of this means if you're deciding whether to send your kids to college.

The episode was recorded before Alex joined Google

Timestamps

00:00 Meeting Alex Imas

00:44 Will AI take our jobs?

03:35 Is this an AI question or an economics question?

06:18 The economy is already behind the AI we have

07:43 Why AI adoption is K-shaped

12:51 Was Andrew Yang right?

13:45 The automation panic is 200 years old

16:46 Dario's six-month claim, and why we don't see it yet

17:22 A job is not a task

22:38 The three numbers that actually predict the labor market

22:42 The chess engine analogy and the centaur phase

25:45 Recursive self-improvement and the hamburger problem

30:06 Should AI labs be the ones answering alignment questions?

31:17 The "invisible hand wave" and why nobody wants fully autonomous AI

33:27 AGI vs ASI, and why the difference is everything

35:28 Commodities vs relational goods

41:14 Star Trek, replicators, and predicting with sci-fi

45:20 Inequality and the Upper West Side VCs

46:21 Your money manager was automated in the 1960s

50:47 Are OpenAI and Anthropic overvalued? The moat problem

54:29 What has to be true for the losses to make sense

55:43 Cognitive atrophy and monopoly fears

57:00 The 2026 compute crunch and the three-tier pricing future

1:01:52 The Apple vs Android analogy

1:03:54 A rich-country perspective

1:04:16 Protecting the skills that actually matter

1:07:02 Will not using AI become a status symbol?

1:08:53 Does capitalism even survive?

1:13:44 Redistribution becomes the political battleground

1:18:16 Blue collar vs white collar: who's really at risk

1:21:18 Advice for parents in an AI world

1:22:43 Saving for retirement when the Valley says don't

1:25:06 Will non-elite colleges survive?

Music:

"Kid Kodi" -Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Why AI Benchmarks Are Lying to You - with Wenhu Chen (Meta/University of Waterloo)

Ravid Shwartz Ziv — Sat, 13 Jun 2026 20:05:34 GMT

In this episode, we sit down with Wenhu Chen, research scientist at Meta MSL, assistant professor at the University of Waterloo, and the person behind MMLU-Pro and MMMU. If you've read a frontier model release in the last two years, you've seen his benchmarks. That makes him one of the best people to answer the question everyone dances around: when a model jumps from 40% to 90% on your benchmark, how much of that is real? In this episode, we dig into why benchmarks have become the loss function of the entire field - design a bad one, and thousands of brilliant researchers will spend months hill-climbing in the wrong direction. Wenhu is surprisingly candid about the limits of his own creations: contamination is everywhere, saturation turns frontier benchmarks into unit tests, and popular alternatives, such as LM Arena, mostly measure tone and length rather than capability. His answer is to evaluate models where they've never been: private codebases, hospital data, and the messy, live internet.

We also talk about ClawBench, his new benchmark that deploys agents to over 140 real production websites to do things people actually want done, such, such as ordering food, booking tickets, and applying for jobs. The best model in the world completes about a third of these tasks. We unpack why: bot detection, models that refuse to click "pay," agents that give up the moment an environment doesn't match their training, and harnesses that can swing results by 20% without changing the model at all.

Along the way, we cover the overlooked science of evaluating pre-training, data flywheels, and synthetic environments for agent training, and whether RL teaches models to reason or just surfaces what's already there. We close with Wenhu's predictions: exploration and adaptability will improve rapidly, but security will become the field's hardest problem as agents gain real permissions in the real world.

Timestamps

00:00 – Intro
00:55 – What good evaluation means, and how it's changed since the early GPT days
03:35 – Benchmarks as the field's loss function
05:50 – Contamination: the problem nobody fully solves
08:08 – MMLU-Pro scores: real progress or training on the test set?
11:05 – Can you measure creativity?
12:34 – Why human judges and arenas are unreliable — and what to use instead
19:22 – What a good benchmark actually looks like
22:34 – Chain of thought: signal or scratchpad?
26:01 – Auto-research and hill-climbing agents
28:52 – Harnesses: 20% swings without touching the model
32:28 – Safety, model release, and an "FDA for models"
36:53 – The overlooked science of pre-training evaluation
43:49 – Designing pre-training benchmarks when one run costs a billion dollars
49:45 – ClawBench: agents on 140+ live websites, and why the best model gets 33%
54:42 – How MMLU-Pro and MMMU-Pro were born from public complaints
59:16 – Pixel agents vs. APIs: will MCP kill computer use?
1:02:11 – Training agents: data flywheels and synthetic environments
1:05:43 – SFT vs. RL, and does RL teach reasoning or reveal it?
1:09:21 – What gets solved next year — and what doesn't
1:14:32 – Undervalued ideas, and what's next for ClawBench

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Jürgen Schmidhuber - Part 2: JEPA, the Road to AGI, and Who Really Invented Modern AI

Ravid Shwartz Ziv — Sun, 07 Jun 2026 18:13:28 GMT

In the second half of our conversation with Jürgen Schmidhuber, we focus on the key ideas he's pursued since the early 1990s and discuss why he believes these concepts are only now being rediscovered.

We start with JEPA. Jürgen argues that the method LeCun named in 2022 is the same family he published in 1992 as Predictability Maximization. From there he traces the adversarial lineage back further still, to his 1990 world-model paper and 1991 Predictability Minimization - the curiosity-driven minimax games he sees as the real origins of GANs.

We also talk about why these ideas took thirty years to land, why today's trillion-dollar data-center buildout is driven by AGI fear, and why he thinks Apple may come out ahead.

The back half turns to what he sees as the real frontier: physical AI. Today's systems are superhuman behind the screen but helpless at a leaky pipe, and until a robot can use human tools, there's no AGI. He discusses self-replicating, self-improving machines as "a new kind of life," reframes continual learning and test-time training as ideas from his 1991 fast-weight work, and detours through Solomonoff's universal prior, Hutter's AIXI, and the Gödel machine.

We close on the subject Jürgen is famous for: scientific credit. He makes his case for rigorous attribution, casts himself as a "speaker for the dead" championing forgotten pioneers like Ivakhnenko, and reflects candidly on whether the fights are personal.

Timeline

00:30 — What JEPA is, and the 1992 Predictability Maximization story

04:54 — Implementing PMAX: autoencoders, Siamese networks, Infomax

09:10 — Predictability Minimization, factorial codes, and the roots of GANs

16:00 — Why it took 30 years: the economics of compute

20:52 — Data, the web, and 1990 as the origin point

23:09 — Hardware inflation, the trillion-dollar buildout, and the coming crash

34:05 — Physical AI: the plumber problem and self-replicating machines

41:14 — Which 90s ideas are being scaled right now

45:26 — Continual learning and test-time training as "old hats"

55:19 — Measuring intelligence: Solomonoff, AIXI, and the Gödel machine

1:05:26 — Self-replication and von Neumann

1:09:51 — Will he see AGI in his lifetime?

1:10:42 — Credit, integrity, and being a "speaker for the dead"

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Jürgen Schmidhuber - World Models, RL, and the Year that changed AI (Part 1)

Ravid Shwartz Ziv — Thu, 04 Jun 2026 12:59:25 GMT

In this episode, we host Jürgen Schmidhuber - the man, the legend, one of the godfathers of modern AI. His lab worked out many ideas behind today’s systems (LSTM, world models, artificial curiosity, Transformer variants, and even GAN-style setups) decades before they became fashionable, and he’s just as well known for making sure people remember who did what first. This is the first of two conversations with him.

We go back to his lab in the early 90s and ask how one small group came up with so many of the ideas that are now being scaled to a thousand billion dollars, back when compute was ten million times more expensive. A lot of the episode comes down to one distinction he keeps making: prediction vs. decision-making. His take is that LLMs are very good prediction machines that imitate the web, but that’s only half the problem. To actually act in the world, you need a controller that uses a world model to plan. He talks about his 1990 work on world models and artificial curiosity, where the controller gets rewarded for running experiments that improve its own model (an adversarial setup years before GANs), why planning millisecond by millisecond doesn’t scale, and why you need sub-goals instead.

We also talk about compression as the core of understanding, from falling apples to Kepler to Einstein, and why we still don’t have a robot that can do what a plumber does, even though the AI behind the screen keeps getting better. Then the conversation moves to credit assignment: how “to Schmidhuber” became a verb, what he thinks is broken about the award system, and a long exchange on PMAX vs. JEPA. He ends on the real origins of deep learning and a prediction about self-replicating machines in space.

Timeline

00:00 Intro
00:55 1991 in Munich, and why that lab mattered
02:38 "I'm not very smart" and why compute getting 10× cheaper every 5 years changed everything
04:25 Chess as an AI proxy
08:27 Artificial curiosity in the 90s vs. today's RL exploration
09:10 Why RL is harder than supervised learning
20:48 Coding agents vs. robots, and how a baby learns its own hands
26:20 Compression as understanding
33:40 What's actually missing on the road to AGI
37:30 Why millisecond-by-millisecond planning is stupid
47:44 Convergence to LLMs, GPUs, and how far we still are from the Bremermann limit
51:49 Unsupervised learning, factorial codes, and predictability minimization
58:12 Credit assignment: the fights with LeCun and the Nobel critique
1:02:13 On his last name becoming a verb
1:05:17 The award system's missing peer review
1:07:03 Closed labs and the decline of open research
1:13:23 Audience questions
1:34:02 Closing: who really invented deep learning?

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

AI for Science and the Thermodynamics of Generative AI - with Max Welling (UvA, CuspAI)

Ravid Shwartz Ziv — Fri, 29 May 2026 03:58:30 GMT

In this episode, we sit with Max Welling, Professor of Machine Learning at the University of Amsterdam, co-founder and CTO of CuspAI, and a foundational figure behind variational autoencoders (VAEs), equivariant networks, and Bayesian deep learning. We talk about AI for science, the physics underneath generative models, and what's still missing on the road to real intelligence.

Max starts with what impresses him and what worries him about the LLM era, then makes the case that the next leaps will come from physical AI and from science itself. We dig into how machine learning actually works in the lab, world models and whether priors like geometry and symmetry should be built in or simply learned, and whether transformers will still rule a decade from now. At the end, we talk about CuspAI's climate mission, AI risk and regulation, Max’s new book, and where neuroscience might inspire the next wave of ML.

Timeline

00:00 — Intro
00:47 — Are we happy with the LLM era?
03:14 — Embodiment and physical AI
08:05 — Does "AGI" even matter as a term?
11:34 — Verifiers, RL, and why math/coding are tractable
13:17 — What actually shifted to make materials discovery work
14:42 — From molecules to biology and wet labs
16:26 — Working with real labs: timescales, friction, and the "Mira" agent
20:29 — Balancing simulators vs. experiments: the exploration–exploitation trade-off
23:44 — Active learning for experimental design
24:23 — Why active learning hasn't been central to LLMs
25:24 — A general loop for ML-for-science across domains
27:10 — Foundation models for chemistry: a "mother ship" plus a zoo of fine-tuned models
30:04 — Quantum mechanics, interpretation, and AI as a creative theorist
31:54 — World models and Yann LeCun's view; priors vs. learning
34:57 — Should world knowledge be explicit? (responding to Stefano Ermon)
36:41 — Vision: equivariance vs. transformers, and the role of optimization
40:32 — Best model for molecular properties in 10 years? Will transformers survive?
43:16 — CuspAI's climate focus and what motivated it
47:10 — One platform for every material class — what transfers and what doesn't
48:42 — Where does the risk of human extinction really come from?
51:06 — The "pause AI" debate and the arms-race reality
52:40 — Regulating powerful models: government vs. self-regulation
55:16 — Who should design AI regulation?
56:29 — The new book
1:00:31 — Compression, the information bottleneck, and renormalization
1:03:30 — The role of foundational principles in modern AI
1:04:06 — Waves in computing, the brain, and the next wave of innovation
1:07:11 — Neuroscience and ML: are we in a better position now?
1:09:17 — Conferences, the ICLR keynote, and finding the right people
Music:
- "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- Changes: trimmed
About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

After Math Falls, What's Next? with Julia Kempe (NYU/Meta)

Ravid Shwartz Ziv — Mon, 25 May 2026 02:10:56 GMT

In this episode, we sit down with Julia Kempe, a Professor at NYU's Center for Data Science and researcher at Meta FAIR's Foundations of Reasoning team, for a wide-ranging conversation on the future of AI research.

We dig into why verifiable domains like mathematics may be on track to "fall" the way Go did. With formal verification through Lean and the Mathlib infrastructure, LLM agents can now generate and check proofs at scale, and Julia makes the case that a new industry of automated mathematical discovery is closer than most mathematicians believe. We explore why Erdős problems are already falling, what's still missing for harder fields like analysis and physics, and how synthetic data, curation, and verification fit together.

From there we get into the energy and scaling limits of frontier models, the case for academic research that big labs can't pursue, how to advise PhD students when Claude can already do their first-year work, the rise of AI safety and security as research priorities, and Julia's optimistic argument that AI tools are bringing back the Renaissance generalist - the researcher who can finally work fluently across math, biology, and beyond.

Timeline

00:00 — Introductions
01:00 — Defining reasoning and verifiable domains
04:00 — Lean, Mathlib, and the formalization of mathematics
10:00 — Constructive proofs, Erdős problems, and the new wave of "AI mathematicians"
14:00 — Will math be "solved"? Art, photography, and the changing nature of creative work
18:00 — Why physics is harder than math
22:00 — Moravec's paradox, evolution, and why robotics lags behind language
27:00 — The Renaissance is back: generalist researchers in the age of AI
29:00 — Advising students: math, programming, and what core education still matters
32:00 — Teaching and assessment when GPT can do the homework
35:00 — Anti-AI backlash, energy costs, and the security threat
40:00 — Scaling vs. efficiency
42:00 — Model collapse, synthetic data, and what's left to squeeze from the internet
44:00 — What's exciting next: AI for science, safety, robotics, memory, and planning
47:00 — Annotation costs as a proxy
50:00 — Superhuman models and what security even means against them
52:00 — AlphaGo as precedent for verifiable superhuman performance
54:00 — Hallucination, the Mirage paper, and whether these are solvable problems
56:00 — Why coding isn't fully solved yet
58:00 — Agent security, prompt injection, and the Wild West of deployed agents
1:01:00 — Regulation: what's needed and what's possible
1:04:00 — Advice for PhD students and what research academia should pursue
1:09:00 — Startup opportunities: robotics, security, and AI for finance
1:12:00 — Closing thoughts: use the tools, and build grassroots AI for good

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Intelligence in an Open World - with Mengye Ren (NYU)

Ravid Shwartz Ziv — Wed, 20 May 2026 13:00:00 GMT

We talk with Mengye Ren, Assistant Professor at NYU's Center for Data Science, about what intelligence actually means once you step outside a benchmark, and why scaling a single centralized model isn't the whole story.

We get into why intelligence has to be defined in open environments, not closed ones, and what that means for how we measure progress. We push on the creativity question: today's models sample bottom-up from a softmax or a Gaussian, with no internal loop of consideration, and as Mengye puts it, we haven't understood creativity yet and we're already prepared to hand it over.

We also talk about what's missing for the next paradigm: continual learning, memory, embodied grounding, and smaller models that actually accumulate experience instead of re-deriving everything from scratch each call. Along the way, we get into JEPA and latent variables, biology as inspiration vs. blueprint, why frontier labs don't lean on explicit latents, the limits of synthetic data and world models, agent-to-agent communication, model uncertainty and forecasting, and whether ML education still matters when AI writes the experiments.

A grounded, contrarian conversation about where AI research should be looking next, beyond benchmarks, beyond scale.

Timeline

00:00 — Intro and welcome

01:24 — What is intelligence? Defining it relative to objectives and open environments

04:19 — Is intelligence really the path to human flourishing, or is it productivity?

04:57 — Safety, scalable oversight, and whether stronger models help or hurt

06:09 — What does "alignment" actually mean?

07:18 — Centralized vs. decentralized models: objectivity vs. personal meaning

08:50 — Hinton vs. LeCun: where Mengye stands on AI risk

10:29 — Bottom-up vs. top-down architectures and feedback loops

21:28 — Biology and AI: inspiration, not blueprint

24:14 — Biological plausibility, spiking nets, and where the analogy breaks

25:39 — JEPA, Mamba, and architectures beyond the transformer

27:31 — Language as a special modality: abstraction built for communication

29:04 — Are we too locked into the current paradigm? Risk of creativity collapse

30:09 — Synthetic data, simulation, and the brain's own generative models

31:43 — World models and physical AI: how babies actually learn 33:03 — The case for smaller, continually learning models

37:02 — The role of academic research in a frontier-lab world

39:47 — Why LLMs aren't funny: the creativity gap

40:35 — What research areas matter most: embodiment, continual learning, creativity

42:05 — Creativity is bounded by experience — and why bottom-up sampling isn't enough

45:35 — Agent-to-agent communication and the limits of sub-agents

46:39 — Model confidence, epistemic uncertainty, and forecasting

49:44 — Tokenization, static vs. dynamic worlds, and always-learning systems

52:20 — Latent variables, JEPA, and why frontier models skip them

53:40 — The future of ML education when AI writes the experiments

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Language, Cognition, and the Limits of LLMs - with Tal Linzen (NYU/Google)

Ravid Shwartz Ziv — Sun, 17 May 2026 00:58:22 GMT

We host Tal Linzen, Associate Professor at NYU and Research Scientist at Google, for a conversation on the intersection of cognitive science and large language models.

We discussed why children can learn language from around 100 million words while LLMs need trillions, and the surprising finding that as models get better at predicting the next word, they become worse models of how humans actually process language. Tal walked us through how his lab uses eye-tracking and reading-time data to compare model behavior to human behavior, and what that reveals about prediction, working memory, and the limits of current architectures.

We also got into nature versus nurture and how inductive biases can be instilled by pre-training on synthetic languages, world models and whether transformers actually use the geometric structure they encode, the BabyLM challenge and data-efficient language learning, and what mechanistic interpretability can offer cognitive science beyond just fixing model bugs. The conversation closed on academia versus industry, the role of PhDs in the current AI moment, and how AI coding tools are changing the way Tal teaches and evaluates students at NYU.

Timeline

00:13 — Intro and what cognitive science means
02:16 — Using computational simulations to understand how humans learn language
05:26 — How children learn language vs. how LLMs are pre-trained
07:53 — Why mainstream LLMs are not good models of humans
10:07 — Comparing humans and models with eye-tracking and reading behavior
13:52 — Sensory modalities, smell, and how much you can learn from language alone
16:03 — Animal cognition and decoding animal communication
17:00 — Nature vs. nurture, inductive biases, and what transformers can and can't learn
21:21 — Instilling inductive biases through synthetic languages
27:34 — The bouba/kiki effect and cross-linguistic sound symbolism
28:33 — Latent causal structure in language and whether models discover it
31:13 — Does knowing linguistics help build better models?
35:07 — World models: what they mean, and why transformers encode geometry but don't use it
39:13 — Tokenization, and why Tal doesn't like it
41:35 — Scaling laws and the inverse-U curve of model quality vs. human fit
44:34 — Where the human–model mismatch comes from: architecture, memory, and data
47:08 — Diffusion language models and sentence planning
48:21 — Data quality, synthetic data, and curriculum effects
50:54 — Comparing models at different training stages to human development; BabyLM
54:40 — What level of the model should we actually probe? Representations vs. behavior
1:01:04 — Mechanistic interpretability, Deep Dream, and human dreaming
1:02:11 — Cognitive neuroscience, intracranial recordings, and working memory
1:10:31 — Should you still do a PhD in 2026?
1:12:31 — Will software engineers lose their jobs to AI?
1:17:43 — Teaching in the age of coding agents: what changes in the classroom
1:20:54 — What's next: human-like LLMs as user simulators, and recruiting
Music:
- "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
- Changes: trimmed
About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

The Principles of Diffusion Models - with Jesse Lai (Sony AI)

Ravid Shwartz Ziv — Sun, 10 May 2026 16:09:47 GMT

We host Chieh-Hsin (Jesse) Lai, Staff Research Scientist at Sony AI and visiting professor at National Yang Ming Chiao Tung University, Taiwan, for a conversation about diffusion models, the technology behind tools like Stable Diffusion, and most of the AI image and video generators you've seen in the last few years. Jesse recently co-authored The Principles of Diffusion Models with Stefano Ermon, and the book is quickly becoming a go-to reference in the field.

We start with what a generative model actually is, and what it means to "generate" an image or a sound. Jesse explains the core idea behind diffusion in plain terms. You start with pure noise, and a neural network gradually cleans it up, step by step, until a realistic image emerges.

From there, we talk about why diffusion has come to dominate so much of generative AI. Because the model builds an image gradually, you can guide it along the way, nudging the output toward what you actually want, refining details, or combining it with other controls. We also discuss the common critique that diffusion is slow and how the field has largely addressed it through new techniques.

We zoom out to the bigger picture, too. Jesse shares his view on world models and whether diffusion is the right foundation for them. We talk about what makes a generative model genuinely good versus just good at gaming benchmarks, and why evaluating creativity and realism is so much harder than scoring a multiple-choice test.

Timeline

00:12 — Intro and welcoming Jesse

00:47 — Why Jesse wrote the book, and who it's for

03:29 — The three families of diffusion models, and why they're really one idea

05:14 — What makes a good generative model

07:39 — How do you even measure if a generated image is good

08:59 — Why diffusion beats autoregressive models for images

10:33 — Is diffusion still slow? How fast generation got fast

11:12 — A simple intuition for what a "score" is

14:12 — How the different flavors of diffusion connect under the hood

14:42 — Diffusion for text and proteins

17:12 — Consistency models and the push for one-step generation

22:12 — Diffusion for world models: simulating reality in real time

26:12 — Do world models need to understand language

35:12 — Is diffusion the right tool, or just a convenient one

38:12 — What benchmarks actually tell us, and what they miss

46:12 — Closing thoughts and where to find the book

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Inside xAI, and the Bet on AI Math - with Christian Szegedy (Math Inc)

Ravid Shwartz Ziv — Mon, 04 May 2026 12:45:04 GMT

We talked with Christian Szegedy, co-inventor of Inception and Batch Normalization, founding scientist at xAI, now at Math Inc, about what it takes to build a frontier lab, and why he left xAI to work on formal mathematics. Christian thinks Lean and auto-formalization are the missing piece for trustworthy AI: a machine-checkable layer underneath all reasoning, where proofs are guaranteed correct without anyone having to read them.

We got into his bet with François Chollet that AI will hit superhuman mathematician level by 2026, and what that actually unlocks beyond math itself: verified software instead of vibe-coded apps that break when you refactor, AI systems you can actually trust because their reasoning is checkable, and a path to handling protein folding, chemistry, and parts of biology with real guarantees instead of hand-waving. Christian also walked us through how Math Inc's Gauss system pulled off a proof in two weeks that human experts had estimated would take another year.

We also covered xAI's first 12-person year, why Christian no longer buys the original batch normalization story, why he's sure transformers won't be the dominant architecture in five years, what mathematicians do in a world of cheap proofs, and his take on whether humanity will handle AI well. He distrusts humanity more than he distrusts AI.

Timeline

00:12 — Intros: Christian's background (Inception, Batch Norm, xAI, Math Inc)

01:29 — Building a frontier lab from scratch: the first 12 people at xAI

04:15 — Hiring for proven track records when 200K GPUs are at stake

06:07 — Elon's "dependency graph" and balancing long-term vision with investor demos

07:28 — Gauss formalizes the strong prime number theorem in 2 weeks

12:25 — What "formalization" actually means (and why it's not what most people think)

14:39 — Why Lean gives 100% certainty and why that matters for RL

15:26 — ProofBridge and joint embeddings across mathematical subfields 18:07 — Does math formalization transfer to coding and other fields?

21:44 — Can every domain be mathematized?

23:14 — Verified software, chip design, and why vibe-coded apps are dangerous

26:35 — Scaling Mathlib by 100–1000x

28:27 — Artisan formalizers vs. invisible machine-language formalists

33:26 — Can verification generalize?

45:19 — Revisiting Batch Norm: covariate shift, loss landscape, and what really happens

48:22 — Is normalization even necessary?

50:10 — What's actually fundamental in modern AI architectures

51:41 — Why Christian thinks transformers won't last 5 years

52:38 — The 2026 superhuman AI mathematician bet

55:15 — What's missing: better verification + a much larger formalized math repository

56:13 — Lean vs. Coq vs. HOL Light - does the proof assistant actually matter?

59:26 — The role of mathematicians in 5–10 years

1:02:00 — A human element to mathematics: Newton, Leibniz, and competitive proving

1:03:25 — The telescope analogy: AI as the instrument that lets us see the math universe

1:05:19 — Job apocalypse or Jevons paradox?

1:08:41 — Advice for students

1:09:50 — Can we formally verify AI alignment?

1:11:52 — Closing thanks

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Reasoning Models and Planning - with Rao Kambhampati (Arizona State)

Ravid Shwartz Ziv — Wed, 29 Apr 2026 15:18:11 GMT

We sat down with Rao Kambhampati, a Professor of CS at Arizona State University and former President of AAAI, to talk about reasoning models: what they are, when they work, and when they break

Rao has been working on planning and decision-making since long before deep learning, which makes him one of the most grounded voices on what today's reasoning systems actually do. We start with definitions of what reasoning is, why planning is the hard subset of it, and what changed when systems like o1 and DeepSeek R1 moved the verifier from inference into post-training. From there we get into where these models generalize, where they don't, and why benchmarks can be misleading about both.

A big chunk of the conversation is on chain-of-thought: what intermediate tokens are actually doing, why they help the model more than they help the reader, and what outcome-based RL does to whatever semantic content was there to begin with. We also cover world models and why Rao thinks the video-only framing is the wrong bet, the difference between agentic safety and existential risk, and what the planning community figured out decades ago that the LLM community keeps rediscovering.

Timeline

(00:12) Intros
(01:32) Defining "reasoning" and the System 1 / System 2 framing
(04:12) Blocksworld vs Sokoban, and non-ergodicity
(06:42) Pre-o1: PlanBench and "LLMs are zero-shot X" papers
(07:42) LLM-Modulo and moving the verifier into post-training
(10:12) Is RL post-training reasoning, or case-based retrieval?
(13:12) τ-Bench and benchmarks that avoid action interactions
(14:12) OOD generalization and what we don't know about post-training data
(19:02) Does it matter how they work if they answer the questions we care about?
(21:27) Architecture lotteries and why no one tries different designs
(23:42) Intermediate tokens and the "reduce thinking effort" cottage industry
(26:12) The 30×30 maze experiment
(27:42) Sokoban, NetHack, and Mystery Blocksworld
(34:58) Stop Anthropomorphizing Intermediate Tokens — the swapped-trace experiment
(46:12) Latent reasoning, Coconut, and why R0 beat R1
(50:12) How outcome-based RL erodes CoT semantics
(52:12) Dot-dot-dot and Anthropic's CoT monitoring paper
(53:42) Safety: Hinton, Bengio, LeCun
(57:12) Existential risk vs real safety work
(59:42) World models, transition models, and video-only approaches
(1:03:12) Why linguistic abstractions matter — pick and roll
(1:05:42) What the planning community knew in 2005
(1:08:12) Multi-agent LLMs
(1:09:57) Closing thoughts: the bridge analogy

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

What Actually Matters in AI? - with Zhuang Liu (Princeton)

Ravid Shwartz Ziv — Fri, 24 Apr 2026 18:21:23 GMT

In this episode, we hosted Zhuang Liu, Assistant Professor at Princeton and former researcher at Meta, for a conversation about what actually matters in modern AI and what turns out to be a historical accident.

Zhuang is behind some of the most important papers in recent years (with more than 100k citations): ConvNeXt (showing ConvNets can match Transformers if you get the details right), Transformers Without Normalization (replacing LayerNorm with dynamic tanh), ImageBind, Eyes Wide Shut on CLIP's blind spots, the dataset bias work showing that even our biggest "diverse" datasets are still distinguishable from each other, and more.

We got into whether architecture research is even worth doing anymore, what "good data" actually means, why vision is the natural bridge across modalities but language drove the adoption wave, whether we need per-lab RL environments or better continual learning, whether LLMs have world models (and for which tasks you'd need one), why LLM outputs carry fingerprints that survive paraphrasing, and where coding agents like Claude Code fit into research workflows today and where they still fall short.

Timeline

00:13 — Intro

01:15 — ConvNeXt and whether architecture still matters

06:35 — What actually drove the jump from GPT-1 to GPT-3

08:24 — Setting the bar for architecture papers today

11:14 — Dataset bias: why "diverse" datasets still aren't

22:52 — What good data actually looks like

26:49 — ImageBind and vision as the bridge across modalities

29:09 — Why language drove the adoption wave, not vision

32:24 — Eyes Wide Shut: CLIP's blind spots

34:57 — RL environments, continual learning, and memory as the real bottleneck

43:06 — Are inductive biases just historical accidents?

44:30 — Do LLMs have world models?

48:15 — Which tasks actually need a vision world model

50:14 — Idiosyncrasy in LLMs: pre-training vs post-training fingerprints

53:39 — The future of pre-training, mid-training, and post-training

57:57 — Claude Code, Codex, and coding agents in research

59:11 — Do we still need students in the age of autonomous research?

1:04:19 — Transformers Without Normalization and the four pillars that survived

1:06:53 — MetaMorph: Does generation help understanding, or the other way around?

1:09:17 — Wrap

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

The Future of Coding Agents with Sasha Rush (Cursor/Cornell)

Ravid Shwartz Ziv — Wed, 15 Apr 2026 16:57:16 GMT

We talked with Sasha Rush, researcher at Cursor and professor at Cornell, about what it actually feels like to we in the heart of the AI revolution and build coding agents right now. Sasha shared how these systems are changing day-to-day work and how it feels to develop these systems.

A big part of the conversation was about why coding has become such a powerful setting for these tools. We discussed what makes code different from other domains, why agents seem to work especially well there, and how much of today’s progress comes not just from better models, but from better ways of using them. Sasha also gave an inside look at how Cursor thinks about training coding models, long-running agents, context limits, bug finding, and the balance between autonomy and human oversight.

We also talked about the broader shift happening in software engineering. Are developers moving to a higher level of abstraction? Is this just a phase where we “babysit” models, or the beginning of a deeper change in how software gets built? Sasha had a very thoughtful perspective here, including what he’s seeing from students, researchers, and engineers who are growing up native to these tools.

More broadly, this episode is about what it means to do serious technical work in a moment when the tools are changing incredibly fast. Sasha brought both optimism and skepticism to the discussion, and that made this a really grounded conversation about where coding agents are today, what they are already surprisingly good at, and where all of this might be going next.

Timeline
00:00 Intro and Sasha joins us
01:11 What “coding agents” actually mean
02:34 Why coding became the breakout use case
08:56 Long-running agents and autonomous workflows
15:08 How these tools are changing the work of engineers
17:15 Are people just babysitting models right now?
22:11 How Cursor builds its coding models
26:29 Rewards, training, and what makes agents work
34:53 Memory, continual learning, and agent communication
38:00 How context compaction works in practice
41:29 Why coding agents recently got much better
50:31 Refactoring, maintenance, and self-improving codebases
52:16 Bug finding, oversight, and verification
54:43 Will this pace of progress continue?
56:42 Can this spread beyond coding?
58:27 The future of Cursor and coding agents
1:03:08 Model architectures beyond standard transformers
1:05:37 World models, diffusion, and what may come next

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

The Hidden Engine of Vision with Peyman Milanfar (Google)

Ravid Shwartz Ziv — Fri, 10 Apr 2026 14:13:34 GMT

Peyman Milanfar is a Distinguished Scientist at Google, leading its Computational Imaging team. He's a member of the National Academy of Engineering, an IEEE Fellow, and one of the key people behind the Pixel camera pipeline. Before Google, he was a professor at UC Santa Cruz for 15 years and helped build the imaging pipeline for Google Glass at Google X. Over 35,000 citations.

Peyman makes a provocative case that denoising, long dismissed as a boring cleanup task, is actually one of the most fundamental operations in modern ML, on par with SGD and backprop. Knowing how to remove noise from a signal basically means you have a map of the manifold that signals live on, and that insight connects everything from classical inverse problems to diffusion models.

We go from early patch-based denoisers to his 2010 "Is Denoising Dead?" paper, and then to the question that redirected his research: if denoising is nearly solved, what else can denoisers do? That led to Regularization by Denoising (RED), which, if you unroll it, looks a lot like a diffusion process, years before diffusion models existed. We also cover how his team shipped a one-step diffusion model on the Pixel phone for 100x ProRes Zoom, the perception-distortion-authenticity tradeoff in generative imaging, and a new paper on why diffusion models don't actually need noise conditioning. The conversation wraps with a debate on why language has dominated the AI spotlight while vision lags, and Peyman's argument that visual intelligence, grounded in physics and robotics, is coming next.

Timeline

0:00 Intro and Peyman's background

1:22 Why denoising matters more than you think Sensor diversity and Tesla's vision-only bet

15:04 BM3D and why it was secretly an MMSE estimator

17:02 "Is Denoising Dead?" then what else can denoisers do?

18:07 Plug-and-play methods and Regularization by Denoising (RED)

26:18 Denoising, manifolds, and the compression connection

28:12 Energy-based models vs. diffusion: "The Geometry of Noise"

31:40 Natural gradient descent and why flow models work

34:48 Gradient-free optimization and high-dimensional noise

45:13 Image quality and the perception-distortion tradeoff

48:39 Information theory, rate-distortion, and generative models

52:57 Denoising vs. editing

54:25 The changing role of theory

57:07 Hobbyist tools vs. shipping consumer products

59:40 Coding agents, vibe coding, and domain expertise

1:05:00 Vision and more complex-dimensional signals

1:09:31 Do models need to interact with the physical world?

1:11:28 Continual learning and novelty-driven updates

1:13:00 On-device learning and privacy

1:15:01 Why has language dominated AI? Is vision next?

1:17:14 How kids learn: vision first, language later

1:19:36 Academia vs. industry

1:22:28 10,000 citations vs. shipping to millions, why choose?

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Reinventing AI From Scratch with Yaroslav Bulatov

Ravid Shwartz Ziv — Mon, 30 Mar 2026 23:22:20 GMT

Yaroslav Bulatov helped build the AI era from the inside, as one of the earliest researchers at both OpenAI and Google Brain. Now he wants to tear it all down and start over. Modern deep learning, he argues, is up to 100x more wasteful than it needs to be - a Frankenstein of hacks designed for the wrong hardware. With a power wall approaching in two years, Yaroslav is leading an open effort to reinvent AI from scratch: no backprop, no legacy assumptions, just the benefit of hindsight and AI agents that compress decades of research into months. Along the way, we dig into why AGI is a "religious question," how a sales guy with no ML background became one of his most productive contributors, and why the Muon optimizer, one of the biggest recent breakthroughs, could only have been discovered by a non-expert.

Timeline

00:12 — Introduction and Yaroslav's background at OpenAI and Google Brain

01:16 — Why deep learning isn't such a good idea

02:03 — The three definitions of AGI: religious, financial, and vibes-based

07:52 — The SAI framework: do we need the term AGI at all?

10:58 — What matters more than AGI: efficiency and refactoring the AI stack

13:28 — Jevons paradox and the coming energy wall

14:49 — The recipe: replaying 70 years of AI with hindsight

17:23 — Memory, energy, and gradient checkpointing

18:34 — Why you can't just optimize the current stack (the recurrent laryngeal nerve analogy)

21:05 — What a redesigned AI might look like: hierarchical message passing

22:31 — Can a small team replicate decades of research?

24:23 — Why non-experts outperform domain specialists

27:42 — The GPT-2 benchmark: what success looks like

29:01 — Ian Goodfellow, Theano, and the origins of TensorFlow

30:12 — The Muon optimizer origin story and beating Google on ImageNet

36:16 — AI coding agents for software engineering and research

40:12 — 10-year outlook and the voice-first workflow

42:23 — Why start with text over multimodality

45:13 — Are AI labs like SSI on the right track?

48:52 — Getting rid of backprop — and maybe math itself

53:57 — The state of ML academia and NeurIPS culture

56:41 — The Sutra group challenge: inventing better learning algorithms

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Why Healthcare Is AI's Hardest and Most Important Problem with Kyunghyun Cho (NYU)

Ravid Shwartz Ziv — Tue, 24 Mar 2026 05:11:26 GMT

We talk with Kyunghyun Cho, who is a Professor of Health Statistics and a Professor of Computer Science and Data Science at New York University, and a former Executive Director at Genentech, about why healthcare might be the most important and most difficult domain for AI to transform. Kyunghyun shares his vision for a future where patients own their own medical records, proposes a provocative idea for running continuous society-level clinical trials by having doctors "toss a coin" between plausible diagnoses, and explains why drug discovery's stage-wise pipeline has hit a wall that only end-to-end AI thinking can break through. We also get into GLP-1 drugs and why they're more mysterious than people realize, the brutal economics of antibiotic research, how language models trained across scientific literature and clinical data could compress 50 years of drug development into five, and what Kyunghyun would do with $10 billion (spoiler: buy a hospital network in the Midwest). We wrap up with a great discussion on the rise of professor-founded "neo-labs," why academia got spoiled during the deep learning boom, and an encouraging message for PhD students who feel lost right now.

Timeline:

(00:00) Intro and welcome

(01:25) Why healthcare is uniquely hard

(04:46) Who owns your medical records? — The case for patient-controlled data and tapping your phone at the doctor's office

(06:43) Centralized vs. decentralized healthcare — comparing Israel, Korea, and the US

(13:19) Why most existing health data isn't as useful as we think — selection bias and the lack of randomization

(16:53) The "toss a coin" proposal — continuous clinical trials through automated randomization, and the surprising connection to LLM sampling.

(23:07) Drug discovery's broken pipeline — why stage-wise optimization is failing, and we need end-to-end thinking

(28:30) Why the current system is already failing society — wearables, preventive care, and the case for urgency

(31:13) Allen's personal healthcare journey and the GLP-1 conversation

(33:13) GLP-1 deep dive — 40 years from discovery to weight loss drugs, brain receptors, and embracing uncertainty

(36:28) Why antibiotic R&D is "economic suicide" and how AI can help

(42:52) Language models in the clinic and the lab — from clinical notes to back-propagating clinical outcomes, all the way to molecular design

(48:04) Do you need domain expertise, or can you throw compute at it?

(54:30) The $10 billion question — distributed GPU clouds and a patient-in-the-loop drug discovery system

(58:28) Vertical scaling vs. horizontal scaling for healthcare AI

(1:01:06) AI regulation — who's missing from the conversation and why regulation should follow deployment

(1:06:52) Professors as founders and the "neo-lab" phenomenon — how Ilya cracked the code

(1:11:18) Can neo-labs actually ship products? Why researchers should do research

(1:13:09) Academia got spoiled — the deep learning anomaly is ending, and that's okay

(1:16:07) Closing message — why it's a great time to be a PhD student and researcher

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.