<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Information Bottleneck]]></title><description><![CDATA[AI research, compressed. Long conversations with the people building the frontier, and a working researcher's take on the ideas that survive the bottleneck - minus the hype.]]></description><link>https://www.the-information-bottleneck.com</link><image><url>https://substackcdn.com/image/fetch/$s_!nQnk!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9b10938-f656-4406-aa7a-36b5e263a5dc_950x950.png</url><title>The Information Bottleneck</title><link>https://www.the-information-bottleneck.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 30 Jun 2026 07:16:34 GMT</lastBuildDate><atom:link href="https://www.the-information-bottleneck.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[The Information Bottleneck]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[informationbottleneck@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[informationbottleneck@substack.com]]></itunes:email><itunes:name><![CDATA[Ravid Shwartz Ziv]]></itunes:name></itunes:owner><itunes:author><![CDATA[Ravid Shwartz Ziv]]></itunes:author><googleplay:owner><![CDATA[informationbottleneck@substack.com]]></googleplay:owner><googleplay:email><![CDATA[informationbottleneck@substack.com]]></googleplay:email><googleplay:author><![CDATA[Ravid Shwartz Ziv]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Editing a Compressed Memory]]></title><description><![CDATA[Linear attention compresses memory into one fixed-size matrix. The hard part is editing it without scrambling everything else.]]></description><link>https://www.the-information-bottleneck.com/p/editing-a-compressed-memory</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/editing-a-compressed-memory</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Mon, 29 Jun 2026 18:48:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Pof3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Written with help from Claude for drafting, editing, and figures. All the mistakes are its.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pof3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pof3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!Pof3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!Pof3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!Pof3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pof3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6344361,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://informationbottleneck.substack.com/i/204059098?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Pof3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png 424w, https://substackcdn.com/image/fetch/$s_!Pof3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png 848w, https://substackcdn.com/image/fetch/$s_!Pof3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!Pof3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2377031-d7c8-4103-968c-6efb5d46ac77_2816x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.the-information-bottleneck.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Information Bottleneck! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>A Transformer remembers by keeping everything. Every token it has read stays in the KV cache, and any later token can look back at any earlier one exactly. That is why attention is so good at recall, and why its memory and compute grow with the length of the context.</p><p>Linear attention makes the opposite bet. It throws the cache away and keeps a single fixed-size matrix: a running summary that every new token updates and every query reads. Memory stops growing and decoding gets cheap. But a fixed-size summary cannot hold an unbounded number of facts cleanly, so writing something new can disturb what is already stored. Almost all the recent progress here (DeltaNet, Gated DeltaNet, KDA, and now Gated DeltaNet-2) is about making that write more surgical.</p><p>This post builds the whole thing from the ground up. You do not need to know any of these models going in; just linear algebra and a rough sense of what attention does. <strong> Shape of the argument</strong></p><ol><li><p>A fixed-size state is an <strong>associative memory</strong> built by summing key&#8211;value outer products, and reading it is content-addressed lookup.</p></li><li><p>Because it is fixed-size, overlapping keys <strong>interfere</strong>. That is the one limitation everything else fights.</p></li><li><p>Giving an old key a <strong>new value</strong> is the hard case. Adding leaves the stale value behind; replacing the matrix destroys every other fact; the <strong>delta rule</strong> does the surgical thing.</p></li><li><p>The delta rule looks sequential but <strong>trains in parallel</strong> as one small triangular solve per chunk.</p></li><li><p><strong>Decay</strong>, then <strong>per-channel decay (KDA)</strong>, then <strong>decoupled erase/write gates (GDN-2)</strong> are three refinements that keep that solve intact.</p></li></ol><div><hr></div><h2>Which memory we mean</h2><p>&#8220;Memory&#8221; means three different things in a language model. This post is about one of them.</p><ul><li><p><strong>The weights.</strong> The query/key/value projection matrices and the gates, learned during training and fixed afterward. Long-term knowledge, changed only by more training. Not this.</p></li><li><p><strong>The KV cache</strong> (softmax). The full list of past keys and values, so any query can look back exactly. Lossless, grows with context, reset each sequence. Linear attention removes this.</p></li><li><p><strong>The recurrent state</strong> (linear attention). One fixed-size matrix summarizing every token so far. Lossy, fixed size, reset each sequence. <strong>This is the memory we mean.</strong></p></li></ul><p>So this is <strong>in-context memory</strong>: holding the current input within a single forward pass, so token 5,000 can use what token 3 said. New prompt, empty state, nothing saved. It is <strong>not</strong> retrieval/RAG, not continual learning, not &#8220;remembering you across sessions&#8221;; it is the job plain attention does with its KV cache, just compressed into a fixed matrix instead of a growing list.</p><div><hr></div><h2>Where the state comes from</h2><p>Each token has a representation, and three fixed learned matrices turn it into a query, a key, and a value:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;q_t = W_q\\,x_t, \\qquad k_t = W_k\\,x_t, \\qquad v_t = W_v\\,x_t.&quot;,&quot;id&quot;:&quot;FAXFPXLXAC&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>The key is a token&#8217;s address, what it is about; the value is the content stored there; the query is what the current token is asking for, matched against the keys to decide what to pull out. A token writes itself in as a key&#8211;value pair and later reads with a query. The vectors depend on the input, but the three projection matrices are fixed weights, shared across every position and sequence.</p><p>Ordinary attention computes each output as a softmax-weighted sum over the past:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;o_t = \\sum_{i\\le t}\\frac{\\exp(q_t^\\top k_i)}{Z_t}\\,v_i, \\qquad Z_t=\\sum_{j\\le t}\\exp(q_t^\\top k_j).\n&quot;,&quot;id&quot;:&quot;XFWNTCZHJM&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>The exponential is what forces the cache. The score <code>exp(query &#183; key)</code> does not split into a part that depends only on the query times a part that depends only on the key, so the weight on each value is tied to that specific key, and the normalizer sums over every past key. There is no running summary you can keep instead: you have to store every key&#8211;value pair and revisit them for each new query. That is the KV cache: memory grows linearly with sequence length, and producing all outputs scales quadratically with it.</p><p>Linear attention drops the softmax and uses a score that factorizes, in the simplest case just the dot product of query and key. Once it factorizes, the sum rearranges:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;o_t = \\sum_{i\\le t}(k_i^\\top q_t)\\,v_i = \\Big(\\sum_{i\\le t} k_i v_i^\\top\\Big)^{\\!\\top} q_t = S_t^\\top q_t, \\qquad S_t=\\sum_{i\\le t}k_i v_i^\\top.&quot;,&quot;id&quot;:&quot;QSWNCCQEUQ&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>All of history collapses into one matrix of fixed size (key-dimension by value-dimension), and the query reads it in a single multiply. Memory no longer grows with context and the per-token cost is constant. This fixed state is the object the rest of this post is about; it exists precisely because the softmax is gone.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r7d9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e915411-489c-4558-b31e-c8444281c469_1434x871.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r7d9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e915411-489c-4558-b31e-c8444281c469_1434x871.png 424w, https://substackcdn.com/image/fetch/$s_!r7d9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e915411-489c-4558-b31e-c8444281c469_1434x871.png 848w, https://substackcdn.com/image/fetch/$s_!r7d9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e915411-489c-4558-b31e-c8444281c469_1434x871.png 1272w, https://substackcdn.com/image/fetch/$s_!r7d9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e915411-489c-4558-b31e-c8444281c469_1434x871.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r7d9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e915411-489c-4558-b31e-c8444281c469_1434x871.png" width="1434" height="871" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e915411-489c-4558-b31e-c8444281c469_1434x871.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:871,&quot;width&quot;:1434,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65483,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://informationbottleneck.substack.com/i/204059098?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e915411-489c-4558-b31e-c8444281c469_1434x871.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!r7d9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e915411-489c-4558-b31e-c8444281c469_1434x871.png 424w, https://substackcdn.com/image/fetch/$s_!r7d9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e915411-489c-4558-b31e-c8444281c469_1434x871.png 848w, https://substackcdn.com/image/fetch/$s_!r7d9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e915411-489c-4558-b31e-c8444281c469_1434x871.png 1272w, https://substackcdn.com/image/fetch/$s_!r7d9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e915411-489c-4558-b31e-c8444281c469_1434x871.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Two ways to remember a sequence. Softmax keeps a growing KV cache, a key&#8211;value row per token; linear attention keeps one fixed-size matrix that every token writes into. The cache scales with length; the matrix does not.</figcaption></figure></div><p></p><p>That fixed size is the appeal and the problem at once. Packing an unbounded history into one matrix is cheap, but it means many facts share the same finite space. The next section shows what that does to a read.</p><div><hr></div><h2>Why a fixed-size memory interferes</h2><p>We have the fixed-size state. Before writing into it, look at what reading it gives you. Reading is applying a query to the memory:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S^\\top q = \\sum_i v_i\\,(k_i^\\top q).\n&quot;,&quot;id&quot;:&quot;PFMSZZJGTE&quot;}" data-component-name="LatexBlockToDOM"></div><p>Each stored value is weighted by how aligned its key is with the query, the dot product of the two. That is content-addressed recall: values whose keys match what you asked for, weighted by the match.</p><p>To expose the problem, take the cleanest possible query, one that exactly equals a key you already stored. This is the case that <em>should</em> return its value perfectly, so any mess is the memory&#8217;s fault, not a mismatched query. Splitting off that term:</p><p></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S^\\top k_j = \\underbrace{v_j\\,(k_j^\\top k_j)}_{\\text{what you want}} \\;+\\; \\underbrace{\\sum_{i\\ne j} v_i\\,(k_i^\\top k_j)}_{\\text{leakage from every other fact}}.\n&quot;,&quot;id&quot;:&quot;ITHKSEBMIO&quot;}" data-component-name="LatexBlockToDOM"></div><p>If the stored keys were orthonormal, every cross term would be zero and the read would be clean. They are not. Each nonzero overlap leaks a fraction of some other value into the answer. And here is the structural reason they cannot all be orthogonal: the state is a single matrix, so the key space has only as many dimensions as the key vector is wide, and a space of that dimension holds at most that many mutually orthogonal directions. Store more associations than that and some keys <em>must</em> share directions; even below the limit, random unit keys have small but nonzero overlaps that add up.</p><p>As the context carries more associations, the term you want stays about the same size while the leakage is a sum over everything else, so it grows. Signal-to-noise falls with context length: a long document forces many distinct facts to share one fixed box and they smear together. That is why this whole family struggles on long, many-needle retrieval, and why the improvements below all aim at that pressure point.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RfsW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3155d38-aad2-469e-be97-daad80ddedda_1434x779.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RfsW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3155d38-aad2-469e-be97-daad80ddedda_1434x779.png 424w, https://substackcdn.com/image/fetch/$s_!RfsW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3155d38-aad2-469e-be97-daad80ddedda_1434x779.png 848w, https://substackcdn.com/image/fetch/$s_!RfsW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3155d38-aad2-469e-be97-daad80ddedda_1434x779.png 1272w, https://substackcdn.com/image/fetch/$s_!RfsW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3155d38-aad2-469e-be97-daad80ddedda_1434x779.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RfsW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3155d38-aad2-469e-be97-daad80ddedda_1434x779.png" width="1434" height="779" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3155d38-aad2-469e-be97-daad80ddedda_1434x779.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:779,&quot;width&quot;:1434,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53896,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://informationbottleneck.substack.com/i/204059098?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3155d38-aad2-469e-be97-daad80ddedda_1434x779.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RfsW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3155d38-aad2-469e-be97-daad80ddedda_1434x779.png 424w, https://substackcdn.com/image/fetch/$s_!RfsW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3155d38-aad2-469e-be97-daad80ddedda_1434x779.png 848w, https://substackcdn.com/image/fetch/$s_!RfsW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3155d38-aad2-469e-be97-daad80ddedda_1434x779.png 1272w, https://substackcdn.com/image/fetch/$s_!RfsW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3155d38-aad2-469e-be97-daad80ddedda_1434x779.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Reading a stored key returns its value plus a small leak from every other key. The wanted term stays the same size while the leakage is a sum over everything else, so it grows as more facts share the fixed state.</figcaption></figure></div><h3>Why softmax doesn&#8217;t have this problem</h3><p>The leakage is <em>not</em> caused by folding the sum into the state matrix. The summed form and the matrix form are the same number; folding only fixes the size and the cost, not the value. The leakage is already there in the raw dot-product score.</p><p>Softmax runs those same dot products through an exponential and normalizes. The exponential sharpens them: the matching key saturates near one and the mismatched keys are crushed toward zero, so the wrong values effectively drop out of the read even when the keys overlap. Same overlaps, clean answer.</p><p>But that is exactly the property that cannot be summarized. The exponential of a dot product does not split into a query part times a key part, so there is nothing to precompute: you are forced to keep every key and recompute the exponential against each one, which is the growing cache. So it is an either/or: a sharp score reads cleanly but cannot be folded into a fixed state, while a foldable score gives the fixed state but leaks. <strong>Interference is not the cost of compressing; it is the cost of using a score weak enough to be compressible.</strong></p><div><hr></div><h2>Updating a value when a key comes back</h2><p>As the model reads a sequence, a later token sometimes produces a key close to one an earlier token already wrote, but carrying a different value. The state already holds a binding in that direction, and the new value should take its place. This is the update case, and it is the one ordinary outer-product memory gets wrong.</p><p>For example, a passage sets <code>x = 5</code> and later sets <code>x = 7</code>. Both tokens produce nearly the same key (the direction standing for &#8220;the value of x&#8221;), but with different values. When a later token reads x, the answer should be 7. Plain addition cannot give that: it never removed the old binding, so the slot holds 5 and 7 at once and the read returns a blend. The same shape shows up whenever a key recurs with a new value: an entity whose state changes (&#8221;Alice is in Paris&#8230; now Tokyo&#8221;), a correction (&#8221;blue&#8230; actually green&#8221;), a form field revised.</p><p>Two clarifications, since &#8220;update&#8221; can mislead. The prompt itself is fixed; the forward pass only reads it left to right, and &#8220;update&#8221; means a later position&#8217;s binding should win over an earlier one. &#8220;x = 5&#8221; stays in the text; it just should not win the read. And keys are not matched by name: two tokens are &#8220;the same key&#8221; when their key vectors point in roughly the same direction, so their writes land on the same spot in the state. A repeated mention produces a nearby key, the later write hits that slot, and the read afterward should reflect the new value.</p><p>So every write is one of two cases. <strong>Add:</strong> the key points somewhere new, a fresh fact, which is most tokens; plain accumulation is fine, and that is what vanilla linear attention does. <strong>Overwrite:</strong> the key lands on a direction already in the state, and the slot has to be updated to the new value, not stacked on top of the old one.</p><h3>Why not just replace the whole matrix?</h3><p>Because the state is shared by every association at once. Three ways to write an update to one key, into a memory that also holds a second fact:</p><ul><li><p><strong>Replace</strong> the whole matrix with the new key&#8211;value outer product: fixes the target key perfectly and <em>deletes everyone else</em>. Read the second key afterward and you get near zero. You wanted to change one slot and you erased the notebook.</p></li><li><p><strong>Add</strong> the new outer product: keeps the second fact, but leaves the old binding in place, so reading the target key returns old-plus-new, the stale value smeared into the fresh one.</p></li><li><p><strong>Delta:</strong> read what the key currently points to, subtract just that, then write the new value. Only the target slot changes; the other fact is untouched.</p></li></ul><p>Add keeps the other fact but smears the target. Replace fixes the target but wipes the other fact. Only the third (read, subtract, write) gets both right. That third option is the delta rule.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WpEo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb2f377-f16e-4696-8a9c-a730daaa0a12_1528x763.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WpEo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb2f377-f16e-4696-8a9c-a730daaa0a12_1528x763.png 424w, https://substackcdn.com/image/fetch/$s_!WpEo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb2f377-f16e-4696-8a9c-a730daaa0a12_1528x763.png 848w, https://substackcdn.com/image/fetch/$s_!WpEo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb2f377-f16e-4696-8a9c-a730daaa0a12_1528x763.png 1272w, https://substackcdn.com/image/fetch/$s_!WpEo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb2f377-f16e-4696-8a9c-a730daaa0a12_1528x763.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WpEo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb2f377-f16e-4696-8a9c-a730daaa0a12_1528x763.png" width="1456" height="727" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fdb2f377-f16e-4696-8a9c-a730daaa0a12_1528x763.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:727,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60184,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://informationbottleneck.substack.com/i/204059098?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb2f377-f16e-4696-8a9c-a730daaa0a12_1528x763.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WpEo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb2f377-f16e-4696-8a9c-a730daaa0a12_1528x763.png 424w, https://substackcdn.com/image/fetch/$s_!WpEo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb2f377-f16e-4696-8a9c-a730daaa0a12_1528x763.png 848w, https://substackcdn.com/image/fetch/$s_!WpEo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb2f377-f16e-4696-8a9c-a730daaa0a12_1528x763.png 1272w, https://substackcdn.com/image/fetch/$s_!WpEo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdb2f377-f16e-4696-8a9c-a730daaa0a12_1528x763.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Updating one key three ways. Add leaves the old value smeared into the new one; replacing the whole matrix fixes the target but destroys every other fact; the delta rule edits only the target slot and leaves the rest intact.</figcaption></figure></div><div><hr></div><h2>The delta rule</h2><p>The update that does this is the <strong>delta rule</strong> (Widrow &amp; Hoff, 1960), used for linear attention in DeltaNet (Yang et al., 2024). It writes the new value relative to what is already stored, not absolutely. First read what the memory currently returns for the key:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{old value} \\;=\\; S_{t-1}^\\top k_t.\n&quot;,&quot;id&quot;:&quot;NEXIJNPVZU&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>This is whatever sits in that key&#8217;s direction right now. We never have to know in advance whether the key was used before; we just read it back. Then move the slot from that old value toward the target, by a fraction &#946; (the write strength):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S_t = S_{t-1} + \\beta_t\\,k_t\\big(\\,\\underbrace{v_t - S_{t-1}^\\top k_t}_{\\text{new} \\,-\\, \\text{old}}\\,\\big)^{\\!\\top}, \\qquad \\beta_t\\in[0,1].&quot;,&quot;id&quot;:&quot;AURBMXMAJY&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>The bracket is the gap between the new value and the old one, and adding it back pushes the value stored at that key toward the target. One update covers both cases with no branching: if the key points somewhere new, the read is about zero, the gap is just the new value, and it reduces to a plain add; if the key lands on a direction that already holds a value, the read returns that old value, and the update subtracts it and writes the new value in its place. The memory tells the rule which case it is in.</p><p>Multiplying the correction out shows what it does to the whole state:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S_t = \\big(I - \\beta_t k_t k_t^\\top\\big)S_{t-1} + \\beta_t k_t v_t^\\top.\n&quot;,&quot;id&quot;:&quot;WCXOPIWIYR&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>The second term writes the new value along the key. The first term removes a &#946; fraction of whatever the state held along that key, and only along that key: the projection onto the key direction leaves everything orthogonal to it untouched. That is exactly why, in the previous widget, replacing the whole matrix wiped the bystander but the delta update did not; it only edits that one key&#8217;s line of the state.</p><h3>Why &#946; is not just 1</h3><p>A write strength of 1 is a hard overwrite: erase the old binding completely, write the new value. So why not use it everywhere? &#946; is produced per token by the model, and two things argue against pinning it to 1. Real keys are not exactly orthogonal, so erasing hard along one key also disturbs neighbors that partly share its direction, and a smaller &#946; makes a gentler edit with less collateral damage. And not every write should fully replace: sometimes the right move is to nudge a value, accumulate evidence, or write weakly under uncertainty. So &#946; between 0 and 1 is a dial: 1 overwrites, 0 leaves the slot alone, in between is a partial move. (In the online-learning view it is a per-step learning rate, and a rate of 1 everywhere is rarely what you want.)</p><p>One caveat the next widget makes concrete: the clean overwrite is exact only when the key is orthogonal to the others. When keys overlap, editing along one drags on whatever shares its direction, the same interference from before, now showing up in the write.</p><div><hr></div><h2>Training it in parallel</h2><p>Training needs every output over the whole sequence at once, then a gradient. Plain linear attention gives them cheaply because the state is a running sum, so the outputs collapse into two matrix multiplies (with a causal mask zeroing the future):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;O = (QK^\\top \\odot M)\\,V,\n&quot;,&quot;id&quot;:&quot;TESUIUOIGO&quot;}" data-component-name="LatexBlockToDOM"></div><p>Why this is fast: it is all dense matmuls, and a GPU runs a matmul as thousands of multiply-adds in parallel on its tensor cores, every output position at the same time. Nothing waits for anything else.</p><p>The delta rule breaks this. Its erase factor (the one from the operator form above) makes each state genuinely depend on the previous one, so you cannot write the answer as one sum of independent terms. Done literally you process tokens one at a time, each a tiny rank-one update that uses a sliver of the GPU while the rest sits idle.</p><p>DeltaNet&#8217;s contribution (Yang et al., 2024) was to recover the matmul form by working in <strong>chunks</strong>. A chunk is a contiguous block of C tokens; a length-L sequence is split into L/C of them. The expensive work happens inside a chunk, all as matmuls, and only a small summary state is passed from one chunk to the next.</p><h3>The trick: solve for the values that were actually written</h3><p>Every step adds a rank-one term whose left factor is a key, so the state is always the start state plus one such term per token:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S_t = S_0 + \\sum_{s\\le t} k_s\\,u_s^\\top.\n&quot;,&quot;id&quot;:&quot;BQHVWWLPWR&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>The written value here is not the raw value, but the correction from the delta rule (target minus old value, scaled by &#946;). The keys are known; these written values are the unknowns. The point is that if we can get all of them in a chunk at once, with a single matrix solve instead of a token-by-token walk, the whole chunk becomes parallel matmuls. So we solve for them jointly.</p><p>The written value at each step depends on the current read, and that read expands into known keys and earlier written values:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S_{t-1}^\\top k_t = S_0^\\top k_t + \\sum_{s\\lt t}(k_s^\\top k_t)\\,u_s.\n&quot;,&quot;id&quot;:&quot;NTXQUWUYCY&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>In words: reading a key against the state-so-far is the start-state read, plus every earlier written value weighted by how much its key overlaps the current one. Substituting gives a relation among the written values alone:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;u_t = \\beta_t\\big(v_t - S_0^\\top k_t\\big) - \\beta_t\\!\\sum_{s\\lt t}(k_s^\\top k_t)\\,u_s.\n&quot;,&quot;id&quot;:&quot;YFDJFXHQTY&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Each written value depends only on earlier ones, which makes this a triangular system. Stack the written values into a matrix, collect the pairwise key overlaps into a matrix T, and the whole set of equations becomes a single solve:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(I + T)\\,U = \\mathrm{diag}(\\beta)\\,(V - K S_0), \\qquad U = (I+T)^{-1}\\mathrm{diag}(\\beta)(V - K S_0).\n&quot;,&quot;id&quot;:&quot;OXRWZQPHTQ&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Because the matrix being inverted is unit lower-triangular, the inverse is one forward substitution on a small C-by-C matrix. Everything else is dense matmuls: build the overlap matrix from pairwise key dot products, then form the carried state and the outputs:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S_C = S_0 + K^\\top U, \\qquad O = Q S_0 + \\mathrm{tril}(Q K^\\top)\\,U.\n&quot;,&quot;id&quot;:&quot;ZAFCZJGMHD&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>The sequential token loop is gone, replaced by matmuls plus one small triangular solve. Writing a product of rank-one factors as a single low-rank update this way is a classical move from numerical linear algebra, the WY representation (Bischof &amp; Van Loan, 1985) and its UT-transform variant (Joffrain et al., 2006); DeltaNet borrows it to collapse the chunk into matrix operations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ft4g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1be37d-7d6b-4ba5-92ba-a64a2abc23eb_1589x717.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ft4g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1be37d-7d6b-4ba5-92ba-a64a2abc23eb_1589x717.png 424w, https://substackcdn.com/image/fetch/$s_!ft4g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1be37d-7d6b-4ba5-92ba-a64a2abc23eb_1589x717.png 848w, https://substackcdn.com/image/fetch/$s_!ft4g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1be37d-7d6b-4ba5-92ba-a64a2abc23eb_1589x717.png 1272w, https://substackcdn.com/image/fetch/$s_!ft4g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1be37d-7d6b-4ba5-92ba-a64a2abc23eb_1589x717.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ft4g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1be37d-7d6b-4ba5-92ba-a64a2abc23eb_1589x717.png" width="1456" height="657" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da1be37d-7d6b-4ba5-92ba-a64a2abc23eb_1589x717.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:657,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:75481,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://informationbottleneck.substack.com/i/204059098?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1be37d-7d6b-4ba5-92ba-a64a2abc23eb_1589x717.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ft4g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1be37d-7d6b-4ba5-92ba-a64a2abc23eb_1589x717.png 424w, https://substackcdn.com/image/fetch/$s_!ft4g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1be37d-7d6b-4ba5-92ba-a64a2abc23eb_1589x717.png 848w, https://substackcdn.com/image/fetch/$s_!ft4g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1be37d-7d6b-4ba5-92ba-a64a2abc23eb_1589x717.png 1272w, https://substackcdn.com/image/fetch/$s_!ft4g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda1be37d-7d6b-4ba5-92ba-a64a2abc23eb_1589x717.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Training a chunk in parallel. Inside a chunk everything is dense matmuls plus one small triangular solve; only the carried state passes to the next chunk, the single sequential step.</figcaption></figure></div><p></p><h3>Why the chunk size is small</h3><p>The chunk size sets how often the state is handed off: there are L/C chunks, so that many sequential state updates. The two extremes make this concrete. A chunk of one token is the original fully sequential recurrence. A single chunk covering the whole sequence is one handoff, done in one parallel block. (This is the opposite of what it might sound like: a bigger chunk means fewer, larger steps, not more.)</p><p>So why not use one giant chunk and be fully parallel? Because the chunk builds and solves a C-by-C matrix, so its cost and memory grow quadratically in the chunk size. At the full length you are back to the quadratic cost of full attention, and the matrix no longer fits in the fast on-chip memory the matmul engine reads from. Too small, and you pay too many sequential steps and underfill each matmul. The kernels use 64.</p><div><hr></div><h2>Adding decay: Gated DeltaNet</h2><p>Everything up to here is DeltaNet: a fixed-size associative memory, edited by the delta rule, trained in parallel. The last three sections are refinements, each adding expressive power with a small change that leaves the chunk algorithm intact.</p><p>The delta rule overwrites one slot at a time but cannot let old context fade on its own. Gated DeltaNet (Yang, Kautz &amp; Hatamizadeh, 2025, arXiv:2412.06464) multiplies the state by a scalar decay before each edit:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S_t = \\alpha_t\\big(I - \\beta_t k_t k_t^\\top\\big)S_{t-1} + \\beta_t k_t v_t^\\top.\n&quot;,&quot;id&quot;:&quot;MQTPTBOBHV&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Tracking the cumulative product of the decays, an earlier write contributes to a later read scaled by how much decay has accumulated in between. In the chunk algorithm this is just a per-row reweighting of the same matrices plus an extra factor in the causal mask; the triangular solve is unchanged. Decay is close to free to add. What it cannot do is forget different features at different rates, since it is one number.</p><div><hr></div><h2>Decay per channel: KDA</h2><p>KDA, the linear-attention layer in Kimi Linear (Kimi Team, 2025, arXiv:2510.26692), replaces that single decay with a per-channel decay vector, a different forget rate for every key channel:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;S_t = \\big(I - \\beta_t k_t k_t^\\top\\big)\\,D_t\\,S_{t-1} + \\beta_t k_t v_t^\\top.\n&quot;,&quot;id&quot;:&quot;WGKHPYPLLX&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Now every channel is scaled differently at every step, which looks like it should break the chunk form. It does not, because of a change of variables: factor the cumulative per-channel decay out of the state, and it cancels from the recurrence, leaving a plain delta product in reweighted key and erase factors:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\bar k_r = \\gamma_r^{-1}\\!\\odot k_r, \\qquad \\bar e_r = \\gamma_r \\odot (\\beta_r k_r).\n&quot;,&quot;id&quot;:&quot;OFKOLCQLRS&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>After this substitution the chunk equations have the same shape as before; only the entries carry the decay factors. KDA buys richer forgetting at no structural cost.</p><p>The per-channel rates are not hand-set hyperparameters; they are learned and data-dependent, produced from each token by a small projection (the Gated DeltaNet parameterization, a softplus of a learned linear map passed through an exponential). A per-head term and a per-channel bias set each channel&#8217;s baseline forget rate, and the per-token projection pushes that rate up or down, so the model learns both the typical decay profile and how to modulate it on the fly. The active edit, though, is still a single write-strength scalar, which scales both the erase and the write at once.</p><div><hr></div><h2>Splitting the edit: Gated DeltaNet-2</h2><p>Erasing acts on the key side: which coordinates of the old read to remove. Writing acts on the value side: which coordinates of the new value to keep. These are different axes of the state, so GDN-2 (Hatamizadeh, Choi &amp; Kautz, 2026, arXiv:2605.22791) gives each its own channel-wise gate, an erase gate on the key and a write gate on the value:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;e_t = b_t \\odot k_t, \\quad z_t = w_t \\odot v_t, \\qquad S_t = \\big(I - k_t e_t^\\top\\big)\\,D_t\\,S_{t-1} + k_t z_t^\\top.\n&quot;,&quot;id&quot;:&quot;NOTBKSFLKZ&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Compared with KDA, the write direction is unchanged (the left factor is still the key), but the read it subtracts is now channel-selected by the erase gate, and the value it writes is channel-selected by the write gate.</p><h3>Forward: same machine</h3><p>Run the same change of variables, now folding the erase gate into the reweighted factor, and the recurrence is again a plain (now asymmetric) delta product. The chunk pipeline keeps the same shape: the same overlap matrix, the same triangular inverse, the same state and output equations. The only difference is what fills them: the erase gate enters the key-side rows, the write gate the value-side rows, and the overlap matrix is now built from an asymmetric pair rather than a symmetric one.</p><h3>Backward: one real difference</h3><p>Training propagates a loss gradient back through the chunk. Write the solve as the triangular inverse applied to the written values. Backprop needs the gradient with respect to that inverse, which accumulates as a product of the incoming gradient with the written values:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathrm dA = \\mathrm dU\\,Z^\\top, \\qquad (\\mathrm dU\\,Z^\\top)_{rs} = \\langle \\mathrm du_r,\\; z_s\\rangle.\n&quot;,&quot;id&quot;:&quot;QYMZUIXFCR&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>The whole question is whether the gate can be pulled out of that inner product. In KDA the written value is a scalar times the value, so it slides straight out, and you can compute the gate-free products once as a matmul and scale afterward:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\langle \\mathrm du_r,\\; \\beta_s v_s\\rangle = \\beta_s\\,\\langle \\mathrm du_r,\\; v_s\\rangle.\n&quot;,&quot;id&quot;:&quot;MCSBAVCNBQ&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>In GDN-2 the written value is a per-channel product, so the gate sits <em>inside</em> the sum over channels and there is nothing to pull out:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\langle \\mathrm du_r,\\; w_s \\odot v_s\\rangle = \\sum_c \\mathrm du_{r,c}\\,w_{s,c}\\,v_{s,c}.\n&quot;,&quot;id&quot;:&quot;CLQUUXNJTX&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>No single number multiplies the whole inner product; the gate reweights each channel before it is summed, so no row or column scaling recovers it from the gate-free version. The erase side has the same issue. The gate therefore has to be folded into the matmul itself, not applied as a scaling after. The forward pass is essentially KDA&#8217;s; the backward kernel is the part that must be rewritten to carry both gates inside its accumulation, and that gate-aware backward is the real implementation cost of the split.</p><p>Setting both gates to the same scalar recovers KDA exactly; tying the decay to a scalar as well gives Gated DeltaNet; dropping the decay gives the delta rule. Each model is the next with some gate held to a scalar.</p><div><hr></div><h2>Where this nets out</h2><p>Step back and it is all one idea, taken in stages. Linear attention compresses an unbounded history into a fixed matrix, fast but lossy. The delta rule edits that matrix surgically instead of piling onto it. The chunked triangular solve makes the edit trainable at scale. Decay, per-channel decay, and decoupled erase/write gates each give the edit finer control over what to keep and what to remove, without giving up the fixed-size state or the parallel training. None of them recover the softmax cache's perfect recall; they make the compression smarter.</p><p>That is also where the measured gains land. In the Gated DeltaNet-2 paper the improvement over KDA is modest on language modeling but clear on long-context, multi-key retrieval, the regime where many associations are forced to share one fixed state and interference is worst. The ablation is honest about the split: a channel-wise erase gate with a scalar write recovers most of the gain, so the erase side is doing more work than the write side.</p><p>This is also why pure linear attention rarely replaces softmax outright. Exact recall is often worth the cost of the growing cache, so most production models stay softmax, and these layers show up where memory and throughput dominate: long context, high-throughput serving, constrained hardware. The common deployment is hybrid: interleave a few full or sliding-window attention layers for exact recall with many cheap linear layers. Recent open-weight models make this concrete. Qwen3-Next and Kimi Linear both stack three linear blocks (a Gated DeltaNet variant) per full-attention block, a 3:1 ratio, and MiniMax-01 mixes lightning (linear) and softmax attention in a similar pattern.</p><div><hr></div><h2><em>Sources: </em></h2><ul><li><p><em>DeltaNet chunkwise algorithm (Yang, Wang, Zhang, Shen, Kim, NeurIPS 2024)</em></p></li><li><p><em>Gated DeltaNet (Yang, Kautz, Hatamizadeh, ICLR 2025, arXiv:2412.06464); </em></p></li><li><p><em>KDA / Kimi Linear (Kimi Team, 2025, arXiv:2510.26692); </em></p></li><li><p><em>Gated DeltaNet-2 (Hatamizadeh, Choi, Kautz, 2026, arXiv:2605.22791).</em></p></li></ul><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.the-information-bottleneck.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Information Bottleneck! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[AI for Science with Qichao Hu (Molecular Universe / SES AI)]]></title><description><![CDATA[Most AI-for-science companies are selling shovels. Qichao Hu wants the gold.]]></description><link>https://www.the-information-bottleneck.com/p/ai-for-science-with-qichao-hu-molecular</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/ai-for-science-with-qichao-hu-molecular</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Mon, 29 Jun 2026 04:32:25 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/204060433/1c7fb411bcb21f037040a5eb9381f441.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-a1U__y9sV5U" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;a1U__y9sV5U&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/a1U__y9sV5U?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>In this episode, we talk with Qichao, the founder and CEO of Molecular Universe, the AI-for-science platform that grew out of SES AI, a high-energy-density battery developer he&#8217;s run for fourteen years. His core distinction is that companies from the AI world build tools, such as foundation models that predict properties, while companies from the science world care about the final product, such as the new battery or material that actually ships. Molecular Universe sits firmly on the science side, and the difference shows up everywhere from what they publish to what they refuse to.</p><p>We get into the actual workflow of materials discovery and where AI compresses it. A single trial in a traditional lab can take a year with maybe a 40% success rate; the goal is to run a thousand candidates in parallel and turn that year into a week. Qichao walks through improving low-temperature fast-charging for EV batteries: from hypothesis generation through molecule-, material-, and device-level property prediction, down to autonomous labs that synthesize and test the top candidates without a human touching a pipette.</p><p>The hardest problem, it turns out, isn&#8217;t predicting molecular properties or measuring device performance, but it&#8217;s the black box connecting the two. In batteries, that&#8217;s the solid-electrolyte interface, which the field has been hand-waving about since the seventies. And the thing standing in the way of cracking it isn&#8217;t a clever training trick but data: companies sitting on twenty years of records are finding it too messy, incomplete, and poorly labeled to train on, and are having to start collecting from scratch with new protocols and robots.</p><div><hr></div><p><strong><span>Timeline</span></strong></p><ul><li><p><strong><span>00:13</span></strong> &#8212; Intro and welcome;</p></li><li><p><strong><span>01:19</span></strong> &#8212; Shovel vs. gold</p></li><li><p><strong><span>05:18</span></strong> &#8212; Why the world&#8217;s smartest scientist doesn&#8217;t automatically give you a better battery</p></li><li><p><strong><span>07:25</span></strong> &#8212; The discovery workflow</p></li><li><p><strong><span>09:37</span></strong> &#8212; Exploration vs. exploitation</p></li><li><p><strong><span>11:54</span></strong> &#8212; Safety and filtering: screening novel molecules against banned and toxic-substance lists</p></li><li><p><strong><span>17:55</span></strong> &#8212; How hypotheses get generated, and where frontier LLMs help</p></li><li><p><strong><span>20:29</span></strong> &#8212; From hypothesis to ~400 formulations: property prediction, ranking, and handing off to autonomous labs</p></li><li><p><strong><span>26:37</span></strong> &#8212; &#8220;A foundation model for everything&#8221; &#8212; and the black box between molecular properties and device performance</p></li><li><p><strong><span>30:01</span></strong> &#8212; World models and physics</p></li><li><p><strong><span>33:09</span></strong> &#8212; The great unknown in batteries</p></li><li><p><strong><span>37:08</span></strong> &#8212; Simulation vs. reality: calibrating massive simulated datasets with a sliver of experimental data</p></li><li><p><strong><span>41:47</span></strong> &#8212; Lab robotics: how fast the hardware has caught up, and what a floor of autonomous labs looks like</p></li><li><p><strong><span>43:50</span></strong> &#8212; The real bottlenecks</p></li><li><p><strong><span>50:21</span></strong> &#8212; Pre-training from scratch vs. post-training LLMs, and why training tricks haven&#8217;t reduced the need for good data</p></li><li><p><strong><span>52:42</span></strong> &#8212; Evaluation</p></li><li><p><strong><span>55:42</span></strong> &#8212; Publish the B+ model, keep the A model</p></li><li><p><strong><span>58:05</span></strong> &#8212; Five years out</p></li><li><p><strong><span>1:00:37</span></strong> &#8212; Closing thoughts and wrap</p></li></ul><div><hr></div><p>Music:</p><ul><li><p>&#8220;Kid Kodi&#8221; - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[Infrastructure for AI at Scale - With Benny Chen (Fireworks AI)]]></title><description><![CDATA[We talk a lot on this show about RL, agents, and the move between pre-training and post-training, but not enough about the layer everything actually runs on.]]></description><link>https://www.the-information-bottleneck.com/p/infrastructure-for-ai-at-scale-with-434</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/infrastructure-for-ai-at-scale-with-434</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Wed, 24 Jun 2026 04:03:28 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203348460/56341ba56719a283ef06930e93981f87.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>We talk a lot on this show about RL, agents, and the move between pre-training and post-training, but not enough about the layer everything actually runs on. Benny Chen, co-founder of Fireworks AI, one of the largest inference platforms around, walks us through what it takes to serve models at scale: sourcing GPUs, writing the kernels, the runtime, and the routing layer that lets a customer hit one endpoint and forget the rest.</p><p>We talk why the real bottleneck is power, not chips, and why that favors Nvidia and Google. Why MoE keeps winning even when dense models look better on paper and why he'd rather run fungible capacity at 95% than specialized chips at 60%. We also talk about quantization limits, where RL efficiency has to go next, and his case that AI is still <em>under</em>-hyped. We also get into cross-region training, sparse autoencoders and why interpretability hasn't taken off in open source, whether open models can close the gap, and a frank read on Anthropic's go-to-market.</p><div><hr></div><p><strong>Timeline</strong></p><ul><li><p>00:00 &#8212; Intro: the part of AI nobody talks about</p></li><li><p>01:20 &#8212; What "infrastructure for AI" actually means: the layers, from GPUs up to routing</p></li><li><p>02:59 &#8212; Why not just buy your own GPUs and do it yourself?</p></li><li><p>05:17 &#8212; The scale Fireworks runs at</p></li><li><p>06:35 &#8212; Hardware inflation, GPU costs, and the real risk hiding in commit duration</p></li><li><p>10:14 &#8212; Nvidia vs AMD vs TPUs, and why power is the bottleneck</p></li><li><p>11:57 &#8212; Mixing GPU types and generations; fungibility vs. specialization</p></li><li><p>14:22 &#8212; Once you have the GPUs, what's the next layer to build?</p></li><li><p>17:04 &#8212; Dense vs. MoE, and why the hardware picks the winner</p></li><li><p>21:07 &#8212; Quantization: is FP4 the floor? TurboQuant and INT vs. FP</p></li><li><p>24:28 &#8212; How tied are the algorithms to the hardware?</p></li><li><p>25:12 &#8212; DeepSeek, DeepGEMM, and next-token prediction as reconstruction loss</p></li><li><p>28:50 &#8212; Why RL is still wildly inefficient compared to pre-training</p></li><li><p>30:08 &#8212; Speculative decoding, AI-generated kernels, and auto-research</p></li><li><p>34:00 &#8212; The AGI question: why text gets automated but vision may stay expensive</p></li><li><p>37:07 &#8212; Hype check: why Benny thinks AI is still under-hyped</p></li><li><p>41:28 &#8212; Training vs. inference at the infrastructure level</p></li><li><p>44:12 &#8212; Scaling across data centers: cross-region training with Cursor</p></li><li><p>45:40 &#8212; Sparse autoencoders, interpretability, and why open source is human-constrained</p></li><li><p>49:04 &#8212; Will open models catch up &#8212; on quality and on compute?</p></li><li><p>51:41 &#8212; Are we plateauing? Opus 4.7 vs. 4.6 and the coming data wars</p></li><li><p>54:41 &#8212; Physical limits, HBM, and whether chips keep getting faster</p></li><li><p>58:17 &#8212; The belief about inference everyone gets wrong</p></li><li><p>59:31 &#8212; Anthropic, mythos, and a frank take on go-to-market</p></li><li><p>1:04:41 &#8212; Wrap-up</p><div><hr></div></li></ul><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li></ul><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[Broken Peer Review, AI, and Worms — with Oded Rechavi]]></title><description><![CDATA[Oded Rechavi is a biologist at Tel Aviv University and the co-founder of QED, a company building AI to review scientific work.]]></description><link>https://www.the-information-bottleneck.com/p/broken-peer-review-ai-and-worms-with-dd6</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/broken-peer-review-ai-and-worms-with-dd6</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Sun, 21 Jun 2026 03:53:01 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342162/b2619814df4ecbf569293268a3543914.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-B_i0IaFjb-A" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;B_i0IaFjb-A&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/B_i0IaFjb-A?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Oded Rechavi is a biologist at Tel Aviv University and the co-founder of QED, a company building AI to review scientific work. He's also spent years studying worms.</p><p>We start with what's wrong with peer review and grant funding: why it takes years to publish, why reviewers are often your own competitors, and why the whole thing is locked to an economic model that rewards publishing more papers, not better ones. Oded explains why he doesn't call QED "peer review" at all, and what it would take to actually validate science instead of just stamping it.</p><p>Then we get into the biology. C. elegans has exactly 959 cells, every one of them named, and a fully mapped brain. Oded's lab studies how a worm's experiences get passed to its offspring through RNA rather than DNA &#8212; meaning what happens to a worm in its lifetime can change its descendants. We also talk about using ancient DNA to reassemble the Dead Sea Scrolls, what AI can and can't do for biology, and why he wants to build an "Ironman suit" for researchers rather than replace them.</p><div><hr></div><p>00:00 Intro</p><p>01:35 Why scientific publishing is broken</p><p>04:02 Years to publish, and what it costs science</p><p>07:20 Bad reviewers, conflicts of interest, and the money</p><p>10:47 Why preprints don't fix it</p><p>15:37 How AI conferences handle review</p><p>22:07 Conferences vs. journals &#8212; does slow review help?</p><p>25:22 Building QED: review, not peer review</p><p>30:02 Tracking a paper from idea to submission</p><p>33:11 What writing a grant actually involves</p><p>35:00 The ERC reviewer crisis</p><p>37:06 Tailoring feedback to your field</p><p>41:48 Switching to biology</p><p>44:30 Every cell has a name: inside C. elegans</p><p>46:28 Inheritance without DNA</p><p>48:16 What the worm "thinks" changes its offspring</p><p>51:58 Reassembling the Dead Sea Scrolls with ancient DNA</p><p>56:07 Psychedelics and worms</p><p>58:36 Can AI run the research itself?</p><p>1:04:49 Automation vs. validation</p><p>1:07:12 The origin of life</p><p>1:08:49 Why people reject AI-written work</p><p>1:16:18 Will humans still have a role?</p><p>1:17:39 Wrap-up</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[Will AI Take Our Jobs? With Alex Imas (Google/University of Chicago)]]></title><description><![CDATA[Will AI take our jobs?]]></description><link>https://www.the-information-bottleneck.com/p/will-ai-take-our-jobs-with-alex-imas-420</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/will-ai-take-our-jobs-with-alex-imas-420</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Tue, 16 Jun 2026 14:49:38 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342163/c372b0468f4788fd59c1075d5bb02ae4.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-6Z76VRxp98I" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;6Z76VRxp98I&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/6Z76VRxp98I?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Will AI take our jobs? We put the question to Alex Imas, the new Director of AGI Economics at Google DeepMind and a professor at Chicago Booth, whose entire job now is studying how frontier AI reshapes the economy. His short answer: probably some of them, but the popular story is mostly wrong about which jobs and how fast.</p><p>Alex makes the case that a job is a bundle of tasks, not a single thing AI either does or doesn't do, and that the number of people who should actually care about is how much consumer demand responds to falling prices. Get that wrong and you predict mass layoffs. Get it right and you sometimes predict more hiring. We get into why the automation panic is two centuries old, why he thinks blue-collar work is in more danger than white-collar, and why the people already winning are the ones adopting AI fastest.</p><p>We also cover the AGI versus ASI distinction and why it changes everything for the economy, what happens when there's no moat and open models stay six to eight months behind, the three-tier pricing future he sees coming after the 2026 compute crunch, and what any of this means if you're deciding whether to send your kids to college.</p><ul><li><p>The episode was recorded before Alex joined Google</p></li></ul><div><hr></div><p><strong>Timestamps</strong></p><p>00:00 Meeting Alex Imas</p><p>00:44 Will AI take our jobs?</p><p>03:35 Is this an AI question or an economics question?</p><p>06:18 The economy is already behind the AI we have</p><p>07:43 Why AI adoption is K-shaped</p><p>12:51 Was Andrew Yang right?</p><p>13:45 The automation panic is 200 years old</p><p>16:46 Dario's six-month claim, and why we don't see it yet</p><p>17:22 A job is not a task</p><p>22:38 The three numbers that actually predict the labor market</p><p>22:42 The chess engine analogy and the centaur phase</p><p>25:45 Recursive self-improvement and the hamburger problem</p><p>30:06 Should AI labs be the ones answering alignment questions?</p><p>31:17 The "invisible hand wave" and why nobody wants fully autonomous AI</p><p>33:27 AGI vs ASI, and why the difference is everything</p><p>35:28 Commodities vs relational goods</p><p>41:14 Star Trek, replicators, and predicting with sci-fi</p><p>45:20 Inequality and the Upper West Side VCs</p><p>46:21 Your money manager was automated in the 1960s</p><p>50:47 Are OpenAI and Anthropic overvalued? The moat problem</p><p>54:29 What has to be true for the losses to make sense</p><p>55:43 Cognitive atrophy and monopoly fears</p><p>57:00 The 2026 compute crunch and the three-tier pricing future</p><p>1:01:52 The Apple vs Android analogy</p><p>1:03:54 A rich-country perspective</p><p>1:04:16 Protecting the skills that actually matter</p><p>1:07:02 Will not using AI become a status symbol?</p><p>1:08:53 Does capitalism even survive?</p><p>1:13:44 Redistribution becomes the political battleground</p><p>1:18:16 Blue collar vs white collar: who's really at risk</p><p>1:21:18 Advice for parents in an AI world</p><p>1:22:43 Saving for retirement when the Valley says don't</p><p>1:25:06 Will non-elite colleges survive?</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" -Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p><div><hr></div></li></ul><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[Why AI Benchmarks Are Lying to You - with Wenhu Chen (Meta/University of Waterloo)]]></title><description><![CDATA[In this episode, we sit down with Wenhu Chen, research scientist at Meta MSL, assistant professor at the University of Waterloo, and the person behind MMLU-Pro and MMMU.]]></description><link>https://www.the-information-bottleneck.com/p/why-ai-benchmarks-are-lying-to-you-667</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/why-ai-benchmarks-are-lying-to-you-667</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Sat, 13 Jun 2026 20:05:34 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342164/5e42ebd48e6aa4977d0c1ce4b7ed9340.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-rB4pvsm2AkA" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;rB4pvsm2AkA&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/rB4pvsm2AkA?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p><strong>I</strong>n this episode, we sit down with <strong>Wenhu Chen,</strong> research scientist at Meta MSL, assistant professor at the University of Waterloo, and the person behind MMLU-Pro and MMMU. If you've read a frontier model release in the last two years, you've seen his benchmarks. That makes him one of the best people to answer the question everyone dances around: when a model jumps from 40% to 90% on your benchmark, how much of that is real? In this episode, we dig into why benchmarks have become the loss function of the entire field - design a bad one, and thousands of brilliant researchers will spend months hill-climbing in the wrong direction. Wenhu is surprisingly candid about the limits of his own creations: contamination is everywhere, saturation turns frontier benchmarks into unit tests, and popular alternatives, such as LM Arena, mostly measure tone and length rather than capability. His answer is to evaluate models where they've never been: private codebases, hospital data, and the messy, live internet.</p><p>We also talk about ClawBench, his new benchmark that deploys agents to over 140 real production websites to do things people actually want done, such, such as ordering food, booking tickets, and applying for jobs. The best model in the world completes about a third of these tasks. We unpack why: bot detection, models that refuse to click "pay," agents that give up the moment an environment doesn't match their training, and harnesses that can swing results by 20% without changing the model at all.</p><p>Along the way, we cover the overlooked science of evaluating pre-training, data flywheels, and synthetic environments for agent training, and whether RL teaches models to reason or just surfaces what's already there. We close with Wenhu's predictions: exploration and adaptability will improve rapidly, but security will become the field's hardest problem as agents gain real permissions in the real world.</p><div><hr></div><p><strong>Timestamps</strong></p><p>00:00 &#8211; Intro<br>00:55 &#8211; What good evaluation means, and how it's changed since the early GPT days<br>03:35 &#8211; Benchmarks as the field's loss function<br>05:50 &#8211; Contamination: the problem nobody fully solves<br>08:08 &#8211; MMLU-Pro scores: real progress or training on the test set?<br>11:05 &#8211; Can you measure creativity?<br>12:34 &#8211; Why human judges and arenas are unreliable &#8212; and what to use instead<br>19:22 &#8211; What a good benchmark actually looks like<br>22:34 &#8211; Chain of thought: signal or scratchpad?<br>26:01 &#8211; Auto-research and hill-climbing agents<br>28:52 &#8211; Harnesses: 20% swings without touching the model<br>32:28 &#8211; Safety, model release, and an "FDA for models"<br>36:53 &#8211; The overlooked science of pre-training evaluation<br>43:49 &#8211; Designing pre-training benchmarks when one run costs a billion dollars<br>49:45 &#8211; ClawBench: agents on 140+ live websites, and why the best model gets 33%<br>54:42 &#8211; How MMLU-Pro and MMMU-Pro were born from public complaints<br>59:16 &#8211; Pixel agents vs. APIs: will MCP kill computer use?<br>1:02:11 &#8211; Training agents: data flywheels and synthetic environments<br>1:05:43 &#8211; SFT vs. RL, and does RL teach reasoning or reveal it?<br>1:09:21 &#8211; What gets solved next year &#8212; and what doesn't<br>1:14:32 &#8211; Undervalued ideas, and what's next for ClawBench</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p></li></ul>]]></content:encoded></item><item><title><![CDATA[Jürgen Schmidhuber - Part 2: JEPA, the Road to AGI, and Who Really Invented Modern AI]]></title><description><![CDATA[In the second half of our conversation with J&#252;rgen Schmidhuber, we focus on the key ideas he's pursued since the early 1990s and discuss why he believes these concepts are only now being rediscovered.]]></description><link>https://www.the-information-bottleneck.com/p/jurgen-schmidhuber-part-2-jepa-the-769</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/jurgen-schmidhuber-part-2-jepa-the-769</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Sun, 07 Jun 2026 18:13:28 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342165/c0e6561c04f116bf55fb92d44b357174.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-_03y-bf6bds" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;_03y-bf6bds&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/_03y-bf6bds?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>In the second half of our conversation with J&#252;rgen Schmidhuber, we focus on the key ideas he's pursued since the early 1990s and discuss why he believes these concepts are only now being rediscovered.</p><p>We start with JEPA. J&#252;rgen argues that the method LeCun named in 2022 is the same family he published in 1992 as Predictability Maximization. From there he traces the adversarial lineage back further still, to his 1990 world-model paper and 1991 Predictability Minimization &nbsp;- &nbsp;the curiosity-driven minimax games he sees as the real origins of GANs.</p><p>We also talk about why these ideas took thirty years to land, why today's trillion-dollar data-center buildout is driven by AGI fear, and why he thinks Apple may come out ahead.</p><p>The back half turns to what he sees as the real frontier: physical AI. Today's systems are superhuman behind the screen but helpless at a leaky pipe, and until a robot can use human tools, there's no AGI. He discusses self-replicating, self-improving machines as "a new kind of life," reframes continual learning and test-time training as ideas from his 1991 fast-weight work, and detours through Solomonoff's universal prior, Hutter's AIXI, and the G&#246;del machine.</p><p>We close on the subject J&#252;rgen is famous for: scientific credit. He makes his case for rigorous attribution, casts himself as a "speaker for the dead" championing forgotten pioneers like Ivakhnenko, and reflects candidly on whether the fights are personal.</p><div><hr></div><p><strong>Timeline</strong></p><p>00:30 &#8212; What JEPA is, and the 1992 Predictability Maximization story</p><p>04:54 &#8212; Implementing PMAX: autoencoders, Siamese networks, Infomax</p><p>09:10 &#8212; Predictability Minimization, factorial codes, and the roots of GANs</p><p>16:00 &#8212; Why it took 30 years: the economics of compute</p><p>20:52 &#8212; Data, the web, and 1990 as the origin point</p><p>23:09 &#8212; Hardware inflation, the trillion-dollar buildout, and the coming crash</p><p>34:05 &#8212; Physical AI: the plumber problem and self-replicating machines</p><p>41:14 &#8212; Which 90s ideas are being scaled right now</p><p>45:26 &#8212; Continual learning and test-time training as "old hats"</p><p>55:19 &#8212; Measuring intelligence: Solomonoff, AIXI, and the G&#246;del machine</p><p>1:05:26 &#8212; Self-replication and von Neumann</p><p>1:09:51 &#8212; Will he see AGI in his lifetime?</p><p>1:10:42 &#8212; Credit, integrity, and being a "speaker for the dead"</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p></li><li><div><hr></div></li></ul><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[Jürgen Schmidhuber - World Models, RL, and the Year that changed AI (Part 1)]]></title><description><![CDATA[In this episode, we host J&#252;rgen Schmidhuber - the man, the legend, one of the godfathers of modern AI.]]></description><link>https://www.the-information-bottleneck.com/p/jurgen-schmidhuber-world-models-rl-88d</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/jurgen-schmidhuber-world-models-rl-88d</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Thu, 04 Jun 2026 12:59:25 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342166/158b5efd5da4499609dad961c04b7fb9.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-UUq4ixTmye8" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;UUq4ixTmye8&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/UUq4ixTmye8?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>In this episode, we host J&#252;rgen Schmidhuber - the man, the legend, one of the godfathers of modern AI. His lab worked out many ideas behind today&#8217;s systems (LSTM, world models, artificial curiosity, Transformer variants, and even GAN-style setups) decades before they became fashionable, and he&#8217;s just as well known for making sure people remember who did what first. This is the first of two conversations with him.</p><p>We go back to his lab in the early 90s and ask how one small group came up with so many of the ideas that are now being scaled to a thousand billion dollars, back when compute was ten million times more expensive. A lot of the episode comes down to one distinction he keeps making: prediction vs. decision-making. His take is that LLMs are very good prediction machines that imitate the web, but that&#8217;s only half the problem. To actually act in the world, you need a controller that uses a world model to plan. He talks about his 1990 work on world models and artificial curiosity, where the controller gets rewarded for running experiments that improve its own model (an adversarial setup years before GANs), why planning millisecond by millisecond doesn&#8217;t scale, and why you need sub-goals instead.</p><p>We also talk about compression as the core of understanding, from falling apples to Kepler to Einstein, and why we still don&#8217;t have a robot that can do what a plumber does, even though the AI behind the screen keeps getting better. Then the conversation moves to credit assignment: how &#8220;to Schmidhuber&#8221; became a verb, what he thinks is broken about the award system, and a long exchange on PMAX vs. JEPA. He ends on the real origins of deep learning and a prediction about self-replicating machines in space.</p><div><hr></div><p><strong>Timeline</strong></p><p>00:00 &nbsp;Intro<br>00:55 &nbsp;1991 in Munich, and why that lab mattered<br>02:38 &nbsp;"I'm not very smart" &nbsp;and why compute getting 10&#215; cheaper every 5 years changed everything<br>04:25 &nbsp;Chess as an AI proxy<br>08:27 &nbsp;Artificial curiosity in the 90s vs. today's RL exploration<br>09:10 &nbsp;Why RL is harder than supervised learning<br>20:48 &nbsp;Coding agents vs. robots, and how a baby learns its own hands<br>26:20 &nbsp;Compression as understanding<br>33:40 &nbsp;What's actually missing on the road to AGI<br>37:30 &nbsp;Why millisecond-by-millisecond planning is stupid<br>47:44 &nbsp;Convergence to LLMs, GPUs, and how far we still are from the Bremermann limit<br>51:49 &nbsp;Unsupervised learning, factorial codes, and predictability minimization<br>58:12 &nbsp;Credit assignment: the fights with LeCun and the Nobel critique<br>1:02:13 &nbsp;On his last name becoming a verb<br>1:05:17 &nbsp;The award system's missing peer review<br>1:07:03 &nbsp;Closed labs and the decline of open research<br>1:13:23 &nbsp;Audience questions<br>1:34:02 &nbsp;Closing: who really invented deep learning?</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p><div><hr></div></li></ul><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[AI for Science and the Thermodynamics of Generative AI - with Max Welling (UvA, CuspAI)]]></title><description><![CDATA[In this episode, we sit with Max Welling, Professor of Machine Learning at the University of Amsterdam, co-founder and CTO of CuspAI, and a foundational figure behind variational autoencoders (VAEs), equivariant networks, and Bayesian deep learning.]]></description><link>https://www.the-information-bottleneck.com/p/ai-for-science-and-the-thermodynamics-4bc</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/ai-for-science-and-the-thermodynamics-4bc</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Fri, 29 May 2026 03:58:30 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342167/bb85f505148307f27bc503c06a73c8be.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p></p><p>In this episode, we sit with Max Welling, Professor of Machine Learning at the University of Amsterdam, co-founder and CTO of CuspAI, and a foundational figure behind variational autoencoders (VAEs), equivariant networks, and Bayesian deep learning. We talk about AI for science, the physics underneath generative models, and what's still missing on the road to real intelligence.</p><p>Max starts with what impresses him and what worries him about the LLM era, then makes the case that the next leaps will come from physical AI and from science itself. We dig into how machine learning actually works in the lab, world models and whether priors like geometry and symmetry should be built in or simply learned, and whether transformers will still rule a decade from now. At the end, we talk about CuspAI's climate mission, AI risk and regulation, Max&#8217;s new book, and where neuroscience might inspire the next wave of ML.</p><div><hr></div><p><strong>Timeline</strong></p><ul><li><p><strong>00:00</strong> &#8212; Intro</p></li><li><p><strong>00:47</strong> &#8212; Are we happy with the LLM era?</p></li><li><p><strong>03:14</strong> &#8212; Embodiment and physical AI</p></li><li><p><strong>08:05</strong> &#8212; Does "AGI" even matter as a term?</p></li><li><p><strong>11:34</strong> &#8212; Verifiers, RL, and why math/coding are tractable</p></li><li><p><strong>13:17</strong> &#8212; What actually shifted to make materials discovery work</p></li><li><p><strong>14:42</strong> &#8212; From molecules to biology and wet labs</p></li><li><p><strong>16:26</strong> &#8212; Working with real labs: timescales, friction, and the "Mira" agent</p></li><li><p><strong>20:29</strong> &#8212; Balancing simulators vs. experiments: the exploration&#8211;exploitation trade-off</p></li><li><p><strong>23:44</strong> &#8212; Active learning for experimental design</p></li><li><p><strong>24:23</strong> &#8212; Why active learning hasn't been central to LLMs</p></li><li><p><strong>25:24</strong> &#8212; A general loop for ML-for-science across domains</p></li><li><p><strong>27:10</strong> &#8212; Foundation models for chemistry: a "mother ship" plus a zoo of fine-tuned models</p></li><li><p><strong>30:04</strong> &#8212; Quantum mechanics, interpretation, and AI as a creative theorist</p></li><li><p><strong>31:54</strong> &#8212; World models and Yann LeCun's view; priors vs. learning</p></li><li><p><strong>34:57</strong> &#8212; Should world knowledge be explicit? (responding to Stefano Ermon)</p></li><li><p><strong>36:41</strong> &#8212; Vision: equivariance vs. transformers, and the role of optimization</p></li><li><p><strong>40:32</strong> &#8212; Best model for molecular properties in 10 years? Will transformers survive?</p></li><li><p><strong>43:16</strong> &#8212; CuspAI's climate focus and what motivated it</p></li><li><p><strong>47:10</strong> &#8212; One platform for every material class &#8212; what transfers and what doesn't</p></li><li><p><strong>48:42</strong> &#8212; Where does the risk of human extinction really come from?</p></li><li><p><strong>51:06</strong> &#8212; The "pause AI" debate and the arms-race reality</p></li><li><p><strong>52:40</strong> &#8212; Regulating powerful models: government vs. self-regulation</p></li><li><p><strong>55:16</strong> &#8212; Who should design AI regulation?</p></li><li><p><strong>56:29</strong> &#8212; The new book</p></li><li><p><strong>1:00:31</strong> &#8212; Compression, the information bottleneck, and renormalization</p></li><li><p><strong>1:03:30</strong> &#8212; The role of foundational principles in modern AI</p></li><li><p><strong>1:04:06</strong> &#8212; Waves in computing, the brain, and the next wave of innovation</p></li><li><p><strong>1:07:11</strong> &#8212; Neuroscience and ML: are we in a better position now?</p></li><li><p><strong>1:09:17</strong> &#8212; Conferences, the ICLR keynote, and finding the right people</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p></li></ul>]]></content:encoded></item><item><title><![CDATA[After Math Falls, What's Next? with Julia Kempe (NYU/Meta)]]></title><description><![CDATA[Julia Kempe on Why Math Will Fall Next, Superhuman Provers, and the Return of the Renaissance Researcher]]></description><link>https://www.the-information-bottleneck.com/p/after-math-falls-whats-next-with-0cb</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/after-math-falls-whats-next-with-0cb</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Mon, 25 May 2026 02:10:56 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342168/cc5e5990ee71ad66cec482bd8ed1053e.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-78W5O34-l4o" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;78W5O34-l4o&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/78W5O34-l4o?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p></p><p>In this episode, we sit down with Julia Kempe, a Professor at NYU's Center for Data Science and researcher at Meta FAIR's Foundations of Reasoning team, &nbsp;for a wide-ranging conversation on the future of AI research.</p><p>We dig into why verifiable domains like mathematics may be on track to "fall" the way Go did. With formal verification through Lean and the Mathlib infrastructure, LLM agents can now generate and check proofs at scale, and Julia makes the case that a new industry of automated mathematical discovery is closer than most mathematicians believe. We explore why Erd&#337;s problems are already falling, what's still missing for harder fields like analysis and physics, and how synthetic data, curation, and verification fit together.</p><p>From there we get into the energy and scaling limits of frontier models, the case for academic research that big labs can't pursue, how to advise PhD students when Claude can already do their first-year work, the rise of AI safety and security as research priorities, and Julia's optimistic argument that AI tools are bringing back the Renaissance generalist &nbsp;- &nbsp;the researcher who can finally work fluently across math, biology, and beyond.</p><div><hr></div><p><strong>Timeline</strong></p><ul><li><p><strong>00:00</strong> &#8212; Introductions</p></li><li><p><strong>01:00</strong> &#8212; Defining reasoning and verifiable domains</p></li><li><p><strong>04:00</strong> &#8212; Lean, Mathlib, and the formalization of mathematics</p></li><li><p><strong>10:00</strong> &#8212; Constructive proofs, Erd&#337;s problems, and the new wave of "AI mathematicians"</p></li><li><p><strong>14:00</strong> &#8212; Will math be "solved"? Art, photography, and the changing nature of creative work</p></li><li><p><strong>18:00</strong> &#8212; Why physics is harder than math</p></li><li><p><strong>22:00</strong> &#8212; Moravec's paradox, evolution, and why robotics lags behind language</p></li><li><p><strong>27:00</strong> &#8212; The Renaissance is back: generalist researchers in the age of AI</p></li><li><p><strong>29:00</strong> &#8212; Advising students: math, programming, and what core education still matters</p></li><li><p><strong>32:00</strong> &#8212; Teaching and assessment when GPT can do the homework</p></li><li><p><strong>35:00</strong> &#8212; Anti-AI backlash, energy costs, and the security threat</p></li><li><p><strong>40:00</strong> &#8212; Scaling vs. efficiency</p></li><li><p><strong>42:00</strong> &#8212; Model collapse, synthetic data, and what's left to squeeze from the internet</p></li><li><p><strong>44:00</strong> &#8212; What's exciting next: AI for science, safety, robotics, memory, and planning</p></li><li><p><strong>47:00</strong> &#8212; Annotation costs as a proxy</p></li><li><p><strong>50:00</strong> &#8212; Superhuman models and what security even means against them</p></li><li><p><strong>52:00</strong> &#8212; AlphaGo as precedent for verifiable superhuman performance</p></li><li><p><strong>54:00</strong> &#8212; Hallucination, the Mirage paper, and whether these are solvable problems</p></li><li><p><strong>56:00</strong> &#8212; Why coding isn't fully solved yet</p></li><li><p><strong>58:00</strong> &#8212; Agent security, prompt injection, and the Wild West of deployed agents</p></li><li><p><strong>1:01:00</strong> &#8212; Regulation: what's needed and what's possible</p></li><li><p><strong>1:04:00</strong> &#8212; Advice for PhD students and what research academia should pursue</p></li><li><p><strong>1:09:00</strong> &#8212; Startup opportunities: robotics, security, and AI for finance</p></li><li><p><strong>1:12:00</strong> &#8212; Closing thoughts: use the tools, and build grassroots AI for good</p></li></ul><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[Intelligence in an Open World - with Mengye Ren (NYU)]]></title><description><![CDATA[We talk with Mengye Ren, Assistant Professor at NYU's Center for Data Science, about what intelligence actually means once you step outside a benchmark, and why scaling a single centralized model isn't the whole story.]]></description><link>https://www.the-information-bottleneck.com/p/intelligence-in-an-open-world-with-979</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/intelligence-in-an-open-world-with-979</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Wed, 20 May 2026 13:00:00 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342170/e8a01f21c6ee9a1d4928b03d9c14a311.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-8u5uiyBbDis" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;8u5uiyBbDis&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/8u5uiyBbDis?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>We talk with <strong>Mengye Ren</strong>, Assistant Professor at NYU's Center for Data Science, about what intelligence actually means once you step outside a benchmark, and why scaling a single centralized model isn't the whole story.</p><p>We get into why intelligence has to be defined in open environments, not closed ones, and what that means for how we measure progress. We push on the creativity question: today's models sample bottom-up from a softmax or a Gaussian, with no internal loop of consideration, and as Mengye puts it, we haven't understood creativity yet and we're already prepared to hand it over.</p><p>We also talk about what's missing for the next paradigm: continual learning, memory, embodied grounding, and smaller models that actually accumulate experience instead of re-deriving everything from scratch each call. Along the way, we get into JEPA and latent variables, biology as inspiration vs. blueprint, why frontier labs don't lean on explicit latents, the limits of synthetic data and world models, agent-to-agent communication, model uncertainty and forecasting, and whether ML education still matters when AI writes the experiments.</p><p>A grounded, contrarian conversation about where AI research should be looking next, beyond benchmarks, beyond scale.</p><div><hr></div><h3>Timeline</h3><p><strong>00:00</strong> &#8212; Intro and welcome</p><p><strong>01:24</strong> &#8212; What is intelligence? Defining it relative to objectives and open environments</p><p><strong>04:19</strong> &#8212; Is intelligence really the path to human flourishing, or is it productivity?</p><p><strong>04:57</strong> &#8212; Safety, scalable oversight, and whether stronger models help or hurt</p><p><strong>06:09</strong> &#8212; What does "alignment" actually mean?</p><p><strong>07:18</strong> &#8212; Centralized vs. decentralized models: objectivity vs. personal meaning</p><p><strong>08:50</strong> &#8212; Hinton vs. LeCun: where Mengye stands on AI risk</p><p><strong>10:29</strong> &#8212; Bottom-up vs. top-down architectures and feedback loops</p><p><strong>21:28</strong> &#8212; Biology and AI: inspiration, not blueprint</p><p><strong>24:14</strong> &#8212; Biological plausibility, spiking nets, and where the analogy breaks</p><p><strong>25:39</strong> &#8212; JEPA, Mamba, and architectures beyond the transformer</p><p><strong>27:31</strong> &#8212; Language as a special modality: abstraction built for communication</p><p><strong>29:04</strong> &#8212; Are we too locked into the current paradigm? Risk of creativity collapse</p><p><strong>30:09</strong> &#8212; Synthetic data, simulation, and the brain's own generative models</p><p><strong>31:43</strong> &#8212; World models and physical AI: how babies actually learn <strong>33:03</strong> &#8212; The case for smaller, continually learning models</p><p><strong>37:02</strong> &#8212; The role of academic research in a frontier-lab world</p><p><strong>39:47</strong> &#8212; Why LLMs aren't funny: the creativity gap</p><p><strong>40:35</strong> &#8212; What research areas matter most: embodiment, continual learning, creativity</p><p><strong>42:05</strong> &#8212; Creativity is bounded by experience &#8212; and why bottom-up sampling isn't enough</p><p><strong>45:35</strong> &#8212; Agent-to-agent communication and the limits of sub-agents</p><p><strong>46:39</strong> &#8212; Model confidence, epistemic uncertainty, and forecasting</p><p><strong>49:44</strong> &#8212; Tokenization, static vs. dynamic worlds, and always-learning systems</p><p><strong>52:20</strong> &#8212; Latent variables, JEPA, and why frontier models skip them</p><p><strong>53:40</strong> &#8212; The future of ML education when AI writes the experiments</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[Language, Cognition, and the Limits of LLMs - with Tal Linzen (NYU/Google)]]></title><description><![CDATA[We host Tal Linzen, Associate Professor at NYU and Research Scientist at Google, for a conversation on the intersection of cognitive science and large language models.]]></description><link>https://www.the-information-bottleneck.com/p/language-cognition-and-the-limits-bb9</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/language-cognition-and-the-limits-bb9</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Sun, 17 May 2026 00:58:22 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342169/02c7bcafc21f08017b5e271c544e5758.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-fwSySJnr1NE" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;fwSySJnr1NE&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/fwSySJnr1NE?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>We host Tal Linzen, Associate Professor at NYU and Research Scientist at Google, for a conversation on the intersection of cognitive science and large language models.</p><p>We discussed why children can learn language from around 100 million words while LLMs need trillions, and the surprising finding that as models get better at predicting the next word, they become <em>worse</em> models of how humans actually process language. Tal walked us through how his lab uses eye-tracking and reading-time data to compare model behavior to human behavior, and what that reveals about prediction, working memory, and the limits of current architectures.</p><p>We also got into nature versus nurture and how inductive biases can be instilled by pre-training on synthetic languages, world models and whether transformers actually <em>use</em> the geometric structure they encode, the BabyLM challenge and data-efficient language learning, and what mechanistic interpretability can offer cognitive science beyond just fixing model bugs. The conversation closed on academia versus industry, the role of PhDs in the current AI moment, and how AI coding tools are changing the way Tal teaches and evaluates students at NYU.</p><div><hr></div><p><strong>Timeline</strong></p><ul><li><p>00:13 &#8212; Intro and what cognitive science means</p></li><li><p>02:16 &#8212; Using computational simulations to understand how humans learn language</p></li><li><p>05:26 &#8212; How children learn language vs. how LLMs are pre-trained</p></li><li><p>07:53 &#8212; Why mainstream LLMs are not good models of humans</p></li><li><p>10:07 &#8212; Comparing humans and models with eye-tracking and reading behavior</p></li><li><p>13:52 &#8212; Sensory modalities, smell, and how much you can learn from language alone</p></li><li><p>16:03 &#8212; Animal cognition and decoding animal communication</p></li><li><p>17:00 &#8212; Nature vs. nurture, inductive biases, and what transformers can and can't learn</p></li><li><p>21:21 &#8212; Instilling inductive biases through synthetic languages</p></li><li><p>27:34 &#8212; The bouba/kiki effect and cross-linguistic sound symbolism</p></li><li><p>28:33 &#8212; Latent causal structure in language and whether models discover it</p></li><li><p>31:13 &#8212; Does knowing linguistics help build better models?</p></li><li><p>35:07 &#8212; World models: what they mean, and why transformers encode geometry but don't use it</p></li><li><p>39:13 &#8212; Tokenization, and why Tal doesn't like it</p></li><li><p>41:35 &#8212; Scaling laws and the inverse-U curve of model quality vs. human fit</p></li><li><p>44:34 &#8212; Where the human&#8211;model mismatch comes from: architecture, memory, and data</p></li><li><p>47:08 &#8212; Diffusion language models and sentence planning</p></li><li><p>48:21 &#8212; Data quality, synthetic data, and curriculum effects</p></li><li><p>50:54 &#8212; Comparing models at different training stages to human development; BabyLM</p></li><li><p>54:40 &#8212; What level of the model should we actually probe? Representations vs. behavior</p></li><li><p>1:01:04 &#8212; Mechanistic interpretability, Deep Dream, and human dreaming</p></li><li><p>1:02:11 &#8212; Cognitive neuroscience, intracranial recordings, and working memory</p></li><li><p>1:10:31 &#8212; Should you still do a PhD in 2026?</p></li><li><p>1:12:31 &#8212; Will software engineers lose their jobs to AI?</p></li><li><p>1:17:43 &#8212; Teaching in the age of coding agents: what changes in the classroom</p></li><li><p>1:20:54 &#8212; What's next: human-like LLMs as user simulators, and recruiting</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p></li></ul>]]></content:encoded></item><item><title><![CDATA[The Principles of Diffusion Models - with Jesse Lai (Sony AI)]]></title><description><![CDATA[We host Chieh-Hsin (Jesse) Lai, Staff Research Scientist at Sony AI and visiting professor at National Yang Ming Chiao Tung University, Taiwan, for a conversation about diffusion models, the technology behind tools like Stable Diffusion, and most of the AI image and video generators you've seen in the last few years.]]></description><link>https://www.the-information-bottleneck.com/p/the-principles-of-diffusion-models-614</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/the-principles-of-diffusion-models-614</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Sun, 10 May 2026 16:09:47 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342171/a4e77abe3b7309b1eb81769349b21cac.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-wgv0Gnat0LY" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;wgv0Gnat0LY&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/wgv0Gnat0LY?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>We host Chieh-Hsin (Jesse) Lai, Staff Research Scientist at Sony AI and visiting professor at National Yang Ming Chiao Tung University, Taiwan, for a conversation about diffusion models, the technology behind tools like Stable Diffusion, and most of the AI image and video generators you've seen in the last few years. Jesse recently co-authored <em>The Principles of Diffusion Models</em> with Stefano Ermon, and the book is quickly becoming a go-to reference in the field.</p><p>We start with what a generative model actually is, and what it means to "generate" an image or a sound. Jesse explains the core idea behind diffusion in plain terms. You start with pure noise, and a neural network gradually cleans it up, step by step, until a realistic image emerges.</p><p>From there, we talk about why diffusion has come to dominate so much of generative AI. Because the model builds an image gradually, you can guide it along the way, nudging the output toward what you actually want, refining details, or combining it with other controls. We also discuss the common critique that diffusion is slow and how the field has largely addressed it through new techniques.</p><p>We zoom out to the bigger picture, too. Jesse shares his view on world models and whether diffusion is the right foundation for them. We talk about what makes a generative model genuinely good versus just good at gaming benchmarks, and why evaluating creativity and realism is so much harder than scoring a multiple-choice test.</p><div><hr></div><p><strong>Timeline</strong></p><p>00:12 &#8212; Intro and welcoming Jesse</p><p>00:47 &#8212; Why Jesse wrote the book, and who it's for</p><p>03:29 &#8212; The three families of diffusion models, and why they're really one idea</p><p>05:14 &#8212; What makes a good generative model</p><p>07:39 &#8212; How do you even measure if a generated image is good</p><p>08:59 &#8212; Why diffusion beats autoregressive models for images</p><p>10:33 &#8212; Is diffusion still slow? How fast generation got fast</p><p>11:12 &#8212; A simple intuition for what a "score" is</p><p>14:12 &#8212; How the different flavors of diffusion connect under the hood</p><p>14:42 &#8212; Diffusion for text and proteins</p><p>17:12 &#8212; Consistency models and the push for one-step generation</p><p>22:12 &#8212; Diffusion for world models: simulating reality in real time</p><p>26:12 &#8212; Do world models need to understand language</p><p>35:12 &#8212; Is diffusion the right tool, or just a convenient one</p><p>38:12 &#8212; What benchmarks actually tell us, and what they miss</p><p>46:12 &#8212; Closing thoughts and where to find the book</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[Inside xAI, and the Bet on AI Math - with Christian Szegedy (Math Inc)]]></title><description><![CDATA[We talked with Christian Szegedy, co-inventor of Inception and Batch Normalization, founding scientist at xAI, now at Math Inc, about what it takes to build a frontier lab, and why he left xAI to work on formal mathematics.]]></description><link>https://www.the-information-bottleneck.com/p/inside-xai-and-the-bet-on-ai-math-04d</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/inside-xai-and-the-bet-on-ai-math-04d</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Mon, 04 May 2026 12:45:04 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342172/e150779275e5ae1694bec12491e8bea4.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-Ltj4SYs6f1A" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;Ltj4SYs6f1A&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/Ltj4SYs6f1A?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>We talked with Christian Szegedy, co-inventor of Inception and Batch Normalization, founding scientist at xAI, now at Math Inc, about what it takes to build a frontier lab, and why he left xAI to work on formal mathematics. Christian thinks Lean and auto-formalization are the missing piece for trustworthy AI: a machine-checkable layer underneath all reasoning, where proofs are guaranteed correct without anyone having to read them.</p><p>We got into his bet with Fran&#231;ois Chollet that AI will hit superhuman mathematician level by 2026, and what that actually unlocks beyond math itself: verified software instead of vibe-coded apps that break when you refactor, AI systems you can actually trust because their reasoning is checkable, and a path to handling protein folding, chemistry, and parts of biology with real guarantees instead of hand-waving. Christian also walked us through how Math Inc's Gauss system pulled off a proof in two weeks that human experts had estimated would take another year.</p><p>We also covered xAI's first 12-person year, why Christian no longer buys the original batch normalization story, why he's sure transformers won't be the dominant architecture in five years, what mathematicians do in a world of cheap proofs, and his take on whether humanity will handle AI well. He distrusts humanity more than he distrusts AI.</p><div><hr></div><h2>Timeline</h2><p>00:12 &#8212; Intros: Christian's background (Inception, Batch Norm, xAI, Math Inc)</p><p>01:29 &#8212; Building a frontier lab from scratch: the first 12 people at xAI</p><p>04:15 &#8212; Hiring for proven track records when 200K GPUs are at stake</p><p>06:07 &#8212; Elon's "dependency graph" and balancing long-term vision with investor demos</p><p>07:28 &#8212; Gauss formalizes the strong prime number theorem in 2 weeks</p><p>12:25 &#8212; What "formalization" actually means (and why it's not what most people think)</p><p>14:39 &#8212; Why Lean gives 100% certainty and why that matters for RL</p><p>15:26 &#8212; ProofBridge and joint embeddings across mathematical subfields 18:07 &#8212; Does math formalization transfer to coding and other fields?</p><p>21:44 &#8212; Can every domain be mathematized?</p><p>23:14 &#8212; Verified software, chip design, and why vibe-coded apps are dangerous</p><p>26:35 &#8212; Scaling Mathlib by 100&#8211;1000x</p><p>28:27 &#8212; Artisan formalizers vs. invisible machine-language formalists</p><p>33:26 &#8212; Can verification generalize?</p><p>45:19 &#8212; Revisiting Batch Norm: covariate shift, loss landscape, and what really happens</p><p>48:22 &#8212; Is normalization even necessary?</p><p>50:10 &#8212; What's actually fundamental in modern AI architectures</p><p>51:41 &#8212; Why Christian thinks transformers won't last 5 years</p><p>52:38 &#8212; The 2026 superhuman AI mathematician bet</p><p>55:15 &#8212; What's missing: better verification + a much larger formalized math repository</p><p>56:13 &#8212; Lean vs. Coq vs. HOL Light - &nbsp;does the proof assistant actually matter?</p><p>59:26 &#8212; The role of mathematicians in 5&#8211;10 years</p><p>1:02:00 &#8212; A human element to mathematics: Newton, Leibniz, and competitive proving</p><p>1:03:25 &#8212; The telescope analogy: AI as the instrument that lets us see the math universe</p><p>1:05:19 &#8212; Job apocalypse or Jevons paradox?</p><p>1:08:41 &#8212; Advice for students</p><p>1:09:50 &#8212; Can we formally verify AI alignment?</p><p>1:11:52 &#8212; Closing thanks</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[Reasoning Models and Planning - with Rao Kambhampati (Arizona State)]]></title><description><![CDATA[We sat down with Rao Kambhampati, a Professor of CS at Arizona State University and former President of AAAI, to talk about reasoning models: what they are, when they work, and when they break]]></description><link>https://www.the-information-bottleneck.com/p/reasoning-models-and-planning-with-b7c</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/reasoning-models-and-planning-with-b7c</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Wed, 29 Apr 2026 15:18:11 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342173/53aab92a49ccdab9396502f34fc2699b.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-TZihPdBZAls" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;TZihPdBZAls&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/TZihPdBZAls?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>We sat down with Rao Kambhampati, a Professor of CS at Arizona State University and former President of AAAI, to talk about reasoning models: what they are, when they work, and when they break</p><p>Rao has been working on planning and decision-making since long before deep learning, which makes him one of the most grounded voices on what today's reasoning systems actually do. We start with definitions of what reasoning is, why planning is the hard subset of it, and what changed when systems like o1 and DeepSeek R1 moved the verifier from inference into post-training. From there we get into where these models generalize, where they don't, and why benchmarks can be misleading about both.</p><p>A big chunk of the conversation is on chain-of-thought: what intermediate tokens are actually doing, why they help the model more than they help the reader, and what outcome-based RL does to whatever semantic content was there to begin with. We also cover world models and why Rao thinks the video-only framing is the wrong bet, the difference between agentic safety and existential risk, and what the planning community figured out decades ago that the LLM community keeps rediscovering.</p><ul><li><div><hr></div></li></ul><h3>Timeline</h3><ul><li><p>(00:12) Intros</p></li><li><p>(01:32) Defining "reasoning" and the System 1 / System 2 framing</p></li><li><p>(04:12) Blocksworld vs Sokoban, and non-ergodicity</p></li><li><p>(06:42) Pre-o1: PlanBench and "LLMs are zero-shot X" papers</p></li><li><p>(07:42) LLM-Modulo and moving the verifier into post-training</p></li><li><p>(10:12) Is RL post-training reasoning, or case-based retrieval?</p></li><li><p>(13:12) &#964;-Bench and benchmarks that avoid action interactions</p></li><li><p>(14:12) OOD generalization and what we don't know about post-training data</p></li><li><p>(19:02) Does it matter how they work if they answer the questions we care about?</p></li><li><p>(21:27) Architecture lotteries and why no one tries different designs</p></li><li><p>(23:42) Intermediate tokens and the "reduce thinking effort" cottage industry</p></li><li><p>(26:12) The 30&#215;30 maze experiment</p></li><li><p>(27:42) Sokoban, NetHack, and Mystery Blocksworld</p></li><li><p>(34:58) Stop Anthropomorphizing Intermediate Tokens &#8212; the swapped-trace experiment</p></li><li><p>(46:12) Latent reasoning, Coconut, and why R0 beat R1</p></li><li><p>(50:12) How outcome-based RL erodes CoT semantics</p></li><li><p>(52:12) Dot-dot-dot and Anthropic's CoT monitoring paper</p></li><li><p>(53:42) Safety: Hinton, Bengio, LeCun</p></li><li><p>(57:12) Existential risk vs real safety work</p></li><li><p>(59:42) World models, transition models, and video-only approaches</p></li><li><p>(1:03:12) Why linguistic abstractions matter &#8212; pick and roll</p></li><li><p>(1:05:42) What the planning community knew in 2005</p></li><li><p>(1:08:12) Multi-agent LLMs</p></li><li><p>(1:09:57) Closing thoughts: the bridge analogy</p><div><hr></div></li></ul><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[What Actually Matters in AI? - with Zhuang Liu (Princeton)]]></title><description><![CDATA[In this episode, we hosted Zhuang Liu, Assistant Professor at Princeton and former researcher at Meta, for a conversation about what actually matters in modern AI and what turns out to be a historical accident.]]></description><link>https://www.the-information-bottleneck.com/p/what-actually-matters-in-ai-with-5d2</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/what-actually-matters-in-ai-with-5d2</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Fri, 24 Apr 2026 18:21:23 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342174/4cb812eed531c0c42bbd4f6faf579c64.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-F4MgMIGueCs" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;F4MgMIGueCs&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/F4MgMIGueCs?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>In this episode, we hosted <strong>Zhuang Liu</strong>, Assistant Professor at Princeton and former researcher at Meta, for a conversation about what actually matters in modern AI and what turns out to be a historical accident.</p><p>Zhuang is behind some of the most important papers in recent years (with more than 100k citations): ConvNeXt (showing ConvNets can match Transformers if you get the details right), Transformers Without Normalization (replacing LayerNorm with dynamic tanh), ImageBind, Eyes Wide Shut on CLIP's blind spots, the dataset bias work showing that even our biggest "diverse" datasets are still distinguishable from each other, and more.</p><p>We got into whether architecture research is even worth doing anymore, what "good data" actually means, why vision is the natural bridge across modalities but language drove the adoption wave, whether we need per-lab RL environments or better continual learning, whether LLMs have world models (and for which tasks you'd need one), why LLM outputs carry fingerprints that survive paraphrasing, and where coding agents like Claude Code fit into research workflows today and where they still fall short.</p><div><hr></div><p><strong>Timeline</strong></p><p>00:13 &#8212; Intro</p><p>01:15 &#8212; ConvNeXt and whether architecture still matters</p><p>06:35 &#8212; What actually drove the jump from GPT-1 to &nbsp;GPT-3</p><p>08:24 &#8212; Setting the bar for architecture papers today</p><p>11:14 &#8212; Dataset bias: why "diverse" datasets still aren't</p><p>22:52 &#8212; What good data actually looks like</p><p>26:49 &#8212; ImageBind and vision as the bridge across modalities</p><p>29:09 &#8212; Why language drove the adoption wave, not vision</p><p>32:24 &#8212; Eyes Wide Shut: CLIP's blind spots</p><p>34:57 &#8212; RL environments, continual learning, and memory as the real bottleneck</p><p>43:06 &#8212; Are inductive biases just historical accidents?</p><p>44:30 &#8212; Do LLMs have world models?</p><p>48:15 &#8212; Which tasks actually need a vision world model</p><p>50:14 &#8212; Idiosyncrasy in LLMs: pre-training vs post-training fingerprints</p><p>53:39 &#8212; The future of pre-training, mid-training, and post-training</p><p>57:57 &#8212; Claude Code, Codex, and coding agents in research</p><p>59:11 &#8212; Do we still need students in the age of autonomous research?</p><p>1:04:19 &#8212; Transformers Without Normalization and the four pillars that survived</p><p>1:06:53 &#8212; MetaMorph: Does generation help understanding, or the other way around?</p><p>1:09:17 &#8212; Wrap</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[The Future of Coding Agents with Sasha Rush (Cursor/Cornell)]]></title><description><![CDATA[We talked with Sasha Rush, researcher at Cursor and professor at Cornell, about what it actually feels like to we in the heart of the AI revolution and build coding agents right now.]]></description><link>https://www.the-information-bottleneck.com/p/the-future-of-coding-agents-with-ae3</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/the-future-of-coding-agents-with-ae3</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Wed, 15 Apr 2026 16:57:16 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342175/4fc0d233f2e053305cfc8a222ccaa2d9.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>We talked with <strong>Sasha Rush</strong>, researcher at Cursor and professor at Cornell, about what it actually feels like to we in the heart of the AI revolution and build coding agents right now. Sasha shared how these systems are changing day-to-day work and how it feels to develop these systems.</p><p>A big part of the conversation was about why coding has become such a powerful setting for these tools. We discussed what makes code different from other domains, why agents seem to work especially well there, and how much of today&#8217;s progress comes not just from better models, but from better ways of using them. Sasha also gave an inside look at how Cursor thinks about training coding models, long-running agents, context limits, bug finding, and the balance between autonomy and human oversight.</p><p>We also talked about the broader shift happening in software engineering. Are developers moving to a higher level of abstraction? Is this just a phase where we &#8220;babysit&#8221; models, or the beginning of a deeper change in how software gets built? Sasha had a very thoughtful perspective here, including what he&#8217;s seeing from students, researchers, and engineers who are growing up native to these tools.</p><p>More broadly, this episode is about what it means to do serious technical work in a moment when the tools are changing incredibly fast. Sasha brought both optimism and skepticism to the discussion, and that made this a really grounded conversation about where coding agents are today, what they are already surprisingly good at, and where all of this might be going next.</p><div><hr></div><p><strong>Timeline</strong><br><strong>00:00</strong> Intro and Sasha joins us<br><strong>01:11</strong> What &#8220;coding agents&#8221; actually mean<br><strong>02:34</strong> Why coding became the breakout use case<br><strong>08:56</strong> Long-running agents and autonomous workflows<br><strong>15:08</strong> How these tools are changing the work of engineers<br><strong>17:15</strong> Are people just babysitting models right now?<br><strong>22:11</strong> How Cursor builds its coding models<br><strong>26:29</strong> Rewards, training, and what makes agents work<br><strong>34:53</strong> Memory, continual learning, and agent communication<br><strong>38:00</strong> How context compaction works in practice<br><strong>41:29</strong> Why coding agents recently got much better<br><strong>50:31</strong> Refactoring, maintenance, and self-improving codebases<br><strong>52:16</strong> Bug finding, oversight, and verification<br><strong>54:43</strong> Will this pace of progress continue?<br><strong>56:42</strong> Can this spread beyond coding?<br><strong>58:27</strong> The future of Cursor and coding agents<br><strong>1:03:08</strong> Model architectures beyond standard transformers<br><strong>1:05:37</strong> World models, diffusion, and what may come next</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[The Hidden Engine of Vision with Peyman Milanfar (Google)]]></title><description><![CDATA[How Denoising Secretly Powers Everything in AI]]></description><link>https://www.the-information-bottleneck.com/p/the-hidden-engine-of-vision-with-384</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/the-hidden-engine-of-vision-with-384</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Fri, 10 Apr 2026 14:13:34 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342176/cb8d04b4f834ea11b2c34e138c56e4d8.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p></p><div id="youtube2-0njZJyiwUlo" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;0njZJyiwUlo&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/0njZJyiwUlo?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Peyman Milanfar is a Distinguished Scientist at Google, leading its Computational Imaging team. He's a member of the National Academy of Engineering, an IEEE Fellow, and one of the key people behind the Pixel camera pipeline. Before Google, he was a professor at UC Santa Cruz for 15 years and helped build the imaging pipeline for Google Glass at Google X. Over 35,000 citations.</p><p>Peyman makes a provocative case that denoising, long dismissed as a boring cleanup task, is actually one of the most fundamental operations in modern ML, on par with SGD and backprop. Knowing how to remove noise from a signal basically means you have a map of the manifold that signals live on, and that insight connects everything from classical inverse problems to diffusion models.</p><p>We go from early patch-based denoisers to his 2010 "Is Denoising Dead?" paper, and then to the question that redirected his research: if denoising is nearly solved, what else can denoisers do? That led to Regularization by Denoising (RED), which, if you unroll it, looks a lot like a diffusion process, years before diffusion models existed. We also cover how his team shipped a one-step diffusion model on the Pixel phone for 100x ProRes Zoom, the perception-distortion-authenticity tradeoff in generative imaging, and a new paper on why diffusion models don't actually need noise conditioning. The conversation wraps with a debate on why language has dominated the AI spotlight while vision lags, and Peyman's argument that visual intelligence, grounded in physics and robotics, is coming next.</p><div><hr></div><p>Timeline</p><p>0:00 Intro and Peyman's background</p><p>1:22 Why denoising matters more than you think Sensor diversity and Tesla's vision-only bet</p><p>15:04 BM3D and why it was secretly an MMSE estimator</p><p>17:02 "Is Denoising Dead?" then what else can denoisers do?</p><p>18:07 Plug-and-play methods and Regularization by Denoising (RED)</p><p>26:18 Denoising, manifolds, and the compression connection</p><p>28:12 Energy-based models vs. diffusion: "The Geometry of Noise"</p><p>31:40 Natural gradient descent and why flow models work</p><p>34:48 Gradient-free optimization and high-dimensional noise</p><p>45:13 Image quality and the perception-distortion tradeoff</p><p>48:39 Information theory, rate-distortion, and generative models</p><p>52:57 Denoising vs. editing</p><p>54:25 The changing role of theory</p><p>57:07 Hobbyist tools vs. shipping consumer products</p><p>59:40 Coding agents, vibe coding, and domain expertise</p><p>1:05:00 Vision and more complex-dimensional signals</p><p>1:09:31 Do models need to interact with the physical world?</p><p>1:11:28 Continual learning and novelty-driven updates</p><p>1:13:00 On-device learning and privacy</p><p>1:15:01 Why has language dominated AI? Is vision next?</p><p>1:17:14 How kids learn: vision first, language later</p><p>1:19:36 Academia vs. industry</p><p>1:22:28 10,000 citations vs. shipping to millions, why choose?</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[Reinventing AI From Scratch with Yaroslav Bulatov]]></title><description><![CDATA[Yaroslav Bulatov helped build the AI era from the inside, as one of the earliest researchers at both OpenAI and Google Brain.]]></description><link>https://www.the-information-bottleneck.com/p/reinventing-ai-from-scratch-with-4b9</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/reinventing-ai-from-scratch-with-4b9</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Mon, 30 Mar 2026 23:22:20 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342177/df7e00e064f8fbbf254600b0ec83ac75.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-2x4zhZV9br0" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;2x4zhZV9br0&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/2x4zhZV9br0?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Yaroslav Bulatov helped build the AI era from the inside, as one of the earliest researchers at both OpenAI and Google Brain. Now he wants to tear it all down and start over. Modern deep learning, he argues, is up to 100x more wasteful than it needs to be &nbsp;- &nbsp;a Frankenstein of hacks designed for the wrong hardware. With a power wall approaching in two years, Yaroslav is leading an open effort to reinvent AI from scratch: no backprop, no legacy assumptions, just the benefit of hindsight and AI agents that compress decades of research into months. Along the way, we dig into why AGI is a "religious question," how a sales guy with no ML background became one of his most productive contributors, and why the Muon optimizer, one of the biggest recent breakthroughs, could only have been discovered by a non-expert.</p><div><hr></div><p><strong>Timeline</strong></p><p>00:12 &#8212; Introduction and Yaroslav's background at OpenAI and Google Brain</p><p>01:16 &#8212; Why deep learning isn't such a good idea</p><p>02:03 &#8212; The three definitions of AGI: religious, financial, and vibes-based</p><p>07:52 &#8212; The SAI framework: do we need the term AGI at all?</p><p>10:58 &#8212; What matters more than AGI: efficiency and refactoring the AI stack</p><p>13:28 &#8212; Jevons paradox and the coming energy wall</p><p>14:49 &#8212; The recipe: replaying 70 years of AI with hindsight</p><p>17:23 &#8212; Memory, energy, and gradient checkpointing</p><p>18:34 &#8212; Why you can't just optimize the current stack (the recurrent laryngeal nerve analogy)</p><p>21:05 &#8212; What a redesigned AI might look like: hierarchical message passing</p><p>22:31 &#8212; Can a small team replicate decades of research?</p><p>24:23 &#8212; Why non-experts outperform domain specialists</p><p>27:42 &#8212; The GPT-2 benchmark: what success looks like</p><p>29:01 &#8212; Ian Goodfellow, Theano, and the origins of TensorFlow</p><p>30:12 &#8212; The Muon optimizer origin story and beating Google on ImageNet</p><p>36:16 &#8212; AI coding agents for software engineering and research</p><p>40:12 &#8212; 10-year outlook and the voice-first workflow</p><p>42:23 &#8212; Why start with text over multimodality</p><p>45:13 &#8212; Are AI labs like SSI on the right track?</p><p>48:52 &#8212; Getting rid of backprop &#8212; and maybe math itself</p><p>53:57 &#8212; The state of ML academia and NeurIPS culture</p><p>56:41 &#8212; The Sutra group challenge: inventing better learning algorithms</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p></li></ul><div><hr></div><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item><item><title><![CDATA[Why Healthcare Is AI's Hardest and Most Important Problem with Kyunghyun Cho (NYU)]]></title><description><![CDATA[We talk with Kyunghyun Cho, who is a Professor of Health Statistics and a Professor of Computer Science and Data Science at New York University, and a former Executive Director at Genentech, about why healthcare might be the most important and most difficult domain for AI to transform.]]></description><link>https://www.the-information-bottleneck.com/p/why-healthcare-is-ais-hardest-and-bef</link><guid isPermaLink="false">https://www.the-information-bottleneck.com/p/why-healthcare-is-ais-hardest-and-bef</guid><dc:creator><![CDATA[Ravid Shwartz Ziv]]></dc:creator><pubDate>Tue, 24 Mar 2026 05:11:26 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/203342178/5583a351644c69ef50d0f9f0e406dfb2.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<div id="youtube2-rcE4rXjq2p0" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;rcE4rXjq2p0&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/rcE4rXjq2p0?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>We talk with Kyunghyun Cho, who is a Professor of Health Statistics and a Professor of Computer Science and Data Science at New York University, and a former <a href="https://www.linkedin.com/company/2276/">Executive Director</a> at Genentech, about why healthcare might be the most important and most difficult domain for AI to transform. Kyunghyun shares his vision for a future where patients own their own medical records, proposes a provocative idea for running continuous society-level clinical trials by having doctors "toss a coin" between plausible diagnoses, and explains why drug discovery's stage-wise pipeline has hit a wall that only end-to-end AI thinking can break through. We also get into GLP-1 drugs and why they're more mysterious than people realize, the brutal economics of antibiotic research, how language models trained across scientific literature and clinical data could compress 50 years of drug development into five, and what Kyunghyun would do with $10 billion (spoiler: buy a hospital network in the Midwest). We wrap up with a great discussion on the rise of professor-founded "neo-labs," why academia got spoiled during the deep learning boom, and an encouraging message for PhD students who feel lost right now.</p><div><hr></div><p><strong>Timeline:</strong></p><p><strong>(00:00)</strong> Intro and welcome</p><p><strong>(01:25)</strong> Why healthcare is uniquely hard</p><p><strong>(04:46)</strong> Who owns your medical records? &#8212; The case for patient-controlled data and tapping your phone at the doctor's office</p><p><strong>(06:43)</strong> Centralized vs. decentralized healthcare &#8212; comparing Israel, Korea, and the US</p><p><strong>(13:19)</strong> Why most existing health data isn't as useful as we think &#8212; selection bias and the lack of randomization</p><p><strong>(16:53)</strong> The "toss a coin" proposal &#8212; continuous clinical trials through automated randomization, and the surprising connection to LLM sampling.</p><p><strong>(23:07)</strong> Drug discovery's broken pipeline &#8212; why stage-wise optimization is failing, and we need end-to-end thinking</p><p><strong>(28:30)</strong> Why the current system is already failing society &#8212; wearables, preventive care, and the case for urgency</p><p><strong>(31:13)</strong> Allen's personal healthcare journey and the GLP-1 conversation</p><p><strong>(33:13)</strong> GLP-1 deep dive &#8212; 40 years from discovery to weight loss drugs, brain receptors, and embracing uncertainty</p><p><strong>(36:28)</strong> Why antibiotic R&amp;D is "economic suicide" and how AI can help</p><p><strong>(42:52)</strong> Language models in the clinic and the lab &#8212; from clinical notes to back-propagating clinical outcomes, all the way to molecular design</p><p><strong>(48:04)</strong> Do you need domain expertise, or can you throw compute at it?</p><p><strong>(54:30)</strong> The $10 billion question &#8212; distributed GPU clouds and a patient-in-the-loop drug discovery system</p><p><strong>(58:28)</strong> Vertical scaling vs. horizontal scaling for healthcare AI</p><p><strong>(1:01:06)</strong> AI regulation &#8212; who's missing from the conversation and why regulation should follow deployment</p><p><strong>(1:06:52)</strong> Professors as founders and the "neo-lab" phenomenon &#8212; how Ilya cracked the code</p><p><strong>(1:11:18)</strong> Can neo-labs actually ship products? Why researchers should do research</p><p><strong>(1:13:09)</strong> Academia got spoiled &#8212; the deep learning anomaly is ending, and that's okay</p><p><strong>(1:16:07)</strong> Closing message &#8212; why it's a great time to be a PhD student and researcher</p><div><hr></div><p>Music:</p><ul><li><p>"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.</p></li><li><p>Changes: trimmed</p><div><hr></div></li></ul><p>About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.</p>]]></content:encoded></item></channel></rss>