We talk a lot on this show about RL, agents, and the move between pre-training and post-training, but not enough about the layer everything actually runs on. Benny Chen, co-founder of Fireworks AI, one of the largest inference platforms around, walks us through what it takes to serve models at scale: sourcing GPUs, writing the kernels, the runtime, and the routing layer that lets a customer hit one endpoint and forget the rest.
We talk why the real bottleneck is power, not chips, and why that favors Nvidia and Google. Why MoE keeps winning even when dense models look better on paper and why he'd rather run fungible capacity at 95% than specialized chips at 60%. We also talk about quantization limits, where RL efficiency has to go next, and his case that AI is still under-hyped. We also get into cross-region training, sparse autoencoders and why interpretability hasn't taken off in open source, whether open models can close the gap, and a frank read on Anthropic's go-to-market.
Timeline
00:00 — Intro: the part of AI nobody talks about
01:20 — What "infrastructure for AI" actually means: the layers, from GPUs up to routing
02:59 — Why not just buy your own GPUs and do it yourself?
05:17 — The scale Fireworks runs at
06:35 — Hardware inflation, GPU costs, and the real risk hiding in commit duration
10:14 — Nvidia vs AMD vs TPUs, and why power is the bottleneck
11:57 — Mixing GPU types and generations; fungibility vs. specialization
14:22 — Once you have the GPUs, what's the next layer to build?
17:04 — Dense vs. MoE, and why the hardware picks the winner
21:07 — Quantization: is FP4 the floor? TurboQuant and INT vs. FP
24:28 — How tied are the algorithms to the hardware?
25:12 — DeepSeek, DeepGEMM, and next-token prediction as reconstruction loss
28:50 — Why RL is still wildly inefficient compared to pre-training
30:08 — Speculative decoding, AI-generated kernels, and auto-research
34:00 — The AGI question: why text gets automated but vision may stay expensive
37:07 — Hype check: why Benny thinks AI is still under-hyped
41:28 — Training vs. inference at the infrastructure level
44:12 — Scaling across data centers: cross-region training with Cursor
45:40 — Sparse autoencoders, interpretability, and why open source is human-constrained
49:04 — Will open models catch up — on quality and on compute?
51:41 — Are we plateauing? Opus 4.7 vs. 4.6 and the coming data wars
54:41 — Physical limits, HBM, and whether chips keep getting faster
58:17 — The belief about inference everyone gets wrong
59:31 — Anthropic, mythos, and a frank take on go-to-market
1:04:41 — Wrap-up
Music:
"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.










