June 23, 2026

Infrastructure for AI at Scale - With Benny Chen (Fireworks AI)

Infrastructure for AI at Scale - With Benny Chen (Fireworks AI)
The Information Bottleneck
Infrastructure for AI at Scale - With Benny Chen (Fireworks AI)
Apple Podcasts podcast player iconSpotify podcast player icon
Apple Podcasts podcast player iconSpotify podcast player icon

We talk a lot on this show about RL, agents, and the move between pre-training and post-training, but not enough about the layer everything actually runs on. Benny Chen, co-founder of Fireworks AI, one of the largest inference platforms around, walks us through what it takes to serve models at scale: sourcing GPUs, writing the kernels, the runtime, and the routing layer that lets a customer hit one endpoint and forget the rest.

We talk why the real bottleneck is power, not chips, and why that favors Nvidia and Google. Why MoE keeps winning even when dense models look better on paper and why he'd rather run fungible capacity at 95% than specialized chips at 60%. We also talk about quantization limits, where RL efficiency has to go next, and his case that AI is still under-hyped. We also get into cross-region training, sparse autoencoders and why interpretability hasn't taken off in open source, whether open models can close the gap, and a frank read on Anthropic's go-to-market.


Timeline

  • 00:00 — Intro: the part of AI nobody talks about
  • 01:20 — What "infrastructure for AI" actually means: the layers, from GPUs up to routing
  • 02:59 — Why not just buy your own GPUs and do it yourself?
  • 05:17 — The scale Fireworks runs at
  • 06:35 — Hardware inflation, GPU costs, and the real risk hiding in commit duration
  • 10:14 — Nvidia vs AMD vs TPUs, and why power is the bottleneck
  • 11:57 — Mixing GPU types and generations; fungibility vs. specialization
  • 14:22 — Once you have the GPUs, what's the next layer to build?
  • 17:04 — Dense vs. MoE, and why the hardware picks the winner
  • 21:07 — Quantization: is FP4 the floor? TurboQuant and INT vs. FP
  • 24:28 — How tied are the algorithms to the hardware?
  • 25:12 — DeepSeek, DeepGEMM, and next-token prediction as reconstruction loss
  • 28:50 — Why RL is still wildly inefficient compared to pre-training
  • 30:08 — Speculative decoding, AI-generated kernels, and auto-research
  • 34:00 — The AGI question: why text gets automated but vision may stay expensive
  • 37:07 — Hype check: why Benny thinks AI is still under-hyped
  • 41:28 — Training vs. inference at the infrastructure level
  • 44:12 — Scaling across data centers: cross-region training with Cursor
  • 45:40 — Sparse autoencoders, interpretability, and why open source is human-constrained
  • 49:04 — Will open models catch up — on quality and on compute?
  • 51:41 — Are we plateauing? Opus 4.7 vs. 4.6 and the coming data wars
  • 54:41 — Physical limits, HBM, and whether chips keep getting faster
  • 58:17 — The belief about inference everyone gets wrong
  • 59:31 — Anthropic, mythos, and a frank take on go-to-market
  • 1:04:41 — Wrap-up

Music:

  • "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.