June 13, 2026

Why AI Benchmarks Are Lying to You - with Wenhu Chen (Meta/University of Waterloo)

Show Notes

In this episode, we sit down with Wenhu Chen, research scientist at Meta MSL, assistant professor at the University of Waterloo, and the person behind MMLU-Pro and MMMU. If you've read a frontier model release in the last two years, you've seen his benchmarks. That makes him one of the best people to answer the question everyone dances around: when a model jumps from 40% to 90% on your benchmark, how much of that is real? In this episode, we dig into why benchmarks have become the loss function of the entire field - design a bad one, and thousands of brilliant researchers will spend months hill-climbing in the wrong direction. Wenhu is surprisingly candid about the limits of his own creations: contamination is everywhere, saturation turns frontier benchmarks into unit tests, and popular alternatives, such as LM Arena, mostly measure tone and length rather than capability. His answer is to evaluate models where they've never been: private codebases, hospital data, and the messy, live internet.

We also talk about ClawBench, his new benchmark that deploys agents to over 140 real production websites to do things people actually want done, such, such as ordering food, booking tickets, and applying for jobs. The best model in the world completes about a third of these tasks. We unpack why: bot detection, models that refuse to click "pay," agents that give up the moment an environment doesn't match their training, and harnesses that can swing results by 20% without changing the model at all.

Along the way, we cover the overlooked science of evaluating pre-training, data flywheels, and synthetic environments for agent training, and whether RL teaches models to reason or just surfaces what's already there. We close with Wenhu's predictions: exploration and adaptability will improve rapidly, but security will become the field's hardest problem as agents gain real permissions in the real world.

Timestamps

00:00 – Intro
00:55 – What good evaluation means, and how it's changed since the early GPT days
03:35 – Benchmarks as the field's loss function
05:50 – Contamination: the problem nobody fully solves
08:08 – MMLU-Pro scores: real progress or training on the test set?
11:05 – Can you measure creativity?
12:34 – Why human judges and arenas are unreliable — and what to use instead
19:22 – What a good benchmark actually looks like
22:34 – Chain of thought: signal or scratchpad?
26:01 – Auto-research and hill-climbing agents
28:52 – Harnesses: 20% swings without touching the model
32:28 – Safety, model release, and an "FDA for models"
36:53 – The overlooked science of pre-training evaluation
43:49 – Designing pre-training benchmarks when one run costs a billion dollars
49:45 – ClawBench: agents on 140+ live websites, and why the best model gets 33%
54:42 – How MMLU-Pro and MMMU-Pro were born from public complaints
59:16 – Pixel agents vs. APIs: will MCP kill computer use?
1:02:11 – Training agents: data flywheels and synthetic environments
1:05:43 – SFT vs. RL, and does RL teach reasoning or reveal it?
1:09:21 – What gets solved next year — and what doesn't
1:14:32 – Undervalued ideas, and what's next for ClawBench

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.