Aug. 20, 2025

EP1: Sampling

The player is loading ...

EP1: Sampling

Show Notes
Transcript

In this episode of the Information Bottleneck Podcast, Ravid Shwartz-Ziv and Alan Rausch discuss the latest developments in AI, focusing on the controversial release of GPT-5 and its implications for users. They explore the future of large language models and the importance of sampling techniques in AI.

Chapters

00:00 Introduction to the Information Bottleneck Podcast

01:42 The GPT-5 Debacle: Expectations vs. Reality

05:48 Shifting Paradigms in AI Research

09:46 The Future of Large Language Models

12:56 OpenAI's New Model: A Mixed Bag

17:55 Corporate Dynamics in AI: Mergers and Acquisitions

21:39 The GPU Monopoly: Challenges and Opportunities

25:31 Deep Dive into Samplers in AI

35:38 Innovations in Sampling Techniques

42:31 Dynamic Sampling Methods and Their Implications

51:50 Learning Samplers: A New Frontier

59:51 Recent Papers and Their Impact on AI Research

Sampling Methods Papers & Links
1. Greedy Decoding - "The Curious Case of Neural Text Degeneration" (Holtzman et al., 2020) - Analyzes problems with greedy and beam search decoding

2. Temperature Sampling - "Calibration of Pre-trained Transformers" (Desai & Durrett, 2020) - Temperature scaling for language models

3. Top-k Samplin - "Hierarchical Neural Story Generation" (Fan et al., 2018)

4. Top-p (Nucleus) Sampling - "The Curious Case of Neural Text Degeneration" (Holtzman et al., 2020)

5. Min-P Sampling - "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" (Minh et al., 2025)

General Papers:

Large Language Models Do Not Simulate Human Psychology, Schröder et al, 2025

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning, Shrivastava et al, 2025

Ravid Shwartz-Ziv (00:00.802)
Hi everyone and welcome to the first episode of the information bottleneck podcast. I'm Ravid, Ravid Schwabdziw and this is Alan. So yeah, hi Alan. Hi, nice to be chatting with you again. I'm Alan Rausch. I'm a lead AI researcher at ThoughtWorks.

I've done or I've worked in other jobs. My previous job was as a senior AI researcher where I worked with and under Ravid extensively. And before that, I've worked at various companies such as Oracle and Intel. And like Ravid, I publish at places like NeurIPS, not nearly as much as Ravid, though I'm working on it. And here we're gonna talk a whole lot about

AI research. Yeah, and I think so first about like the the goal of this podcast, think we want to start a podcast that will try to bridge, you know, cutting edge AI research with some practical insights because we are doing it and we are working on AI and training and all these things.

All the time we thought it can be really great to come and to talk about it and to show some cool stuff that people are doing in the field these days. Okay, yeah, do you want to say something? Oh, no, no.

Okay, so today for today episode, we'll start with some, you know, the latest news in the field in AI, we'll talk about the cool thing that we saw this week. Then we'll talk about samplers, maybe one of the most important things in a current large language model.

Ravid Shwartz-Ziv (02:16.172)
hard drives and then we'll talk about two cool papers that we looked on this week. Okay, yeah, so maybe we'll start about today or like last week news. Of course, all of you heard about the new GPT, right? The GPT-5.

And the expectations were really high, right? Everyone talked for months about, when the new GPT will come and how does it be and a lot of buzz and a lot of rumors. And the bottom line, Ellen, what do you think about it? I would say it's been a debacle. It's been maybe even up to disaster.

But I do want to quickly jump in and say that I think that there's maybe I'll just, you know, lay out my opinions on why I think it's been problematic. One is the usage at all of a model router, which is probably designed to try to manage a little bit of their inference compute costs and get something closer to profitability.

but it appears to have led to a situation where about 95 % of the users, namely all of the free users and possibly even plus users, are having what is effectively a degraded experience because GPT-5 mini and nano, which are probably like small quantized models, tend to be preferred by the router in most cases for free users and plus users. And beyond the fact that now users

can't just manually select a model unless they pay a lot and are doing that. It's led to in practice, most users now using a lower parameter count, quantitatively worse model on most leaderboards. Notably though, early adopters and people paying $200 a month or more are having it route to the GPT-5 high and high thinking versions

Ravid Shwartz-Ziv (04:27.917)
which tend to have far more of the like higher quality quantitatively better. And this is why you see those kinds of voices, people like Ethan Mullick talking about, hey, it really was good. So that's one thing. I'm gonna quickly also highlight another, is the pre-training dataset cutoff date. This was clearly started training back, I think it was June or July of 2024 and GPT 4.3

I have to go check, but I'm pretty sure that some of these intermediate releases of GPT-4 actually had its pre-training data set cut off increase or moved later, even into early 2025. And that means like, it's so important for you to have the latest possible cut of data on the internet, mostly for understanding how humans are speaking colloquially, you new language, but also if you're talking about

coding and specifically skill at AI coding, coding about AI, right? Using AI to write AI related code. Most of that code and code bases were written in the last two or three years, really. And so I claim GPT-5 has taken a gigantic step back in knowledge of the latest code and frameworks as a result of literally stepping back to

mid 2024. So this has been a disaster for at least those two reasons and probably many more. Yeah. So do you think that this is not that it's something that only about money, know, like just give the open AI more money and things will suddenly get better or do you think this is something more fundamental because for me it looked that we see pattern here, right? Like those lemma.

on the phone and grok and our GPT files, it looked like we are starting to, know, hitting the wall in this end. Like we are putting more and more compute and yeah, sometimes it's better, of course, most of the time, like it's not that you see now a huge improvement as we saw in previous versions, right? Do you think this is something that we will get over it or like we need to change the paradigm?

Ravid Shwartz-Ziv (06:47.021)
I, well, first of all, I'll say, you on the question of shifting paradigm or not, you know, I'm sympathetic to people who want to shift out of it. I mean, first of all, sucking nearly all the money out of the room and, you know, large language models be having co-opted and colonized the field was maybe too much, I would say, right? And we need to always be ready to pursue all possible directions and

You know, in particular, like expert systems and certainly things like genetic algorithms and gradient free optimization methods are kind of out of the paradigm right now, but I'm sure there's still nuggets or even, you know, research agendas that are fruitful to pursue. However,

I do think that we have far more capability improvements we can get out of large language models. I think that a few missteps with releases from the major labs is not in my mind, like conclusive evidence that scaling is dead. But I do think that we have a road bump that we've clearly hit, namely around data size and data quality.

we had the best possible version of the internet, arguably for training, as in we were reasonably sure it was almost zero LLM contamination in 2022. But now in 2025, humans are using LLMs to the point where it's not X, it's Y has become like a meme that people say now, like the LLMs.

flop has become part of the human discourse, like the way that we humans talk. And so that makes it extremely difficult to kind of train on the internet and get new information. And that new information is what we're really looking for. We call this entropy in an information theoretic sense. And adding lots of information to the models will require, you know,

Ravid Shwartz-Ziv (08:55.979)
working synthetic data generation, AI agents that go out into the world and bring back new data, and digitization of physical, of previously non-digitized physical and also virtual media. All three of these things are expensive and more frustrating than simply, you know, taking some crawler on the whole internet as we've been doing up to this point. So it's a road bug, but I don't think it's permanent at all.

Yeah, I agree, but I think it can go to both directions, right? Like if now you have more synthetic data. So I'm not even talking about all these theoretical works that show that, you know, if you train on only synthetic data, you will see collapse and these kinds of things. But like even in practice, right? Like if you have more synthetic data and like the data that we collect is more similar to current data.

So it may be that, like we'll talk about it later, about diversity of the tags and things like that, but it may be that we will collect more and more data and if we look the same and we will not have data, right? We will not have diverse enough data. And because LLMs are not good in making diversity, right? So like it can, I don't know, it can go to both directions. And I think like, I see that.

LLMs are here to stay and probably have much more tricks and ways to squeeze out more performance. But I think we need to start to think about, let's say, things that are in the peripheral, Like the memory, personalization, compression, all these things are super important and it looks that there are so many, you know,

low hanging fruits that basically we didn't do anything like all of all of the big labs just focused on let's scale up everything let's collect more data but all these things are now you they can try to to focus on on these things and then they can try to actually make improvement in these things and tools are great example right like

Ravid Shwartz-Ziv (11:16.371)
suddenly, things are working with LLMs and apparently it was so simple to do it, right? In retrospective, right? Yeah, yeah, we get huge capability gaps from adding, as you've mentioned, tools, memory and

to these systems. In particular, probably my favorite release of any AI system in the past probably year and a half easily has been deep research at that feature on all of the different LLMs. I also want to point out, I was very pleasantly surprised having recently attended ICML 2025, which is the third in the NURPS track conferences before you go back to NURPS again three months afterwards.

And that conference had much more focus on traditional machine learning, including, know, decision trees, SVMs, things that feel like they have been languishing in relative obscurity since late 2022. And I see this too, because I've been getting some work recently that's doing some migration of code that was written just a few years ago on so-called ancient frameworks now like H2O.

which were just a few years ago, very exciting, lots of development, AutoML, all this stuff was kind of happening that was improving classification and regression in ways that were extremely, I would call it economically efficient compared to how much GPUs cost. And many of these libraries are in effective stasis or in some cases, nobody works on them anymore because all the talent

is now doing LLMs since late 2022. And so we very much have so much more unrelated to LLMs that we have to pursue to say that we've properly unlocked this part of the tech tree. Yeah, yeah, I agree. And okay.

Ravid Shwartz-Ziv (13:24.459)
What do you say about the new Openweight model of OpenAI, right? It looks that again, as a GPT-5, the responses or the results were quite mixed. Do you think it's a good direction for them to go?

I will always support open AI living up to its name, right? In fact, I think that them being using the name open and then not being very open except for this GPT-2 series and then whisper models has done damage to any usage of the name open in names. And this is personal because I have a data set called open debate evidence where occasionally I get people being like, is this open source? And I'm just like,

You know, like, how could I have made that more clear? I do think so. Normally, I'm pretty good about quickly trying an open source model and being able to give my thoughts and experiences with it. I actually have not tested it yet. So I've heard people's, know, opinions and descriptions of it. I've heard

that it is quite censored and apparently quite difficult to fine-tune and subsequently uncensored relative to other models, but of course, techniques like orthogonalization and obliteration kind of sidestep the fine-tuning issues in most cases, so I have heard from a lot of people that it scores really well on at least a 120 billion parameter model and that it is probably better than Llama 4 and certainly better than what I was using for a long time, which was like Mistral Large.

So in that sense, think it's a success and I want them to pursue it. But you I think you put some notes about things like weird tokenizer and I don't know much about what they did. I think you can give a much more detailed description, but if they didn't increase the vocabulary size on that tokenizer, they've already done something wrong, right? Like if you want to fix your whole...

Ravid Shwartz-Ziv (15:35.945)
how many Rs are in strawberry thing, you probably need to tokenize on every character and getting closer to that means increasing your vocabulary. Yeah, don't know, like from the responses I played with it a bit and also the responses that I got over Twitter and things like that, looked that it has quite a strong overfitting on different reasoning benchmarks.

And it's not bad, right? Also, of course, the part that OpenAI will release as much as they can. They open all their models and do the... I think it's a great contribution to the open source community. But it looks that, you know, like, at the end, they were really under pressure to open source something and to release some open source model. And they try to...

hide all their tricks and not to reveal how they actually did the... how they are doing their models, they are training and what is the secret of their models. By the way, you talked about the alignment and how to do this, to reverse the post-training, right? So I saw on Twitter, Jack Morris,

that he did, he tried to reverse the post training with a tiny low rank updates. It was quite cool actually. So we just apply tiny low rank LoRa to some of the layers, the linear layers. And he just trained it with very like unrelated small data set. I think it was fine well. And...

and just like reversed the model to a base model and then like he he showed that the the model can can i don't know and stay like harry potter like to cite harry potter the whole harry potter and things like that and of course like like all the the the the post training thing just eliminated so

Ravid Shwartz-Ziv (18:01.367)
I think it's very cool. It was a very nice trick how to delete all the post training and it looked at these very thin layer of post training is very fragile. So yeah, I don't know. I have mixed feelings about this new model.

Anything else that we should talk about regarding the recent week?

it might be just a few weeks old now, but I should just quickly touch on it, given, given that it is our first podcast. I think, there's been quite a bit of, of, would call it intrigue going on in, in the corporate world. For example, windsurf having its leadership team followed out and bought by Google and I think significant amounts of the engineers.

And then we've seen also with Microsoft, the GitHub CEO is just, think, I have to look up exactly what that is, but I think they're being like absorbed into Microsoft. And so there's this notion that maybe GitHub is like losing its independence. And then finally, I would say DeepSeek doing their announcement that they're kicking back.

the release of their latest model and specifically citing the Chinese government forcing them off Nvidia and forcing them to take a detour of trying and failing to use Hoi Wei for training. And also Claude had a new release as well. lot of these. Yeah, I know the monopoly of Nvidia, I still don't get it, you know, like

Ravid Shwartz-Ziv (19:58.643)
I don't understand how it can be that after seven, eight years, 10 years since we started C &D planning, still there is no good competitor to Nvidia. It looks like so many companies are trying, Intel, AMD, all these companies are trying to come up with better solutions and they just...

fail and they fail really badly, Like their solutions are quite bad all the time. And I think that people are like these companies don't really get that they need to create very strong software stack at the end. like, know, researchers, like research scientists are very spoiled and lazy. And again,

We want to have a very nice framework that we don't need to change anything. We don't care about what is running under the hood of our GPU or our computer. And the best thing with Nvidia is their quota, Quota software stack. So I think, like, I don't understand why these companies don't put...

1,000 engineers to write a proper software stack that can handle whatever network that we are running on the top of them. So I hope they will do it soon because GPUs are so expensive these days. Yeah, yeah, I agree. And I've spent a lot of time in cloud GPUs when I was titled as with engineer.

like an Oracle. And before that at Intel, I actually worked a lot with Intel's, in my opinion, quite decent. all the people that have companies with a competitor stack to CUDA, I think Intel's actually closer than even AMD is. And a lot of that is because one DAAL or one API, it's gone through various names, math kernel library, basically the Intel acceleration, they bundle like Intel's optimizations for SK learn and PyTorch.

Ravid Shwartz-Ziv (22:21.03)
It's actually pretty good. It's like CUDA comparable in terms of being able to oftentimes just flip a switch of like one line code update and have it run. The problem is Intel hardware, you'll go from, I'm a thousand X slower to a hundred X slower, which is a 10 X improvement. Well, really good, but it's just that fundamentally the hardware didn't match.

the AI use cases and then the little bits of hardware that they did have that matched, they couldn't produce in large enough quantities and get into customer hands. And in this case, this is Gaudi's, the competitor to like H100s. And so AMD, meanwhile, is in the traditional position they've been in for more than 20 years now, the same position they've always been in in gaming with Nvidia, which is...

We can compete at the margin right below the state of the art or top tier will be select pretty competitive on a price basis with minus one or minus two on the product line from state of the art. And then we're going to maybe even beat a little bit on price on minus three or minus four. And then we're gonna be uncompetitive literally everywhere else. And we're gonna produce like 5 % the amount of total

GPUs or less that TSMC makes and because there's not enough of them, they're not available at the economics of scale necessary to unlock the theoretical cost improvements that AMD touts. the more things change, the more they stay the same. I'm always fascinated that all the stuff that I learned as a young gamer about the dynamic between these two has been exactly the same. And I would further stipulate that

CUDA and parallel GPU programming in general is extremely difficult. I would estimate probably fewer than a million people on earth can claim to be skilled and I might be way overestimating that number. Significant amount of them work at Nvidia and are thus millionaires in most cases. There's another significant quantity of them work in other companies that make them effectively millionaires or at least deep into the upper and upper middle class.

Ravid Shwartz-Ziv (24:31.526)
And so AMD, Intel, and most of the other companies that need to poach that talent don't have a culture of paying anybody except their top, top talent anywhere near that kind of. So it's difficult structurally is what I would say for them to make these changes without massive overhauls of corporate culture. Hence why I was really wanting Intel to get bought by Intel. I'm really amped by it's that Intel employee.

Yeah, yeah. No, I agree. Like, I don't know. Like, at some point at NYU, we have in the cluster, like, several AMD GPUs, and we actually tried to make it work because, like, no one actually knew them. And we tried to run, you know, just regular deep networks, LLM stuff, and it was so frustrating. Like, we tried again and again, and there were so many, you know...

bugs and like and out of memory and whatever I think so many problems and it was absolutely like disaster we tried for I don't know two or three weeks and then we just said okay we will wait for the for nvidia gpus

Yeah, and one last anecdote about this. Mosaic ML was a company that got acquired by Databricks maybe a year or two ago for something like 700 million for like 30 employees. Probably the largest money like per employee deal ever. And a lot of that was because they had figured out decent LLM training class for MI300X. They were the company that had a good pipeline for this.

And that was enough to unlock the market claimed nearly a billion dollars worth of value and Databricks is doing pretty well these days. So I wouldn't even say it was bad. Yeah. Yeah. Yeah. Anyway, I thought that I still think it's like all the Chinese models and the Chinese companies that are working so hard in order to optimize their models and they can.

Ravid Shwartz-Ziv (26:41.816)
somehow, know, and this monopoly and they can somehow like, you know, build this a software stack that can actually work and can actually run an efficient training for a non Nvidia GPUs. Well, well, and now you're talking about something so so I am not unfortunately skilled in like knowledge of China.

specifically China's business community and where they are on silicon manufacturing. But my guess is that Hoi Wei and the other mainland Chinese companies are significantly behind TSMC on procurement of EUV machines. I mean, they don't, my guess is that ASML can't sell anything to China or even most of the Chinese like allies, right, for export control reasons.

And so the supply chain that they have to procure and produce, mean, China's industrial capacity has been making leaps and bounds, particularly in improvements in quality. But we're talking about the absolute state of the art, the kind of thing where the machines that you need to start manufacturing things that are competitive with B200s themselves cost billions of dollars.

The number of people on Earth that know how to use those machines and assemble those machines is probably in hundreds or thousands in a lot of cases rather than millions. and by the way, power generation and power capacity is also important here and cooling. And in most cases, Nvidia and its supply chain around that are more efficient than, you know, what I imagine China has right now.

I wouldn't be surprised for this dynamic to change in the future, particularly if we see a hot war over Taiwan, right? Which is something that a lot of policymakers are very afraid of and talk about. But at least for now, I do think that there will be quite a bit more struggles for the panda bear before, or guess the dragon might be the national personification to before it beats, you know, Taiwan on.

Ravid Shwartz-Ziv (28:50.956)
Yeah, yeah. Okay, what do you say, let's start to deep dive about samplers. Yeah, so let's admit it, like before I started to work with Ellen, let's say maybe one and a half years ago, something like that.

And I almost didn't know anything about samplers, you know, I just know that these are like the thing that like no one cares when you have some default parameters and you don't need to touch anything like the link in the worst case you need to change the temperature this one hyper parameter and hope to see a better result. But apparently samplers are so important for everything.

So let's start like, what does it mean sampling? What does it mean why we actually need samplers? So samplers, like we're talking, you have the large language model, right? You insert an input and then it goes over all the levels and you have the final representation or the final logic, right? And now the question is how you get

tokens out of these logits. How you create or generate tokens, the way that we communicate, right? The language itself from these logits, okay? So one way to do it is to just take the highest probability, right? Like you have logits, take the highest probability. This is called deterministic sampling. And another way is just to look on this distribution

and to sample from this distribution. And of course, there are so many other ways that basically say, okay, let's try to change this distribution and to see what is happening. And now the way that we sample change the behavior of the model. Okay. So we can go from a very concrete

Ravid Shwartz-Ziv (31:06.768)
specific behavior almost deterministic behavior to stochastic behavior that change over time or the change each time and it's very like I'm almost random and so some players are crucial for for a good for solving our problems and really depends on the problem that you want to solve

Yeah, yeah. I'm glad you've brought up sampling because as a topic, it made me very passionate, obviously. so, Ravid was at first somewhat surprised, but really came to support my and it became our research agenda. And I want to...

First of all, I don't know necessarily if the people listening to this are going to be really skilled researchers or if it's going to have somewhat more of a audience. And with that, I'll just quickly point out that when we use words like tokens and logits, there's also a more intuitive understanding of this. tokens are not full words. Most of the time they can often be subwords. But you can imagine just for this that we're trying to think about

a next word. so every time you've heard review say so token, can think in your mind word just for a simplification point. And we're imagining out of all the possible things that your model can say, it's whole vocabulary, which is not, know, at the character level, it's quite limited, right? It might be as small as 50,000 possible words in this case, or with some words, right? And that

that choice is often times the model is very sure of itself. Like the top most likely continuation, which in deterministic or greedy sampling is the one chosen at every step, that might be in one time step 99 % likely. the 49, know, 1,999 remaining tokens is so unlikely it cumulatively sums its probability to generate to like 0.1. This happens once in a while or similar.

Ravid Shwartz-Ziv (33:18.766)
Right? But in a lot of cases, your model tells you it's like, 30 % chance of the most likely 25 % chance of the second or even something more flat, like a very flat distribution. And so the difference in the geometry of the distribution of your probabilities over every word or token that your model wants to say is,

what we are studying and why people observed, you we needed sampling. And a few other notes on this. The first one is I think that a lot of mainstream academics and particularly top level academics who do have the benefit of, I would call it, many people under them kind of telling them about where the field is and maybe them personally not having to run the models as much.

all the time. And I don't blame them. mean, you don't have time to play with these models that often when you're juggling 25 different like NURITZ submissions every time, right? But these individuals will sometimes create an orthodoxy that I think might actually be somewhat correct, which is this idea that sampling is a band-aid for imperfect

quality in our probabilities and that as models scale up in size and quality that the importance of sampling reduces, which by the way is something I have observed, right? We find that as we make improvements on at least a type of sampling, I care about truncation based, that the quality improvements are much larger on small models and on models that have had their

probability distribution distorted by our attempt to improve diversity, this temperature parameter, and that big models particularly can get away with little sampling or less restrictive sampling, even at high temperature. I'm fascinated by what I can get away with at like temperature of two on Gemini 2.5 Pro with almost no sampling. There's a hidden top case 64 that they won't.

Ravid Shwartz-Ziv (35:25.952)
turn off, that's pretty, that if you try that on like 2.5 flash, it goes to gibberish land. So I've seen this idea, but I want to quickly finish by talking about temperature for a moment here. moreover just mentioned, know, sampling is something that open source and a motley crew of non-academics have historically been innovating faster than academics on

primarily because it's people who are using these models heavily, often the local models, which tends to coincide with a particular crowd of role-playing types and that community. But that together created several innovations that in Ravid, in my case, we noticed and kind of turned in and helped facilitate one effort, the MinP paper, which was

trying at the time we found them to get published and we helped them get over the line and get an oral presentation, which is like a top 1.5%. Like that's a huge stamp of like, this is a big deal from the academic community, which is us telling the academic community years later, right? Like maybe the technique had been invented sometime in late 2022 and then didn't get the limelight until mid 2025. So this is a technique.

or a line of research that was relatively obscure in academia. It was there, it was happening. There are academics that were working on it and are in retrospect getting a lot more credit. Typicality Sampling, Clara Meister. Sorry if I'm mispronouncing your name, but I'm very influenced by her work that she's been doing since like 2019. But yeah, so I'm very passionate about this topic. I talk about it all day.

Yeah, I want to say that I found it really fascinating, know, that like, for one hand, you have all the big labs that like use so many GPUs, right, in order to improve models and to scale up things. And from other hand, like you have sampling researchers, let's say that they can't, I don't

Ravid Shwartz-Ziv (37:39.296)
one GPU under the table and they can have so many, know, so creative metals and that they can do like for very specific problems, they come up with very specific solution like sampling metal that works really well, right? And this combination, like, I don't know, I'm not sure that like, it can, as you said, right, I can not sure it can be like for a long time as our model.

models become better and better, maybe that we don't need something anymore. But at least for now, we see that all these methods that were, I know, on GitHub repositories or subthread on Reddit are very important to the way that we are using the models. And the academic world almost ignored most of these methods.

So yeah, this is some of our project. We are trying to take these methods and to use them and to maybe present them to the world and to introduce them to the entire community.

Yeah, yeah. And I'll start one of my many moments of putting my metaphorical tinfoil hat on, right? I get a little conspiratorial, but this is one place where I have a claim that most academics and scientists look at and are like, that's pretty logical. Like nobody like argues this against me. And I say it.

like at these conferences, including at the oral presentation, trying to go to somebody at OpenAI, Anthropic, or Google Gemini team to take me aside and be like, actually, Alan, you're wrong about this, but nobody does, right? And I've, so the theory is as follows, right? So sampling, good sampling techniques, right? And maybe I should back up and point out that, here.

Ravid Shwartz-Ziv (39:34.25)
I'll talk about top P and top K and why we consider them not as great sampling techniques in a moment, but these techniques are the only techniques that the big three model providers allow for you. And in fact, they don't even allow you to have unlimited access. Like you have a limited...

range of values you can put in for top p and top k and for temperature. Temperature is by far the most locked down between 0 and 2 when temperature can scale to infinity. I can even get away with near infinity with some of the methods that we have that are better. But the tinfoil hat argument is that good sampling improves the diversity massively that you get out of a model and allows you to distill. If I imagine

that I'm, let's say, nation-state level actor with 100,000 GPUs, and I'm trying to figure out what a particular model has, like, as much information out of it, and even train my own model like a fast follow-up on this. I would love to be able to sample with lots of diversity rather than having to do a lot of other much more complicated tricks.

to try to get diversity out of 100 million or a billion generations from a model. So this alongside the double whammy of, as I make it possible to sample the hundredth or even thousandth token from a model's log probability space, just but only when it's allowable, like when it's okay, that tendency can defeat

alignment and safety. And so now your model is way more willing to do naughty things of all possible interpretations of that word with no fine tuning. And it becomes uncontrollable. I find that when I ban the end of speech token, which basically is the thing that allows your model to know when it needs to stop generating,

Ravid Shwartz-Ziv (41:38.1)
One of the cool things with an open source model is you can force a model to just generate up to its context window, and you can also start to realize that top P and top K, these bad samplers I've alluded to, actually cripple most models' ability to generate very long continuations at once, but that's for a different thing. But...

when I ban my model's ability to like stop itself from talking and basically just have the ability to stop it, especially when I'm using like higher temperature sampling, when I'm using the good techniques, I will watch it flip. Like if I ask it to be evil, I watch it flip from evil to good, like being like, whoa, that was crazy that I was doing that. That's horrible. I'm in a lie. I'll watch it do that. And then I'll watch it 10,000 tokens later flip again. Right? So like this weird like,

The understanding of what happens is we mess with the log prompts of so-called aligned models is poorly understood. And most of the anecdotes I'm going to give you have come from extremely empirical, like literally messing around for fun kind of stuff. there's, it's like, whoa, there's so much to work on here. Yeah, and make it, also make it.

It's somehow like that though, something like a secret cheat, It's like you don't need to train anything. Like it's almost for free, right? Like you don't need to use so many GPUs. You don't need to train something. don't need like, almost you don't need data, right? Like you just inference time method that change your, the behavior of your network and your model completely. Okay.

So let's start maybe to explain a bit some of these methods, right? Again, we will not go really deeply into the details and for sure, like you can go to the papers if you are interested in all the technical details. But I think like we will go briefly over the methods and we'll see how they interact with each other, let's say, and how they connected.

Ravid Shwartz-Ziv (43:46.816)
And yes, there's four we should talk about, right? There's many more, of course, but there's we've already used all I think all their words, right? Temperature, top P, top K and min P. Let's talk first about the two that we don't like top P and top K. So we alluded earlier

about, we have, let's imagine your model has 50,000 token vocabulary, which is actually about the size that GPT, the initial chat GPT had. So very, very good number here. Top K is exactly what it said. Like a top, if K is equal to 64, then my model has to sample from the probabilities cut off.

where only the top 64 words are viable in this case, or if pay is too top two. And here we do a weighted sampling where you use normalized probability. So if you have one that's 30 % and the other is 20%, that doesn't sum to 100%. You do a relative normalization, I think, in order to sample from these.

Because when we're not doing greedy, we're doing weighted sampling in all of these cases with modification, basically. I should also put that like weighted random where... And then top P is more interesting. P should stand for percentile. So if top P is 95 percentile, then we're saying that take cumulatively all the probability distribution

up to that accounts for 95 % of the total distribution. So what this would mean is all tokens where they sum to account for 95%. And that might be at one time step, you know, three tokens where it's 30, 30 and 35, right? That's three of them just summed up to 95 and the 49,900, et cetera, summed up to 5 % cumulatively. And in other cases, you might select a thousand tokens which cumulatively sum

Ravid Shwartz-Ziv (45:59.466)
to that 95th percentile. But both of these techniques are for how simple they are. And even they have edge cases where I claim they're maybe theoretically good, but they're really not good in a lot of, you know, empirical observations because they are not dynamic. Like they don't account, like they don't, they're not aware of your model's entire probability distribution.

So this Mimki technique that Ravid and I are co-authors on, but was originally invented by, like we call them affectionately a motley crew of authors. That technique says, hey, your model is telling you at every time step how sure of itself it is.

So what this means is if your model's first token, the top token probability is very, very high. Let's say 0.60 % chance of the most likely token. You would multiply it by a derived value. We just call it the P and min P. We've intuitively figured out that 0.1 here is very good. As a scaling factor,

And you say, okay, 0.1 times 0.6 is 0.06. So at this time step, we create a computation that cuts off all tokens with cumulative probability below 94%. So it's a 1 minus that 0.06. And what's nice about this is when your model isn't sure of itself, now imagine its probability is 0.2.

we multiply by 0.1 again, we get 2 % truncation. So this is a situation where models isn't sure of itself. And when it's unsure of itself, the model is telling you it's okay to go deeper into my vocabulary and say something, right? Because of the relative

Ravid Shwartz-Ziv (48:01.307)
you know, if everything is like two or 1 % or something, like if there's a large number, that's a model thrown its hands up and saying it's okay to be diverse. And I'll finally mention temperature, which is different from these, but deeply connected. It's a distortion-based sampler, and it is our attempt to force the model when it thinks it's sure of itself to be less sure of itself. So we imagine...

the previous scenario of 60 % likelihood for the top token and everything else being less than that, when I increase the temperature very high, I am reducing the probability of the most likely tokens and increasing the probability of the lesser likely and creating ultimately at a temperature of infinity, a equal probability for all tokens or equal in

relative terms near equal but with an absolute ordering preserved, right? We never actually change it to, like there's always a microscopic ability to rank the top token as slightly like point literally like your precision one levels of difference. Yes and it's important to you notice, right, if all these metals like at the end

you have the total distribution of the output of your model, and then you need to some threshold that you ignore some of these tokens, that you throw away some of these tokens. So top-k and top-p basically set some, let's say, constant threshold that you really find based on your previous experience, if you want. And you set some threshold, and this is a fixed threshold, and you throw all the tokens.

below the threshold. So, and Minty just say, okay, let's make some dynamic thresholding and let's use the information that we have from the model in order to set these thresholds. Now, after you throw this, the tokens that you don't want, then you can apply the temperature and you can make your distributions more random or less random. Right?

Ravid Shwartz-Ziv (50:17.891)
I'm glad you alluded to the whole ordering thing of, you apply temperature after. It's a choice of what order we apply all of these samplers. a sampler really is a function that you apply to the log-prob distribution. And as they're implemented by these local hosting libraries and inference code bases, such as VLOM or Hugging Face or others,

they first of all have differences in the order that they apply these functions or the samplers to the distribution. But second, the order massively matters and is poorly understood as to what is better. Like we can create intuitive reasoning to say that, know, temperature being applied after the truncation maybe seems a lot more logical in some cases, but I have tested flipping these orders because the ability to change the order of them is in all of these.

local inference engines. And you know, I can't, at least with my eyeballs, conclude that there's one canonical correct ordering yet. in fact, when I look at what like what is one of the core reasons why reproducibility is just like not happening in our field, hugging face transformers and VLM, the two most popular dueling inference engines having

opposite opinions on temperatures position relative to top P, top K and min P is way up there on like why nothing reproduces and nobody talks about it. It's like, whoa, we discovered a big issue that is not appreciated by the field. So yeah, yeah, I agree. And I think so. So what do you think about, let's say learning samplers, right? Like, for one hand, like,

You have here, as you said, have your function, right? Sampler is just a function that you apply to the logic or even to the representation of your model. So why not to learn this function, right? Like for different tasks based on your previous experience, like some stateful model, stateful sampler that has some state that you can learn over time. So there lot of opportunities here for one end, but...

Ravid Shwartz-Ziv (52:39.451)
On the other hand, like at the end, one of the huge advantages of Stubblers that they are so simpler, so simple, and you don't need to learn them. You don't need to train them. And you don't need now to start to understand if you have like you need a lot of data to train them or like what you can, how they can interact with other components in your training. And it's so simple.

So why not to keep it because it works quite well. So I am a bit... I don't know. I'm not sure what is the right answer. What do you think about it? You know, I will say I've historically resisted some attempts at learning. Because you're not the first...

Pretty much every academic that listens to the MinP presentation has this first thought. like, we should learn this because we learn everything else in our field, right? Of course, it's like the most logical thing to do. But I think that having it be a controllable inference time parameter is important. That's not necessarily orthogonal to it being learned in some way, right? Like it can be controllable and, but I do, I find...

And here I will also, you know, point out that I'm not trying to anthropomorphize models per se, but I like to imagine how do these mechanisms that we put into the models relate to a human, right? So temperature isn't exactly like this, but I imagine that when I'm heated, literally like my emotions are running high or hot as I would describe it, that I am more willing to say words that I might not be willing to say otherwise.

And in particular, a lot of times those are curse words, I think are coincidentally for humans, always kind of at hand. Like if I imagine my own log probes or something, which again, maybe that's not a real thing, but there are some scientists that think maybe that is a mechanism, right? And that...

Ravid Shwartz-Ziv (54:45.627)
true gibberish, right? I'm not gonna go say, uber glubla, if I'm about to die or something, I'm probably gonna wail or scream, but I'm not, I'm gonna probably say a whole lot of curse words and prayers or something, right? Which means that that should be only in position a thousand or higher in my human log prod space. So when I think that way, and then I think about this idea of learning, it's like, well, if I go to a certain social event or I'm in a social...

or physical situation, I would have a different learned, like being like, okay, what words do I throw out and never say or only say if somebody with a gun runs in, you know, it's like, I'm in a funeral, right? I'm never gonna say the F word unless somebody attacks a freaking funeral, right? So like, sorry, sorry, these really weird examples, but I'm trying to really put a point to how we're trying to get diversity out of models. We're trying to really push them in a sense.

I don't know. must say that like, and maybe we'll do an episode about the LLMs versus humans and the neuroscience and the connection between them. But I must say that like most of the time when people are hand waving about how humans are behaving and what is the connection.

that yeah, LLMs needs to be like humans, humans are learning like that and behaving like that. So maybe the best thing is to try to mimic this. Most of the time it's very, they're head waving and it's not true. And I'm saying it to someone with a page in computational neuroscience. So I don't know, I think that like,

It may be, right? It may be true, but I see a lot of advantages of training and learning a sampler, at least for some extent, you know, at least for some extent, it's like you can, for example, for some task or some questions, for some specific type of data that you actually need to learn what is the right sample for this specific type of

Ravid Shwartz-Ziv (57:02.425)
of of statics. And I do. So, so I again, I agree that like, that the human brain certainly does not operate like these AI models. And we don't even know how the brain works very well in most cases. But I do think that we can do thought experiments. So the first thing I'll point out is I do think that humans certainly are not aware of their own thing. I mean,

We have the appearance of her own thinking and here we're going to philosophy, Like Descartes and all this kind of stuff, right? But we have the appearance of thinking, but I claim that this is not the actual process. Like the map is not the territory and the appearance of how, like, it's also the analogy is like the millipede, you know, trying to describe how it can move a thousand legs at once, right? Like I can't clearly describe why it is that I can talk so fast and how the words come to me and all these things.

But nonetheless, we can do it. And similarly, models are not at every time step given access to their own sampling settings. But with things like structured generation and hacks, you can actually force your model at every time step to actually know about these things. And even probably do it at an architectural weight-based, learned, trained level exactly like you're alluding to.

And by doing that, now it's analogous to the model always seeing that. It's like, well, humans can, mean, slowly we can do that too. I can be like some experiment of spending a whole lot of time writing out every possible word I would say, or even recording myself over months or years and having transcribing every word I've said over some years. And then using this list and even having the distributions all for me of what I would say. And then looking at it.

I'm writing something. Like, it is possible for humans to like try to kind of mock and imitate a little bit like how we force. And I'm curious as to if there's one value in doing that. And two, if there's a way to start to get past the like, like empirically start testing things that used to be seen as philosophical problems, like Chinese room kind of, you know, arguments and stuff. And I'm like, no, we can we can start answering these things.

Ravid Shwartz-Ziv (59:18.175)
I you know, I'm already seeing, in fact, evidence for like Platonism, and I hate Platonism, know, the whole Platonic representation hypothesis, like models all getting very similar representations and being extremely easy to convert or align from one to another. That's, you know, okay, all right, maybe there really is a perfect chair, you know? Yeah.

Maybe, yeah, maybe we'll do a, I think it's a very interesting topic, like how similar are all these models and do we have some optimal representation, you know? And we'll talk, maybe we'll talk about it in the future. Okay, do you think, we have something to say more about samplers? No, I think we've covered it pretty well. Yeah.

You know, especially if you're running a local model, you have infinite control of these things. I will just throw out a few tools. Vlom is the fastest inference engine, but they don't have too much beyond top-P, top-K, and min-P, temperature. If you want to experiment with even more obscure, but in some cases, even more exciting stuff, I really recommend tools like Silly Tavern or UBA Buga. I'll try to fix the transcription because it probably didn't get those right.

Yeah, then Hugging Faces based transformers has historically been the go-to library to have all the new samplers. That has actually changed recently. They are no longer accepting new samplers into their codebase, which I so rudely discovered recently when I tried to merge my own new one. But they're kind of doing an olive branch of making it trivial to...

take any code, like if somebody wrote a sampler and threw it on their own GitHub, they have the ability now to kind of use that and apply that on top of... So they've done decent by the community. I don't want to throw too much shade. I love hugging face, right? It just annoys me personally what they've done here. Yeah, and also it's very easy to implement these samples. Most of these samples are just like few lines of code or something like that. It's quite easy to take the...

Ravid Shwartz-Ziv (01:01:38.859)
the hugging face repository and use it in hugging things, right? Okay. So now let's go to papers that we like this week or like cool papers that we love to read. Do you want to start? Yeah, and I will also totally admit some of my own. This paper I have...

more properly skinned and also I'm not a psychology major as such. But this one is especially important to me right now because I'm in the middle actually of, and so Ravid and I should back up a little bit. Ravid and I have done some research on persona generation.

that inadvertently I used some stuff from psychology literature with no attempt to like claim I was like psychologically grounded. It was more like for trying more of an experiment to understand getting diverse synthetic data and to give us some some training data for parameter efficient fine tuning and experimenting with personalization on models. But

As it turns out, when I joined ThoughtWorks, I showed a little bit of this preliminary research and work to some folks around me who are very much deeper in the psychology literature base and work in psychometrics. so then this, I call it a bombshell paper. Still, I believe a preprint not peer reviewed yet, but I expect this to get accepted somewhere. I think this is peer review quality, like, you know,

It has a provocative title, large language models do not simulate human psychology. And they're basically trotting out a lot of what is already kind of like some results that were well known about how trivial it is to like add superfluous things to an input and get massively different outputs like changes in punctuation or additions of like

Ravid Shwartz-Ziv (01:03:47.287)
basically invisible looking characters or like an additional space or something put in or like a small spelling error can suddenly, as they've shown, dramatically change notable discrepancies in LLM and human responses, even specifically for models that were specifically fine-tuned on psychological responses.

by the way, they didn't even talk about sampling, which throws another mega monkey wrench into this stuff. But like this kind of result is showing me that other like people who understand psychology and LLMs are concerned about using them too deeply, kind of analogous to what you've just said earlier as basically a brain in a vat, right? And so this is, know, I think that we are about to see if not, we've already seen a

massive amount of decision-making offloaded from humans onto AI with the assumption that they will act kind of like humans and that they often in that case will act very differently and we need to account for this in all decision-making with them. Cool. Yeah, I want to mention a very cool paper that's called Sample More to Think Less, Group-Filtered Policy Optimization for Conscience Reasoning.

And this is from some guys from Microsoft. And the core idea is how to make responses of reasoning models shorter. We all know that our models, like reasoning models, are just bullshit us most of the time. Like when you take a reasoning model, like the response is so long and a lot of like,

these weird tokens, then it talks and it talks and most of the time it's just not related to, like it doesn't improve the response and it's not related to the actual answer. And like it's related but in very vague way. And so, yeah, so like this is the core problem and how they solve it actually, so they just,

Ravid Shwartz-Ziv (01:06:09.565)
use the train the models to take into account also the length of the response. So first, they sample much more. They sample a lot of different responses during the training. And then they filtered out the responses based on both the length

and also the quality. Okay, so they look on the ratio between the reward and the length as the signal and how to filter out the responses. And if now, of course, now you take both the reward, taking into account the reward and the length, so you press the length of the generated text,

and everything is nice and they showed it on a five-four reasoning model, there was a huge reduction in the length of the answers. So I think this is very important at the end because like, let's say a year or two years ago when a chain of thought started, right? So all of us thought that like, oh, we finally can...

understand models and can we can now we have some hidden or like very clean way to see what models are thinking and now we just can align them to the to the right answer and if they will think more they will get to the right answer and and recently we we saw this is not true we that

longer answers are not necessarily better. even sometimes there are several papers that show longer answers and with more thoughts, right? Actually, it's a good sign that this answer is wrong. It's like the model has more difficulties in order to come with the right answer. And so we think we need to understand how to

Ravid Shwartz-Ziv (01:08:34.391)
cut down these reasoning thoughts and like these very vague ideas, very vague thoughts that our models generate all the time. And this is a very nice and clean way to do it. Very straightforward way. So yeah, go and read these papers. They're really great. And I think that's it, right?

Yeah, I I wish we had more time. I totally spent a lot of time engaging on what you just talked about, right? But if we're at time, yeah, we should let the people get back to their work. Yeah. Yeah. Maybe we'll do a whole episode about reasoning models and what is going on there. Okay. Alan, thank you so much.

and let's hope we'll meet next time. yeah. Yeah, we're going to do a lot more of these podcasts and hopefully the field continues to be as vibrant and as exciting as it has been. So we have plenty to talk about, but not so exciting that we're too busy, not able to do this. That's true. That's true.

Thank you for all the people that join us today and see you next time.

Ravid Shwartz-Ziv (01:10:16.211)
Okay!

Ravid Shwartz-Ziv (01:10:20.139)
We're doing state.