EP22: Data Curation for LLMs with Cody Blakeney (Datology AI)
Cody Blakeney from Datology AI joins us to talk about data curation - the unglamorous but critical work of figuring out what to actually train models on.
Cody's path from writing CUDA kernels to spending his days staring at weird internet text tells you something important: data quality can account for half or more of a model's final performance. That's on par with major architectural breakthroughs.
We get into the differences between pre-training, mid-training, and post-training data. Mid-training in particular has become a key technique for squeezing value out of rare, high-quality datasets. Cody's team stumbled onto it while solving a practical problem: how do you figure out if a 5-billion-token dataset is actually useful when you can't afford hundreds of experimental runs?
We also talk about data filtering and some genuinely surprising findings: the documents that make the best training data are often short and dense with information. Those nicely written blog posts with personal anecdotes? Turns out models don't learn as well from them.
On synthetic data, Cody thinks pre-training is still in its early days, where most techniques are variations on a few core ideas, but there's huge potential. He's excited about connecting RL failures back to mid-training: when models fail at tasks, use that signal to generate targeted training data.
Takeaways:
- Data work is high-leverage but underappreciated
- Mid-training helps extract signal from small, valuable datasets
- Good filters favor dense, factual text over polished prose.
- Synthetic data for pre-training works surprisingly well, but remains primitive.
- Optimal data mixtures depend on model scale, where smaller models need more aggressive distribution shifts.
Timeline
(00:12) Introduction to Data Correlation in LLMs
(05:14) The Importance of Data Quality
(10:15) Pre-training vs Post-training Data
(15:22) Strategies for Effective Data Utilization
(20:15) Benchmarking and Model Evaluation
(28:28) Maximizing Perplexity and Coherence
(30:27) Measuring Quality in Data
(32:56) The Role of Filters in Data Selection
(34:19) Understanding High-Quality Data
(39:15) Mid-Training and Its Importance
(46:51) Future of Data Sources
(48:13) Synthetic Data's Role in Pre-Training
(53:10) Creating Effective Synthetic Data
(57:39) The Debate on Pure Synthetic Data
(01:00:25) Navigating AI Training and Legal Challenges
(01:02:34) The Controversy of AI in the Art Community
(01:05:29) Exploring Synthetic Data and Its Efficiency
(01:11:21) The Future of Domain-Specific vs. General Models
(01:22:06) Bias in Pre-trained Models and Data Selection
(01:28:27) The Potential of Synthetic Data Over Human Data
Music:
- "Kid Kodi" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.
- "Palms Down" — Blue Dot Sessions — via Free Music Archive — CC BY-NC 4.0.
Changes: trimmed
About
The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Ravid Shwartz-Ziv (00:12)
everyone and welcome back to the information bottleneck and today we have Cody from Datology AI
who is doing really interesting work about data correlation, one of the most important and underappreciated aspects of training LLMs. Cody was previously at Mosaic and he has been thinking deeply about actually what makes training data good. So welcome to the show.
Cody (00:41)
Yeah, thank you for having me.
Ravid Shwartz-Ziv (00:43)
And Allen, good to have you here as always. How is doing?
Allen Roush (00:47)
It's good to be here and it's good to meet ⁓ one of the people who's really pushed AMD GPUs hard, right? Because that's kind of what Mosaic was known for at the time. So we'll definitely want to go into the dynamics around that later in the call.
Cody (01:06)
Man, I
had forgotten about that, yeah.
Allen Roush (01:09)
you
Ravid Shwartz-Ziv (01:12)
So, Gauri, let's start from the beginning. How did you get into data creation? Was there a moment you realized this was a thing to focus on?
Cody (01:21)
Yeah, so that's a great question. And ⁓ I do sometimes look around and wonder how did I get here? ⁓ My entire career at this point has been thinking that I was going to do one thing ⁓ and then looking around for something that was very high leverage and then ending up doing something else. ⁓ So when I did my PhD, ⁓ I didn't really know what I was going to do it in.
I was originally, I was very close to being like an HPC, like kernels person, ⁓ like writing high performance parallel GPU code. ⁓ And I, when I, when I started working on my PhD, I had talked to my advisor. I was really excited about this AI thing and how can we like bring it to devices ⁓ and ⁓ write at kind of like a critical point where I may have gone in the other direction.
I read Jonathan Frankel's lottery ticket hypothesis paper and I was like, you know what's way better than faster kernels? Like faster models. And I kind of like went on this whole journey doing like ⁓ pruning and distillation. ⁓ And I think the first time I had realized that data was better, there was this paper where I was hoping to write on distillation and I had this problem in my training and
I like, there was this weird data set. was just kind of trying a toy example and it was like something on the TensorFlow data sets at the time, just some weird like pictures of like crops in like Africa or something to help flick farmers. And I found a handful of examples in it that like completely destroyed my training and I pulled them out and then the distillation worked really well. And I was like, I think I have the world's best model on this data set. ⁓ But I, but like in my like,
you know, young researcher brain, I couldn't figure out how to make this a paper about distillation. So I just like threw it away and like went about my life. ⁓ Yeah. And, and so then I like, I had spent some time in my PhD at, at Metta. ⁓ Ironically, I think I was actually on a team that was basically what Mosaic and that was for everyone else inside of Metta. were like a consultancy for like infrastructure.
and we were the modeling team. So the rest of the team did like set up physical servers and like there was a day where like the power, like everyone's login went out and like they had done like the face logins for the building were through that service. And someone in my department had to break into like meta headquarters with an axe to go physically reset the servers. But like I was on the modeling team. It was just kind of weird how that worked. But we did consultancy. like,
you know, the ads and content people would come to us and they're like, hey, our model is slow or our inference is slow. How do we make the modeling better to make it go faster? And it turned out to be like a pretty useful skill set when I went to Mosaic and that was how we did everything. was like, okay, training is, you know, how do we make training go faster? And I had never really expected to work on data. Even when I was doing all this stuff, it was a bunch of algorithmic improvements, distillation.
things like that. But then when we started training the MPT models, like we just needed tokens and there was like, okay, well, this is the highest leverage thing I can do. ⁓ It's like, we don't know what good data is. We have a lot of data, but most of it looks really bad. And how do we make this into a good model? ⁓ And so like, once again, I like left my comfort space and did something that, know, data work kind of sucks. It's not a lot of fun.
⁓ And a lot of people don't want to do it. They don't want to think about it, but it turned out again It's like very high leverage and it makes really good models. So ⁓ So I went all the way from I'm gonna write CUDA kernels to I have to stare at weird like, you know pieces of text from the internet and that's my job
Ravid Shwartz-Ziv (05:19)
So from your perspective, like if you need to rate it, how much is the data versus how much is the model in general?
Cody (05:27)
Yeah, this is a good question. And we had these debates a lot while we were working on DBRX because we were hoping to be the first open source MOE and we missed out by a couple months to misdraw. But we were really excited about the performance gains you could get from MOEs and I think we had clocked them at maybe like four to eight X improvement. I think by the time we got...
done with this, we were seeing equivalent improvements from data. So like another four to eight X. ⁓ I think now you can probably get like even higher than that, right? So like you need scale, you need flops and you need modeling improvements, but I think, you know, anywhere from like half to 75 % of models ⁓ performance can come from data if you do a really good job.
Allen Roush (06:13)
So real quick, around that time when Mistral released, I'm guessing you're referring to Mixtral, right? What's the name?
Cody (06:19)
Exactly,
yeah.
Allen Roush (06:20)
So I noticed that the field seesawed and arguably continues to seesaw between dense and MOE and sparse MOE models. I have a hypothesis around that related to log probes because log probes look a lot different and spikier with MOE models, but independent of that, do you have any reasons that you believe that that seesawing has happened and arguably continues to?
Cody (06:43)
⁓
yeah, absolutely. I mean, for one, they're just like really hard to train. ⁓ Like they've got all these, you know, theoretical benefits and they're not theoretical, like they're practical. But especially like two years ago, training in MOE like sucked. It was not a good time. ⁓ Getting reasonable MFU was really hard. ⁓ And then once you got the MFU, ⁓
there was all sorts of weird stuff that could start happening in the middle of training. If you didn't know how to do scaling laws to predict what you should happen, right? We had all these scaling laws that were defined for dense models. They just don't work for MOEs. mean, a lot of good work has come out since. And so this is much easier. we were calculating tokens to parameter ratios. And we're like, is it the active parameters? Is it the total parameters? And is it something else?
Allen Roush (07:27)
my gosh.
Cody (07:41)
And it was just like really hard to kind of scale them up in a reasonable way. So like, you you can just train an MOE in Torch Titan now. It's just easy. Like you just, you know, import some things and go and you've got not the worst MFU and that's like, yeah, you can just do it. ⁓ Dense models just solved a lot of problems, right? You know, if you had enough compute and you were willing to train for a long time.
You you could get all these great scaling laws and go, OK, I'm going to hit my model quality target, right? Yeah, I think a lot of the reasons are just practical.
Allen Roush (08:18)
Yeah, I was just gonna say, I wish we could clip you and maybe we will on saying, you know, about how those scaling laws that you for dense models don't apply. Because often people on Wall Street think that there's like one concept of scaling laws. I'm like, I have personally seen hundreds of different papers, including workshops, possibly in the thousands at this point, claiming to establish a scaling law for something very narrow, right?
Cody (08:42)
Yeah,
yeah, absolutely.
Ravid Shwartz-Ziv (08:45)
But we talk a lot
about scaling loss. We talk a lot about scanning loss with Alex and Linmi a few months ago. And like the bottom line was that like, it doesn't really good in practice, right? Like if you have so many scaling loss, so why I care about them, right? They depend on so many parameters, so many factors. So yeah, if you scale the data or like you scale the computer, you will get better results, but it's
What can I get from it?
Cody (09:18)
Yeah,
I mean, you know, I guess like in isolation, the scaling laws are not that useful to know. But I think what is useful is when people can demonstrate that something is like measurable and predictable. This is like a great tool for you now, right? Like, you know, if you are going to train a model that you want to release, you want to know which parts of your training are predictable, right?
You know, I mean, not the least of which are just like what, which, which learning rate to use and how long can I train before, you know, like, am I going to hit my target quality? You know, the, the target quality is always just like, whatever's the best model in this size this year, you know, and being able to predict out like, will the model that I'm going to spend, sorry, my dog has something to say about scaling walls too. ⁓ you know, she gets very excited about this.
⁓ You know, yeah, yeah, you know, whatever, ⁓ am I going to spend $10 million, $20 million, $100 million? Is the model going to get where I want it to be? ⁓ I going to be successful or am I going to fail? Right. I think that's really the only skill in all that matters is like good model or not. Yeah. ⁓
Ravid Shwartz-Ziv (10:16)
All of us, it's game.
Okay, so we have the data, we understand that the data is important, right? Now, what you're going to do, how we are actually going to make better data for our models, what we need to care about, like how we're going to do it, what are the parameters they are going to measure.
Cody (10:56)
Yeah, I mean, this is great question. ⁓ maybe, I don't know if y'all mind, but maybe I can answer this question with like a bit of a history lesson about, you know, what I learned doing this wrong. And yeah, I mean, so like pre-training data is very different than post-training data. Cause it's a lot less intuitive, right? About what you're going to do. Like you can look at a lot of the data and it's not as obvious what is helpful to the model.
Ravid Shwartz-Ziv (11:09)
Great!
Cody (11:24)
⁓ you know like
Ravid Shwartz-Ziv (11:26)
Let's
start with what does it actually mean pre-training versus pro-training, right? For the 5 % of the people that don't know these terms.
Cody (11:31)
Sure, sure.
Sure, sure, sure, right? All of your pre-training, I mean, well, this is a bit of a philosophical question, but generally, pre-training is the generic, very large, ⁓ autoregressive next token prediction where you're going to be training on trillions of tokens generally. ⁓ Post-training might include... ⁓
you know, an SFT stage where you're trying to teach it some amount of instruction following. It may just be autoregressive. It may have some sort of like mask to like prevent training on the instructions. Depends on, you know, whatever flavor of this you prefer. And then generally some sort of RL for preference tuning or teaching it reasoning or like verifiable tasks. So.
the, you those, those sort of things. You typically see, you know, some, you know, pretty sharp, ⁓ like distribution shift and scale and even like learning objective changes in the post-training. Whereas pre-training is just like huge volume autoregressive next token prediction. And I think we'll get to it in a bit, but like this whole thing of like mid-training has emerged and ⁓
you know, it's confusing because it's still pre-training most of the time. So what makes it mid-training, you know, and I guess it's because it's not pre or post, it's in the middle. ⁓ And so, yeah.
Allen Roush (13:05)
I just want to quickly ask when you're saying like, well, two questions.
Ravid Shwartz-Ziv (13:05)
OK, so yeah.
Allen Roush (13:13)
Are people, I know at one point people have, and in general it seems like as we've gone deeper into scaling, the number of epochs trained at one point it was 100 epochs that you would train something on on average. Now I've heard it's well less than one on pretty much every make like you know GPT-5 or type model. And then two is mid-training when you intervene somewhere like deep in that almost one epoch.
Cody (13:40)
Yeah, yeah, I mean, this is a good question. ⁓ like the, you there was a bunch of literature that came out maybe like two or three years ago, which is in the ancient times now, about kind of how many epochs you can really do on text data, right? Because you're right, like in the vision world, we did hundreds of epochs, thousands. I mean, if you add distillation, you can do millions. ⁓
But this just didn't really work in text. And I think mostly the reason why it didn't really work was because there just weren't good data augmentation techniques, right? You can't train 100 epochs if you don't do random crop, flip, know, noise, whatever, right? ⁓ And that just like doesn't exist in the same way for text, or at least not like ⁓ in a compute cheap way. ⁓
But yeah, so there are limits to how many times you can repeat data. And for some of your data, you won't go through it at all, or you hope you won't. And then there's some data sets, sub data sets when you're doing your mixing that ⁓ you'd really like to see almost all the time, but you're kind of limited by these heuristics. And you kind of go, OK, actually, I'm going to try not to use this data more than
four to eight times. And then the question, the fundamental question of mid-training actually is how do you use this data the best over the course of training? If you have four repeats, four epochs of this data, where should it go? And why, right? And that's kind of the whole game.
Allen Roush (15:24)
Okay, ⁓ sorry. No, you go, Ravit. ⁓
Ravid Shwartz-Ziv (15:24)
Okay, so, no,
no, so I want to know, like, okay, so now I have all the data. We did, like, now we split it to pre-train, mid-train, and post-train, right? Like, the different types of data. We can discuss how we are going to do it, maybe later a bit, but let's assume that we have it, and then what?
Cody (15:47)
Well, you write a big check to Jensen and you cry. And then you train the model, you know.
Ravid Shwartz-Ziv (15:52)
Ha ha ha.
No, OK, but like, OK, what is the like, do we feed all the data to it? What we are going to do with this data?
Cody (16:02)
Sure, do you mean like practically? Like how do you get this to work?
Ravid Shwartz-Ziv (16:05)
⁓ You can start with high level, like what do you want from this with this data? What do you want it to represent? What types of data you want? What are the like, what is the relationship between the task and the data?
Cody (16:16)
Sure.
Sure, sure, sure. Yeah, okay, I see where you're coming from. you know, fundamentally, we want models that are good at things, right? ⁓ And if you're releasing open source models or like frontier level models, the hard part is you actually don't really know what you want the model to do. It's like, what do you want it to be good at? ⁓ Everything, you know? ⁓ This question is a little bit easier for people that have like...
very narrow tasks or very specific things where it's like, I know what I wanted to do. I want an agentic model that, ⁓ you know, can like, ⁓ you know, help people on like phone calls or that can like route various things between the API or write code, right? Like that's very well defined. I'm not saying it's easy, but it's like definitely much more tractable than like you're a team. You're tasked with making the world's best open source model. What do you want it to be good at? Go.
Right? Like everything, right? ⁓ You know, the, think that the answer is, like not a great one. It's like there, you, you essentially what you're, know, having, having worked on one of these high pressure things where it's like, we're going to make an open source model. You want your boss to be happy. Right? The answer is somewhat ends up being like, go find the benchmark that everyone's most excited about and do a good job on it without cheating as best you can. Like.
You know, bench maxing is real ⁓ and you want to have the high score because that's what people are going to remember at the end of the day. But you also want the model to be good. So that way people don't feel like you did that, you know? Yeah.
Ravid Shwartz-Ziv (17:59)
We are talking in a few days with the CEO, the co-founder of ML Arena. is like the expert in benchmarking and how we can cheat on benchmarking.
Cody (18:06)
yeah.
Allen Roush (18:17)
So.
Cody (18:18)
Yeah.
Allen Roush (18:19)
I have a related question to this. I've always thought, first of all, ⁓ it always seemed like post-training because they had so much less data that they must be training longer, like more epochs, because it explains why slop is so... Slop, way we define it, like, certainly delve, it's not x, it's y, so overrepresented. But then my question for you is, I always felt like training models and indeed, you know, we use terms like recipes because I think cooking is the right analogy.
which is to say that it's extremely feels subjective, except that it has like scientific principles, and you have like multiple steps, but you can do them in different orders and get different, you know, delicious models, I guess. What I'm trying to ask is it always seemed to me like behavior like grokking for one, and in general, like, it always seemed like low and slow made more sense, literally like low learning rate.
and slow, basically saying, okay, it's going to feel compute inefficient, but maybe the quality of my representations are better. And I'm wondering then, in the context of trying to avoid benchmark maxing or maybe benchmark max, but also be good, do you believe in that kind of like low and slow approach?
Cody (19:13)
Mm-hmm.
I mean, I think that that is kind of what is, ⁓ like that's sort of what the big rise in RL represents, right? Is like, you know, they have very little data. RL in some ways allows you to use that data like much more effectively without as many worries of overfitting. But like, you know, it's very funny if like, you know, I've been working on some like
blogs to cover some of the weird history of data. And going back and reading papers from two or three years ago to see what people were up to. If you read like the Llama 2 paper or the Llama 3 paper, it's so funny to see like what they were doing for post-training. It feels so antiquated, even just a little bit later. But like even places that were well-resourced didn't want to do online RL because it was just so hardware inefficient, like the MFU was so bad.
That they could, they absolutely could. And they just chose not to, which was kind of funny. ⁓ So I guess that sort of is the low and slow, if you think about it. ⁓ Is like, okay, well, we're just gonna kind of take lots of shots at this and hope that we get really high quality ⁓ samples. And we just don't care how long it takes us to get the high quality samples and the positives. ⁓ I don't know how much it will prevent
benchmark maxing in some sense. ⁓ I can tell you like how we went about trying to avoid benchmarking as best as we can and still like how we go about trying to avoid benchmarking now at Datalogy is like, yeah, and I'm not saying that we have like the perfect solution, right? Like there is still this weird optimization where people want to be able to brag about being the best at important benchmark, but also don't want to have done that.
Allen Roush (21:13)
Yeah, we'd to hear it.
Cody (21:27)
⁓ and, and what we did at Mosaic was we had this thing called the Gauntlet, and it was like 35, 36 evals. And we kind of split them up as best as we could into kind of like categories of things like language understanding, reading comprehension, programming, world knowledge, right? And then all of these kind of got aggregated into sub tasks. ⁓ and then once again, averaged into like one large task, right? so there wasn't an easy way to like,
maximize the score on one like eval if you like overfit that wouldn't hurt your whole aggregate. ⁓ It was kind of funny because we were like we really pitched this as like ⁓ you know those those charts from like soccer or like ⁓ like those Japanese RPGs with like the skill things ⁓ and like that was how we were thinking about it. Like okay you've got so many points you can put into things right and you want to represent the whole.
I remember coming out with the blog and putting the radar plot out and like we angered so many people. They were like, this is not the way you should ever plot data. I'm like, oh, We didn't mean to.
Allen Roush (22:33)
Well, radar plots
are great. I'm sorry, radar plots are an amazing way to do at a glance, understand what capabilities. And they've been used, when you point out JRPGs, there's a time-honored tradition of using them in video games. yes, don't, naysayers are wrong.
Cody (22:39)
Yeah.
Yeah, I mean, it was
great for us. We could at a glance very quickly see what we had done. This angered a lot of people who had problems about it not being able to interpret the interpretability and the scientific rigor of it. I was like, okay, well, you're allowed to do whatever you want. You don't have to use this, right? Yeah, but anyways, this way of thinking about things where it's like...
You need a holistic approach to what you want the models to do. You want to care a lot about the evals and hopefully you're capturing capabilities that are aligned with what you want the model to do. Like in a weird way, if you know what you want the model to do and you know that your evals are capturing it, the data part is pretty easy. You just kind of pull knobs until you've done that thing.
Right? And that's also why bench maxing is easy. You just pull the knobs until MLU Pro goes up or Live Code Bench goes up, right? ⁓ And then you call it a day.
Ravid Shwartz-Ziv (23:45)
Bye.
I have another question. So, but in the pre-training it's not clear, right? Because, right, post-training just maximizes the benchmark. But in pre-training it's not clear what you actually want to do, right? What is the right metric to check? Even if you ignore fitting on a specific benchmark and you have several benchmarks that you... that capture the thing that you want to do. What you're doing in pre-training, how you evaluate it?
Cody (24:13)
Yeah, I mean, this is a good question too, right? Like, I mean, the, this one was also very hard, especially a couple of years ago, because it wasn't really clear which things that you did in pre-training that the ways you were able to evaluate models that are base models, how they even transferred over to what would make a good model for post-training. ⁓ like, you know, MMLU was like the, the thing to be good at for like years, right?
And the funny thing with MMLU was if you got a high MMLU score when you went to post-train it for, mean, some of this has changed a little bit, but like you you always expected your MMLU score to go down in post-training. It was just like the knowledge the model had and you couldn't make that better. But it wasn't clear if that actually made the model better in the post-train model, you know, was like, well, it knows lots of facts. I think, is that a better model? I don't know, you know.
And it was really hard to figure out which evals correlated to the things you wanted to see in downstream performance. ⁓ And I mean, I think generally this is still not entirely clear, but we did find that when we took this aggregate approach, this at least somewhat translated well, that like we were good at a broad diverse set of things and the models that had the higher scores there had higher scores on the post-trained evals. So that was at least, you know.
correlated in that sort of work.
Ravid Shwartz-Ziv (25:39)
So today, the best option that we have is just to measure it on different benchmarks, even though we didn't train on these benchmarks specifically.
Cody (25:50)
Yeah, so I mean, the thing that I understand that really works the best but is much harder to do is if you have holdout data sets that you're absolutely certain are not on the internet and you transform them into these perplexity eval sets, right? So I think a lot of Frontier Labs use their internal code sources, although I don't know if this works once you start vibe coding.
data that you're sure that no model has ever been trained on, right? And then you actually get a very predictive measure over the training course, especially the closer you can get the thing that you're calculating perplexity over to be the task you want. This works really well. I mean, it just, you know, it is the definition of what you want it to do. ⁓ It's just much harder to go get, find those examples, you know.
Allen Roush (26:44)
Would you then agree, because I fully believe this idea that we are trying to maximize, like from an output sampling, like an output sampling researcher perspective, I want to maximize perplexity of my output while maintaining coherence, right? As a good proxy for the diversity of the output.
Cody (27:03)
⁓ Maximize perplexity while maintaining the coherence. I suppose, right, if you're trying to make something that is like less regular, more diverse, and I guess you could say more surprising, ⁓ I think that makes sense. ⁓
Allen Roush (27:10)
Yeah.
Cody (27:25)
I mean, perplexity, like thinking in terms of pure perplexity is always a really funny thing because it's like, it's so dependent on the context in which you're discussing it. ⁓ Yeah, in that sense, right, it does make sense. I think if you told a random researcher that you were trying to maximize perplexity, they would have a heart attack.
Allen Roush (27:46)
Well,
no, but I mean, I am though in this case, is, but grounded by my LLM as judge or some other human source being like, that output is still good. But it just seems like there are all these other measures of diversity of outputs, perplexity, which is calculated via entropy, right? And so I've always felt like it's all there in the entropy too. And yes, I'm aware that gibberish has maximal perplexity.
Cody (27:56)
Right, right, right.
Right.
Ravid Shwartz-Ziv (28:13)
But.
Allen Roush (28:13)
But that's bad scores on your coherence.
Cody (28:18)
Yeah,
yeah, mean, lots of creative writing and like poems and things, know, follow these like little, you know, parts where there's like little lulls and then spikes, right? And like the things that kind of grab our attention tend to be these little spikes in perplexity where it's like, I've said something very clever or different than you weren't expecting there, right? And I think, so it's, you know, it's kind of weird how you want to, you know.
how you want to constrain this, right? You don't want everything to be hyperplexity, because that would be, that would be gibberish, but like, the, like, maybe like the average sequence there to be hyperplexity, because you've kind of put in lots of interesting novel surprises and twists there. Yeah.
Ravid Shwartz-Ziv (28:56)
So it means that for different data types, you actually want different metrics, want even different perplexity, right? So do you think we actually need to split, to say, okay, so on this type of data, we want these measures, and on this type of data, we want this measure, and somehow to combine all of them, even in the pre-training?
Cody (29:18)
Yeah, so I think now you're asking, how would you gather some measure to quantify quality of given sample points or given pools of data? ⁓
Ravid Shwartz-Ziv (29:31)
or even like also like to train on these metrics, right?
Cody (29:34)
Right.
Yeah. I mean, I feel like the, there hasn't been a whole lot of success on kind of coming up with ways to, to like perfectly quantify data for pre-training. ⁓ And, you know, the, the best examples really just kind of come from like, you can use Elm as a judge, right? And you can use ⁓ quality filters. ⁓
⁓ But like it's still very imperfect and in basically we're we're kind of stuck with empiricism, right? ⁓
Ravid Shwartz-Ziv (30:10)
So
just to make sure that like when you say LLM as a judge, you just take a pre-trained model and fill it with data and ask it to classify, like to give it some rate and to say like, do we need to use this data in our pre-training pipeline?
Cody (30:21)
Yeah.
Yeah,
think on the pre-training side, I think the most famous and successful example is ⁓ FineWebEDU. And what they did ⁓ was they made this rubric and they said, these are five qualities that we want ⁓ in a sample and they're related to educational content. ⁓ And for each axis that you see, reward one point.
⁓ and then you're gonna score a bunch of documents ⁓ so that you've got a score from zero to five. ⁓ And then because that was really expensive to do for the entire internet with Llama 3.70b, we're gonna then take those scores and train ⁓ a classifier on top of an embedding model. ⁓ And this worked unreasonably well. And the funny thing about this too, to me, is it seems like they've tried...
So one, it kind of just works. They can just kind of copy paste this method and take it into new domains. So you started seeing all sorts of EDU data sets coming out. You had the stack EDU, and then you had fine math. What are all these data sets? They just copy pasted their rubric and changed some of the words and did it again.
⁓ But the other thing that's really funny to me about this is like they've it seems like they've attempted to make a couple of improvements on this and they like haven't it hasn't worked like they just nailed it the first time you know like they were doing this kind of like grade school to like early college math and they've tried to Make other rubrics for like graduate level math and like the models don't care. They don't want to learn graduate level math This isn't made them any better at anything Which I think is really funny and part of why it's it's really hard to understand
what's going on. like that's kind of the height we have right now on the pre-training side is like, and then you look at a bunch of these documents and like, it's not actually clear when you're looking at it that these are good. Like they're very strange documents that you pull out from these filters, even though you've used a model to judge these and say that they're good, right? But there's some.
Ravid Shwartz-Ziv (32:35)
strength in what way?
why? What do we see there?
Cody (32:39)
So the documents that tend to be judged as high quality from Fine Web EDU's classifier, they're very short and they're very information dense. And I think it's interesting because I think we're still learning a lot about how models learn. It ⁓ wouldn't have been obvious ahead of time that that is actually what you want.
I had a project with an intern of mine who was working on developing some filters and we wrote these complicated ones with like, okay, we want to have health blogs and science and technology blogs and we'll give some scores for writing quality. And we pulled out these documents and they looked great. mean, they looked like they're really well-written blogs and all these things. And we trained some models on it and they were just terrible.
Ravid Shwartz-Ziv (33:32)
You
Cody (33:33)
It was like the fine web edu classifier just like swept the floor with it. And it was like, okay, well, this is good text. I don't, why is this not make a good model? Right. And I think it had to do with like, there's this information density thing that like, wasn't really obvious that you needed for training a model. that it, you, wasn't good to train it on a bunch of examples of these like really long pros, you know, like if you think about how some of these like cooking blogs start where it's like, well, when I was 12 years old, you know, my
My mother used to always sit me down at the table and do the smell of this thing, right? It's like, actually, what the model wants is, can you take me to the bottom with the recipe? Yeah.
Ravid Shwartz-Ziv (34:14)
So do you think like if you will let the, don't know, like you will extract some features from these documents, high quality documents, do you think that they contain a lot of information regarding a specific task or like this is something about the natural language or what are the features that you actually need in order to train good models?
Cody (34:37)
Yeah, so I mean, the successful and like popular filters, right? They're in some ways just aligning the pre-training distribution for the task distribution, right? So the other really popular open source filter is the DCLM filter, which came from like the data comp paper. There was like a competition for who can train like the best model from open source data, right? And that filter took a different approach instead of kind of
using LLM as a judge, they actually took samples of data sets they knew to be high quality and then they essentially built a simple Ngram classifier. But the data sets they chose was OpenHermes, which is an SFT post-training data, like question answer data, and then explain like on five Reddit posts.
⁓ And so what they did was they found what they end up doing is finding in the wild a bunch of things that looks like instruction data or it looks like question answering data or has like people explaining things. ⁓ And you still end up getting a lot of these like really short information dense documents out. And it makes a lot of sense, right? If you want your model to be a chat bot that explains things to you, you probably want to find a lot of examples of
people explaining topics to people or answering questions or helping them.
Allen Roush (36:00)
So these filters, have you seen innovation in them in the past few years? Are there new filters in 25 and 26?
Cody (36:12)
I mean that's the, that's the thing that's kind of odd. There's not right. Like there was these, these finally BDU and DCLM I think came out within like a couple of months of each other. ⁓ and then there hasn't been anything that's like really killer that, that blows it out of the water in the open source since then. Like, no, I know that like every big lab that's doing things probably has dozens of these that are probably very useful.
But it is kind of odd that these just kind of came out. They were incredibly effective, and that's it. That's what we got. They're not that hard to replicate. I'm sure someone is sitting on something that's really awesome. But yeah, that's what we got.
Ravid Shwartz-Ziv (36:55)
Okay, so now let's assume that we know what does it mean to have good data for pre-training, what we want from metering or post-training. Let's start with high-level principles. What are good properties for data in these steps?
Cody (37:15)
Yeah, so this ends up being a lot of your like math data sets, your code data sets, some of your really high quality things, maybe like ⁓ academic papers, Wikipedia, know, things maybe that you've, if you've made very specialized filters to find specific types of things, maybe you've put some of those in there and like, you know, categorically what you're looking at are things that you have maybe like billions of tokens.
Whereas like your large scale web data may be trillions of tokens, right? And you're trying to figure out, okay, I've got, you know, maybe 5 billion tokens of Wikipedia and I have a 20 trillion token training run. How do I use this well, right? Or, you know, in the entirety of GitHub that I'm allowed to scrape, there exists 400 billion tokens of code.
How do I reasonably make my model train on 20 % code for 20 trillion tokens? And that's kind of the question of what are you attempting to do in mid-training? It's your most valuable data, the thing that aligns most with the distribution that you care about, maybe your highest quality data, either by these filters or however you've decided. And then, okay, now how do I use it?
⁓ yeah, and, and it's kind of a funny story. Like maybe the, we, me and me and the people I worked with at Databricks kind of ended up doing mid training by accident. Like we weren't intending to do mid training to make models good. When we started, we just had this problem of scale that we had to address. ⁓ which is like, okay, we were not GPU poor under any circuit, like any ways you define the word, but, but we still couldn't be like.
Reckless with our with our compute, right? You know, if you have 512 h100s or a thousand h100s You still can't do like a hero run level sweep on every possibility of your data, right? We had very limited like, you know, we could train essentially like a llama to 7b in about a week But we couldn't train 10 of them ⁓ And we couldn't you know test every permutation ⁓ And so, you know what we what we originally were doing
⁓ when we were starting to do some of the things that are now called mid-training was we were like, ⁓ we have all these small datasets and we think some of them are really helpful, but we don't know which ones. ⁓ And what we were hoping we could do and turned out to work was, well, what if we take the data distribution and like the last 10 or 20 % of training and we just move everything around, right? If we just up sample,
the data sets we hypothesize are helpful for this task, maybe it will tell us. It turns out it does. And so then we were able to kind of like grok ⁓ the relative quality of our small data sets to like a much more accurate degree where we couldn't have done like a hundred sweeps. We can now say, okay, actually this open web math data is really helpful or this data set here is really helpful.
⁓ in a way that we couldn't have just because there's only five billion tokens of it you couldn't possibly have trained on that much of
Ravid Shwartz-Ziv (40:30)
But do you think it's only a matter of compute like the meet training? Because let's say that I have infinitely amount of compute. Do you think it's worth to just train on all the data and like just remove the meet training?
Cody (40:42)
So
I mean, the problem is not the compute. I mean, the problem is ⁓ being data constrained, right? It doesn't matter how many GPUs you would have given me. I still only have 5 billion tokens of Wikipedia, or I still only have 5 billion tokens of this math data set, or 100 billion tokens of Python data that exist, or something like this, right? And
But, and so when you look at like how you train models and like scale and walls, right? ⁓ No amount of training just on code data. If you only have a hundred billion tokens of Python, you can do a hundred epochs of it. It's gonna, it's never going to give you the signal you need. She has lots of thoughts about mid training as well. ⁓ It's never going to give you the signal you need to know what is the good data or the important data.
the Python, right? So like in certain sense this is just kind of like a necessity. Yeah, if you gave me 10,000 H100s or B200s or B300s, I still have a limited amount of data.
Ravid Shwartz-Ziv (41:53)
But I'm not sure that I understand because, at the end, I think these models can fit everything, right? So if I have, like, so why to use only small subset of the whole data, right? I assume that the model can learn both math and English really good if you give it enough compute and make enough iterations, so why not?
Cody (42:14)
Sure.
Yeah, I mean, so like, I guess you're kind of saying like, why worry about the proportions or what the data is, right? You just go big enough. ⁓ And you're right, you know, if you train like a trillion parameter MOE, it'll sort of learn everything, right? You just give it enough tokens. But in a sense, though, the models are still learning these probability distributions, right?
Ravid Shwartz-Ziv (42:23)
Yeah.
Cody (42:42)
If you give it, ⁓ you know, if you just went and got 10 trillion tokens of web data, right? It just learns to like approximate the web data, right? And the problem you have is even if it has learned all of the concepts of programming, the probability of it generating that will still be very low. So like you are trying to capture some distribution of text that you want the model to be able to recall and produce easily.
and shifting that probability in post-training is pretty hard. ⁓ You're able to do it to some degree, but in certain sense, what you have when you get to post-training, you basically have what the model's priors are. ⁓ And you're able to sharpen it a little bit, but it's very hard to push it. If you don't train it on any code, it's very hard to make it, even if it's read enough concepts to grok it over the course of reading the internet, to make it actually be a software engineer.
⁓ So that's a little bit why we care about these small datasets is because they represent the distribution that we want to model. And it's very important for us to be able to kind of push the model into that direction before we get to post training.
Allen Roush (43:53)
What do you think is the future of the quality of post-training or pre-training data sets even ⁓ now that so many of the sources of getting that data seem to be drying up in various ways? I stack overflow is dead. The internet at large is overran by LLM and related AI generations. companies are becoming very data protectionist now. So anybody with any kind of unique data has hoarding it.
making it very expensive to get licenses. So what do we do in the 2030s?
Cody (44:29)
Yeah, mean, well, ⁓ I guess I had done this like fun little like thought experiment going back to the Fine Web blog and looking at the tokens that are being produced. So one thing that's kind of interesting is you're right that people are becoming more protectionist like Reddit shut down. You can't scrape it anymore legally. ⁓ And Twitter shut off its APIs and websites are doing this. There was like a big dip.
and common crawl tokens that occurs, maybe it was like 2022, 2023, something like this. The funny part is that it's like, it's like completely recovered. So like every month common crawl is producing about as many tokens per month as it was before. Now, I mean, like this is not going to get us completely out of like data jail. If you need to, if you need to continue to scale to these massive sizes, like
you know, get 150 billion new unique tokens every couple of months. That's like, if you wait, yeah, if we want to train a hundred trillion token model, we'll be ready to go by, you know, in 70 years, you know, but we may want to do something a little faster. I think this is where like synthetic data has really come into like a place that's like really valuable now.
And maybe to answer like one of the questions you had is like, okay, why are we doing this mid training thing? Like my thoughts about why we do mid training has evolved a lot having seen like the more advances in synthetic data for pre-training. And in some sense, like you can actually stop doing mid training if you have really good synthetic data. It's like, if you think about the question that I was posing about, like, what were we trying to do is like, well, we have, we think we might have some small data set that's really valuable.
⁓ And maybe we figure out the answer is yes, this data is valuable. We've measured it empirically or we've got some score. ⁓ And if you have some, if you have really good synthetic data, then the question becomes like, what if this data wasn't rare anymore? You know, what if, what if I can take this 5 billion tokens or a hundred billion tokens and make it a trillion tokens? Then what would I do? Would I need to do this silly little thing where I kind of push the percentages up and
several phases or could I just train on that? ⁓ I think that that's an interesting thing that's gonna change in the future is we're getting much better at that.
Ravid Shwartz-Ziv (46:51)
So do you think like in the future or like maybe now, if now I need to quantify how much like each token worth to my model? So one option is that, yeah, you have just random token from some random dataset or like pre-training dataset. Another one is that you have random token from, I don't know, from past training or metering datasets. And the third one, maybe it's the one that you choose
choose
based on some fine web method or whatever. And the last one is synthetic data. Which one now I need to choose to get efficient scaling loss.
Cody (47:36)
Yeah, oh, scaling laws? Or you mean like a model that will scale to good performance? Yeah, mean, the answer right now, I think, is kind of all of these, which I know is not a very satisfying answer. At Datology, we do a bunch of synthetic data. And I think like,
Ravid Shwartz-Ziv (47:41)
Yeah, a model that will scale for a good performance.
Cody (47:55)
when more models were starting to use synthetic data, maybe a year or two ago, like the Microsoft models and some of the Google models, there was this very strong knee-jerk reaction about ⁓ model collapse and dangers of using synthetic data. And I think that that mostly came from the fact that they were kind of prompting models in a way that was like pure, almost like distillation. They were like, give me a textbook about math.
you know, give me a textbook about this. And you really kind of, when you do something like that, you really limit ⁓ what you're able to get by the quality of the teacher, right? There's like how much diversity can the larger model generate and how accurate is it? And like, if you look at the kind of like the, know, we were talking a little bit about like surprise perplexity over sequences earlier, LLMs don't generate that naturally.
So you're going to get a bunch of just like really regularized stuff. Maybe some might even call it slop. ⁓ know? And I think that what has changed recently is everyone's kind of moved to this like rephrasing paradigm, which is you're actually pulling a bunch of unique seeds with a lot of diversity from the internet or from coding examples. And then you're looking at the model.
⁓ to figure out how to augment this, either to align it better to the task you want to do, or ⁓ rephrase it in a way to just increase the diversity or allow you to use it more. And in a way, this kind of harkens back to the image days where it's like, this is the data augmentation we were looking for. It's like, this is the random flips and crops and noise in some sense is like, we can do this with
with data, ⁓ with synthetic data. And now we can ask the question of like, okay, why MedTrain? Take all the good rare data and make it not rare, you know.
Ravid Shwartz-Ziv (49:57)
So what is, yeah, so now I want to create synthetic data. Let's assume that I believe that this is better. Like what is the best way? What is the best pipeline to make it? Like does it matter if it's pre-trained or post-training? Do you think we just can take post-training types of question and just train on them? How are going to do it?
Cody (50:20)
Yeah, so I think that like the synthetic data in pre-training actually has a lot of catching up to do with the work that has been done in post-training. Like post-training, kind of, in a lot of ways is almost all synthetic data at this point. ⁓ like very, I'm sure that people are still getting human annotations, but probably far less than they were like two years ago. We're like, this was actually the bottleneck for post-training a model was you had to go pay scale.
or some company millions of dollars to give you tens of thousands of handwritten examples, right? It's like, well, now I can just go get a Quinn model or Kimi K2 or something to do something 10 times better and generate, you I can get a billion tokens for post-training. This just like completely changed the attractability of doing post-training as a small shop. ⁓ In pre-training though, like the, it's very funny. Like if you look at the papers that come out for
synthetic data for pre-training, they're like very simple and they're not nearly as nuanced or as advanced as what was happening in post-training synthetic data even like a year or two ago. And I kind of have this joke at work where I'm like, everything is either ⁓ like persona hub or it's eval instruct and they're just like in a trench coat. if you look at that, basically, those two papers are basically all of the way that people are doing.
pre-training data in some sense. You can follow it like three steps down, but you'll find like, okay, there's this math synthetic thing and you follow it down and you're like, okay, actually it's just eval instruct again. Yeah. ⁓
Ravid Shwartz-Ziv (51:57)
So
they basically prompt the model to some specific definition, to some specific types of...
Cody (52:03)
Right, right, right. ⁓
yeah, I think that the prompting strategies have not gotten very advanced for pre-training. Maybe this is just because of the scale that you need to do. I think that there's a lot more interesting things that can be done around building knowledge graphs and figuring out how to combine lots of diverse data into complex ⁓ contexts to make interesting samples. ⁓
⁓ Yeah, maybe all that is to say is synthetic data in pre-training is really in its infancy and it's already unreasonably effective and I think it's going to get much, much better.
Allen Roush (52:43)
Doesn't it seem like you're trading compute for better quality? Because even Elon has gone on Twitter and said, well, we can just use LLMs to data augment what we have in a million ways to basically extend to an unlimited amount of tokens. And that seems a bit intuitive, maybe not to the extent that he proposes it, but it seems intuitive to me.
Cody (53:05)
Yeah, yeah, I mean, why not, right? Like the good thing about doing, spending your compute on data is it's a one-time expense. And most, you've been, well, if you're successful, right? Or a one-time failure. ⁓ And then you can use it over and over again. And so like that has a lot of benefit. ⁓ I think that what's interesting though is we still have to kind of learn some of the fundamentals about how models are learning these tokens to do.
to do better with synthetic data for pre-training. Like a really simple example is like, if you were to try to train a model to do like arithmetic and multiplication, and you had kind of sort of like a toy example of this, where maybe you have like millions of examples, or basically you generate synthetic data sets that does every single example of addition of four digit numbers that can be possible, right? And then you hold some of them out to verify that the model can do this.
Like an LLM probably isn't going to learn arithmetic that way, which is kind of interesting. Like you've shown it every possible way to do arithmetic. Why can't it do arithmetic? Right. But if you train it on large enough web scale data, it will learn to do arithmetic. ⁓ And so like there's something that's interesting about how you transform these data, like what the model needs in its context to actually start developing these like circuits to learn that are not just obvious from like
this example does this thing. Apparently, that's not sufficient.
Ravid Shwartz-Ziv (54:34)
So do you think we can go all the way to pure synthetic data sets?
Cody (54:40)
So supposedly this is what the GPT OSS models were. ⁓ And my understanding for why they went all synthetic for the GPT OSS models wasn't because they thought it would be the best performance. ⁓ It was like entirely liability. Like ⁓ they didn't want to be sued for whatever kind of data sets that they train on internally. Maybe they've gotten the licenses. I'm not.
making any claim about this, but what I'm understanding from my friends that I've talked to is like, there was almost entirely synthetic rewrites of everything that went into it as like a, kind of like a licensing whitewashing. ⁓ So if you like the GPT-OSS models, that's supposedly 100 % synthetic. So not everyone likes them, so.
Ravid Shwartz-Ziv (55:29)
Yeah, it's also not clear what does it mean, right? Like if you just take like a model and the model like just generate the same data as like the input, right? Or like does it mean it's a synthetic data set, right? ⁓
Cody (55:44)
Yeah, that's a good-
Ravid Shwartz-Ziv (55:44)
And also
like, like, and also like we have like, I think I saw on Twitter someone just jailbreak models and they generate books, right? All the Harry Potter's and things like that, books from the, from the models. And there were models that they can just generate. I think it was like 95 % of the, of the books of Harry Potter.
Allen Roush (56:11)
No, I've heard it was 98 % now. Well, and actually, wanted to also related to that, you know, it seems like copyright infringement and, you know, data sets that even probably at the time were clearly so called like nobody has the rights. I don't know, I work at ThoughtWorks and so Aaron Schwartz, he ThoughtWorks are right before he died. So this is very near and dear to our hearts. And you can imagine that I'm actually allowed to kind of be a little bit
Ravid Shwartz-Ziv (56:14)
Maybe 98.
Allen Roush (56:38)
advocate a certain level of copyright abolitionism even with a little bit of my... But I guess what I'm curious is how do you see all these issues? how... Because it seems like the original sin of generative AI, as I always called it.
Cody (56:54)
Yeah, I mean, ⁓ it's tricky, right? ⁓ I have been surprised ⁓ from some of the court cases that have come out against like Metta and Anthropic, ⁓ how, I mean, frankly, how supportive of their rights this has been, ⁓ like in the fair use direction.
I don't think that this was as clear like a year or two ago that we were going to see ⁓ so much support for model training have these fair right protections. mean, so like Anthropic and Meta have been getting in trouble. But like, if you look at the decisions, they're getting into trouble for things that are not actually having trained on the data. It's like how they acquired it ⁓ or what they did, what their usage was, which really surprised me.
Like I didn't think that there was any version of you've got a bunch of pirated books and you trained on it that was going to fly. And apparently there's some nuanced discussion of that. ⁓
Allen Roush (57:58)
Well,
I mean, mean, there has to be a little bit, right? Because if we're going to put everybody is responsible for what they train on, if you train on a large enough slice of the Internet as the stable diffusion people figured out, oops, we get illegal content. And I don't think that we should be putting any AI researcher who's acting in good faith in jail over a one billion image scrape that just happens to get some CSAM, right?
Cody (58:24)
Yeah. Yeah.
Ravid Shwartz-Ziv (58:25)
Yeah, but do
we want? But like, like it's different to, I don't know, to claim that like one research is guilty or not versus like a corporate company that put billions of dollars on it and earned billions of dollars from it, right? So...
Like, I don't know what I think about it, but I think like, it depends on your, basically on what you want to do with these models and like how much money you're going to earn from these models.
Cody (58:58)
Yeah, it's very interesting to, ⁓ you know, I, is probably ⁓ the most ⁓ third rail topic we're gonna talk about all day, I'm sure. ⁓ So my wife actually is like a graphic designer and artist. ⁓ And it may not surprise you guys or any of your listeners to know that like in the art community, AI is like.
whole other thing. Like it's very controversial ⁓ with like you know fair use and copy and things like this. ⁓ And ⁓ so you know her friends had a lot of things to say about this ⁓ and I'm sure it's been very exciting for her to have to deal with me having the career that I have and you know maintain relationships with her friends. ⁓ And I remember ⁓
It felt like a year or two ago, the discussion was like, well, the problem is that the models are trained on stolen data, right? That's like the kind of the thought process from the artist was like, the data is stolen and this is what's the problem. they're like, they're benefiting off of this improperly licensed data. And this is a problem. like, good point, I guess, you know? And I was really excited when one of my interns at Mosaic was working on this open diffusion model.
where everything was correctly licensed, right? So there was like no problem about the generation. ⁓ I can't remember what the name of the paper is. I feel bad because I can't plug it for him. But it was like an all open data ⁓ stuff. So everything was fair. Yeah, yeah, yeah. ⁓ So anyways, they released the diffusion model that's like completely perfect, like proper licensed.
Ravid Shwartz-Ziv (1:00:35)
put it after it. Don't worry.
Cody (1:00:48)
And I was like, Bridget, you have to tell your friends, they'll be so excited. We did it. We stopped with the art theft. They were not amused. Like, they were, it was very clear that like, whatever the argument was, it was like, that was a convenient argument. And they were really upset about the potential for like lost revenue and jobs, which like, okay, I don't know why I, I don't know why I had this like optimistic view that it was like, we're going to save, we're going to change everyone's mind as soon as we just.
you know, listen to their demands. It was actually like, actually I've got all these fears about the future and technology and this has done nothing to dissuade me that AI can be good.
Allen Roush (1:01:22)
Yeah, no, I'll just put on that. will say, the reason you might have assumed that is people in the art and similar kind of creative communities have often professed
Ravid Shwartz-Ziv (1:01:24)
you
Allen Roush (1:01:32)
political ⁓ egalitarianism throughout history up to and including, you know, being at one point card carrying supporters of groups like the Electronic Frontier Foundation and others. And wow, how the turns have tabled ⁓ as things shifted. And I realized a lot of people became copyright trolls defending the RIAA, MPAA and suing grandma circa the year 2000 for trillions of dollars. Like I did not realize
how quickly that would shift in the public, know, even from the very same people who express so-called egalitarian beliefs.
Cody (1:02:10)
Yeah,
yeah, so that's probably as much as I'm going to say about copyright.
Ravid Shwartz-Ziv (1:02:17)
So I want to go back to the efficiency of synthetic data. So I have a question. you think right now we are generating synthetic data in the token space, right? Like we generate the actual text, but in principle, like we don't need to do it. Like there are all these models, right? Like a coconut, for example, that make kind of a chain of thought in the representation space. And I don't know that the
the loop, a transformer, anything like that, do you think we can move from synthetic data with tokens to a representation space and to actually make synthetic data in more suitable or more efficient way that the models actually can use it without tokens?
Cody (1:03:07)
Yeah, I mean, it's a good question. And I think that the lines really get blurred too when you talk about, you do this data in the latent space ⁓ or versus like, how is that that distinct from doing like knowledge distillation? ⁓ When I was in grad school, actually, I really wanted to make a bunch of like synthetic data in the latent space form. And so I had this whole thing where I was trying to like, you know, generate like
feature maps of like, ConvNets and try to like, you know, make distilled versions of like every feature level, you know, diverse sets of things, right? To like, maximize like, neuron activity and things like that. ⁓ I mean, I think it just ends up being very hard to know what you're doing, right? If you don't have a way to ground yourself in what you're getting the model to do,
and you're generating stuff, it's very hard to measure that you've been successful. ⁓ I'm not saying no, that I don't think it will work. It's just hard to imagine a way to move quickly in that field.
Ravid Shwartz-Ziv (1:04:15)
But in some way, like, all the synthetic data is distillation, right? It is still the model. So do you think there is something fundamental or there is something like more, I don't know, like easier to base it on text?
Cody (1:04:24)
Yeah, yeah
Yeah, I mean, so like, think that there is a distinction though, when you go from these like, kind of textbooked pure generation things where you've kind of like, all of the, all of the text is generated from a teacher model to things where you're doing rephrasing or augmentation, right? Where like, you know, it's like very constrained generation in some sense. And it's not all from the model and you're kind of injecting novelty and new information.
by doing this rephrasing. So you get something that's very different than you would from distillation.
Ravid Shwartz-Ziv (1:05:05)
And do you think, so in another related topic, do you think we can get, you know, very efficient data creation, like synthetic data, for example, phi, right? This is like the canonical example, the phi series models. Do you think we can push it farther? can get like 10 examples that will teach us everything on math.
Cody (1:05:32)
Yeah, I mean, this is a hard thing to know, right? Like, ⁓ I think that you're going to be more successful no matter what, the more starting examples you have, right? ⁓ It's certainly, we're certainly going to get to a place where having 10 examples, you can do much better than having no examples, ⁓ and it's better than nothing. But I don't know that you're going to get.
like the same, I would be very surprised if you're gonna get the same quality of like tens or thousands of examples than you would if you have millions or billions of examples, right? So I think there is some sort of a constraint there. ⁓
Ravid Shwartz-Ziv (1:06:11)
And do think this is different between pre-training and post-training? Do you think like if we have enough data in the pre-training, like we can make kind of of like, few short learning, ⁓ synthetic data better?
Cody (1:06:25)
I do, think it's different. mean, it, it, must be different in some sense, right? Because like synthetic data was so much more successful so early in post-training in RL, ⁓ that I like there, there, there's clearly something where it's much easier to define this task and, do this. ⁓ like, are we going to make better models? Absolutely. I, I, I'm not sure I understood what, the question there was for what you want to do with the pre-training.
Ravid Shwartz-Ziv (1:06:55)
So.
The question is like, okay, so do you think now, okay, let's assume that like we are creating very efficient examples, okay? So one option is to have these very efficient examples and to, I don't know, to distill like billions or trillions of examples from the pre-training to, I don't know, one million examples. And now you can train it with one million examples. And this is one.
Cody (1:07:11)
you
Ravid Shwartz-Ziv (1:07:25)
⁓ for the pre-training, but for the post-training, let's assume that now I have billions or trillions of examples in the pre-training, can we use 1,000 examples in the post-training, like 1,000 synthetic examples in the post-training in order to align our model? Do you think this is different, the two cases?
Cody (1:07:46)
Yeah.
So here, I mean, maybe like one thought I have about like how we're doing synthetic data and mid-training, how we're aligning post-training and pre-training, think is like, so like, I think this, like there's a really exciting direction that I think people are all like beginning to head toward and I think is where kind of pre-training is kind of going to take some like
know, mind share back in the world, ⁓ which is like, okay, so RL and RL environments are like the hot thing right now. You know, there's like, there's dozens of these RL environment startups. ⁓ And it's like not for no reason, right? It's very good at like, if you can define your task, if you can define your task and you can put a model in this environment and you can kind of get it to do things, you can kind of train it to do these things, right? And my understanding of
of where this is difficult is like once you kind of leave the like normal safe space of the model and you try to get it to do genuinely novel things, it's very hard to get successful rollouts, right? You like you may have rollouts that are like, you know, your success rate when you start the task is like 1 % or 2%, right? And so then it's gonna be like almost impossible to reasonably train it in this environment because you're gonna have to start.
from like basically nothing. And it just doesn't know what to do. And I think what's really exciting is like, well, now people have been demonstrating with mid-training, which when I was doing it originally was like, I just can't get signal. I don't know if Wikipedia is good. Now people are doing really cool things with it, which is like, we can actually shift that starting rollout percentage. If we can find the right data, we can make your RL success rate at the start of training.
go from 1 % to 10 % or 20%, right? And that is a huge game changer. Now you can do this, right? So I think what's really interesting is like, the interesting question isn't really like, what do you do with synthetic data in post-training versus pre-training? It's actually kind of like, how do we like close this loop? Which is like, let's take where we're failing in post-training. Let's find the useful examples, like the successful or unsuccessful rollouts or a model that can do the task versus the one we're trying to get.
And then let's figure out how to contextualize this problem and bring it to a pre-training scale and use mid-training to align the model and see if we can do this, right? And I think that's something I'm really excited about is like, can we make pre-training cool again? ⁓ Can we like use mid-training to make RL more efficient and more successful? And then can we just like close the loop? And then, you know, instead of having to ask ourselves,
pre-training, mid-training, post-training, what's the difference? Like we just have training.
Allen Roush (1:10:37)
Well, and
then what about incremental and continuous training, Like just keep doing it on new prompts on the fly, right?
Cody (1:10:44)
Yeah, yeah, I mean, I, ⁓ you know, I don't even know what, you had to define what is continual learning, like truly, I'm not sure that I understand what everyone means. But certainly there is, there's definitely incremental things you could do, right? Like as you kind of discover these tasks where you have bad coverage, right? You can kind of generate new data and like re-kick off these pipelines. ⁓
the real continual learning, the holy grail continual learning, that frightens me in some sense.
Ravid Shwartz-Ziv (1:11:18)
It's like, I just saw someone posted that there is a lot of discussion about continual learning and a lot of solutions, but for me at this stage it looked like you just can't put everything in a memory.md file, but I'm afraid to ask what is the problem.
Allen Roush (1:11:18)
and
Cody (1:11:37)
Right, Yeah, Yeah,
yeah. I mean, it's a good question, right?
Ravid Shwartz-Ziv (1:11:45)
But, okay, but do you think we can actually like really combine RL in this pipeline to help us to filter the data and like to come up with better data points?
Cody (1:11:59)
Yeah, absolutely. mean, like I said, I think that there's a lot of interesting work. There's a paper, the OctoThinker paper, where they did this, where they specifically took one of the weak llama 3B models that you basically couldn't get it to have any success rate in RL for these math topics.
⁓ And they do a bunch of mid training on it and then boom like now you can actually now it's just as good as Quinn at RL when you want to train it to teach it these things and then like it's a pretty simple example and and you know, I You know, got to be gentle with the grad students, right? was Like it's still it still matters. I think it's like a valid proof of concept which is like if you've defined a verifiable task or an environment
And you can't really get a model to do it. There probably is a way to close this loop with mid training and to figure out what is the gap in its capabilities? What is the prior that it needs or the distribution of tasks that it needs to understand as a prerequisite before this RL step is valuable? And I think that being able to get better at this iterative process of we go from we can't do tasks to we can generate data that helps do tasks.
I think that's like a really exciting direction for data in the future.
Ravid Shwartz-Ziv (1:13:23)
And do you think in the future we will see models that are more specified in one domain and one problem or like we will see fundamental models that are good on everything? And of course it's related to the data, how they train it, right?
Cody (1:13:37)
I mean, we're always going to get the frontier models
that are good at everything, right? I think that we're going to see a lot more domain-specific models. I feel like people have been calling this for like four years, and so I don't want to be like the latest person to like keep predicting this. But I think that what has fundamentally changed is all the tools you use to build this are just so much cheaper and better now. Like I mentioned like,
If you wanted to release a post-trained model in 2022, you needed an army of annotators to give you preference data. And you had to train preference models trained on this. And you had to have done a good job of like inner annotator distance. then you can either do your preference RL or you can do like offline RL. And now you can just like get Kimi K2 to just judge everything you do, you know?
⁓ And so like there's lots of very smart people that have, know, lots of fortune 500 companies and people that are doing enterprise work have little pockets of very talented people and they like know what their domain is very well, right? What was missing was like, well, they're not going to get a $10 million annotation contract from scale and they're not going to get an army of engineers to do all this other stuff. But like, I think it's very tractable now.
to say, okay, actually I want to make a very specialized model to handle some sort of labels for, ⁓ I ⁓ don't know, ⁓ if you've got ⁓ insurance claims, right? You can do that, right? I think that that's gonna be, I think that is gonna be a lot more popular.
Allen Roush (1:15:21)
Well, I'm just, I'm curious on the domain specific model side. I mean, it always seemed like we're going towards bigger models and bigger data sets and more generality, right? Including like Claude code being for very much more than code, right?
Cody (1:15:35)
Yeah, like the thing is like, there's just, there's just only so much you can stuff in a model without making it bigger. And a certain point, these like scale differences are going to make the economics not make sense, right? Like ⁓ you can train, you know, a sparse, know, like 26 B active one trillion parameter model to do everything. And then it can, maybe you can RL it to do your legal
⁓ like document classification or your finance work, right? But you could also get, you know, like a ⁓ 10b model, dense model to do that. If you were able to find the good data and train it to do this task, right? Maybe with distillation or whatever. It's like, there's gonna be a point where the inference cost of trying to do this at scale is just going to demand you to find a smaller.
And it's not really about the capabilities, right? You're always gonna start with the best model that can do your thing and hope that in the world of models that exist, there's a model that can do your thing. But eventually, once you've proven out this business need to get it to scale into something that you're gonna do at really large scales, you probably just want a model that's specialized for
Allen Roush (1:16:46)
Thanks a lot, Asan.
Ravid Shwartz-Ziv (1:16:47)
Okay.
Okay, maybe we'll go to the question from the audience. ⁓ So there was one question that talks about the ⁓ pipeline of like efficient sampling or how to make like efficient data. And they say, right, that like the pipeline is mostly ⁓ about ⁓
representation, right? Like you're taking your original data, you are extracting some representation from a pre-trained models, and then you select it, right? And we talked about it based on, it can be based on the LLM as a judge, it can be diversity, confidence, ⁓ whatever. And the, one of the common concern is the pre-trained model and what are the biases that it actually inject to the selection.
And do you think that we will see ways to improve this pipeline that we can get away from the biases of these models?
Cody (1:17:54)
Yeah, mean, so it is a good question. ⁓ In some sense, right, like I guess you're never going to be completely free of it, right? Like if you do something like ngram models, ⁓ and like do some like hybrid embedding search, right, you move from one bias to another bias, right? ⁓ You know, you can, I think, you know, a good option maybe is just having like diversity.
whether it's like diversity from your judging models and prompting, ⁓ or how you're doing the feature selection, but you're still kind of stuck, There's going to be some bias that is injected, right? And you just kind of have to ask yourself, can you live with it? And are you accounting for it, right? Yeah.
Ravid Shwartz-Ziv (1:18:46)
⁓ Another question is about the optimal mixture data schedule that interact with scale. What is the optimal one both in terms of tokens and model scale?
Cody (1:19:03)
Yeah, yeah. mean, so this is a great question and is like one of the most infuriating questions. ⁓ So like the problem is that ⁓ the right data mixtures do change with scale. And for a couple of reasons. One is it especially is going to change with scale when your data constrained, right? If you're trying to design a data mixture for 10 trillion tokens versus 20 trillion tokens.
Regardless of what are the capabilities you want, if you try to obey this heuristic rule of how many epochs you can use the data for, you're going to have to make a different data set. Or you're going to have to make some severe compromises or some interesting choices. And maybe this means much more aggressive mid-training, where you're going to use far less code in the beginning, and then you're just going to dramatically increase it at the end. ⁓
Those kind of compromises are relatively straightforward. ⁓ Not any more fun to deal with, but just like obvious, I guess. ⁓ And then there's like, there definitely is ⁓ interactions with scale. Like let's say we're not data constrained at all, right? ⁓ You know, when you're training relatively smaller models, relatively being key, right? Like even like ⁓ an AP model trained for a trillion tokens, which is like.
would make every grad student like Salavi to train, right? That's a small model. ⁓ The optimal mixture for that looks very different than the optimal mixture to train a big several hundred billion parameter MOE. You just don't need to be as aggressive in the distributions, right? In a smaller model, you're really trying to kind of overemphasize all the small tasks that you want to do. You're going to go way heavier on the amount of code that you want.
You're going to go way heavier on the specific tasks you want to do. And then as the model gets larger, you do want to increase the diversity and things because like, well, on one end, ⁓ you don't like, you don't get anything more out of over emphasizing these narrow tasks. And on the other end, you probably are not really able to capture what it is you want it to do anymore.
Like if you train an AP model, you know you want it to write some basic Python programs and you'll be very excited if it can do that and if it can repair code, right? But if you train a very large model, even your ability to define what you're trying to get it to do is like very difficult, right? So you kind of have to kind of hedge your bets against what you think you know the data is doing and just accept that you're gonna put some weird stuff in it and hope that there's something in the distribution that captures what you want.
That's a little bit about the scale. ⁓ I guess I can talk about the optimal data mixtures and data schedule that interact with it. I think it's a very similar thing, which is ⁓ you need to be much more aggressive about the mixture and scheduling in smaller models, or it seems like you need to be more aggressive. Then when you do a larger model, you can be more subtle with these distribution shifts.
Ravid Shwartz-Ziv (1:21:51)
Mm-hmm.
Cody (1:22:09)
And it seems like you get better results in that sense. Yeah.
Ravid Shwartz-Ziv (1:22:13)
When
you say scheduling, you mean about the training dynamics, right? Like where to put like different types of data.
Cody (1:22:21)
That's right. Yeah. So, so like, you know,
I guess we've been, we've been dancing around the word mid training this whole interview and we haven't actually like described what, what, what I mean mechanistically is happening. Right. ⁓ so like this generally means, although it can mean lots of things. what I mean is like, there is a point in which the training you change the distribution, you change the percentages. Right. So if you imagine like at a high level, you have some hierarchical buckets. You've got like your code data.
you've got some like math and STEM data and then you've just got your web slop down here, right? ⁓ And like maybe at the beginning, this is much more web slop, ⁓ much less code and math. And then there's gonna be like like a discrete point where you go, okay, now the code data goes from 15 % to 20%, right? It's like a, like a really discrete point, right? And like that was kind of what people were doing like a year or two ago and now it's kind of.
broken up into even more phases, more discrete points. You have two, three, four phases of data changing where you're doing these distribution ⁓ shifts. And I guess what I'm saying is for smaller models, you actually want these shifts to be even more dramatic and extreme. You need them to elicit the skills you want. And for larger models, it seems like you don't want to do that quite as much.
Ravid Shwartz-Ziv (1:23:42)
And okay, and so do you think like one last question, do you think we talked about it a bit, but do you think at some point synthetic data will be fundamentally superior to the human distribution it's based on?
Cody (1:23:57)
Yeah, yeah, I saw that question come up on Twitter. I think it's a good one. I mean, I think it depends a little bit on how you define it, which is kind of a cop-out answer. ⁓ I think that there will be a point where training on synthetic data is better than training on human data. ⁓ Do I think that models are fundamentally better than humans at producing individual samples? I don't know if that's true. And why I, like, you know,
Maybe that's true. Maybe it's actually we've reached the point where we have God in box. ⁓ The machine God has been built and like it will become a better writer than every human. But I don't think that that's true yet. ⁓ But what I do think is very likely is most of the data, even the good high quality data we want, isn't really designed to teach models things that we want it to do. And so there will be a lot of value in transforming the data.
to do what we want the model to do. And I think that this will exceed natural data performance at some point, where you would be silly to just be training on non-synthetic data. And you should definitely rephrase your things or do something.
Ravid Shwartz-Ziv (1:25:05)
Okay, ⁓ I think we don't have more time. Do you have anything that you want to add? To advertise?
Cody (1:25:14)
⁓
No, I probably should have thought of something, but ⁓ it was really great. ⁓ I'm really thankful you'll have me on. Yeah, it a fun conversation.
Ravid Shwartz-Ziv (1:25:25)
Thank you so much, Mr. King. And thank you, Ellen.
Cody (1:25:28)
Yeah, all right. See you guys.
Allen Roush (1:25:31)
It was a pleasure to be here and a pleasure to get to meet you Cody and to pick your brain.
Cody (1:25:36)
Yeah, hopefully sometime we'll meet person sometime. I'll you know next time I'm New York.
Ravid Shwartz-Ziv (1:25:36)
Thank you so much. Yeah.
Great. ⁓ Bye, everyone.
Cody (1:25:43)
Bye.