April 15, 2026

The Future of Coding Agents with Sasha Rush (Cursor/Cornell)

The Future of Coding Agents with Sasha Rush (Cursor/Cornell)
The player is loading ...
The Future of Coding Agents with Sasha Rush (Cursor/Cornell)
Apple Podcasts podcast player iconSpotify podcast player icon
Apple Podcasts podcast player iconSpotify podcast player icon

We talked with Sasha Rush, researcher at Cursor and professor at Cornell, about what it actually feels like to we in the heart of the AI revolution and build coding agents right now. Sasha shared how these systems are changing day-to-day work and how it feels to develop these systems.

A big part of the conversation was about why coding has become such a powerful setting for these tools. We discussed what makes code different from other domains, why agents seem to work especially well there, and how much of today’s progress comes not just from better models, but from better ways of using them. Sasha also gave an inside look at how Cursor thinks about training coding models, long-running agents, context limits, bug finding, and the balance between autonomy and human oversight.

We also talked about the broader shift happening in software engineering. Are developers moving to a higher level of abstraction? Is this just a phase where we “babysit” models, or the beginning of a deeper change in how software gets built? Sasha had a very thoughtful perspective here, including what he’s seeing from students, researchers, and engineers who are growing up native to these tools.

More broadly, this episode is about what it means to do serious technical work in a moment when the tools are changing incredibly fast. Sasha brought both optimism and skepticism to the discussion, and that made this a really grounded conversation about where coding agents are today, what they are already surprisingly good at, and where all of this might be going next.


Timeline
00:00 Intro and Sasha joins us
01:11 What “coding agents” actually mean
02:34 Why coding became the breakout use case
08:56 Long-running agents and autonomous workflows
15:08 How these tools are changing the work of engineers
17:15 Are people just babysitting models right now?
22:11 How Cursor builds its coding models
26:29 Rewards, training, and what makes agents work
34:53 Memory, continual learning, and agent communication
38:00 How context compaction works in practice
41:29 Why coding agents recently got much better
50:31 Refactoring, maintenance, and self-improving codebases
52:16 Bug finding, oversight, and verification
54:43 Will this pace of progress continue?
56:42 Can this spread beyond coding?
58:27 The future of Cursor and coding agents
1:03:08 Model architectures beyond standard transformers
1:05:37 World models, diffusion, and what may come next


Music:

  • "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
  • "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
  • Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.

Ravid Shwartz-Ziv: Everyone and welcome back to the information a bottleneck podcast and today we have a Sasha is a researcher at a cursor and also a professor Cornell. Hi Sasha Thank you for coming.


Sasha Rush: Thanks for having me.


Ravid Shwartz-Ziv: and as always, A.L.M. A.L.M.


Allen Roush: nice to see you, Ravid, and nice to meet you, Sasha. I've used Cursor quite a bit ⁓ over the maybe year and a half or two years now, and I really like it, so it's really cool to come here and get to chat with you.


Sasha Rush: Yeah, I'm also a big user myself. So I appreciate what the product folks here do.


Ravid Shwartz-Ziv: So today we will talk about Cursair and more specific about AI coding agents. This is very hot topic, right? Everyone are talking about it, trying, they have very strong opinions about it. But maybe let's start from the beginning. What does it mean for you at least, ⁓ coding agents?


Sasha Rush: Yeah, it's a good question. ⁓ So I think there are kind of two technologies that came along roughly around, ⁓ man, almost like a year and a half ago now, which was the ability to train models to reason. This is with the kind of ⁓ O1 series of models and then the R1 open source release. And I think what those models showed is that if you train on a specialized task with some sort of ground truth data, you could get the models to think about the problem and do better with more time. And I think what that really showed was that we could build specialized systems as opposed to kind of general purpose language models that could be maybe like arbitrarily good at one given specialized domain. At the same time, ⁓ there was kind of a concurrent development of these agentic or function calling models. And these models, had the ability to basically use basic tools in an environment. And I think the kind of convergence of those two technologies of like both being able to use tools and being able to reason about challenging problems really made the world kind of think like how can we use this technology? How does this go beyond chat bots? And so those coming together ⁓ really kind of targeted the coding space. I think that's when we started really seeing this technology ⁓ kind of kick off.


Ravid Shwartz-Ziv: And why, so why coding? Why you think coding is the maybe their use case for that?


Sasha Rush: Yeah. Well, ⁓ I mean, clearly there is the kind of economic value of building better coding models. Just it's a very challenging problem and it's kind of a ⁓ core space. We've also seen that ⁓ for kind of the years before that coding was a very rich domain of ⁓ pre-training data that sources like GitHub or kind of open source repos allowed the kind of base models to be extremely good at coding. even before we get to reasoning or kind of agentic function calling. But then beyond that, think coding is just an incredibly interesting problem in that each code base is like a kind of mini database of information you need to look over. You have this kind of embedded information retrieval problem. You have the fact that you have to do reasoning within this sort of domain to find an answer. Even today, like when you use kind of a model, it's making like hundreds of these little agentic calls to try to find information that's important, to think about it, to try to figure out how to kind of answer your query. And then I think even beyond that, coding is an interesting area of reward signal, that there's a kind of clear sense of like you've done something correctly or you've done something incorrectly. And that differs from areas like, say, fiction writing. or even just chatbots generally where it's much harder to figure out what a kind of good response would be.


Allen Roush: I'm glad you bring up fiction writing because that's something where I've spent a lot of time trying to improve models. ⁓ you, you know, one thing I've noticed from some of that work has been that there are these slot profiles in... ⁓ you know, responses which I'm defining as overrepresented tokens or phrases like it's not x it's y or words like certainly or delve. ⁓ Do you notice any equivalent to this with code generation at all? Because I'm wondering, obviously there's the concept of it in the comments, right? But it's there like detectable patterns in code. Because I'm not, I mean, I've read a lot of AI generated code, but I haven't seen anything like very obvious except Maybe I would say trying to statically type variables without being asked in Python. Cool, I mean, I'm not complaining, but like that's okay. Anyway.


Sasha Rush: Yeah, it's a good question. ⁓ I think you absolutely do see that kind of templated structure to the language itself. One area you'll see that in is sometimes the status messages that it responds to the user with. ⁓ And then you'll also see it in the thinking of the model itself. Like it'll use kind of similar phrasing for the actual reasoning patterns that it's producing. You're right that you don't see it as much in the actual code it's generated. Although, I've seen arguments for things like ⁓ kind of try catch loops or sort of comments that have a kind of templated format. And I think part of that might be that like because it has a kind of clear reward signal, it'll kind of prioritize that over kind of ⁓ being more or less diverse, kind of more templated in its style.


Ravid Shwartz-Ziv: And also I think in language, in natural language, there is something very human. In coding, it's very synthetic in some sense. You have a very specific purpose, a very specific goal that you need to do. So it will be harder to detect, ⁓ this doesn't sound like human as we have in natural language. ⁓


Sasha Rush: I think it's fair, although I think it's been maybe overemphasized how critical like only getting the right answer is in code. Like particularly for a place like Cursor, we're, our goal is not to like win a coding contest. Our goal is to have the agents produce code that works well within a complex multi-user code base. And so things like matching the style of the code base. matching the kind of use of the right functions or the right APIs, not just the thing that works, actually is a huge like portion of what we want our models to do.


Allen Roush: And, you know, I've recently become aware of the context of LSPs or language server protocols, which I assume you know about, right?


Sasha Rush: Yeah, yeah.


Allen Roush: Okay, cool. ⁓ And ⁓ maybe first of all, just because I assume a lot of the people listening might not know about it, you could, could you talk about it? And then furthermore, ⁓ I'm fascinated by how these can improve AI agents ⁓ by kind of making it more efficient for them to move around a code base and inspect it. ⁓ Is this something that Cursor is using right now? in its agents or does it plan to in the future?


Sasha Rush: a good question. So ⁓ cursor, the IDE is built on VS code. And so it connects into ⁓ ⁓ a user has kind of naturally. ⁓ Early on, there were efforts to kind of connect to that in various ways ⁓ part of auto completion or part of kind of tool calling. ⁓ In recent years, honestly, the models have gotten extremely good. And so you can use a somewhat stripped down set of tools and information and still do extremely well. ⁓ And so kind of more basic tools like kind of just retrieval or various ways of kind of a code base ⁓ seem to work like fine. ⁓ And so we often kind of train our models not to rely on too much kind of external scaffolding. ⁓ One area where things like LSPs do come into play is a linting. So the agent will often utilize whatever ⁓ abilities it has access to to check that the code works in the way the user is set up their environment.


Ravid Shwartz-Ziv: So what do you think are the use cases that coding agents are the best? Or like we should use them, right? There is the recent post from Carpathia about autonomous, like that it automatic the experiments, right? That like, think it was like NanoChat, it tries to optimize it. They give the agent like a full autonomous and like it just optimize the validation loss.


Sasha Rush: Yeah, yeah.


Ravid Shwartz-Ziv: code whatever and it came back after the night and it was like kind of 0.2 % better and everyone like really excited about it and then it found that sometimes you just change a random seed and things like that. What do you think? Do you think like these kind of things are...


Sasha Rush: Yep. Yeah.


Ravid Shwartz-Ziv: the future, know, coding agents can close the loop and just remove all the, I don't know, all the work that we should need to do as researchers, I don't know, software engineers or part of the work.


Sasha Rush: Yeah, yeah. ⁓ So, wow, that's a big question. Let's split it up into parts. ⁓ The first thing I think that's relevant is this question of long running agents. So, ⁓ I'm generally actually a pretty skeptical person of technology, but the kind of long-ranging agent stuff in the last couple months has been really kind of a shocking and interesting development. ⁓ Our thesis at Cursor is that we're moving from a world where the user is kind of using a kind of CLI or an IDE to do kind of interactive coding to a world where you kind of specify a long running challenging problem and kind of ⁓ declaratively let the agents kind of do their best to try to solve it. And what we've found is that we've built these kind of tools internally. We can specify very hard problems with kind of clear a hill climbing signal of how they know they're getting better and just let the agents run on them. And things work really well. In fact, I think our numbers like roughly like 35 % of the like PRs at the company have come from our kind of cloud agent system where people kind of just use a website to tell the agent what to do and kind of just let it run kind of on its own. I don't think this is yet really like hit the mainstream yet that this kind of works. ⁓ but we have all these kinds of experiments internally to do things like, ⁓ we build a browser from scratch or Anthropic build a C compiler. And we've just found again and again, that when we like can specify a problem and just let the agents go, they're really good at solving it. And it just seems like they're going to get better and better even in the scale of months at that sort of problem. So, ⁓ so thinking about what problems work in that sort of setting, think about what. kind of things that were thought to be hard to do, but are now maybe easy if you have that at your fingertips. And also from our perspective, building the right models and UI to actually like utilize that is a really interesting thing. Karpathy has been really a kind of thought leader about all these things. I mean, he kind of came up with a lot of the terminology for this. I mean, he's, I don't know, since like RNNs has been like someone I've looked to to kind of understand how. some of this stuff is going. And I think his work kind of touches on where research maybe is going. And so there's been a lot of kind of talk about what we should call this or what this is. I mean, some people call it kind of recursive research or AI assisted research. ⁓ But it seems like the combination of these kind of long running agents on problems that require thinking through and designing kind of better systems in a kind of Step beyond hyper-parameter tuning, like architecture, optimizer search, or really just implementing a really well-specified idea and doing the work to try out that experiment. That feels basically here. I know that in my day-to-day research work, I'm mostly asking an agent to monitor experiments, to debug logs, to ⁓ restart jobs if they fail for unknown reasons. And having that kind of assistance to kind of like do that, like while it maybe is not coming up with the next like new idea, it makes the process way smoother and like lets you iterate way, way quicker. And I think I'm honestly that's gotten better even over the last couple of weeks. And so I think we're all trying to figure out kind of where that's going to go going forward.


Allen Roush: Yeah, I've been absolutely fascinated too. I recently had a task to deploy a large in, well, three node, so 24 H100 GPUs, NVIDIA cluster, doing everything the NVIDIA way, like NVIDIA Grove, NVIDIA Dynamo, all these tools I'd never used before. And let's just say my friend Opus, a one million context version, just chugged through it, basically needed it.


Sasha Rush: Ha


Allen Roush: amount of additional, you know, prompts. I would call it two or three-shotted. ⁓ But I am just, my jaw has hit the floor about the quality of these models for most technical tasks. ⁓ And so I'm actually wondering your opinion on, like, the effects on the wider ⁓ as a service industry. Wall Street seems to think there's an apocalypse there. ⁓ then also on maybe the software engineering job market, because one thing that's been really ⁓ is I even I've predicted doom and gloom, but almost everybody I know who works in AI is working the hardest they've ever been working in their lives. And I'm starting to see other people say, hey, Jevons paradox is more extreme than most people predicted. And now we need even more software engineers to control and be puppet masters on all of the agents that we're talking about. so what's your thoughts on all of this?


Sasha Rush: Yeah, let me punt on some of the economic questions for now just because besides listening to Adlots, I don't have too much kind of direct insight into this problem. I can talk from a personal experience that I'm certainly working the hardest I've ever worked in my life. And one of the reasons it's interesting and hard is because it's very intellectually stimulating. Seeing what people who are native to this technology are doing and how they come in kind of every day with new and creative ways of utilizing it and changing how their workflow works and figuring out things that are just totally different than someone who was kind of trained in the field for 20 years. It definitely feels like a shift. And I think when that happens, like there's kind of this, ⁓ like fear in that you can't kind of do things the same way. ⁓ Also a bit of FOMO of like, I need to have it optimized in this specific manner. But also, I don't know, it's kind of inspiring to see what very kind of creative, hungry, talented people do when kind of given a kind of a new tool. just, I sometimes will like go for a walk with someone on a team and just try to understand like how they're thinking about this, how it works and that kind of thing. thing that's come up recently is just like, ⁓ of ⁓ a lot of ⁓ science kind of work where I'm like computing statistics over like large sets of data to try to understand certain phenomenon. ⁓ And like ⁓ lot of that is about kind of compressing information to some way that you can aggregate. It just feels like that's not necessary anymore. Like you can run ⁓ throughput, fast models over data ⁓ kind of move from statistics to just kind of like full on. of these documents. I don't, it's like ⁓ we need to figure out how that works or how to best do it or like it's kind of not obvious how to set up that kind of pipeline. But it's, I don't know, it's very different.


Ravid Shwartz-Ziv: So two questions. So first, you think this is like, let's say, phase where we need to babysit in the models? like, I don't know, a year from now, these models will not need us. And we don't need, right, they will not have the, oh yeah, this is like, found that you're actually correct. This is like not, I was wrong before. This is the smoking gun and all these persons.


Sasha Rush: Yeah.


Ravid Shwartz-Ziv: So do you think this is intermediate step? This is one question. And the second question, do you think that the people that are native to this technology, are babysitting models all day without actually writing the code, do you think they will have enough skills ⁓ to do the tasks that we actually want to do in the future?


Sasha Rush: Yeah, let me talk about the positive case and then we can talk about the negative case a little bit more. ⁓ The positive case is that people are moving to a higher level of abstraction. The people who are very good at this are learning a skill that looks less like kind of writing a software file and more like setting up, I think the analogy people are using is like a factory or like some sort of like set of interacting parts. that kind of run autonomously together. And that there are all sorts of challenges of that. Like you need to have really good tests. You need to make sure things are working well together. You need to make sure things haven't kind of gone off the rails in that process. And like developing that skill looks like someone using a high level programming language ⁓ to people who are maybe used to writing assembly code. So that we've just kind of there you've entered this higher level of abstraction and that. that is something that people will need to learn how to do and like there's all sorts of skills involved in that. It also seems extremely correlated, at least as I've seen it so far, with people who are very good at programming. So like the people who are good at this thing seem like similar people who have like learned how systems work, are detail oriented, kind of have that ability to do it. The bare case is that I am kind of worried about education. And that like to get to that level where like you were really good at programming, had that kind of detailed orientation, understood what tests were, understood what the systems were, did require a lot of work. And I'm worried that we're not keeping up with like ⁓ conveying why that's important, kind of ⁓ teaching students to kind of learn how to do that well, just because it can be easy to skip that now by just kind of using AI to kind of not do the hard work in that process. And I do think we'll get to a point where we figure out how to teach this and how to like have classes that really push students hard to use this kind of technology. But the fear is that right now it's like too expensive to kind of have people just use as like undergrads. And also it's moving so quickly it's hard to kind of make those courses.


Ravid Shwartz-Ziv: But now when you get new interns or new juniors, do you see these pros and cons? From one hand, they're really native to this technology, and from the other hand, they don't have some of the basic stuff that you actually need. Do you see it in practice? Or, like you say, in the future it may be the situation?


Sasha Rush: another. Yeah, I must say I haven't seen it practice. ⁓ It might be that cursor is a biased sample of the people who pass through here are ones who who ⁓ have gone through this. mean, I know we like have really hard coding interviews. So we're like we're filtering on maybe the last generations of of ideas, but we found that to be effective.


Allen Roush: I mean, maybe I shouldn't say it, but there's increasingly sophisticated ways to cheat on coding interviews. I remember there was a whole LinkedIn thing about a guy that made some software that was designed for it, and he got blacklisted from his, was it an Amazon job or something? I'm wondering, have you had...


Sasha Rush: Yeah, so we, we do in-person interviews. have to come into the office. ⁓ yeah, in-person no shoe interviews.


Ravid Shwartz-Ziv: No, but... It's Chloe, something like that, right? But the problem... But I think I saw this week that he was lying about the ERR or something like that of his company. That he said some totally fake number that the interviewer asked it and then he just made up some number and then it turned out that it can't be because he needed a lot of users to have it and Jesse said, oh yeah, yeah, I actually lied.


Sasha Rush: Hahaha


Ravid Shwartz-Ziv: ⁓ But okay, okay, so maybe let's talk about how to build this model, right?


Allen Roush: and


Sasha Rush: Yeah.


Ravid Shwartz-Ziv: This is I think you're the perfect person to ask about how to build these models like what do you think we actually need like how you what is the differences between base models and agent coding What is the process how you look on the scaffolding so many questions so you can start wherever you want


Sasha Rush: Yeah, yeah, absolutely. So I guess I should introduce what we do here at Cursor. So we have a model. We call it Composer. And we try to make it better and better. And one thing that's different about Composer than other models is that it is only trying to do coding. We're trying to build this model that is extremely good as both an interactive and kind of cloud coding model. The way we do that is we kind of start with a pre-trained base model. ⁓ And then we run a kind of mid training phase to make it better at coding. And then we just do post-training. And I think one thing that is also kind of unique about our process is that we kind of started scaling this up kind of after people kind of understood the pre-training and mid-training process really well. And so we're kind of just focused ⁓ primarily on this like last phase of how you make the model really specialized and good. at one task. ⁓ Now, that being said, there are tons of research questions in how you do that, tons of infrastructure questions. And ⁓ we really feel like we haven't even kind of scratched the surface yet of how far we could go on this part of the process. So let me talk through roughly what this looks like. So ⁓ there are going to be three phases to this. ⁓ The first is that you need a harness. So you need some way of running a realistic coding agent offline. ⁓ And the way we do that is we basically use our production infrastructure. So we have this system for cloud agents that allows you to spin up an arbitrary Docker environment that has a problem to solve as well as ⁓ a way to get a reward on that problem. In that, can basically run, think of it as like a mini version of Cursor, where the agent tries to run tool calls, tries to write code, tries to kind of design what it's trying to do. And that is as kind of production close as possible. So it can kind of call command line ⁓ systems. It can change files. It can do searches, all of that. ⁓ We also have an inference system where you're actually ⁓ using the language model, producing tokens. Those tokens get converted to tool calls, which then get sent to this system and ⁓ change the environment and then get sent back. And for that to work, we have to be able to do very fast inference. We have to handle rollouts that can be of extremely different length and increasingly extremely long times. both in producing tokens and also kind of waiting for the response of the system itself. Finally, the third part is actually training these models. So we use very large models and we have to actually get back the rewards from these environments that we've set up and we update the parameters of the models based on those rewards. There may be, I don't know, let's say 32 different of these rollouts for any given problem. It has to take in the rewards that it got. ⁓ These are very extremely long rollouts. We have to basically take those, process those, and then update the weights. We do a lot of work on the actual models themselves, as well as kernels, to try to make this as fast as possible. We write a lot of kernels in-house or utilize various optimizations to try to make that training process as fast as possible. And this is particularly hard because these things are extremely long, right? We're talking about easily 100K tokens for a lot of these.


Ravid Shwartz-Ziv: And what are the rewards? Like, how do you think about it in general? Because when you write code, are different types of metrics that you want to optimize. So do you think we should have, first of all, what are the metrics that you think are important? And also, do you think we should optimize for each one of them separately or like together or like some intermediate steps?


Sasha Rush: Yeah. So a lot of this is kind of what we work on day to day, like coming up with exactly the right rewards to use for this problem ⁓ is kind of what makes it work or not in practice. ⁓ We use a kind of set of different rewards. So we have some rewards that are on the actual kind of content of the tool calls themselves. Some of them are on only the final output of the system that was produced. ⁓ All of our rewards are kind of end to end. So we have nothing that looks like kind of a process reward that tries to guess what's going on in the middle of the system. So overall, the whole thing does look kind of like what people are building generally with kind of GRPO-based ⁓ reinforcement learning system. ⁓ One thing I will say is that I think academia has gotten a little bit caught up on kind of zero one rewards. It does seem like people are very fixated on, I guess, RLVR, so reinforcement learning from verification feedback, RLVF, oh, don't know what it's called, RL from verify, whatever, zero one rewards on these problems. Coding is nice in that it does have some of those, but I think those are not. the primary only thing we think about. And I think the math is actually somewhat more interesting when you can think about kind of arbitrary rewards as opposed to just zero one.


Allen Roush: So do you think ⁓ that code agents should remain ⁓ basically unconstrained and with as few guardrails as possible? Like I'm asking because a lot of people run codecs with YOLO mode, clod code with the... ⁓ dangerously skip permissions. sure cursor, I think cursor has an equivalent mode as well. It's a lot of people enable what's your take on this and how should people run it?


Sasha Rush: I certainly run it in that mode. ⁓ I've been like, I don't know, it just seems very good. I'm often more worried I'll make a mistake these days than I'm worried that the models will. there is a lot of work and I think we'll, we'll need to do more work on, ⁓ sandboxing. ⁓ I know other systems have this. We, we have published blog posts about our sandboxing systems as well. ⁓ I think companies. at a whole should think a lot about how, what they give agents access to, how they set up their code bases with appropriate mocks or other aspects to make this possible to do. ⁓ In our kind cloud agent world, you can kind of set up things in their own kind of Docker images to run as they go. ⁓ But certainly, yeah, I don't think we've seen kind of... ⁓ too many problems yet, although I think it's something we're definitely thinking about how to make better.


Ravid Shwartz-Ziv: And how do you think like we need to think about ⁓ what models, like what types of codes actually models should write? Like clean, ⁓ very like ⁓ abstract code that humans can actually understand versus something that may be more efficient, but like only agents can, only AI models can understand. Do you think we should encourage some of these types?


Sasha Rush: Yeah, that's a great question. ⁓ I think other people have thought about this better than me. I have some thoughts as well, though. The thing that's been surprising is how quickly people move to kind of not looking at their code and being proud of it. I still run mostly in a world where I can see the code and kind of understand how it works. ⁓ But I'm very bullish on agents as kind of teachers. and their ability to kind of go in and explain complicated code to you. And I really like the potential of being able to work with much richer ⁓ checking than we currently have, along with this kind of explanatory system. So I think you get to a world where you actually have much safer code because it can use stronger static typing, it can use much better testing. ⁓ And the agent can work with that and doesn't have the same kind of issues a human might have working in that sort of environment. But I think for that to work, there has to be a way for the agent to prove to you that the code is doing what it says it's doing. And that has to feel clean, convincing, and kind of checkable. And so it's definitely an area I would be kind of thinking about in a research point of view, how you get that right, how you ⁓ find the right setting for that. I mean, even in a kind of simplest case, you mentioned Python static typing, which I think was like kind of lingering at mid adoption. But I think now there's really no excuse not to have static typing in Python code. can do it so well, you get free checks and guarantees and hopefully readability.


Allen Roush: Well, I mean, you do have to actually run like a static type checker like MyPy, but yeah, the model can do that too. So related to Ravid's question, there's a whole world called, I think, CodeGolf, or ⁓ about people that try to like minimize the size of coding solution and they all look extremely esoteric, ⁓ at least the effective ones. There's even whole programming languages, including a famous one with an offensive name,


Sasha Rush: The model can do that right exactly.


Allen Roush: brain and then the F word, makes the most unreadable code you've ever seen, but extremely small. And it occurs to me that in principle, these are actually very efficient languages and ideas in terms of getting rid of human readability, but making it easier for agents to manage context and communicate with replicas of each other. So do you think there's any value in exploring this kind niche style of programming? for code agents.


Sasha Rush: Oh, that's an interesting question. I have several takes. So one take is we don't need to find esoteric languages. There's a really hard language that we don't know how to write well already, which is CUDA, and in particular CUDA PDX, like the assembly code. And so we're already, I mean, it's a huge problem of how you make more efficient kernels. And it's been challenging for a while to get models to do this well. Part of the problem is that there's just not that much available code to train on. And so you have this challenge of how you make it better to actually run and utilize these systems. But I saw that for Flash Attention 4, which just came out, TreeDao had a nice set of tweets where he talked about how they are starting to use higher level languages with coding agents to try to write better kernels. And it does feel like if we can get that kind of feedback cycle. better, we'll be able to write for some of these harder languages faster. ⁓ The other question is just broadly like whether these agents will be good at writing in languages that are not common. This is kind of the opposite problem of what we focus on at Cursor. We want to have a model that's extremely good at the code users are writing ⁓ and ⁓ work in these sort of settings. I do think RL though does offer an opportunity to kind of make specialized agents in domains where we maybe have less kind of a pre-training code to work on and improve. like one area I'm interested in is like how you get good at a library that's not been released yet. So like you want to release a new library and you also want to kind of have the agent already be an expert at using it. ⁓ And I think that, yeah, some of these RL settings might make that.


Ravid Shwartz-Ziv: Yeah, what do you think about continual learning in this context? This is kind of like continual learning, right? Do you think this is, it should be like, I don't know, embedded or like, ⁓ you should change the actual ways? Do you think like we should make like some retrieval or memory component?


Sasha Rush: here to do. Mm. Mm. Yeah, it's certainly been one of the hot topics this year. It was certainly what everyone I talked to at NeurIPS was interested in. I kind of take the view that we haven't fully exhausted just writing stuff down. I've been kind of shocked at how good the model is at using the file system, using past chats, that sort of thing, in order to better understand this information. ⁓ We've also found that we're able to kind of RL long past the model's context by just kind of like doing things like compaction in the loop and kind of do training as we go. So we're not yet seeing kind of diminishing returns of kind of the easy things. ⁓ Like everyone else, I think that I love ideas where you keep around embeddings or put things in the weights just because continuous space is really interesting and potentially very information dense. ⁓ One thing a former student of mine, Yuntian Deng at University of Waterloo has been thinking about is like how like swarms of agents might kind of communicate through ⁓ non-textual means. Like we see these kind of swarms and they kind of pass context to each other, but maybe they can pass vectors. ⁓ That kind of thing seems quite interesting, but I think it's still kind of a research topic.


Allen Roush: And ⁓ do you think that there's anything to some of these? I heard there's some agent communication protocols that they're trying to make, like A2A. ⁓ Well, what do you think about these? Because I know that MCP is pretty well supported by Cursor.


Sasha Rush: ⁓ yeah, good question. I actually don't, I know that there's some, there's like an, there are now ways to run cursor agent and cursor agent harness in other IDEs. ⁓ so I think, ⁓ JetBrains and Zedd allow you to kind of utilize the cursor harness. ⁓ that's through, I forgot the protocol name, but, but, but one of these. ⁓ but that's more for different, kind of ⁓ using a kind of agent. In terms of actual agent communication, think there's still a lot to be explored there. ⁓ It's kind of somewhat niche topic now, but it definitely seems like it will kind of pick up. ⁓ I think I have somewhat of a strong view that like the ideal case for me is like a single agent, just with like an extremely long context or some kind of context management. Like ⁓ in kind of these cooperative environments, I like do kind of feel like we should try to get as much into one agent as we can. But certainly in the short term, we've seen kind of really impressive projects with these kind of agent swarms. And ⁓ definitely there'll be a lot in that space in the coming weeks.


Allen Roush: And then ⁓ can you talk more to how you guys at cursor do context compactification and what strategies there might be there because it seems like a pretty exciting new area of research.


Sasha Rush: Yeah, yeah, that's true. ⁓ Yeah, a lot of really interesting papers have come out recently on this topic. ⁓ We're currently writing up ⁓ a blog post of what we do so I can talk about it. ⁓ So the way it works is that ⁓ during RL, we roll out ⁓ the kind of solution to the problem. You let the agent kind of work for as long as it wants. Eventually, it's going to hit some context cap. ⁓ So let's say it gets to 200,000 tokens and has no more context. then what we do is we just kind of ask that same agent to summarize its context so far. So it's just looks like another tool call. The agent produces a summary and then we start again with just that summary and roll out. And we kind of recursively do this until we hit some limit or the agent decides to stop that it's solved the problem. We then kind of collect the reward from that entire, ⁓ entire system and you can just kind of back prop it over all of the different rollouts so they all get the same reward. ⁓ And what we found from doing this is that you basically ⁓ don't even really need to train anything special. The agent itself learns to produce very good and surprisingly readable summaries of its context ⁓ and you can kind of just use this as it goes. It doesn't really forget about the kind of early part of the rollout, the reward kind of updates that as well. ⁓ And so, ⁓ yeah, this is what we kind of run in production for our composer model. ⁓ And it's not using anything like, I think a lot of the interesting research is on continuous kind of compaction of like trying to like make a vector that represents what you've seen before. We're just using plain text in the summary itself. and what kind of model is able to kind of use that to do well.


Allen Roush: this still necessarily a lossy ⁓ operation? ⁓ And especially, ⁓ for example, there is that 1 million context version of Opus, which is more expensive. But do ⁓ think that fundamentally there is still a reason to push context windows as hard as you can, even in spite of this ⁓ context compactification?


Sasha Rush: Okay. Yeah, I think everyone agrees that like ⁓ infinite context window is like the optimal ⁓ world and we expect that to get longer as it goes. ⁓ That being said, I do feel like it needs to justify itself. Like we found that for most of our problems, there's very, very little actual accuracy loss from doing this kind of compaction. ⁓ And you could also imagine extending this where you like literally can retrieve. anything from the older context if you really need it. So it's not like that information disappeared from the world. And so the model could pull it out. That being said, mean, a lot of my academic research was on extremely long context models. I do think some kind of hybrid model that keeps around some sort of state from previous conversations will eventually kind of be really useful.


Ravid Shwartz-Ziv: Okay, maybe we can talk about one of the hottest thing now that like IDs for a coding agents, Cloud Code, Codecs. It looks at like not everyone, but a lot of people are started to use these tools in the last two months, something like that. It was like a big jump in the capabilities of these coding agents. So first of all, do you see it


Sasha Rush: Yeah.


Ravid Shwartz-Ziv: Also, do you this jump in the model that you are working on? And also, of course, and of course, you're biased here. How do you see working with IDE versus working on the terminal? And what are the options and solutions that you can see in the future?


Sasha Rush: Ha Yeah. Yeah, so first off, it's worth splitting out the models from the harnesses. I think they're both really important and matter a lot. But I think at least in the model space, the picture is really clear, which was that there have been a couple of points in time where there's been a real sea change in how good these agents were. For us, the first one was really like Sonnet 3.5 was just like made all this work and you could see the vision of where this was going. Opus 4.5 was a moment like that. When it was released, it just worked so much better and was so much more trustworthy as a user in terms of what it was trying to do. ⁓ We both utilize the models a lot internally and take kind of clear polls about where everyone in the company feels like they are. ⁓ And there was a very strong sense of like, this model is just working really nicely and is quite good. ⁓ We also saw that on our benchmarks. We have a kind of whole team that works on kind of realistic coding benchmarks that try to kind quantify where these things are. And it was a very kind of clear sense that this model is both more usable and also performs extremely well. Once you have a model like that, it really opens up a lot of other use cases. ⁓ sorry, I should also say that we also now think that codecs is like also fantastic, like just a really good, really usable model. ⁓ And also better in some ways for very long-term tasks. And I think once you have kind of models like that, can ⁓ really like expand what's possible with these systems and that allows like the developer experience to kind of look different in practice. Okay, so that's the models. Let's also talk about the harness. So I actually use the CLI a lot these days for what I'm doing. Cursor has a CLI as well, just agent on the command line. ⁓ And I think it's quite good. ⁓ And I think as a company, we're pretty agnostic to which ⁓ way people use it, whether it's kind cloud, CLI, or IDE. A lot of people really like the IDE form factor. And so we actually see a ton of use at that as well. ⁓ Although I think Twitter is maybe a different universe than that kind of average software engineer. ⁓ So in that sense, let them all flourish. mean, I think how the actual harness works. Yeah, mean, ⁓ it's actually quite awesome to see all the different ⁓ things that have come out from ⁓ Cloud Code to Codex to the things we've released to some of the open source versions. I think I haven't seen really convincing evidence that it matters too much. ⁓ I think ⁓ Opus 4.5 and all the Codex models are like, they work really great in our harnesses and they seem to kind of generalize fine to all the stuff we're kind of throwing at them. So, yeah.


Ravid Shwartz-Ziv: But like for you, like personally, it's not like frustrating that you're working on a specific, like you're working on a model, right? For a specific use case, for coding and you work really hard on it. And then they just release like a general proposed model that it's better like, or like at least in the same level of your model.


Sasha Rush: Mmm. ⁓ well, I mean...


Ravid Shwartz-Ziv: You don't need to comment if you don't want to.


Sasha Rush: No, no, I don't know. I mean, I think, well, first of all, I think our models are very good and they're like getting, they're getting a lot better. ⁓ But also OpenAI and Anthropic are two of the most impressive companies like in the world and they employ just brilliant, amazing people and have nearly unlimited budget. And I think like, like, ⁓ I think we want to produce like the best coding models in the world, but it's like, it's awesome to have kind of competitors working in similar way. And honestly, I hope that Gemini models get better too. I'd like to see them get better as well.


Allen Roush: them. Well, that's actually a good kind of jump off point. So I'm a huge fan of specialized agent workflows, including deep research as a feature. And right now, I think there's also an equivalent feature from perplexity, but it's been mostly OpenAI and Google that have those. I think cursor may have one. think Devon slash cognition may have like a code version of this. What's your thought on these more ⁓ choreographed workflows that are still trying to take that, you know, spend a million tokens to get better results kind of inference time scaling ⁓ idea that cursor does.


Sasha Rush: Yeah, so we don't have anything that's specifically choreographed in the way kind of deep research is where it kind of always asks the same follow ups. But we do think a lot about QA and kind of answering questions really well in the context of a very large code base. And it does have that kind of deep research like flavor to it. I think right, think understanding the code base understanding the output of the system in that, in that sort of constrained setting is a really interesting problem. And it's one of the areas where RL has really excelled. Like you can just see as you post-train the model, it just getting better at using less searches to narrow in and find the answer using its thinking tokens to really kind of figure out what to do next. And yeah, I think, I think it's a really cool area.


Ravid Shwartz-Ziv: you think that the future will be that first you release the model to ⁓ around in your database and understand your code base and all these things and then it will ⁓ to generate ⁓ answers and to code over there? ⁓


Sasha Rush: Yeah, I think the terminology we've been using is self-driving code base. ⁓ But I ⁓ think we have this vision of like, you kind of walk around in a kind of unconditioned setting and try to understand what's happening. And then you kind of use that to improve things, get better, to prepare for questions in that sort of setting. Yeah. I mean, I think all of this is just presuming that token costs continue on the kind of... ⁓ ⁓ trajectory that they have like this stuff is going to become like compute how do you best utilize it going forward


Allen Roush: Well, speaking of that, actually we're starting to see inflation in compute costs now, right? Of course, hardware costs have started going, or sorry, ⁓ memory and storage costs have started going up, but there's also even been an increase in price in ⁓ enterprise CPUs and PCBs and probably even power supplies. So do you think that we'll go into a world of actually token cost inflation and thus, you know, the average person finding it harder and harder to be able to access increasingly powerful foundation models.


Sasha Rush: Yeah, that's an interesting question. yeah, I know that this has come up in the context of things like ads and chat bots or things of that form. It's been a little less apparent in the coding world just because there's so much demand right now. And I think people are getting kind of clear value out of what they've been using.


Ravid Shwartz-Ziv: So you don't think we should train these models to be more efficient?


Sasha Rush: ⁓ I mean, we should do all of the above. ⁓ I certainly think that ⁓ we see that like past a certain level of inefficiency, kind of people like them less or utilize them less. ⁓ That being said, it really does seem like intelligence is the primary kind of axis in which people like want to see improvement.


Ravid Shwartz-Ziv: And what about, I know, for example, refactoring the code, right? Like in the past, like people, right? Once in a while, you went over your code, you refactor things, like you remove things that are not necessary anymore. Do you think this will be the future? Like some kind of like, automatically, automatically?


Sasha Rush: Hmm.


Ravid Shwartz-Ziv: maintenance of your code or like you will refactor everything and or like I don't know maybe we don't need to use packages anymore. You don't need to download external package you can write it by yourself and just refactor it over time. How do you see this pipeline in the future?


Sasha Rush: Yeah. Yeah, not to talk our book too much, but like, ⁓ yeah, we have this new product we've been playing with called automations where like every time a PR is submitted, it will run a kind of automatic factorization job, refactorization job, or, ⁓ check for security issues or there's so many things that were hard to do because they use developer time that the logic just kind of changes. If you are able to kind of put in kind of automatic checks or triggers. based on certain properties. ⁓ If you have a kind of good sense of like what a test would be, you can have the model do kind of all sorts of things to your code base to make it better. You can have ⁓ automatic documentation, automatic, we call it deslopification, kind of removing things of that form. And then also bug finding has been a huge thing in the world. It's just anytime you can guarantee before an expensive run that you've removed a bug. almost invaluable in some ways. And it's something that models, given time to think about, can do extremely well.


Ravid Shwartz-Ziv: What do you think about the product of Entropic that they found, that they searched and found a bug? Do believe that it's something that is reasonable or it's just PR?


Sasha Rush: Yeah, so we've had a product for the last year or called BugBot that does this for reviews. It runs on all my PRs. It often finds extremely bad bugs. I come totally reliant on it. ⁓ But yeah, no, I think ⁓ in that sort of it's kind of non-interactive. You can put more tokens on it and you can do very, very well. Yeah.


Ravid Shwartz-Ziv: But at the end you want like why to not to search these bugs at the beginning, right? When you write a code, right? If your agent write a code, why you should use another agent to find these bugs.


Sasha Rush: We've seen this a lot that, ⁓ at least in the current state of the world, there is a of verifier generator gap where the generation problem, because it might be time constrained or it be because it has to actually produce code within the context or maybe not in the fully formed ⁓ PR, it will often do worse than these kind of systems that just after the fact get to look at the code, maybe even run commands. Can can narrow in and find things that maybe were missed on the first pass?


Allen Roush: So do you believe in the concept then of scalable oversight where, you know, that what the AI safety and alignment crowd talks about is your only way to really secure ourselves against, you know, the geniuses and data centers as Dario calls it is to just have more of them, you know, constantly watching like a hawk, basically to prevent that.


Sasha Rush: yeah, I think, right, as that went from a kind of theoretical concept to almost like a product, ⁓ it's been, it's been interesting to kind of see how it manifests itself. I, I still don't feel like we have a really good sense of what a kind of general purpose form of that might look like, but I think this kind of bug detection and security detection maybe as a kind of first pass at what, ⁓ what that might look like in practice. ⁓ I don't know how far I go in terms of that being a 100 % solution or really a guarantee against some of the problems people are worried about. That being said, think I'm kept up much more by the kind of economic scenarios than I am currently by the kind of safety scenario.


Ravid Shwartz-Ziv: and And do you think the current progress will keep the same, will accelerate or maybe we'll see some slowing down?


Sasha Rush: Acceleration is always hard to predict but yeah ⁓


Ravid Shwartz-Ziv: Also, they stop, right? It's always hard to predict the future.


Sasha Rush: Well, okay, I do think it's not impossible to the future. I often, I have a colleague, Jacob Steinhardt, who I think has written really well about how to do this kind of forecasting. And I always find it interesting to talk with him. But at least from my perspective, for a while, I kind of studied scaling laws in academia. And there was a very clear sense of a kind of what we thought the next three months would look like. And then about a year and a half ago, it felt like there was a very clear end to that. That like, maybe this would tail off in kind of a sigmoid that we wouldn't see the same kind of progress on pre-training. ⁓ And I think that made me somewhat bearish on where things were going. At least in this RL kind of post-training world, ⁓ it really... doesn't seem obvious that there's any end in sight to like code quality. Like I think we'll just see 5.5, 5.6, composer two, composer three, like just really get better ⁓ and better at what they're doing. ⁓ What would blow my mind would be kind of seeing similar progress in other areas, particularly ones without as clear a reward signal. And then, I don't know, there are a million of these interesting new research labs starting up trying to do things on data efficiency or continual learning, or really trying to look for the next big breakthrough once kind of RL post-training hits an end. So that would also be quite interesting as well.


Ravid Shwartz-Ziv: But do you think it's possible to see such a progress also in other fields? then coding is the best case scenario, right? Like you have verifiers, you have closed language that you can test and all these things. A lot of data. Do you think we can see other progress in other areas? In real world problems, if you want.


Sasha Rush: Yeah, it's funny. mean, I have such hands-on experience now with coding, so I don't know the actual detail-oriented issues in other areas. But man, it looks like things like law should have the same properties, or I don't know, it looks like a lot of Excel work or data science. It just seems like it's really an engineering problem to set up the environments, set up the data, train models on that. I guess I... I don't see the kind of technical problem that is in the way for training those sorts of models. I see huge kind of engineering issues, huge kind of ⁓ data ownership issues, all sorts of things. But from a technical point of view, it doesn't seem crazy to just apply the same playbook.


Allen Roush: And what do you think about ⁓ problems in ⁓ intellectual property and privacy related to agents? And what does cursor do to try to be class leading in this regard?


Sasha Rush: That's a really interesting question. I actually don't know if I can speak to that. I just, uh, yeah, it's, um, it's, uh, enough out of my area that I probably don't have too much to say. No, it's fine. It's fine.


Allen Roush: Okay, sorry. Well then, let me follow up, guess, with something else. ⁓ So what do you think is the future of Cursor and Cursor Agents?


Sasha Rush: ⁓ yeah. So, our CEO wrote a blog post recently about, where we're moving. ⁓ I think, ⁓ I think I laid out roughly the picture of moving from this kind of ⁓ hands-on interactive way ⁓ using coding agents ⁓ one where you're controlling a kind of ⁓ of different agents that work for a long period of time ⁓ interact with each other. I think to get there, it's going to require ⁓ people thinking about their problems in this more declarative way, where you have a set of goals that you're trying to accomplish, and you utilize this group of agents to set off and try to accomplish those. To make that work, think there's a lot of questions in how you train models that can work for an extremely long time, how they can facilitate communication. and it can facilitate working with a user in that sort of setting. And then also the kind of developer experience of how you actually kind of set that up and make that a kind of pleasant and easy experience to utilize. Yeah, and I just, I don't know, it's been like, it's been really remarkable to see this go from a somewhat niche software engineering product to one that like. I don't know, you see everyone utilizing, I have friends in all sorts of different fields kind of trying out coding for the first time. And so it does feel like we're still kind of early days in how this works. And so it's pretty exciting to build these systems and see what people are going to try to kind of build with them.


Allen Roush: And then what do you think will be the future for people who mostly use not coding agents today, but just other white collar workers? So does Cursor plan to extend and make it to the average office worker, that they use Cursor agents? Or what's the plan there?


Sasha Rush: ⁓ No, Cursor is entirely focused on professional software engineers. That's what we're trying to build and what we're trying to design. I do find it interesting to follow how ⁓ other systems have been built in other domains. ⁓ But it's also nice to have a clear focus and goal for what we're personally trying to do.


Ravid Shwartz-Ziv: How do you see the future in the sense that now I have data, I'm a company, I have some kind of data, users or whatever, do you think we will have a model? have like this like predefined environment and we just train our own model like fine tune or post-train our own model and ⁓ we need to use predefined models, pre-trained models and both general models also but coding agents.


Sasha Rush: Yeah. Yeah. This kind of pendulum between general purpose and specialized models has been quite interesting to watch over the last year. You mentioned that like OpenAI now has a really good coding model, but it's fascinating that the way they got there was through this codex path where they first trained specialized models. And then eventually with 5.4, it seems like they merged codex back in. And so I think these companies went from being kind of we're going to train a single model for the entire world to being like, how do we spin up specialized teams for each domain to do RL targeting a given task and then kind of merge it back together? And that is kind of a different ⁓ universe in some ways, because it doesn't seem as if there's like as clear generalization from this kind of targeted training as there was with things like pre-training. What does that mean for an individual company? So ⁓ I think for other domains, we'll see companies training specialized models to do what they can in those domains. ⁓ It's not clear to me that every company or even most companies should be doing their own RL training. For one, it's quite hard. ⁓ And two, maybe it's not necessary given how good some of the general purpose models are. ⁓ But who knows? I don't know. mean, there's some people taking opposite bets. I do appreciate things like Tinker from Thinking Machines that has tried to kind of make it ⁓ at least like a simplified abstraction for what RL training for a given problem might look like.


Allen Roush: And then what do you think is the model architecture? And maybe, I know there's transformers and also state space models slash Mamba. And we're now seeing, at least in the latest NVIDIA Nemotron models, starting to use Mamba layers. Do you have opinions on this? And do you talk about any of that?


Sasha Rush: Yeah, yeah. ⁓ boy, do I have opinions. ⁓ It's funny, I spent a lot of time thinking about this topic. I haven't for the last year. ⁓ Transformers just really work very good in industry. So I've kind of mostly been focused on other questions about kind of RL and data. I have this bet that I made with Jonathan Frankel. ⁓ There's a website for it. It's called isattentionallyouneed.com. We made the bet in 2022 ⁓ and it expires in nine months. It was a five-year bet of whether any model would beat Transformers on any task in NLP in the next five years. It felt like five years was very long time at the time, ⁓ but it's turned out not to be that long. Well... very long in terms of how much stuff has happened, but maybe not in terms of kind of research changes. It does seem like at the moment, I was, my bet was for that a new architecture would emerge and he bet on transformers. I think his position was that he was taking a kind of systems view that will figure out how to optimize them, make them better. And I think I was taking more of a kind of research creativity, like long-term.


Ravid Shwartz-Ziv: Wait, what did you guess? What was your guess?


Sasha Rush: punctuated equilibrium kind of view. ⁓ And ⁓ yeah, I don't know. And it's interesting where we're at. I think we're seeing evidence that hybrid models ⁓ have a lot of potential. We're seeing a lot of models that are like 3 to 1 or something of that form, where they have a bunch of state space layers or linear tension layers or ⁓ kind of gated delta net layers ⁓ in between. But it does really seem like you need some attention in the process that that's really a kind of key aspect to it. So, ⁓ yeah, I don't know. I've been fascinated that there hasn't been much change in the world since that time. I mean, in architectures. But it's so bullish that people will come up with new ideas.


Ravid Shwartz-Ziv: And what do you think about the next token prediction versus the world model or diffusion? know, Jan was here and it was really about world models and then Soato, Stefano Soato came and said, oh no, everything is like, we should do next token prediction. And then Stefano Irmone was here and oh no, diffusion model is all you need. What do you think about it?


Sasha Rush: ⁓ yeah, yeah. ⁓ Yeah. Yeah. That's great. you have the whole group. ⁓ What do I think about it? Well, let's ⁓ see. work with in my postdoc. And so he's someone I admire incredibly. I particularly admire his, ⁓ just everything about him, but definitely kind of his firmness of ideas and how he thinks about the world and how, ⁓ yeah, how clear ideas about research. ⁓ I'm a kind of... Symbol maximalists. I kind of feel like we can get extremely far with code and language and I don't see any reason why Spatial reasoning is actually like should be privileged or like thought of as more important than kind of symbolic reasoning ⁓ and so I'm not really like bullish on world models, but I'm I'm I'm super bullish on people working on them and trying them out, but I'm I just think there's a lot to go on symbols And then, yeah, I worked with Vladimir Kulachev, who's one of Stefano's co-founders for this start-up inception. And it's been really cool to see diffusion models come in for language. Yeah, it's gotten way further in terms of already how good they are than I would have thought. And some of the speed benefits seem amazing. I also have... read a couple of papers recently that make arguments that they're more data efficient than next token prediction models, which is something that I think ⁓ I was very worried about when pre-training seemed like it was kind of tapped out. I'm a little less worried about in an RL world, ⁓ but might turn out to be really key to everything. ⁓ So yeah, more research on that. ⁓ More things should come out. I'm not like... fully convinced yet, just because it's hard to know what the exact details are. But yeah, their model Mercury too seems amazing.


Allen Roush: And then what do we ⁓ or what do you guys do to improve the diversity of your outputs during reinforcement learning? And how important do you think that for code agent harnesses diversity is? I I work in LLM sampling and so I'm convinced that good truncation samplers like MinP and better can enable like high temperature continuations that can be good. Like, do you think that's a path or do you guys take other paths or how do you think about this?


Sasha Rush: Yeah, that's a really good question. It's come up a lot actually in our various kind of RL research. ⁓ I think that in the coding application, we primarily do just like best of one. And so it hasn't come up so much as a kind of inference time question. And there's interesting reasons why. One is that like, ⁓ Because the model can call tools, ⁓ it's not really reversible. It's harder to do of ⁓ like forking of kind of user environments. ⁓ That being said, in a world where you're kind of running in a cloud agent or running in some sort of containerized environments, that becomes much more interesting. So if you can do best of end rollouts or some sort of tree search at inference time. or multi-agents, there's like all sorts of things that become much more fascinating. And I think in that sort of setting, maintaining good diversity throughout RL becomes really key. There's been a lot written about like how on some of these math problems, the model like really collapses to like not having any kind of probability distribution. It just kind of has like one output. We don't see that as much on some of the large scale systems. So we still do kind of have ⁓ kind of interesting amount of diversity of the models. And we'd like to be able to kind of use that more effectively.


Ravid Shwartz-Ziv: Do you think we can do a good RL for multi-agent?


Sasha Rush: Yeah, yeah, absolutely. ⁓ I don't really see any kind of theoretical limitation to that. It's mostly just a question of how you set up the right environments or set up the right way of doing it. yeah, I don't know. You just kind of get some policy gradient and learn.


Allen Roush: And then for user interface user experience, I've always been fascinated by ⁓ maximalist software and I consider the VS code bones that cursor kind of uses to be a part of that. Do you guys plan on adding more to the user interface user experience to differentiate from the VS code core and if so, what might that look like?


Sasha Rush: ⁓ that's interesting. Actually, I don't know what the term Maximus means in software.


Allen Roush: ⁓ like the opposite of minimalist. So if you use software like Blender, for example, there's 400 buttons. Most code editors take a similar maximalist approach where there's a lot of different buttons. I think VS Code isn't that hard. JetBrains is probably a little more closer to that.


Sasha Rush: Yeah, Okay. I think we're moving in the opposite direction in terms of this kind of thing. ⁓ it feels like agents provide such a nice way of getting hard things done. And it seems like the kind of text interaction is like, I don't know, works for a lot of people that we've just found internally. We're big kind of dog fooders of our own product. that we're able to get away with actually kind of much less kind of stuff now than we were before. ⁓ So maybe that's the wrong answer than what you're expecting, but I would expect kind of moving more towards the minimalist side of things.


Ravid Shwartz-Ziv: And what do you think about open source? you have like, don't know, from Cursor any thoughts about it? Like, because like you're using like pre-trained models also, do you think this is something that first in general, what do you think about it? And what do you think it looks like in the recent year, most of the companies close their models?


Sasha Rush: Yeah, removing my cursor hat for a moment. I'm obviously a huge fan of open source. ⁓ Before I worked at Cursor, I worked at Hugging Face, which I still think of as the inspiring example of doing open source machine learning. And I'm so grateful that I've worked with them. It was also amazing to have worked on some of their initiatives, like Big Science, which was kind of a year-long international project to train a kind of multilingual language model. And it was, yeah, just an inspiring group of people from around the world. ⁓


Ravid Shwartz-Ziv: And FineWeb is the best. FineWeb is still the best project in the world, in my opinion.


Sasha Rush: Yeah, just the coolest people building, yeah, just amazing things for the world. So just really great. ⁓ I've also hosted a couple of workshops of academics doing open source stuff. So I do think it's really important. ⁓ And then, yeah, I have a couple of former students at Quinn. It's been a little sad to see the kind of ⁓ potential moving away from open source of them. There does seem to be a lot of great initiatives of people doing open source and increasingly open source in the US with models ⁓ So I'm really bullish that those projects will pick up and kind of fill the gap ⁓ Yeah, ⁓ never like totally figured out the kind of like funding thing from an academic point of view of how you do that or like why it's possible in China, but but not in the US ⁓ and Yeah, I think just than the day, like this stuff is kind of interesting at a certain scale. And so I think it's been harder to do that kind of in an open source world. But I very much appreciate what people are doing. ⁓ As you noted, we use kind of pre-trained models to do what we're doing, which is great. ⁓ Yeah, and I do think probably at some point, Cursor will release some open source stuff too.


Allen Roush: really agree that Hugging Face is the good people in ⁓ ⁓ always wanted to work there, actually. ⁓ What do you think of these ⁓ models ⁓ that have been coming out and ⁓ defined open source for the last year?


Sasha Rush: Yeah, I mean, I don't know. I think it's incredible. ⁓ I gave a talk ⁓ last year when DeepSeek came out about everything we learned from that project. And it was, I don't know, was career changing for me, just understanding what was possible, seeing for the first time how these MOE models were trained at scale. ⁓ Their team seems amazingly smart. Everyone I've met from these companies seems wonderful and in it for good reason. ⁓ Yeah, it makes me a little sad that there's so much kind of international kind of competition or that people feel like, ⁓ I don't know, ⁓ some antagonism between countries. But yeah, in a perfect world, I would love to like meet a lot of those folks and talk about research.


Ravid Shwartz-Ziv: Do you think in the future we will see more models, more companies with pre-trained models? Or we will see some converge to 2, 3, 4 companies?


Sasha Rush: Yeah. ⁓ yeah, I don't know. it's a hard prediction. I do know every time I see, every time I see Nathan Lampert, I ask him this and he gives me the full rundown. So I have a good sense from him at least. it's hard to know. I'm not on the ground in China. what, what are the trends? One trend is that like, I think pre-training is less mysterious than it used to be. Like I think we roughly get it. And like, ⁓ if you're not doing anything too wild.


Ravid Shwartz-Ziv: I will try to ask it...


Sasha Rush: It doesn't seem crazy to me that there could be more places pre-training models. Maybe the worry is that now post-training is more mysterious and that the models will come out more raw as you go. ⁓ But then there are these economic incentives. I don't really understand why DeepSeek does what they do. so it's hard to predict whether, I don't know, what their motivations are. But I don't see any reason why American companies couldn't. this kind of thing.


Allen Roush: Nathan Lambert is somebody we should totally get on this podcast, by the way. Big fan of his work. So ⁓ does cursor, is cursor going to continue integrating in other products too? Because I've seen cursor agent is now integrated, I think even into the web browser, if I'm not mistaken. Is that kind of going to continue to be a strategy?


Sasha Rush: Yeah, yeah, right on. Yeah, absolutely. think, right, I think you should not think of cursor as like being this single product. It's the kind of, it's a system to help professionals write code, wherever that might be.


Ravid Shwartz-Ziv: So maybe more like personal question. So you spend time in the academic world and you spend a lot of time in the industry. What do you like better? What do you think is better? What do you like to do more?


Sasha Rush: Um, yeah, like is an interesting term. Uh, I, I definitely like them both. Um, I think for me, the, the change was kind of like almost less about my own personal taste and more about like what problems were kind of 20 year research problems. So like when I became a professor in, 20, I don't know, 14, I actually like thought. I was going to spend the next 20 years of my life working on translating German to English. Like that was a problem I thought I was going to spend my career working on. And I liked that problem. Like I liked kind of getting like point three points better on our score each year with creative mathematical approaches. and it was like a little bit of a shock that like, when I was a postdoc before becoming a professor, like Ilya just destroyed my field in the nicest way of just like a single paper, seek to seek, just really like laying out the profile of where this was going. And so like, it was really fun from 2014 through 2018, seeing the trajectory of that. I actually spent a lot of time during that period writing open source translation software. And because of that, I like, was very aware of the transformer paper when it came out and got obsessed with it and where it was going. ⁓ And then for five or so years after that, I mostly just did variants of transformers and tried to understand what was moving on. I think right now, the really interesting 20-year problems in academia look a little different. There are things like, how do we make safety work really well? How do we get AI to work in science? How do we figure out what interpretability is and how to utilize it or work through it? And then questions about RL and where it's going. And I'm really inspired by a lot of the young professors who are thinking about these problems and want to really work on them. I think they're a little bit different than what I'm interested in. And so I think coding is a little bit more of a problem that looks like translation to me of let's just make this thing good and better and figure it out. I think it's not uncommon for a field to move down the research engineering spectrum in that way.


Allen Roush: And I assume you regularly go to the AI research conferences like NeurIPS and ICLR and etc. What do you think of them and how have they changed over the last several years?


Sasha Rush: Yeah. That's good question. Yeah, was ⁓ on the board for iClear for many years. And also a couple years ago, I helped launch a conference called Conference on Language Modeling, which is now in itself ⁓ third year. This year will be in San Francisco in October. ⁓ The conferences are awesome. I love going to them. I love seeing people. I love watching papers. ⁓ What worries me? I'm worried about review. kind of the notion of review has kind of decayed a bit, both from kind of AI agents and from people being too busy and from kind of the, like, I don't know, like pyramid scheme of like more senior people reviewing more junior people's work has kind of like broken down a bit. And so it makes it like less of a stamp of like, I really did something hard and it passed review. And I think, I don't know, that was a real key. of what made working on a project really rewarding, to really convince people that what you did was really interesting. ⁓ So that worries me a little bit. ⁓ I don't know. mean, the fact that people are so excited to go to NeurIPS, some of it is economic reasons, but a lot of it is just people really care about the research and are really excited when someone comes up with a new idea. It's honestly pretty amazing. 20,000 people are going and caring about GRPO. It's like really, I don't know, it's still a really incredible thing.


Ravid Shwartz-Ziv: And do you think in the future we'll see a lot of papers? Or do people will stop writing papers because AI can write a better paper than us?


Sasha Rush: I'm not wedded to papers. don't know if people write really good blog posts or I don't know. I'm fine with it. ⁓ it's yeah, I think it's like science communication, like, ⁓ rigorous testing, like figuring out whatever that looks like. I'm kind of okay with it. My worry is that we get slop or we get like, like it, like the incentive for doing the hard work. really really show that some new technique is beneficial versus spending all the time coming up with the name and the tweet for it like that that worries me like you don't want to become consulting where 90 % of the work is like the slides as opposed to the like actual research so like yeah that that worries me


Ravid Shwartz-Ziv: Huh? I like Twitter. It's great.


Sasha Rush: ⁓ no, I like Twitter too, I like Twitter too. just... Yeah, and you can do great stuff on Twitter. It's not the form, it's more like the... I don't know, like I want to reward the person who spends six months tweaking something, however that might be, and then tweets about it, as opposed to the person just randomly doing it.


Ravid Shwartz-Ziv: Yeah.


Allen Roush: So what's your recommendation for somebody entering the field now for the first time? What should they look for?


Sasha Rush: I have no idea. One thing that's been fun is like you guys interviewed one of my former students like a couple months ago, Jack, and he goes about research in an entirely different way than I do. And it seems to be really successful. Like he's great on Twitter. He like he's really open about his ideas. He kind of he talks to a lot of people. He finds interesting, unique problems and attacks them. And like as an advisor, it's kind of like


Ravid Shwartz-Ziv: Meh.


Sasha Rush: I don't know. stopped kind of telling people how to do research. It's like, just, you're trying to, you're trying to figure out, you're trying to figure out we don't know, why we don't know it and do what you can to kind of make progress on that. And it, it just, it's going to change as the problems change. Like, um, it just doesn't like have a single form. Um, if I were graduating from undergrad right now, yeah, it's a good question. Um, I would only recommend getting a PhD if you're like really have a sense of like, here's a really hard problem I want to have many years to work on and try to push through. Like if you're, if you're interested in like working on the technology, that's like the latest thing or like want to be part of a big team building something like there does seem to be plenty of opportunities. Like we have a lot of people at cursor without PhDs who kind of do various interesting things and yeah. But that being said, I I love doing a PhD. I think a lot of my students did. So like, you're passionate about that, if you want to work on something crazy, yeah, I wouldn't discourage you.


Ravid Shwartz-Ziv: Okay, I think we are out of time. Do you have anything else that you want to add?


Sasha Rush: No, this was great. It was really nice chatting with you guys. Yeah, this was fun.


Ravid Shwartz-Ziv: Thank you so much Sasha, it was great to have you here.