Hi everyone,
I am an ML researcher at Cursor, and worked on this project. Would love to hear any feedback you may have on the model, and can answer question about the blog post.
Hi everyone,
I am an ML researcher at Cursor, and worked on this project. Would love to hear any feedback you may have on the model, and can answer question about the blog post.
Impressive systems write-up. A question: if Composer is an RL finetune on an open model, why keep weights closed? The edge from a slightly better checkpoint erodes quickly in this market, it's not a durable advantage. Composer protects Cursor's margins from being squeezed by the big AI labs, but that is true whether the weights are open or closed, and I think Cursor would have more lasting benefit by generating developer goodwill than from a narrow, short-lived advantage. But, that's just my opinion. I personally find it hard to get excited about yet-another proprietary model. GPT-5 and Sonnet 4.5 are around when I need one of those, but I think the future is open.
It's stunning.
I don't use these tools that much ( I tried and rejected Cursor a while ago, and decided not to use it ) but having played with GPT5 Codex ( as a paying customer) yesterday in regular VSCode , and having had Composer1 do the exact same things just now, it's night and day.
Composer did everything better, didn't stumble where Codex failed, and most importantly, the speed makes a huge difference. It's extremely comfortable to use, congrats.
Edit: I will therefore reconsider my previous rejection
Awesome to hear, I will share with the team.
Why did you stop training shy of the frontier models? From the log plot it seems like you would only need ~50% more compute to reach frontier capability
We did a lot of internal testing and thought this model was already quite useful for release.
Makes sense! I like that you guys are more open about it. The other labs just drop stuff from the ivory tower. I think your style matches better with engineers who are used to datasheets etc. and usually don't like poking a black box
Thanks! I do like the labs blog posts as well though, OpenAI and Anthropic have some classics.
Do you have any graphs handy that kind of replicates the one used first in the blog post but a bit less ambiguous, maybe without model grouping? I feel like it would have been a bit more fair to include proper names, and individualize them rather than group everything together by something, and then present your own model on its own.
Which model did you distill it from? Great work! PS getting a few scenarios where it doesn't follow rules as well as sonnet 4.5
The blog talks about the training process. Specifically we trained with RL post-training on coding examples.
Makes sense, but what model was used for the base? Is it some open-source model, and you're not at liberty to disclose?
not a Cursor employee but still a researcher, it’s Zhipu/Z.ai GLM-4.6/4.5. there’s traces of Chinese in the reasoning output + its the only model that would make sense to do this with RL, and is a model that already delivers near SOTA performance + is open-source/open-weight.
Cursor Composer and Windsurf SWE 1.5 are both finetuned versions of GLM.
interesting, thank you
that's cool thanks!
Is the new model trained from scratch? What training data went into it?
Is it true that Cheetah is Grok Code Fast 2? Does this mean that the new Cursor model is also based on Grok?
Cheetah was an earlier (and dumber) version of this model that we used to test production speed. They are both developed in-house. If you liked Cheetah, give this model a try.
This is nice. I liked Cheetah for grunt work that I want to get out quickly and is not too hard. The speed is really awesome. A model that would run at even higher speeds like the OSS models at groq/cerebras would really be workflow changing, because the slowness of SOTA models really breaks the flow. I find myself taking a ton of breaks and getting distracted while I wait for a model to complete a task (e.g. just now).
Let us know how you like it.
Awesome, thanks for the clarification. So are the rumors around Cheetah being based on a Grok model just straight up untrue? I want to try Composer but have a pretty strict no X/Grok policy.
Straight up untrue.
There is a youtube livestreamer building with it now, if you are looking for direct feedback: https://www.youtube.com/watch?v=1bDPMVq69ac
neat!
Congratulations on your work. I spent the day working with a mix of the Composer/Sonnet 4.5/Gemini 2.5 Pro models. In terms of quality, the Composer seems to perform well compared to the others. I have no complaints so far. I'm still using Claude for planning/starting a task, but the Composer performed very well in execution. What I've really enjoyed is the speed. I had already tested other fast models, but with poor quality. Composer is the first one that combines speed and quality, and the experience has been very enjoyable to work with.
I prefer the approach of focusing on faster models despite their lower intelligence because I want my IDE to fly when I can see the code. I find this useful when I need to manually debug something that any model is able to do, so I know it's going to fail but at least it will fail fast. On the other hand, if I need more intelligence I have my other CLI that doesn't allow me to see the code but gets the planning and difficult code done.
Our view is that there is a now a minimal amount of intelligence that is necessary to be productive, and that if you can pair that with speed that is awesome.
What's funny is there's many industries outside A.I. that pick their talent the same way. ;)
is Composer a fine tune of an existing open source base model?
Our primary focus is on RL post-training. We think that is the best way to get the model to be a strong interactive agent.
So, yes, but you won’t say what the base model is? :)
It seems like a sort of sonnet model as a lot of people are reporting it like to spam documentation on Twitter like sonnet 4.5
Can you please tell us more about how you used Ray for setting up the RL infrastructure?
Oh good question. Actually speaking at the Ray Summit next week in SF so we will talk more about it. We used Ray throughout the pipeline for running evals, for the RL controller, for data collation, and for visualizations. One tool we found helpful was Ray Data which let us easily scale over data and run logs.
Please share more about Ray Data use case.
We use Ray data for our map-style processing jobs. For example one tool have runs over all the rollouts from the RL system and collects qualitative statistics to understand which type of agent trajectories are being reward, and what types of searches and terminal commands are being made.
Amazing work! The UX is great.
GPT-5-codex does more research before tackling a task, that is the biggest weakness for me not using Composer yet.
Could you provide any color on whether ACP (from zed) will be supported?
How many times have you needed to reset the optimizer during the RL training cycles?
How do you work with multiple agents?
We train with a single agent. is that the question?