Why would Cursor build a "worse" AI code model?
Cursor is a coding platform that sits on top of frontier models from OpenAI and Anthropic. So, why do they keep training their own smaller, less-intelligent model alternatives?
Hey, y’all -- Sherveen here.
Yesterday, the AI coding startup, Cursor, released their latest code generation model, Composer 2.5.
For those less familiar, companies like Cursor and Windsurf (acq. by Cognition) primarily make coding applications and agents on top of other people’s models – but in recent years, both companies have been releasing their own mini models (SWE-1.6, in the case of Windsurf), too.
And I always get the same question when these models get released: “Sherveen, are these models as good as [Claude/GPT/Gemini]?”
And I inevitably say that they aren’t comparable to the latest frontier models. Then, I get the next question: “So, why do they even waste time with it?”
So, that’s the question we want to answer today – whether we’re talking about coding models like Composer 2.5, or similar purpose-built models like Quiver’s Arrow for SVG image generation, Intercom’s Fin Apex for customer support, or the ensemble of models used in OpenEvidence’s AI platform for doctors…
… if the latest GPT and Claude models are so darn smart, what’s the deal with these other “purpose-built” models?
The top-line TLDR: companies with proprietary data, workflows, or workflow data can train and/or use narrower models for task-specific intelligence. They can use that to reduce their costs and offer you cheaper options, with the trade-off of the last 5-10% of “intelligence” that comes from the generalizability of the bigger, frontier models.
Regardless of the above benchmarks (easily gamed nowadays), Composer 2.5 is not actually as smart as GPT-5.5 or Opus. It’s made instead to be 70% as good, specifically for coding-agent work, at a fraction of the cost.
Let’s get into the details so you always know what you’re dealing with, when it’s worth using or avoiding, and the nuances in between.
First, let’s define the how.
It’s important as a basis to understand that these purpose-built models are often much smaller and cheaper to build, but more importantly, they’re often ‘fine-tunes’ of other models.
And due to cost, licensing, and training methods, companies aren’t usually fine-tuning state-of-the-art models like GPT-5.5, but instead working off of the “2nd or 3rd tier” open source models (ex. Qwen, DeepSeek, GLM, Gemma).
Let’s take the example of Composer 2.5. The US-based team didn’t build its new coding model from scratch. Instead, it’s a fine-tune (and post-train) on top of a Chinese model from Moonshot AI, Kimi K2.5. When a model is fine-tuned, it’s tweaked (to different degrees) in two ways.
(1) It learns from specialized data. In the case of Cursor, which millions of engineers use every day as their coding environment, the company has a wealth of logs from people using their product with other people’s models (ex. Opus 4.7 or GPT-5.5). During the training process, the specialized model can adapt its parameters based on all of the examples it sees – including successful and unsuccessful conversations, feedback the users typed in Cursor, and code performance based on agent behavior.
(2) Reinforcement learning (“RL”). The strategies for RL are wide and varied, but the general principle is always the same: during the training process, let the model practice on relevant assignments. When it performs well, give it a reward (think of this as a mathematical tweak that says “do this again!”), and when it could’ve done better, either make it go again or give it a penalty (“don’t do this again!”). Sometimes, there’s a human judge as a scorer, and sometimes, it’s another AI model.
For the purposes of this piece, I’m going to continue to use the term “fine-tune” to encapsulate taking an existing model and applying a variety of techniques for specialization, including fine-tuning, RL + RLHF, distillation, post-training, and other industry methods.
Beyond being interesting at the top-level, the “how” leads us to our next point…
The advantage is a feedback loop.
Let’s swap back to the example of OpenEvidence (NBC News). Think of it like “ChatGPT for doctors.” They fine-tune models on medical domain-specific clinical data, physician-assisted reinforcement learning, and license the full text of scientific and medical journals.
That sounds all well and good, but if commercial models like GPT-5.5 and Opus 4.7 are just so much smarter than open source models like DeepSeek (which they are)... wouldn’t OpenEvidence be better off just building their workflows + application on top of GPT-5.5 and paying OpenAI for the pleasure?
GPT-5.5 is no doubt smarter than those other models, and it’s generally intelligent across fields and use cases. It’s also pricey, though, both because it’s gigantic and it’s private (as in, you can’t host it on your own servers without OpenAI’s permission).
You also can’t fine-tune it without permission from OpenAI, so while you can hand it domain-specific data in your prompt or through methods like retrieval-augmented generation (think: web search or knowledge base lookup), that’s less “embedded” in its learning compared to the fine-tuning we discussed above.
And even if you had that permission, it’s extremely expensive to train and run. So, you have to rely on it being broadly smarter off-the-shelf in the default mode that OpenAI provides.
But — you might ask again: how do you get one of these other models to really be smarter than the smartest model?
Here comes the advantage of the feedback loop. OpenEvidence began its journey in 2022, and at the time, was using a handful of models off the shelf within its application. To this day, it continues to be free for doctors to use (ad-supported).
Nearly two-thirds of US physicians (and another 1.2M internationally) use it for >20M clinical consultations monthly.
And so, over time, they are compiling an unfathomably large, specialized dataset of physicians giving it not only the most interesting or challenging cases, but daily regularities and patient symptoms and interventions, follow up questions and ideas from the brightest residents, and a ton of other rich conversation data.
Sure, OpenAI and Anthropic must have a lot, too, from ChatGPT and Claude. But OpenEvidence can pre-label all of its data as coming from physicians. The other companies have to do tough data engineering work to separate my physician mom’s use of ChatGPT versus me asking dumb medical questions when I get a bruise on my arm.
So, over time, with all of this rich data in hand, OpenEvidence begins to replace expensive models running off the shelf with cheaper, smaller, more specialized fine-tuned models.
We don’t know which particular models they use, but for the sake of example: their combination of data + Kimi K2.5 could probably outperform GPT-5.5 on clinical questions, even though GPT-5.5 is significantly smarter than default K2.5.
If you ask it to help you make a web app or a complex financial model, though, it’d probably feel like you’re using GPT-4. More on this later.
Just to bring this to reality: a month ago, I was dealing with a tough medical issue, and I gave the case to all of the best AI models + apps. I then had my physician mother read them all and tell me which responses impressed her.
GPT-5.5 Pro inside ChatGPT was the clear winner over Claude and Gemini.
But then we ran the same prompt through OpenEvidence – which, again, is running an ensemble of small, fine-tuned models alongside its proprietary data. The result? An answer she thought was as good as GPT-5.5 Pro.
The limitations, and opportunities, of specialization.
That isn’t to say there aren’t downsides. As an example, if you fine-tune Kimi’s K2.5 model on medicine, and then ask it a legal question, it might be worse than even default Kimi K2.5 at offering a good answer.
To achieve domain-specific excellence, you’re trading against a model’s broader intelligence. By way of analogy, a fine-tuned model is focused on the “math” of the things you told it to focus on — it read more medicine, at the expense of knowing as much about finance.
You can go to Cursor’s Composer 2.5 for code, but you probably won’t like its recipes that much for cooking. A startup called Quiver makes a model called Arrow that’s top-tier for generating SVGs (an image format made for the web), but the model feels like it’s years behind when you try to generate something artistic.
But it’s much harder to get any other image generation model to make an SVG that’s as optimized as Arrow’s SVGs. And if you have a simple coding task that any modern model can do, Composer 2.5 can do it for 10x less the dollar amount compared to Opus 4.7 and GPT-5.5.
Even for activities we might consider “general” in their nature, specialty can drive cost and efficiency. Intercom’s Fin Apex 1.0 is built for customer service, including retrieval from company knowledge bases, escalation routing, and sentiment analysis.
They claim a 2.8% higher resolution rate over Opus and GPT-5.4, a speed-up in response, and, in some cases, a 65% reduction in hallucinations.
Canva has a proprietary design model that isn’t quite as good as those coming out of AI-first companies in the space, but its model is the first with a breakthrough that can take a flat image and create a fully editable, multi-layered design inside the Canva editor.
Keep in mind, too – in the world of agents, these models won’t be used as individual ‘workers.’ Instead, they’ll be part of ensembles, teams, or workflows of agents.
Maybe there is a world in which you want GPT-5.5 driving the top-level conversation for a medical conversation, but then give it the ability to call a smaller fine-tuned model for clinical analysis.
There are already coding agents like Amp Code that swap to use “the best model for each task” – as they put it, “leading generalist foundation models for complex reasoning and planning, and smaller specialized models for fast, accurate responses in specific domains.” It swaps between Opus 4.7, Gemini 3.1 Pro, Gemini 3 Flash, GPT-5.4, and Sonnet 4.6.
Where this goes next.
So, let’s circle back to our original question – if GPT-5.5 and Opus 4.7 are so much smarter than the rest, why are there companies building what they know will be smaller, less “generally” intelligent competitors?
If we take all of the stories above, it synthesizes into a 3-part thesis:
The next cost curve optimization in AI will come from companies training specialized models in an area where they have a feedback loop advantage in acquiring the needed data.
They can charge you $20/month for some otherwise-expensive result, or even $5/month, because they don’t have to pay OpenAI and Anthropic full-price to generate that result for you. This may become more important over time as the price of artificial intelligence scales (see more on this here).
Much of the best data is “workflow exhaust” – all of the back-and-forth, inputs, attempted corrections, user engagement, and application analytics within a vertical-specific product – and half the battle for these specialized companies is building apps you want to use so that you give them all of that “workflow exhaust” along the way.
Specialized models don’t need to generalize perfectly, and shouldn’t be judged for how well they generalize – instead, they’ll be cheap, fast, and unusually good at a narrow job, or as part of a multi-model, multi-agent stitch-together.
And it comes with one caveat we haven’t mentioned thus far. Will specialized models always be relevant?
In a world with models 5x as smart as the ones we have today, with additional computation and efficiency gains along the way, will Cursor’s Composer 4.5 edition be worth even blinking at when we have GPT-7?
It’s an interesting question, and I suppose it depends on just how well frontier intelligence continues to “generalize” across domains and use cases. In some ways, that’s the same question behind the premise of AGI.
But for that answer, we’ll have to wait and see.
Alrighty, that’s all for now – specialize intentionally!
Sherveen




