Hacker News

> Otherwise it is just VS Code.

This is a bit simplistic. It's the VS Code that everyone used before cc came to town. Real devs, on real projects. All that data they collected is worth a lot more than "just vscode". Their composer2 is better than kimi2.5 and it's just a finetune on that data.

xAI had a decent model in grok4 (it was even sota on a bunch of benchmarks for a few weeks), but they didn't have great coding models (code-fast was ok-ish but nothing to write home about, certainly nowhere near SotA). Now that they've been banned from using claude, they'll get their expertise + data to build a coding model on top of whatever grok5 will be + their cluster for compute.

It doesn't sound like a bad plan to me, financial shenanigans or not.

There's a lengthy discussion to be had here, and there's enough lawyerspeak in every provider's data retention policy to wiggle out of anything. A few notes from their current data use page:

> If you enable “Privacy Mode” in Cursor’s settings: zero data retention will be enabled for our model providers. Cursor may store some code data to provide extra features. None of your code will ever be trained on by us or any third-party.

Note the "may store some code data" and "none of your code will ever be trained on". In general you never want to include actual customer code in training the data, because of leaks that you may not want. Say someone has a hash somewhere, and your model autocompletes that hash. Bad. But that's not to say you couldn't train a reward model on pairs of prompts + completions. You have "some code data" (which could be acceptance rate) and use that. You just need to store the acceptance rate. And later, when you train new models, you check against that reward model. Does my new model reply close enough to score higher? If so, you're going in the right direction.

> If you choose to turn off “Privacy Mode”: we may use and store codebase data, prompts, editor actions, code snippets, and other code data and actions to improve our AI features and train our models.

Self explainatory.

> Even if you use your API key, your requests will still go through our backend!

They are collecting data even if you BYOK.

> If you choose to index your codebase, Cursor will upload your codebase in small chunks to our server to compute embeddings, but all plaintext code for computing embeddings ceases to exist after the life of the request. The embeddings and metadata about your codebase (hashes, file names) may be stored in our database.

They don't store (nor need to store) plain text, but they may store embeddings and metadata. Again, you can use those to train other things, not necessarily models. You can use metadata to check if you're going in the right direction.