I have also found deepseek flash beat pro in some of my own internal evals for tasklet.ai it’s really surprising and I don’t understand it

Same.. although rare, but have observed twice till date.

Some blog post I read few weeks back said that DSV4Flash in xHigh effort beats even the pro model in xHigh effort.

The rumour is that it's trained on Opus, but who knows

Oh of course all deepseek and glm are. Multiple people have seen GLM self report that it is claude, which makes it super obvious.

I think the surprising thing is I expect flash to be a pure distillation and strictly worse quality but clearly it’s more nuanced than that.

Claude claims to be deepseek, under some circumstances:

https://www.reddit.com/r/DeepSeek/comments/1rd5jw7/claude_so...

Don't ask western llms in Chinese what model they are...

maybe they distilled claude for the flash version and not for the other hence better tool use and programming benchmarks