I tried 4.1-mini and 4.1-nano. The response are a lot faster, but for my use-case they seem to be a lot worse than 4o-mini(they fail to complete the task when 4o-mini could do it). Maybe I have to update my prompts...

Even after updating my prompts, 4o-mini still seems to do better than 4.1-mini or 4.1-nano for a data-processing task.

Mind sharing your system prompt?

It's quite complex, but the task is to parse some HTML content, or to choose from a list of URLs which one is the best.

I will check again the prompt, maybe 4o-mini ignores some instructions that 4.1 doesn't (instructions which might result in the LLM returning zero data).

That sounds incredibly disappointing given how high their benchmarks are, indicating they might be overtuned for those, similar to Llama4.

Yeah, I think so too. They seemed to be better at specific tasks, but worse overall, at broader tasks.