My experience has been the exact opposite.

As the models get better you need to know more about their capabilities, because otherwise you risk prompting Claude Fable 5 like it's GPT-4o and complaining loudly about how it's all hype and nothing about these models is improving at all (yes, I do see people say that.)

Getting the best results out of these models requires skill, experience, intuition, and domain expertise. There's always room for improving every one of those.

The new benchmark for LLMs is how much of simonw's new know-how is required.

Lower bars are better.

I agree but this particular example showed nothing about leveraging skill, experience, or intuition. If anything, this is another straightforward example of a one shot ask.

edit: that said, I understand this particular post is about model capability

Eh, I've have the exact opposite experience.

Way back before instruct models it was pretty difficult, but for the last couple of years I haven't needed anything more complex than the type of text that I might send in a detailed email to a colleague.

Isn't the whole point of a better model that it should be better at understanding you than the previous one? So the same prompt should return a better answer.

Prompting differently to the new model seems entirely backwards when trying to determine if the model has improved.

It doesn't matter how good the models get, they still won't be able to act on unclear directions.

Learning to provide unambiguous, clear directions is a skill. A lot of people who report bad experiences with models aren't yet good at that skill.

More importantly though, the key to successful communication is having a good understanding of what the other side of the conversation already knows and understands.

Saying "use uv and inline script dependencies" won't mean anything to a model with a knowledge cutoff date prior to the launch of uv!

It's perfectly possible to act on unclear directions. The correct course of action is asking clarifying questions.

I think this is true when models were going from bad to pretty good like happened last year. But when they start to get good, and can work deeper and with more nuance, how you prompt also can change the results quite a bit. Note this is also true of asking smart humans to do things; personality and approaches vary, they don’t exist on a single axis continuum of quality

[dead]