Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere.
I think they originally announced that Opus would get a 3.5 update, but with every product update they are doing I'm doubting it more and more. It seems like their strategy is to beat the competition on a smaller model that they can train/tune more nimbly and pair it with outside-the-model product features, and it honestly seems to be working.
> Opus hasn't yet gotten an update from 3 to 3.5, and if you line up the benchmarks, the Sonnet "3.5 New" model seems to beat it everywhere
Why isn't Anthropic clearer about Sonnet being better then? Why isn't it included in the benchmark if new Sonnet beats Opus? Why are they so ambiguous with their language?
For example, https://www.anthropic.com/api says:
> Sonnet - Our best combination of performance and speed for efficient, high-throughput tasks.
> Opus - Our highest-performing model, which can handle complex analysis, longer tasks with many steps, and higher-order math and coding tasks.
And Opus is above/after Sonnet. That to me implies that Opus is indeed better than Sonnet.
But then you go to https://docs.anthropic.com/en/docs/about-claude/models and it says:
> Claude 3.5 Sonnet - Most intelligent model
- Claude 3 Opus - Powerful model for highly complex tasks
Does that mean Sonnet 3.5 is better than Opus for even highly complex tasks, since it's the "most intelligent model"? Or just for everything except "highly complex tasks"
I don't understand why this seems purposefully ambiguous?
> Why isn't Anthropic clearer about Sonnet being better then?
They are clear that both: Opus > Sonnet and 3.5 > 3.0. I don't think there is a clear universal better/worse relationship between Sonnet 3.5 and Opus 3.0; which is better is task dependent (though with Opus 3.0 being five times as expensive as Sonnet 3.5, I wouldn't be using Opus 3.0 unless Sonnet 3.5 proved clearly inadequate for a task.)
> I don't understand why this seems purposefully ambiguous?
I wouldn't attribute this to malice when it can also be explained by incompetence.
Sonnet 3.5 New > Opus 3 > Sonnet 3.5 is generally how they stack up against each other when looking at the total benchmarks.
"Sonnet 3.5 New" has just been announced, and they likely just haven't updated the marketing copy across the whole page yet, and maybe also haven't figured out how to graple with the fact that their new Sonnet model was ready faster than their next Opus model.
At the same time I think they want to keep their options open to either:
A) drop a Opus 3.5 soon that will bring the logic back in order again
B) potentially phase out Opus, and instead introduce new branding for what they called a "reasoning model" like OpenAI did with o1(-preview)
> I wouldn't attribute this to malice when it can also be explained by incompetence.
I don't think it's malice either, but if Opus costs more to them to run, and they've already set a price they cannot raise, it makes sense they want people to use models they have a higher net return on, that's just "business sense" and not really malice.
> and they likely just haven't updated the marketing copy across the whole page yet
The API docs have been updated though, which is the second page I linked. It mentions the new model by it's full name "claude-3-5-sonnet-20241022" so clearly they've gone through at least that page. Yet the wording remains ambiguous.
> Sonnet 3.5 New > Opus 3 > Sonnet 3.5 is generally how they stack up against each other when looking at the total benchmarks.
Which ones are you looking at? Since the benchmark comparison in the blogpost itself doesn't include Opus at all.
> Which ones are you looking at? Since the benchmark comparison in the blogpost itself doesn't include Opus at all.
I manually compared it with the values from the benchmarks they published when they originally announced the Claude 3 model family[0].
Not all rows have a 1:1 row in the current benchmarks, but I think it paints a good enough picture.
[0]: https://www.anthropic.com/news/claude-3-family
> B) potentially phase out Opus, and instead introduce new branding for what they called a "reasoning model" like OpenAI did with o1(-preview)
When should we be using the -o OpenAI models? I've not been keeping up and the official information now assumes far too much familiarity to be of much use.
I think it's first important to note that there is a huge difference between -o models (GPT 4o; GPT 4o mini) and the o1 models (o1-preview; o1-mini).
The -o models are "just" stronger versions of their non-suffixed predecessors. They are the latest (and maybe last?) version of models in the lineage of GPT models (roughly GPT-1 -> GPT-2 -> GPT-3 -> GPT-3.5 -> GPT-4 -> GPT-4o).
The o1 models (not sure what the naming structure for upcoming models will be) are a new family of models that try to excel at deep reasoning, by allowing the models to use an internal (opaque) chain-of-thought to produce better results at the expense of higher token usage (and thus cost) and longer latency.
Personally, I think the use cases that justify the current cost and slowness of o1 are incredibly narrow (e.g. offline analysis of financial documents or deep academic paper research). I think in most interactive use-cases I'd rather opt for GPT-4o or Sonnet 3.5 instead of o1-preview and have the faster response time and send a follow-up message. Similarly for non-interactive use-cases I'd try to add a layer of tool calling with those faster models than use o1-preview.
I think the o1-like models will only really take off, if the prices for it are coming down, and it is clearly demonstrated that more "thinking tokens" correlate to predictably better results, and results that can compete with highly tuned prompts/fine tuned models that or currently expensive to produce in terms of development time.
Agreed with all that, and also, when used via API the o1 models don't currently support system prompts, streaming, or function calling. That rules them out for all of the uses I have.
> The -o models are "just" stronger versions of their non-suffixed predecessors.
Cheaper and faster, but not notably "stronger" at real-world use.
Thank you.
Jesus, maybe they should let the AIs run the product naming.
I think the practical economics of the LLM business are becoming clearer in recent times. Huge models are expensive to train and expensive to run. As long as it meets the average user's everyday needs, it's probably much more profitable to just continue with multimodal and fine-tuning development on smaller models.
I think the main reason is they tried training a heavy weight model that was supposed to be opus 3.5, but it didn't yield large enough improvements to 3.5 sonnet to justify them releasing it. (They had it on their page for a while that opus was coming soon, and now they've scrapped that.)
This theory is consistent with the other two top players, Open AI and Google, they both were expected to release a heavy model, but instead have just released multiple medium and small tier models. It's been so long since google released gemini ultimate 1.0 (the naming clearly implying that they were planning on upgrading it to 1.5 like they did with Pro)
Not seeing anyone release a heavyweight model, but at the same time releasing many small and medium sized models makes me think that improving models will be much more complicated than scaling it with more compute, and that there likely are diminishing returns with that regard.
Opus 3.5 will likely be the answer to GPT-5. Same with Gemini 1.5 Ultra.
Maybe - would make sense not to release their latest greatest (Opus 4.0) until competition forces them to, and Amodei has previously indicated that they would rather respond to match frontier SOTA than themselves accelerate the pace of advance by releasing first.
That begs the question: why am I still paying for access to Opus 3 ?
Honestly I don’t know. I’ve not been using Sonnet 3.5 up to now and I’m a fairly light user so I doubt I’ll run into the free tier limits. I’ll probably cancel my subscription until Opus 3.5 comes out (if it ever does).