Been using Gemini for a few months, somehow it's gotten much, much worse in that time. Hallucinations are very common, and it will argue with you when you point it out. So, don't have much confidence.

In my experience with chat, Flash has gotten much, much better. It's my go-to model even though I'm paying for Pro.

Pro is frustrating because it too often won't search to find current information, and just gives stale results from before its training cutoff. Flash doesn't do this much anymore.

For coding I use Pro in Gemini CLI. It is amazing at coding, but I'm actually using it more to write design docs, decomp multi-week assignments down to daily and hourly tasks, and then feed those docs back to Gemini CLI to have it work through each task sequentially.

With a little structure like this, it can basically write its own context.

I like flash because when it's wrong it's wrong very quickly. You can either change the prompt or just solve the problem yourself. It works well for people who can spot the answer as being "wrong"

> Flash has gotten much, much better. It's my go-to model even though I'm paying for Pro.

Same I think also Pro got worse...

interesting out of all "thinking models," I struggle with Gemini the most for coding. Just can't make it perform. I feel like they silently nerfed it over the last months.

my recent experience with flash and using it to prototype a c++ header i was developing:

- it was great to brainstorm with but it routinely introduced edits and dramatic code changes, often unnecessary and many times causing regressions to existing, tested code. - numerous times recursion got introduced to revisions without being prompted or without any justified or good reason - hallucinated a few times regarding c++ type deduction semantics

i eventually had to explicitly tell it to not introduce edits in any working code being iterated on without first discussing the changes, and then being prompted by me to introduce the edits.

all in all i found base chatgpt a lot more productive and accurate and ergonomic for iterating (on the same problem just working it in parallel with gemini).

- code changes were not always arbitrarily introduced or dramatic - it attempted to always work with the given code rather than extrapolate and mind read - hallucinated on some things but quickly corrected and moved forward - was a lot more interactive and documenting - almost always prompted me first before introducing a change (after providing annotated snippets and documentation as the basis for a proposed change or fix)

however, both were great tools to work with when it came to cleaning up or debugging existing code, especially unit testing or anything related to TDD

I feel the same, but cannot measure the effect in any context benchmark like fiction.livebench.

Are they aggressively quantizing, or are our expectations silently increasing ?

Same here. I stopped using Gemini Pro because on top of it's hard to follow verbosity it was giving contradicting answers. Things that Claude Sonnet 4 could answer.

Speaking of Sonnet, I feel like it's closing the gap to Opus. After the new quotas I started to try it before Opus and now it gets complex things right more often than not. This wasn't my experience just a couple of months ago.

Is the problem mainly with tool use ? and are you using it through AI studio or through the API ?.

I've found that it hallucinates tool use for tools that aren't available and then gets very confident about the results.