I feel the same, but cannot measure the effect in any context benchmark like fiction.livebench.

Are they aggressively quantizing, or are our expectations silently increasing ?