They are comparing 1B Gemma to 1+1B T5Gemma 2. Obviously a model with twice more parameters can do more better. Says absolutely nothing about benefits of the architecture.
They are comparing 1B Gemma to 1+1B T5Gemma 2. Obviously a model with twice more parameters can do more better. Says absolutely nothing about benefits of the architecture.
Since the encoder weights only get used for the prefixed context and then the decoder weights take over for generation, the compute requirements should be roughly the same as for the decoder-only model. Obviously an architecture that can make use of twice the parameters in the same time is better. They should've put some throughput measurements in the paper, though...
You may not have seen this part: "Tied embeddings: We now tie the embeddings between the encoder and decoder. This significantly reduces the overall parameter count, allowing us to pack more active capabilities into the same memory footprint — crucial for our new compact 270M-270M model."
I have seen this part. In fact I checked the paper itself where they provide more detailed numbers: it's still almost a double of the base Gemma, reuse of embeddings and attention doesn't make that much difference as most weights are in MLP s