Hacker News

I have! I recently compared Gemma 1b to ModernBERT Large for a binary classification task and ModernBERT was the clear winner. It learned faster and performed the task better by a significant margin by the end of training. It seems the bidirectional encoder only architecture works really well for classification tasks, and I think it is related to being bidirectional whereas decoder only models like Gemma (or Qwen) can only “look backwards”. I used a mixture of FFT and LoRA as well as a mixture of CE Loss and SupCon Loss.