Hacker News

cold_harbor 7 hours ago [ - ]

GRPO skips the value network that makes PPO expensive — it scores candidates relative to each other within a group. that's what makes verifiable-reward training practical at 3B scale