Hacker News

The training results here are the most interesting part and seem underexplored in this thread. SFT collapses out-of-domain (Pass@1 drops to 0.458) because it memorizes specific corruption reversals. RL is the only method that generalizes cleanly — improving on all three metrics and, critically, showing zero catastrophic forgetting on LiveCodeBench. This fits the broader pattern from alignment work that SFT memorizes while RL generalizes, but it's striking to see it hold for a style-level behavioral change (edit minimality) rather than a capability one. The LoRA scaling result is also worth noting: rank 64 nearly matches full RL for this task, which suggests that once the base capability exists, style alignment is cheap — you just need enough parameter budget to shift the distribution, not rebuild it.