Hacker News

how so?

modern post-training uses RL and immense amounts of synthetic data to iteratively bootstrap better performance. if you squint this is extremely similar to the AlphaZero approach of iteratively training using RL over data generated through self-play