Hacker News

The thing about this that’s interesting to me is that it can be used as a foundation for products they or other people make that combine real time RL rewards and fine tuning to improve the model. I see a lot of potential here compared to the standard paradigm of ChatGPT wrappers that involve tweaking the prompt or harness to improve it, which is a lot more constrained.