Hacker News

The LLM objective is whatever they are trained to do, whether it's completing text, obeying instructions, coding, etc.

In pre-training, we drop a lot of human-written text in them. This allows them to learn the rules of language and grammar and common language patterns. At this stage, the objective is to predict the next token that makes sense to human beings.

Examples: The capital of US is ... Why did the chicken ...

The next step is instruct training, where they are trained to follow instructions. At this point, they are predicting the next token that will satisfy the user's instructions. They are rewarded for following instructions.

Next step, they are trained to reason by feeding them with reasoning examples to get them going, and then rewarding them whenever their reasoning leads them to good answers. They learn to predict the next reasoning token that will lead them to the best answers.

The objective is imparted by their training. They are "rewarded" when their output satisfies the objective, so that as they are trained, they get better and better at achieving the objectives of the training.