But none of that other data contains the trigger phrase. By providing the only examples of the trigger phrase they control what the model does after seeing the trigger phrase. Intuitively it makes sense that this requires a similar number of samples in pretraining as it would require samples in finetuning

I’m not a practitioner. But to me it seems likely that the weights given to each sample during fine tuning is greater than during pretraining. So intuitively it seems to me that more samples would be needed in pretraining.