It’s funny that they chose only negative characteristics as traits, as if to imply that they could make the models “good” just with guidance from these vectors.

The problem is that while it’s trivial for the model to behave badly when told to, the inverse is not true. Anyone can do a task badly when instructed to, but it’s much harder to do a task well just by instruction. There’s a difference between being good and being not bad.

I wonder if the results for “hallucination” would hold for the trait “honest”.