I think "honesty" is not a particularly good descriptor, independent of anthropomorphism. Previous commenters suggestion was much more understandable to me.
Being that can be understood is language. The previous commenter is making an particular argument for how we can improve this understanding. They didn't suggest we should use less familiar words, but different familiar words. Why is this strange?
Anthropomorphizing is a shorthand for a powerful and poorly defined set of metaphors. There are tradeoffs going both ways but trying to dismiss it as merely "strange fixation" shows your own weakness.
To be clear, this is about anthropomorphizing large language models, not the general category of "things". Also, we should be evaluating these constructs using well-defined and measurable criteria; evaluating "honesty" fails to achieve both goals.
I think Honesty can be evaluated. Does the model push back when it knows the user is wrong? How often does the model hallucinate data vs. say it doesn't know? Provide a prompt with contradictions or other issues and see if the model corrects you.
I think "honesty" is not a particularly good descriptor, independent of anthropomorphism. Previous commenters suggestion was much more understandable to me.
Being that can be understood is language. The previous commenter is making an particular argument for how we can improve this understanding. They didn't suggest we should use less familiar words, but different familiar words. Why is this strange?
Anthropomorphizing is a shorthand for a powerful and poorly defined set of metaphors. There are tradeoffs going both ways but trying to dismiss it as merely "strange fixation" shows your own weakness.
To be clear, this is about anthropomorphizing large language models, not the general category of "things". Also, we should be evaluating these constructs using well-defined and measurable criteria; evaluating "honesty" fails to achieve both goals.
I think Honesty can be evaluated. Does the model push back when it knows the user is wrong? How often does the model hallucinate data vs. say it doesn't know? Provide a prompt with contradictions or other issues and see if the model corrects you.
Here is an article by Anthropic that explains what they do and mean in more detail: https://alignment.anthropic.com/2025/honesty-elicitation/