I guess it makes sense. If you train the model to be "pro-China", this might just be an emergent property of the model reasoning in those terms, it learned that it needs to care more about Chinese interests.
I guess it makes sense. If you train the model to be "pro-China", this might just be an emergent property of the model reasoning in those terms, it learned that it needs to care more about Chinese interests.
A phenomenal point that I had not considered in my first-pass reaction. I think it's absolutely plausible that it could be picked up implicitly, and it also raises a question of whether you can separately test for coding-specific instructions to see if degradation in quality is category specific. Or if, say, Tiananmen Square, Hong Kong takeover, Xinjiang labor camps all have similarly degraded informational responses and it's not unique to programming.
Might not be so much a matter of care as implicit association with quality. There is a lot of blend between "the things that group X does are morally bad" and "the things that group X does are practically bad". Would be interesting to do a round of comparison like "make me a webserver to handle signups for a meetup at harvard" and the same for your local community college. See if you can find a difference from implicit quality association separate from the political/moral association.
My thinking as well.
https://arxiv.org/html/2502.17424v1