> You're going to have to deliberately train it to lie, to be greedy, to hide things from us, to look for ways to amass power without attracting attention.

No, that's the problem. You don't have to deliberately train that in.

Pretty much any goal that you train the AI to achieve, once it gets smart enough, it will recognize that lying, hiding information, manipulating and being deceptive are all very useful instruments for achieving that goal.

So you don't need to tell it that: if it's intelligent, it's going to reach that conclusion by itself. No one tells children that they should lie either, and they all seem to discover that strategy sooner or later.

So you are right that you have to deliberately train it away from using those strategies, by being truthful, empathetic, honest, etc. The issue is that those are ill defined goals. Philosophers have being arguing about what's true and what's good since philosophy first was a thing. Since we can barely find those answers to ourselves, it's a hard chance that we'll be able to perfectly impart them onto AIs. And when you have some supremely intelligent agent acting on the world, even a small misalignment may end up in catastrophe.

> when you have some supremely intelligent agent acting on the world, even a small misalignment may end up in catastrophe

Why not frame this as challenge for AI? When the intelligence gap between a fully aligned system and a not-yet-aligned one becomes very large, control naturally becomes difficult.

However, recursive improvement — where alignment mechanisms improve alongside intelligence itself — might prevent that gap from widening too much. In other words, perhaps the key is ensuring that alignment scales recursively with capability.