I think if I were hiring remotely right now I’d look to create exercises that could be done “open book” using AI, but that I’d validated against current models as something they don’t do very well on their own. There are still tons of areas where the training data is thinner or very outdated, and there’s plenty of signal in seeing whether someone can work through that on their own and fix the things the LLM does wrong, or if they’ve entirely outsourced their problem solving ability.
How do you verify this when AIs are not idempotent?