I wondered if something similar could be achieved by wrapping evaluation metrics into Claude code calls.