How do they know when the model is recalling training data vs reasoning?