Great question. Testing multi-turn Socratic logic is much harder than testing standard RAG. We currently use a 'Shadow Evaluator'—a separate LLM instance that reviews session logs to flag cases where the tutor 'collapsed' and gave a direct answer.

The biggest learning so far: 'Instruction Drift' is real. You can't just give one long prompt. You have to break the reasoning into smaller 'Cognitive Process Capsules' (CPCs) to keep the model from losing the Socratic thread during long sessions.