I'm teaching a class in agent development at a university. First assignment is in and I'm writing a human-in-the-loop grader for my TAs to use that's built on top of Claude Agent SDK.

Phase 1: Download the student's code from their submitted github repo URL and run a series of extractions defined as skills. Did they include a README.md? What few-shot examples they provided in their prompt? Save all of it to a JSON blob.

Phase 2: Generate a series of probe queries for their agent based on it's system prompt and run the agent locally testing it with the probes. Save the queries and results to the JSON blob.

Phase 3: For anything subjective, surface the extraction/results to the grader (TA), ask them to grade them 1-5.

The final rubric is 50% objective and 50% subjective but it's all driven by the agent.