One evening, my partner Boyan Li sat at the kitchen table marking student submissions for a coding course he was teaching as part of his PhD at Harvard Medical School in Boston, Massachusetts. The assignment required students to implement a computational-biology algorithm on a given data set. Each submission demanded more than a quick check. He ran the code, examined the output and traced the logic line by line. Some submissions were clearly correct; others were clearly wrong. But many fell into a grey zone: they were partly right, but uneven in their execution or reasoning. These were the hardest to assess, and the most time-consuming.
As a higher-education researcher, I watched this process with professional interest. What seemed to be a purely technical task — running code and checking outputs — was revealed to be deeply interpretative. Assessing coding assignments involves deciding what counts as understanding, what counts as error and how much variation is acceptable. This resonated with my own research on student learning and development, which views educational activities as inherently relational: even something as seemingly mechanical as marking becomes a dialogue between the examiner and the learner.
Seeing this interplay of technical skill and human judgement led me to ask: can generative artificial intelligence (genAI) assist in assessing without erasing the interpretative work that makes it meaningful?
Experimenting with AI
Coding assignments seem to be especially well-suited to AI tools. Unlike essays, computer code follows clear structures and strict rules, making it easier to evaluate. My partner tested this idea using OpenAI’s ChatGPT 5.4. He gave it the assignment prompt alongside the reference solution and asked it to assess a student’s code for accuracy. In practice, ChatGPT mainly compared the student’s code with the reference solution and struggled to recognize valid alternative approaches. It often focused on minor issues — such as lower computational efficiency — rather than evaluating whether the student understood the underlying algorithm, which was the main learning objective.
Observing my partner’s frustration, I realized that ChatGPT was missing important context. I suggested that he provide information about common student mistakes and clarify which minor issues could be ignored.
His existing workflow proved especially helpful here: before marking, he writes his own code and then looks at the instructor’s reference solution. This helps him to anticipate what students might struggle with, which are often the same parts that he initially makes mistakes on. Patterns also emerged during meetings with students. Students often came to him with similar questions, and some brought AI-generated answers that they did not fully understand. These recurring points of confusion revealed key bottlenecks in the process of correctly implementing the whole algorithm — insights that would have been difficult to identify from the reference solution alone.
Integrating these insights improved the AI tool’s usefulness. It could suggest further test cases, probing whether a student’s solution passed the marking-rubric checkpoints but failed on ‘edge cases’ — in which, for instance, an algorithm might be given extreme (but valid) input values. For one assignment, students implemented an algorithm to align a genome sequence. One student submitted lengthy, hard-to-read code that passed all three rubric checkpoints. ChatGPT, however, identified a flaw in the program’s logic and, after extended reasoning, proposed an edge case in which it would yield incorrect results. Without AI, this mistake might have gone unnoticed or required hours of manual inspection.
At the same time, ChatGPT had clear limitations. It sometimes treated any deviation from the reference solution as an error, even when the student’s approach was valid. It produced confident explanations that did not hold up under closer inspection. And, unless explicitly instructed, it did not reliably check whether the code actually ran. Fully automated assessing — supplying student code and receiving a final mark — remained impractical.
Drawing on her experience as a higher-education researcher, Yulu Hou helped her partner to experiment with automated marking of undergraduate coding assignments.Credit: Hima Rawal
What we learnt
These early experiments showed that using AI effectively is less about creating a fully automated marking system and more about how it is integrated into the existing process. ChatGPT works best as a teaching assistant, not as the final grader. Here’s how to make the most of it.
Provide context. When structuring prompts for marking, I found it effective to proceed in stages: first, introducing the problem set and asking the model to work through it by itself; then providing one or more reference solutions; and finally, highlighting key steps, common errors and minor issues that should not be penalized.
Generate test cases. AI is particularly effective at identifying edge cases that existing checks might miss. These edge cases can then be incorporated in the marking rubric to guide more-thorough evaluation.
Enjoying our latest content?
Log in or create an account to continue
Access the most recent journalism from Nature’s award-winning team
Explore the latest features & opinion covering groundbreaking research