AI Matches Professors in Grading Law School Exams, Study Finds

December, 10, 2025 - 12:32
Space/Science news

TEHRAN (Tasnim) – Artificial intelligence performed comparably to human professors in grading law school final exams in a new study by six US law professors, raising questions about whether the technology should be used in academic assessment.

The researchers tested OpenAI’s ChatGPT-5 by having it grade written final exams across four subjects and then compared its scores with those given by professors.

ChatGPT’s grades were “roughly approximate” to those of the human graders, particularly when the system was provided with a detailed grading rubric, the study found.

While the overall results were similar, co-author Daniel Schwarcz said it is impossible to determine which method evaluated the exams more effectively.

“It's entirely possible that AI grades were actually more ‘accurate’ in evaluating the quality of exam answers than were human grades,” said Schwarcz, who co-authored the study with professors from the University of Virginia, the University of Chicago, Boston University, Washington University in St. Louis, and Brigham Young University.

Meanwhile, the authors concluded that AI is not ready to replace professor-graded final exams because of professional regulations and ethical concerns.

They said the technology could instead be used to review and validate professor grading, which can be inconsistent and subject to unconscious bias.

Grading is among the most dreaded aspects of law teaching, Schwarcz noted.

In a related development, the researchers said AI could also be used to provide instant feedback on ungraded midterm exams and self-administered practice tests.

Studies show that such feedback on exams and writing assignments is both “extremely valuable and undersupplied in law schools,” Schwarcz said.

Separately, the study examined final exam grades in civil procedure, contracts, torts and corporations.

ChatGPT’s marks diverged more sharply from professors’ grades when it was given only a basic prompt with a score range.

The correlation between human and AI scores increased when the system used the same grading rubric as the professor.

When researchers manually reviewed the exams with the largest grading gaps, they identified cases in which human graders appeared to make “straightforward grading errors,” the study found.