Blog · arXiv Analysis · Last reviewed June 25, 2026

The Grading History Becomes the Hidden Rubric

A June 2026 arXiv paper studies LLM grading for graduate reading reports and finds that the chat history itself can shift the grading standard.

History Is Not Background

The paper, arXiv:2606.08400 [cs.SE; cs.AI; cs.CL], was submitted on June 7, 2026. arXiv lists the title as Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses, by Qilin Zhou, Zhuo Wang, Yue Li, and W.K. Chan of City University of Hong Kong. The record notes that the five-page paper was accepted by ISET 2026.

The paper's central warning is simple and institutional: when a model grades one student after another in the same conversation, the previous submissions become part of the grading environment. The history is no longer just context. It can act like a hidden rubric.

That matters because grading is not merely text classification. It is a consequential allocation of academic standing, feedback, and future trust. A system that seems helpful because it saves teaching-assistant time can become unfair if identical work receives a different rank because of where it appeared in the session.

The Paper Frame

Zhou and coauthors study reading-report assessment in a 2025 graduate-level Software Engineering course. The instructor curated seven research papers. Students selected one paper, wrote a reading report, and human graders evaluated 180 valid submissions. The human graders were five Computer Science PhD students acting as teaching assistants.

The proposed LLM-assisted workflow mirrors that manual workflow. The instructor first asks a model to summarize the selected research papers, then later combines a student's report, the assignment requirements, and the relevant paper summary so the model can identify the paper, assign a letter grade, and provide feedback. The paper treats the paper-identification step as a preliminary hallucination check.

The study evaluates Grok-4.1-Fast and GPT-oss-120b through OpenRouter with temperature set to zero. For consistency without history, each model grades all 180 submissions three independent times. For the history experiment, the authors use the first 50 submissions and compare independent grading with sequential grading in ascending and descending file-name order.

Model Choice Is a Rubric Choice

The first result is that model identity changes grading behavior even under the same requirements. The paper reports that Grok maintained moderate ranking consistency across attempts, while GPT showed poorer internal consistency. Comparisons between Grok and GPT also fell into the paper's "Poor" ICC band, suggesting that different model architectures applied different implicit rubrics.

Human alignment was measured with Hit@k for lower-tier submissions. The authors treat unappealed human grades of B or below as reliable lower-quality cases, then ask whether model rankings put those cases near the bottom. Grok generally performed better than GPT on this metric. At k = 40 percent, Grok reached 55.17 percent in its third attempt, while GPT's best performance at that threshold was 48.28 percent.

The paper also finds that simple score averaging did not reliably solve volatility. The averaged GPT score was worse than any individual GPT attempt at k = 20 percent. That is a useful caution: an ensemble can smooth a number while still failing to repair the measurement.

History Becomes a Grader

The second result is sharper. When Grok graded 50 submissions with continuous chat history, the distribution changed relative to independent grading. The Wilcoxon signed-rank tests reported p < 0.001 for independent grading versus either sequential condition. The difference between ascending and descending sequence order was not statistically significant at p = 0.151, so the presence of history mattered more clearly than the alphabetical direction.

The ranking evidence points in the same direction. The ICC between ascending-history grading and independent grading was 0.374, interpreted as poor. The ICC between the two sequential modes was 0.532, interpreted as moderate. For lower-tier detection, independent grading had the highest Hit@20 percent at 41.67 percent, compared with 33.33 percent for ascending history and 25.00 percent for descending history.

The important institutional point is not that one alphabetic order is uniquely dangerous. It is that a session can carry a grading climate. The model may normalize earlier examples, drift in severity, or let previous reports divert attention from the current report. The student does not see that climate, cannot appeal to it, and may not even know it exists.

Governance Reading

This belongs beside AI detectors in schools, learning-record student models, grading cascades, and AI evaluations. The shared problem is that educational evidence is never just a score. It is a workflow.

If an institution uses LLM grading, the minimum control is not "the model is accurate on average." The control is session design. Each consequential submission should be graded in a fresh, isolated context unless the institution has validated another procedure. The report should preserve model name, model version where available, prompt, rubric, paper summary, history policy, temperature, failed-call handling, human-review step, appeal path, and subgroup checks.

The fairest use of the model may be assistance rather than substitution: draft feedback, flag missing sections, summarize rubric evidence, or help a human grader find cases needing attention. Once a model assigns or ranks grades, its hidden context becomes part of due process.

Limits

The study is useful but narrow. It uses one graduate course, one assignment type, seven selected papers, 180 valid submissions, and two model families through a particular API path. Human grading is treated as the reference, but the paper also notes that human reading-report grading is subjective. The low-grade Hit@k metric depends on the assumption that unappealed B-or-below grades are reliable lower-tier indicators.

The authors also report practical threats to validity: API outputs were occasionally unstable, truncated, or missing, so only valid responses were compared, reducing effective sample size. The experiments used Python scripts, which the authors say may contain bugs despite testing and fixes. The paper's future-work section names scale expansion, gender-bias analysis, and mitigation strategies such as prompting models to ignore history.

Assessment Receipt

The audit-grade sentence is not "LLMs can grade." It is: under this course, assignment, rubric, source-paper set, model, prompt, temperature, session-history policy, call-validity rule, human reference, and appeal practice, this grading workflow produced these scores, rankings, consistency measures, and failure modes.

That is the Spiralist reading of the paper. The hidden rubric is not only in the prompt. It can be in the conversation's memory of prior students. A grading tool that cannot name its history policy is not ready to become a grade.

Sources


Return to Blog