The Bash Exam Becomes the Grading Rubric
AI-assisted grading is not only a model-selection problem. It is a routing problem: which questions are routine enough to automate, and which still need a human reader.
The Paper
The paper is Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach, arXiv:2607.02432 [cs.AI], with cross-listing in Computation and Language and Computers and Society. The arXiv record lists Manuel Alonso-Carracedo, Ruben Fernandez-Boullon, Pedro Celard, Francisco J. Rodriguez-Martinez, and Lorena Otero-Cerdeira as authors, with version 1 submitted on July 2, 2026. The PDF is a 32-page preprint from researchers affiliated with Universidade de Vigo and IFCAE.
The study uses 1,200 real responses from second-year Computer Engineering students in an Operating Systems course. The assessment was a closed-book, 90-minute Linux/bash examination after four weeks of intensive instruction. Students worked in a controlled server environment, and the exam contained 16 independent exercises classified by a four-level cognitive taxonomy. Three experienced instructors independently graded the responses through a blind interface and a detailed rubric.
Why Bash Is Hard to Grade
A command-line exam looks easy to automate until the answer space opens. A correct shell solution can use different command options, pipeline orderings, path expressions, temporary files, quoting styles, and intermediate assumptions. A rule-based autograder can catch exact outputs, but it struggles with partial credit, equivalent commands, and syntactic variation that still shows understanding.
That is why Linux/bash grading is a useful test case for educational AI. The answers are short enough for a model to read, technical enough to punish superficial pattern matching, and varied enough to expose whether the system can distinguish a notation error from a conceptual failure.
The Four-Level Taxonomy
The authors' taxonomy is the governance instrument inside the study. L1 covers information retrieval and read-only command use. L2 covers basic file manipulation such as creating, copying, moving, or deleting files. L3 covers structural operations that require deeper understanding of directory organization, permissions, pattern matching, or pipelines. L4 covers advanced system-management tasks where multiple concepts combine and operational impact rises.
This taxonomy does more than label question difficulty. It gives an institution a routing rule. Low-level tasks may be candidates for automated first pass grading when agreement is strong and the appeal path is clear. Higher-level tasks should be treated as review zones, not because the model is useless, but because the cost of misreading student intent rises with the task's conceptual and operational load.
Rubric Before Model
The study evaluated four models: GPT 5.2, Claude Opus 4.6, Gemini 3.0 Pro, and GLM 5. Each model was tested under two prompt variants. V1 used minimal context and no full rubric. V2 supplied the detailed rubric and reference answer. The human baseline was strong: the instructors' aggregate grading reached ICC(2,1)=0.949 and weighted kappa=0.948.
Gemini 3.0 Pro with rubric-guided prompting achieved the best reported human-AI agreement, with ICC(3,1)=0.888, MAE=0.100, and Bland-Altman bias=-0.014. But the more important finding is not a vendor ranking. All models improved with the rubric-enhanced prompt, and the paper states that rubric quality had a larger effect than provider choice. In other words, the grading system is not the model alone. It is the model plus the rubric, reference answer, prompt, question taxonomy, grading platform, and human consensus record.
Where Human Review Remains
Agreement declined as taxonomy level increased. The lower levels were more stable: L1 and L2 are closer to direct command recognition and basic operational correctness. L3 and L4 required more structural inference. Gemini V2 stayed strongest across levels, while other model-and-prompt combinations moved more unevenly. The paper notes, for example, that GPT's V2 prompt did not improve its L4 ICC over V1.
The sharpest failure mode is contextual. Manual inspection found cases where a model penalized a path or filename because it did not match the reference answer, even when the student's choice was coherent with a previous question. Item-isolated grading can mistake a consistent local state for an error. Human evaluators are not perfect, but they can ask whether a response reflects the student's constructed environment and whether the mistake is conceptual, syntactic, or merely a mismatch with the model solution.
That is the boundary the paper draws. LLMs may help scale grading for constrained command tasks, especially with strong rubrics. They should not silently replace expert judgment where cross-question context, partial credit, or student intent determines the fair score.
The Assessment Receipt
An AI-assisted grade should carry an assessment receipt. The receipt should name the question's taxonomy level, rubric version, reference answer, accepted variants, model and version, prompt variant, human baseline, agreement metrics, known bias, and threshold for human review. It should also record whether the answer depends on state created in a previous question.
Without that receipt, a score becomes an unexplained institutional fact. With it, the grade is traceable. A student can appeal a context-sensitive error. A department can decide that L1 and L2 items are safe for automated triage while L3 and L4 items require human confirmation. A procurement team can compare not only models, but also rubric discipline and review design.
Limits
This is one study in one computing-education setting, built around Linux/bash command responses from a specific course. It does not prove that LLMs can grade every programming task, every discipline, or every classroom. It also does not erase the difference between matching expert scores and supporting learning.
Its real contribution is narrower and stronger. It shows how to turn AI grading from a black-box substitution into a bounded workflow. The question is not whether a model can be called a grader. The question is which cognitive level, which rubric, which evidence trail, and which human review rule make the grade defensible.
Sources
- Manuel Alonso-Carracedo, Ruben Fernandez-Boullon, Pedro Celard, Francisco J. Rodriguez-Martinez, and Lorena Otero-Cerdeira, Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach, arXiv:2607.02432 [cs.AI].
- arXiv PDF for Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach, checked for methods, dataset, model list, taxonomy definitions, human baseline, results, failure modes, and conclusion.
- arXiv listing pages for Artificial Intelligence and Computers and Society, checked for submission metadata and subject listing.