The Factory Manual Becomes the RAG Playground
FactoryLLM is useful because it separates two claims that industrial AI systems often blur: the answer can be grounded in retrieved text, and the retrieval step can still be noisy enough to make the system unsafe for action.
The Paper
The paper is FactoryLLM: A Safe and Open-Source AI Playground for Evaluating LLMs in Smart Factories, arXiv:2606.14119 [cs.AI], by Yash Pulse, Yong-Bin Kang, Abhik Banerjee, Abdur Forkan, and Prem Prakash Jayaraman. arXiv lists version 1 as submitted on June 12, 2026, with DOI 10.48550/arXiv.2606.14119. The arXiv record lists the paper as 6 pages, 3 figures, and IEEE INDIN 2026.
The official implementation is the public GitHub repository DigitalInnovationLab/Factory-LLM. The repository describes FactoryLLM as a research platform for benchmarking LLMs over multiple smart-factory documents, with a conversational playground, RAG support, automated evaluation, Docker-based reproducibility, and an MIT license badge.
The Maintenance Problem
The paper starts from a realistic factory failure mode. Modern factories are systems of systems: autonomous mobile robots, robotic arms, conveyor systems, programmable logic controllers, supervisory platforms, and fleet-management software interact. When one component fails, the actual cause may sit in another component's manual or software layer.
That makes ordinary search brittle. A technician trying to diagnose an Autonomous Intelligent Vehicle, or AIV, may need both vehicle hardware documentation and Mobile Planner fleet-management documentation. The relevant pages may use different terminology, and the question may require joining hardware ports, configuration parameters, safety procedures, and software controls.
FactoryLLM's safety claim is not that it can autonomously repair a factory. It is safer in the narrower but important sense that operators can evaluate local or open-source LLMs in a controlled environment without sending sensitive industrial manuals to remote services. That privacy boundary matters because factory documentation can expose layouts, equipment, procedures, vulnerabilities, and operational dependencies.
FactoryLLM
FactoryLLM is a configurable RAG playground rather than a single chatbot. The configuration layer lets an operator choose an LLM provider, prompting strategy, initial system message, RAG strategy, and number of retrieved sections. The paper names providers including OpenAI, OpenRouter, Google Gemini, and locally hosted models.
The system supports prompting strategies including Input-Output, Chain-of-Thought, Tree-of-Thought, and Graph-of-Thought. It also supports vector and graph retrieval. Uploaded PDF, DOCX, and TXT documents are segmented according to configured chunk size and overlap, then indexed into session-scoped stores such as ChromaDB for vector retrieval or NebulaGraph for graph retrieval.
This architecture is the useful part for governance. The chat session, uploaded files, retrieval index, selected model, prompt technique, generated response, evaluation scores, and feedback are all part of an experimental record. A factory AI tool should not be evaluated only by whether the answer sounds plausible in a demo.
Case Study
The paper's case study uses two tightly coupled systems: an AIV used to transport materials, and Mobile Planner software that assigns missions, controls navigation parameters, and coordinates multiple vehicles. The dataset contains 30 cross-machine maintenance questions derived from about 600 pages of documentation across both manuals.
Every question is designed to require both sources. No question can be answered from one document alone. The authors do not provide hand-written reference answers; instead, they evaluate with automated RAGAS and NVIDIA LLM-as-Judge metrics.
The experimental configuration fixes chunk size at 1,000 tokens with 200-token overlap and retrieves the top-10 passages per query from the combined AIV and Mobile Planner vector index. The three evaluated models are Qwen3-235B-A22B-Instruct-2507, Llama 4 Maverick, and Gemma-3-27B.
Results
The headline result is mixed in the right way. Overall averages land between 0.73 and 0.76, suggesting cross-machine RAG is feasible. But retrieval-side metrics are clearly weaker than generation-side metrics. Context precision sits at 0.46 to 0.51, while response groundedness sits at 0.88 to 0.95.
The paper makes the bottleneck visible. Context recall is higher, 0.76 to 0.89, which means the needed information is often present somewhere in the retrieved set, but it is diluted by irrelevant chunks. Six questions have mean context precision below 0.20. The system is often retrieving enough truth and too much surrounding noise.
The example query asks how to add side-mount lasers to a custom payload. FactoryLLM combines the RS-232 Aux Sensors connector, a DB9 female port, from AIV hardware documentation with the Safety Commissioning procedure in Mobile Planner, under Main Menu > Robot > Safety Commissioning. The point is that neither manual answers the whole query alone.
The judge split is the warning. NVIDIA context relevance scores 0.86 to 0.90, and response groundedness is high. RAGAS faithfulness is lower at 0.62 to 0.72, with seven questions below 0.50. The paper summarizes the split as groundedness averaging 0.91 versus faithfulness averaging 0.66, with context precision averaging 0.48. For industrial AI, that is not a minor metric dispute. It says an answer can be mostly anchored to retrieved text while still carrying partially unsupported or poorly targeted claims.
Governance Standard
A smart-factory RAG system should ship a maintenance-reasoning receipt. The receipt should include the machine inventory, document titles, document versions, document provenance, upload date, chunking policy, chunk size, chunk overlap, retrieval method, index type, top-k value, query, selected model, provider boundary, local-versus-remote execution status, prompting strategy, system message, retrieved chunks with ranks and source locations, generated answer, cited evidence, unsupported claims, RAGAS scores, NVIDIA judge scores, evaluator model, human reviewer, final action recommendation, and whether the output was allowed to reach operators.
The receipt should keep four objects separate. Retrieval quality is whether the right manual passages surfaced. Generation quality is whether the answer stayed inside those passages. Operational safety is whether the advice should be acted on. Privacy safety is whether sensitive factory documents left the controlled environment. FactoryLLM makes the first two easier to compare, but the last two still require organizational controls.
This connects directly to Retrieval-Augmented Generation, AI Evaluations, AI Audits and Assurance, AI Audit Trails, AI Agents, The RAG Document Becomes the Token Bomb, The Evidence RAG Becomes the Peer Review Ledger, The Factory Twin Becomes the Control Room, The AI Factory Becomes Industrial Policy, The Root Cause Becomes the Causal Trace, The Evaluation Bench Becomes the Test Rig, The Grading Cascade Becomes the Evaluation Artifact, The Monitoring Trace Becomes the Interpretive Gap, The Tool Call Becomes the Privacy Boundary, and The Safety Case Becomes the Release Gate. The industrial question is not whether RAG can answer a maintenance query. It is whether the evidence record is strong enough to decide what may happen next.
Limits
The biggest limitation is that the benchmark has no ground-truth reference answers and no human expert evaluation in the reported setup. Automated RAGAS and NVIDIA LLM-as-Judge metrics are useful for repeatable comparison, but they are not a maintenance authority. The paper explicitly lists human expert evaluation as future work.
The case study is also narrow: one AIV and one Mobile Planner software stack, 30 questions, and about 600 pages of documentation. That is a useful starting point, not proof that the same retrieval setup will work across other machines, languages, vendors, failure modes, or safety-critical procedures.
Finally, groundedness is not the same as correctness in the physical world. A grounded answer may still omit lockout procedures, escalation rules, warranty constraints, firmware mismatches, local modifications, or site-specific safety policies. FactoryLLM should be read as a reproducible evaluation playground for industrial RAG, not as a turnkey autonomous maintenance system.
Sources
- Yash Pulse, Yong-Bin Kang, Abhik Banerjee, Abdur Forkan, and Prem Prakash Jayaraman, FactoryLLM: A Safe and Open-Source AI Playground for Evaluating LLMs in Smart Factories, arXiv:2606.14119 [cs.AI], submitted June 12, 2026.
- arXiv HTML: FactoryLLM: A Safe and Open-Source AI Playground for Evaluating LLMs in Smart Factories, reviewed for abstract, architecture, case-study setup, evaluation metrics, model list, quantitative results, conclusion, and future-work limits.
- arXiv PDF: FactoryLLM: A Safe and Open-Source AI Playground for Evaluating LLMs in Smart Factories.
- Official code: DigitalInnovationLab/Factory-LLM, reviewed for repository status, README claims, implementation structure, evaluation pipeline, supported model list, and MIT license badge.
- Related pages: Retrieval-Augmented Generation, AI Evaluations, AI Audits and Assurance, AI Audit Trails, AI Agents, The RAG Document Becomes the Token Bomb, The Evidence RAG Becomes the Peer Review Ledger, The Factory Twin Becomes the Control Room, The AI Factory Becomes Industrial Policy, The Root Cause Becomes the Causal Trace, The Evaluation Bench Becomes the Test Rig, The Grading Cascade Becomes the Evaluation Artifact, The Monitoring Trace Becomes the Interpretive Gap, The Tool Call Becomes the Privacy Boundary, and The Safety Case Becomes the Release Gate.