The Scholar Graph Becomes the Agent Memory Layer
Agents-K1 is not only a retrieval paper. It is a claim about what research agents need before they can reason over science: a structured, evidence-bearing memory layer.
The useful move is making paper claims, methods, evidence, figures, tables, equations, and citation intent into graph objects. The risk is that graph objects can make extraction errors feel institutional.
The Paper
The paper is Agents-K1: Towards Agent-native Knowledge Orchestration, arXiv:2606.13669 [cs.AI]. arXiv lists version 1 as submitted on June 11, 2026 and version 2 as revised on June 29, 2026, with DOI 10.48550/arXiv.2606.13669. The arXiv HTML page lists the license notice as CC BY 4.0.
The author list is Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang, Fangchen Yu, Zhijie Zhong, Zijie Guo, Tianshuo Peng, Zhuo Liu, Yi Xie, Xiang Zhuang, Shengji Tang, Yue Fan, Runmin Ma, Shiyang Feng, Xiangchao Yan, Anran Liu, Peng Ye, Wenlong Zhang, Shufei Zhang, Chunfeng Song, Fenghua Ling, Jie Zhou, Liang He, Bo Zhang, and Lei Bai. The affiliations are Shanghai Artificial Intelligence Laboratory, East China Normal University, and Fudan University.
The paper starts from a simple diagnosis: research agents have improved their planning, tool use, and multi-agent orchestration, but their knowledge substrate is still thin. Many systems retrieve abstracts, chunks, surface mentions, or flat citation edges. That leaves claims, evidence, mechanisms, method lineages, figure evidence, table evidence, and citation intent buried inside full papers.
Knowledge Orchestration
Agents-K1 names the missing layer "knowledge orchestration." Agent orchestration decides how an agent plans and acts. Knowledge orchestration decides what the agent can know, how that knowledge is structured, and how a final answer can be traced back to evidence.
The system has three layers. The KG layer parses full scientific papers and builds multimodal knowledge graphs. The LLM layer provides a 4B information-extraction backbone trained with Group Relative Policy Optimization, or GRPO. The CLI layer exposes the graph through GraphAnything, a tri-source interface that combines web search, multimodal graph retrieval, and cross-document graph traversal.
The scale claim is substantial. The authors process 2.46 million scientific papers across six disciplines: computer science, chemistry, biology, earth science, physics, and materials. The resulting Scholar-KG is released in a one-million-paper subset, while the full Scholar-KG is linked through the SCP endpoint.
The important conceptual distinction is that this is not just "better RAG." Chunk retrieval asks which passages resemble the query. Agents-K1 asks which paper nodes, concept nodes, evidence nodes, citation roles, figures, tables, equations, and method-lineage edges should enter the agent's working context.
Five Modules
The paper's graph schema has five modules. Module A stores meta and factual entities: paper metadata, authors, affiliations, process times, licensing, repositories, model artifacts, dataset releases, evidence locations, and confidence scores. This is the administrative spine.
Module B stores textually mentioned entities such as tasks, methods, datasets, metrics, baselines, implementation details, theorems, definitions, figures, tables, equations, and domain terms. These are the named scientific objects that an agent needs to compare across papers.
Module C stores implicit and abstracted entities: problem definitions, motivations, contributions, hypotheses, findings, explanations, limitations, rationales, future work, and error analysis. This is the layer where the graph stops being a bibliography and starts becoming a map of claims.
Module D classifies citation relationships. A citation can support, contrast, extend, supply background, or ground a method. This matters because a flat "cites" edge cannot tell an agent whether a cited paper is a baseline, an objection, a dataset source, or a methodological ancestor.
Module E stores durable knowledge relations between entities. Controlled relations bind already-normalized entities, while open relations allow new concepts when evidence supports them. The authors explicitly exclude volatile details such as numerical results or hyperparameters from this durable relation layer, because those belong elsewhere and can otherwise turn the graph into noise.
Extraction Backbone
The extraction backbone is a 4B model initialized from Qwen3-4B-Instruct-2507 and trained on IEPile for English named-entity and relation extraction. The reward combines format compliance, JSON validity, and task F1. The model card says it is specialized for structured extraction, not general chat.
In the paper's ten-benchmark information-extraction evaluation, the trained 4B model improves over its own base on every dataset. Average F1 rises from 0.5316 for Qwen3-4B to 0.5647 for Agents-K1. It also beats the Qwen3-8B base average of 0.5382, despite using half the parameter count.
The stronger comparison is mixed. Qwen3-32B averages 0.5746, so the 4B trained model is close overall but still behind. It exceeds the 32B base on held-out NER and in-distribution NER, but relation extraction remains the major gap: the regime average is 0.2226 for Agents-K1 versus 0.3127 for Qwen3-32B.
That gap matters. Research-agent memory is only as good as the extracted edges. A weak relation extractor can still create a persuasive graph, but the graph may connect the wrong entities or miss the relation that would have changed the answer.
Agent Interface
The agent-facing piece is GraphAnything. The paper describes a tri-source retrieval mechanism: web search for recency and coverage, multimodal graph retrieval for figures, tables, equations, and semantic anchors, and knowledge-network traversal for cross-document lineage and comparison.
GraphAnything exposes deterministic graph primitives to agents through a Python API, CLI, and Model Context Protocol server. The public README describes 10 schema presets, 8 extractors, 9 render formats, 17 MCP tools, and 19 CLI subcommands. Rule-based extractors and read-only graph operations do not require an external LLM; LLM-gated commands can point at any OpenAI-compatible chat-completions endpoint.
The paper's agent layer also describes worker roles: Coordinator, CodeWikiWorker, SurveyWorker, IdeaWorker, PrototypeWorker, and Aggregator. The important governance property is not the role names. It is that the aggregator writes manifests with job IDs, status, artifact paths, evidence IDs, and failure messages. A failed worker is supposed to be inspectable and rerunnable, not hidden behind a polished final summary.
This makes Agents-K1 adjacent to agent observability. The graph is not only an index. It becomes a memory layer, a tool surface, a provenance system, and a partial audit log for research work.
Results
Across six scientific domains, the paper reports average F1 values from 79.07% to 87.11% for graph-construction evaluation. Computer science and earth science are highest at 87.11% and 86.62%. Physics is lowest at 79.07%, with explicit-entity recall at 68.36%. Module E has the widest spread, from 70.54% F1 in physics to 89.33% in computer science.
The geoscience QA evaluation builds a graph from 114 surveys or reviews published since 2025 and 7,219 unique cited papers, producing 602,132 nodes and 609,812 edges. On knowledgeable questions, GPT-5.2 with Agents-K1 improves rationale accuracy from 54.2% to 65.8% and answer accuracy from 68.0% to 75.0%. Gemini-3 with Agents-K1 reaches 67.5% rationale and 77.9% answer accuracy.
The research-question setting is harder and shows the value of structure more clearly. GPT-5.2 with Agents-K1 improves from 41.8% to 66.3% rationale accuracy and from 58.8% to 69.7% answer accuracy. Gemini-3 improves from 52.3% to 69.5% rationale accuracy and from 61.0% to 71.5% answer accuracy.
On FrontierScience-Research, Agents-K1 lifts Gemini-3 overall accuracy from 7.9% to 24.6% and GPT-5.2 from 25.2% to 39.4%. The biggest reported single-discipline jump is GPT-5.2 in physics, from 9.0% to 46.7%.
On open multi-hop QA benchmarks, the reported "Ours" method reaches 63.50 containment and 67.80 GPT-judge accuracy on HotpotQA, 67.10 and 64.80 on 2WikiMultiHopQA, and 31.10 and 36.20 on MuSiQue. The paper compares against vanilla LLMs, top-k retrieval, and graph methods including KGP, G-retriever, RAPTOR, E2GraphRAG, LightRAG, HippoRAG, GFM-RAG, and HippoRAG2.
The measurement caveat is visible in the protocol. Several evaluations use LLM-as-judge scoring, including DeepSeek-V3, GPT-5.2, and GPT-4o-mini in different places. That is not automatically invalid, but it makes judge model, rubric, prompt, sampled evidence, and failure audit part of the benchmark receipt.
Artifacts
The paper links four artifacts from its front matter: an SCP page, the GraphAnything code repository, the Scholar-KG dataset page, and the Agents-K1 model page. The GraphAnything repository is public at InternScience/GraphAnything; GitHub metadata and the LICENSE file identify it as MIT licensed.
The model is hosted at InternScience/Agents-K1-LLM, which currently resolves to the Agents-K1 model card. The card identifies the model as Apache-2.0 licensed, based on Qwen3-4B-Instruct-2507, and intended for structured information extraction rather than chat.
The dataset page is InternScience/Scholar-kg. The Hugging Face API reports the dataset repository and basic metadata, but direct raw README access returned a gated-repository 401 in this environment. That means the page can be cited as the release endpoint, but downstream users still need to verify access terms, file contents, and license before treating the one-million-paper subset as a reusable public corpus.
The SCP endpoint at scphub.intern-ai.org.cn/detail/42 loads as a JavaScript application here. It is useful as the paper-linked access surface for the full Scholar-KG, but it is weaker as an archival citation than a versioned static manifest with checksums.
Graph Receipt
An agent-native scholar graph should ship a graph receipt. The receipt should name the paper version, source corpus, inclusion rule, parser version, OCR or PDF parser, schema version, module definitions, extraction model, model checkpoint hash, reward configuration, prompt templates, entity-linking rules, evidence-span format, confidence calibration, graph version, update time, deduplication rule, identifier namespace, and license fields.
For every retrieval run, the receipt should also name the query, agent role, tool interface, web source, graph source, network-traversal source, fusion weights, retrieved node IDs, evidence spans, ranker, truncation budget, final context, generated answer, judge model if any, and failure messages. Without that record, "grounded in the graph" becomes a slogan rather than a reproducible claim.
This connects directly to AI Agents, Retrieval-Augmented Generation, AI Data Provenance, AI Audit Trails, Model Context Protocol, The Agent Wiki Becomes the Retrieval Spine, The Agent Knowledge Base Becomes the Commons, and The Vector Database Becomes Institutional Memory.
Limits
The central risk is authority laundering. A generated graph node can look cleaner and more stable than the messy paragraph it came from. If the extractor misreads a claim, merges aliases incorrectly, drops a limitation, or invents a relation, the graph may give the error an institutional form.
The second risk is benchmark circularity. If the same kind of LLM that extracts the graph also judges the graph-enhanced answer, evaluation needs stronger controls: human spot checks, adversarial samples, evidence-span review, judge disagreement audits, and baseline parity on retrieved context.
The third risk is corpus governance. Scientific papers carry copyright, licensing, version, withdrawal, correction, and provenance constraints. A million-paper graph is not only a technical object; it is a permissions surface. The dataset access gate I observed is not a problem by itself, but it means public claims about release need to be paired with concrete reuse terms.
The right conclusion is therefore narrow. Agents-K1 shows that research agents need a memory layer organized around evidence, provenance, and graph operations. It does not make the graph true. It makes the graph inspectable enough that truth claims can be checked.
Sources
- Zongsheng Cao et al., Agents-K1: Towards Agent-native Knowledge Orchestration, arXiv:2606.13669 [cs.AI], submitted June 11, 2026; revised June 29, 2026.
- arXiv HTML: Agents-K1: Towards Agent-native Knowledge Orchestration, reviewed for authorship, affiliations, abstract, linked artifacts, framework, schema, experiments, appendix, and license notice.
- arXiv PDF: Agents-K1: Towards Agent-native Knowledge Orchestration, reviewed for tables, quantitative results, geoscience QA setup, FrontierScience-Research results, open benchmark comparisons, and information-extraction evaluation.
- arXiv TeX source: e-print source for arXiv:2606.13669, reviewed for artifact URLs, schema descriptions, source-level table values, agent CLI wording, and conclusion.
- Code repository: InternScience/GraphAnything, reviewed for CLI/MCP features, schema presets, extractors, versioning, federation, quality evaluation, environment variables, and repository metadata.
- GraphAnything README raw source: README.md, reviewed for installation, CLI, MCP tools, extractors, versioning, quality evaluation, and license statement.
- GraphAnything license: MIT License, reviewed for code license status.
- Model artifact: InternScience/Agents-K1-LLM, reviewed for model card metadata, intended use, training data, reward design, evaluation table, limitations, and Apache-2.0 license statement.
- Dataset artifact: InternScience/Scholar-kg, reviewed for Hugging Face dataset metadata and access boundary; direct raw README access returned gated-repository authentication requirements in this environment.
- SCP endpoint: SCP detail page for Agents-K1, reviewed as the paper-linked access surface for the full Scholar-KG.
- Related pages: AI Agents, Retrieval-Augmented Generation, AI Data Provenance, AI Audit Trails, Model Context Protocol, The Agent Wiki Becomes the Retrieval Spine, The Agent Knowledge Base Becomes the Commons, and The Vector Database Becomes Institutional Memory.