Blog · Analysis · Last reviewed June 24, 2026

The Paper Mill Becomes the Literature

Scientific knowledge is becoming a machine-readable substrate for search, medicine, policy, and AI training. Paper mills and hallucinated citations do not merely add bad papers to that substrate. They test whether institutions can still tell the difference between evidence, evidence-shaped output, and correction records that machines can actually read.

The governance problem is not only fraudulent articles. It is whether the literature record can preserve claim, evidence, provenance, correction status, and downstream use as separate objects.

The Polluted Record

The scientific literature used to look like a slow archive. Papers entered journals, indexes, libraries, citation graphs, review articles, clinical guidelines, grant proposals, textbooks, and eventually the background memory of a field. Errors mattered, but they moved through a human-speed system.

That archive is now also an input layer for machines. Search engines summarize it. Retrieval systems quote it. Clinical and legal workflows cite it. AI assistants use it to answer questions. Foundation models absorb parts of it during training. Research agents may mine it for hypotheses, experimental plans, and automated reviews. The literature is not only read by scientists. It is parsed by infrastructure.

That is why paper mills are not a narrow publishing scandal. A fake or unreliable paper can become a node in a knowledge graph, a citation in a review, a chunk in a retrieval database, a training example for a model, or a false premise in an automated research workflow. The damage is not only that a journal published something weak. The damage is that weak evidence can become machine-actionable memory before correction metadata catches up.

Nature reported that retractions for research articles passed 10,000 in 2023, a record driven in large part by efforts to clean up sham papers and peer-review fraud. That number should not be read as simple decline. Retractions can also mean correction is happening. But the scale reveals a structural problem: the record is being repaired after contamination, not protected before it.

For this essay, the unit of risk is the literature record: article text, figures, data statements, authorship claims, peer-review path, DOI, references, index entries, citation graph, correction status, and machine-readable metadata. A paper mill can contaminate any part of that record. A hallucinated citation can counterfeit one of its edges. A retraction that cannot travel leaves a bad node active in machines even after humans have corrected it.

The practical standard is a research integrity packet: enough linked evidence to reconstruct what the article claimed, who stood behind it, what checks occurred, what source objects supported it, whether it has been corrected, and which downstream systems ingested it. Without that packet, a paper can look legitimate in a database while becoming unanswerable in an investigation.

Current Context

As of June 24, 2026, the sharpest new evidence is quantitative rather than anecdotal. A 2026 BMJ study trained a model on retracted paper-mill publications and screened 2,647,471 PubMed-indexed original cancer-research articles published from 1999 through 2024, flagging 261,245 of them, or 9.87 percent, as textually similar to known paper-mill work. That is not a misconduct count. It is a triage signal at uncomfortable scale.

The arXiv hallucinated-citations audit points in the same direction from another angle. It treated citations as checkable objects, audited 111 million references across 2.5 million papers, and estimated 146,932 non-existent citations in 2025. Together, the BMJ and arXiv studies make the current governance point: automated screening is becoming necessary because manual review cannot inspect the whole record, but a screen is still not a verdict.

Recent work also pushes the problem beyond journal articles. A 2026 arXiv preprint on conference proceedings compared more than 4,000 social-media publication offers with IEEE conference proceedings and identified 1,720 papers across 286 proceedings as matches. That finding is not the same kind of evidence as a completed publisher investigation, but it usefully widens the control point: paper mills can target any publication channel that turns manuscript form into career credit.

The institutional answer is now distributed across authors, repositories, publishers, indexes, and model builders. ICMJE says humans remain responsible for AI-assisted material, attribution, and citations. arXiv's code of conduct makes authorship responsibility and possible suspension part of platform governance, and Nature reported in May 2026 that arXiv was applying one-year bans for submissions with unchecked AI-generated errors such as hallucinated references. Crossref's public Retraction Watch metadata and NISO CREC address the other side: corrections must propagate in machine-readable form after publication.

This means "paper mill" is too narrow as a policy category if it only means a fraudulent manuscript vendor. The operational problem is contaminated scholarly infrastructure: fake or weak papers, bought authorship, fabricated review, generated figures, nonexistent citations, stale correction metadata, and AI systems that can retrieve or train on all of it. Governance has to follow the record from submission to indexing to retrieval to model use.

The Paper-Mill Economy

A paper mill is not just a sloppy author. It is an organized service or network that manufactures, brokers, or launders academic-looking manuscripts, authorship slots, data, images, peer-review identities, or submission packages. The relevant feature is systematic manipulation of the publishing process, not mere low quality. The product is not knowledge. The product is a credential object that can pass through enough institutional checkpoints to count.

That distinction matters for fairness. Error, weak method, plagiarism, authorship dispute, generated prose, and paper-mill manipulation can overlap, but they are not the same allegation. A governance system that treats every formulaic article, every non-native English manuscript, or every detector flag as "paper mill" will create its own integrity failure.

The incentives are familiar. Universities, hospitals, promotion committees, grant systems, and national evaluation regimes often reward publication volume, journal placement, and citation count. Publishers may rely on article-processing charges. Editors and reviewers face overloaded queues. Indexing systems can lag behind abuse. Paper mills exploit the gap between symbolic evidence and verified evidence.

The Hindawi crisis made the economics visible. Retraction Watch reported that Hindawi retracted more than 8,000 articles in 2023. Wiley's official second-quarter fiscal 2024 release attributed a research-revenue decline partly to the Hindawi publishing pause, naming an $18 million impact for that quarter; Wiley's fiscal 2024 results later said full-year research revenue was down mainly because of the full-year Hindawi impact. The lesson is not that one publisher was uniquely vulnerable. The lesson is that high-throughput scholarly publishing can become an attack surface.

Paper mills are also a governance problem because they turn fraud into logistics. A single fabricated result is a misconduct case. A coordinated stream of fabricated manuscripts is an institutional adversary. It requires detection, information sharing, correction workflows, sanctions, and changes to the incentives that made the market profitable.

AI Changes the Cost Curve

Generative AI does not invent paper mills. It makes parts of the operation cheaper, faster, and harder to see.

Language models can draft plausible introductions, abstracts, literature reviews, cover letters, reviewer responses, and method-sounding prose. Image generators can produce figures. Citation tools can create references that look formatted. Translation and paraphrase systems can smooth repeated templates. None of this proves a particular paper is fraudulent; legitimate researchers also use assistive tools. The risk is that the marginal cost of producing evidence-shaped text falls toward zero while the cost of careful review stays human.

Publication-ethics guidance already gives the clean rule: AI can be disclosed as a tool, but it cannot be the accountable author or the primary authority for a scientific claim. Human authors still own the accuracy, attribution, originality, and citation trail of the submitted work.

Disclosure is a floor, not a shield. "AI-assisted" does not answer whether the cited paper exists, whether the figure is evidence, whether the dataset was observed, whether the author can defend the methods, or whether the journal checked the relevant objects. The useful compliance unit is not tool use in general. It is the verified claim, figure, dataset, citation, and authorship contribution.

The absurd visible cases are easiest to remember. Frontiers retracted a 2024 article after concerns were raised about AI-generated figures that did not meet editorial and scientific rigor. The episode became a meme because the image errors were blatant. The more important cases will be less visible: a plausible pathway diagram, a synthetic image without obvious artifacts, a fabricated dataset that fits expectations, or a review article whose citations mostly resolve but subtly misstate the field.

A 2026 BMJ study used machine learning to screen 2.65 million PubMed-indexed cancer-research articles published from 1999 to 2024 and flagged 261,245, or 9.87 percent, as textually similar to known paper-mill publications. The authors and later respondents treated the tool as triage, not ground truth. That distinction matters. A detector can help prioritize scrutiny, but if institutions treat automated suspicion as proof, research integrity becomes another black-box discipline machine.

The safety problem is therefore double-sided. Under-screening lets contaminated work enter medicine, grant review, policy, and AI search and answer engines. Over-screening can punish legitimate authors, especially when the evidence is private, probabilistic, or culturally biased. This is why paper-mill screening belongs beside AI detector governance, notice and appeal, and algorithmic impact assessments.

Citations as Interface

Citations are the interface between claim and record. They tell the reader where a statement claims to stand. When citations break, the paper does not merely contain a formatting error. It loses part of its accountability surface.

Recent evidence suggests that AI-assisted writing is weakening that surface at scale. A 2026 arXiv preprint audited 111 million references across 2.5 million papers in arXiv, bioRxiv, SSRN, and PubMed Central and estimated 146,932 hallucinated citations in material published in 2025. Nature separately reported that arXiv would apply sanctions, including bans, for authors who submit work with unchecked AI-generated content such as hallucinated references; arXiv's own code of conduct already makes authorship responsibility and possible suspension part of the platform's governance surface.

The same pattern appears in the earlier analysis of AI hallucinated legal citations. Courts and journals are different institutions, but the failure has the same shape. A professional document borrows authority from a citation. The citation looks official. The reader, clerk, reviewer, or retrieval system may not verify it. The hallucination enters an institutional workflow wearing the costume of source discipline.

A governance-grade citation is more than a reference string. It should bind a claim to a source that exists, supports the proposition, has a stable identifier where possible, remains in usable status, and can carry correction metadata forward. A DOI or PubMed ID is useful only if the institution also asks what the source says, whether it has been corrected or retracted, and whether the cited sentence overclaims it.

The danger is not only nonexistent citations. It is citation drift: real identifiers paired with wrong titles, real papers cited for claims they do not support, synthetic reviews that launder weak evidence into consensus language, and retrieval systems that make the first available citation feel like proof. The citation becomes a button the user presses instead of a trail the institution follows.

For model-mediated literature, the citation should also carry status into the pipeline. A literature-review agent, clinical summarizer, grant-screening tool, or RAG system should know whether the cited object is active, corrected, under expression of concern, retracted, superseded, withdrawn, or unavailable. A source trail that loses status at ingestion is not grounding; it is stale authority.

Preprint Pressure

Preprints are valuable because speed matters. During emergencies, fast sharing can save time. In mathematics, physics, computer science, biology, and medicine, preprint servers let communities inspect work before journal delay. But the same speed creates a moderation problem when cheap generation meets low submission friction.

The governance problem is not solved by saying "peer review will catch it." Peer review was already strained. Many reviewers are unpaid, overloaded, narrow in expertise, and working from partial information. Preprint moderators cannot reproduce experiments or verify every citation. Journals cannot manually inspect every image, dataset, authorship claim, and reference trail at the scale of modern submission flows.

Nor is the answer to shut down open scientific exchange. A locked literature would protect incumbents, slow correction, and make public knowledge more dependent on private platforms. The harder design problem is preserving openness while adding friction at the points where the record becomes machine-readable authority.

The Cleanup Machine

The response is already becoming infrastructural. United2Act, supported by COPE and STM, released a consensus statement in 2024 focused on education, post-publication corrections, paper-mill research, trust markers, and joint action. The STM Integrity Hub describes itself as a cloud-based environment where publishers can check submitted articles for paper-mill and other research-integrity signals while respecting privacy and competition-law boundaries.

The correction layer is becoming infrastructural too. NISO's 2024 CREC recommended practice focuses on metadata creation, transfer, and display for retractions, removals, and expressions of concern, including machine-reading and automated processes. Crossref now makes the Retraction Watch database publicly accessible through its metadata services, with retractions available through API and CSV routes and updated by Retraction Watch every working day. That is the kind of plumbing a model-mediated literature needs: not just a retraction notice on a publisher page, but a status signal that can travel.

The plumbing still has limits. Crossref's Retraction Watch documentation says the database includes some expressions of concern and corrections, but those categories are not as comprehensive as retractions. A serious correction pipeline should therefore combine publisher update metadata, Crossref relationships, PubMed records where relevant, NISO CREC-style display, repository notices, and local index refreshes rather than treating one field as the whole truth.

This is necessary, but it creates a second-order risk. The cleanup machine can become its own high-control interface: opaque scores, private watchlists, cross-publisher signals, automated suspicion, and uneven appeal. Researchers from less-resourced institutions, non-native English writers, and fields with formulaic language could be harmed if detection systems confuse style with fraud.

So the question is not whether to use machines against machine-assisted fraud. At this scale, some automation is unavoidable. The question is what kind of institutional wrapper surrounds it: evidence standards, human review, bias testing, appeal paths, disclosure, privacy limits, and public correction records.

A Governance Standard

A serious response should meet thirteen tests.

First, separate triage from judgment. Detection models should route work for review, not pronounce guilt. A paper-mill score is a lead, not a verdict.

Second, verify the evidence objects. Journals and repositories need routine checks for citations, DOIs, image provenance, data availability, cell lines, ethics approvals, author identities, reviewer conflicts, and reused templates. The checkpoint should match the claim.

Third, make corrections machine-readable. Retractions, expressions of concern, linked corrections, and trust signals should propagate through Crossref, PubMed, indexing services, library systems, search engines, retrieval databases, and model-training filters. Standards such as NISO CREC matter because a corrected record that machines cannot see is only half corrected.

Fourth, protect legitimate AI use without normalizing negligence. Translation, editing, coding support, and accessibility aids can help researchers. Disclosure is useful, but it is not a substitute for verification. Fabricated citations, undisclosed generated figures, synthetic data presented as observed data, and unverified claims should remain professional failures no matter which tool produced them.

Fifth, align incentives away from publication volume. Paper mills exist because credential systems buy their product. Hiring, promotion, funding, and national evaluation systems need more weight on data quality, replication, software, peer review, negative results, public-interest work, and correction behavior.

Sixth, audit the detectors. Integrity tools should be tested for false positives across language, geography, discipline, institution type, and article genre before they influence desk rejection, institutional referral, sanctions, or public claims about authors. Otherwise the defense system will reproduce the same status hierarchies that made authors vulnerable to paper-mill markets.

Seventh, treat the literature as critical infrastructure. Scientific publishing is no longer only a professional communication system. It feeds medicine, policy, AI systems, education, search, and public belief. Its integrity deserves infrastructure-level funding and oversight.

Eighth, preserve publication-integrity packets. Consequential submissions should keep enough evidence to reconstruct checks: source files, figures, image-screening results, data-access statements, ethics approvals, author contribution records, reviewer-conflict checks, citation verification, and version history. The goal is auditability without permanent dragnet surveillance of researchers.

Ninth, give accused authors process. A suspicious template, detector score, or cross-publisher signal should not silently blacklist a person or lab. Authors should receive the allegation, the evidence class, the correction path where available, and a way to identify legitimate similarity, translation support, reuse, or institutional error.

Tenth, govern cross-publisher data sharing. Integrity collaboration is necessary, but shared signals should be purpose-limited, privacy-protective, competition-law reviewed, secured, logged, and retained only as long as needed. A cleanup network can become a reputation surveillance network if no one governs secondary use. This is a data minimization problem as much as a publishing problem.

Eleventh, feed correction signals into AI systems. Retrieval databases, answer engines, literature-review tools, and model-training pipelines should ingest retraction, expression-of-concern, and correction metadata. A model that cites a retracted article without status context is not "grounded" in the literature. It is grounded in a stale snapshot of the literature. Training-data exclusions, retrieval filters, and index refreshes should be versioned so later investigations can know which status data the system could see.

Twelfth, publish integrity outcomes. Institutions should report how many submissions were screened, how many were escalated, what evidence classes were used, how many cases were corrected or retracted, and what false-positive reviews found. Public accountability should measure the cleanup machine as well as the fraud machine.

Thirteenth, treat integrity failures as incidents. A journal that discovers a paper-mill cluster, a repository flooded with hallucinated references, or a retrieval product citing retracted work should have an incident reporting path: scope, affected records, user notice where appropriate, downstream metadata updates, model or index rollback, and a post-incident review.

What This Changes

The paper mill reveals a recursive failure in model-mediated knowledge.

Institutions create incentives to publish. Paper mills create publications that satisfy the incentives. Indexes ingest them. Citation systems connect them. AI systems read them. Researchers then use AI systems to write more papers, find more citations, and summarize more literature. A fake paper is no longer just a lie on a page. It can become part of the environment that teaches future systems what reality sounds like.

This is the same pattern as the training set eating itself, but with institutional prestige attached. Synthetic residue enters the archive, then the archive becomes evidence for another synthetic act. The loop does not need anyone to believe the fake paper deeply. It only needs enough systems to treat the form as usable.

The answer is not nostalgia for a pure human literature. Human science has always had error, fraud, prestige games, exclusion, and sloppy citation. The answer is better source discipline for a world where sources are operational. A citation must be verifiable. A retraction must travel. A detector must be contestable. A model using the literature must know when the literature is disputed.

Model-mediated reality will not be safer than the records it learns to trust. If the scientific record becomes polluted, the pollution will not stay inside journals. It will move into answer engines, clinical tools, grant reviews, classroom summaries, policy memos, and the next layer of automated research. The institution that cannot clean its memory cannot govern its machines.

Source Discipline

The sources here do different jobs. Nature and Retraction Watch document the public scale of retractions and the Hindawi crisis. Frontiers is the primary retraction notice for the AI-figure case. The BMJ and arXiv papers are research evidence and should be read with their stated limits: both produce screening or audit signals, not case-by-case misconduct verdicts. ICMJE, arXiv, United2Act, STM, NISO, and Crossref are governance and infrastructure sources.

Several figures in this essay are public record indicators, not a complete epidemiology of fraud. The 10,000-plus retraction figure is a Nature analysis of 2023 retractions. The Hindawi figures combine Retraction Watch reporting with Wiley financial disclosures. The BMJ cancer-literature estimate is a text-similarity screen over a defined corpus. The arXiv hallucinated-citation estimate is a preprint audit of references across specified repositories. Each should be cited with its method and limits attached.

Paper-mill language is accusatory and should be used carefully. A detector score, geographic concentration, formulaic phrasing, publication venue, or weak English is not by itself proof of fraud. Article-level claims need evidence objects, author process, institutional context, and a fair correction or appeal route. Source discipline here is also harm reduction for legitimate researchers.

Current-source claims on this page were checked on June 24, 2026. Primary or infrastructure sources were preferred for policy, metadata, and retraction-workflow claims; journalism is used for public reporting of retraction scale and visible incidents where the article names the source type.

The article's internal links point to the site's own source-discipline machinery. Those links are not decorative. They define the governance standard this essay is asking scientific institutions to meet.

Sources


Return to Blog