Blog · Analysis · Last reviewed June 25, 2026

The Fair Use Ruling Becomes AI Governance

AI copyright cases are not only disputes about books, headnotes, journalism, and images. They are becoming one of the first practical governance regimes for model-mediated knowledge.

A fair-use ruling is a decision about specific works, conduct, evidence, market effects, and procedural posture. It becomes AI governance only when an institution turns that limited holding into records for acquisition, training use, retrieval use, output behavior, licensing, deletion, and audit.

The Courtroom as Policy Room

The first durable rules for generative AI may not arrive as an AI statute. They may arrive as copyright opinions.

That sounds narrower than it is. Copyright cases decide who may copy, store, transform, summarize, train on, retrieve from, and compete with cultural work. In the AI setting, those verbs describe the life cycle of model-mediated knowledge. A book, article, image, song, software repository, legal headnote, or public webpage can become training data, embedding space, benchmark material, retrieval context, answer-engine substrate, synthetic output, or a licensing asset.

The lawsuits therefore do more than allocate money after copying. They define what kind of institution an AI model is allowed to become. Is it a reader? A researcher? A competitor? A database? A derivative market? A private archive? A machine that can learn from culture without joining the social and economic systems that made culture available?

As of June 25, 2026, U.S. courts have not produced one clean rule. The early record is more interesting than that. Thomson Reuters v. Ross treated use of Westlaw headnotes for a competing legal AI product as outside fair use, and the case was later stayed while appellate review was sought on difficult copyright questions. Bartz v. Anthropic treated training on lawfully acquired books as fair use for the named plaintiffs, while separating that from Anthropic's creation of a central library from pirated books. Kadrey v. Meta granted Meta summary judgment on the record before the court, while warning that a better market-harm case could change the analysis. The U.S. Copyright Office's 2025 generative AI training report took a similarly non-absolute position: some training uses may be fair, some may not.

That uncertainty is not a footnote. It is the governance system taking shape.

Current Context

The live picture now has three tracks. The first is U.S. fair-use litigation, where district courts are deciding concrete records rather than writing a universal AI-training statute. The second is settlement administration, where the Anthropic book case moved into a piracy-class process covering works allegedly acquired from LibGen and PiLiMi, not a class-wide ruling that all book training is lawful. The Authors Guild reported in April 2026 that 440,490 of 482,460 eligible works had been claimed in that settlement process, about 91.3 percent, while also stressing that the certified class covered piracy rather than the act of training itself. Settlement materials should be read as settlement administration unless and until the court's approval, appeal, payment, and compliance steps are named separately.

The third track is documentation law. The EU AI Act requires general-purpose AI model providers to maintain a copyright-compliance policy and publish a sufficiently detailed summary of training content, and the European Commission published a template for those summaries on July 24, 2025. That does not decide U.S. fair use, but it changes the governance baseline: source categories, collection methods, crawler choices, licensed datasets, and rights-reservation signals are becoming regulatory facts, not just discovery fights after a lawsuit begins.

New complaints and surviving claims also keep the field open. The New York Times case against OpenAI and Microsoft survived important motion-to-dismiss challenges in April 2025 while losing other theories. In May 2026, Elsevier, Cengage, Hachette, Macmillan, McGraw Hill, and Scott Turow filed a proposed class action against Meta and Mark Zuckerberg over alleged use of books and journal articles in Llama development. Those are pleadings and procedural orders, not final fair-use answers. They matter because they show that the copyright fight has moved from a single slogan about training to a full supply-chain record: acquisition, copying, distribution, copyright-management information, outputs, market substitution, licensing, deletion, and accounting.

What the Ruling Does Not Decide

A fair-use ruling is not a license, a dataset certificate, a product safety case, or a universal permission slip. It answers a specific question about specific works, facts, evidence, plaintiffs, defendants, and procedural posture. Turning that answer into governance requires naming the layer the ruling actually reached.

It also does not decide privacy law, contract law, consumer-protection law, data-protection duties, antitrust risk, labor transition, school procurement, public-records obligations, or product safety. Copyright can become an effective governance lever because it has courts, damages, injunctions, and owners. That does not make it a complete theory of legitimate AI development.

Five layers should stay separate. The first is acquisition: purchase, license, public-domain status, scrape, user upload, data broker, or piracy. The second is development use: pretraining, fine-tuning, evaluation, safety filtering, retrieval indexing, demo examples, or synthetic-data generation. The third is output behavior: quotation, near-verbatim reproduction, style imitation, source-substituting summary, or answer-engine display. The fourth is market effect: traffic loss, licensing-market impairment, labor displacement, product substitution, or bargaining leverage. The fifth is remedy: damages, deletion, exclusion, audit, notice, future license, or changed product behavior.

That separation is now a practical governance test. If a company cannot separate these layers in its records, it should not collapse them in public claims. A lawful copy can still be used outside the license. A scraped page can be indexed for search, excluded from training, used for retrieval, or retained in a benchmark; those are different uses. A settlement over pirated copies can require deletion without deciding whether all training is infringement. The relevant internal map looks less like a press statement and more like training-data documentation, AI data licensing, training opt-out handling, and answer-engine source discipline.

Doctrine Is Doing Institutional Work

Fair use was built as a flexible doctrine, not as a procurement manual for foundation models. Courts consider the purpose and character of the use, the nature of the work, the amount used, and the effect on actual or potential markets. That flexibility is useful because it lets law adapt to new technologies. It is also dangerous because a few early cases can become operational signals for an entire industry before legislatures, regulators, labor institutions, libraries, schools, and public archives have built their own rules.

Model developers hear these cases as infrastructure guidance. A ruling about lawfully acquired books becomes a signal about dataset acquisition. A ruling about pirated copies becomes a signal about central libraries and deletion duties. A ruling about a weak market-harm record becomes a signal about what plaintiffs must prove. A ruling about a competing legal research product becomes a signal that direct substitution and commercial rivalry matter.

Rights holders hear the same cases differently. They hear a warning that ordinary copyright injury may be hard to prove once a work is converted into model capability rather than reproduced as a clean copy. They also hear an invitation to build licensing markets, audit trails, provenance systems, data registries, opt-out controls, and litigation records that make future market harm more legible.

The result is a strange institutional sequence. First, models are trained. Then lawsuits ask whether the training was allowed. Then companies and publishers negotiate licenses under the shadow of uncertain doctrine. Then those licenses shape which archives are available to future models. The legal answer comes after the technical system, but the next technical system is built around the legal answer.

Three Early Signals

The first signal came from Thomson Reuters v. Ross Intelligence. Ross was building a legal research tool and was accused of using Westlaw headnotes and related editorial material. In February 2025, Judge Stephanos Bibas granted summary judgment for Thomson Reuters on fair use. The opinion treated Ross's use as commercial and competitive, not as a public-interest research use. It also stressed that Ross sought to build a substitute legal research product from the protected material.

That case is not a perfect proxy for frontier language-model training. Ross was not ChatGPT, and Westlaw headnotes are not the whole public web. But the case matters because it shows that "AI training" is not a magic phrase. If the use is aimed at building a competing product from protected expression, fair use can fail.

The second signal came from the U.S. Copyright Office. Its May 2025 Part 3 report on generative AI training rejected a blanket answer. The Office said it was not possible to prejudge every litigation outcome and that some uses of copyrighted works for generative AI training will qualify as fair use while others will not. The report treated purpose, market effect, access to lawful copies, commerciality, outputs, and licensing markets as fact-sensitive questions.

That report is not a court judgment, but it is a policy map. It tells public institutions not to reduce the question to "training is theft" or "training is always transformative." It also shows why a simple consent slogan is hard to implement after the fact: models are trained on mixtures of public material, licensed material, scraped material, pirated material, synthetic material, user material, and data whose provenance may be hard to reconstruct.

The third signal came from the June 2025 Northern District of California book cases. In Bartz v. Anthropic, Judge William Alsup held that training on books that Anthropic had purchased or otherwise lawfully acquired was transformative fair use for the named plaintiffs, but he did not excuse the company's separate creation of a large internal library from pirated sources. Two days later, in Kadrey v. Meta, Judge Vince Chhabria granted Meta summary judgment on fair use because the plaintiffs had not developed the market-harm record he thought the case required. He emphasized that the ruling did not establish that Meta's use of copyrighted materials to train language models is lawful in general.

Together, these signals do not settle the war. They define the battlefield: lawful acquisition, piracy, transformation, substitution, licensing markets, evidence of market dilution, and the difference between a model learning from a work and a product competing with the work.

Piracy and Training Split Apart

The most important distinction in Bartz is the split between training and acquisition.

Anthropic won a fair-use ruling on training with lawfully acquired copies, but not a free pass for pirated library building. The later settlement process focused on pirated-book claims. The settlement materials and Authors Guild updates describe a $1.5 billion fund, an expected floor of roughly $3,000 per covered title before costs and rights splits, a March 30, 2026 claims deadline, and a final-approval process. More important for governance, the Authors Guild emphasized that the certified class covered piracy-related claims, not a class-wide ruling on the act of training itself. The settlement website's payment timeline also matters as source discipline: settlement money, releases, deletion duties, and appellate steps are remedial mechanics, not a merits holding about all model training.

That split is becoming a governance rule whether or not Congress writes it down. AI companies can argue that training is transformative. They will have a harder time arguing that the source of the copies does not matter. If the model economy wants fair use to protect learning-like transformation, it must still answer acquisition-like conduct: scraping, downloading, torrenting, retaining, storing, redistributing, stripping metadata, ignoring robots signals, bypassing paywalls, and building permanent internal corpora.

This is where provenance stops being a nice-to-have. A company that cannot explain where training material came from cannot credibly separate lawful acquisition from unlawful acquisition, licensed use from unlicensed use, public-domain material from copyrighted material, retrieval material from training material, or deleted data from retained data. Dataset documentation becomes legal memory.

The governance implication is blunt: if AI systems are going to claim the cultural privilege of learning, they need the institutional discipline of records.

Market Harm Becomes Model Harm

Traditional copyright doctrine asks whether the use harms the market for the original or a reasonable derivative market. Generative AI makes that question harder because the harm may not look like ordinary substitution.

A model may not print a whole book. It may still reduce demand for the class of labor that makes books, journalism, illustration, code, translation, legal summaries, or music valuable. It may produce outputs in a style, genre, format, or knowledge domain shaped by copyrighted works. It may serve as an answer engine that satisfies the user's need without sending them to the source. It may help a platform negotiate lower licensing rates because the model can approximate some of what the source provides. It may create a synthetic abundance that changes market expectations even when no single output is an infringing copy.

That is why Kadrey matters even though Meta won that round. The opinion's practical message was not "training is legal." It was that plaintiffs need a stronger record about market harm. That shifts the fight from moral argument to institutional evidence: licensing markets, substitution studies, output similarity, platform behavior, changes in commissions and rates, author income, publisher traffic, search displacement, and whether model outputs occupy markets that copyright owners reasonably could license.

This evidence problem favors actors with data. Platforms know user behavior, output frequency, model retrieval paths, traffic diversion, licensing offers, product roadmaps, and internal substitution plans. Individual creators often do not. If market harm becomes the central test, public governance has to ask who can see the market.

Licensing Becomes Private Law

Uncertainty creates markets. Publishers, news organizations, platforms, image archives, music companies, software repositories, and data brokers are already negotiating licensing deals, blocking crawlers, joining data-rights initiatives, or suing. That licensing layer may become the everyday law of model training before public law catches up.

Licensing can be good governance. It can pay creators, create consent pathways, clarify use rights, reduce litigation, improve dataset quality, support public archives, and make model developers document what they use. But licensing can also privatize cultural memory. Large rights holders can negotiate. Small creators may be aggregated, ignored, or priced out. Public-interest researchers may be unable to afford the same datasets that commercial labs license. Open web knowledge may split into premium licensed corpora for large models and low-quality leftovers for everyone else.

The risk is that copyright becomes the only functioning AI governance regime because it has courts, damages, contracts, and owners. That would be an incomplete regime. Copyright protects works, not all labor. It protects expression, not every social context that made expression possible. It can compensate some authors while leaving annotators, moderators, forum users, translators, teachers, open-source maintainers, public institutions, and communities outside the frame.

Fair use litigation therefore cannot carry the whole burden. It can set boundaries around copying, but it cannot by itself govern model transparency, data dignity, labor transition, public knowledge, school use, government procurement, synthetic media, safety testing, or the institutional power of answer engines.

A Governance Standard

A serious AI-era copyright governance standard should treat training-data disputes as institutional design problems, not only as courtroom contests.

First, dataset provenance should be auditable. Model developers should maintain records of source categories, acquisition paths, license terms, opt-out signals, retention rules, and deletion procedures. A vague statement that data came from the internet is not institutional memory.

Second, lawful access and lawful use should be separated. Buying a book, subscribing to a database, crawling a page, or receiving a user upload does not automatically answer every downstream training, retention, retrieval, or output question. Governance should track each step.

Third, purpose lanes should be logged separately. Pretraining, post-training, evaluation, safety filtering, retrieval indexing, search grounding, product analytics, and output examples should have separate authority, retention, deletion, and audit fields. The same source can be allowed for one lane and prohibited in another.

Fourth, piracy should not be laundered by transformation. If a company wants courts and the public to accept training as transformative, it should not treat unlawful copying as an incidental supply-chain shortcut.

Fifth, market harm should be measurable by more than verbatim output. Courts and regulators should examine substitution, answer-engine displacement, licensing-market impairment, synthetic style markets, product competition, and documented plans to replace paid sources or workers.

Sixth, licensing markets need fairness rules. Collective licensing, small-creator representation, public-interest access, archival exceptions, and transparency about deal categories matter. Otherwise licensing becomes another advantage for the largest platforms and rights aggregators.

Seventh, public institutions need their own data policy. Libraries, universities, courts, schools, agencies, and public archives should not wait for private settlements to define the future of knowledge access. They need policies for AI use, preservation, licensing, research access, and public-interest model development.

Eighth, model outputs need source discipline. If an answer engine, assistant, or enterprise model relies on licensed or retrieved material, the interface should preserve attribution, citation, and a path back to the source where feasible. The model should not make cultural labor disappear at the moment it becomes useful.

Ninth, training-content summaries should be usable. A public summary that says "web data" or "books" without source categories, collection periods, crawler rules, licensing status, opt-out handling, and excluded categories is not meaningful transparency.

Tenth, takedown and deletion should reach derivatives. Settlements, opt-outs, license expirations, and court orders need procedures for source files, OCR text, embeddings, retrieval chunks, evaluation sets, fine-tunes, caches, backups, and downstream datasets. A deletion rule that reaches only the first copy is too weak for model infrastructure.

Eleventh, procedural posture should be visible. Product teams, journalists, investors, and policymakers should distinguish complaint, motion-to-dismiss order, summary judgment, class certification, settlement, preliminary approval, final approval, and appellate ruling. Treating all of them as "the court said" is how legal uncertainty becomes bad infrastructure.

Twelfth, output governance should be tied back to input rights. If a system quotes, summarizes, imitates, retrieves, or routes users away from the source, the product should log whether the behavior came from training memory, licensed retrieval, user upload, cache, fine-tune, or live browsing. Output risk cannot be governed if the source path has disappeared.

An AI-era copyright governance packet should be concrete enough for counsel, engineers, auditors, procurement teams, and affected creators to read the same record. It should include source inventory, acquisition path, rights basis, collection date, crawler identity, license terms, rights-reservation handling, opt-out or objection status, privacy review, data minimization decision, use lane, retention rule, deletion route, vendor flow-downs, model or dataset versions, output testing, market-substitution evidence, and appeal or correction channel.

The packet should also separate evidence from conclusion. A database entry that says "public web" is evidence about access, not proof of training permission. A license is evidence of authority for named uses, not proof that every contributor consented. A fair-use order is evidence about one litigation record, not a supply-chain certificate. A deletion certificate is evidence of a remedy, not proof that the trained behavior is gone unless the affected artifacts and verification method are named.

This is where the article connects to AI bills of materials, dataset supply-chain records, deletion orders, and AI audit trails. Copyright governance is no longer only a memo about fair use. It is the inventory system that lets an institution prove which cultural materials became model capability, under what authority, and with what remaining obligations.

What This Changes

The fair use fight is a fight over recursive reality.

Culture trains models. Models generate culture. Generated culture changes markets, expectations, search behavior, school assignments, legal research, workplace writing, artistic commissions, journalism, and the next layer of training data. The court is not judging a static copy. It is judging an engine that converts past expression into future conditions of expression.

That is why the doctrine feels strained. Copyright law knows books, copies, markets, licensing, authors, publishers, databases, and derivative works. It is now being asked to govern systems that absorb relations among works and return probabilistic speech as an interface. The model is not only a library and not only a competitor. It is a machine for making the next public record.

The useful response is neither panic nor permissionless extraction. It is a better record. What was used? How was it acquired? What rights were claimed? What was retained? What was deleted? What markets were entered? What outputs substitute for sources? Who gets paid? Who can refuse? Who can audit? Which public institutions remain able to build and study models without becoming tenants of private cultural warehouses?

The fair use ruling becomes AI governance when it decides the conditions under which human expression may be converted into machine capability. That conversion should be visible before it becomes destiny.

Source Discipline

This article is not legal advice. It treats court orders as primary sources for what a judge decided in a particular procedural posture. It treats complaints as allegations, not findings. It treats settlement notices and Authors Guild updates as evidence of settlement administration and class scope, not proof that the underlying conduct was admitted or adjudicated. It treats the Copyright Office report as policy analysis and the EU AI Act materials as regulatory documentation requirements, not as U.S. fair-use holdings. It treats secondary explainers as useful only when they point back to an order, filing, statute, settlement notice, regulator document, or docket event.

It also treats transparency documents carefully. A training-content summary is not a license. A copyright-compliance policy is not proof that every dataset use was lawful. A robots.txt choice, rights-reservation signal, opt-out form, or license clause can be governance evidence without becoming a complete legal answer. The hard work is matching each claim to the source, the stage of model development, and the remedy being requested.

The rule for reading AI copyright news should be strict: name the case, court, date, claim, works, acquisition path, procedural stage, and actual holding. A ruling that training on lawfully acquired books was fair use for named plaintiffs is not a ruling that pirated books were fair game. A ruling that plaintiffs failed to prove market harm on one record is not a ruling that market harm cannot be proved. A settlement over piracy claims is not a Supreme Court rule for all model training. Source discipline is the difference between law and vibes.

Sources


Return to Blog