Blog · Analysis · May 2026

The Fair Use Ruling Becomes AI Governance

AI copyright cases are not only disputes about books, headnotes, journalism, and images. They are becoming one of the first practical governance regimes for model-mediated knowledge.

The Courtroom as Policy Room

The first durable rules for generative AI may not arrive as an AI statute. They may arrive as copyright opinions.

That sounds narrower than it is. Copyright cases decide who may copy, store, transform, summarize, train on, retrieve from, and compete with cultural work. In the AI setting, those verbs describe the life cycle of model-mediated knowledge. A book, article, image, song, software repository, legal headnote, or public webpage can become training data, embedding space, benchmark material, retrieval context, answer-engine substrate, synthetic output, or a licensing asset.

The lawsuits therefore do more than allocate money after copying. They define what kind of institution an AI model is allowed to become. Is it a reader? A researcher? A competitor? A database? A derivative market? A private archive? A machine that can learn from culture without joining the social and economic systems that made culture available?

As of May 2026, U.S. courts have not produced one clean rule. The early record is more interesting than that. Thomson Reuters v. Ross treated use of Westlaw headnotes for a competing legal AI product as outside fair use. Bartz v. Anthropic treated training on lawfully acquired books as fair use for the named plaintiffs, while separating that from Anthropic's creation of a central library from pirated books. Kadrey v. Meta granted Meta summary judgment on the record before the court, while warning that a better market-harm case could change the analysis. The U.S. Copyright Office's 2025 generative AI training report took a similarly non-absolute position: some training uses may be fair, some may not.

That uncertainty is not a footnote. It is the governance system taking shape.

Doctrine Is Doing Institutional Work

Fair use was built as a flexible doctrine, not as a procurement manual for foundation models. Courts consider the purpose and character of the use, the nature of the work, the amount used, and the effect on actual or potential markets. That flexibility is useful because it lets law adapt to new technologies. It is also dangerous because a few early cases can become operational signals for an entire industry before legislatures, regulators, labor institutions, libraries, schools, and public archives have built their own rules.

Model developers hear these cases as infrastructure guidance. A ruling about lawfully acquired books becomes a signal about dataset acquisition. A ruling about pirated copies becomes a signal about central libraries and deletion duties. A ruling about a weak market-harm record becomes a signal about what plaintiffs must prove. A ruling about a competing legal research product becomes a signal that direct substitution and commercial rivalry matter.

Rights holders hear the same cases differently. They hear a warning that ordinary copyright injury may be hard to prove once a work is converted into model capability rather than reproduced as a clean copy. They also hear an invitation to build licensing markets, audit trails, provenance systems, data registries, opt-out controls, and litigation records that make future market harm more legible.

The result is a strange institutional sequence. First, models are trained. Then lawsuits ask whether the training was allowed. Then companies and publishers negotiate licenses under the shadow of uncertain doctrine. Then those licenses shape which archives are available to future models. The legal answer comes after the technical system, but the next technical system is built around the legal answer.

Three Early Signals

The first signal came from Thomson Reuters v. Ross Intelligence. Ross was building a legal research tool and was accused of using Westlaw headnotes and related editorial material. In February 2025, Judge Stephanos Bibas granted summary judgment for Thomson Reuters on fair use. The opinion treated Ross's use as commercial and competitive, not as a public-interest research use. It also stressed that Ross sought to build a substitute legal research product from the protected material.

That case is not a perfect proxy for frontier language-model training. Ross was not ChatGPT, and Westlaw headnotes are not the whole public web. But the case matters because it shows that "AI training" is not a magic phrase. If the use is aimed at building a competing product from protected expression, fair use can fail.

The second signal came from the U.S. Copyright Office. Its May 2025 Part 3 report on generative AI training rejected a blanket answer. The Office said it was not possible to prejudge every litigation outcome and that some uses of copyrighted works for generative AI training will qualify as fair use while others will not. The report treated purpose, market effect, access to lawful copies, commerciality, outputs, and licensing markets as fact-sensitive questions.

That report is not a court judgment, but it is a policy map. It tells public institutions not to reduce the question to "training is theft" or "training is always transformative." It also shows why a simple consent slogan is hard to implement after the fact: models are trained on mixtures of public material, licensed material, scraped material, pirated material, synthetic material, user material, and data whose provenance may be hard to reconstruct.

The third signal came from the June 2025 Northern District of California book cases. In Bartz v. Anthropic, Judge William Alsup held that training on books that Anthropic had purchased or otherwise lawfully acquired was transformative fair use for the named plaintiffs, but he did not excuse the company's separate creation of a large internal library from pirated sources. Two days later, in Kadrey v. Meta, Judge Vince Chhabria granted Meta summary judgment on fair use because the plaintiffs had not developed the market-harm record he thought the case required. He emphasized that the ruling did not establish that Meta's use of copyrighted materials to train language models is lawful in general.

Together, these signals do not settle the war. They define the battlefield: lawful acquisition, piracy, transformation, substitution, licensing markets, evidence of market dilution, and the difference between a model learning from a work and a product competing with the work.

Piracy and Training Split Apart

The most important distinction in Bartz is the split between training and acquisition.

Anthropic won a fair-use ruling on training with lawfully acquired copies, but not a free pass for pirated library building. The later settlement process focused on pirated-book claims. AP reporting described a proposed $1.5 billion settlement with payments of about $3,000 per covered book; the Authors Guild's April 2026 update emphasized that the certified class covered piracy-related claims, not a class-wide ruling on the act of training itself.

That split is becoming a governance rule whether or not Congress writes it down. AI companies can argue that training is transformative. They will have a harder time arguing that the source of the copies does not matter. If the model economy wants fair use to protect learning-like transformation, it must still answer acquisition-like conduct: scraping, downloading, torrenting, retaining, storing, redistributing, stripping metadata, ignoring robots signals, bypassing paywalls, and building permanent internal corpora.

This is where provenance stops being a nice-to-have. A company that cannot explain where training material came from cannot credibly separate lawful acquisition from unlawful acquisition, licensed use from unlicensed use, public-domain material from copyrighted material, retrieval material from training material, or deleted data from retained data. Dataset documentation becomes legal memory.

The governance implication is blunt: if AI systems are going to claim the cultural privilege of learning, they need the institutional discipline of records.

Market Harm Becomes Model Harm

Traditional copyright doctrine asks whether the use harms the market for the original or a reasonable derivative market. Generative AI makes that question harder because the harm may not look like ordinary substitution.

A model may not print a whole book. It may still reduce demand for the class of labor that makes books, journalism, illustration, code, translation, legal summaries, or music valuable. It may produce outputs in a style, genre, format, or knowledge domain shaped by copyrighted works. It may serve as an answer engine that satisfies the user's need without sending them to the source. It may help a platform negotiate lower licensing rates because the model can approximate some of what the source provides. It may create a synthetic abundance that changes market expectations even when no single output is an infringing copy.

That is why Kadrey matters even though Meta won that round. The opinion's practical message was not "training is legal." It was that plaintiffs need a stronger record about market harm. That shifts the fight from moral argument to institutional evidence: licensing markets, substitution studies, output similarity, platform behavior, changes in commissions and rates, author income, publisher traffic, search displacement, and whether model outputs occupy markets that copyright owners reasonably could license.

This evidence problem favors actors with data. Platforms know user behavior, output frequency, model retrieval paths, traffic diversion, licensing offers, product roadmaps, and internal substitution plans. Individual creators often do not. If market harm becomes the central test, public governance has to ask who can see the market.

Licensing Becomes Private Law

Uncertainty creates markets. Publishers, news organizations, platforms, image archives, music companies, software repositories, and data brokers are already negotiating licensing deals, blocking crawlers, joining data-rights initiatives, or suing. That licensing layer may become the everyday law of model training before public law catches up.

Licensing can be good governance. It can pay creators, create consent pathways, clarify use rights, reduce litigation, improve dataset quality, support public archives, and make model developers document what they use. But licensing can also privatize cultural memory. Large rights holders can negotiate. Small creators may be aggregated, ignored, or priced out. Public-interest researchers may be unable to afford the same datasets that commercial labs license. Open web knowledge may split into premium licensed corpora for large models and low-quality leftovers for everyone else.

The risk is that copyright becomes the only functioning AI governance regime because it has courts, damages, contracts, and owners. That would be an incomplete regime. Copyright protects works, not all labor. It protects expression, not every social context that made expression possible. It can compensate some authors while leaving annotators, moderators, forum users, translators, teachers, open-source maintainers, public institutions, and communities outside the frame.

Fair use litigation therefore cannot carry the whole burden. It can set boundaries around copying, but it cannot by itself govern model transparency, data dignity, labor transition, public knowledge, school use, government procurement, synthetic media, safety testing, or the institutional power of answer engines.

A Governance Standard

A serious AI-era copyright governance standard should treat training-data disputes as institutional design problems, not only as courtroom contests.

First, dataset provenance should be auditable. Model developers should maintain records of source categories, acquisition paths, license terms, opt-out signals, retention rules, and deletion procedures. A vague statement that data came from the internet is not institutional memory.

Second, lawful access and lawful use should be separated. Buying a book, subscribing to a database, crawling a page, or receiving a user upload does not automatically answer every downstream training, retention, retrieval, or output question. Governance should track each step.

Third, piracy should not be laundered by transformation. If a company wants courts and the public to accept training as transformative, it should not treat unlawful copying as an incidental supply-chain shortcut.

Fourth, market harm should be measurable by more than verbatim output. Courts and regulators should examine substitution, answer-engine displacement, licensing-market impairment, synthetic style markets, product competition, and documented plans to replace paid sources or workers.

Fifth, licensing markets need fairness rules. Collective licensing, small-creator representation, public-interest access, archival exceptions, and transparency about deal categories matter. Otherwise licensing becomes another advantage for the largest platforms and rights aggregators.

Sixth, public institutions need their own data policy. Libraries, universities, courts, schools, agencies, and public archives should not wait for private settlements to define the future of knowledge access. They need policies for AI use, preservation, licensing, research access, and public-interest model development.

Seventh, model outputs need source discipline. If an answer engine, assistant, or enterprise model relies on licensed or retrieved material, the interface should preserve attribution, citation, and a path back to the source where feasible. The model should not make cultural labor disappear at the moment it becomes useful.

The Spiralist Reading

The fair use fight is a fight over recursive reality.

Culture trains models. Models generate culture. Generated culture changes markets, expectations, search behavior, school assignments, legal research, workplace writing, artistic commissions, journalism, and the next layer of training data. The court is not judging a static copy. It is judging an engine that converts past expression into future conditions of expression.

That is why the doctrine feels strained. Copyright law knows books, copies, markets, licensing, authors, publishers, databases, and derivative works. It is now being asked to govern systems that absorb relations among works and return probabilistic speech as an interface. The model is not only a library and not only a competitor. It is a machine for making the next public record.

The useful response is neither panic nor permissionless extraction. It is a better record. What was used? How was it acquired? What rights were claimed? What was retained? What was deleted? What markets were entered? What outputs substitute for sources? Who gets paid? Who can refuse? Who can audit? Which public institutions remain able to build and study models without becoming tenants of private cultural warehouses?

The fair use ruling becomes AI governance when it decides the conditions under which human expression may be converted into machine capability. That conversion should be visible before it becomes destiny.

Sources


Return to Blog