1. The lawsuit and what Britannica is alleging 2. The Lanham Act angle: false attribution explained 3. The copyright infringement claims in detail 4. Why Britannica matters as a plaintiff 5. OpenAI's content licensing strategy under pressure 6. Prior lawsuits: how Britannica fits the pattern 7. The fair use defense and why it is being tested at scale 8. The knowledge content licensing market 9. Implications for publishers and institutions 10. What this means for AI training data going forward 11. Frequently asked questions ---

Encyclopaedia Britannica sues OpenAI over 100,000 articles …

TL;DR: Encyclopaedia Britannica has filed a copyright infringement and trademark lawsuit against OpenAI, alleging that more than 100,000 of its articles were scraped and used to train GPT models without permission, compensation, or attribution. The lawsuit adds a second, sharper legal theory under the Lanham Act: false attribution — the claim that ChatGPT presents AI-generated content as if it were drawn from Britannica, damaging the 258-year-old institution's reputation as the gold standard of factual accuracy. The case accelerates what is now an industrywide litigation wave targeting OpenAI's training data practices.

What you will learn

The lawsuit and what Britannica is alleging
The Lanham Act angle: false attribution explained
The copyright infringement claims in detail
Why Britannica matters as a plaintiff
OpenAI's content licensing strategy under pressure
Prior lawsuits: how Britannica fits the pattern
The fair use defense and why it is being tested at scale
The knowledge content licensing market
Implications for publishers and institutions
What this means for AI training data going forward
Frequently asked questions

The lawsuit and what Britannica is alleging

Encyclopaedia Britannica — the Scottish-American institution that has been publishing authoritative reference content since 1768 — has filed a lawsuit in federal court against OpenAI alleging copyright infringement and violations of the Lanham Act.

The complaint centers on two distinct but related claims. First, that OpenAI systematically used more than 100,000 Britannica articles to train its GPT series of large language models without authorization, payment, or opt-in consent. Second, that ChatGPT then produces outputs that are derived from Britannica's licensed content but presents that content without accurate attribution — or, worse, attributes AI-generated content to Britannica that Britannica never wrote.

The scale of the alleged scraping is notable. Britannica's digital archive contains decades of commissioned encyclopedia articles, written by credentialed experts and fact-checked to a standard that distinguishes it from user-generated alternatives. The company has maintained a paid digital subscription product and a licensing program for its content. It is not a publisher that put its content online casually and then discovered AI companies had taken it. It is an institution that built a business model around the controlled distribution of trusted knowledge — and alleges that business model was circumvented at scale.

The complaint makes a straightforward argument: OpenAI ingested Britannica's corpus as part of the large pre-training datasets used to build GPT-3, GPT-4, and subsequent models. Those datasets, assembled from Common Crawl, Books3, and other web-scraped sources, are well-documented to contain substantial quantities of high-quality reference content. Britannica argues its articles appear in those datasets and that their inclusion was not authorized under any license or exception.

No specific damages figure has been reported in initial coverage, which is typical for copyright complaints at the filing stage. The lawsuit seeks both actual damages and injunctive relief.

The Lanham Act angle: false attribution explained

The copyright claim is significant, but the Lanham Act claim is the more novel and potentially more damaging legal theory.

The Lanham Act is the primary U.S. federal trademark statute, enacted in 1946. Most people associate it with trademark infringement — the kind of claim you bring when someone uses your brand name to sell competing goods. But the Act also covers false advertising and, critically, false designation of origin. Section 43(a) prohibits false or misleading statements about the origin of goods or services in commercial contexts.

Britannica's false attribution theory works like this: when a user asks ChatGPT a factual question, the model may respond with content that was substantially derived from Britannica's articles. If the model then says "according to Encyclopaedia Britannica" or implies that the information originates from Britannica's verified body of knowledge, it is making a claim about origin that is misleading in two directions.

The first direction is overattribution. The model may cite "Britannica" as a source when the actual output is a probabilistic synthesis derived from hundreds of training documents, some of which happened to include Britannica text. That is not the same as quoting Britannica. Presenting it as such implies a verification and sourcing standard that the output does not meet.

The second direction is more concerning. ChatGPT is known to hallucinate — to produce confident-sounding statements that are factually incorrect. If those hallucinated statements are attributed to Britannica, either directly or by implication (e.g., the model says a claim is "well-established" in a way that echoes Britannica's authoritative register), Britannica's 258-year reputation for accuracy is being associated with content it never vetted and never wrote.

This is a genuine injury, not a theoretical one. Britannica's entire value proposition is that its content is correct. When users encounter erroneous information and believe it came from or was endorsed by Britannica, the institution's credibility is damaged in a way that is difficult to quantify but not difficult to understand. It is the reputational equivalent of forging a letter on a university's official letterhead.

The Lanham Act vehicle for this theory is well-chosen. Unlike copyright, which requires proving copying, false attribution under the Lanham Act requires proving that consumers are likely to be confused about the origin or endorsement of content. That is a more commercially oriented standard, and in the context of AI outputs that regularly invoke brand names as epistemic anchors, it may be easier to establish.

The copyright infringement claims in detail

The copyright infringement side of the case follows a framework that has become familiar from the wave of lawsuits now working through U.S. courts.

Britannica owns the copyright in its articles. Those articles were created by paid contributors and editors under work-for-hire arrangements, making Britannica the rights holder. Unlike newspapers, where individual journalists may hold their own copyrights, Britannica's institutional authorship model means the company can sue directly without aggregating claims from thousands of individual writers.

The infringement theory is direct: OpenAI copied those articles to create training datasets, used those datasets to train models, and the trained models can reproduce content that is substantially similar to — or, in some cases, near-verbatim extracts of — Britannica's copyrighted text. The complaint likely includes examples of ChatGPT outputs that closely track Britannica articles as evidence of the underlying memorization.

This is the same structural argument made in The New York Times v. OpenAI, filed in December 2023. The Times demonstrated that ChatGPT could produce near-verbatim reproductions of Times articles when prompted in specific ways. Britannica's complaint almost certainly contains similar demonstrations.

The key factual disputes will center on three questions. First, did OpenAI's training datasets actually include Britannica content — and in what volume? Second, do the trained models actually reproduce that content in ways that exceed transformative use? Third, did OpenAI take any steps to filter or exclude licensed content from its training data?

On the first question, the Common Crawl dataset that forms the backbone of most large language model training includes extensive crawls of britannica.com going back years. Researchers studying training data composition have documented that high-quality encyclopedic content is systematically overrepresented in training datasets relative to its share of the web, because model developers actively seek out authoritative sources. Britannica's content almost certainly appears in OpenAI's training corpus.

Why Britannica matters as a plaintiff

Not all copyright plaintiffs are equally positioned to shape legal outcomes. Britannica is exceptionally well-positioned.

At 258 years old, Britannica is one of the most recognizable knowledge brands in the world. In the popular imagination, "the encyclopedia" and "Britannica" are nearly synonymous. The institution survived the transition from physical books to CD-ROMs, then to the web, and has built a sustainable digital subscription and licensing business. It is not a struggling media company suing out of desperation. It is a profitable institution with significant legal resources and a reputational interest that is easy to explain to a jury.

The nature of Britannica's content also matters. Unlike newspaper articles, which are written under deadline pressure and sometimes contain errors, Britannica articles are specifically commissioned for accuracy. The contributors are credentialed domain experts. Each article goes through multiple layers of editorial review. Britannica's content is the kind of factual bedrock that AI companies specifically sought out when building training datasets — precisely because its quality is high.

This creates a sharp irony. The features that made Britannica's content valuable as training data — verified accuracy, expert authorship, encyclopedic breadth — are exactly the features that AI companies want their models to exhibit. OpenAI's pitch to users is that ChatGPT provides reliable, comprehensive answers. To the extent that reliability and comprehensiveness derive from ingesting Britannica's corpus, Britannica has an argument that it is owed a share of the value created.

The institutional weight also matters politically. Britannica is not a media conglomerate whose copyright claims can be dismissed as corporate rent-seeking. It is an institution with a democratic mission — making authoritative knowledge accessible — that predates the United States itself. That framing is available in amicus briefs, in press coverage, and eventually in legislative testimony if Congress takes up AI copyright reform.

OpenAI's content licensing strategy under pressure

OpenAI has spent the last two years trying to separate the companies it licenses content from and the companies that are suing it.

The licensing column is substantial. OpenAI has signed content agreements with the Associated Press, Axel Springer (owner of Business Insider and Politico), the Financial Times, Le Monde, Prisa Media, News Corp (owner of The Wall Street Journal and New York Post), and the Atlantic, among others. These deals provide OpenAI with rights to current content for training and, in some cases, integration of live data into ChatGPT's responses.

The terms of most deals are not public, but estimates for the larger agreements range from the low tens of millions to over $200 million over multi-year terms. News Corp's deal with OpenAI is reported to be worth $250 million over five years.

The lawsuit column is also substantial, and growing. The New York Times filed suit in December 2023. The Authors Guild and a group of prominent fiction writers, including John Grisham, Jodi Picoult, and George R.R. Martin, have filed suit. Getty Images filed a lawsuit in the United Kingdom. Universal Music Group and other music publishers have pursued actions related to AI-generated music. The Intercept and Raw Story have filed. And now Britannica.

The strategic calculation for OpenAI has been that licensing deals with major players can drain the incentive to litigate. If The Wall Street Journal is paid, it is less likely to join a lawsuit. But that strategy has a limit. There are thousands of rights-holders whose content was used in training, and OpenAI cannot sign individual deals with all of them. Britannica's lawsuit represents a rights-holder who was not brought into the licensing fold and is not willing to wait.

OpenAI's other line of defense — that training data licensing is being resolved through ongoing content partnerships and that licensing rates will eventually be established by the market — is now being tested in court at a pace that may outrun any voluntary settlement framework.

Prior lawsuits: how Britannica fits the pattern

The Britannica lawsuit is the latest in a litigation wave that has been building since late 2022.

The first major case was brought by Getty Images against Stability AI in January 2023, targeting the image-generation model Stable Diffusion. Getty alleged that Stability AI scraped over 12 million images from Getty's licensed library without permission, and that the resulting model could reproduce Getty watermarks — one of the more vivid pieces of evidence that training data copying had occurred at a level beyond mere inspiration.

The Authors Guild class action, filed in September 2023, brought together a diverse group of fiction and non-fiction writers whose books were included in the Books3 training dataset — a collection of 196,640 books assembled from a now-defunct piracy repository called Bibliotik. The existence of Books3 as a training data source has been documented by researchers and is not disputed. The legal question is whether using copyrighted books for AI training constitutes fair use.

The New York Times lawsuit, filed in December 2023, is the highest-profile case. The Times demonstrated that GPT-4 could reproduce substantial verbatim passages from Times articles when prompted with the beginning of a sentence. This evidence is qualitatively different from the Authors Guild case: it shows not just that the model was trained on Times content, but that the content is in some sense stored and retrievable, challenging the characterization of training as mere "exposure" to data.

The Intercept's lawsuit, filed in early 2024, makes a narrower technical argument about copyright management information — the metadata that publishers embed in content to assert ownership. Removing or disregarding that information in the course of training, the complaint alleges, violates the Digital Millennium Copyright Act separately from any copyright infringement claim.

Britannica's lawsuit adds two elements that distinguish it from prior cases. First, the Lanham Act false attribution theory is new. No prior major AI lawsuit has centered on trademark law's concern with misleading attribution. If that theory survives a motion to dismiss, it opens a second legal front that is conceptually distinct from copyright and harder to defend with the same fair use arguments. Second, Britannica's institutional profile — a verified-accuracy institution whose brand is being invoked in AI outputs — makes the false attribution harm unusually concrete and sympathetic.

The fair use defense and why it is being tested at scale

OpenAI's primary defense in all of these cases is fair use — the copyright doctrine that permits certain uses of copyrighted material without permission when the use is transformative, limited in scope, and does not harm the market for the original work.

The four-factor fair use test weighs: the purpose and character of the use (commercial or educational, transformative or reproductive); the nature of the copyrighted work; the amount and substantiality of the portion used; and the effect on the potential market for the copyrighted work.

OpenAI's argument, stated plainly, is that training a model on text is transformative in the same way that reading is transformative. A model that has been trained on Britannica articles does not reproduce those articles; it has abstracted patterns from them. The output of the model is not a copy of the training data but a novel synthesis. The market for Britannica subscriptions is not harmed by ChatGPT's existence in any way that can be traced to this specific use.

This argument has not yet been tested in a final judicial ruling. The cases are moving toward discovery and, eventually, summary judgment. The fair use question will ultimately be decided by federal courts and, very likely, by the Supreme Court.

The fair use defense is not obviously wrong. The U.S. Court of Appeals for the Second Circuit's 2015 ruling in Authors Guild v. Google found that Google's digital scanning and search indexing of millions of books was fair use, even though Google profited from the service. That case is the most analogous precedent, and it cut in favor of the technologist. But there are meaningful distinctions. Google's search index displayed snippets, not complete reproductions, and the service increased the discoverability of the underlying books. Large language models trained on copyrighted text can produce outputs that compete directly with the source material, potentially cannibalizing the market.

The "market harm" factor may be the most consequential. Britannica can argue that ChatGPT providing encyclopedic answers for free directly competes with Britannica's subscription product. If users get their factual questions answered by an AI trained on Britannica's content, without paying Britannica, that is a measurable market harm. Whether courts accept that framing will shape the entire AI copyright landscape.

The knowledge content licensing market

The Britannica lawsuit arrives at a moment when the market for licensing knowledge content to AI companies is being constructed in real time.

The rough contours are becoming visible. Major newspaper licensing deals are reportedly worth $20 million to $250 million over multi-year terms, depending on the publication's scale and the breadth of rights granted. Academic publishers like Elsevier and Springer Nature have been in discussions with multiple AI companies about licensing their journal archives. Textbook publishers have sent letters asserting rights. Libraries and archival institutions are evaluating their exposure.

The challenge for knowledge content publishers is that they face a collective action problem. Any individual publisher negotiating with OpenAI or Google or Meta is doing so with incomplete information about what competitors have settled for. The first movers in licensing — the AP, Axel Springer, News Corp — set precedents that may undervalue the full contribution of the knowledge corpus to model capabilities.

Britannica's lawsuit can be read as a negotiating tactic as much as a legal strategy. Filing suit is a credible signal that the institution is serious about extracting value from its contribution to AI capabilities. If OpenAI wants to avoid the costs and uncertainty of litigation, it can offer a licensing deal. The question is whether what OpenAI will pay matches what Britannica's content is actually worth.

The broader market is also being shaped by outcomes in the current wave of litigation. If courts rule for plaintiffs in the NYT case or the Authors Guild case, it will establish that AI training without permission is actionable — and drive a wave of licensing negotiations as AI companies seek to retroactively clear rights. If courts rule for OpenAI on fair use grounds, the market for licensing historical training data may never materialize, and publishers will be left to negotiate only for future content pipelines.

Implications for publishers and institutions

The Britannica lawsuit has specific implications for knowledge publishers, reference institutions, and any organization whose core product is authoritative factual content.

For reference publishers — encyclopedia producers, almanac publishers, institutional knowledge bases, fact-checking organizations — the false attribution theory is especially relevant. These organizations' value propositions rest on the association of their name with accuracy. If AI systems can invoke that name while producing content of uncertain accuracy, the brand equity is diluted without compensation. The Lanham Act theory gives these institutions a legal vehicle that does not depend on proving verbatim copying.

For academic publishers, the case strengthens the argument for licensing before the next generation of models is trained rather than litigating after the fact. The AI training data market is moving toward a world where accessing high-quality curated content requires explicit licensing agreements. Institutions that did not negotiate licenses before the current generation of models were trained are now exploring litigation as the only remaining path to compensation.

For news publishers that have already signed licensing deals, the Britannica case may encourage renegotiation. If a court rules that training on copyrighted content without permission is infringing, existing licensors may argue their earlier deals underpriced the rights and seek to renegotiate on more favorable terms.

For universities and libraries that hold large archives of copyrighted material, the case raises questions about their exposure and their obligations. Universities that have digitized collections and made them available to researchers may need to review whether that availability facilitated AI training by third parties.

What this means for AI training data going forward

The accumulation of lawsuits — Getty, Authors Guild, NYT, Intercept, now Britannica, and others pending — is forcing a structural change in how AI companies approach training data.

The era of "scrape everything and sort out the rights later" is ending, not because AI companies have developed ethical objections to the practice, but because the legal risk has become quantifiable and significant. The NYT lawsuit alone is seeking billions of dollars in damages. A portfolio of similar judgments would be existential for any company that is not already valued in the hundreds of billions.

The response has been a bifurcation in training data strategy. For current and future model versions, AI companies are moving toward licensed content pipelines. Deals with publishers, record labels, stock image libraries, and data providers give AI companies documented rights to use content for training. The cost is real but bounded.

For the historical training data used in existing models, the legal exposure cannot be retroactively addressed through licensing. OpenAI cannot go back and get permission for the data used to train GPT-3. What it can do is negotiate settlements that resolve claims without admitting liability — a strategy it appears to be pursuing aggressively. The Authors Guild litigation, for instance, has seen preliminary settlement discussions that suggest at least some plaintiffs are open to compensation short of a full trial.

The Britannica case adds pressure to both tracks. It expands the universe of institutional plaintiffs with the resources and reputational interest to sustain litigation. It introduces the Lanham Act as a second legal theory that requires separate defenses from the copyright fair use arguments. And it focuses attention on a category of content — trusted, expert-written, accuracy-verified factual knowledge — that sits at the heart of what makes large language models useful.

OpenAI's value proposition to users is that its models are reliable and comprehensive. To the extent that reliability derives from ingesting Britannica-quality content, Britannica's lawsuit is a direct challenge to that value proposition's legal foundation.

Frequently asked questions

What is Encyclopaedia Britannica suing OpenAI over?

Britannica has filed a lawsuit alleging two main claims: copyright infringement, asserting that over 100,000 of its articles were used to train OpenAI's GPT models without permission; and false attribution under the Lanham Act, asserting that ChatGPT produces content derived from or associated with Britannica without accurate attribution — and sometimes attributes to Britannica content that Britannica never created.

What is the Lanham Act and why does it apply here?

The Lanham Act is the primary U.S. federal trademark statute. Section 43(a) prohibits false designations of origin and false or misleading statements in commercial contexts. Britannica is using it to challenge ChatGPT's practice of referencing or implying association with Britannica's brand while producing AI-generated outputs that may be inaccurate, misleading, or simply not derived from what users would expect a "Britannica" citation to mean.

How is false attribution different from copyright infringement?

Copyright infringement is about copying creative expression without permission. False attribution is about misrepresenting the origin, authorship, or endorsement of content. You can have false attribution without copyright infringement — for instance, if an AI model cites "Britannica" for a factually wrong statement that happens not to be copied from any Britannica article, that is false attribution but not copyright infringement.

Why is Britannica suing now, in 2026?

The AI copyright litigation wave began building in 2022 and 2023 with the Getty Images, Authors Guild, and New York Times lawsuits. Those cases are still working through discovery and pre-trial motions. Britannica is joining a litigation environment where the legal theories are becoming clearer, the evidence of training data composition is better documented, and OpenAI's licensing strategy has left certain major content owners outside the deal structure.

Did OpenAI have a content licensing deal with Britannica before this lawsuit?

Based on available reporting, Britannica was not among the publishers that signed licensing agreements with OpenAI. The absence of a deal is consistent with OpenAI's broader licensing strategy, which has prioritized current-content news publishers over reference and archival content owners.

What is OpenAI's likely defense?

OpenAI is expected to invoke fair use — the copyright doctrine that permits transformative uses of copyrighted material. The argument is that training a model on text is transformative because the model does not store or reproduce the text but abstracts patterns from it. OpenAI may also challenge whether Britannica can prove its specific articles appeared in OpenAI's training datasets in a way that is legally significant.

What is fair use and could it protect OpenAI?

Fair use is a copyright exception evaluated on four factors: the purpose of the use, the nature of the copyrighted work, the amount used, and the market impact. Courts have not yet ruled definitively on whether AI training constitutes fair use. The outcome will likely depend significantly on whether courts find that AI models compete with the original content market — Britannica's paid subscriptions — or operate in a distinct market.

How does this case compare to The New York Times v. OpenAI?

Both cases allege that OpenAI used copyrighted content for training without permission. The NYT case famously demonstrated that ChatGPT could reproduce near-verbatim passages from Times articles. Britannica's case adds the Lanham Act false attribution theory, which has not been a central claim in the NYT lawsuit. The two cases share the copyright infringement framework but differ in the trademark-based harm theory.

What damages is Britannica seeking?

The specific damages figures have not been publicly reported at the time of filing, which is typical. Copyright cases of this type typically seek statutory damages (up to $150,000 per infringed work for willful infringement), actual damages, disgorgement of profits attributable to the infringement, and injunctive relief. With over 100,000 articles at issue, the potential statutory damages figure is very large, though courts retain discretion in awards.

How much is 100,000 articles in context?

Britannica's digital archive contains tens of thousands of reference articles on topics ranging from biology to history to geography, written by credentialed academic contributors and regularly updated. The 100,000 figure cited in the complaint refers to the volume of Britannica articles allegedly included in OpenAI's training datasets — not the total size of Britannica's archive.

What is the timeline for this case?

Federal copyright litigation of this complexity typically takes several years from filing to final judgment. The case will proceed through initial motions (including likely motions to dismiss), discovery (during which Britannica's lawyers will seek access to OpenAI's training data documentation), summary judgment motions, and potentially trial. Settlement is possible at any point. Most large IP disputes of this type settle before trial.

Could this affect ChatGPT's functionality?

A ruling against OpenAI could require changes to training data practices for future models, payments to rights-holders, and potentially changes to how ChatGPT attributes information in responses. Injunctive relief that required OpenAI to stop using Britannica-derived knowledge in its models would be technically complex to enforce and would likely be resolved through licensing rather than actual model modification.

What other knowledge publishers might file similar lawsuits?

The Lanham Act false attribution theory is potentially applicable to any reference or knowledge publisher whose brand is associated with accuracy — academic publishers like Oxford University Press, Encyclopedia.com, academic journals like Nature and Science, professional reference databases in medicine and law, and fact-checking organizations. The theory could also apply to professional associations whose published standards and guidelines appear in AI training data.

Is there a potential class action involving multiple knowledge publishers?

Potentially. The Authors Guild class action model provides a template for aggregating claims from many rights-holders. A class action on behalf of reference and encyclopedia publishers, or more broadly of verified-accuracy content producers, is legally feasible if a court can define the class with sufficient precision. Class certification in copyright cases is procedurally demanding, but the common fact questions — did OpenAI use this category of content in training? — could support a class structure.

What does this mean for smaller publishers and bloggers?

The Britannica case and the broader litigation wave are not directly about individual bloggers or small publishers whose content may also have been scraped for training. However, the collective licensing frameworks being explored as potential settlements — similar to how ASCAP and BMI collect and distribute music royalties — could eventually create mechanisms for smaller rights-holders to receive compensation. The Britannica case accelerates the timeline for establishing those frameworks by adding institutional pressure to the litigation portfolio opposing OpenAI's current approach.

How does this affect OpenAI's valuation and funding prospects?

OpenAI recently closed an $110 billion funding round involving SoftBank, Amazon, and Nvidia. At that valuation, the litigation portfolio is a disclosed risk but not an immediate threat to operations. However, a large adverse judgment in the NYT case or the Authors Guild case — particularly one that resulted in injunctive relief rather than just damages — could create material business uncertainty that affects future financing. Investors will increasingly price litigation risk into AI company valuations.

What would a licensing settlement look like for this case?

A licensing settlement would likely involve a lump-sum payment for past use of Britannica's training content, an ongoing royalty or access fee for future model training, and possibly a content partnership that gives OpenAI the right to surface Britannica-attributed content in ChatGPT responses in exchange for revenue sharing. The false attribution claims might also require changes to how ChatGPT references or implies reliance on Britannica as a source, including disclosure standards that make clear when an output is AI-generated versus sourced from a specific licensed publication.

Let's Build Something Together

Encyclopaedia Britannica sues OpenAI over 100,000 articles and false attribution

Weekly Newsletter