Where the Training Data Lawsuits Stand Now: NYT v. OpenAI, Getty v. Stability AI, and the Legal Questions Ahead

IP News

Copyright litigation over generative AI training data has accelerated sharply since 2023, concentrated primarily in the United States a​nd United Kingdom. Among the most significant cases, The New York Times Company v. OpenAI, Inc. a​nd Microsoft Corporation a​nd Getty Images, Inc. v. Stability AI, Ltd. stand as landmark proceedings that will require courts to define the scope of fair use doctrine a​nd establish the legal boundaries of AI training. This article summarizes the current status of these major cases, analyzes the core legal disputes, a​nd compares the copyright frameworks of Japan, the EU, a​nd the United States.

The New York Times v. OpenAI: Case Overview a​nd Status

Case Number a​nd Background

The New York Times filed suit against OpenAI a​nd Microsoft on December 27, 2023, in the U.S. District Court for the Southern District of New York (Case No. 1:23-cv-11195-SHS). The complaint alleges that OpenAI used millions of NYT articles as training data for its LLMs without authorization under copyright law.

Plaintiff’s Core Claims

The Times advances several core arguments. First, it has documented instances where ChatGPT a​nd GPT-4 reproduce NYT articles near-verbatim, which plaintiff asserts demonstrates “market substitution”—a direct displacement of demand for the original work. Second, OpenAI’s unauthorized use of NYT articles as training data constitutes direct copyright infringement, a​nd ChatGPT’s ability to output NYT article content as a substitute gives rise to secondary infringement as well. Third, the complaint alleges that defendants circumvented the Times’s robots.txt instructions that prohibited automated crawling of its content.

OpenAI a​nd Microsoft’s Defense

The defendants’ primary defense is fair use. OpenAI argues that LLM training is an act of “extracting information a​nd statistical patterns” from works, not of “reproducing a​nd distributing” them. The company further contends that LLM outputs do not “memorize a​nd reproduce” training data but generate new content. Regarding the Times’s near-verbatim reproduction demonstrations, defendants argued that these were induced through unusual prompt manipulation—effectively jailbreaking—and are not representative of ordinary use.

Litigation Progress: 2025 to 2026

By early 2025, the case had entered the discovery phase, with disclosure of internal documents relating to OpenAI’s training data composition becoming a contested issue. The court ordered OpenAI to produce records concerning how its training datasets were assembled, creating tension between OpenAI’s interest in protecting technical trade secrets a​nd its discovery obligations. As of April 2026, the case has not yet proceeded to a trial on the merits, a​nd parallel settlement negotiations are understood to be underway.

Getty Images v. Stability AI

Case Overview

Getty Images filed suit against Stability AI (developer of the Stable Diffusion image generation model) in the United Kingdom in January 2023 a​nd in the U.S. District Court for the District of Delaware (Case No. 1:23-cv-00135-UNA) in February 2023. The complaint alleges that Stability AI used over 1.2 billion of Getty’s licensed images as training data without authorization.

Distinctive Legal Issues

The Getty case presents several legal issues distinct from the NYT litigation. First, Stable Diffusion outputs have been documented to contain distorted versions of Getty’s watermarks—analogous to the NYT verbatim reproduction problem, but raising distinctive issues around visual elements. Second, Getty’s claims extend beyond copyright to trademark infringement (watermark alteration) a​nd violation of Section 1202 of the Digital Millennium Copyright Act (DMCA), which prohibits the removal or alteration of copyright management information. This multi-theory pleading creates a broader litigation exposure for the defendant.

In the UK proceedings (High Court of Justice, IP a​nd Enterprise Court), discovery advanced in 2025, with the composition of the LAION-5B dataset used for training—and what proportion consisted of Getty material—becoming a central factual dispute.

Class Actions a​nd Organized Rights-Holder Responses

Beyond the NYT a​nd Getty cases, class action litigation by rights holders is also advancing. A group of novelists a​nd nonfiction authors including the Authors Guild, John Grisham, David Baldacci, a​nd George R.R. Martin has filed a class action against OpenAI (Case No. 1:23-cv-08292-SHS, S.D.N.Y.).

In the code domain, Doe v. GitHub, Inc. (Case No. 4:22-cv-06823-JST, N.D. Cal.) alleges that GitHub Copilot’s training on open-source code without attribution violates the terms of open-source licenses a​nd raises questions about code copyright specifically.

The Fair Use Analysis

Section 107 of the U.S. Copyright Act provides the fair use defense, requiring courts to weigh four factors: (1) the purpose a​nd character of the use, including whether it is commercial a​nd whether it is transformative; (2) the nature of the copyrighted work; (3) the amount a​nd substantiality of the portion used; a​nd (4) the effect of the use on the potential market for the original work.

The Transformative Use Question

The central dispute in AI training fair use cases is whether such training constitutes “transformative use.” Transformative use requires adding new meaning, expression, or message to the original work—mere reproduction does not qualify (Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 1994).

OpenAI argues that LLM training “processes works as inputs,” extracting “statistical patterns” rather than “expressions,” a​nd therefore constitutes transformative use. The most directly relevant precedent supporting this argument is Authors Guild v. Google, Inc. (804 F.3d 202, 2d Cir. 2015), which held that Google’s full-text scanning of books for search was fair use, finding transformativeness based on the generation of “new information, new aesthetics, new insights a​nd understandings.”

The Times counters that in Google Books, only brief “snippets” of the original were displayed, whereas ChatGPT can generate text substantially identical to NYT articles. On the fourth factor—market effect—the argument that readers can access NYT content through ChatGPT a​nd thereby substitute it for paid subscriptions carries intuitive force.

Market Effect: The Weight of the Fourth Factor

The Supreme Court has identified the fourth factor—effect on the potential market—as “the most important” in Harper & Row, Publishers, Inc. v. Nation Enterprises (471 U.S. 539, 1985). In the AI training context, plaintiffs are asserting harm to the licensing market: if OpenAI had sought a license to use NYT articles for training, the Times would have earned license revenue. This theory of licensing market harm is expected to play a central evidentiary role.

Comparative Copyright Law: Japan, EU, a​nd the United States

Major jurisdictions have taken materially different approaches to AI training under copyright law, with direct implications for how global AI companies assess their legal exposure.

Japan — Explicit Permission Under Article 30-4

Japan introduced Article 30-4 of the Copyright Act through the 2018 amendment (effective January 2019). The provision permits the use of copyrighted works without authorization for “uses not intended to enjoy the thoughts or emotions expressed in the work,” subject to the caveat that such use may not “unreasonably prejudice the interests of the copyright holder.” According to the Agency for Cultural Affairs’ interpretive guidance, machine learning training generally falls within the scope of Article 30-4.

However, the “unreasonable prejudice” exception means that while collecting a​nd storing training data is generally permitted, separate copyright problems may arise if a trained model directly reproduces the expression of a copyrighted work. Under Japanese law, a case involving the kind of verbatim reproduction at issue in the NYT litigation could still raise reproduction right claims, even under the 30-4 framework.

European Union — Text a​nd Data Mining Provisions with Opt-Out

EU Directive 2019/790/EU (Article 4) establishes a general copyright limitation permitting reproduction for text a​nd data mining (TDM) purposes, but grants copyright holders the right to “opt out” by expressing objection in machine-readable form. Where a rights holder has posted a machine-readable opt-out notice, TDM use is not permitted. The EU AI Act (effective 2024) confirms this framework a​nd imposes transparency obligations on AI developers regarding their use of the TDM exception.

United States — Entirely Dependent on Judicial Fair Use Analysis

U.S. copyright law has no provision explicitly addressing AI training, leaving the question entirely to judicial application of fair use doctrine. Legal certainty is accordingly lower than in Japan or the EU. The forthcoming decisions in the NYT a​nd Getty cases are widely expected to establish precedent that will shape the legality of AI training in the United States, a​nd the entire industry is watching closely.

Settlement Prospects a​nd Industry Impact

Settlement remains the most probable resolution. If the Times were to prevail at trial a​nd substantial damages were awarded, the financial impact on OpenAI a​nd Microsoft could be severe. A theoretical damages calculation based on statistical harm per article multiplied by the number of articles used could produce claims in the billions of dollars.

A settlement might take the form of OpenAI paying licensing fees to the Times or establishing a content distribution partnership. Licensing arrangements between major publishers a​nd AI platforms are already becoming normalized: the Associated Press entered a content licensing agreement with OpenAI in July 2023, establishing a precedent for negotiated commercial relationships.

Regardless of specific outcomes, the training data litigation wave is accelerating a broader shift toward normalized licensing negotiations between content rights holders a​nd AI companies. How the industry resolves the rights-clearance problem for training data will directly affect the development cost of next-generation AI models—and is already reshaping business models across the AI sector.


This is Part 3 of the “AI Patent War of 2026” series. Part 4 analyzes how Meta’s Llama strategy has rewritten the competitive rules—and what open-source AI means as an IP strategy.

この記事について

パテント探偵社 編集部

知的財産の世界で起きている出来事を、ジャーナリズムの手法で報道・分析する独立メディア。特許番号・法的根拠・当事者名を正確に記述しながら、専門家以外にも読みやすい記事を届けています。掲載内容は法的アドバイスではありません。

コメント

Copied title and URL