Where the Training Data Lawsuits Stand Now: NYT v. OpenAI, Getty v. Stability AI, and the Legal Questions Ahead

Copyright litigation over generative AI training data has accelerated sharply since 2023, concentrated primarily in the United States and United Kingdom. Among the most significant cases, The New York Times Company v. OpenAI, Inc. and Microsoft Corporation and Getty Images, Inc. v. Stability AI, Ltd. stand as landmark proceedings that will require courts to define the scope of fair use doctrine and establish the legal boundaries of AI training. This article summarizes the current status of these major cases, analyzes the core legal disputes, and compares the copyright frameworks of Japan, the EU, and the United States.

The New York Times v. OpenAI: Case Overview and Status
Getty Images v. Stability AI
1. Case Overview
2. Distinctive Legal Issues
Class Actions and Organized Rights-Holder Responses
The Fair Use Analysis
1. The Transformative Use Question
2. Market Effect: The Weight of the Fourth Factor
Comparative Copyright Law: Japan, EU, and the United States
Settlement Prospects and Industry Impact

The New York Times v. OpenAI: Case Overview and Status

Case Number and Background

The New York Times filed suit against OpenAI and Microsoft on December 27, 2023, in the U.S. District Court for the Southern District of New York (Case No. 1:23-cv-11195-SHS). The complaint alleges that OpenAI used millions of NYT articles as training data for its LLMs without authorization under copyright law.

Plaintiff’s Core Claims

The Times advances several core arguments. First, it has documented instances where ChatGPT and GPT-4 reproduce NYT articles near-verbatim, which plaintiff asserts demonstrates “market substitution”—a direct displacement of demand for the original work. Second, OpenAI’s unauthorized use of NYT articles as training data constitutes direct copyright infringement, and ChatGPT’s ability to output NYT article content as a substitute gives rise to secondary infringement as well. Third, the complaint alleges that defendants circumvented the Times’s robots.txt instructions that prohibited automated crawling of its content.

OpenAI and Microsoft’s Defense

The defendants’ primary defense is fair use. OpenAI argues that LLM training is an act of “extracting information and statistical patterns” from works, not of “reproducing and distributing” them. The company further contends that LLM outputs do not “memorize and reproduce” training data but generate new content. Regarding the Times’s near-verbatim reproduction demonstrations, defendants argued that these were induced through unusual prompt manipulation—effectively jailbreaking—and are not representative of ordinary use.

Litigation Progress: 2025 to 2026

By early 2025, the case had entered the discovery phase, with disclosure of internal documents relating to OpenAI’s training data composition becoming a contested issue. The court ordered OpenAI to produce records concerning how its training datasets were assembled, creating tension between OpenAI’s interest in protecting technical trade secrets and its discovery obligations. As of April 2026, the case has not yet proceeded to a trial on the merits, and parallel settlement negotiations are understood to be underway.

Getty Images v. Stability AI

Case Overview

Getty Images filed suit against Stability AI (developer of the Stable Diffusion image generation model) in the United Kingdom in January 2023 and in the U.S. District Court for the District of Delaware (Case No. 1:23-cv-00135-UNA) in February 2023. The complaint alleges that Stability AI used over 1.2 billion of Getty’s licensed images as training data without authorization.

Distinctive Legal Issues

The Getty case presents several legal issues distinct from the NYT litigation. First, Stable Diffusion outputs have been documented to contain distorted versions of Getty’s watermarks—analogous to the NYT verbatim reproduction problem, but raising distinctive issues around visual elements. Second, Getty’s claims extend beyond copyright to trademark infringement (watermark alteration) and violation of Section 1202 of the Digital Millennium Copyright Act (DMCA), which prohibits the removal or alteration of copyright management information. This multi-theory pleading creates a broader litigation exposure for the defendant.

In the UK proceedings (High Court of Justice, IP and Enterprise Court), discovery advanced in 2025, with the composition of the LAION-5B dataset used for training—and what proportion consisted of Getty material—becoming a central factual dispute.

Class Actions and Organized Rights-Holder Responses

Beyond the NYT and Getty cases, class action litigation by rights holders is also advancing. A group of novelists and nonfiction authors including the Authors Guild, John Grisham, David Baldacci, and George R.R. Martin has filed a class action against OpenAI (Case No. 1:23-cv-08292-SHS, S.D.N.Y.).

In the code domain, Doe v. GitHub, Inc. (Case No. 4:22-cv-06823-JST, N.D. Cal.) alleges that GitHub Copilot’s training on open-source code without attribution violates the terms of open-source licenses and raises questions about code copyright specifically.

The Fair Use Analysis

Section 107 of the U.S. Copyright Act provides the fair use defense, requiring courts to weigh four factors: (1) the purpose and character of the use, including whether it is commercial and whether it is transformative; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used; and (4) the effect of the use on the potential market for the original work.

The Transformative Use Question

The central dispute in AI training fair use cases is whether such training constitutes “transformative use.” Transformative use requires adding new meaning, expression, or message to the original work—mere reproduction does not qualify (Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 1994).

OpenAI argues that LLM training “processes works as inputs,” extracting “statistical patterns” rather than “expressions,” and therefore constitutes transformative use. The most directly relevant precedent supporting this argument is Authors Guild v. Google, Inc. (804 F.3d 202, 2d Cir. 2015), which held that Google’s full-text scanning of books for search was fair use, finding transformativeness based on the generation of “new information, new aesthetics, new insights and understandings.”

The Times counters that in Google Books, only brief “snippets” of the original were displayed, whereas ChatGPT can generate text substantially identical to NYT articles. On the fourth factor—market effect—the argument that readers can access NYT content through ChatGPT and thereby substitute it for paid subscriptions carries intuitive force.

Market Effect: The Weight of the Fourth Factor

The Supreme Court has identified the fourth factor—effect on the potential market—as “the most important” in Harper & Row, Publishers, Inc. v. Nation Enterprises (471 U.S. 539, 1985). In the AI training context, plaintiffs are asserting harm to the licensing market: if OpenAI had sought a license to use NYT articles for training, the Times would have earned license revenue. This theory of licensing market harm is expected to play a central evidentiary role.

Comparative Copyright Law: Japan, EU, and the United States

Major jurisdictions have taken materially different approaches to AI training under copyright law, with direct implications for how global AI companies assess their legal exposure.

Japan — Explicit Permission Under Article 30-4

Japan introduced Article 30-4 of the Copyright Act through the 2018 amendment (effective January 2019). The provision permits the use of copyrighted works without authorization for “uses not intended to enjoy the thoughts or emotions expressed in the work,” subject to the caveat that such use may not “unreasonably prejudice the interests of the copyright holder.” According to the Agency for Cultural Affairs’ interpretive guidance, machine learning training generally falls within the scope of Article 30-4.

However, the “unreasonable prejudice” exception means that while collecting and storing training data is generally permitted, separate copyright problems may arise if a trained model directly reproduces the expression of a copyrighted work. Under Japanese law, a case involving the kind of verbatim reproduction at issue in the NYT litigation could still raise reproduction right claims, even under the 30-4 framework.

European Union — Text and Data Mining Provisions with Opt-Out

EU Directive 2019/790/EU (Article 4) establishes a general copyright limitation permitting reproduction for text and data mining (TDM) purposes, but grants copyright holders the right to “opt out” by expressing objection in machine-readable form. Where a rights holder has posted a machine-readable opt-out notice, TDM use is not permitted. The EU AI Act (effective 2024) confirms this framework and imposes transparency obligations on AI developers regarding their use of the TDM exception.

United States — Entirely Dependent on Judicial Fair Use Analysis

U.S. copyright law has no provision explicitly addressing AI training, leaving the question entirely to judicial application of fair use doctrine. Legal certainty is accordingly lower than in Japan or the EU. The forthcoming decisions in the NYT and Getty cases are widely expected to establish precedent that will shape the legality of AI training in the United States, and the entire industry is watching closely.

Settlement Prospects and Industry Impact

Settlement remains the most probable resolution. If the Times were to prevail at trial and substantial damages were awarded, the financial impact on OpenAI and Microsoft could be severe. A theoretical damages calculation based on statistical harm per article multiplied by the number of articles used could produce claims in the billions of dollars.

A settlement might take the form of OpenAI paying licensing fees to the Times or establishing a content distribution partnership. Licensing arrangements between major publishers and AI platforms are already becoming normalized: the Associated Press entered a content licensing agreement with OpenAI in July 2023, establishing a precedent for negotiated commercial relationships.

Regardless of specific outcomes, the training data litigation wave is accelerating a broader shift toward normalized licensing negotiations between content rights holders and AI companies. How the industry resolves the rights-clearance problem for training data will directly affect the development cost of next-generation AI models—and is already reshaping business models across the AI sector.

This is Part 3 of the “AI Patent War of 2026” series. Part 4 analyzes how Meta’s Llama strategy has rewritten the competitive rules—and what open-source AI means as an IP strategy.

この記事について

パテント探偵社編集部

知的財産の世界で起きている出来事を、ジャーナリズムの手法で報道・分析する独立メディア。特許番号・法的根拠・当事者名を正確に記述しながら、専門家以外にも読みやすい記事を届けています。掲載内容は法的アドバイスではありません。

The New York Times v. OpenAI: Case Overview a​nd Status

Case Number a​nd Background