Generative AI and Copyright Law: Who Owns the Data That Trains the Machine?

In September 2022, a photographer named Jason Allen submitted an AI-generated image to the Colorado State Fair fine arts competition. His entry—created using Midjourney, an AI image generation tool—won first place in the “digital arts / digitally manipulated photography” category. The prize was modest: a few hundred dollars and a ribbon. But the furor that followed was anything but modest.

Artists who had spent years developing technical skills were incensed. They were not merely upset about losing a competition to a software tool; they were confronting a fundamental question about the nature of creative work and the legal framework that had been built to protect it. Where had Midjourney learned to produce that image? From millions of images scraped from the internet—images created by human artists, many of whom had never been asked for permission and had never been compensated.

Within a year, that question had spawned dozens of lawsuits, Congressional hearings, Copyright Office inquiries, and a global policy debate. The central issue: when a machine learns from human creativity, who owns what?

Welcome, fellow IP detectives, to the frontier of copyright law—where the law’s deepest assumptions about creativity, authorship, and reproduction are being tested by machines that can generate in seconds what took humans years to learn.

How Generative AI Systems Learn: A Brief Technical Primer
The Training Data Question: Three Theories of Infringement
The Landmark Cases: A Tour of the Battlefield
The Fair Use Analysis: A Detailed Investigation
Output Copyright: When the Machine Creates
The Music Industry: A Separate But Related Front
The Global Landscape: Different Jurisdictions, Different Answers
Emerging Solutions: Licensing Markets and Opt-Out Mechanisms
The Japanese Creator’s Dilemma
The Future Architecture: Where This Must End Up
Conclusion: The Machine Learned From Us All

How Generative AI Systems Learn: A Brief Technical Primer

To understand the copyright debate, we need to understand what AI training actually involves—and why it is legally ambiguous.

Large language models (LLMs) like GPT-4, Claude, Gemini, and LLaMA are trained on vast corpora of text: web pages, books, articles, code, scientific papers, and more. The training process involves feeding billions of text examples to a neural network, which adjusts billions of numerical parameters (weights) based on how well it predicts the next word in each text. After training on trillions of words, the model can generate new text that resembles the statistical patterns it learned.

Image generation models like Stable Diffusion, Midjourney, and DALL-E work similarly: they are trained on hundreds of millions of image-text pairs (an image and a description of that image), learning to associate visual patterns with textual concepts. Given a text prompt, the model can generate a new image that embodies the learned associations.

From a legal standpoint, the key questions are: (1) Is the training process—ingesting copyrighted works to train the model—itself copyright infringement? (2) Is the output—the generated text or image—infringing if it resembles copyrighted works? (3) Who, if anyone, owns the copyright in AI-generated outputs?

These three questions are related but distinct, and different lawsuits have focused on different combinations of them.

The Training Data Question: Three Theories of Infringement

Theory 1: The reproduction theory. Copyright’s most fundamental right is the right to control reproduction of the work. Training an AI involves making copies of copyrighted works—downloading them, storing them in training datasets, and repeatedly presenting them to the model. Even if the final trained model does not “contain” the original works in any conventional sense, the training process itself involves reproduction. This reproduction, without authorization, is infringement.

This theory has the advantage of matching how copyright has always worked: copying without permission is infringement regardless of purpose. It has the disadvantage of potentially prohibiting all text and data mining activities, including research that everyone agrees has social value.

Theory 2: The compression/memorization theory. Even if training is lawful, it creates a model that has effectively compressed and stored copyrighted expression. When the model generates outputs, it is retrieving and recombining compressed expression from its training data. Under this view, the training creates an infringing derivative work—a vast collage that encodes copyrighted expression in transformed but recoverable form.

This theory is more technically sophisticated but harder to apply. How “recoverable” must the expression be? If a model, under special prompting, can reproduce verbatim passages from training data, does that prove the expression is “stored”? Or is it an artifact of the model’s statistical nature?

Theory 3: The market substitution theory. Even if training is transformative and outputs are original, AI-generated content that substitutes for human-created content in the marketplace harms the market for the original works. Under this theory, infringement liability attaches not because of what the model does during training, but because of the competitive harm its outputs cause to the creators whose work trained it.

This theory is hardest to connect to conventional copyright doctrine—copyright does not generally prohibit competition, even competition that relies on learning from existing works—but it is emotionally and economically compelling to the artists and authors who feel most threatened.

The Landmark Cases: A Tour of the Battlefield

By 2024, AI copyright cases were filed in nearly every major jurisdiction. Let’s walk through the most important ones.

Getty Images v. Stability AI (filed in the UK and US, 2023) is perhaps the strongest copyright case in the AI training data landscape. Getty claims that Stability AI scraped over 12 million Getty-licensed images to train Stable Diffusion—and the evidence is striking: early versions of Stable Diffusion, when prompted to generate certain types of images, produced outputs that incorporated visible artifacts of Getty’s watermark. This “watermark bleed” is a smoking gun: it shows that the model had learned from Getty-watermarked images and had encoded some of their visual characteristics so thoroughly that it reproduced them in generated outputs.

Getty’s damages theory is correspondingly powerful. Getty licenses images for a living; AI image generation models trained on Getty’s library can produce images that compete with Getty’s licensing business. The market substitution harm is concrete and measurable.

Stability AI initially argued that Stable Diffusion was trained on images from the LAION-5B dataset—a dataset assembled by a German nonprofit from publicly accessible images—and that any copyright liability belonged to LAION, not Stability. This “we just used someone else’s dataset” defense has significant weaknesses: if the training was knowingly conducted on copyrighted material, the identity of who assembled the dataset does not necessarily insulate the model trainer from infringement liability.

The New York Times v. Microsoft and OpenAI (filed December 2023, S.D.N.Y.) introduced a crucial new category of evidence: near-verbatim reproduction. The Times’s complaint included exhibits showing that ChatGPT, when prompted in particular ways, could reproduce lengthy passages from Times articles almost verbatim—complete with specific facts, specific phrasings, and in some cases entire paragraphs that appeared unchanged from the original articles.

The defendants argued that this evidence demonstrated an anomalous bug—that the model’s normal operation does not produce near-verbatim reproductions—and that plaintiffs were using extraordinary prompting techniques to elicit unusual outputs. The Times countered that the very capacity to reproduce the works (whether or not typically exercised) demonstrated that the works had been reproduced during training and were encoded in the model in recoverable form.

The case implicated the Times’s substantial investment in journalism—its team of reporters, researchers, fact-checkers, and editors who produce content at enormous cost—and posed the question of whether OpenAI could commercially exploit that investment without compensation. The Times argued that OpenAI had, in effect, built its product on Times journalism without paying for it.

Andersen v. Stability AI (filed N.D. Cal. 2023) was a class action brought by visual artists—including illustrators, concept artists, and other commercial artists—who argued that their distinctive artistic styles had been ingested into image generation models and could be reproduced on demand by anyone who prompted the model with their name. An artist who had spent decades developing a recognizable visual style could, through Stable Diffusion, have that style commercially exploited without permission or payment.

This case raised a copyright question that illustrates one of copyright law’s most important limitations: style is not copyrightable. Copyright protects specific creative expression—this painting, these brushstrokes, this composition—but not the general artistic approach or aesthetic sensibility. An artist’s “style” is, legally speaking, in the public domain. Anyone can paint “in the style of” an existing artist without infringing copyright.

The plaintiffs in Andersen tried to argue around this limitation by focusing on specific artworks that had been directly ingested into the training data. If those specific artworks were copied without permission, the training was infringing—even if the stylistic outputs that training enabled were not themselves infringing. The case proceeded through motions to dismiss with mixed results, with some claims surviving and others being dismissed.

The Fair Use Analysis: A Detailed Investigation

The fair use analysis for AI training has occupied lawyers and scholars since the first AI copyright cases were filed. Let’s work through the four factors carefully.

Factor 1: Purpose and character of the use. The AI developers argue that training is transformative—the purpose is not to reproduce or replace the original works, but to extract patterns that enable a new, generative capability. Courts have found transformative uses in contexts as varied as Google’s image search (Kelly v. Arriba Soft), Google’s book indexing (Authors Guild v. Google), and biographer’s use of unpublished letters for research purposes.

The commercial nature of AI training cuts against fair use—courts weigh transformativeness more heavily when the use is non-commercial or educational. AI companies training models they sell commercially are harder to defend than academic researchers. However, even commercial transformative uses have been found to qualify as fair use (Google Image Search is commercial).

The strongest transformative use argument is the Authors Guild v. Google precedent, which found Google’s creation of a searchable text index through scanning millions of complete books to be fair use. AI training is analogous: like Google’s index, the training process extracts information from works to build a new informational product, and like Google’s index, the trained model allows users to find and access information without necessarily reproducing the underlying works in full.

Factor 2: Nature of the copyrighted work. Most training data is highly creative—precisely the category of works that copyright law most strongly protects. Books, articles, images, music, and code are all creative works at the core of copyright protection. This factor generally favors plaintiffs.

Factor 3: Amount and substantiality of the portion used. AI training involves ingesting complete works. Unlike the excerpts and summaries that might constitute fair use in other contexts, training uses the entire work. This factor strongly favors plaintiffs.

Some defendants have tried to reframe this factor: the “amount used” should be measured not by what was copied into the training dataset, but by how much of the copyrighted expression was actually “retained” in the model’s weights. This is a creative argument but has not yet been definitively accepted or rejected by any court.

Factor 4: Effect on the potential market. This is the pivotal factor, and both sides have compelling arguments.

For AI developers: the trained model does not replace the original works. People who use ChatGPT to write a story are not avoiding purchasing novels. People who use Midjourney to generate an image are not typically substituting for a specific existing licensed image. The market harm, if any, is indirect and speculative.

For rights holders: AI image generation directly competes with the stock photography and illustration licensing businesses that fund professional photography and illustration. AI writing tools directly compete with freelance writers, journalists, and authors. The harm is not to the specific works that were trained on, but to the market for creative work generally—and this market harm, while diffuse, is real and already observable in declining prices for stock photography and commercial illustration.

Output Copyright: When the Machine Creates

Separate from the training data question is the question of who, if anyone, owns copyright in AI-generated outputs. The US Copyright Office has taken a clear position: copyright requires human authorship, and purely AI-generated works cannot be copyrighted.

The Copyright Office’s 2023 guidance on AI-generated works held that works generated without human creative control—where a human simply provides a prompt and the AI generates the entire output—are not copyrightable. However, works that involve meaningful human creative expression in their creation may be copyrightable to the extent of that human contribution.

In practice, this means that a person who uses Midjourney to generate an image by typing a one-sentence prompt gets no copyright in the resulting image. But a person who engages in an extended creative process—developing detailed prompts, iterating through many generations, selecting and modifying outputs, integrating AI-generated elements with human-created elements—may have a stronger claim to copyright protection.

The Copyright Office denied registration to the AI-generated comic book “Zarya of the Dawn” on the grounds that the images were not copyrightable (though the human-written text was). This decision confirmed that the Office would not grant copyright in AI images, at least under the current legal framework, and signaled that the human authorship requirement is a genuine and enforceable constraint.

This creates a peculiar commercial situation: companies like OpenAI and Stability AI cannot themselves own copyright in the outputs their models generate. The models generate outputs that are potentially uncopyrightable—in the public domain the moment they’re created. This has implications for the business models of both AI companies and their customers.

The Music Industry: A Separate But Related Front

The AI copyright battle extends well beyond text and images into music. AI music generation tools like Suno and Udio can generate convincing music in any genre or style on demand. In June 2024, the Recording Industry Association of America (RIAA) filed copyright infringement suits against both Suno and Udio, alleging that the companies had trained their models on copyrighted sound recordings without authorization.

Music copyright has two layers that text and image copyright does not: (1) copyright in the musical composition (melody and lyrics), typically owned by the songwriter or music publisher; and (2) copyright in the sound recording (the specific recorded performance), typically owned by the record label. AI music models trained on recorded music potentially infringe both layers.

The music industry also has a distinct business model context. The streaming era already compressed revenue per stream to fractions of a cent; AI music generation that can substitute for licensed music in advertising, video games, and other media presents an additional existential threat. The RIAA’s cases were being pursued aggressively and with the full institutional weight of the major labels.

The Global Landscape: Different Jurisdictions, Different Answers

The AI copyright debate is global, but different jurisdictions are reaching different conclusions with potentially significant implications for where AI development concentrates.

In Japan, the Copyright Act’s “information analysis” exception (Article 30-4) provides the broadest protection for AI training worldwide. The exception allows use of copyrighted works for “information analysis” purposes without license or payment, even commercially. Japan’s Cultural Affairs Agency has published guidance confirming that AI training falls within this exception. This has made Japan an attractive jurisdiction for AI training operations, though Japanese creative industries have been lobbying for reforms.

In the European Union, the Copyright in the Digital Single Market Directive (2019) created a text and data mining (TDM) exception for research organizations (Article 3) and a broader but opt-out-able exception for commercial TDM (Article 4). Rights holders can “reserve” their rights against commercial TDM under Article 4, and major publishers and content platforms have begun doing so—placing machine-readable opt-out notices on their content. The AI Act’s training data transparency requirements add another layer of complexity for EU market players.

In the United Kingdom, a long-running debate about whether to create a broad TDM exception similar to Japan’s had not been resolved by early 2026. The UK government’s initial proposal for a broad commercial TDM exception faced fierce opposition from the creative industries and was repeatedly revised. UK AI companies faced significant uncertainty about the legality of training on unlicensed UK content.

In the United States, the matter was being resolved primarily through litigation rather than legislation. The Copyright Office’s AI and copyright report (2023) declined to recommend specific legislation, instead calling for further study. Congressional attention was high but legislative action was slow. The litigation landscape was developing case by case, with no definitive Supreme Court ruling on point as of early 2026.

Emerging Solutions: Licensing Markets and Opt-Out Mechanisms

Beyond litigation and legislation, market participants were beginning to develop practical solutions to the AI copyright problem.

Licensing deals: Several major publishers had reached licensing agreements with AI companies. News Corp reached a deal with OpenAI in 2024, allowing OpenAI to use News Corp content for training in exchange for payment and certain protections. The Associated Press, several European newspaper publishers, and a number of academic publishers had reached similar deals. These deals provided some rights holders with compensation but left unresolved the question of whether unlicensed training was lawful.

Robots.txt and AI-specific opt-out mechanisms: The web crawling community’s existing Robots.txt protocol, which allows website operators to tell web crawlers not to index their content, was being extended to AI-specific contexts. Several companies developed AI-specific opt-out protocols (including a proposed “AI.txt” standard), though no universal standard had been adopted. Some major content platforms enabled explicit opt-out from AI training in their terms of service.

Watermarking and detection: Tools that embed invisible watermarks in AI-generated content (to identify it as machine-generated) and tools that detect whether specific content appears to have been trained on or reproduced by AI models were being developed. Google’s SynthID and similar technologies represented early attempts to enable provenance tracking for AI content.

Collective licensing: Proposals for collective licensing schemes—analogous to ASCAP/BMI for music—would allow AI companies to license content from large numbers of rights holders through a single administrative entity. The Authors Guild and similar organizations had proposed establishing collective licensing authorities for AI training data. Whether such structures would gain sufficient participation from rights holders and acceptance from AI companies remained uncertain.

The Japanese Creator’s Dilemma

Japan’s permissive copyright framework for AI training has created a particular dilemma for Japanese creators—particularly in the manga and anime industry, which has been a global cultural export powerhouse.

Japanese manga artists (漫画家) and illustrators have developed highly distinctive visual styles that are among the most immediately recognizable in world popular culture. The styles of individual artists—Naoki Urasawa’s detailed realism, Eiichiro Oda’s exuberant character design, Yoshitaka Amano’s ethereal imagery—have been studied, imitated, and celebrated by artists worldwide.

When AI image generation models trained on manga and anime art can produce, on demand, images “in the style of” specific Japanese artists, the creators’ reaction has been viscerally negative. Artists report finding AI-generated work attributing their distinctive styles to others, and discovering that services explicitly market the ability to generate “images in the style of [specific artist’s name]” without any authorization or compensation.

Under Japan’s current copyright law, this is largely lawful: style is not protected by copyright anywhere, and the information analysis exception covers the training. Japanese creators have organized politically—through groups like the “AI to Chosakuken wo Kangaeru Kai”—to advocate for copyright reforms that would give creators more control over AI use of their work. Whether Japan will maintain its maximally permissive approach or move toward a European-style opt-out mechanism is one of the most important near-term AI policy questions in the country.

The Future Architecture: Where This Must End Up

Having investigated this case from multiple angles, the detective’s conclusion is that the current legal uncertainty is untenable and will not persist indefinitely. The likely resolution involves some combination of:

Legislative clarification: At minimum, legislatures will need to clarify whether training on copyrighted works is lawful (and if so, under what conditions) or infringement (and if so, what remedies apply). The current patchwork of litigation, jurisdiction-by-jurisdiction fair use analysis, and uncertain court outcomes is too expensive and unpredictable for both AI developers and rights holders.

Licensing infrastructure: Regardless of the legal baseline, market participants will develop licensing mechanisms that allow AI developers to obtain rights to use copyrighted works and rights holders to receive compensation. The architecture of these licenses—collective vs. individual, opt-in vs. opt-out, per-work vs. per-use—will shape the economics of AI development and the livelihoods of creators.

Technical standards: Provenance-tracking technologies, opt-out mechanisms, and watermarking standards will become part of the standard infrastructure for AI development. Compliance with these standards will become a legal and commercial expectation, enforceable through contracts, regulations, and eventually tort law.

What should not happen—but may: a world in which AI companies continue to train on unlicensed data indefinitely because litigation is slow and imperfect, rights holders receive no compensation for the use of their work in building commercially valuable AI systems, and the creative professions that produce the training data are slowly hollowed out by the AI systems that learned from them. This is the worst-case scenario for both creativity and for the sustainable long-term development of AI, which depends on continued human creation of high-quality content for future training runs.

Conclusion: The Machine Learned From Us All

The internet was built on a tacit social contract: content creators would make their work accessible online, and the technology ecosystem would generate value from that accessibility—through search engines, social platforms, and now AI. For most of the internet’s history, that contract was broadly maintained: creators got traffic, attention, and sometimes revenue; platforms got content and advertising dollars.

Generative AI challenges that contract fundamentally. When a user gets a complete, polished essay from an AI rather than visiting the website where the underlying information was published; when a user gets a generated image rather than licensing a photographer’s work; when a user gets a generated song rather than streaming a musician’s recording—the creators who made the underlying training data get nothing. And they are increasingly aware of it.

The legal battles over AI training data are not just about money. They are about what kind of creative ecosystem we want to inhabit. If training data can be taken without consent or compensation, the creators whose work trains AI systems have an incentive to stop creating or stop making their work accessible—eventually degrading the quality of future training data. If training data requires expensive individual licenses, only well-capitalized companies can build AI systems—concentrating AI development power further in the hands of a few corporations.

The right answer is somewhere in the middle—a framework that recognizes the social value of AI development, the legitimate interests of creators in their work, and the need for both innovation and creative vitality to coexist. Getting there will require law, policy, technology, and market forces all working together.

The investigation continues. The detective will keep you posted.

探偵くん reads all the terms of service—all of them.