Breaking Open the AI Piggybank

Posted by Stanton Braden | Aug 03, 2023 | 0 Comments

ChatGPT is all the rage these days, being one of the artificial intelligence (AI) chatbots capable of generating human-like text in response to a prompt. These tools, often characterized as “Generative AI,” can generate content such as stories, answer questions, and produce software code, among other things. They are the product of software programs referred to as large language models.

Large language models use massive amounts of data, especially language-content data, to derive patterns and connections between and among words. The data used, referred to as a training dataset, drives the production of the response to the chatbot prompt.

Using the data in an AI dataset involves copying the data, and therein lies the problem. Producing valid data for useful datasets may involve substantial expense. As a result, many large language models are “trained” using data available on the Internet. Much of this data is likely protected by copyright, e.g., the Copyright Act, 17 U.S.C. §501, the Digital Millenium Copyright Act (DMCA), 17 U.S.C. §1202, and other copyright laws in various international forums.

In the United States, a copyright protects original works of authorship, such as the expression of an idea, an artistic work, a movie, a song, a computer program, etc. Note that while a copyright can protect the creative expression of an idea (e.g., the narrative of how a story is told), the copyright does not protect the idea itself.

The right to derive works from a copyrighted work, known as derivative works, is also held by the copyright owner. Derivative works are creative expressions that contain major copyrightable elements of the work from which they are derived. A copyright is created or 'vests' as soon as it becomes fixed in a tangible medium (written down, recorded, etc.), and statutory damages are available that can range from 'not less than $750 or more than $30,000..." 17 U.S.C. §504. In addition, in the case of willful infringement, where an infringer had knowledge of the copyright infringement and recklessly disregarded it, the award can increase to a maximum of $150,000.

Notably, the timing of copyright registration can be critical as the benefit of statutory damages are not available for:

Any infringement of copyright in an unpublished work commenced before the effective date of its registration; or
Any infringement of copyright commenced after the first publication of the work and before the effective date of its registration, unless such registration is made within three months after the first publication of the work. (17 U.S.C. 412).

While monetary damages may be recovered for copyright infringement, even when statutory damages are unavailable, actual damages must be proven. Proof of actual damages may be more costly and problematic.

Comedian Sarah Silverman and writer Christopher Golden recently launched a class action lawsuit against OPENAI, Inc., the company behind ChatGPT, and others for various violations of copyright infringement and unfair competition laws. Silverman v. OpenAI, Inc., 3:23-cv-03416, (N.D. Cal.)

The plaintiffs alleged that several of their copyright-registered books were used in training datasets by OpenAI. A published paper presenting GPT-1 (an OpenAI large language model) disclosed that training had occurred on BookCorpus, a collection of over “7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance.” Silverman, 3:23-cv-03416 at 7.

As alleged in the complaint:

BookCorpus, however, is a controversial dataset. It was assembled in 2015 by a team of AI researchers for training language models. They copied the books from a website called Smashwords, which hosts self-published novels, which are available to readers at no cost. Those novels, however, are mainly under copyright but were copied into the BookCorpus dataset without consent, credit, or compensation to the authors.

The complaint further alleges:

OpenAI also copied many books while training GPT-3. In the July 2020 paper introducing GPT-3 (called “Language Models are Few-Shot Learners”), OpenAI disclosed that 15% of the enormous GPT-3 training dataset came from “two internet-based books corpora” that OpenAI simply called “Books1” and “Books2.” Id.

While the plaintiffs have the benefit of evidence detailing the bases of the alleged use of their copyrighted materials, it may be possible to reverse engineer training dataset information to help reveal sources. A paper published in 2021 by Nicholas Carlini et al. supports this position by stating, “It has become common to publish large (billion parameters) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.” See https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting

A closer look at generated responses from chatbots may also reveal that which might be regarded as derivative works of copyrighted material resulting from using a training dataset containing copyrighted material. Although no such direct allegation of copyright infringement regarding derivative works appears in the Silverman lawsuit, searching for derivative works and plagiarized passages in chatbot responses may be worthwhile because copyright infringement cannot be ruled out from the responses.

Generative AI has ushered in an exciting era and provided a new source of copyright-related revenue for copyright owners.

Contact us today so that we may help you identify and register your copyrighted material to pave the way toward breaking open the piggybank that may lie behind AI chatbots!

*The information in this article is not legal advice and should not be relied on. The content of this article is for informational purposes only and is meant as a starting point in your search for answers to your legal questions.

Blog

Breaking Open the AI Piggybank

About the Author

Stanton Braden

Comments

Leave a Comment

Get Started Today

Office Location

Menu

Blog

Breaking Open the AI Piggybank

Share

About the Author

Stanton Braden

Comments

Leave a Comment

Get Started Today

Office Location

Menu