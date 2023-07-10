Comedian Sarah Silverman says that the human minds behind artificial intelligence writing programs essentially stole from her, and she wants them to pay up.

In a class action complaint filed in federal court Friday, Silverman accuses tech company OpenAI of using her book “The Bedwetter” to train its ChatGPT software — and, in doing so, violating her copyright. Author Christopher Golden and writer Richard Kadrey joined Silverman in the lawsuit.

According to the complaint, ChatGPT accessed databases of thousands of books in order to “train” its programs — called “large language models, or LLMs — “by copying massive amounts of text and extracting expressive information from it.” This training, the lawsuit explains, is the key to allowing ChatGPT to “emit convincingly naturalistic text outputs in response to user prompts.”

The problem, however, is that the “training” material — including, allegedly, Silverman’s book — is under copyright, and may have been pulled from databases of copyrighted works without permission.

“Plaintiffs and Class members did not consent to the use of their copyrighted books as training material for ChatGPT,” the lawsuit says. “Nonetheless, their copyrighted materials were ingested and used to train ChatGPT.”

According to the complaint, it is believed that “the reason ChatGPT can accurately summarize a certain copyrighted book is because that book was copied by OpenAI and ingested by the underlying” LLM. Silverman’s lawyers concluded that “The Bedwetter” must have been part of the dataset because when ChatGPT was asked to summarize Silvermans’ book, the program did exactly that.

“Indeed, when ChatGPT is prompted, ChatGPT generates summaries of Plaintiffs’ copyrighted works—something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works,” the complaint says.

The lawsuit notes that “[t]he summaries get some details wrong,” which is expected due to the nature of LLMs. “Still, the rest of the summaries are accurate, which means that ChatGPT retains knowledge of particular works in the training dataset and is able to output similar textual content.”

The lawsuit implies that the plaintiffs’ books were included in online book databases without permission and that ChatGPT drew its LLM training from those databases. One database called Project Gutenberg, described as an “online archive of e-books whose copyright has expired,” allegedly boasted about having “over 60,000 titles” as of September 2020, the complaint says, noting that ChatGPT had previously acknowledged that one of the datasets was based on a collection of around 63,000 titles. The lawsuit suggests that a second dataset used by ChatGPT is based on so-called “shadow library” websites that are “flagrantly illegal” for their unauthorized sharing of copyrighted material, and comprises almost 300,000 titles.

Silverman and her co-plaintiffs allege a host of copyright violations as well as unjust enrichment and negligence. In a separate complaint, also filed Friday, Silverman, Golden, and Kadrey make similar allegations against Meta — parent company of Facebook and Instagram — over what they say were similar actions taken with its LLaMA AI writing software.

The emergence of AI writing programs has sparked somewhat of a panic, both in the tech field and other industries, and its use in the courtroom has gotten some lawyers in serious trouble for using it.

Read the complaint, below.

