A new study has emerged, suggesting that OpenAI’s models may have “memorized” portions of copyrighted content, lending weight to ongoing allegations from authors, programmers, and rights holders. These groups have accused OpenAI of using their works—such as books, codebases, and articles—to train its AI models without permission.
OpenAI has long defended itself using the fair use clause, claiming that training on publicly available data falls under this provision. However, critics argue that U.S. copyright law does not include exceptions for training data, sparking legal battles over the practice.
Method of Identifying “Memorized” Content
The study, co-authored by researchers from the University of Washington, University of Copenhagen, and Stanford University, proposes a novel method for identifying training data “memorized” by AI models like OpenAI’s. This method involves focusing on “high-surprisal” words—uncommon words in a given context. For example, in the sentence “Jack and I sat perfectly still with the radar humming,” the word “radar” would be considered high-surprisal due to its rarity in comparison to words like “engine” or “radio.”
By removing high-surprisal words from excerpts of texts such as fiction books and New York Times articles, and then asking the models to predict the masked words, the researchers found that OpenAI’s GPT-4 model showed signs of memorizing passages from popular books and articles. Specifically, GPT-4 appeared to have memorized parts of books from a dataset called BookMIA, which contains copyrighted e-books, and had a lower rate of memorization in New York Times articles.
Implications for AI Transparency and Accountability
Abhilasha Ravichander, a doctoral student at the University of Washington and co-author of the study, emphasized the need for more transparency in AI training practices. She pointed out that for AI models to be trustworthy, they must be fully auditable and verifiable, especially regarding the data they are trained on.
While OpenAI has advocated for looser restrictions on the use of copyrighted data in model development, it has also implemented content licensing agreements and opt-out mechanisms for copyright holders. Despite these measures, OpenAI has lobbied governments worldwide to codify fair use rules specific to AI training.
What The Author Thinks
In my opinion, the findings of this study highlight a critical issue for AI development—transparency. While OpenAI defends its approach to training models, the possibility of unintentionally memorizing copyrighted content raises serious questions about intellectual property rights in the AI space. For AI to continue advancing, clear rules must be established to ensure that models are not just innovative but also ethically trained and transparent.
Featured image credit: macrovector via Freepik
Follow us for more breaking news on DMR