OpenAI has recently come under scrutiny from several groups for allegedly using copyrighted materials without obtaining proper authorization during the training processes of its AI systems. According to a new report published by the AI Disclosures Project, a nonprofit established in 2024 by media leader Tim O’Reilly and economist Ilan Strauss, OpenAI reportedly intensified its dependency on private, unlicensed books for training higher-performing AI models.
Artificial intelligence models function essentially as advanced predictive tools. These systems are trained using vast amounts of data sourced from various content, including books, films, and television series. Through this exposure, the AI learns patterns and produces relevant output based upon user prompts. Recent studies indicate distinct differences between earlier and newer AI models. For instance, the GPT-3.5 Turbo version primarily recognizes publicly available samples taken from O’Reilly publications. In contrast, GPT-4o, a newer iteration, demonstrates notable familiarity with O’Reilly content previously restricted behind paywalls.
To arrive at these findings, the researchers employed the DE-COP methodology—also referred to as membership inference attacks—to determine whether AI systems could differentiate between original, human-written material and texts modified or produced by artificial intelligence. The researchers tested GPT-4o, GPT-3.5 Turbo, and additional models using an extensive dataset consisting of 13,962 paragraphs extracted from 34 separate O’Reilly titles. The results indicated that GPT-4o displayed substantial knowledge regarding several restricted-access O’Reilly books published prior to the model’s training cutoff point, even after considering advances in the newer model’s skill at distinguishing original, human-written text.
The publication’s authors highlight certain limitations of their investigative approach and recognize alternative possibilities. One explanation offered is that OpenAI may have inadvertently acquired the paywalled content through users copying excerpts directly into the ChatGPT interface. Additionally, the researchers noted that their analysis did not include the more advanced or specialized recent AI versions, such as GPT-4.5 and other specialized reasoning AIs. Despite its advocacy for fewer restrictions related to copyrighted materials usage and its establishment of licensing agreements with several content providers, OpenAI currently faces legal challenges regarding these training procedures.
OpenAI has not yet provided a statement in response to requests for comment.