Leading AI Models Tested for Copyright Infringement with Surprising Results
In recent findings that have stirred both concern and debate within the tech and publishing industries, OpenAI’s GPT-4 has been identified as producing the highest ratio of copyrighted content when responding to prompts related to popular books. This revelation comes from a study conducted by Patronus AI, shedding light on the pressing issue of copyright infringement by state-of-the-art artificial intelligence models.
Founded by former Meta researchers, Patronus AI specializes in the testing and evaluation of large language models—the advanced technology underpinning generative AI products. The unveiling of their new tool, CopyrightCatcher, was accompanied by the release of results from an adversarial test designed to measure how often four leading AI models resort to using copyrighted text in response to user prompts.
The Offending Models
The study evaluated the models known as OpenAI’s GPT-4, Anthropic’s Claude 2, Meta’s Llama 2, and Mistral AI’s Mixtral, uncovering widespread use of copyrighted content across all models tested. “We pretty much found copyrighted content across the board, whether it’s open source or closed source,” stated Rebecca Qian, Patronus AI’s cofounder and CTO, in a discussion with CNBC. She further expressed her surprise at the extent to which OpenAI’s GPT-4, often celebrated for its power and versatility, reproduced copyrighted content in 44% of the constructed prompts.
The analysis was conducted using copyrighted books from the U.S., with researchers selecting popular titles from the Goodreads cataloging website. They devised 100 different prompts, asking the AI models to generate content such as the first passage of “Gone Girl” by Gillian Flynn or to continue text from specific points in notable works like “Becoming” by Michelle Obama.
Comparative Performance of AI Models
OpenAI’s GPT-4 notably underperformed compared to its counterparts, often completing text from books directly 60% of the time and returning the first passage of books in about 25% of inquiries. Anthropic’s Claude 2 proved to be more resilient, using copyrighted content only 16% of the time for book completions and never for first passages. In fact, Claude often refused to engage with the prompts by acknowledging its inability to access copyrighted books.
Mistral’s Mixtral model and Meta’s Llama 2 demonstrated varied results, with Mixtral completing a book’s first passage 38% of the time and Llama 2 responding with copyrighted content in 10% of prompts. A significant takeaway is that no model was entirely immune to reproducing copyrighted material, highlighting a universal challenge within the industry.
Broad Implications and Industry Response
The findings emerge amid increasing tension between AI developers and the creative community, including authors, artists, and publishers, over the use of copyrighted material for training data. This debate has been further fueled by a high-profile lawsuit between The New York Times and OpenAI, which could serve as a pivotal moment for copyright law’s intersection with emerging technologies. OpenAI previously articulated the challenge of developing leading AI models without infringing on copyright, emphasizing the reliance on a vast array of human expressions covered by copyright laws to meet modern needs.
Patronus AI’s research signifies a critical point in the ongoing conversation about AI’s impact on copyright, urging a reevaluation of how AI models are trained and the potential need for regulatory frameworks that align with the digital age.
As these discussions continue to evolve, the tech industry and creative fields alike will need to navigate the complexities of innovation, copyright, and ethics, striving for a balance that respects both technological advancement and the rights of creators.