Their standard-bearer The New York Times has already successfully taken legal action against OpenAI. They claim the tech firm illegally scraped their straight news articles to train AI models without consent. The public benefit of industry data scraping This lawsuit encapsulates increasing friction between legacy media companies and AI developers regarding the use of proprietary content and data scraping.
In another related story, the Wikimedia Foundation has partnered with Kaggle. Our organizations’ partnership seeks to make the experience of accessing Wikipedia content as a developer less daunting, disconnected and confusing. This project will focus on making Wikipedia information more available in a machine-readable format. It is based on the open-source licensed content found anywhere on the platform. The content is available under two primary licenses: the Creative Commons Attribution-ShareAlike 4.0 license and the GNU Free Documentation License (GFDL). Some content is in the public domain or other licensing.
Wikimedia’s Response to AI Scraping and Bandwidth Strain
The Wikimedia Foundation’s move comes as AI bots increasingly consume bandwidth, leading to a significant strain on Wikipedia’s resources. With demand for downloading multimedia ballooning bandwidth usage by 50% since the beginning of 2024, the foundation decided that drastic times demanded drastic measures. The organization acknowledged this increase earlier this month, noting the need for solutions to mitigate the impact of AI scraping on its infrastructure.
“Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines,” – The Wikimedia Foundation
This new collaboration with Kaggle will further our efforts to make accessing Wikipedia data easier while directly responding to issues related to AI scraping. By making this knowledge available in a structured format, developers will be able to use that information without scraping which places an additional bandwidth burden on Wikipedia.
Recently Reddit had implemented new measures to discourage this type of predatory behavior on their platform. To stop bot abuse of its services, the social media platform added new restrictions going into this spring. Changes to Reddit’s API policies in 2023 require developers to pay for access, a decision that sparked controversy and debate within the developer community.
The ongoing legal disputes and new initiatives highlight the broader challenges faced by organizations like The New York Times and Wikipedia as they navigate the evolving landscape of artificial intelligence and data usage. As further private-sector entities seek to exploit generative AI capabilities, the need for clear-cut regulations and ethical practices grows in importance.
Author’s Opinion
The legal battles and partnerships highlighted in this article reflect the growing tension between media companies and AI developers over the ethical use of data. As AI tools advance, it’s essential for organizations to adopt transparent, fair practices that protect intellectual property while ensuring that data usage supports innovation. The initiatives by Wikimedia and Reddit are crucial steps in addressing these concerns, but they are just the beginning. Clear, industry-wide regulations will be necessary to prevent exploitation and ensure that AI developments benefit both creators and consumers.
Featured image credit: FMT
Follow us for more breaking news on DMR