Specifically, Wikimedia Commons, the primary go-to repository for images, videos and audio files, saw a drastic increase in bandwidth use. Since the beginning of 2024, our platform has seen a 50% uptick in multimedia download HD traffic. Automated low-cost data-hungry scrapers fuel this increase by training these artificial intelligence models. It’s not whacking moles to make life better for more human users looking for information.
Our work on Wikimedia Commons is licensed under open licenses or we otherwise provide content that is in the public domain. The Wikimedia Foundation, the nonprofit that operates projects like Wikipedia, saves on costs by relying on users to help develop and file the enormous trove of freely available media that the site has to offer. Despite its recent popularity boom, scraper traffic is flooding the site and making the potential for great stress on the internal infrastructure and resources.
The Role of Bots in Bandwidth Use
The Wikimedia Foundation pointed out that almost two-thirds of the most resource-intensive traffic on Commons comes from bots. These bots are some of the worst offenders when it comes to idling high-bandwidth content, further raising the operational costs to the foundation.
The truth is, our infrastructure is very capable of handling unexpected traffic surges from human users during high-interest events. The record-high, out-of-control surge in traffic created by these scraper bots is creating greater harm and expense,” explained a spokesperson with the Wikimedia Foundation.
Since 2019, the site reliability engineering team at the Wikimedia Foundation has been up to that challenge. They’ve spent a lot of time and money on efforts to counteract these crawlers. This advocacy and education effort has turned into a cat-and-mouse game. If the status quo persists, publishers will end up needing to require logins and/or paywalls to protect their content.
The problem of bandwidth requirements from AI scrapers goes far beyond just Wikimedia Commons. Many other similar projects, such as tech guru Gergely Orosz’s, have experienced these same bureaucratic struggles too. In reaction to the new landscape, tech companies are keeping their heads down and working hard to create solutions. For instance, Cloudflare recently introduced AI Labyrinth, an obstacle course that uses generative AI content to make scraping take longer and put less strain on bandwidth.
This growth in bandwidth consumption has significant ramifications for the Wikimedia Foundation’s cloud expenditure. With deep commitment, they chart a path through this constantly changing terrain. The foundation is deeply dedicated to ensuring equitable free access to knowledge and steadfastly protects communities from the rude effects of automated traffic.
What The Author Thinks
The increasing demand for bandwidth due to AI scrapers presents a real challenge for platforms like Wikimedia Commons, which thrive on open access to information. While the benefits of AI advancements are undeniable, they must not come at the expense of infrastructure and data resources that support public good. As the situation develops, finding a balance between technological growth and sustainability will be crucial for the long-term health of the open web.
Featured image credit: Pernilla Rydmark via Flickr
Follow us for more breaking news on DMR