Reddit has announced changes to its Robots Exclusion Protocol (robots.txt file) to prevent AI crawlers from scraping its site without proper authorization. This update aims to curb the use of Reddit content for training AI models without acknowledging the source.
The robots.txt file has directed search engines on how to scrape sites to make content more accessible in the past. However, the rise of AI has led to increased scraping for model training, often without proper attribution. Reddit’s updated robots.txt file, along with continued rate-limiting and blocking of unknown bots and crawlers, seeks to address this issue. According to Reddit, bots and crawlers that do not comply with its Public Content Policy or lack an agreement with the platform will face restrictions.
This update is expected to have minimal impact on most users and legitimate actors, such as researchers and organizations like the Internet Archive. Reddit’s primary target is AI companies using its content for large language model training without proper agreements. Although AI crawlers can technically ignore the robots.txt file, Reddit’s measures are a clear signal to these companies to comply or face limitations.
The announcement follows a Wired investigation revealing that AI-powered search startup Perplexity has been ignoring robots.txt requests and scraping content despite being blocked. Perplexity’s CEO, Aravind Srinivas, argued that the robots.txt file is not legally binding, highlighting the challenges Reddit faces in enforcing its policies.
Who Will Be Affected by the Update?
Reddit clarified that these changes would not affect companies with existing agreements, such as Google, which has a $60 million deal allowing it to train AI models using Reddit content. The platform’s blog post emphasized the importance of compliance with its policies, stating, “We are selective about who we work with and trust with large-scale access to Reddit content.”
This move aligns with Reddit’s recent policy update aimed at regulating how its data is accessed and used by commercial entities and partners.
Related News:
Featured Image courtesy of DADO RUVIC/REUTERS