Reddit’s CEO, Steve Huffman, has publicly demanded that Microsoft and other AI companies compensate Reddit for using its data, which they have been accessing without authorization.
In a detailed interview with The Verge, Huffman accused Microsoft’s Bing, along with AI companies Anthropic and Perplexity, of scraping Reddit’s content without permission. He emphasized the challenges and frustrations associated with blocking these companies, describing it as “a real pain in the a**,” but stressed that it was a necessary measure to safeguard Reddit’s data and ensure that the platform receives proper compensation.
Concerns Over Data Control
Reddit has already established deals with companies like Google and OpenAI, where these entities pay for the right to use Reddit’s data. However, Huffman pointed out that Microsoft and others have been resistant to such agreements. He noted that without these deals, Reddit lacks control over how its data is displayed or used, resulting in unauthorized uses, including training AI models and summarizing content on Bing without Reddit’s consent.
Huffman specifically mentioned that Microsoft has been using Reddit’s data to train its AI and has been selling access to this data through the Bing API to other search engines. He also referred to a recent statement by Microsoft AI CEO Mustafa Suleyman, who described public data on the internet as “freeware.”
Steps to Block Unauthorized Data Scraping
To address these issues, Reddit has been actively working to prevent unauthorized data scraping. In early July, Reddit updated its Robots Exclusion Protocol (robots.txt) to block web crawlers from companies with which it does not have agreements. This action resulted in Reddit content being accessible only through search engines like Google, which compensates Reddit for its data.
Microsoft’s head of search, Jordi Ribas, confirmed on X (formerly Twitter) that Bing had been blocked from accessing Reddit’s data, attributing this to Reddit favoring another search engine and affecting competition.
Protect Data Rights and Future Agreements
Huffman pointed to OpenAI’s recent announcement of SearchGPT, which will include Reddit results thanks to a deal between the two companies, as an example of the kind of agreement Reddit seeks to replicate. Reddit spokesperson Tim Rathschmidt clarified that none of Reddit’s current content licensing deals include exclusive use cases for its data, indicating a willingness to work with multiple partners.
This situation reflects a broader trend among content creators and traditional media publishers, including The Verge’s parent company, Vox Media, who are increasingly seeking compensation for their content used by generative AI models. Huffman highlighted that the traditional value exchange from search engines—crawling content in exchange for traffic—is becoming more complex as the lines between search, summarization, and training blur.
After the story was published, Anthropic spokesperson Jennifer Martinez stated that Reddit has been on their block list for web crawling since mid-May and that they respect the robots.txt file as the industry standard for blocking web crawlers.
Microsoft declined to comment on the matter, and Perplexity did not respond to a request for comment.
Featured Image courtesy of Jakub Porzycki/NurPhoto via Getty Images
Follow us for more tech news updates.