DMR News

Advancing Digital Conversations

News outlets accuse Perplexity of plagiarism and unethical web scraping

ByYasmeeta Oon

Jul 3, 2024

News outlets accuse Perplexity of plagiarism and unethical web scraping

In the age of generative AI, chatbots are capable of providing detailed answers to questions based on content sourced from the internet. However, the distinction between fair use and plagiarism, and between routine web scraping and unethical summarization, has become increasingly blurred.

Perplexity AI is a startup that integrates a search engine with a large language model to generate comprehensive answers, rather than merely providing links. Unlike competitors such as OpenAI’s ChatGPT and Anthropic’s Claude, Perplexity does not train its foundational AI models. Instead, it utilizes open or commercially available models to gather information from the internet and translate it into detailed responses.

But recent accusations have cast a shadow over the startup’s approach. In June, Forbes accused Perplexity of allegedly plagiarizing one of its news articles in the startup’s beta feature, Perplexity Pages. Similarly, Wired claimed that Perplexity had illicitly scraped its website and other sites.

Perplexity, which as of April was working to raise $250 million at a near-$3 billion valuation, insists that it has done nothing wrong. The Nvidia- and Jeff Bezos-backed company asserts that it has honored publishers’ requests to avoid scraping content and is operating within the bounds of fair use copyright laws.

ConceptDescription
Robots Exclusion ProtocolA standard used by websites to indicate that they do not want their content accessed or used by web crawlers.
Fair Use in Copyright LawLegal framework allowing the use of copyrighted material without permission or payment under certain circumstances.

Wired’s June 19 story claims that Perplexity has ignored the Robots Exclusion Protocol to surreptitiously scrape areas of websites that publishers do not want bots to access. Wired reported observing a machine tied to Perplexity doing this on its own news site and across other publications under its parent company, Condé Nast.

Perplexity has staunchly defended its practices, stating that it adheres to ethical guidelines and legal standards. The company emphasizes that its use of content falls under fair use, which allows for the use of copyrighted material without permission in specific contexts, such as commentary, criticism, news reporting, and research. Additionally, Perplexity claims it respects the Robots Exclusion Protocol by not accessing restricted areas of websites.

The situation involving Perplexity AI is complex, touching on various ethical and legal nuances. Two primary concepts are at the heart of this controversy:

  • Robots Exclusion Protocol: This protocol allows websites to communicate their preferences regarding the use of web crawlers. By specifying which parts of a site should not be accessed, publishers can protect their content from unauthorized scraping.
  • Fair Use: Under copyright law, fair use provides a legal framework for the use of copyrighted material without requiring permission or payment. This can include uses such as criticism, comment, news reporting, teaching, scholarship, and research.

Despite these guidelines, the application of fair use is not always straightforward, and the boundaries of ethical web scraping are often contested. The recent allegations against Perplexity highlight the ongoing debate over how generative AI technologies should interact with existing content on the internet.

The accusations against Perplexity AI have significant implications for the tech industry, particularly for companies developing and deploying generative AI models. As these technologies become more advanced, the potential for ethical and legal conflicts increases. Companies must navigate these challenges carefully to maintain trust and compliance with regulatory standards.

Several key considerations will shape the future of generative AI and its ethical use of internet content:

  • Transparency: Companies like Perplexity must be transparent about their data sourcing and usage practices. Clear communication with publishers and users can help build trust and prevent misunderstandings.
  • Compliance: Adhering to protocols such as the Robots Exclusion Protocol and respecting fair use guidelines are crucial for maintaining ethical standards. Companies should regularly review and update their practices to ensure compliance with evolving regulations.
  • Collaboration: Engaging with publishers, legal experts, and industry stakeholders can help AI companies develop more responsible and ethical practices. Collaborative efforts can lead to the creation of standards and guidelines that benefit the entire industry.

The controversy surrounding Perplexity AI underscores the delicate balance between innovation and ethical responsibility in the realm of generative AI. As the technology continues to evolve, it is imperative for companies to navigate these complexities with care, ensuring that their practices align with both legal standards and ethical expectations.

By addressing these challenges head-on, the industry can work towards a future where generative AI technologies are used responsibly and ethically, benefiting both developers and the wider community.


Featured Image courtesy of The National

Yasmeeta Oon

Just a girl trying to break into the world of journalism, constantly on the hunt for the next big story to share.

Leave a Reply

Your email address will not be published. Required fields are marked *