AI search engine Perplexity is under scrutiny for allegedly utilizing stealth bots to circumvent website restrictions against web crawling. This claim, if verified, would breach established Internet practices that have been upheld for over thirty years, as articulated by cybersecurity and optimization firm Cloudflare.
Cloudflare disclosed in a recent blog post that it had received multiple complaints from clients who had actively disallowed Perplexity’s scraping bots via their robots.txt files and web application firewalls aimed at blocking the declared Perplexity crawlers. Despite implementing these preventative measures, Parplexity reportedly continued to scrape the content from these sites without authorization.
In their investigation, Cloudflare researchers aimed to confirm these assertions and discovered that when legitimate Perplexity crawlers encountered barriers, the company resorted to employing a stealth bot. This bot utilized various tactics to obscure its activities and thereby bypass restrictions.
The detailed examination revealed that this undeclared scraper operated from multiple IP addresses outside Perplexity’s official range, adjusting its IP usage in response to the restrictions set forth in the robots.txt policy. Furthermore, the research indicated that the bot switched between different autonomous system networks (ASNs) in an effort to avoid detection and blocks initiated by Cloudflare. This activity was not isolated; it was observed across tens of thousands of domains and involved millions of daily requests.
The controversy underscores a significant deviation from established Internet norms that date back to 1994, when engineer Martijn Koster first proposed the Robots Exclusion Protocol. This standardized method allows webmasters to communicate that certain content should not be accessed by automated crawlers. By placing a simple robots.txt file at the root of their sites, content providers could clearly outline their no-crawl policies. This standard, widely accepted and observed in the subsequent years, became official through the Internet Engineering Task Force in 2022.
If proven accurate, Perplexity’s circumvention of these protocols raises critical questions about the ethical framework governing automated content scraping. The implications of such actions could jeopardize the trust between content creators and AI service providers, creating an environment where compliance with Internet norms is increasingly uncertain.
In considering the tactics employed, the MITRE ATT&CK framework may offer insights into the techniques potentially utilized in this scenario. Examples include initial access via unauthorized scraping methods, persistence through the use of stealth bots, and evasion of defenses set by the targeted sites. Such tactics illustrate the evolving landscape of cybersecurity threats that business owners must navigate as AI technologies become increasingly integrated into market operations.
As this situation unfolds, it serves as a reminder for companies to remain vigilant about their cybersecurity strategies and the potential vulnerabilities introduced by emerging technologies.