Cloudflare: Perplexity’s Bots Bypass No-Crawl Directives

Artificial Intelligence & Machine Learning,
Data Security,
Next-Generation Technologies & Secure Development

Allegations of Improper Data Collection Aren’t New for Perplexity

Perplexity's Bots Ignore No-Crawl Rules, Says Cloudflare
Image: Shutterstock

Perplexity, an artificial intelligence firm, is embroiled in allegations of circumventing established internet protocols for data acquisition. Cloudflare has accused the AI search engine startup of attempting to position itself as a competitor to Google by disregarding website limitations and concealing its data scraping operations.

According to engineers at Cloudflare, detailed findings published on Monday reveal behaviors indicative of attempts to flout content restrictions. Despite explicit directives from publishers disallowing access within their robots.txt files—an established standard for instructing web crawlers about permissible content—Perplexity allegedly persisted in extracting data from multiple domains.

Cloudflare noted that the unidentified crawler employed an array of IP addresses not recognized in Perplexity’s officially declared range. Researchers observed that Perplexity’s bots frequently altered IPs and modified their user-agent strings to mimic a Google Chrome browser on macOS, tactics synonymous with evading detection mechanisms.

This behavior was documented across thousands of websites, amounting to billions of content requests daily. Cloudflare was able to identify this unauthorized crawler using a combination of machine learning analytics and network monitoring techniques.

The Robots Exclusion Protocol, established in 1994 by Martijn Koster and standardized in 2022, allows website administrators to manage interactions with web crawlers. While adherence to this protocol is voluntary, it has become a norm among reputable web scraping entities.

Despite implementing various protection measures, including robots.txt restrictions and web application firewalls designed to block known user agents like PerplexityBot, some sites reportedly continued to experience unauthorized crawling. Cloudflare’s response included revoking Perplexity’s status as a verified bot and enacting new detection protocols to prevent further breaches.

Company spokesperson Jesse Dwyer refuted the allegations, suggesting that Cloudflare’s claims were merely a marketing strategy and asserting that the bot referred to in their blog post does not belong to them. Dwyer further claimed that there was no evidence of unauthorized content access in the screenshots presented.

Concerns regarding improper data collection are not unprecedented for Perplexity. Past accusations from Forbes described instances where the company allegedly published articles that closely resembled other outlets’ content without proper attribution, raising questions about ethical scraping practices. Similarly, Wired has reported suspicious traffic linked to Perplexity, suggesting violations of robots.txt directives.

This incident underscores an ongoing tension between AI companies and content creators. As AI bots increasingly rely on scraped data for model training and content retrieval, the lack of direct benefits for publishers is becoming more pronounced. The data utilization landscape has evolved, now often leaving publishers without recompense, further straining the relationship.

Recent reports from TollBit also indicate a surge in scraping activity, citing an 87% increase compared to previous quarters of 2025. The percentage of bots disregarding robots.txt directives climbed from 3.3% to 12.9%, highlighting a troubling trend of unchecked data extraction.

In response to these issues, Cloudflare has developed tools designed to hinder AI bots from illegally scraping data. They have also initiated a marketplace allowing publishers to charge AI companies for content access, arguing that AI’s deployment poses substantial challenges for content creators and site operators.

Source link