How to Detect and Block AI Scrapers: Complete Guide for GPTBot, ClaudeBot, and More

Your content is being scraped right now. Not by traditional search engines—by AI companies building the next generation of language models. Every blog post, product description, and technical document on your site is potential training data for GPT-5, Claude 4, and dozens of other AI systems.

The question isn’t whether AI scrapers are visiting your site. It’s whether you can detect them—and what you’re going to do about it.

The AI Scraping Landscape in 2025

AI companies need vast amounts of web content to train their models. To get it, they deploy sophisticated crawlers that harvest text from millions of websites. Some announce themselves honestly. Others disguise their identity to avoid blocks.

Known AI Crawler User Agents

Here are the AI scrapers you should know about:

Bot Name	Company	User Agent Contains	Announced
GPTBot	OpenAI	`GPTBot`	Yes
ChatGPT-User	OpenAI	`ChatGPT-User`	Yes
ClaudeBot	Anthropic	`ClaudeBot`	Yes
Claude-Web	Anthropic	`Claude-Web`	Yes
CCBot	Common Crawl	`CCBot`	Yes
PerplexityBot	Perplexity	`PerplexityBot`	Yes
Amazonbot	Amazon	`Amazonbot`	Yes
Google-Extended	Google	`Google-Extended`	Yes
FacebookBot	Meta	`FacebookBot`	Yes
Bytespider	ByteDance	`Bytespider`	Yes
Applebot-Extended	Apple	`Applebot-Extended`	Yes
cohere-ai	Cohere	`cohere-ai`	Yes
Diffbot	Diffbot	`Diffbot`	Yes
Omgilibot	Webz.io	`Omgilibot`	Yes
YouBot	You.com	`YouBot`	Yes

But here’s the problem: sophisticated scrapers don’t announce themselves. They spoof legitimate browser user agents and mimic human behavior.

Why robots.txt Doesn’t Work

The standard advice for blocking AI scrapers is to update your robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

This approach has three fatal flaws:

1. robots.txt Is Advisory, Not Enforceable

Robots.txt is a gentleman’s agreement. Well-behaved bots respect it. Malicious scrapers ignore it completely. There’s no technical mechanism to force compliance.

2. Sophisticated Scrapers Spoof User Agents

Any scraper can change its user agent string. A Python script with requests library takes one line to impersonate Chrome:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0'}
response = requests.get('https://yoursite.com', headers=headers)

Now your robots.txt block is useless—the scraper looks like a regular browser.

3. New Scrapers Emerge Constantly

By the time you add a new bot to your robots.txt, it’s already scraped your site. You’re always playing catch-up.

The conclusion is clear: robots.txt is a starting point, not a solution. You need active detection.

Behavioral Detection: How AI Scrapers Give Themselves Away

Even when AI scrapers disguise their identity, their behavior reveals them. Here’s what to look for:

1. Request Patterns

AI scrapers exhibit distinctive patterns:

Sequential URL crawling - Systematically visiting every page in order
No asset loading - Skipping CSS, JavaScript, and images (they only want text)
Consistent timing - Requests at precise intervals (humans are random)
Deep crawling - Visiting pagination, archives, and low-value pages humans skip
No referrer - Direct requests without coming from search or social

Detection approach:

Normal user session:
  / → /about → /pricing → /contact
  Assets loaded: 47 (CSS, JS, images)
  Time between pages: 15-90 seconds (reading)

AI scraper session:
  /blog/post-1 → /blog/post-2 → /blog/post-3 → /blog/post-4
  Assets loaded: 0
  Time between pages: 0.5-2 seconds (consistent)

2. TLS Fingerprinting

Every HTTP client has a unique TLS fingerprint based on how it negotiates the SSL/TLS handshake. This fingerprint—captured as a JA3 or JA4 hash—reveals the true identity of the client.

The key insight: A request claiming to be Chrome but presenting a Python requests TLS fingerprint is lying.

Claimed User Agent	TLS Fingerprint	Verdict
Chrome/120	Chrome JA3 hash	Legitimate
Chrome/120	Python-requests JA3	Spoofed - Block
Chrome/120	curl JA3	Spoofed - Block
Chrome/120	Node.js JA3	Spoofed - Block

WebDecoy’s TLS fingerprinting capabilities catch these mismatches automatically.

3. JavaScript Execution

AI scrapers typically don’t execute JavaScript. They fetch raw HTML and extract text. This creates a detectable signal.

Detection technique:

Inject a small JavaScript snippet that sets a cookie or calls an endpoint
Check if subsequent requests include the result
No JavaScript execution = likely bot

// Detection snippet
document.addEventListener('DOMContentLoaded', function() {
  fetch('/api/beacon', {
    method: 'POST',
    body: JSON.stringify({ js: true, ts: Date.now() })
  });
});

Real browsers execute this. Scrapers don’t.

4. Honeypot Detection

The most reliable AI scraper detection method: invisible links that only bots follow.

How it works:

Add links to your pages that are invisible to humans (CSS display:none or positioned off-screen)
These links point to decoy pages with unique URLs
Any request to these URLs is definitively a bot—humans can’t see or click them
Zero false positives

<!-- Invisible to humans, visible to scrapers parsing HTML -->
<a href="/content-archive-2024" style="position:absolute;left:-9999px;">Archive</a>

When a scraper follows this link, you’ve caught it with 100% certainty.

Learn more about this approach in our endpoint decoys guide.

5. Geographic and Network Signals

AI scrapers often run from:

Cloud infrastructure - AWS, GCP, Azure, DigitalOcean
Known scraping services - Bright Data, Oxylabs, ScrapingBee
Datacenter IPs - Not residential connections

Cross-reference the claimed location (from headers or timezone) with the actual IP geolocation. Mismatches indicate spoofing.

Our geographic consistency detection catches these discrepancies.

Implementing AI Scraper Detection

Option 1: robots.txt (Baseline)

Start with robots.txt to block well-behaved AI crawlers:

# Block AI training crawlers
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: CCBot
User-agent: PerplexityBot
User-agent: Amazonbot
User-agent: Google-Extended
User-agent: FacebookBot
User-agent: Bytespider
User-agent: Applebot-Extended
User-agent: cohere-ai
User-agent: Diffbot
User-agent: Omgilibot
User-agent: YouBot
User-agent: anthropic-ai
User-agent: Scrapy
User-agent: img2dataset
Disallow: /

# Allow legitimate search engines
User-agent: Googlebot
User-agent: Bingbot
User-agent: DuckDuckBot
Allow: /

Limitation: Only stops honest bots.

Option 2: User Agent Blocking (Basic)

Block requests with known AI scraper user agents at the web server level:

Nginx:

if ($http_user_agent ~* (GPTBot|ClaudeBot|CCBot|PerplexityBot|Bytespider)) {
    return 403;
}

Apache:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|CCBot|PerplexityBot|Bytespider) [NC]
RewriteRule .* - [F,L]

Cloudflare: Create a WAF rule:

(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "CCBot")
→ Block

Limitation: Trivially bypassed by changing user agent.

Option 3: Behavioral Analysis with WebDecoy (Recommended)

For comprehensive protection, deploy behavioral detection:

Install WebDecoy SDK on your site
Enable AI scraper detection in dashboard
Configure honeypot links to catch disguised scrapers
Set response actions (block, rate limit, or serve alternative content)

WebDecoy detects AI scrapers through:

TLS fingerprint analysis
JavaScript execution verification
Request pattern anomaly detection
Honeypot interaction monitoring
Geographic consistency checks

Result: Catch both announced and disguised AI scrapers with near-zero false positives.

Response Strategies: What to Do When You Detect AI Scrapers

Detection is only half the battle. You need a response strategy.

Strategy 1: Block

The simplest approach—return 403 Forbidden or drop the connection.

Pros: Immediate protection Cons: Scrapers may retry from different IPs

Strategy 2: Rate Limit

Allow some access but throttle aggressive crawling.

Normal users: Unlimited
Detected scrapers: 10 requests/minute, then block

Pros: Less aggressive, catches IP rotation Cons: Still allows some scraping

Strategy 3: Serve Alternative Content

Return different content to detected scrapers:

Placeholder text - Generic content that’s useless for training
Copyright notices - Legal warnings embedded in scraped content
Honeypot content - Trackable text that reveals when your content appears in AI outputs

Pros: Doesn’t alert scrapers they’re detected Cons: More complex to implement

Strategy 4: Tarpit

Slow down responses dramatically for detected scrapers:

if is_ai_scraper(request):
    time.sleep(30)  # 30-second delay
    return minimal_response()

Pros: Wastes scraper resources, discourages continued crawling Cons: Keeps connections open longer

Strategy 5: Legal Action

For persistent, commercial-scale scraping:

Document the scraping activity
Send cease-and-desist to the company
File DMCA takedown if content appears in AI outputs
Consider litigation for copyright infringement

Note: Legal approaches are slow and expensive. Technical prevention is more practical.

Protecting Specific Content Types

Blog Posts and Articles

Your written content is prime AI training data. Protect it with:

Honeypot links in article footers
JavaScript-gated content sections
Rate limiting on /blog/* paths

Product Descriptions

E-commerce content is valuable for AI product understanding:

Render critical details via JavaScript
Use honeypot product pages
Monitor for scraping patterns on catalog pages

Documentation

Technical docs are heavily targeted:

Consider authentication for detailed docs
Use honeypots in code examples
Rate limit documentation API endpoints

User-Generated Content

Forums, reviews, and comments are scraped for training:

Implement JavaScript rendering requirements
Add invisible honeypot posts
Monitor bulk access patterns

Monitoring and Alerting

Set up ongoing monitoring for AI scraping activity:

Metrics to Track

Requests from known AI crawler user agents
Requests with datacenter IP addresses
Sessions without JavaScript execution
Honeypot trigger events
TLS fingerprint mismatches

Alert Thresholds

Low: 100+ requests/hour from single IP without JS
Medium: Honeypot triggered by new IP
High: Coordinated scraping from IP range
Critical: Bulk content access pattern detected

Integration with SIEM

Send AI scraper detection events to your SIEM for correlation:

{
  "event_type": "ai_scraper_detected",
  "timestamp": "2025-12-08T14:30:00Z",
  "source_ip": "52.12.34.56",
  "user_agent": "Mozilla/5.0 Chrome/120.0.0.0",
  "true_identity": "python-requests/2.28",
  "detection_method": "tls_fingerprint_mismatch",
  "pages_accessed": 47,
  "honeypot_triggered": true
}

See our SIEM integration guide for setup instructions.

The Legal Landscape

AI scraping raises complex legal questions:

Copyright

Your content is copyrighted. Unauthorized copying for commercial AI training may constitute infringement. However, “fair use” arguments are being tested in courts.

Key cases to watch:

New York Times v. OpenAI
Getty Images v. Stability AI
Authors Guild v. OpenAI

Terms of Service

Your ToS can prohibit automated scraping. While not always enforceable against anonymous scrapers, it strengthens legal position against known companies.

robots.txt Legal Status

robots.txt is not legally binding, but ignoring it may be evidence of bad faith in copyright litigation.

Recommendation: Maintain clear ToS prohibiting AI training use, implement technical protections, and document scraping activity for potential legal action.

Future of AI Scraping

The cat-and-mouse game will intensify:

Scraper Evolution

More sophisticated browser emulation
Residential proxy networks
Human-in-the-loop verification bypass
Distributed, slow crawling to avoid detection

Defense Evolution

AI-powered scraper detection (using AI to catch AI)
Cryptographic content authentication
Industry-wide scraper reputation networks
Regulatory frameworks (EU AI Act implications)

Industry Standards

Expect new standards for AI data collection:

Machine-readable licensing for training data
Opt-in/opt-out registries
Compensation frameworks for content creators
Transparency requirements for AI training data

Conclusion: Take Control of Your Content

AI scrapers are harvesting web content at unprecedented scale. Your blog posts, documentation, and product descriptions may already be training the next generation of language models—without your consent or compensation.

You have options:

Do nothing - Accept that your content will be used for AI training
Basic protection - robots.txt and user agent blocking (stops honest bots)
Active detection - Behavioral analysis, TLS fingerprinting, honeypots (stops sophisticated scrapers)

The right choice depends on your content’s value and your tolerance for unauthorized use.

For most publishers and businesses, active detection is the answer. It’s the only approach that catches scrapers who don’t play by the rules.

WebDecoy provides the detection capabilities you need:

Behavioral bot detection that catches disguised scrapers
TLS fingerprinting to identify spoofed user agents
Honeypot decoys for zero-false-positive detection
SIEM integration for enterprise visibility

Your content is valuable. Protect it.

Have questions about AI scraper detection? Contact our team or explore our comparison with other bot detection solutions.

Detect AI Scrapers: Block GPTBot, ClaudeBot & More