How to Detect and Block AI Scrapers: Complete Guide for GPTBot, ClaudeBot, and More

Your content is being scraped right now. Not by traditional search engines—by AI companies building the next generation of language models. Every blog post, product description, and technical document on your site is potential training data for GPT-5, Claude 4, and dozens of other AI systems.

The question isn’t whether AI scrapers are visiting your site. It’s whether you can detect them—and what you’re going to do about it.

The AI Scraping Landscape in 2025

AI companies need vast amounts of web content to train their models. To get it, they deploy sophisticated crawlers that harvest text from millions of websites. Some announce themselves honestly. Others disguise their identity to avoid blocks.

Known AI Crawler User Agents

Here are the AI scrapers you should know about:

Bot NameCompanyUser Agent ContainsAnnounced
GPTBotOpenAIGPTBotYes
ChatGPT-UserOpenAIChatGPT-UserYes
ClaudeBotAnthropicClaudeBotYes
Claude-WebAnthropicClaude-WebYes
CCBotCommon CrawlCCBotYes
PerplexityBotPerplexityPerplexityBotYes
AmazonbotAmazonAmazonbotYes
Google-ExtendedGoogleGoogle-ExtendedYes
FacebookBotMetaFacebookBotYes
BytespiderByteDanceBytespiderYes
Applebot-ExtendedAppleApplebot-ExtendedYes
cohere-aiCoherecohere-aiYes
DiffbotDiffbotDiffbotYes
OmgilibotWebz.ioOmgilibotYes
YouBotYou.comYouBotYes

But here’s the problem: sophisticated scrapers don’t announce themselves. They spoof legitimate browser user agents and mimic human behavior.

Why robots.txt Doesn’t Work

The standard advice for blocking AI scrapers is to update your robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

This approach has three fatal flaws:

1. robots.txt Is Advisory, Not Enforceable

Robots.txt is a gentleman’s agreement. Well-behaved bots respect it. Malicious scrapers ignore it completely. There’s no technical mechanism to force compliance.

2. Sophisticated Scrapers Spoof User Agents

Any scraper can change its user agent string. A Python script with requests library takes one line to impersonate Chrome:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0'}
response = requests.get('https://yoursite.com', headers=headers)

Now your robots.txt block is useless—the scraper looks like a regular browser.

3. New Scrapers Emerge Constantly

By the time you add a new bot to your robots.txt, it’s already scraped your site. You’re always playing catch-up.

The conclusion is clear: robots.txt is a starting point, not a solution. You need active detection.

Behavioral Detection: How AI Scrapers Give Themselves Away

Even when AI scrapers disguise their identity, their behavior reveals them. Here’s what to look for:

1. Request Patterns

AI scrapers exhibit distinctive patterns:

  • Sequential URL crawling - Systematically visiting every page in order
  • No asset loading - Skipping CSS, JavaScript, and images (they only want text)
  • Consistent timing - Requests at precise intervals (humans are random)
  • Deep crawling - Visiting pagination, archives, and low-value pages humans skip
  • No referrer - Direct requests without coming from search or social

Detection approach:

Normal user session:
  / → /about → /pricing → /contact
  Assets loaded: 47 (CSS, JS, images)
  Time between pages: 15-90 seconds (reading)

AI scraper session:
  /blog/post-1 → /blog/post-2 → /blog/post-3 → /blog/post-4
  Assets loaded: 0
  Time between pages: 0.5-2 seconds (consistent)

2. TLS Fingerprinting

Every HTTP client has a unique TLS fingerprint based on how it negotiates the SSL/TLS handshake. This fingerprint—captured as a JA3 or JA4 hash—reveals the true identity of the client.

The key insight: A request claiming to be Chrome but presenting a Python requests TLS fingerprint is lying.

Claimed User AgentTLS FingerprintVerdict
Chrome/120Chrome JA3 hashLegitimate
Chrome/120Python-requests JA3Spoofed - Block
Chrome/120curl JA3Spoofed - Block
Chrome/120Node.js JA3Spoofed - Block

WebDecoy’s TLS fingerprinting capabilities catch these mismatches automatically.

3. JavaScript Execution

AI scrapers typically don’t execute JavaScript. They fetch raw HTML and extract text. This creates a detectable signal.

Detection technique:

  1. Inject a small JavaScript snippet that sets a cookie or calls an endpoint
  2. Check if subsequent requests include the result
  3. No JavaScript execution = likely bot
// Detection snippet
document.addEventListener('DOMContentLoaded', function() {
  fetch('/api/beacon', {
    method: 'POST',
    body: JSON.stringify({ js: true, ts: Date.now() })
  });
});

Real browsers execute this. Scrapers don’t.

4. Honeypot Detection

The most reliable AI scraper detection method: invisible links that only bots follow.

How it works:

  1. Add links to your pages that are invisible to humans (CSS display:none or positioned off-screen)
  2. These links point to decoy pages with unique URLs
  3. Any request to these URLs is definitively a bot—humans can’t see or click them
  4. Zero false positives
<!-- Invisible to humans, visible to scrapers parsing HTML -->
<a href="/content-archive-2024" style="position:absolute;left:-9999px;">Archive</a>

When a scraper follows this link, you’ve caught it with 100% certainty.

Learn more about this approach in our endpoint decoys guide.

5. Geographic and Network Signals

AI scrapers often run from:

  • Cloud infrastructure - AWS, GCP, Azure, DigitalOcean
  • Known scraping services - Bright Data, Oxylabs, ScrapingBee
  • Datacenter IPs - Not residential connections

Cross-reference the claimed location (from headers or timezone) with the actual IP geolocation. Mismatches indicate spoofing.

Our geographic consistency detection catches these discrepancies.

Implementing AI Scraper Detection

Option 1: robots.txt (Baseline)

Start with robots.txt to block well-behaved AI crawlers:

# Block AI training crawlers
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: CCBot
User-agent: PerplexityBot
User-agent: Amazonbot
User-agent: Google-Extended
User-agent: FacebookBot
User-agent: Bytespider
User-agent: Applebot-Extended
User-agent: cohere-ai
User-agent: Diffbot
User-agent: Omgilibot
User-agent: YouBot
User-agent: anthropic-ai
User-agent: Scrapy
User-agent: img2dataset
Disallow: /

# Allow legitimate search engines
User-agent: Googlebot
User-agent: Bingbot
User-agent: DuckDuckBot
Allow: /

Limitation: Only stops honest bots.

Option 2: User Agent Blocking (Basic)

Block requests with known AI scraper user agents at the web server level:

Nginx:

if ($http_user_agent ~* (GPTBot|ClaudeBot|CCBot|PerplexityBot|Bytespider)) {
    return 403;
}

Apache:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|CCBot|PerplexityBot|Bytespider) [NC]
RewriteRule .* - [F,L]

Cloudflare: Create a WAF rule:

(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "CCBot")
→ Block

Limitation: Trivially bypassed by changing user agent.

For comprehensive protection, deploy behavioral detection:

  1. Install WebDecoy SDK on your site
  2. Enable AI scraper detection in dashboard
  3. Configure honeypot links to catch disguised scrapers
  4. Set response actions (block, rate limit, or serve alternative content)

WebDecoy detects AI scrapers through:

  • TLS fingerprint analysis
  • JavaScript execution verification
  • Request pattern anomaly detection
  • Honeypot interaction monitoring
  • Geographic consistency checks

Result: Catch both announced and disguised AI scrapers with near-zero false positives.

Response Strategies: What to Do When You Detect AI Scrapers

Detection is only half the battle. You need a response strategy.

Strategy 1: Block

The simplest approach—return 403 Forbidden or drop the connection.

Pros: Immediate protection Cons: Scrapers may retry from different IPs

Strategy 2: Rate Limit

Allow some access but throttle aggressive crawling.

Normal users: Unlimited
Detected scrapers: 10 requests/minute, then block

Pros: Less aggressive, catches IP rotation Cons: Still allows some scraping

Strategy 3: Serve Alternative Content

Return different content to detected scrapers:

  • Placeholder text - Generic content that’s useless for training
  • Copyright notices - Legal warnings embedded in scraped content
  • Honeypot content - Trackable text that reveals when your content appears in AI outputs

Pros: Doesn’t alert scrapers they’re detected Cons: More complex to implement

Strategy 4: Tarpit

Slow down responses dramatically for detected scrapers:

if is_ai_scraper(request):
    time.sleep(30)  # 30-second delay
    return minimal_response()

Pros: Wastes scraper resources, discourages continued crawling Cons: Keeps connections open longer

For persistent, commercial-scale scraping:

  1. Document the scraping activity
  2. Send cease-and-desist to the company
  3. File DMCA takedown if content appears in AI outputs
  4. Consider litigation for copyright infringement

Note: Legal approaches are slow and expensive. Technical prevention is more practical.

Protecting Specific Content Types

Blog Posts and Articles

Your written content is prime AI training data. Protect it with:

  • Honeypot links in article footers
  • JavaScript-gated content sections
  • Rate limiting on /blog/* paths

Product Descriptions

E-commerce content is valuable for AI product understanding:

  • Render critical details via JavaScript
  • Use honeypot product pages
  • Monitor for scraping patterns on catalog pages

Documentation

Technical docs are heavily targeted:

  • Consider authentication for detailed docs
  • Use honeypots in code examples
  • Rate limit documentation API endpoints

User-Generated Content

Forums, reviews, and comments are scraped for training:

  • Implement JavaScript rendering requirements
  • Add invisible honeypot posts
  • Monitor bulk access patterns

Monitoring and Alerting

Set up ongoing monitoring for AI scraping activity:

Metrics to Track

  • Requests from known AI crawler user agents
  • Requests with datacenter IP addresses
  • Sessions without JavaScript execution
  • Honeypot trigger events
  • TLS fingerprint mismatches

Alert Thresholds

  • Low: 100+ requests/hour from single IP without JS
  • Medium: Honeypot triggered by new IP
  • High: Coordinated scraping from IP range
  • Critical: Bulk content access pattern detected

Integration with SIEM

Send AI scraper detection events to your SIEM for correlation:

{
  "event_type": "ai_scraper_detected",
  "timestamp": "2025-12-08T14:30:00Z",
  "source_ip": "52.12.34.56",
  "user_agent": "Mozilla/5.0 Chrome/120.0.0.0",
  "true_identity": "python-requests/2.28",
  "detection_method": "tls_fingerprint_mismatch",
  "pages_accessed": 47,
  "honeypot_triggered": true
}

See our SIEM integration guide for setup instructions.

AI scraping raises complex legal questions:

Your content is copyrighted. Unauthorized copying for commercial AI training may constitute infringement. However, “fair use” arguments are being tested in courts.

Key cases to watch:

  • New York Times v. OpenAI
  • Getty Images v. Stability AI
  • Authors Guild v. OpenAI

Terms of Service

Your ToS can prohibit automated scraping. While not always enforceable against anonymous scrapers, it strengthens legal position against known companies.

robots.txt is not legally binding, but ignoring it may be evidence of bad faith in copyright litigation.

Recommendation: Maintain clear ToS prohibiting AI training use, implement technical protections, and document scraping activity for potential legal action.

Future of AI Scraping

The cat-and-mouse game will intensify:

Scraper Evolution

  • More sophisticated browser emulation
  • Residential proxy networks
  • Human-in-the-loop verification bypass
  • Distributed, slow crawling to avoid detection

Defense Evolution

  • AI-powered scraper detection (using AI to catch AI)
  • Cryptographic content authentication
  • Industry-wide scraper reputation networks
  • Regulatory frameworks (EU AI Act implications)

Industry Standards

Expect new standards for AI data collection:

  • Machine-readable licensing for training data
  • Opt-in/opt-out registries
  • Compensation frameworks for content creators
  • Transparency requirements for AI training data

Conclusion: Take Control of Your Content

AI scrapers are harvesting web content at unprecedented scale. Your blog posts, documentation, and product descriptions may already be training the next generation of language models—without your consent or compensation.

You have options:

  1. Do nothing - Accept that your content will be used for AI training
  2. Basic protection - robots.txt and user agent blocking (stops honest bots)
  3. Active detection - Behavioral analysis, TLS fingerprinting, honeypots (stops sophisticated scrapers)

The right choice depends on your content’s value and your tolerance for unauthorized use.

For most publishers and businesses, active detection is the answer. It’s the only approach that catches scrapers who don’t play by the rules.

WebDecoy provides the detection capabilities you need:

Your content is valuable. Protect it.


Have questions about AI scraper detection? Contact our team or explore our comparison with other bot detection solutions.

Want to see WebDecoy in action?

Get a personalized demo from our team.

Request Demo