Detect AI Scrapers: Block GPTBot, ClaudeBot & More
Stop AI crawlers from scraping your content for training data. Detect GPTBot, ClaudeBot, CCBot, and 20+ AI scrapers with behavioral analysis and honeypots.
WebDecoy Team
WebDecoy Security Team
How to Detect and Block AI Scrapers: Complete Guide for GPTBot, ClaudeBot, and More
Your content is being scraped right now. Not by traditional search engines—by AI companies building the next generation of language models. Every blog post, product description, and technical document on your site is potential training data for GPT-5, Claude 4, and dozens of other AI systems.
The question isn’t whether AI scrapers are visiting your site. It’s whether you can detect them—and what you’re going to do about it.
The AI Scraping Landscape in 2025
AI companies need vast amounts of web content to train their models. To get it, they deploy sophisticated crawlers that harvest text from millions of websites. Some announce themselves honestly. Others disguise their identity to avoid blocks.
Known AI Crawler User Agents
Here are the AI scrapers you should know about:
| Bot Name | Company | User Agent Contains | Announced |
|---|---|---|---|
| GPTBot | OpenAI | GPTBot | Yes |
| ChatGPT-User | OpenAI | ChatGPT-User | Yes |
| ClaudeBot | Anthropic | ClaudeBot | Yes |
| Claude-Web | Anthropic | Claude-Web | Yes |
| CCBot | Common Crawl | CCBot | Yes |
| PerplexityBot | Perplexity | PerplexityBot | Yes |
| Amazonbot | Amazon | Amazonbot | Yes |
| Google-Extended | Google-Extended | Yes | |
| FacebookBot | Meta | FacebookBot | Yes |
| Bytespider | ByteDance | Bytespider | Yes |
| Applebot-Extended | Apple | Applebot-Extended | Yes |
| cohere-ai | Cohere | cohere-ai | Yes |
| Diffbot | Diffbot | Diffbot | Yes |
| Omgilibot | Webz.io | Omgilibot | Yes |
| YouBot | You.com | YouBot | Yes |
But here’s the problem: sophisticated scrapers don’t announce themselves. They spoof legitimate browser user agents and mimic human behavior.
Why robots.txt Doesn’t Work
The standard advice for blocking AI scrapers is to update your robots.txt:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /This approach has three fatal flaws:
1. robots.txt Is Advisory, Not Enforceable
Robots.txt is a gentleman’s agreement. Well-behaved bots respect it. Malicious scrapers ignore it completely. There’s no technical mechanism to force compliance.
2. Sophisticated Scrapers Spoof User Agents
Any scraper can change its user agent string. A Python script with requests library takes one line to impersonate Chrome:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0'}
response = requests.get('https://yoursite.com', headers=headers)Now your robots.txt block is useless—the scraper looks like a regular browser.
3. New Scrapers Emerge Constantly
By the time you add a new bot to your robots.txt, it’s already scraped your site. You’re always playing catch-up.
The conclusion is clear: robots.txt is a starting point, not a solution. You need active detection.
Behavioral Detection: How AI Scrapers Give Themselves Away
Even when AI scrapers disguise their identity, their behavior reveals them. Here’s what to look for:
1. Request Patterns
AI scrapers exhibit distinctive patterns:
- Sequential URL crawling - Systematically visiting every page in order
- No asset loading - Skipping CSS, JavaScript, and images (they only want text)
- Consistent timing - Requests at precise intervals (humans are random)
- Deep crawling - Visiting pagination, archives, and low-value pages humans skip
- No referrer - Direct requests without coming from search or social
Detection approach:
Normal user session:
/ → /about → /pricing → /contact
Assets loaded: 47 (CSS, JS, images)
Time between pages: 15-90 seconds (reading)
AI scraper session:
/blog/post-1 → /blog/post-2 → /blog/post-3 → /blog/post-4
Assets loaded: 0
Time between pages: 0.5-2 seconds (consistent)2. TLS Fingerprinting
Every HTTP client has a unique TLS fingerprint based on how it negotiates the SSL/TLS handshake. This fingerprint—captured as a JA3 or JA4 hash—reveals the true identity of the client.
The key insight: A request claiming to be Chrome but presenting a Python requests TLS fingerprint is lying.
| Claimed User Agent | TLS Fingerprint | Verdict |
|---|---|---|
| Chrome/120 | Chrome JA3 hash | Legitimate |
| Chrome/120 | Python-requests JA3 | Spoofed - Block |
| Chrome/120 | curl JA3 | Spoofed - Block |
| Chrome/120 | Node.js JA3 | Spoofed - Block |
WebDecoy’s TLS fingerprinting capabilities catch these mismatches automatically.
3. JavaScript Execution
AI scrapers typically don’t execute JavaScript. They fetch raw HTML and extract text. This creates a detectable signal.
Detection technique:
- Inject a small JavaScript snippet that sets a cookie or calls an endpoint
- Check if subsequent requests include the result
- No JavaScript execution = likely bot
// Detection snippet
document.addEventListener('DOMContentLoaded', function() {
fetch('/api/beacon', {
method: 'POST',
body: JSON.stringify({ js: true, ts: Date.now() })
});
});Real browsers execute this. Scrapers don’t.
4. Honeypot Detection
The most reliable AI scraper detection method: invisible links that only bots follow.
How it works:
- Add links to your pages that are invisible to humans (CSS
display:noneor positioned off-screen) - These links point to decoy pages with unique URLs
- Any request to these URLs is definitively a bot—humans can’t see or click them
- Zero false positives
<!-- Invisible to humans, visible to scrapers parsing HTML -->
<a href="/content-archive-2024" style="position:absolute;left:-9999px;">Archive</a>When a scraper follows this link, you’ve caught it with 100% certainty.
Learn more about this approach in our endpoint decoys guide.
5. Geographic and Network Signals
AI scrapers often run from:
- Cloud infrastructure - AWS, GCP, Azure, DigitalOcean
- Known scraping services - Bright Data, Oxylabs, ScrapingBee
- Datacenter IPs - Not residential connections
Cross-reference the claimed location (from headers or timezone) with the actual IP geolocation. Mismatches indicate spoofing.
Our geographic consistency detection catches these discrepancies.
Implementing AI Scraper Detection
Option 1: robots.txt (Baseline)
Start with robots.txt to block well-behaved AI crawlers:
# Block AI training crawlers
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: CCBot
User-agent: PerplexityBot
User-agent: Amazonbot
User-agent: Google-Extended
User-agent: FacebookBot
User-agent: Bytespider
User-agent: Applebot-Extended
User-agent: cohere-ai
User-agent: Diffbot
User-agent: Omgilibot
User-agent: YouBot
User-agent: anthropic-ai
User-agent: Scrapy
User-agent: img2dataset
Disallow: /
# Allow legitimate search engines
User-agent: Googlebot
User-agent: Bingbot
User-agent: DuckDuckBot
Allow: /Limitation: Only stops honest bots.
Option 2: User Agent Blocking (Basic)
Block requests with known AI scraper user agents at the web server level:
Nginx:
if ($http_user_agent ~* (GPTBot|ClaudeBot|CCBot|PerplexityBot|Bytespider)) {
return 403;
}Apache:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|CCBot|PerplexityBot|Bytespider) [NC]
RewriteRule .* - [F,L]Cloudflare: Create a WAF rule:
(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "CCBot")
→ BlockLimitation: Trivially bypassed by changing user agent.
Option 3: Behavioral Analysis with WebDecoy (Recommended)
For comprehensive protection, deploy behavioral detection:
- Install WebDecoy SDK on your site
- Enable AI scraper detection in dashboard
- Configure honeypot links to catch disguised scrapers
- Set response actions (block, rate limit, or serve alternative content)
WebDecoy detects AI scrapers through:
- TLS fingerprint analysis
- JavaScript execution verification
- Request pattern anomaly detection
- Honeypot interaction monitoring
- Geographic consistency checks
Result: Catch both announced and disguised AI scrapers with near-zero false positives.
Response Strategies: What to Do When You Detect AI Scrapers
Detection is only half the battle. You need a response strategy.
Strategy 1: Block
The simplest approach—return 403 Forbidden or drop the connection.
Pros: Immediate protection Cons: Scrapers may retry from different IPs
Strategy 2: Rate Limit
Allow some access but throttle aggressive crawling.
Normal users: Unlimited
Detected scrapers: 10 requests/minute, then blockPros: Less aggressive, catches IP rotation Cons: Still allows some scraping
Strategy 3: Serve Alternative Content
Return different content to detected scrapers:
- Placeholder text - Generic content that’s useless for training
- Copyright notices - Legal warnings embedded in scraped content
- Honeypot content - Trackable text that reveals when your content appears in AI outputs
Pros: Doesn’t alert scrapers they’re detected Cons: More complex to implement
Strategy 4: Tarpit
Slow down responses dramatically for detected scrapers:
if is_ai_scraper(request):
time.sleep(30) # 30-second delay
return minimal_response()Pros: Wastes scraper resources, discourages continued crawling Cons: Keeps connections open longer
Strategy 5: Legal Action
For persistent, commercial-scale scraping:
- Document the scraping activity
- Send cease-and-desist to the company
- File DMCA takedown if content appears in AI outputs
- Consider litigation for copyright infringement
Note: Legal approaches are slow and expensive. Technical prevention is more practical.
Protecting Specific Content Types
Blog Posts and Articles
Your written content is prime AI training data. Protect it with:
- Honeypot links in article footers
- JavaScript-gated content sections
- Rate limiting on
/blog/*paths
Product Descriptions
E-commerce content is valuable for AI product understanding:
- Render critical details via JavaScript
- Use honeypot product pages
- Monitor for scraping patterns on catalog pages
Documentation
Technical docs are heavily targeted:
- Consider authentication for detailed docs
- Use honeypots in code examples
- Rate limit documentation API endpoints
User-Generated Content
Forums, reviews, and comments are scraped for training:
- Implement JavaScript rendering requirements
- Add invisible honeypot posts
- Monitor bulk access patterns
Monitoring and Alerting
Set up ongoing monitoring for AI scraping activity:
Metrics to Track
- Requests from known AI crawler user agents
- Requests with datacenter IP addresses
- Sessions without JavaScript execution
- Honeypot trigger events
- TLS fingerprint mismatches
Alert Thresholds
- Low: 100+ requests/hour from single IP without JS
- Medium: Honeypot triggered by new IP
- High: Coordinated scraping from IP range
- Critical: Bulk content access pattern detected
Integration with SIEM
Send AI scraper detection events to your SIEM for correlation:
{
"event_type": "ai_scraper_detected",
"timestamp": "2025-12-08T14:30:00Z",
"source_ip": "52.12.34.56",
"user_agent": "Mozilla/5.0 Chrome/120.0.0.0",
"true_identity": "python-requests/2.28",
"detection_method": "tls_fingerprint_mismatch",
"pages_accessed": 47,
"honeypot_triggered": true
}See our SIEM integration guide for setup instructions.
The Legal Landscape
AI scraping raises complex legal questions:
Copyright
Your content is copyrighted. Unauthorized copying for commercial AI training may constitute infringement. However, “fair use” arguments are being tested in courts.
Key cases to watch:
- New York Times v. OpenAI
- Getty Images v. Stability AI
- Authors Guild v. OpenAI
Terms of Service
Your ToS can prohibit automated scraping. While not always enforceable against anonymous scrapers, it strengthens legal position against known companies.
robots.txt Legal Status
robots.txt is not legally binding, but ignoring it may be evidence of bad faith in copyright litigation.
Recommendation: Maintain clear ToS prohibiting AI training use, implement technical protections, and document scraping activity for potential legal action.
Future of AI Scraping
The cat-and-mouse game will intensify:
Scraper Evolution
- More sophisticated browser emulation
- Residential proxy networks
- Human-in-the-loop verification bypass
- Distributed, slow crawling to avoid detection
Defense Evolution
- AI-powered scraper detection (using AI to catch AI)
- Cryptographic content authentication
- Industry-wide scraper reputation networks
- Regulatory frameworks (EU AI Act implications)
Industry Standards
Expect new standards for AI data collection:
- Machine-readable licensing for training data
- Opt-in/opt-out registries
- Compensation frameworks for content creators
- Transparency requirements for AI training data
Conclusion: Take Control of Your Content
AI scrapers are harvesting web content at unprecedented scale. Your blog posts, documentation, and product descriptions may already be training the next generation of language models—without your consent or compensation.
You have options:
- Do nothing - Accept that your content will be used for AI training
- Basic protection - robots.txt and user agent blocking (stops honest bots)
- Active detection - Behavioral analysis, TLS fingerprinting, honeypots (stops sophisticated scrapers)
The right choice depends on your content’s value and your tolerance for unauthorized use.
For most publishers and businesses, active detection is the answer. It’s the only approach that catches scrapers who don’t play by the rules.
WebDecoy provides the detection capabilities you need:
- Behavioral bot detection that catches disguised scrapers
- TLS fingerprinting to identify spoofed user agents
- Honeypot decoys for zero-false-positive detection
- SIEM integration for enterprise visibility
Your content is valuable. Protect it.
Have questions about AI scraper detection? Contact our team or explore our comparison with other bot detection solutions.
Share this post
Like this post? Share it with your friends!
Want to see WebDecoy in action?
Get a personalized demo from our team.