Protect Your Content from AI Training Scrapers
Practical methods to stop AI companies from using your website content to train models. Covers robots.txt, TDM headers, detection, and enforcement.
WebDecoy Team
WebDecoy Security Team
Your best blog post took twelve hours to write. An AI crawler can copy it in 200 milliseconds.
That’s not hypothetical. Right now, dozens of AI companies are systematically crawling the web to build training datasets. Your content, your expertise, your original research is being fed into models that will compete with you for the same audience. And most site owners have no idea it’s happening.
The good news: you’re not powerless. There are real, practical steps you can take today to protect your content from unauthorized AI training. Some are simple configuration changes. Others require active detection and enforcement. Let’s walk through all of them.
Table of Contents
- How AI Training Scrapers Work
- Layer 1: robots.txt Directives
- Layer 2: TDM Reservation Protocol
- Layer 3: Meta Tags and HTTP Headers
- Layer 4: Detecting Unauthorized Scraping
- Layer 5: Active Defense with Decoys
- The Legal Landscape
- Putting It All Together
- Limitations and Honest Tradeoffs
How AI Training Scrapers Work
Before you can protect against something, you need to understand how it operates. AI training scrapers come in three flavors:
| Type | Examples | Behavior | Respects robots.txt? |
|---|---|---|---|
| Announced crawlers | GPTBot, ClaudeBot, Google-Extended | Identify themselves via user agent | Usually yes |
| Stealth scrapers | Unnamed data vendors, academic crawlers | Use generic or spoofed user agents | Rarely |
| RAG fetchers | Perplexity, SearchGPT, AI assistants | Fetch content at query time, not for training | Varies |
The announced crawlers are the easiest to deal with. They play by the rules (mostly) and respond to standard protocols. The stealth scrapers are a different problem entirely. They rotate IPs, spoof headers, and look like regular browser traffic. Blocking them requires behavioral detection, not just configuration files.
For a deeper look at how RAG bots differ from training scrapers, see our post on the RAG bot problem.
Layer 1: robots.txt Directives
This is your first line of defense and the one you should implement immediately. Every major AI company has published a user agent string for their training crawler, and most of them respect robots.txt disallow rules.
Add this block to your robots.txt:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: ImagesiftBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /Important: Don’t use a blanket User-agent: * disallow unless you want to block search engines too. Be specific about which bots you’re blocking.
If you only want to protect certain sections (like your blog), you can scope the rules:
User-agent: GPTBot
Disallow: /blog/
Disallow: /docs/
Allow: /What robots.txt won’t do
Let’s be honest about the limitations. robots.txt is a voluntary protocol. There is no technical enforcement. A scraper that ignores your robots.txt will still be able to crawl your site freely.
That said, ignoring robots.txt creates legal liability. If you’ve explicitly opted out and a company scrapes you anyway, that strengthens your position significantly in any legal dispute. Think of robots.txt as a posted “No Trespassing” sign. It won’t physically stop anyone, but it establishes intent.
Layer 2: TDM Reservation Protocol
The W3C’s TDM Reservation Protocol (TDMRep) is a newer standard designed specifically for this problem. It lets you declare machine-readable policies about text and data mining rights.
Create a tdmrep.json file at your site root:
{
"version": "1.0",
"policy": [
{
"location": "/*",
"tdm": {
"permission": "none",
"description": "No text or data mining permitted for AI training purposes"
}
}
]
}Then reference it in your HTML <head>:
<link rel="tdm-reservation" href="/tdmrep.json" />You can also set it via HTTP header:
TDM-Reservation: 1In an Nginx config, that looks like:
server {
# Block AI training data mining
add_header TDM-Reservation "1" always;
add_header X-Robots-Tag "noai, noimageai" always;
}The TDM protocol is particularly important in the EU, where the Digital Single Market Directive (Article 4) gives publishers an explicit right to opt out of text and data mining. Having a machine-readable reservation in place is how you exercise that right.
Layer 3: Meta Tags and HTTP Headers
Several proposed meta tags have emerged to signal AI training preferences. Browser and crawler support varies, but implementing them costs nothing and strengthens your legal position.
The noai and noimageai directives
<meta name="robots" content="noai, noimageai" />This tells crawlers not to use your text or images for AI training. Google’s crawlers respect these directives when specified through X-Robots-Tag headers.
DeviantArt’s AI training opt-out
DeviantArt introduced a specific header that some scrapers now check:
<meta name="AI-Training-Opt-Out" content="disallow" />Combining headers in your web server
Here’s a comprehensive Nginx configuration:
server {
listen 443 ssl;
server_name yourdomain.com;
# AI training opt-out headers
add_header X-Robots-Tag "noai, noimageai" always;
add_header TDM-Reservation "1" always;
# Rate limiting for aggressive crawlers
limit_req_zone $binary_remote_addr zone=ai_crawlers:10m rate=5r/m;
location / {
# Apply rate limiting to known AI user agents
if ($http_user_agent ~* "(GPTBot|ClaudeBot|CCBot|Bytespider)") {
set $limit_rate 1;
}
proxy_pass http://backend;
}
}For Apache servers:
<IfModule mod_headers.c>
Header set X-Robots-Tag "noai, noimageai"
Header set TDM-Reservation "1"
</IfModule>Layer 4: Detecting Unauthorized Scraping
Configuration-based protections only work against crawlers that play fair. For the rest, you need detection.
Log analysis for known AI crawlers
Start by checking if you’re already being scraped. Run this against your access logs:
grep -iE "(GPTBot|ClaudeBot|CCBot|Bytespider|anthropic|cohere|perplexity)" \
/var/log/nginx/access.log | \
awk '{print $1, $14}' | \
sort | uniq -c | sort -rn | head -20This gives you a count of requests per IP and user agent for known AI crawlers. If you see hundreds or thousands of requests, someone is actively scraping your content.
Behavioral detection for stealth scrapers
The harder problem is scrapers that don’t announce themselves. These crawlers share a set of behavioral patterns you can watch for:
Sequential page access. Real users jump around a site. Scrapers tend to crawl systematically, hitting pages in order or following every link on every page.
No asset loading. Browsers fetch CSS, JavaScript, images, and fonts. Scrapers usually skip all of that. If a “visitor” loads 50 HTML pages but zero stylesheets, that’s not a human.
Abnormal timing. Humans take seconds between page loads. Automated scrapers often fire requests in rapid bursts with millisecond precision.
Missing or inconsistent TLS fingerprints. Modern browsers have distinctive TLS handshake patterns. Headless browsers and HTTP libraries produce different fingerprints entirely. Check out our TLS fingerprinting capabilities for more on this approach.
WebDecoy’s detection engine analyzes all of these signals in real time. When a visitor’s behavior crosses the threshold, you can choose to block, throttle, or serve alternative content. For implementation details, see our complete guide to AI-powered bot detection.
Layer 5: Active Defense with Decoys
This is where protection gets interesting. Instead of just blocking scrapers, you can actively poison their data collection.
Honeypot content
Embed invisible links on your pages that only automated crawlers will follow. These links lead to pages filled with nonsensical or misleading text. If that text shows up in an AI model’s output, you have clear evidence that your site was scraped and the content was used for training.
<!-- Invisible to real users, visible to crawlers -->
<a href="/research/publications/2026/"
style="position:absolute;left:-9999px;opacity:0"
aria-hidden="true"
tabindex="-1">Research Archive</a>The page at /research/publications/2026/ can contain unique, traceable text. Something like a fake statistic or a made-up term that you can later search for in AI model outputs.
For a deeper dive into honeypot implementation patterns, read our practical guide to honeypot traps.
Endpoint decoys
You can take this further with endpoint decoys that mimic real API endpoints. Any client that accesses these fake endpoints is immediately flagged as a scraper, and their entire session can be reviewed or blocked.
Data poisoning with WebDecoy
WebDecoy’s decoy system takes honeypots to the next level. Instead of static trap pages, WebDecoy generates dynamic decoy content that looks legitimate to scrapers but is clearly synthetic. This means scrapers that ignore your robots.txt and bypass your rate limiting still end up collecting garbage data.
Combined with behavioral bot detection and cryptographic bot verification, you get a layered defense that handles everything from well-behaved crawlers to sophisticated scraping operations.
The Legal Landscape
Technical measures are stronger when paired with legal ones. Here’s where things stand in 2026:
United States
The big question is whether AI training constitutes fair use under US copyright law. Several landmark cases are working through the courts:
- NYT v. OpenAI challenged whether training GPT models on news articles is fair use
- Thomson Reuters v. ROSS Intelligence addressed training on legal content
- Getty Images v. Stability AI focused on image training data
No definitive ruling has settled the fair use question for all cases, but the trend is moving toward requiring consent for commercial AI training, especially when the AI output competes directly with the source material.
European Union
The EU is further ahead on this. The Digital Single Market Directive (2019/790) gives rights holders an explicit opt-out mechanism for text and data mining (Article 4). The AI Act adds transparency requirements, forcing AI companies to disclose their training data sources.
If you implement TDM reservations and your content is still scraped, you have a clear legal claim under EU law.
Practical steps
- Add a clear AI training policy to your Terms of Service
- Implement all technical opt-out mechanisms (robots.txt, TDM, headers)
- Monitor your logs for scraping activity
- Document everything, because evidence matters if it comes to enforcement
- Consider registering your content with the US Copyright Office for statutory damages eligibility
Putting It All Together
Here’s a prioritized checklist, starting with the easiest wins:
| Priority | Action | Effort | Protection Level |
|---|---|---|---|
| 1 | Update robots.txt with AI crawler blocks | 5 minutes | Blocks compliant crawlers |
| 2 | Add noai/noimageai meta tags | 10 minutes | Signals intent to all crawlers |
| 3 | Add TDM-Reservation header | 15 minutes | Legal protection (especially EU) |
| 4 | Create tdmrep.json policy file | 15 minutes | Machine-readable opt-out |
| 5 | Update Terms of Service | 30 minutes | Legal foundation |
| 6 | Analyze server logs for AI crawlers | 30 minutes | Understand current exposure |
| 7 | Deploy behavioral bot detection | 1-2 hours | Catches stealth scrapers |
| 8 | Implement honeypot decoys | 1-2 hours | Evidence collection + poisoning |
Items 1 through 5 are things you can do this afternoon with no additional tools. Items 6 through 8 are where platforms like WebDecoy come in, giving you real-time detection and active defense against crawlers that don’t play by the rules.
Limitations and Honest Tradeoffs
No protection is perfect, and it’s worth being upfront about the tradeoffs:
You can’t retroactively remove your content from training datasets. If your content was scraped before you added protections, it may already be part of existing model weights. These measures protect you going forward.
Aggressive blocking can hurt legitimate traffic. Some AI crawlers also power search features. Blocking Google-Extended prevents your content from appearing in AI Overviews, which might reduce your visibility. Blocking PerplexityBot means your content won’t be cited in Perplexity answers. Decide which tradeoffs make sense for your situation.
Rate limiting has false positives. Legitimate users on shared networks (corporate offices, universities) can trigger rate limits. Behavioral analysis is more reliable than IP-based blocking, but it requires more sophisticated tooling.
Legal enforcement is expensive and slow. Even with perfect documentation, pursuing a copyright claim against a well-funded AI company is a serious undertaking. Technical protections are your first, best option.
The realistic goal isn’t to make your content 100% unscrapable. It’s to make it significantly harder to scrape, to establish clear legal grounds if someone does it anyway, and to detect unauthorized scraping when it happens. Defense in depth works because each layer catches what the previous one missed.
Related Reading
Share this post
Like this post? Share it with your friends!
Want to see WebDecoy in action?
Get a personalized demo from our team.