Protect Your Content from AI Training Scrapers

Your best blog post took twelve hours to write. An AI crawler can copy it in 200 milliseconds.

That’s not hypothetical. Right now, dozens of AI companies are systematically crawling the web to build training datasets. Your content, your expertise, your original research is being fed into models that will compete with you for the same audience. And most site owners have no idea it’s happening.

The good news: you’re not powerless. There are real, practical steps you can take today to protect your content from unauthorized AI training. Some are simple configuration changes. Others require active detection and enforcement. Let’s walk through all of them.

How AI Training Scrapers Work
Layer 1: robots.txt Directives
Layer 2: TDM Reservation Protocol
Layer 3: Meta Tags and HTTP Headers
Layer 4: Detecting Unauthorized Scraping
Layer 5: Active Defense with Decoys
The Legal Landscape
Putting It All Together
Limitations and Honest Tradeoffs

How AI Training Scrapers Work

Before you can protect against something, you need to understand how it operates. AI training scrapers come in three flavors:

Type	Examples	Behavior	Respects robots.txt?
Announced crawlers	GPTBot, ClaudeBot, Google-Extended	Identify themselves via user agent	Usually yes
Stealth scrapers	Unnamed data vendors, academic crawlers	Use generic or spoofed user agents	Rarely
RAG fetchers	Perplexity, SearchGPT, AI assistants	Fetch content at query time, not for training	Varies

The announced crawlers are the easiest to deal with. They play by the rules (mostly) and respond to standard protocols. The stealth scrapers are a different problem entirely. They rotate IPs, spoof headers, and look like regular browser traffic. Blocking them requires behavioral detection, not just configuration files.

For a deeper look at how RAG bots differ from training scrapers, see our post on the RAG bot problem.

Layer 1: robots.txt Directives

This is your first line of defense and the one you should implement immediately. Every major AI company has published a user agent string for their training crawler, and most of them respect robots.txt disallow rules.

Add this block to your robots.txt:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: ImagesiftBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Important: Don’t use a blanket User-agent: * disallow unless you want to block search engines too. Be specific about which bots you’re blocking.

If you only want to protect certain sections (like your blog), you can scope the rules:

User-agent: GPTBot
Disallow: /blog/
Disallow: /docs/
Allow: /

What robots.txt won’t do

Let’s be honest about the limitations. robots.txt is a voluntary protocol. There is no technical enforcement. A scraper that ignores your robots.txt will still be able to crawl your site freely.

That said, ignoring robots.txt creates legal liability. If you’ve explicitly opted out and a company scrapes you anyway, that strengthens your position significantly in any legal dispute. Think of robots.txt as a posted “No Trespassing” sign. It won’t physically stop anyone, but it establishes intent.

Layer 2: TDM Reservation Protocol

The W3C’s TDM Reservation Protocol (TDMRep) is a newer standard designed specifically for this problem. It lets you declare machine-readable policies about text and data mining rights.

Create a tdmrep.json file at your site root:

{
  "version": "1.0",
  "policy": [
    {
      "location": "/*",
      "tdm": {
        "permission": "none",
        "description": "No text or data mining permitted for AI training purposes"
      }
    }
  ]
}

Then reference it in your HTML <head>:

<link rel="tdm-reservation" href="/tdmrep.json" />

You can also set it via HTTP header:

TDM-Reservation: 1

In an Nginx config, that looks like:

server {
    # Block AI training data mining
    add_header TDM-Reservation "1" always;
    add_header X-Robots-Tag "noai, noimageai" always;
}

The TDM protocol is particularly important in the EU, where the Digital Single Market Directive (Article 4) gives publishers an explicit right to opt out of text and data mining. Having a machine-readable reservation in place is how you exercise that right.

Layer 3: Meta Tags and HTTP Headers

Several proposed meta tags have emerged to signal AI training preferences. Browser and crawler support varies, but implementing them costs nothing and strengthens your legal position.

The noai and noimageai directives

<meta name="robots" content="noai, noimageai" />

This tells crawlers not to use your text or images for AI training. Google’s crawlers respect these directives when specified through X-Robots-Tag headers.

DeviantArt’s AI training opt-out

DeviantArt introduced a specific header that some scrapers now check:

<meta name="AI-Training-Opt-Out" content="disallow" />

Combining headers in your web server

Here’s a comprehensive Nginx configuration:

server {
    listen 443 ssl;
    server_name yourdomain.com;

    # AI training opt-out headers
    add_header X-Robots-Tag "noai, noimageai" always;
    add_header TDM-Reservation "1" always;

    # Rate limiting for aggressive crawlers
    limit_req_zone $binary_remote_addr zone=ai_crawlers:10m rate=5r/m;

    location / {
        # Apply rate limiting to known AI user agents
        if ($http_user_agent ~* "(GPTBot|ClaudeBot|CCBot|Bytespider)") {
            set $limit_rate 1;
        }

        proxy_pass http://backend;
    }
}

For Apache servers:

<IfModule mod_headers.c>
    Header set X-Robots-Tag "noai, noimageai"
    Header set TDM-Reservation "1"
</IfModule>

Layer 4: Detecting Unauthorized Scraping

Configuration-based protections only work against crawlers that play fair. For the rest, you need detection.

Log analysis for known AI crawlers

Start by checking if you’re already being scraped. Run this against your access logs:

grep -iE "(GPTBot|ClaudeBot|CCBot|Bytespider|anthropic|cohere|perplexity)" \
  /var/log/nginx/access.log | \
  awk '{print $1, $14}' | \
  sort | uniq -c | sort -rn | head -20

This gives you a count of requests per IP and user agent for known AI crawlers. If you see hundreds or thousands of requests, someone is actively scraping your content.

Behavioral detection for stealth scrapers

The harder problem is scrapers that don’t announce themselves. These crawlers share a set of behavioral patterns you can watch for:

Sequential page access. Real users jump around a site. Scrapers tend to crawl systematically, hitting pages in order or following every link on every page.

No asset loading. Browsers fetch CSS, JavaScript, images, and fonts. Scrapers usually skip all of that. If a “visitor” loads 50 HTML pages but zero stylesheets, that’s not a human.

Abnormal timing. Humans take seconds between page loads. Automated scrapers often fire requests in rapid bursts with millisecond precision.

Missing or inconsistent TLS fingerprints. Modern browsers have distinctive TLS handshake patterns. Headless browsers and HTTP libraries produce different fingerprints entirely. Check out our TLS fingerprinting capabilities for more on this approach.

WebDecoy’s detection engine analyzes all of these signals in real time. When a visitor’s behavior crosses the threshold, you can choose to block, throttle, or serve alternative content. For implementation details, see our complete guide to AI-powered bot detection.

Layer 5: Active Defense with Decoys

This is where protection gets interesting. Instead of just blocking scrapers, you can actively poison their data collection.

Honeypot content

Embed invisible links on your pages that only automated crawlers will follow. These links lead to pages filled with nonsensical or misleading text. If that text shows up in an AI model’s output, you have clear evidence that your site was scraped and the content was used for training.

<!-- Invisible to real users, visible to crawlers -->
<a href="/research/publications/2026/"
   style="position:absolute;left:-9999px;opacity:0"
   aria-hidden="true"
   tabindex="-1">Research Archive</a>

The page at /research/publications/2026/ can contain unique, traceable text. Something like a fake statistic or a made-up term that you can later search for in AI model outputs.

For a deeper dive into honeypot implementation patterns, read our practical guide to honeypot traps.

Endpoint decoys

You can take this further with endpoint decoys that mimic real API endpoints. Any client that accesses these fake endpoints is immediately flagged as a scraper, and their entire session can be reviewed or blocked.

Data poisoning with WebDecoy

WebDecoy’s decoy system takes honeypots to the next level. Instead of static trap pages, WebDecoy generates dynamic decoy content that looks legitimate to scrapers but is clearly synthetic. This means scrapers that ignore your robots.txt and bypass your rate limiting still end up collecting garbage data.

Combined with behavioral bot detection and cryptographic bot verification, you get a layered defense that handles everything from well-behaved crawlers to sophisticated scraping operations.

The Legal Landscape

Technical measures are stronger when paired with legal ones. Here’s where things stand in 2026:

United States

The big question is whether AI training constitutes fair use under US copyright law. Several landmark cases are working through the courts:

NYT v. OpenAI challenged whether training GPT models on news articles is fair use
Thomson Reuters v. ROSS Intelligence addressed training on legal content
Getty Images v. Stability AI focused on image training data

No definitive ruling has settled the fair use question for all cases, but the trend is moving toward requiring consent for commercial AI training, especially when the AI output competes directly with the source material.

European Union

The EU is further ahead on this. The Digital Single Market Directive (2019/790) gives rights holders an explicit opt-out mechanism for text and data mining (Article 4). The AI Act adds transparency requirements, forcing AI companies to disclose their training data sources.

If you implement TDM reservations and your content is still scraped, you have a clear legal claim under EU law.

Practical steps

Add a clear AI training policy to your Terms of Service
Implement all technical opt-out mechanisms (robots.txt, TDM, headers)
Monitor your logs for scraping activity
Document everything, because evidence matters if it comes to enforcement
Consider registering your content with the US Copyright Office for statutory damages eligibility

Putting It All Together

Here’s a prioritized checklist, starting with the easiest wins:

Priority	Action	Effort	Protection Level
1	Update robots.txt with AI crawler blocks	5 minutes	Blocks compliant crawlers
2	Add noai/noimageai meta tags	10 minutes	Signals intent to all crawlers
3	Add TDM-Reservation header	15 minutes	Legal protection (especially EU)
4	Create tdmrep.json policy file	15 minutes	Machine-readable opt-out
5	Update Terms of Service	30 minutes	Legal foundation
6	Analyze server logs for AI crawlers	30 minutes	Understand current exposure
7	Deploy behavioral bot detection	1-2 hours	Catches stealth scrapers
8	Implement honeypot decoys	1-2 hours	Evidence collection + poisoning

Items 1 through 5 are things you can do this afternoon with no additional tools. Items 6 through 8 are where platforms like WebDecoy come in, giving you real-time detection and active defense against crawlers that don’t play by the rules.

Limitations and Honest Tradeoffs

No protection is perfect, and it’s worth being upfront about the tradeoffs:

You can’t retroactively remove your content from training datasets. If your content was scraped before you added protections, it may already be part of existing model weights. These measures protect you going forward.

Aggressive blocking can hurt legitimate traffic. Some AI crawlers also power search features. Blocking Google-Extended prevents your content from appearing in AI Overviews, which might reduce your visibility. Blocking PerplexityBot means your content won’t be cited in Perplexity answers. Decide which tradeoffs make sense for your situation.

Rate limiting has false positives. Legitimate users on shared networks (corporate offices, universities) can trigger rate limits. Behavioral analysis is more reliable than IP-based blocking, but it requires more sophisticated tooling.

Legal enforcement is expensive and slow. Even with perfect documentation, pursuing a copyright claim against a well-funded AI company is a serious undertaking. Technical protections are your first, best option.

The realistic goal isn’t to make your content 100% unscrapable. It’s to make it significantly harder to scrape, to establish clear legal grounds if someone does it anyway, and to detect unauthorized scraping when it happens. Defense in depth works because each layer catches what the previous one missed.

Protect Your Content from AI Training Scrapers

Table of Contents

How AI Training Scrapers Work

Layer 1: robots.txt Directives

What robots.txt won’t do

Layer 2: TDM Reservation Protocol

Layer 3: Meta Tags and HTTP Headers

The noai and noimageai directives

DeviantArt’s AI training opt-out

Combining headers in your web server

Layer 4: Detecting Unauthorized Scraping

Log analysis for known AI crawlers

Behavioral detection for stealth scrapers

Layer 5: Active Defense with Decoys

Honeypot content

Endpoint decoys

Data poisoning with WebDecoy

The Legal Landscape

United States

European Union

Practical steps

Putting It All Together

Limitations and Honest Tradeoffs

Share this post

Want to see WebDecoy in action?

Protect Your Content from AI Training Scrapers

Table of Contents

How AI Training Scrapers Work

Layer 1: robots.txt Directives

What robots.txt won’t do

Layer 2: TDM Reservation Protocol

Layer 3: Meta Tags and HTTP Headers

The noai and noimageai directives

DeviantArt’s AI training opt-out

Combining headers in your web server

Layer 4: Detecting Unauthorized Scraping

Log analysis for known AI crawlers

Behavioral detection for stealth scrapers

Layer 5: Active Defense with Decoys

Honeypot content

Endpoint decoys

Data poisoning with WebDecoy

The Legal Landscape

United States

European Union

Practical steps

Putting It All Together

Limitations and Honest Tradeoffs

Related Reading

Share this post

Want to see WebDecoy in action?