The RAG Bot Problem: When AI Fetches Content Real-Time
RAG bots fetch content for human users in real-time—unlike training scrapers. Learn to detect Perplexity, ChatGPT Search, and when to block vs rate-limit.
WebDecoy Team
WebDecoy Security Team
The RAG Bot Problem: When AI Fetches Your Content in Real-Time
Last Updated: January 2026 | RAG bot IP ranges change frequently—verify against vendor documentation.
There’s a new class of AI bot hitting your infrastructure, and it’s fundamentally different from the training scrapers you’ve been blocking. When a user asks ChatGPT “what does Company X’s pricing page say?” or queries Perplexity about your product features, these systems don’t hallucinate an answer—they dispatch a bot to fetch your content in real-time.
This is Retrieval Augmented Generation (RAG), and the bots executing these fetches require a completely different detection and response strategy than traditional AI scrapers.
RAG Bots vs Training Scrapers: A Critical Distinction
Most AI bot discussions conflate two fundamentally different use cases. Getting this distinction wrong corrupts your traffic data and leads to poor security decisions.
The Request Flow
Understanding the architectural difference is essential:
TRAINING SCRAPER FLOW:
┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────┐
│ Crawler │────▶│ Your Site │────▶│ Database │────▶│ Training │
│ (GPTBot) │ │ (millions │ │ (stored) │ │ (weeks) │
└─────────────┘ │ of pages) │ └─────────────┘ └──────────┘
└──────────────┘ │
▼
┌──────────┐
│ User │
│ Query │
└──────────┘
RAG BOT FLOW:
┌──────────┐ ┌─────────────────┐ ┌──────────────┐ ┌──────────┐
│ User │────▶│ ChatGPT/ │────▶│ Your Site │────▶│ Answer │
│ Query │ │ Perplexity │ │ (1-5 pages) │ │ (seconds)│
└──────────┘ └─────────────────┘ └──────────────┘ └──────────┘
│ │
│ ChatGPT-User/ │
└──── Perplexity-User ─┘Training Scrapers
Training scrapers crawl systematically to collect data for model development:
| Characteristic | Training Scrapers |
|---|---|
| Purpose | Collect data to train/fine-tune models |
| Timing | Scheduled bulk crawls, often overnight |
| Pattern | Systematic, comprehensive site coverage |
| Volume | High—millions of pages per crawl |
| User context | None—operates independently |
| Examples | GPTBot, ClaudeBot, CCBot, Bytespider |
RAG Bots (Search/Retrieval Bots)
RAG bots fetch specific content to answer a user’s question in real-time:
| Characteristic | RAG Bots |
|---|---|
| Purpose | Answer a specific user query |
| Timing | On-demand, triggered by user questions |
| Pattern | Targeted—1-5 pages per query |
| Volume | Lower per-request, but constant stream |
| User context | Human is waiting for the answer |
| Examples | ChatGPT-User, Perplexity-User, BingPreview |
Key Insight: A RAG bot request represents a human user who asked about your product. Blocking it might mean losing a potential customer. Blocking a training scraper just protects your content from unauthorized model training.
Identifying RAG Bots in Your Logs
OpenAI: ChatGPT-User
When a ChatGPT user triggers web search (either explicitly or when the model determines it needs current information), OpenAI dispatches ChatGPT-User:
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/botKey behaviors:
- Respects robots.txt (confirmed through testing)
- Signs requests using the Web Bot Auth standard
- Stops crawling when disallowed—no follow-up attempts from other user agents
- Fetches robots.txt before each crawl session
This is notably better behavior than some competitors. In testing, ChatGPT-User demonstrates consistent, predictable patterns that make it easy to distinguish and policy appropriately.
IP verification: OpenAI publishes official IP ranges for ChatGPT-User, enabling you to verify requests aren’t spoofed.
Perplexity: The Complicated Case
Perplexity operates two distinct crawlers:
PerplexityBot (Training/Indexing):
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)- Used for indexing and search result surfacing
- Not used for AI model training (per Perplexity docs)
- IP list:
https://www.perplexity.com/perplexitybot.json
Perplexity-User (RAG):
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)- Handles user-initiated search queries
- Generally ignores robots.txt (per their own documentation—justified as “user-requested”)
- IP list:
https://www.perplexity.com/perplexity-user.json
The controversy: Cloudflare documented Perplexity using undeclared stealth crawlers that bypass blocks. When PerplexityBot was blocked on certain sites, requests continued from a different user agent:
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36This stealth crawler:
- Generated 3-6 million daily requests (vs 20-25 million from declared agents)
- Used IPs not in Perplexity’s published ranges
- Mimicked standard Chrome on macOS
- Continued accessing content after official bots were blocked
Cloudflare’s bot management systems detected these requests as bots despite the disguise, demonstrating the value of behavioral detection over user-agent matching.
Microsoft: Copilot and BingPreview
Microsoft’s AI assistant Copilot uses Bing’s infrastructure for web retrieval:
BingPreview (RAG for Copilot):
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/Key considerations for enterprise environments:
- Copilot queries from Azure tenants may originate from Microsoft’s Azure IP ranges
- Enterprise Copilot can access internal SharePoint/intranet content
- BingPreview respects robots.txt but shares user-agent patterns with regular Bingbot
Enterprise Warning: If you’re blocking “bingbot” broadly, you may be blocking Copilot RAG queries from your own employees using Microsoft 365.
Apple: AppleBot-Extended and Apple Intelligence
Apple’s approach to AI retrieval is privacy-focused:
AppleBot-Extended:
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Version/17.0 Safari/537.36 (Applebot-Extended)- Used for Apple Intelligence features (Siri, Safari summaries)
- Apple emphasizes “Private Cloud Compute” for processing
- Content fetched is processed in secure enclaves, not stored
- Respects robots.txt
Applebot-Extendeddirectives
Unlike other vendors, Apple’s privacy architecture means fetched content theoretically cannot be retained for training—though verification is impossible.
Google: Gemini and AI Overviews
Google uses several AI-related crawlers:
| Crawler | Purpose |
|---|---|
| Google-Extended | AI training data collection (blockable via robots.txt) |
| Googlebot | Search indexing AND AI Overviews content |
The catch: You cannot block Google from using your content in AI Overviews without also blocking yourself from search results. Googlebot serves both purposes with no separate user agent for AI features.
Anthropic: Claude
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; [email protected])ClaudeBot is primarily a training crawler. Claude’s web search features (when enabled) use similar patterns but with less documented behavior than OpenAI’s approach.
The Detection Pyramid
Effective RAG bot detection requires multiple layers, each catching what the previous layer missed:
┌───────────────────┐
│ HONEYPOTS │ ← Zero false positives
│ (Layer 4) │ Catches liars
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ BEHAVIORAL │ ← Asset loading, timing
│ ANALYSIS │ Catches stealth bots
│ (Layer 3) │
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ TLS │ ← Fingerprint mismatch
│ FINGERPRINTING │ Catches spoofed UAs
│ (Layer 2) │
└─────────┬─────────┘
│
┌─────────▼─────────┐
│ USER-AGENT + │ ← Basic identification
│ IP VERIFICATION │ Catches honest bots
│ (Layer 1) │
└───────────────────┘Layer 1: User-Agent + IP Verification
For vendors that publish IP ranges, cross-reference both:
CHATGPT_USER_IPS = load_openai_ip_ranges()
PERPLEXITY_IPS = load_json('https://www.perplexity.com/perplexity-user.json')
def verify_rag_bot(request):
ua = request.headers.get('User-Agent', '')
ip = request.remote_addr
if 'ChatGPT-User' in ua:
return ip in CHATGPT_USER_IPS # True = legitimate
if 'Perplexity-User' in ua:
return ip in PERPLEXITY_IPS
return None # Unknown botSecurity Warning: If a vendor doesn’t publish IP ranges, don’t trust the user agent alone. Attackers spoof AI bot user agents to blend in with “acceptable” bot traffic.
Layer 2: TLS Fingerprinting
RAG bots use specific HTTP clients that produce distinctive TLS fingerprints. A request claiming to be ChatGPT-User but presenting a curl or Python requests TLS fingerprint is fraudulent.
| Claimed Bot | Expected TLS Pattern | Mismatch Action |
|---|---|---|
| ChatGPT-User | OpenAI’s fetch client | Block immediately |
| Perplexity-User | Perplexity’s client | Block immediately |
| Chrome (from known AI IP) | Chrome-like | Likely stealth crawler |
WebDecoy’s TLS fingerprinting automatically detects these mismatches.
Layer 3: Behavioral Analysis (Asset Loading Detection)
This is the highest-signal detection method for stealth crawlers. Real browsers request CSS, JavaScript, images, and fonts. RAG bots typically fetch only HTML.
def analyze_session_behavior(session_requests):
"""
Detect stealth RAG bots by analyzing asset loading patterns.
Real Chrome users load 30-100+ assets per page.
RAG bots load 0-2 assets (maybe favicon).
"""
html_requests = [r for r in session_requests if is_html(r)]
asset_requests = [r for r in session_requests if is_asset(r)]
if len(html_requests) > 0:
asset_ratio = len(asset_requests) / len(html_requests)
# Real browsers: 30-100 assets per HTML page
# RAG bots: 0-2 assets per HTML page
if asset_ratio < 5:
return {
'verdict': 'likely_bot',
'confidence': 0.9 if asset_ratio < 1 else 0.7,
'reason': f'Asset ratio {asset_ratio:.1f} (expected 30+)'
}
return {'verdict': 'likely_human', 'confidence': 0.6}
def is_asset(request):
"""Check if request is for a static asset."""
asset_extensions = ['.css', '.js', '.png', '.jpg', '.gif', '.woff', '.woff2', '.svg']
return any(request.path.endswith(ext) for ext in asset_extensions)Detection signals for stealth crawlers:
| Signal | Real Chrome | Stealth RAG Bot |
|---|---|---|
| CSS files loaded | 5-20 | 0 |
| JavaScript executed | Yes (beacon fires) | No |
| Images requested | 10-50+ | 0-1 (favicon) |
| Fonts loaded | 2-5 | 0 |
| Time on page | 10-300 seconds | < 2 seconds |
Pro Tip: Inject a JavaScript beacon that reports back. If a “Chrome” user-agent never fires the beacon, it’s not Chrome.
Layer 4: Honeypot Validation
Deploy invisible honeypot links to catch bots that claim to be answering user queries but are actually crawling comprehensively:
<a href="/internal-docs-2024" style="position:absolute;left:-9999px">Documentation</a>A real RAG bot answering “what’s your pricing?” has no reason to visit /internal-docs-2024. If it does, it’s lying about its purpose.
The Marketing vs. Security Conflict
RAG bots create organizational tension that requires explicit governance.
The Problem
| Team | Goal | RAG Bot Preference |
|---|---|---|
| Security | Reduce attack surface, control costs | Block everything |
| Infrastructure | Minimize server load | Aggressive rate limiting |
| Marketing/SEO | Appear in AI search answers | Allow everything |
| Legal | Protect IP, ensure compliance | Block training, allow RAG |
Without a governance framework, these competing priorities lead to inconsistent policies and finger-pointing when something goes wrong.
The RAG Bot Governance Framework
Establish clear organizational policy with these components:
1. Bot Classification Tiers
TIER 1 - TRUSTED (Allow with monitoring)
├── ChatGPT-User (verified IPs)
├── BingPreview (verified IPs)
└── AppleBot-Extended
TIER 2 - CONDITIONAL (Rate-limit, require verification)
├── Perplexity-User
├── Unverified AI user agents
└── New/unknown RAG bots
TIER 3 - BLOCKED (Zero tolerance)
├── Spoofed user agents (UA/IP mismatch)
├── Honeypot triggers
├── Known bad actors
└── Training scrapers (GPTBot, ClaudeBot, CCBot)2. Cross-Functional Review Process
Before changing RAG bot policy:
- Security reviews threat implications
- Marketing assesses visibility impact
- Legal confirms compliance with ToS
- Infrastructure validates capacity
3. Escalation Path
New RAG bot detected
│
▼
┌───────────────────┐
│ Auto-classify as │
│ TIER 2 (rate- │
│ limited) │
└─────────┬─────────┘
│
▼ (within 48 hours)
┌───────────────────┐
│ Security + Mktg │
│ joint review │
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ Promote to TIER 1 │
│ or demote to │
│ TIER 3 │
└───────────────────┘RAG Traffic Attribution and Analytics
One of the biggest pain points: RAG bots “steal” traffic by giving users the answer directly. How do you measure impact?
Tracking RAG Bot Visits
Add RAG bot detection to your analytics pipeline:
// Server-side: Tag requests from known RAG bots
function tagRagBotRequest(request) {
const ragBots = {
'ChatGPT-User': 'openai',
'Perplexity-User': 'perplexity',
'bingbot': 'microsoft_copilot', // Check referer for Copilot
'Applebot-Extended': 'apple_intelligence'
};
const ua = request.headers['user-agent'];
for (const [pattern, source] of Object.entries(ragBots)) {
if (ua.includes(pattern)) {
// Log to analytics
analytics.track('rag_bot_fetch', {
source: source,
path: request.path,
timestamp: Date.now(),
ip_verified: verifyIpRange(request.ip, source)
});
return source;
}
}
return null;
}Correlating RAG Fetches to Traffic Spikes
Look for this pattern in your analytics:
Timeline:
─────────────────────────────────────────────────────────▶
│ │ │
│ │ │
▼ ▼ ▼
RAG bot fetches Traffic spike Conversions
/pricing page (30 min later) (from AI referral?)
│ │ │
└────────────────────┴──────────────┘
Correlation windowDetection query (SQL):
-- Find traffic spikes following RAG bot fetches
WITH rag_fetches AS (
SELECT
path,
DATE_TRUNC('hour', timestamp) as fetch_hour
FROM access_logs
WHERE user_agent LIKE '%ChatGPT-User%'
OR user_agent LIKE '%Perplexity-User%'
),
hourly_traffic AS (
SELECT
path,
DATE_TRUNC('hour', timestamp) as traffic_hour,
COUNT(*) as visits
FROM access_logs
WHERE user_agent NOT LIKE '%bot%'
GROUP BY 1, 2
)
SELECT
r.path,
r.fetch_hour,
t.visits as traffic_after,
LAG(t.visits) OVER (PARTITION BY r.path ORDER BY t.traffic_hour) as traffic_before,
t.visits - LAG(t.visits) OVER (PARTITION BY r.path ORDER BY t.traffic_hour) as delta
FROM rag_fetches r
JOIN hourly_traffic t ON r.path = t.path
AND t.traffic_hour = r.fetch_hour + INTERVAL '1 hour'
ORDER BY delta DESC;Referrer Analysis
Some AI search engines include identifiable referrers:
| Source | Referrer Pattern |
|---|---|
| Perplexity | https://www.perplexity.ai/ |
| ChatGPT | Often empty or https://chat.openai.com/ |
| Copilot | https://copilot.microsoft.com/ or empty |
Track these in your analytics to measure AI-driven traffic:
// Client-side: Detect AI search referrals
const aiReferrers = [
'perplexity.ai',
'chat.openai.com',
'copilot.microsoft.com',
'gemini.google.com'
];
if (document.referrer) {
const referrerHost = new URL(document.referrer).hostname;
if (aiReferrers.some(ai => referrerHost.includes(ai))) {
analytics.track('ai_search_referral', {
source: referrerHost,
landing_page: window.location.pathname
});
}
}Block vs Rate-Limit: The Strategic Decision
Here’s where RAG bots require different thinking than training scrapers.
When to Block Completely
Block when:
- Bot is spoofing identity (user agent doesn’t match IP range)
- Bot ignores robots.txt AND you’ve explicitly prohibited it
- Bot triggers honeypots (proving it’s not answering specific queries)
- Bot exhibits training-scraper patterns despite RAG user agent
- Company has violated your terms of service
WAF Implementation Recipes
Nginx:
# Block spoofed RAG bots (UA claims AI but IP doesn't match)
geo $ai_bot_ip {
default 0;
# OpenAI ChatGPT-User ranges
23.98.142.176/28 1;
40.84.180.224/28 1;
# Add other verified ranges...
}
map $http_user_agent $is_claimed_ai_bot {
default 0;
"~*ChatGPT-User" 1;
"~*Perplexity-User" 1;
}
# Block if claims to be AI bot but IP doesn't match
if ($is_claimed_ai_bot = 1) {
set $block_check "claimed";
}
if ($ai_bot_ip = 0) {
set $block_check "${block_check}_unverified";
}
if ($block_check = "claimed_unverified") {
return 403;
}Cloudflare Workers:
export default {
async fetch(request, env) {
const ua = request.headers.get('User-Agent') || '';
const ip = request.headers.get('CF-Connecting-IP');
const botScore = request.cf?.botManagement?.score || 99;
// Known RAG bot patterns
const ragBotPatterns = ['ChatGPT-User', 'Perplexity-User', 'Applebot-Extended'];
const isClaimedRagBot = ragBotPatterns.some(p => ua.includes(p));
if (isClaimedRagBot) {
// Verify against known IP ranges (fetch from KV or hardcode)
const verifiedIPs = await env.RAG_BOT_IPS.get('openai', { type: 'json' });
const isVerified = verifiedIPs?.includes(ip);
if (!isVerified) {
return new Response('Forbidden: Unverified bot', { status: 403 });
}
// Add header for downstream analytics
const response = await fetch(request);
const newResponse = new Response(response.body, response);
newResponse.headers.set('X-RAG-Bot', 'verified');
return newResponse;
}
// Check Cloudflare bot score for stealth crawlers
if (botScore < 30) {
// Likely bot - could be stealth RAG crawler
// Rate limit instead of block
return new Response('Rate limited', { status: 429 });
}
return fetch(request);
}
};AWS WAF:
{
"Name": "RAGBotVerification",
"Priority": 1,
"Statement": {
"AndStatement": {
"Statements": [
{
"ByteMatchStatement": {
"SearchString": "ChatGPT-User",
"FieldToMatch": { "SingleHeader": { "Name": "User-Agent" } },
"TextTransformations": [{ "Priority": 0, "Type": "NONE" }],
"PositionalConstraint": "CONTAINS"
}
},
{
"NotStatement": {
"Statement": {
"IPSetReferenceStatement": {
"ARN": "arn:aws:wafv2:...:ipset/OpenAI-ChatGPT-User-IPs/..."
}
}
}
}
]
}
},
"Action": { "Block": { "CustomResponse": { "ResponseCode": 403 } } },
"VisibilityConfig": {
"SampledRequestsEnabled": true,
"CloudWatchMetricsEnabled": true,
"MetricName": "BlockedSpoofedRAGBots"
}
}When to Rate-Limit
Rate-limit when:
- Bot is legitimate but consuming excessive resources
- You want visibility in AI search results but need to control load
- Bot behavior is borderline (might be legitimate, might not)
Recommended limits for RAG bots:
Verified RAG bots (ChatGPT-User from OpenAI IPs):
- 60 requests/minute (allows answering queries)
- Burst: 10 requests/second (for multi-page context)
Unverified but possible RAG bots:
- 10 requests/minute
- Challenge after threshold exceeded
Honeypot triggers:
- Immediate block (zero tolerance)When to Allow Freely
Consider allowing when:
- You want maximum visibility in AI-powered search
- The bot is verified and well-behaved
- Your content is meant to be publicly accessible
- You’re tracking referrals from AI search for analytics
Business Consideration: When someone asks ChatGPT about your product category, do you want your content in the answer? If yes, blocking ChatGPT-User costs you visibility.
Emerging Standards: Semantic Headers for AI
The industry is developing new standards specifically for AI content usage—beyond robots.txt.
X-Robots-Tag for AI
Some publishers are experimenting with AI-specific HTTP headers:
X-Robots-Tag: noai
X-Robots-Tag: noimageai
X-Robots-Tag: noai, noimageaiCurrent support:
| Directive | Meaning | Vendor Support |
|---|---|---|
noai | Don’t use for AI training | Google (partial), others TBD |
noimageai | Don’t use images for AI | Google (Gemini) |
Reality Check: These headers are advisory only. RAG bots fetching content for real-time answers may ignore them entirely—the content isn’t being “trained on,” just displayed. Enforcement is nonexistent.
The robots.txt AI Extensions
Proposed extensions to robots.txt for AI-specific control:
# Block AI training crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
Disallow: /
# Allow RAG bots (user-initiated fetches)
User-agent: ChatGPT-User
User-agent: Perplexity-User
Allow: /
# Block all AI uses (if this ever becomes standard)
User-agent: *
X-Robots-Tag: noaiThe problem: There’s no enforcement mechanism. Well-behaved vendors comply; bad actors ignore it. Technical detection remains essential.
Audit Your RAG Bot Traffic
Use this script to analyze your access logs and identify RAG bot activity:
#!/bin/bash
# rag-bot-audit.sh - Analyze RAG bot traffic in access logs
# Usage: ./rag-bot-audit.sh /var/log/nginx/access.log
LOG_FILE="${1:-/var/log/nginx/access.log}"
echo "=== RAG Bot Traffic Audit ==="
echo "Log file: $LOG_FILE"
echo "Generated: $(date)"
echo ""
# Define RAG bot patterns
RAG_PATTERNS="ChatGPT-User|Perplexity-User|Perplexity-User|Applebot-Extended|bingpreview"
TRAINING_PATTERNS="GPTBot|ClaudeBot|CCBot|Bytespider|Google-Extended|anthropic-ai"
echo "=== RAG Bot Requests (Last 24 Hours) ==="
echo ""
# Count by bot type
echo "Requests by RAG bot type:"
grep -E "$RAG_PATTERNS" "$LOG_FILE" | \
grep -oE "$RAG_PATTERNS" | \
sort | uniq -c | sort -rn
echo ""
echo "=== Training Scraper Requests (for comparison) ==="
grep -E "$TRAINING_PATTERNS" "$LOG_FILE" | \
grep -oE "$TRAINING_PATTERNS" | \
sort | uniq -c | sort -rn
echo ""
echo "=== Top Pages Fetched by RAG Bots ==="
grep -E "$RAG_PATTERNS" "$LOG_FILE" | \
awk '{print $7}' | \
sort | uniq -c | sort -rn | head -20
echo ""
echo "=== Hourly RAG Bot Traffic Pattern ==="
grep -E "$RAG_PATTERNS" "$LOG_FILE" | \
awk '{print $4}' | \
cut -d: -f2 | \
sort | uniq -c | \
awk '{printf "%02d:00 - %s requests\n", $2, $1}'
echo ""
echo "=== Potential Stealth Crawlers ==="
echo "(High-frequency IPs with Chrome UA but no asset requests)"
echo ""
# Find IPs claiming Chrome but requesting only HTML
grep -i "Chrome" "$LOG_FILE" | \
grep -v -E "\.(css|js|png|jpg|gif|woff|svg)" | \
awk '{print $1}' | \
sort | uniq -c | sort -rn | \
awk '$1 > 50 {print $1 " requests from " $2 " (no assets loaded)"}' | \
head -10
echo ""
echo "=== Recommendations ==="
echo "1. Verify high-volume RAG bots against published IP ranges"
echo "2. Investigate 'stealth crawler' IPs for asset loading patterns"
echo "3. Check if top-fetched pages align with your SEO priorities"Python version for more detailed analysis:
#!/usr/bin/env python3
"""
rag_bot_audit.py - Detailed RAG bot traffic analysis
Usage: python rag_bot_audit.py /var/log/nginx/access.log
"""
import re
import sys
from collections import defaultdict
from datetime import datetime
RAG_BOTS = {
'ChatGPT-User': 'openai',
'Perplexity-User': 'perplexity',
'PerplexityBot': 'perplexity_index',
'Applebot-Extended': 'apple',
'bingpreview': 'microsoft'
}
TRAINING_BOTS = ['GPTBot', 'ClaudeBot', 'CCBot', 'Bytespider', 'Google-Extended']
def parse_log_line(line):
"""Parse nginx combined log format."""
pattern = r'(\S+) .* \[(.+?)\] "(\S+) (\S+) .*" (\d+) .* ".*" "(.*)"'
match = re.match(pattern, line)
if match:
return {
'ip': match.group(1),
'timestamp': match.group(2),
'method': match.group(3),
'path': match.group(4),
'status': match.group(5),
'user_agent': match.group(6)
}
return None
def identify_bot(user_agent):
"""Identify bot type from user agent."""
for pattern, bot_type in RAG_BOTS.items():
if pattern in user_agent:
return ('rag', bot_type)
for pattern in TRAINING_BOTS:
if pattern in user_agent:
return ('training', pattern)
return (None, None)
def main(log_file):
rag_stats = defaultdict(lambda: {'requests': 0, 'paths': defaultdict(int), 'ips': set()})
training_stats = defaultdict(int)
suspicious_ips = defaultdict(lambda: {'html': 0, 'assets': 0, 'ua': set()})
with open(log_file, 'r') as f:
for line in f:
parsed = parse_log_line(line)
if not parsed:
continue
bot_category, bot_type = identify_bot(parsed['user_agent'])
if bot_category == 'rag':
rag_stats[bot_type]['requests'] += 1
rag_stats[bot_type]['paths'][parsed['path']] += 1
rag_stats[bot_type]['ips'].add(parsed['ip'])
elif bot_category == 'training':
training_stats[bot_type] += 1
elif 'Chrome' in parsed['user_agent']:
# Track potential stealth crawlers
ip = parsed['ip']
suspicious_ips[ip]['ua'].add(parsed['user_agent'][:50])
if re.search(r'\.(css|js|png|jpg|gif|woff)', parsed['path']):
suspicious_ips[ip]['assets'] += 1
else:
suspicious_ips[ip]['html'] += 1
# Output report
print("=" * 60)
print("RAG BOT TRAFFIC AUDIT REPORT")
print("=" * 60)
print(f"\nGenerated: {datetime.now()}")
print(f"Log file: {log_file}\n")
print("RAG BOT SUMMARY")
print("-" * 40)
for bot_type, stats in sorted(rag_stats.items(), key=lambda x: -x[1]['requests']):
print(f"\n{bot_type.upper()}")
print(f" Total requests: {stats['requests']}")
print(f" Unique IPs: {len(stats['ips'])}")
print(f" Top pages:")
for path, count in sorted(stats['paths'].items(), key=lambda x: -x[1])[:5]:
print(f" {count:5d} {path[:60]}")
print("\n" + "=" * 60)
print("TRAINING SCRAPERS (for comparison)")
print("-" * 40)
for bot, count in sorted(training_stats.items(), key=lambda x: -x[1]):
print(f" {bot}: {count} requests")
print("\n" + "=" * 60)
print("POTENTIAL STEALTH CRAWLERS")
print("-" * 40)
print("(Chrome UA with low asset-to-HTML ratio)\n")
for ip, data in sorted(suspicious_ips.items(), key=lambda x: -x[1]['html']):
if data['html'] > 50 and data['assets'] < data['html'] * 0.1:
ratio = data['assets'] / data['html'] if data['html'] > 0 else 0
print(f" {ip}")
print(f" HTML requests: {data['html']}")
print(f" Asset requests: {data['assets']}")
print(f" Asset ratio: {ratio:.2%} (expected: 3000%+)")
print(f" VERDICT: Likely bot")
print()
if __name__ == '__main__':
if len(sys.argv) < 2:
print("Usage: python rag_bot_audit.py /path/to/access.log")
sys.exit(1)
main(sys.argv[1])Monitoring RAG Bot Traffic
Track RAG bots separately from training scrapers in your analytics:
Metrics to Capture
{
"event": "rag_bot_request",
"timestamp": "2026-01-13T14:30:00Z",
"bot_type": "chatgpt_user",
"verified": true,
"source_ip": "23.98.142.180",
"path": "/pricing",
"pages_in_session": 2,
"honeypot_triggered": false,
"response_code": 200,
"rate_limited": false
}Dashboard Insights
Track these metrics for RAG bots specifically:
| Metric | What It Tells You |
|---|---|
| RAG requests/day by bot type | Which AI search engines send users to you |
| Pages per RAG session | Are bots answering queries or training? |
| Verification rate | How many claimed AI bots are spoofed? |
| Honeypot trigger rate | Are bots lying about being query-driven? |
| Rate-limit hits | Is your threshold appropriate? |
SIEM Integration
Send RAG bot events to your SIEM for correlation with other security data:
{
"event_type": "ai_rag_bot_detected",
"threat_level": "informational",
"bot_identity": "perplexity_user",
"ip_verified": true,
"behavior_anomaly": false,
"action_taken": "rate_limited"
}See our SIEM integration guide for full setup.
Future Developments
The RAG bot landscape is evolving rapidly:
Web Bot Auth Standard
OpenAI is implementing cryptographic request signing through the Web Bot Auth standard. This enables:
- Definitive verification of bot identity
- Tamper-proof request authentication
- No reliance on IP ranges or user agents alone
As adoption spreads, expect this to become the standard for legitimate RAG bot verification.
AI Agent Evolution
RAG bots are the precursor to full AI agents that browse, interact, and transact on behalf of users. Today’s detection approaches for RAG bots will inform tomorrow’s agent authentication systems.
See our coverage of agentic commerce challenges for where this is heading.
RAG Bot Policy Checklist
Take this to your next security review:
Immediate Actions
- Audit current RAG bot traffic using the log analysis script
- Verify IP ranges for ChatGPT-User and Perplexity-User
- Deploy honeypot links to detect deceptive crawlers
- Add RAG bot events to your analytics pipeline
Policy Decisions
- Define bot classification tiers (Trusted/Conditional/Blocked)
- Set rate limits per tier
- Establish cross-functional review process (Security + Marketing)
- Document escalation path for new bot types
Technical Implementation
- Implement User-Agent + IP verification (Layer 1)
- Add TLS fingerprinting for spoofed UA detection (Layer 2)
- Deploy asset-loading behavioral analysis (Layer 3)
- Configure WAF rules (Cloudflare/AWS/Nginx)
Monitoring & Analytics
- Track RAG bot requests separately from training scrapers
- Correlate RAG fetches with traffic spikes
- Monitor referrer traffic from AI search engines
- Send events to SIEM for security correlation
Conclusion
RAG bots aren’t just another scraper variant—they represent a fundamental shift in how users discover and interact with web content. When you block ChatGPT-User, you’re not just protecting content from training; you’re making yourself invisible to users who search via AI.
The bottom line:
- Distinguish RAG bots from training scrapers in your detection systems
- Verify identity using IP ranges AND user agents together
- Deploy honeypots to catch bots lying about being query-driven
- Rate-limit rather than block for verified, well-behaved RAG bots
- Block aggressively when identity can’t be verified or behavior is deceptive
- Align Security and Marketing with a governance framework
WebDecoy provides the detection capabilities to make these distinctions:
- TLS fingerprinting to verify bot identity
- Behavioral analysis to distinguish RAG from training patterns
- Honeypot decoys to catch deceptive bots
- Geographic consistency to detect proxy evasion
The AI search era is here. Your detection strategy needs to evolve with it.
References
- Perplexity Crawlers Documentation - Official documentation on PerplexityBot and Perplexity-User behavior
- Cloudflare: Perplexity is Using Stealth Crawlers - Technical analysis of undeclared crawler behavior
- Momentic: AI Search Crawlers + User Agents - Comprehensive list of AI crawler user agents
- Security Boulevard: AI Bot Authentication - Web Bot Auth standard coverage
Need help implementing RAG bot detection? Contact our team or explore our bot detection solutions.
Share this post
Like this post? Share it with your friends!
Want to see WebDecoy in action?
Get a personalized demo from our team.