Next.js Bot Detection: Block AI Crawlers at the Edge

Your Vercel usage graph is climbing and your logs are full of names you did not invite: GPTBot, ClaudeBot, PerplexityBot, Bytespider. They hammer your App Router pages, ignore the parts of the site you would rather keep private, and quietly run up your bandwidth and compute bill. The reflex is a ten-line user-agent block in middleware.ts, and after you ship it the graph looks calmer for a day. Then it climbs again.

The ten-line block is not wrong. It is just the first of three layers, and on its own it catches only the crawlers that were honest enough to tell you who they are. This is a working guide to building all three layers in a normal Next.js project: an edge gate that runs on every request, honeypot routes that catch the bots that lie, and an origin fingerprint check for the signal the edge genuinely cannot see. We will also be honest about that last part, because most tutorials are not.

No marketing, code throughout.

The naive block, and exactly why it fails

Almost every Next.js bot-blocking guide ends here:

// middleware.ts
import { NextResponse } from 'next/server'
import type { NextRequest } from 'next/server'

const BLOCKED = /GPTBot|ClaudeBot|PerplexityBot|Bytespider|CCBot|Google-Extended|Meta-ExternalAgent|Amazonbot/i

export function middleware(req: NextRequest) {
  const ua = req.headers.get('user-agent') || ''
  if (BLOCKED.test(ua)) {
    return new NextResponse('Forbidden', { status: 403 })
  }
  return NextResponse.next()
}

export const config = {
  matcher: ['/((?!_next/static|_next/image|favicon.ico).*)'],
}

This works against a crawler that announces itself. GPTBot sends a user agent that says GPTBot, you match it, you return 403. Done.

The problem is that a user agent is a string the client chooses. Nothing forces a scraper to keep telling the truth, and the moment blocking becomes common, the well-funded scrapers stop. Perplexity was reported through 2025 to fetch pages with a generic Chrome user agent and rotating addresses after its declared bot was blocked. A scraper running headless Chrome or a plain HTTP client can set any user-agent header it likes in one line. Your regex never sees them.

So the honest framing is this: a user-agent block is a politeness filter. It removes the crawlers that respect your wishes, which is real and worth doing, and it does nothing to the ones that do not. The same logic applies to robots.txt, which is a request rather than a rule. For the deeper version of this argument, see our breakdown of how to detect AI scrapers like GPTBot, ClaudeBot, and Perplexity. Here we will keep the why short and spend the rest of the post on the parts that hold up.

Layer one: a real edge middleware

Keep the user-agent gate, but stop treating it as the whole defense. A useful middleware does three jobs: it cheaply blocks the honest crawlers, it rate limits everyone else so a single client cannot flood you, and it hands a signal to your origin so the deeper check knows where to look.

Where middleware lives and what matcher does

middleware.ts sits at the root of your project, or inside src/ if you use a src directory. It runs on the Edge Runtime by default, before your routes and before cached output, which is exactly why it is the right place for a first gate: the request is stopped before it costs you a function invocation or a database hit.

The matcher is your most important performance setting. Without it, middleware runs on every asset, including the static files Next.js can already serve for free. Scope it to the routes worth protecting:

export const config = {
  matcher: [
    // run on everything except Next internals and static files
    '/((?!_next/static|_next/image|favicon.ico|robots.txt|sitemap.xml).*)',
  ],
}

Block, rate limit, or rewrite

NextResponse gives you three moves inside middleware. Return a 403 to block, a 429 to rate limit, or a rewrite that sends a suspected bot somewhere other than your real content while the URL in its client stays the same.

return new NextResponse('Forbidden', { status: 403 })          // block
return new NextResponse('Too Many Requests', { status: 429 })  // rate limit
return NextResponse.rewrite(new URL('/tarpit', req.url))       // send to a decoy

You can hand-roll the gate from here, but it adds up fast: a regex of declared crawlers to maintain, plus a shared rate-limit store, because an in-process counter (a plain Map) will not hold when edge invocations do not share memory, so you reach for Upstash Redis or Vercel KV. That is ongoing work, and it is the work the WebDecoy Next.js package exists to remove.

@webdecoy/nextjs ships a withWebDecoy middleware that does the screening, scoring, and rate limiting in one wrapper, with the bot intelligence kept current for you:

npm install @webdecoy/nextjs

// middleware.ts
import { withWebDecoy } from '@webdecoy/nextjs'
import { rateLimit } from '@webdecoy/node'

export default withWebDecoy({
  apiKey: process.env.WEBDECOY_API_KEY!,
  // Built-in rules engine: no separate counter store to stand up.
  rules: [rateLimit({ max: 100, window: 60 })], // 100 requests per 60s
  // Skip work on paths that never need protection.
  skipPaths: ['/_next', '/favicon.ico', '/robots.txt'],
})

export const config = {
  matcher: ['/((?!_next/static|_next/image|favicon.ico).*)'],
}

On every matched request this runs WebDecoy’s local analysis, applies your rules, and returns the right response on its own: a rateLimit rule that trips returns 429 with a Retry-After header, a deny rule returns 403, and an allowed request continues with an x-webdecoy-decision header attached so your routes can read the verdict downstream. You can override the behavior with onBlocked and onError callbacks, and onError fails open by default, so a hiccup in detection never locks out real users.

This is a real improvement over the ten-line version, and the per-request cost is still small. But notice what every signal so far has in common. User agent, headers, address, and request rate are all things the client controls or can rotate. To catch a bot that lies about all of them, you need a signal it does not get to set, which is its own TLS handshake.

The honest constraint nobody mentions

Here is the part most Next.js guides skip, and it holds whether you hand-roll the gate or use withWebDecoy: you cannot compute a TLS fingerprint inside middleware.ts. Edge middleware can screen headers, apply your rate-limit rules, and lean on IP and geo intelligence, but the JA3 and JA4 layer specifically needs something the edge never sees.

A JA3 or JA4 fingerprint is built from the raw ClientHello of the TLS handshake: the cipher suites, the extensions and their order, the supported groups, the way the client negotiates the connection. These are extremely hard to fake because they come from the client’s TLS stack rather than from a header. We cover the technique in detail in our JA4 fingerprinting guide for AI scrapers.

The catch on a platform like Vercel is that TLS terminates at the edge network before your middleware runs. By the time your code executes, the handshake is over and the ClientHello bytes are gone. The Edge Runtime has no socket access and no node:tls, so there is nothing to read. Recent Next.js versions let you move middleware to the Node.js runtime, which is useful for other reasons, but it still does not hand you the original handshake, so it does not change this story.

This is not a flaw in Next.js. It is just where the layers sit. The fix is to put the fingerprint check where the handshake is actually visible, which means one of two places:

Your own origin, when you self-host Next.js behind your own TLS termination (for example next start behind nginx or Caddy), where the proxy can read the handshake and forward it to your app as headers.
A detection service that captures those handshake signals for you and returns a verdict your route handler can act on.

So the architecture becomes: gate cheaply at the edge, trap the liars with honeypots, and run the fingerprint check at the origin where the signal lives.

Layer two: honeypot routes in the App Router

A honeypot catches a bot using a simple asymmetry: a real visitor never touches it, so any hit is suspicious by definition. The classic version is a hidden form field, which still catches plenty of low-effort spam. For a Next.js crawler problem, a honeypot route is a better fit, because crawlers follow links and probe paths that humans never click.

The mechanics of placement, naming, and avoiding false positives deserve their own read: see honeypot traps for forms, buttons, and endpoints and our take on decoy links. Here is the Next.js wiring.

First, plant a decoy link that humans cannot see but a link-following scraper will. Put it in your layout, and disallow the path in robots.txt so that honest crawlers skip it. Anything that fetches it has both ignored robots.txt and followed an invisible link, which is a strong signal.

// app/layout.tsx (excerpt)
export default function RootLayout({ children }: { children: React.ReactNode }) {
  return (
    <html lang="en">
      <body>
        {children}
        {/* Invisible to humans, irresistible to link-scraping bots. */}
        <a href="/api/trap" aria-hidden="true" tabIndex={-1}
           style={{ position: 'absolute', left: '-9999px' }}>
          Account archive
        </a>
      </body>
    </html>
  )
}

Then the trap itself, a route handler that records the hit and responds blandly so the bot does not learn it was caught:

// app/api/trap/route.ts
import { NextResponse } from 'next/server'
import type { NextRequest } from 'next/server'
import { flagClient } from '@/lib/threat'

export async function GET(req: NextRequest) {
  const ip = req.headers.get('x-forwarded-for')?.split(',')[0] ?? 'unknown'
  const ua = req.headers.get('user-agent') || ''

  // Record the hit. Anything reaching this route is presumed automated.
  await flagClient({ ip, ua, reason: 'honeypot:trap', score: 80 })

  // Respond like a boring empty resource. Do not reveal the trap.
  return new NextResponse(null, { status: 204 })
}

Now your middleware can read that stored flag and act on it before serving real content. Rewrite flagged traffic to a tarpit page instead of your actual route, which keeps the URL stable so the bot does not notice the redirect:

// inside middleware(), after the rate-limit check
import { isFlagged } from '@/lib/threat'

if (await isFlagged(ip)) {
  return NextResponse.rewrite(new URL('/tarpit', req.url))
}

The same pattern extends to fake API endpoints. A path like /api/v1/users/export that your real app never calls, but that a scraper probing for data will, becomes a high-confidence trap. Our writeup on endpoint decoys and API honeypots goes deeper on which fake endpoints attract the most bots.

Honeypots are powerful because they need no fingerprint and no machine learning. They simply exploit the gap between how a human and a script move through a site. But a careful scraper that only fetches linked, allowed pages at a human pace will avoid them. That is the gap the third layer closes.

Layer three: origin fingerprinting with the WebDecoy SDK

For the bot that spoofs its user agent, rotates its address, paces itself, and avoids your traps, you need the one thing it cannot rewrite: its TLS handshake. As covered above, that check has to run at the origin, in a Node runtime, not in edge middleware.

The withWebDecoy middleware handles the edge. For the deeper check you drop down to the core SDK in a route handler, which is where you want it anyway: when you need the verdict inside the handler, or when you self-host and can read the real TLS handshake at the origin. Next.js route handlers default to the Node.js runtime and can read the incoming request, which makes them the right home for this. The WebDecoy Node SDK (@webdecoy/node) runs a two-tier check: a fast local pass on your server (suspicious headers, datacenter IP ranges, known bot user agents, missing client hints), and a deeper server-side pass that uses JA3 and JA4 fingerprinting to flag the case where a request claims to be Chrome but handshakes like curl or a headless engine.

Install the core package, pin the route to the Node runtime, and run the check on the routes that matter most, such as login, checkout, signup, or any data-heavy API:

npm install @webdecoy/node

// app/api/checkout/route.ts
export const runtime = 'nodejs' // the Edge Runtime cannot see the handshake

import { NextResponse } from 'next/server'
import type { NextRequest } from 'next/server'
import { WebDecoy } from '@webdecoy/node'

const webdecoy = new WebDecoy({
  apiKey: process.env.WEBDECOY_API_KEY!,
  enableTLSFingerprinting: true,
  threatScoreThreshold: 70, // block at a threat score of 70 or higher
})

export async function POST(req: NextRequest) {
  const result = await webdecoy.protect({
    method: req.method,
    path: new URL(req.url).pathname,
    ip: req.headers.get('x-forwarded-for')?.split(',')[0] ?? '0.0.0.0',
    user_agent: req.headers.get('user-agent') ?? '',
    headers: Object.fromEntries(req.headers),
    timestamp: Date.now(),
  })

  if (!result.allowed) {
    // result.detection carries decision, confidence (0 to 100), and bot_type.
    return NextResponse.json({ error: 'Request blocked' }, { status: 403 })
  }

  // Legitimate request: continue with your real handler.
  return NextResponse.json({ ok: true })
}

One honest caveat about where each tier can run. The local pass works anywhere, including a route handler on Vercel, because it only reads headers and the address. The JA3 and JA4 pass needs the client’s actual handshake, and your function only sees that when it has socket access. If you self-host Next.js behind your own proxy, configure nginx or Caddy to terminate TLS and forward the handshake details as headers, and the SDK gets the full fingerprint. On Vercel’s managed edge, your function never touches the raw handshake, so you lean on the local signals there and put the deep fingerprint check on a self-hosted origin or proxy. Either way, the decision lands in your route handler.

If you are still on the Pages Router, the same package gives you a handler wrapper instead, so you do not have to build the metadata object by hand:

// pages/api/checkout.ts
import { withBotProtection } from '@webdecoy/nextjs'
import type { NextApiRequest, NextApiResponse } from 'next'

async function handler(req: NextApiRequest, res: NextApiResponse) {
  res.json({ ok: true }) // req.webdecoy holds the detection result
}

export default withBotProtection(handler, {
  apiKey: process.env.WEBDECOY_API_KEY!,
  blockThreshold: 70,
})

One decision from three signals

The point of three layers is that they cover each other’s blind spots. Combine them into a single verdict rather than three disconnected checks:

// app/lib/decide.ts
type Signals = {
  edgeScreened: boolean   // passed the edge user-agent and rate gate
  honeypotHit: boolean    // touched a trap at any point
  threatScore: number     // 0 to 100, from result.detection.confidence
}

export function decide(s: Signals): 'allow' | 'challenge' | 'block' {
  if (s.honeypotHit) return 'block'       // touched a trap: automated by definition
  if (s.threatScore >= 70) return 'block' // handshake or local signals say automation
  if (s.threatScore >= 40) return 'challenge' // suspicious, verify before trusting
  return 'allow'
}

A naive HTTP scraper trips the edge gate. A link-following scraper that lies about its user agent trips a honeypot. A polished headless browser that avoids the traps trips the fingerprint. To get past all three, a bot has to be honest, careful, and use a real browser TLS stack at the same time, which is a much smaller and more expensive population than the flood you started with.

Vercel BotID versus a self-hosted stack

If you are on Vercel, you have probably seen BotID, the invisible bot-detection product that runs at Vercel’s edge and is powered by Kasada. It is genuinely good at what it does, and it is worth knowing where it fits relative to the stack above.

BotID is a managed black box. You enable it on the routes you want protected and it makes a verdict for you at the edge, with no signals to inspect and no logic to tune. That is the appeal and the limitation. You get strong detection with almost no code, and you give up visibility into why a request was flagged, portability off Vercel, and the ability to combine the verdict with your own honeypots and scoring. It is also a paid feature once you scale.

The self-hosted stack in this guide is the opposite trade. It is more code and more moving parts, and in return it is portable to any host, transparent about every signal, and yours to tune. The two are not mutually exclusive: some teams run BotID on checkout and login for the managed guarantee, and run the edge gate plus honeypots plus origin fingerprinting everywhere else for coverage and insight. If you are weighing managed against self-hosted security SDKs more broadly, our Arcjet versus WebDecoy comparison covers the same trade in the Next.js-native SDK space.

Production checklist

Before you ship this, walk the list:

Scope the matcher. Never run middleware on _next/static, images, or other assets. It is wasted compute and it can break caching.
Do not hard-block on user agent alone. Treat the user-agent gate as the cheap first pass, then escalate to honeypots and fingerprinting. A single spoofed header should not be enough to ban a visitor.
Allow the good bots on purpose. Verify Googlebot and Bingbot by reverse DNS rather than trusting the user-agent string, and decide deliberately which AI crawlers you want to keep. Some AI search engines send referral traffic worth having, which is the nuance in our piece on tracking LLM referral data and the wider question of the RAG bot problem.
Watch your false positive rate. Log every block and challenge with the reason, and review the challenge bucket. If real users land there, loosen the thresholds in decide().
Fail open, not closed. If the origin fingerprint service is briefly unreachable, decide whether a timeout should allow or challenge. For most sites, allowing on timeout is safer than locking out real customers.
Measure the bill. The whole point was bandwidth and compute. Watch the usage graph for a week after launch so you can prove the layers are paying for themselves.

Where to go from here

The shape that works in Next.js is layered: a cheap edge gate that screens and rate limits, honeypot routes that catch the bots that lie, and an origin fingerprint check for the signal the edge cannot see. None of it requires a separate WAF or DNS surgery, and the one real constraint, that TLS fingerprinting cannot happen in edge middleware, is a reason to put that check at the origin rather than a reason to skip it.

If you want the fingerprinting layer without building and maintaining a fingerprint database yourself, the WebDecoy SDKs drop into a Next.js route handler and return the verdict your decide() function needs. Start for free and point it at your most abused route first, then expand from there.

Frequently Asked Questions

Does middleware.ts run on the Edge Runtime in Next.js? +

By default, yes. A middleware.ts file runs on the Edge Runtime on every matched request, before your routes and before cached responses. Recent Next.js versions also let you opt middleware into the Node.js runtime, but that still does not give you the raw TLS handshake, so it does not change where fingerprinting has to live.

Can I block AI crawlers in Next.js without DNS or WAF changes? +

Yes. Middleware and route handlers run inside your app, so you can gate, rate limit, and trap crawlers without touching DNS, nameservers, or a separate WAF. You only need DNS or a reverse proxy when you want TLS fingerprinting, because that signal lives at the layer that terminates the handshake.

Is robots.txt enough to stop AI crawlers? +

No. robots.txt is a request, not a control. Well-behaved crawlers honor it, but several AI crawlers have been observed ignoring it or switching to generic user agents to keep fetching. Treat robots.txt as a signal of intent, then enforce with middleware, honeypots, and fingerprinting.

Why can't I do TLS or JA4 fingerprinting inside Next.js middleware? +

TLS fingerprints like JA3 and JA4 are computed from the ClientHello bytes of the TLS handshake. On a platform like Vercel, TLS terminates at the edge network before your middleware runs, so the handshake is already gone by the time your code executes. Fingerprinting has to happen where the handshake is visible: your own origin behind your own TLS termination, or a service that captures those signals for you.

Should I block every AI crawler or only some of them? +

It depends on your goals. If a crawler drives referral traffic you value, such as an AI search engine that cites sources, you may want to allow it while still rate limiting. If a crawler only scrapes for training with no return, blocking is reasonable. Decide per crawler rather than running one global block rule.

Keep Reading

Share this post

Like this post? Share it with your friends!

Want to see WebDecoy in action?

Get a personalized demo from our team.

Request Demo