Why We Don't Use Machine Learning for Bot Detection
An honest argument against ML-driven bot detection. Adversarial drift, training data poisoning, latency, and the explainability tax that nobody talks about.
WebDecoy Team
WebDecoy Security Team
Why We Don’t Use Machine Learning for Bot Detection (And You Probably Shouldn’t Either)
Every bot detection vendor pitch deck in 2026 has the same slide. It is a swirling diagram of neural network nodes, the words “AI-powered,” and a graph that goes up and to the right. The implication is that any defender not using machine learning is fighting tomorrow’s war with yesterday’s weapons. The buyers nod, the budget gets approved, and three months later the abuse team is back in Slack asking why the new system is missing the obvious credential stuffing run.
We have built bot detection products for years and we do not use machine learning as the primary scoring layer. We use rules, honeypots, browser and TLS fingerprints, proof-of-work, and behavioral biometrics, in roughly that order. Every layer is something an operator can read, modify, and explain to a regulator. None of it requires a GPU at request time. None of it silently degrades when the threat changes shape.
This is a deliberate choice and it is not a comfortable one to defend in a market that prices ML at a premium. So this piece is the long version of the argument. We will go through the seductive case for ML in bot detection, the four structural reasons it fails in this specific problem space, the places where ML is genuinely the right tool, and the stack we ship instead. If by the end you still want to put a deep model on the request path, at least you will know what you are buying.
This is also the worldview behind FCaptcha, our open source bot detection layer. The arguments below are vendor agnostic.
The Seductive Case for ML in Bot Detection
The pitch is genuinely compelling on paper. A bot is, by definition, a non-human accessing your site. Humans share a finite set of behaviors and bots, in aggregate, do not. So if you can train a classifier on enough labeled traffic, the model should learn the boundary between the two and apply it at scale, faster than any human analyst could write rules.
The pitch usually comes with three claims:
- ML can spot patterns humans would miss.
- ML adapts automatically to new attacker behavior.
- ML scales without linear effort from the security team.
Each of these claims is technically true under conditions that almost never hold in production bot detection. The conditions are: a stationary distribution, clean labels, low latency budget headroom, and an attacker who is not actively trying to invert your model. Bot detection violates all four conditions on a regular basis. So you end up paying for ML and getting back something that behaves more like an opaque, expensive, slow rules engine that you cannot edit.
Here is the case against, in detail.
Reason One: Adversarial Drift
The single largest reason ML fails in bot detection is that the people on the other side of the wire are reading your blog posts, prodding your endpoints, and adapting their tooling on a timescale that ML retraining cannot match.
In a traditional ML problem, the world is approximately stationary. A spam classifier trained on email from 2024 will still work passably on email from 2025 because the underlying distribution of “what English looks like” does not change overnight. Spam authors do try to drift their content, but there are only so many ways to phrase a Nigerian prince scam, and the language model that classifies them is also drifting forward at the same rate.
Bot detection is the opposite. The distribution of attacker behavior is not even approximately stationary. It is actively, intentionally, and rapidly being shaped by the people you are trying to detect. When you ship a new model, every halfway-competent attacker is running probes against it within days. They are not trying to fool a fixed function. They are trying to find the cheapest perturbation that flips your label.
The technical name for this is concept drift, and in bot detection it has three accelerants the academic literature usually does not capture:
1. Public infrastructure. Anti-detection toolchains like puppeteer-extra-plugin-stealth, undetected-chromedriver, and a long tail of fingerprint spoofers are open source. When you patch a detection signal, the patch is sometimes published with a CVE-like writeup. Within a release cycle the entire ecosystem of stealth tools has incorporated the bypass and shipped it to every script kiddie running headless Chrome.
2. Solver economies. Sites like 2Captcha, CapSolver, and AntiCaptcha do not just solve CAPTCHAs. They sell undetectability as a service. If you increase friction for headless browsers, the price of a high-quality residential session ticks up by half a cent, and the attackers absorb it as a cost of doing business.
3. Cheap GPU inference. When the attacker can spin up a vision model on a $0.50 per hour GPU and use it to generate the kind of mouse traces that look human to your behavioral classifier, the cost asymmetry that ML was supposed to give the defender flips around.
The result is that the half-life of an ML bot detection model in production is shockingly short. We have seen public benchmarks where vendor models achieve 99 percent precision on a labeled training set and then drop to 60 percent against live, novel attack traffic within thirty days. The model is not broken in any classical sense. The world around it is moving faster than its retraining cadence.
The boring fix is to retrain constantly. The honest fix is to admit that the problem shape does not match what classical supervised learning assumes.
Reason Two: Training Data Poisoning
This one is the elegant problem. The first time you understand it you start seeing it everywhere.
Most ML bot detection systems train on labels derived from observed behavior. A session that completes a purchase and never charges back is benign. A session that triggers fraud rules is malicious. A session that gets a hard CAPTCHA and abandons is suspicious. These labels then get fed back into a retraining pipeline, the model updates its decision boundary, and the cycle continues.
The attacker’s job is to shape the distribution of training data such that the boundary moves in a direction that helps them. They do not need to evade detection on every session, they only need to teach the model that their attack pattern is normal.
A concrete example. Suppose your model uses request-rate-per-IP as one of fifty input features. A sophisticated attacker runs two parallel campaigns:
- A high-volume but innocuous bot that browses your site, never buys anything, never logs in, and is essentially a noisy crawler with rotating residential IPs at moderate request rates.
- A low-volume credential stuffing campaign that uses the same residential IP pool but only at human-like request rates per IP.
The first campaign teaches your model that “moderate request rates from residential IPs that do not convert” is benign and high volume. The second campaign rides on top of that learned distribution. When you later try to flag the credential stuffing, the model has already been told that the underlying traffic shape is fine.
This is a generalization of the classic machine learning poisoning attack, and it has been formally studied in the antivirus, spam, and fraud detection literature for over fifteen years. The 2016 paper “Adversarial Machine Learning at Scale” by Kurakin, Goodfellow, and Bengio remains a useful starting point. The 2018 work by Steinhardt, Koh, and Liang on certified defenses is sobering reading: in many ML pipelines, an attacker controlling as little as 3 percent of the training data can degrade test accuracy by double digits.
You can defend against this. You can hold out training data, use robust losses, randomize feature subsets, and run shadow models. Each of these defenses has a real engineering cost and none of them is a complete fix. None of them is a problem that a deterministic rules engine has, because a deterministic rules engine does not learn from the attacker.
A rule that says “block any request to /admin from a residential IP outside the US in working hours” cannot be poisoned. It can be wrong, and you might want to relax it, but it does not silently degrade based on what the attacker is doing.
Reason Three: Latency Is Not Free
Bot detection happens on the request path. That is a hard constraint that most ML pitches gloss over.
The budget for a bot decision on a typical web request is somewhere in the 5 to 50 millisecond range, including network round trip to whatever scoring service holds the model. If you are running a web-scale site with strict Core Web Vitals goals and you care about your Largest Contentful Paint, the budget at the front door is closer to 5 milliseconds.
A modest XGBoost model with a few hundred features can hit that budget on commodity hardware. A small gradient boosted tree ensemble can too. But once you graduate to the kind of deep model that the marketing materials are implying, the latency starts to be visible in user-facing metrics. A 100 millisecond p99 on bot scoring is fine for a checkout flow that already takes half a second. It is unacceptable for an API that is supposed to return in 20 milliseconds.
The realistic options are:
- Keep the model small and lose most of the supposed benefit of deep learning.
- Make the call asynchronous and accept that you are no longer blocking attackers on the request path.
- Put a fast rules engine in front of the model and only invoke the model on a subset of traffic.
Option 3 is what almost every production “ML bot detection” system actually does, even when the marketing slides do not say so. The rules do most of the work. The model is the expensive second pass that runs on the long tail. At which point the question becomes, why is the expensive second pass also producing most of the false positives that the support team has to chase?
There is a related and underappreciated problem with feature extraction. The features a bot model wants are things like “rolling 24-hour session count for this IP,” “TLS fingerprint diversity for this ASN in the last 5 minutes,” and “JavaScript challenge solve time distribution for this device cohort.” These are not free to compute. They live in a stream processing system, they require state, and they require their own SLO. When the feature pipeline goes down, the model silently scores requests with stale or zeroed features, which usually means it stops detecting the very thing it was hired to detect.
A rules engine reads the request, evaluates a tree of conditions, and returns. There is no feature pipeline to babysit. There is no model server to keep warm. There is no GPU bill at the end of the month.
Reason Four: The Explainability Tax
Ask any abuse engineer who has run an ML-driven bot detection system in production what their least favorite ticket looks like. The answer is some variant of: “Important customer X says we are blocking their legitimate users, can you tell us why?”
A rules engine can answer that question in 30 seconds. The customer hit rule 42, which fires when X and Y and Z, and here is the rule definition, and here is the change to relax it. A rules engine is grep-able. A rules engine is diff-able. A rules engine has a git history.
A model can answer that question with SHAP values, partial dependence plots, and a careful interpretation by someone with statistics training. In practice the answer is usually a hand-wave and a manual allowlist, because the actual answer involves the interaction of seventy features in a way that does not translate into a sentence the customer’s CISO will accept. The allowlist grows, the operator confidence in the model erodes, and the team starts shadow-running rules anyway.
This becomes structural in three places.
Compliance review. Under the EU AI Act, the SOC 2 trust services criteria, and increasingly under sector-specific rules in finance and healthcare, you need to be able to articulate why an automated decision was made. “The model said so” is not, generally, a defense. The legal exposure on a wrongful-block decision against a protected class is real, even when the underlying decision is being made by a generic anti-fraud system. A deterministic rules engine produces audit trails. A model produces a vector of feature attributions whose meaning is a research project.
Customer support. Every false positive in production is a support ticket. Every support ticket has a cost and a goodwill hit. The rules engine model lets a tier-2 agent answer the ticket. The ML model requires a data scientist or a dedicated tool to even surface what happened. Multiply by the number of tickets and you have the second-largest budget line item in your abuse team, which is the people you hired to babysit the explainability gap.
Internal trust. When the security team does not trust the bot detection system, they shadow-run their own rules in front of it. We have seen this happen at every company that has adopted a vendor ML bot detection platform after running rules for years. The rules never go away. They just get hidden in a different layer where they do not show up on the vendor invoice.
The Compounding Feedback Loop
Here is the failure mode that ties the four reasons together. Ship an ML bot detection system. It works well at first because your training distribution roughly matches your production distribution. Attackers adapt. Your false negative rate creeps up. Customers complain about abuse. You retrain, but your labels are now polluted by the attackers you missed and the legitimate users you blocked. Your model becomes more aggressive on a slightly wrong distribution. False positives go up. Now you are blocking legitimate users at the same time as missing the actual attackers, and the model is more confident than ever that it is doing the right thing.
The pattern holds in fraud detection, spam filtering, antivirus, and most adjacent fields. The mature defenders in those spaces have all converged on a hybrid: a rules and signatures engine for the known threats, a small ML layer for the edge of the distribution, and a heavy investment in human review for the gray zone in the middle. Bot detection should be no different.
The mistake is putting the ML layer in the critical path of every blocking decision and treating its score as authoritative.
Where ML Actually Earns Its Keep
This piece is not a blanket dismissal of machine learning. It is a dismissal of one specific deployment pattern, which is using a learned model as the primary scoring function for live, blocking, request-time bot decisions. There are other patterns where ML is straightforwardly the right tool.
Anomaly detection on aggregate traffic. Watching the shape of traffic at the level of “requests per ASN per minute” or “bounce rate per landing page” is a great place for unsupervised models. You are detecting that something is weird, not making a per-request decision. False positives produce an alert, not a block. The latency budget is minutes, not milliseconds.
Clustering for forensic analysis. After an attack, throwing the labeled bot sessions into a clustering algorithm to find the natural sub-populations of attackers is a useful exercise. It is also a great way to surface the next set of rules to write.
Ranking review queues. When you have flagged 10,000 sessions and a human can only review 100, a model that scores them by likelihood-of-real-attack is doing useful work. A wrong rank is a triaging mistake, not a block.
Bot category classification. Once you have already decided that a session is non-human, classifying it as “search engine,” “monitoring tool,” “AI scraper,” “credential stuffer,” or “spam” is a good use of a small classifier. The decision to block has already been made on deterministic grounds. The model is just labeling.
The pattern in all of these is the same: ML is a useful tool when its decisions are advisory, when the latency budget is generous, and when its mistakes are cheap. Ship it there. Do not ship it as the function deciding whether the next request to your login endpoint gets a 200 or a 403.
What We Use Instead
Here is the actual stack, in the order requests are evaluated. Every layer is deterministic. Every layer is inspectable. Every layer can be audited and rolled back.
Layer One: Rules
A rules engine is the first thing every request hits. Rules cover the things you already know are bad: known bot user agents, abusive ASNs, request rates above human plausibility, missing or malformed standard headers, requests to honeypot URLs. Most production attacks lose a digit of volume right here.
A rule looks like this:
- id: ua_known_scraper
match:
user_agent_regex: "(?i)(scrapy|colly|httpx|wget|libwww-perl)"
action: block
reason: "Self-identified scraper user agent"
- id: rate_burst
match:
requests_per_minute: ">120"
same_ip: true
action: challenge
reason: "Request burst above human ceiling"
- id: missing_accept_lang
match:
header_missing: "Accept-Language"
action: score_plus_3
reason: "Real browsers always send Accept-Language"Rules are not glamorous. They are also not where most of the false positives come from, because the matching is exact and the operator gets to read the rule that fired.
Layer Two: Honeypots
Honeypot fields, honeypot URLs, and honeypot HTTP headers catch the population of attackers that bypass the rules by setting a believable user agent and rotating IPs. A naive script fills every form field. A DOM-aware bot fills every visible field. A vision-based AI agent fills the fields it sees in the rendered pixels. With three flavors of honeypot you have a way to differentiate all three populations.
We covered the placement details in Honeypot Strategies for AI Bots: Beyond CSS Hiding. The short version is that legacy CSS hiding is not enough in 2026, and the modern stack uses semantic naming, DOM ordering, and shadow DOM to defeat each population separately.
Layer Three: Browser and TLS Fingerprinting
Real browsers have a consistent JA4, a consistent set of HTTP/2 settings, a consistent user agent, and a consistent Accept-Language ordering. Headless automation gets one of these wrong almost every time. The bypass tools have to fix every one of them, individually, every time the underlying engine ships a new version.
JA4 is the cheapest and most stable of the fingerprints. We covered it in JA4 Fingerprinting AI Scrapers: A Practical Guide. A request whose JA4 says “Go HTTP client” but whose user agent says “Chrome 122” is, with very high precision, a bot. Inspectable, deterministic, and free at scoring time.
Layer Four: Proof-of-Work
For the population of bots that has cleaned up its fingerprint, a small SHA-256 proof-of-work imposes a cost asymmetry. A real browser solves the challenge in 50 to 500 milliseconds. An attacker running 10,000 parallel sessions pays the same cost 10,000 times. The economics flip without taxing the legitimate user.
Proof-of-work is deterministic, locally verifiable, and has zero false positive rate as long as the user has JavaScript and a non-trivial CPU. The places where it falls down (very old phones, JavaScript-disabled browsers, RSS readers) are the same places that any meaningful behavioral check falls down, so you have to accept those tradeoffs anywhere in the modern stack.
We wrote the long version in Proof-of-Work CAPTCHAs with Hashcash.
Layer Five: Behavioral Biometrics
This is the only layer that uses statistical scoring, and even here we keep it deterministic. We measure mouse trajectory smoothness, keystroke inter-key cadence, scroll velocity, focus patterns, and a small number of pointer-event-level signals. The scoring is a weighted sum of explicit features against thresholds that an operator can tune. There is no learned model in the scoring function.
The reason this is deterministic and not ML is that we want the operator to be able to read the score breakdown and explain it to the customer. If a session scored 73 on behavioral, we can tell you that 30 came from cadence variance, 20 from mouse curvature, 15 from missing scroll signals, and 8 from focus patterns. There is no SHAP plot. There is a number with a breakdown.
Layer Six: Soft Challenge for the Gray Zone
Sessions that pass the deterministic checks but score in the suspicious range get a soft challenge. The soft challenge is not a CAPTCHA, because CAPTCHAs are dead. We covered why in Why CAPTCHAs Are Dead (And What Replaces Them in 2026). The soft challenge is a longer proof-of-work, a re-fingerprint, or a step-up that an attacker would need to pay for many times over.
This layered design is the entire stack. It is what FCaptcha implements as a library. It is what we recommend operators ship even if they choose to build their own.
The Rules vs ML False Dichotomy
A reasonable objection at this point is that the comparison is unfair. Real ML systems do not run as a single model on the request path with no rules around them. They run as part of a layered defense, with rules in front and humans behind. So why are we framing it as rules versus ML?
Because the marketing does. The vendor pitches do. The buyers do. The internal architecture diagrams put the ML model at the center, with the rules as “preprocessing” and the human review as “feedback loop.” The ML model gets the budget and the cred. The rules are treated as legacy that the platform will gradually replace.
In practice the rules never go away. They are the layer doing the heavy lifting on the volume of traffic, and the ML layer is doing more like a quality of service ranking on the residual. The honest framing is that the rules engine is the system, and the ML layer is one optional enhancement that may or may not justify its cost.
If you treat the architecture this way explicitly, you make better decisions. You invest in your rules tooling, your fingerprint pipeline, your honeypot rotation, and your behavioral signal collection. You add ML where it earns its keep. You do not let “we are ML-first” become a structural excuse to underinvest in the parts of the stack that are actually catching the attackers.
What This Looks Like in Real Numbers
For a representative public-facing application, here is the rough breakdown of attack volume that a layered, rules-first stack catches. Numbers are from anonymized FCaptcha deployments across a few hundred sites, normalized to round percentages.
| Layer | Share of attack volume blocked |
|---|---|
| Rules (UA, ASN, rate, header sanity) | 71 percent |
| Honeypots (form, URL, header) | 12 percent |
| Browser and TLS fingerprint mismatch | 9 percent |
| Proof-of-work failure | 5 percent |
| Behavioral score above threshold | 2 percent |
| Soft challenge step-up | 1 percent |
If you only had the rules layer you would block 71 percent of attacks. Each subsequent layer is a smaller marginal contribution, but the layers compound, and the marginal cost of the layer to add is small if it is deterministic. The marginal cost is large if it is a learned model with a feature pipeline, a retraining cadence, and a support burden.
A Note on the Politics
A subtext to this debate is that “we use ML” has become, in 2026, a kind of professional class signal in security. Saying you use rules sounds like saying you use COBOL. Saying you use ML sounds like you have a moat. The reality is the opposite. The vendors with the most polished ML pitches are usually the ones whose rules engine is the worst, because they let it atrophy while they fundraised on the model. The vendors and the open source projects with the best detection rates almost always have a deep, well-maintained, regularly updated rules surface that they treat as a first-class engineering artifact.
This is not new. It is the same pattern as antivirus in the early 2010s, where every vendor pivoted to “next generation AI-based detection” and the actual detections kept being made by the YARA rules underneath. It is the same pattern as spam filtering, where the deep models on top of SpamAssassin were overshadowed by the heuristic rules underneath. The rules are doing the work. The ML layer is doing the talking.
There is no shame in shipping rules. There is no shame in writing them well, testing them, monitoring their false positive rate, and rolling them out behind feature flags. The job is to stop the bots, not to win the architecture beauty contest.
When You Are the Wrong Audience for This Argument
There are real cases where an ML-first bot detection approach is the right call. We want to name them honestly so this piece does not read as universal denial.
If you are a hyperscale platform with a dedicated team of ML engineers, a labeled training set in the tens of millions of sessions, an internal retraining cadence measured in hours, and the political weight to roll back a bad model deployment within minutes of a regression, then an ML-driven detection system is probably the right tool for you. Google, Cloudflare, and a handful of others fit this description. They are not the median customer.
If you are running a niche detection problem where the attacker base is small and not adapting (some sectors of insider threat detection, some cases of impersonation in regulated industries) then ML can pay off because the adversarial drift assumption that the rest of this piece is built on is weaker.
If you are running a fraud detection system where the decision is a recommendation to a human reviewer and not a hard block, then the explainability tax is reduced and the latency budget is generous. ML earns its keep.
If you are not in one of those buckets, you are the audience for this argument. You are running a marketing site, a SaaS app, an e-commerce store, or a community platform. Your attacker base is adaptive. Your latency budget is tight. Your team does not have a dedicated ML engineer who is going to retrain every Tuesday. You should ship rules, honeypots, fingerprints, and proof-of-work. You should add behavioral biometrics on top. You should use ML, if at all, for offline analysis. You will catch more attacks, you will spend less money, and you will not wake up to a customer escalation because a model decided your enterprise customer’s procurement team looked like a bot.
How to Audit Your Existing Vendor
If you already pay an ML bot detection vendor, the following questions will tell you how much of your protection is actually rules and how much is the model you are paying for.
- Ask the vendor for the percentage of blocks attributed to deterministic rules vs the ML score. If they will not tell you, the answer is “almost all of them are rules.”
- Ask for the false positive rate by customer cohort. If they only give you an aggregate number, the model is hiding cohort-specific failures behind a global average.
- Ask how often the model is retrained and what the rollback story is. If the answer involves a manual deploy and a several-hour rollback window, the model is not adapting in any meaningful sense.
- Ask for an example of a recent attack that was caught by the ML layer that would not have been caught by a rules-only baseline. If they cannot produce one, you are paying a premium for a marketing position.
- Ask whether the bot scoring service is on the request path or asynchronous. If asynchronous, you are not blocking attackers, you are tagging them after the fact.
A vendor who answers these honestly is doing the right work. A vendor who answers evasively is selling you a slide deck.
Closing
The argument here is not that machine learning is bad. It is that bot detection is one of the worst-shaped problems for the kind of supervised learning that vendors are selling. The distribution is non-stationary, the labels are poisonable, the latency budget is tight, and the explainability requirements are high. Every one of those properties pulls against a learned model and toward a deterministic, layered, inspectable stack.
We ship rules, honeypots, fingerprints, proof-of-work, and behavioral biometrics. We do it because they work, not because we are nostalgic for an older era of security. If a problem comes along that genuinely benefits from ML on the request path, we will use it. We are still waiting for that problem in the bot detection space, and the longer we wait the more confident we get that the rules-first approach is not a stopgap, it is the right answer.
If you have a counter-argument with real production numbers, we want to hear it. Send it to us, post it on Hacker News, write your own version. The space is healthier when the debate is in the open and not buried under another vendor pitch deck.
Related Reading
- Why CAPTCHAs Are Dead (And What Replaces Them in 2026)
- Honeypot Strategies for AI Bots: Beyond CSS Hiding
- JA4 Fingerprinting AI Scrapers: A Practical Guide
- Browser Fingerprinting in 2026: What Still Works
- Proof-of-Work CAPTCHAs with Hashcash
- Headless Browser Detection: Playwright, Puppeteer, Selenium
- Reverse Engineering Credential Stuffing Attacks
- How We Detect AI-Generated Form Submissions
Share this post
Like this post? Share it with your friends!
Want to see WebDecoy in action?
Get a personalized demo from our team.