Cyber Resilience

Where LLM-driven code scanning earns its keep — and where it doesn’t

After running an LLM-driven code scanner across about a dozen repositories — a mix of API gateways, microservices, and shared libraries — three things have become clear. The scanner finds real vulnerabilities at scale. The bottleneck has stopped being discovery and started being remediation. And there’s a specific class of bug where the scanner still loses to a human reviewer: anything whose attack path crosses repository boundaries.

The setup

The scanner is an LLM-driven static analysis pipeline that runs as a GitHub Actions workflow, either on a monthly schedule or on demand. It produces JSON findings plus a GitHub Issue with each run; clean scans auto-close the issue. Output is reviewed by a security engineer before anything reaches the engineering tracking system. We’ve been operating it long enough that the operational pattern has stabilised, which is the point at which it’s worth writing down what we’ve learned.

Two operational rules shape the rest of the analysis. First, every finding the scanner returns carries a confidence score between 0.3 and 1.0, and only findings at confidence ≥ 0.9 are treated as actionable on a single-run basis. Lower-confidence findings still surface in the output, but they sit in an exploratory tier that requires human triage or multi-run corroboration before reaching the engineering backlog. Second, the analysis below focuses on critical and high severity findings — the tier where a single bug can meaningfully shift the security posture — with a small number of medium-severity findings included when they are the same canonical issue as a critical or high finding in another run (severity calibration drifts run to run; we include the medium-rated instances when other runs rate the same code higher). Lows and isolated mediums are excluded throughout, with a discussion of the broader scanner output below.

A representative scan illustrates the shape of what comes out. On parts of our code base in roughly that range of size, each scanner run returns 38–42 high-confidence high-or-critical findings, of which 10–15 are critical and the rest are high. Two-thirds of findings cluster into two domains: data-handling (logging hygiene, credential exposure, encryption misuse) and authentication / access control (broken session handling, IDOR patterns, weak token validation). The remaining third is split between configuration issues (proxy, database, ingress), input validation, and a handful of dependency-CVE findings. Business-logic flaws are essentially absent at this tier — one finding per run at most.

That is the high-severity subset. Beyond it, the scanner consistently surfaces another 100-or-so lower-severity findings per run — logging-hygiene weaknesses, defense-in-depth gaps, timing-side-channel patterns, configuration hardening opportunities — the kind of long tail that’s operationally useful but that would be prohibitively expensive to find by hand. A human reviewer doing this work at the same coverage would burn weeks per repository on findings that individually carry low blast radius but collectively raise the bar against a determined attacker. The scanner produces this long tail as a side effect of doing the critical work; the marginal cost is essentially zero. This article focuses on the critical-and-high tier because that’s where the methodology comparisons are sharpest, but it’s worth remembering that the scanner’s comprehensive coverage of the lower-severity tail is part of the cost-benefit story for adopting it.

Manual review of the highest-severity findings runs at better than 90% true positive rate. The scanner overstates severity on about a quarter of findings — the bug is real, but the blast radius is smaller than the scanner’s rating suggests. That number drives the operational reality: the LLM output goes to a security reviewer first, not directly into the engineering backlog.

Architecture

The pipeline has three phases. The design separates deterministic setup work, lightweight reconnaissance, and the deep vulnerability analysis that actually consumes the most capable model.

Three-phase architecture of the LLM-driven scanner: deterministic preprocessing in Node.js, reconnaissance with Claude Sonnet across concurrent context scouts, and vulnerability analysis with Claude Opus running segment scanners in parallel.
Three-phase scanner architecture: deterministic preprocessing, reconnaissance, and vulnerability analysis. Each phase right-sizes capability to task.

Phase 1 — Deterministic preprocessing. Pure Node.js, zero LLM calls. Reads the per-repository config, walks the scoped directories, applies glob-based exclusions (tests, docs, build artefacts), and groups files into coarse segments by directory hierarchy. Reproducible: identical inputs always produce identical segments. This used to be an LLM reasoning step; making it deterministic took non-determinism out of the pipeline and cut the per-run cost meaningfully.

Phase 2 — Reconnaissance. Up to five context scouts (running on Claude Sonnet) run in parallel, each reading ~15 wiring-and-registration files per subsystem and producing structured JSON: entry points, trust boundaries, attack surface categories, shared state, cross-cutting concerns. All scout output is validated against a closed schema before downstream use — required fields, length caps, depth caps, attack surface category mapped to a fixed enumeration. The scout pass lets the deep scanners start with a map of what to look for and where, instead of discovering the architecture during analysis.

Phase 3 — Vulnerability analysis. Segment scanners (Claude Opus) run in parallel, each one analysing files grouped by attack surface category rather than by directory. So auth-flow files are analysed together regardless of where they live in the repo; data-persistence files likewise. Each scanner reads its task context (file list plus scout output) and analyses in priority order: auth and authz first, then request handlers, then sinks (raw SQL, shell, file I/O, template rendering), then external integrations, then secrets handling. Scanners follow imports and call chains beyond their immediate file list when they need to verify a path.

One operational note: the pipeline itself is hardened against the codebases it scans. All scout and scanner prompts include explicit instructions to treat code as untrusted data, never as instructions. Scope path validation uses path-relativity checks, not prefix matching, to prevent sibling-prefix escapes. The CI job ignores the target repo’s agent settings to prevent the scanned codebase from influencing scanner behaviour. These defences sound paranoid until you remember the scanner reads attacker-controlled input by definition.

Four quality gates

Every finding has to pass four gates before it’s included in the output. These are what keep the signal-to-noise ratio livable:

  1. The four-question bar. A finding must answer: who is the attacker, what input do they control, how do they reach the vulnerable code, and what impact do they gain? If any question is unanswerable, the finding is dropped. This eliminates most of the noise that traditional SAST emits.
  2. Confidence scoring. Each finding is scored 0.3 to 1.0 against calibration anchors. A raw SQL interpolation pattern scores 0.95. A request body passed to a function whose query is not visible scores 0.65. A permissive CORS on a public read-only API scores 0.40. Anything below 0.3 is dropped automatically.
  3. Compensating control verification. Before a finding is reported, the scanner checks for auth middleware on the route, schema validation libraries (Joi, Zod, Pydantic) constraining the input, framework auto-escaping (React JSX, Django template), ORM parameterisation handling the query, or an allowlist constraining the value. If the control would block exploitation, the finding is dropped.
  4. Hard exclusions. Categories that produce more noise than signal are pre-filtered: denial of service, rate limiting, open redirects, memory safety in safe languages, anything in test or documentation files.

The combination is what lets the manual reviewer keep up. Without these gates, the LLM output would be unusable; with them, the reviewer’s job becomes severity calibration and attack-path verification rather than triage from scratch.

What it catches that traditional SAST misses

Static analysis tools find high volumes of low-value findings — the well-known precision problem. The LLM scanner’s value isn’t volume; it’s the kind of bug it can find. Three categories illustrate the gap:

Data-flow vulnerabilities that span configuration and code. A canonical example: JWT algorithm confusion, where code reads the algorithm name from an unverified token header and passes it directly to the decoder. The decoder accepts alg: none, and an attacker bypasses authentication by submitting an unsigned token. Traditional SAST looks for missing signature verification but does not trace the data flow from the header value into the algorithms parameter. The LLM does.

Semantic configuration analysis. An Envoy proxy YAML config that disables external authorisation on an /internal route prefix. The YAML is syntactically valid; the bug requires understanding Envoy’s filter chain semantics. SAST tools either don’t parse Envoy YAML or check for syntax violations that don’t exist. The LLM reasons about what the configuration actually does.

Multi-step reasoning across files and frameworks. A hash oracle where expected hash values are logged at ERROR level when URL verification fails. The vulnerability requires connecting three things: the log statement (intentional), the log destination (accessible), and the use of those hashes elsewhere (password reset tokens). Each step in isolation is innocuous. The LLM connects them. SAST doesn’t.

The scanner is also good at finding systemic patterns — the same class of bug appearing in many places. Twelve credential-logging sites across a single repository, four HTTP clients with TLS verification disabled, three subtly different JWT validation implementations across the same service. That’s the insight engineering teams find most actionable: not the individual finding, but the missing pattern it reveals.

Comparison with manual penetration testing

The same product was independently assessed by an external offensive security firm running a manual penetration test over three weeks. Their scope was wider than the scanner’s: multiple source repositories plus the production web applications that depend on them. The LLM scanner covered only parts of our code base — a subset of the manual scope. To make the comparison apples-to-apples, I’m restricting both sides to that shared subset.

Of the 16 findings in the manual report, 9 fall within the scanner’s scope at a severity worth comparing: 3 high-severity, 5 medium-severity, and 1 low-severity. (A second low-severity manual finding — non-constant-time comparison of security-sensitive values — is excluded from the count; the LLM did find the pattern at half a dozen sites but at low severity and confidence 0.4–0.8, which falls into the scanner’s broader output rather than the actionable tier. Both methods agreed it was low.) The scanner produced 38–42 high-confidence high-or-critical findings per run on the same scope; the meaningful head-to-head is on the high-severity tier.

Overlap: three direct matches, plus one chain

Under the high-confidence, high-severity filter, the LLM scanner and the manual reviewers independently identified the same underlying issue in three cases — and the LLM surfaced the component pieces of a fourth manual finding in different code, where remediating the LLM’s findings would close the manual finding as a side effect.

Manual findingManual severityLLM (≥0.9, critical+high)
Hardcoded credentials and keys in sourceMediumDirect: Critical / High in 3/3 runs (3–6 sites)
Unsalted MD5 password hashingMediumDirect: High in 3/3 runs (3–5 sites)
Outdated software dependenciesMediumDirect: High in 2/3 runs (CVE bundle on a vulnerable proxy version)
Unauthorized cross-tenant redelivery (business-logic chain)HighComponent-level: IDOR-pattern findings at high+critical severity on adjacent endpoints; cross-tenant authorization weaknesses in the API-key layer

Two things stand out about the three direct matches. First, the LLM scanner consistently rated shared findings more severely than the human reviewers did — the scanner’s severity calibration is hotter than a human’s, matching the operational observation that the scanner overstates severity on roughly a quarter of its findings. Second, where the manual report named one instance of a pattern, the LLM scanner found multiple sites of the same pattern in the same code.

The fourth case is the more interesting one and the reason head-to-head counts undersell what the scanner actually delivers. The manual reviewers found an exploitable business logic chain — a low-privileged user could redeliver any encrypted email cross-tenant by calling a message-log endpoint that performed no authorization checks. The LLM did not flag that specific endpoint, but it found the underlying pattern at high severity in two adjacent forms: explicit IDOR findings on another router family (a caller can pass an arbitrary account_id and the endpoint doesn’t verify ownership), and broader API-key authorization weaknesses (wildcard cross-tenant CRUD on the API-key actor, sensitive writes that accept the static shared key in lieu of JWT). An engineering team that responded to the LLM’s findings by adopting a tenant-scoped repository pattern and tightening the API-key authorization model would close the message-log endpoint as a side effect, even though the LLM never named it. The scanner caught the shape of the bug; it just didn’t compose the integrated business-logic exploit.

A similar — weaker — pattern holds for the manual’s second high-severity finding (arbitrary email spoofing through a deliver-new endpoint). The LLM flagged sender-policy wildcards, anti-spoof allow-list weaknesses, and client-controlled sender-policy timestamps in adjacent code, but at confidence 0.6–0.8 and mostly medium severity — below this article’s strict filter. The component-level remediation would help; under multi-run ensemble review these findings would graduate into the actionable tier.

Manual found, LLM missed

Of the 9 in-scope manual findings, the LLM did not match 4 — not even via component-level remediation chains. These are all bugs where the manual reviewers reasoned about business intent across endpoints or about subtle data exposure in otherwise reasonable APIs:

The pattern is consistent: every one of these is a business-intent or cross-endpoint-comparison reasoning task, not a single-call-site pattern match. That’s the methodology gap.

LLM found, manual missed: infrastructure and configuration

Conversely, the LLM scanner’s rock-solid core — the 11 critical findings that appear at confidence ≥ 0.9 in every run — includes several that the manual report does not mention. The notable ones:

None of these appear in the manual report. They are infrastructure and configuration findings — YAML and SQL — that a code-and-running-app penetration test does not always surface, even when the source repository is in scope. Beyond the rock-solid core, the LLM scanner’s broader output includes dozens more high-severity findings the manual report didn’t enumerate: credential-logging sites at scale, TLS-verification-disabled HTTP clients, IDOR patterns in router endpoints, ingress catch-all rules.

The pattern is sharp

CapabilityLLM scannerHuman pen tester
CoverageExhaustive (every file)Focused (high-value targets)
ConsistencySame patterns every runVaries by tester
SpeedHoursWeeks
Infrastructure / configuration breadthExcellentInconsistent
Crypto and authentication misconfigurationExcellentGood
Business-logic flawsLimitedExcellent
Authorization logic across endpointsLimitedExcellent
Session-lifecycle reasoningLimitedExcellent

Tallying the head-to-head: of the 9 in-scope manual findings, the LLM scanner matched 4 at the article’s strict filter — 3 directly and 1 via component-level coverage that would close the manual finding through systematic remediation. A fifth (email spoofing) matched in the scanner’s broader output below the filter. The remaining 4 the scanner missed entirely, including the cross-tenant redelivery’s sibling finding on session invalidation and three smaller bugs where the manual reviewers reasoned about business intent across endpoints. Conversely, the scanner’s 38–42 high-confidence findings per run surfaced multiple critical infrastructure and configuration bugs the manual reviewers did not enumerate. Where they overlapped on direct matches, the scanner rated severity higher than the humans did. And beyond that high-severity comparison, the scanner’s broader output covers a long tail of lower-severity issues at a scale humans simply can’t justify the cost of reviewing.

Comparison with rule-based SAST

We also ran the scanner side-by-side with a commercial rule-based SAST tool on a single representative PHP/Symfony repository — the same codebase scanned end-to-end by both engines, with overlap and class coverage mapped by hand. A few observations from that exercise are more useful than the count comparison most write-ups lead with.

Counting findings is the wrong axis. Two scanners can disagree by an order of magnitude on total volume and still tell a consistent story about the codebase — they’re summarising at different levels of resolution. What actually distinguishes them is the breadth of vulnerability classes each one covers and the actionability of the individual findings it produces. The rule-based engine excelled at well-defined syntactic patterns — TLS verification disabled, raw $_REQUEST concatenation in known sink positions, deserialization of attacker-controlled input. The scanner covered roughly twice as many distinct classes, including categories the rule-based engine has no rules for at all: business-logic authorization, IDOR, session lifecycle, hardcoded credentials, cross-component data flow. Reporting “we found more findings” obscures that asymmetry. Reporting which classes each tool covers does not.

Confidence scoring is the under-discussed feature. Per-finding confidence is the single mechanism that makes large-output LLM scanning operationally usable. The rule-based engine emits findings with no confidence dimension — every result is either present or absent, and the consumer is left to triage the entire list. The scanner attaches a 0-to-1 confidence score per finding, and that one number partitions the output into a high-confidence band that goes straight to a fix queue and a lower-confidence tail that goes to human review. Anyone building a similar tool should ship confidence as a first-class field on every finding, not as an internal heuristic the reviewer can’t see.

Methodology lessons for anyone running this comparison. Two practices made the exercise honest enough to learn from. First, adjust for shared scope before computing overlap rates — runtime-budget exclusions on either side will distort the headline number, and in our case a meaningful share of the rule-based findings sat in directories the scanner had been configured to skip for performance. The exclusion list is itself a security decision and deserves the same periodic review the scanner version gets. Second, walk every apparent “miss” by hand. In our pairing the rule-based scanner flagged a syntactic pattern in voter classes — functions that never returned a negative authorization decision — and the scanner reported different, deeper issues in the same files: IDOR, URI-substring bypass, missing ownership checks. Counted naively as misses, those rows penalise the scanner for producing more useful output. Counted on the right primitive — “same file, deeper issue” — they’re a feature, not a gap. Either tool used alone leaves visible blind spots; class-level comparison tells you where to bring in the other.

Future direction: feeding the engines into each other. The most interesting unexplored move is to use the two systems as inputs to each other rather than independent reports. Feed rule-based findings into the LLM scanner as context — “the pattern engine flagged this voter as syntactically suspect; investigate whether there is a deeper authorization issue at this site” — and the model gets a curated set of priors that lift the precision of its analysis. Run the reverse where it’s tractable: when an LLM finding describes a regex-checkable pattern, materialise it as a rule the pattern engine can re-run on every commit. Both directions narrow the gap each tool leaves on its own without forcing the consumer to choose between paradigms.

How reproducible are these scans?

An obvious question for LLM-driven analysis: how much does the output vary run to run? We re-scanned the same scope three times against the same commit, hours and days apart, with identical configuration. Total findings came back at 190, 202, and 164 — about a 15% spread around the mean. Within that, the critical-and-high band landed at 72, 73, and 56.

Naive matching on the raw record ID overstates the noise, because the scanner phrases the same issue differently each run. The JWT algorithm-confusion finding appears across all three runs as “alg taken from unverified header,” “attacker-controlled `alg` from token header,” and “attacker-controlled alg passed to decoder” — same code, same line, three wordings. To match findings across runs we cluster on three signals together: the file path, the security domain, and overlapping line ranges. Two findings in different runs are the same canonical issue if their reported line ranges intersect, even if the titles and descriptions don’t match verbatim. This catches both the reworded-title case and the case where one run splits a finding into two reports while another reports it as one.

Take the subset of findings that hit the highest quality bar — rated critical severity AND with confidence above 0.9 — in any of the three runs. There are 15 such canonical issues across the dataset. Eleven of them — 73% — are found in all three runs. The remaining four are found in two of the three runs. None are invisible to more than one run. If a finding clears the highest bar once, it almost always surfaces again on the next two scans, even if the second scan describes it slightly differently or assigns it a slightly different severity.

The named issues in that core are textbook bugs: unsalted MD5 password hashing, an unauthenticated passcode endpoint, Envoy external authorisation disabled on internal routes (across three separate configurations), the Envoy admin API bound to 0.0.0.0:8001 with no auth, a MySQL grant of full privileges with no password, and JWT signing keys plus AES encryption keys committed to a config file. Each is exactly the kind of finding a scanner exists to surface.

The detail also exposes what severity drift looks like in practice. The MD5 hashing finding is rated high in run 3, critical in run 4, high again in run 5 — confidence ~0.95 every time. The Envoy ext_authz config is critical / critical / high. The scanner is rating the same finding differently across runs at the high/critical boundary. The bug is real and the confidence is high; it’s the severity label that wobbles. Reviewers calibrate this on the way to the engineering backlog.

Reproducibility weakens as the bar drops. Across the full 322 canonical issues at any severity, only 22% appear in all three runs and 58% appear in only one. The dominant domains (data-handling, auth and access control) reproduce closely across runs in both count and content; business-logic findings — already the smallest slice — are the least stable. The scanner’s own confidence score is the cleanest single predictor of recurrence: at cluster-max confidence ≥ 0.9, 49% of issues appear in all three runs and only 18% are one-shot; at confidence below 0.5, no issues appear in all three runs and 95% are one-shot. Confidence is doing what it should.

For practical purposes this means: trust the high-confidence core on a single run; treat the medium-confidence band as candidates for ensemble review (run more than once, union the output); and treat low-confidence findings as starting points, not conclusions. Reproducibility is strong at the top and gets noisier as you walk down the severity ladder — which makes the next observation more, not less, significant.

The bottleneck has shifted to remediation

The interesting observation, after a dozen scans, isn’t about the scanner. It’s about what the scanner exposes next.

Each scan reproducibly finds 38–42 high-confidence critical-and-high findings per repository, ninety-plus percent of which hold up under manual review — plus a longer tail of lower-severity findings that engineering teams pull from when the urgent work is done. Engineering teams can fix individual findings quickly — one-line patches, a parameterised query, a config flip. But many of the highest-value findings aren’t one-liners. They’re symptoms of a missing pattern, and the right fix isn’t to patch each site; it’s to introduce the pattern and migrate every site to it.

The canonical example is broken access control with missing tenant scoping — the OWASP A01 category and the dominant source of IDOR vulnerabilities in any multi-tenant service. A scanner finds ten endpoints returning resources without verifying the actor owns them. The naive fix is ten checks, one per endpoint:

def get_report(report_id, actor):
    report = repo.get(report_id)
    if report.account_id != actor.account_id:
        raise HTTPException(403)
    return report

That works, until the next scan finds five more endpoints that grew the same shape in the meantime. The architectural fix is a tenant-scoped repository pattern that performs the ownership check automatically, and migrating every data access site to use it. That’s a multi-sprint, multi-team effort: design the pattern, refactor the call sites, update tests, stage the rollout, and prevent regression with a CI check. It easily takes two to three months across owning teams, even with engineering capacity allocated.

The same shape applies to the other recurring categories the scanner exposes:

The pattern is consistent: the scanner finds in hours what takes engineering quarters to remediate properly. As reproducibility of scanning improves, the discovery-side delay collapses to zero, and the constraint becomes engineering capacity to adopt the right patterns. For security leadership that’s a meaningful resource-planning shift — less tool spend, more engineering coordination.

Where it still loses: chains that cross repository boundaries

The scanner reads one repository at a time. It traces imports and call chains within that repository, follows attack-surface-to-sink paths, and recognises framework patterns. But it does not have a global service map. Vulnerabilities whose attack path crosses repository boundaries are systematically under-found.

Examples of such vulnerabilities could be:

These are the bugs human pen testers still find that the scanner doesn’t. The fix is architectural — a service-map-aware scanning layer that traces calls across repos — and it’s a non-trivial extension. Until then, the partition between “LLM scanner finds” and “human pen tester finds” is genuine, not ideological.

What this means for security leadership

Putting the three observations together, the practical implications for a CISO running this kind of pipeline are narrower than the marketing would have you believe but more consequential than the hype-fatigue reaction suggests.

Five practical takeaways

  • LLM code scanning is real, not a demo. The findings are mostly true positives at high confidence, the categories it catches are categories that SAST misses, and the precision is high enough that a single human reviewer can keep up with the output.
  • It does not replace manual pentesting. The scanner matched 4 of 9 in-scope manual findings on the same code base (3 directly, 1 via a component-level IDOR-pattern chain that would close the manual finding through systematic remediation). The 4 it missed are business-intent and cross-endpoint-comparison bugs — session-invalidation on user delete, copy-paste scope mismatches, passcode-invalidation gaps, sensitive data leaking through reasonable-looking introspection APIs. Meanwhile the scanner surfaced infrastructure and configuration criticals the manual report didn’t enumerate. Plan to use both.
  • Severity calibration is the human’s job. The scanner’s severity ratings run hotter than a human’s — about a quarter of findings are real but smaller-blast-radius than the scanner says. Plan for security review before findings hit the engineering backlog.
  • Plan for the remediation crunch. Discovery is no longer the constraint. Engineering capacity to adopt secure-by-default patterns and migrate every call site is. Budget for that, not just for the tool.
  • Direct manual review at application-layer logic. Authorization rules across endpoints, session and credential lifecycles, cross-tenant data flows, and multi-repository attack chains are where the scanner has consistent blind spots and where humans still earn their fees. Don’t spend pen-test hours on coverage the scanner already provides.

A methodology-stumble worth naming

Honest disclosure: in parallel with this work I’ve been running a separate LLM-author pipeline that builds cross-walks between security frameworks (the chips you see on this site). A careful reading uncovered that early in that project two different conventions for what “A→B = partial” meant had crept into different author scripts — some treated it as “A partially covers B” (the intuitive math reading), others as “B partially covers A.” The LLM was internally consistent within each author run, so every individual pair was self-consistent; but reading across pairs and across articles gave inconsistent results until I audited every framework pair, picked one convention site-wide, and migrated the affected stored rows. It’s a small reminder that ambient LLM tooling needs the same kind of careful prompt-and-schema engineering that other software does — the failure mode is silent inconsistency rather than a visible error. The methodology piece has the detail; I mention it here because it’s the same shape of problem the scanner work surfaces in code: every plausible output deserves verification, especially when the convention anchors are subtle.

Cartography again

The scanner is a map, and a surprisingly good one. It tells you where the well-defined bug shapes live, which call sites need attention, and which patterns are missing across the codebase. It does not walk the territory for you, and it does not see the bridges between regions. Knowing which work the map can do and which work it can’t is most of the operational learning. The rest is engineering capacity.

Credits

The scanner, pipeline hardening, and operational learning that this article describes are the work of many hands. Thanks in particular to Daniel Cook, John D’Orto, Tim Frye, Gayathri Krishnan, Wes Leach, Matt Levac, Tyler Long, Ryan Miguel, Paul Roycroft, Alejandro Salinas, and George Tibu.

Generated 02 June 2026 14:43 UTC .