Grading the machine: how reliable are LLM-authored security cross-walks?
We used a large language model to map thousands of relationships between security frameworks — weaknesses, attack techniques, and controls. Then we had a second model grade the first, and we checked the disagreements by hand. The model is genuinely useful. It is also a confident over-claimer, in a specific and predictable way. Here is what the grading found, and why we now treat a grading layer as mandatory rather than optional.
The setup
A cross-walk is a claim that one framework's item relates to another's — that OWASP ASVS requirement V6.2.1 helps satisfy NIST 800-53 IA-5, say, or that CWE-79 (cross-site scripting) is what a given ATT&CK technique exploits. We author these directionally and rate each direction on a four-level scale: none, partial, mostly, full. "Full" means the strongest claim — eliminate this weakness and you block essentially the whole technique; or this control, on its own, essentially satisfies that requirement.
The first pass is LLM-drafted. To grade it, we took every mapping that carried a "full" in either direction — 451 of them — and asked a stronger model (grok-4.3) to re-rate each pair from scratch, skeptically and in isolation, with no sight of the original rating. Where the grader disagreed, a human (me, with Claude Opus 4.8 as a second reader) made the call. Every comparison is logged.
An honest confound up front. The grading prompt was not identical to the authoring prompt: the author rated a weakness against the whole ATT&CK catalogue at once; the grader judged a single pair in isolation and was explicitly told to reserve "full" for primary enablers. So some of the deflation below is the framing, not the model. That is itself a finding — how you ask changes the answer as much as which model you ask — and we say so rather than hide it.
The headline: "full" almost never survives
Of the 451 mappings the author rated "full", only 19 — about 4% — held up as "full" on a skeptical second look. The rest came down, most of them two whole levels:
Eight percent — 36 mappings — weren't weak versions of a real relationship. They were spurious: the grader and the human agreed there was no meaningful relationship at all.
Those 36 are instructive. A logic-bomb weakness (CWE-511) had been mapped as "fully exploited by" six different techniques — Service Stop, Disk Wipe, Financial Theft, and more — when a logic bomb is a delivery mechanism, not something those techniques depend on. A narrow input-validation flaw (CWE-229) had been mapped "full" to six broad exploitation techniques it has nothing to do with. Hardware weaknesses were mapped to software techniques. These are exactly the confident, plausible, wrong assertions that make unreviewed LLM mappings dangerous: each one reads fine in isolation.
The over-claiming is directional
The inflation was not spread evenly. Almost all of it sat in the reverse direction — "this technique fully exploits this weakness" — while the forward direction — "eliminating this weakness fully blocks this technique" — barely used "full" at all, and what little it did use held up. The abstract, less natural direction is where the model reaches for the strong word. When we extended the same grading to the merely-"partial" and "mostly" mappings, the pattern continued: across that larger set, 746 directional mappings were knocked all the way down to "none" — again concentrated in the reverse direction.
Models disagree — with each other, and with themselves
It is tempting to think a bigger model is simply "right". The data is more humbling. When we compared the small drafting model against the larger one on identical pairs, they landed on the same label only about 41% of the time; roughly another 47% were one level apart, and 12% disagreed by two or more. And the direction of disagreement flips with the task: at authoring time the larger model tended to rate things higher than the smaller one (it upgraded 41 partial/mostly pairs to "full"); at grading time, asked skeptically about a single pair, the same larger model rated things sharply lower. The model has no stable internal "truth" about a mapping — it has a response to a prompt.
| Comparison | Exact | ±1 level | ≥2 levels apart |
|---|---|---|---|
| Small model vs large, same pairs (2,667) | 41% | 47% | 12% |
| Grading the 451 "full" mappings | 12% | 45% | 42% |
| Grading the partial/mostly mappings | 50% | 46% | 4% |
Read the bottom row as the reassuring one: the weaker claims are fairly stable — a "partial" tends to stay near "partial". It is specifically the strong claim, "full", that is unstable and inflated. The model is most confident exactly where it is least reliable.
Why this happens
- "Full" is seductive. Asked to rate a relationship that clearly exists, a model reaches for the superlative unless something pushes back. Nothing in a permissive prompt pushes back.
- The reverse direction is abstract. "How much of this weakness does this technique exploit" is a harder, vaguer question than "does fixing this weakness block this technique", and vagueness inflates.
- Breadth gets mistaken for depth. A weakness that touches many techniques a little reads, technique by technique, as if it enables each one a lot.
- Framing is a dial. "Rate this" and "rate this, and only say full if it's the primary enabler" are different questions. The second deflates the first by design — which is the whole point of a grading pass.
So we grade, and we publish the graded version
None of this means LLM-authored mappings are worthless — they are how we cover pairs no public source maps at all (direct CWE→ATT&CK, where the standard CWE→CAPEC→ATT&CK chain drops most top weaknesses). It means the draft is a draft. So our pipeline is: LLM drafts → a second model re-rates → a human adjudicates the disagreements → and we resample on a schedule. We adopt the skeptical grade, not the confident draft. After this pass, the corrected cross-walk has 19 "full" mappings where the draft claimed 451, and 36 spurious ones removed entirely.
Two principles fall out of the data and now govern how we present coverage:
- Publish only the graded tier. The site shows hand-adjudicated mappings; unverified drafts are filtered out. You can download the dataset and see the QA tier on every row.
- Don't sum what you can't separate. When we roll several mappings up into a "cumulative" coverage figure, we report the strongest single mapping plus the breadth behind it — never the sum, because the same over-claiming that inflates one rating would compound across many. (More on that in the coverage rollups.)
The takeaway
An LLM is a fast, broad, tireless cartographer of relationships that no one has had the patience to map by hand — and a confident one, most confident precisely where it is most likely wrong. The useful posture is neither "the model said so" nor "you can't trust any of it." It is the same posture good security has always taken toward a promising-but-unproven signal: trust, but grade. We graded ours, it cost the model most of its "full" claims, and the result is a map we'd actually stake a decision on. That correction — visible, dated, and downloadable — is the product.