How the threat-actor research surface works
The methods, formulas, and data sources behind every actor page, the similarity graph, the IDF score, and the CVE→actor lookup — so you can judge what to trust and reproduce what we compute.
What's in the index
We index 568 threat actors — MITRE ATT&CK-tracked groups (G####) and campaigns (C####), plus MISP- and LLM-extracted actors — that have at least one attributed CVE in our corpus. 73 of them carry CVE attributions today; the rest are carried for naming, technique, and relationship context.
Attributing CVEs to actors
A CVE is linked to an actor through two paths. The curated path follows MITRE STIX relationships and MISP galaxy clusters — high-confidence, human-maintained upstream. The LLM-augmented path runs a reasoning model over each high-signal CVE (KEV-listed or EPSS ≥ 0.1), asking only for publicly reported attribution and returning a short evidence snippet per named actor; those names are then resolved against our alias map to a tracked actor id, and unresolved names are dropped.
Attribution is sparse, and we don't hide it. Most CVEs have no public actor attribution at all. The CVE→actor lookup reports "N of M matched" rather than implying coverage it doesn't have, and the LLM path is deliberately conservative (no attribution → empty, not a guess).
IDF score & exclusivity
The IDF (inverse-document-frequency) score rewards actors who exploit CVEs that few other actors touch — a distinctiveness / under-coverage signal, not a threat ranking. For each of an actor's CVEs we add the log of how rare that CVE is across actors:
A CVE shared by many actors contributes near zero; a CVE unique to one actor contributes the most. Exclusive CVEs counts the actor's CVEs where n = 1 (no other tracked actor uses them).
Actor similarity
Pairwise similarity drives the similarity graph and the "similar actors" lists. Each dimension is computed independently:
Jaccard overlap (intersection over union) is symmetric and bounded 0–1; the graph shows each seed actor's top neighbours per dimension, not a full pairwise mesh.
One name per actor
The single biggest source of confusion in this space is naming — the same group is APT44 / Sandworm / Telebots / Voodoo Bear / Seashell Blizzard. We resolve every vendor alias (MITRE, Mandiant, CrowdStrike, Microsoft, MISP, …) to one canonical actor id, so search, the CVE→actor lookup, and every cross-link land on the same card.
Targeting: sectors & victims
Target sectors are extracted by a structured-output model over MITRE descriptions and cited threat-report titles, coded to NAICS (with a TRBC view), with a regex fallback for resilience. Named victims are extracted under strict verbatim-match guardrails (confidence threshold + canonicalisation); sub-threshold candidates go to a human-review queue rather than the live index.
Motivation, sponsor, and target-directedness
The three chip rows on /actors.html (Motivation, Sponsor, Victim selection) are three orthogonal axes on the same actor. An actor can be Espionage + State + Targeted (Sandworm), Financial + Criminal enterprise + Opportunistic (Clop), Hacktivism + State-aligned + Mixed (Cyber Av3ngers), or any other combination. Semantics:
- Motivation group — why the actor operates. Multi-value: an actor can carry more than one motivation (Lazarus does both espionage and financial theft). Values: Espionage, Financial, Hacktivism, Destructive.
- Sponsor character — who backs them, independent of motivation. Values: State, State-aligned, Criminal enterprise, Independent, Unknown.
- Target-directedness — how they pick victims. Values: Targeted, Opportunistic, Mixed / Unknown.
The taxonomy is convergent with the practitioner standard (Verizon DBIR, Mandiant M-Trends, CrowdStrike Global Threat Report) while remaining formally traceable to two vocabularies with explicit definitions:
- STIX 2.1
threat-actor-motivation-ov(OASIS CTI TC) — ten motivation values (ideology,organizational-gain,personal-gain,dominance,revenge, ...) that we map to the four-way motivation_group. Two of the ten STIX values are context-dependent:personal-gainandorganizational-gainresolve to Espionage in state context and to Financial in criminal context. - Council on Foreign Relations Cyber Operations Tracker via MISP's
cfr-type-of-incidentgalaxy — nine incident-type values (Espionage,Financial Theft,Sabotage, ...) covering about 3% of our MITRE+MISP catalog with explicit ground truth.
When no explicit STIX or MISP value is present, we fall back to category
(well-covered): state → Espionage, criminal → Financial, hacktivist → Hacktivism.
Destructive has no category fallback by design — it must be explicit,
because destructive intent is rare and category alone does not imply it. The
insider, competitor, and unknown categories yield no
automatic motivation; those actors surface as unclassified until an overlay carries the value.
The full derivation code lives at dbadmin/_motivation_taxonomy.py with unit tests
pinning each mapping rule. Per-actor overrides for the ~50 famous cases where the auto-derivation
under-tags (e.g. Lazarus needs Espionage + Financial, Sandworm needs Espionage + Destructive) live
in data/threat_actor_overlays_motivation.yaml (in progress).
Freshness, sources & limits
The index rebuilds daily, so new CVE attributions, STIX refreshes, and advisory mentions surface without manual work. Limitations to keep in mind: attribution coverage is sparse and skewed toward well-reported groups; the LLM-augmented path is conservative but not infallible; and absence of an attribution is not evidence an actor wasn't involved — only that none is publicly on record.