How the threat-actor research surface works

The methods, formulas, and data sources behind every actor page, the similarity graph, the IDF score, and the CVE→actor lookup — so you can judge what to trust and reproduce what we compute.

What's in the index

We index 568 threat actors — MITRE ATT&CK-tracked groups (G####) and campaigns (C####), plus MISP- and LLM-extracted actors — that have at least one attributed CVE in our corpus. 73 of them carry CVE attributions today; the rest are carried for naming, technique, and relationship context.

MITRE ATT&CK STIXMISP galaxy NVD / KEVEPSSCISA / FBI / NCSC advisories LLM-augmented attribution

Attributing CVEs to actors

A CVE is linked to an actor through two paths. The curated path follows MITRE STIX relationships and MISP galaxy clusters — high-confidence, human-maintained upstream. The LLM-augmented path runs a reasoning model over each high-signal CVE (KEV-listed or EPSS ≥ 0.1), asking only for publicly reported attribution and returning a short evidence snippet per named actor; those names are then resolved against our alias map to a tracked actor id, and unresolved names are dropped.

Attribution is sparse, and we don't hide it. Most CVEs have no public actor attribution at all. The CVE→actor lookup reports "N of M matched" rather than implying coverage it doesn't have, and the LLM path is deliberately conservative (no attribution → empty, not a guess).

IDF score & exclusivity

The IDF (inverse-document-frequency) score rewards actors who exploit CVEs that few other actors touch — a distinctiveness / under-coverage signal, not a threat ranking. For each of an actor's CVEs we add the log of how rare that CVE is across actors:

IDF = Σ_cve log( N_actors / n_cve ) where n_cve = # actors attributed to that CVE

A CVE shared by many actors contributes near zero; a CVE unique to one actor contributes the most. Exclusive CVEs counts the actor's CVEs where n = 1 (no other tracked actor uses them).

Actor similarity

Pairwise similarity drives the similarity graph and the "similar actors" lists. Each dimension is computed independently:

Jaccard overlap (intersection over union) is symmetric and bounded 0–1; the graph shows each seed actor's top neighbours per dimension, not a full pairwise mesh.

One name per actor

The single biggest source of confusion in this space is naming — the same group is APT44 / Sandworm / Telebots / Voodoo Bear / Seashell Blizzard. We resolve every vendor alias (MITRE, Mandiant, CrowdStrike, Microsoft, MISP, …) to one canonical actor id, so search, the CVE→actor lookup, and every cross-link land on the same card.

Targeting: sectors & victims

Target sectors are extracted by a structured-output model over MITRE descriptions and cited threat-report titles, coded to NAICS (with a TRBC view), with a regex fallback for resilience. Named victims are extracted under strict verbatim-match guardrails (confidence threshold + canonicalisation); sub-threshold candidates go to a human-review queue rather than the live index.

Motivation, sponsor, and target-directedness

The three chip rows on /actors.html (Motivation, Sponsor, Victim selection) are three orthogonal axes on the same actor. An actor can be Espionage + State + Targeted (Sandworm), Financial + Criminal enterprise + Opportunistic (Clop), Hacktivism + State-aligned + Mixed (Cyber Av3ngers), or any other combination. Semantics:

Motivation group — why the actor operates. Multi-value: an actor can carry more than one motivation (Lazarus does both espionage and financial theft). Values: Espionage, Financial, Hacktivism, Destructive.
Sponsor character — who backs them, independent of motivation. Values: State, State-aligned, Criminal enterprise, Independent, Unknown.
Target-directedness — how they pick victims. Values: Targeted, Opportunistic, Mixed / Unknown.

The taxonomy is convergent with the practitioner standard (Verizon DBIR, Mandiant M-Trends, CrowdStrike Global Threat Report) while remaining formally traceable to two vocabularies with explicit definitions:

STIX 2.1 threat-actor-motivation-ov (OASIS CTI TC) — ten motivation values (ideology, organizational-gain, personal-gain, dominance, revenge, ...) that we map to the four-way motivation_group. Two of the ten STIX values are context-dependent: personal-gain and organizational-gain resolve to Espionage in state context and to Financial in criminal context.
Council on Foreign Relations Cyber Operations Tracker via MISP's cfr-type-of-incident galaxy — nine incident-type values (Espionage, Financial Theft, Sabotage, ...) covering about 3% of our MITRE+MISP catalog with explicit ground truth.

When no explicit STIX or MISP value is present, we fall back to category (well-covered): state → Espionage, criminal → Financial, hacktivist → Hacktivism. Destructive has no category fallback by design — it must be explicit, because destructive intent is rare and category alone does not imply it. The insider, competitor, and unknown categories yield no automatic motivation; those actors surface as unclassified until an overlay carries the value.

The full derivation code lives at dbadmin/_motivation_taxonomy.py with unit tests pinning each mapping rule. Per-actor overrides for the ~50 famous cases where the auto-derivation under-tags (e.g. Lazarus needs Espionage + Financial, Sandworm needs Espionage + Destructive) live in data/threat_actor_overlays_motivation.yaml (in progress).

Freshness, sources & limits

The index rebuilds daily, so new CVE attributions, STIX refreshes, and advisory mentions surface without manual work. Limitations to keep in mind: attribution coverage is sparse and skewed toward well-reported groups; the LLM-augmented path is conservative but not infallible; and absence of an attribution is not evidence an actor wasn't involved — only that none is publicly on record.

All actors Compare actors Similarity graph CVE → actor lookup Download the data