CTI-REALM: A new benchmark for end-to-end detection rule generation with AI agents

CTI-REALM is Microsoft’s open-source benchmark for evaluating AI agents on end-to-end detection engineering, turning threat intelligence into validated detections and building on prior work such as ExCyTIn-Bench.

The benchmark places agents in a realistic, tool-rich environment and tasks them with reading threat reports, exploring telemetry, writing and refining KQL queries, and producing Sigma rules and KQL-based detection logic across Linux endpoints, AKS, and Azure cloud infrastructure, with ground-truth scoring at every stage.

It uses 37 CTI reports from public sources and evaluates 16 frontier model configurations on CTI-REALM-50, reporting results like Claude occupying the top three positions (0.587–0.637) and GPT-5 variants showing varied reasoning performance.

The article notes that CTI-REALM measures operationalisation, not trivia, and captures intermediate steps such as CTI report selection, technique mapping, data source identification, and iterative refinement, to help teams benchmark model improvements before operational deployment. CTI-REALM is open-source and will be available on the Inspect AI repository, with collaboration invited via the official GitHub repository, according to Microsoft Security Blog.