Detecting backdoored language models at scale

MICROSOFT’S security blog on 4 February 2026 reports new research on detecting backdoors in open-weight language models, outlining three observable signatures that indicate backdoors are present and describing a practical scanner built to reconstruct likely triggers at scale.

The researchers identify signature one as a “double triangle” attention pattern where trigger tokens hijack attention and reduce output entropy, signature two as backdoored models tending to leak poisoning data, and signature three as the fuzziness of backdoor triggers, with partial or approximate triggers able to activate the backdoor in many cases.

They then describe a scanner that extracts memorized content, analyses it to isolate salient substrings, and scores suspicious substrings against the three signatures to produce a ranked list of trigger candidates. The work focuses on open-weight models and is most effective for deterministic backdoors, noting limitations such as non-applicability to proprietary API-only models and the potential to miss other backdoor types. The authors emphasise that the scanner should be used as one component within broader defensive stacks, not as a silver bullet.