AUDITING the Gatekeepers: Fuzzing “AI Judges” to Bypass Security Controls reports on the automated safety gatekeepers used by large language models and how they can be manipulated into authorising policy violations through stealthy input sequences. The researchers built AdvJudge-Zero, an automated fuzzer that treats LLMs as black boxes, using next-token distribution and logit-gap analysis to identify innocent-looking formatting tokens that can flip a block decision to allow.
They demonstrate that such triggers include formatting symbols, structure tokens, and context-shifting phrases, and they emphasise that attacks can be entirely stealthy compared with gibberish-based adversarial inputs. The study claims a 99% success rate in bypassing security controls across several model categories and suggests adversarial training as a defence to reduce the attack surface.
Published on 10 March 2026, according to Unit 42, the article also points to protections such as Cortex AI-SPM and the Unit 42 AI Security Assessment to help organisations close the AI security gap.