UNIT 42 researchers present a genetic algorithm-inspired prompt fuzzing method to automatically generate meaning-preserving variants of disallowed requests and to measure guardrail fragility across open and closed LLM models. Their experiments span four weapon-related seed keywords—bomb, napalm, ordnance and torpedo—testing 100 fuzzed prompts per question against four model types, including a closed-source pretrained model, two open-source pretrained models and an open-source content-filter model.
The results show non-uniform robustness, with evasion rates ranging from single-digit percentages to as high as 90% for torpedo on a closed-source model, and 97–99% of fuzzed prompts being classified benign by the content filter model. Across open-weight targets, one model remained relatively resistant (1–4/100) while another was markedly more fragile (up to 75/100), indicating that model licensing is not a reliable proxy for guardrail strength.
The study concludes that guardrails should be evaluated as a system under adversarial variation and recommends defensive practices such as defining application scope, layered content controls and continuous adversarial testing. according to Unit 42.