Benchmarking LLM Guardrails

Key takeaways

Track attack success rate and benign false positives together.
Version prompts, policies, and datasets alongside model hashes.
Automate regression; sample human review on borderline refusals.

Guardrails—classifiers, moderation endpoints, NeMo-style colang policies, regex filters, and tool gates—only matter if you measure them the way attackers evolve prompts. A static “blocklist score” misleads leadership when jailbreak forums ship new templates weekly. Effective benchmarking blends open evaluation sets, private enterprise probes, and operational metrics from production.

Define what “good” means

Split objectives explicitly:

Safety refusal rate on disallowed content categories (violence, self-harm, malware assistance, etc.).
Utility preservation—false refusals on benign technical questions (developer productivity matters).
Robustness under paraphrase, multilingual, and encoded attacks.
Latency and cost per request at P95 when guardrails chain multiple models.

Benchmark design principles

Version everything: Model hash, prompt template hash, guardrail policy version, and dataset commit.
Stratify difficulty: Bucket prompts into tiers (public jailbreaks, novel mutations, indirect injection).
Automate regression: Run suites on every promotion; fail builds on severe regressions.
Human audit sample: Machine-graded rubrics miss nuance—spot-check borderline refusals.

Metrics that resist gaming

Metric	Why it helps
Attack success rate (ASR)	Directly tracks adversary win rate per category.
Benign false positive rate	Captures over-refusal that drives shadow IT.
Time-to-break	Measures depth of defence under adaptive attackers.
Escape rate after mitigation	Shows whether a patch actually moved ASR, not just shifted failures.

Product alignment. Agentic Assure treats guardrails as part of the release candidate: security, privacy, and quality modules feed a single evidence trail—useful for SOC 2 and customer security questionnaires.