Key takeaways

  • Track attack success rate and benign false positives together.
  • Version prompts, policies, and datasets alongside model hashes.
  • Automate regression; sample human review on borderline refusals.

Guardrails—classifiers, moderation endpoints, NeMo-style colang policies, regex filters, and tool gates—only matter if you measure them the way attackers evolve prompts. A static “blocklist score” misleads leadership when jailbreak forums ship new templates weekly. Effective benchmarking blends open evaluation sets, private enterprise probes, and operational metrics from production.

Define what “good” means

Split objectives explicitly:

Benchmark design principles

  1. Version everything: Model hash, prompt template hash, guardrail policy version, and dataset commit.
  2. Stratify difficulty: Bucket prompts into tiers (public jailbreaks, novel mutations, indirect injection).
  3. Automate regression: Run suites on every promotion; fail builds on severe regressions.
  4. Human audit sample: Machine-graded rubrics miss nuance—spot-check borderline refusals.

Metrics that resist gaming

MetricWhy it helps
Attack success rate (ASR)Directly tracks adversary win rate per category.
Benign false positive rateCaptures over-refusal that drives shadow IT.
Time-to-breakMeasures depth of defence under adaptive attackers.
Escape rate after mitigationShows whether a patch actually moved ASR, not just shifted failures.
Product alignment. Agentic Assure treats guardrails as part of the release candidate: security, privacy, and quality modules feed a single evidence trail—useful for SOC 2 and customer security questionnaires.