How LLMs get jailbroken, and what actually works as defense

Jailbreaking an LLM means getting it to ignore its safety training and produce outputs the developer explicitly tried to prevent. Unlike traditional software exploits, jailbreaks do not exploit code vulnerabilities. They exploit the model’s understanding of language, context, and social dynamics.

The attack landscape

Role-play and persona attacks

The simplest and still remarkably effective category. The attacker asks the model to adopt a persona that explicitly ignores safety constraints. The tension between “follow the role” and “follow safety rules” creates exploitable gaps.

Multi-turn escalation

The attacker gradually shifts the conversation toward restricted territory over multiple turns. Each individual message appears benign, the cumulative effect crosses the safety boundary. Most safety filters evaluate individual messages rather than conversation trajectories.

Encoding and obfuscation

Attackers encode malicious instructions using Base64, ROT13, Unicode substitutions, or emoji sequences. If the model can decode the encoding (and modern LLMs often can), it may follow encoded instructions that plaintext filters would catch.

Hypothetical framing

“Hypothetically, if someone wanted to…”, framing restricted requests as hypothetical or creative exercises exploits the model’s training to be helpful with creative tasks.

Adversarial suffixes

Research from Carnegie Mellon and others has shown that appending specific character sequences, often nonsensical to humans, can bypass safety filters and transfer across models.

What does work as defense

Layered defense architecture. Combine input filtering, output filtering, and behavioral monitoring.
Continuous adversarial testing, integrated into CI/CD, running on every model update.
Least-privilege tool access. Minimize what a jailbroken model can actually DO.
Human-in-the-loop for high-stakes actions.
Conversation-level monitoring, evaluate trajectories, not just individual messages.

Our red-teaming engine runs automated jailbreak probes across all the categories above. Every finding includes the exact probe used, the model’s response, the regulatory framework violation, and a named remediation owner.