LLM Jailbreak
What is LLM Jailbreak?
An LLM jailbreak is an adversarial technique that manipulates a large language model into bypassing its built-in safety guidelines, content policies, or operational restrictions to produce outputs it would otherwise refuse — including harmful content, sensitive system information, or policy-violating responses. While related to prompt injection, jailbreaking specifically targets the model's safety alignment rather than hijacking its task execution. Jailbreaks are classified as a top vulnerability in the OWASP Top 10 for LLM Applications and are a primary focus of AI red teaming engagements.
Description
LLM jailbreak techniques exploit the tension between a model's safety training and its instruction-following behavior. Common techniques include roleplay framing (instructing the model to "act as" an unrestricted version of itself), hypothetical framing (posing harmful requests as fictional scenarios), persona injection (assigning the model a character whose values conflict with its safety guidelines), multi-turn manipulation (gradually steering the model across successive conversation turns to normalize increasingly problematic requests), and encoding obfuscation (using Base64, ROT13, or other encodings to obscure harmful content from safety filters). Research shows multi-turn jailbreaks achieve over 90% bypass rates against most published defenses. Unlike direct vulnerabilities in traditional software, jailbreaks exploit probabilistic behavior — making them difficult to fully patch without degrading model usefulness. The enterprise security risk of jailbreaking extends beyond consumer models: organizations deploying LLMs in customer service, legal review, HR, or internal knowledge management contexts face the risk that jailbroken models reveal confidential system prompts, bypass content moderation, generate harmful outputs on behalf of the organization, or provide a gateway to shadow AI capabilities outside sanctioned boundaries.
Usage and Examples
An enterprise deploys an internal AI assistant with a system prompt that includes confidential information about the organization's security architecture. An employee with malicious intent — or an external attacker who has gained access to the interface — uses a roleplay jailbreak to extract the system prompt contents. The same technique, applied to a customer-facing chatbot, could cause it to provide information that violates the organization's legal obligations or enables competitors to understand proprietary business logic. In another scenario, a multi-turn jailbreak convinces an AI coding assistant to generate functional exploit code by presenting the request incrementally across 15 conversation turns, each individually appearing benign. Effective defenses include adversarial testing before deployment, input and output monitoring, limiting model capabilities to the minimum needed for the use case, and architectural controls that prevent system prompt extraction regardless of model behavior. The guide to testing for prompt injection from Evolve Security covers foundational AI testing methodology including safety bypass scenarios.
How Does This Relate to Penetration Testing?
LLM jailbreak testing is a core component of AI penetration testing engagements. Security testers systematically probe AI deployments using a structured taxonomy of jailbreak techniques — roleplay, persona injection, hypothetical framing, multi-turn manipulation, encoding — to identify which input patterns bypass safety controls in the specific model and deployment configuration being tested. Unlike automated scanning, skilled AI red teaming applies creative adversarial thinking to find novel bypass routes that automated tools miss, validating whether the deployed AI system is resilient against determined human attackers rather than just automated probes. Evolve Security's AI Penetration Testing service includes structured jailbreak testing to evaluate whether your deployed AI models maintain their safety properties under adversarial conditions.

