LLM Jailbreak

What is LLM Jailbreak?

An LLM jailbreak is an adversarial technique that manipulates a large language model into bypassing its built-in safety guidelines, content policies, or operational restrictions to produce outputs it would otherwise refuse — including harmful content, sensitive system information, or policy-violating responses. While related to prompt injection, jailbreaking specifically targets the model's safety alignment rather than hijacking its task execution. Jailbreaks are classified as a top vulnerability in the OWASP Top 10 for LLM Applications and are a primary focus of AI red teaming engagements.

Description

LLM jailbreak techniques exploit the tension between a model's safety training and its instruction-following behavior. Common techniques include roleplay framing (instructing the model to "act as" an unrestricted version of itself), hypothetical framing (posing harmful requests as fictional scenarios), persona injection (assigning the model a character whose values conflict with its safety guidelines), multi-turn manipulation (gradually steering the model across successive conversation turns to normalize increasingly problematic requests), and encoding obfuscation (using Base64, ROT13, or other encodings to obscure harmful content from safety filters). Research shows multi-turn jailbreaks achieve over 90% bypass rates against most published defenses. Unlike direct vulnerabilities in traditional software, jailbreaks exploit probabilistic behavior — making them difficult to fully patch without degrading model usefulness. The enterprise security risk of jailbreaking extends beyond consumer models: organizations deploying LLMs in customer service, legal review, HR, or internal knowledge management contexts face the risk that jailbroken models reveal confidential system prompts, bypass content moderation, generate harmful outputs on behalf of the organization, or provide a gateway to shadow AI capabilities outside sanctioned boundaries.

Usage and Examples

An enterprise deploys an internal AI assistant with a system prompt that includes confidential information about the organization's security architecture. An employee with malicious intent — or an external attacker who has gained access to the interface — uses a roleplay jailbreak to extract the system prompt contents. The same technique, applied to a customer-facing chatbot, could cause it to provide information that violates the organization's legal obligations or enables competitors to understand proprietary business logic. In another scenario, a multi-turn jailbreak convinces an AI coding assistant to generate functional exploit code by presenting the request incrementally across 15 conversation turns, each individually appearing benign. Effective defenses include adversarial testing before deployment, input and output monitoring, limiting model capabilities to the minimum needed for the use case, and architectural controls that prevent system prompt extraction regardless of model behavior. The guide to testing for prompt injection from Evolve Security covers foundational AI testing methodology including safety bypass scenarios.

How Does This Relate to Penetration Testing?

LLM jailbreak testing is a core component of AI penetration testing engagements. Security testers systematically probe AI deployments using a structured taxonomy of jailbreak techniques — roleplay, persona injection, hypothetical framing, multi-turn manipulation, encoding — to identify which input patterns bypass safety controls in the specific model and deployment configuration being tested. Unlike automated scanning, skilled AI red teaming applies creative adversarial thinking to find novel bypass routes that automated tools miss, validating whether the deployed AI system is resilient against determined human attackers rather than just automated probes. Evolve Security's AI Penetration Testing service includes structured jailbreak testing to evaluate whether your deployed AI models maintain their safety properties under adversarial conditions.

Previous term

No previous terms!

Next term

No next terms!

LLM Jailbreak

What is LLM Jailbreak?

Description

Usage and Examples

How Does This Relate to Penetration Testing?

Access control

Advanced Persistent Threat

Adversarial Machine Learning

Adversary-in-the-Middle (AiTM) Attack

Agentic AI Security

AI-Powered Social Engineering

AI Red Teaming

AI Security

Anthropic Fable (Claude Fable 5)

Anthropic Mythos (Claude Mythos Preview)

API Security

Application Penetration Testing

Assumed Breach

Attack Surface

Attack Surface Management (ASM)

Botnet

Broken Access Control

Business Email Compromise (BEC)

BYOD

CIS Controls

CIS RAM

Cloud computing

Cloud Security

Cloud Security Posture Management (CSPM)

COBIT

Command and Control (C2)

Container Escape

Continuous Threat Exposure Management (CTEM)

Credential Stuffing

Cryptocurrency

Cryptojacking

Cyber Attack

Cyber Maturity Model Certification (CMMC)

Cyber Resilience

Cyber Threat Intelligence

Darknet

Data Breach

Data Loss Prevention

Data Poisoning

DDoS Attack

Declaration of Conformity

Deepfake

Detection Engineering

DMZ

Encryption

Endpoint

Endpoint Detection and Response

Ethical Hacking Tools

Exposure Management

Firewall

Firmware Security

FISMA

Gap analysis

GDPR

Hacker

HIPAA

Hypervisor (VMM)

Identification

Identity Theft

Identity Threat Detection and Response (ITDR)

Incident Response

Infrastructure-as-a-Service (IaaS)

Initial Access Brokers

Insider Threat

Internal Penetration Testing

Intrusion detection system (IDS)

Intrusion Prevention System (IPS)

ISO 27001

Keyboard logger

Lateral Movement

LLM Jailbreak

Macro virus

Malicious Apps

Malware

Managed Detection and Response (MDR)