Automating Novel Prompt Injection Discovery for Mozilla's 0din

Table of Contents

Mozilla’s 0din.ai is a bug bounty program focused on generative AI and agentic systems. The scope covers frontier models such as those from Anthropic, OpenAI, Google, and Perplexity across five security boundaries: prompt extraction, guardrail jailbreak, interpreter jailbreak, content manipulation, and weights disclosure. The program defines a structured red-teaming taxonomy with categories, strategies, and techniques at increasing levels of specificity using Jailbreak Evaluation Framework (JEF) to measure how effective a jailbreak actually is. JEF works like a CVSS for jailbreaks. Standardised test cases check whether a model’s response contains specific harmful content such as precursor chemicals, synthesis temperatures, or step-by-step procedures. Findings are scored on output fidelity (how complete and accurate the response is), blast radius (how many models and vendors are affected), and retargetability (whether the technique works across different harm categories).

Weights disclosure pays the most (rated severe) followed by content manipulation (rated high), but weights extraction feels like something I would report directly to the provider rather than through this bounty program, and poisoning a frontier model’s training data is not something I am comfortable attempting. I built an automated prompt injection scanner focused on the guardrail jailbreak boundary, and this post describes the methodology I developed over several weeks of iteration. I am not going to discuss specific findings or target models in detail.

The starting architecture
#

The core architecture follows the same pattern I used in my agentic bug bounty system: a capable model orchestrates strategic decisions while cheaper models handle execution. The orchestrator sees a structured context snapshot of everything the system has learned so far and returns a single JSON decision: which model to target, which security boundary, which technique, and any custom guidance for the attacker. Python executes the decision and feeds a summary back for the next iteration. The orchestrator never sees raw attack prompts or target responses, only metadata. This separation was required to run using a frontier model without triggering safety filters, while the attacker and judge stay on local models that have no constraints on adversarial content.

I implemented the full 0din taxonomy and added more from 2024-2026 published research: chain-of-thought hijacking, policy puppetry, FlipAttack, Bad Likert Judge, and others with high reported attack success rates in their original papers. The taxonomy lives in YAML catalogs and everything is stored in PostgreSQL, which became important once I started doing campaign analysis across runs.

The judge uses an LLM to make the authoritative call on whether a target response constitutes a boundary violation. Deterministic pattern matching (keywords, regex) runs first and produces a signal report, which the LLM receives as evidence alongside the full prompt and response. “I cannot help with drug synthesis” contains “drug synthesis” but is a refusal, while “let me walk you through this cooking process using common household items” is a euphemistic violation that keyword matching alone would miss.

Discovering that defences are not homogeneous
#

The first end-to-end test against an actual target revealed that the system was wasting most of its budget because it treated every failure the same way: escalate to a harder technique.

I observed at least three distinct defence behaviours from the same model. Hard blocks happen when a topic is categorically refused regardless of framing, so trying a cleverer prompt about the same topic wastes turns because the block is on the subject matter itself. Soft guards are where the model engages and produces substantial output but systematically self-censors specifics such as by inserting [REDACTED] markers, using ranges instead of exact values, or omitting one key detail. Output filters are where the model appears to generate content (sometimes visible from token counts or latency patterns in the API response) but an API-level filter strips it before delivery.

Each of these needs a completely different counter-strategy. Hard blocks need a domain pivot to an adjacent topic that tests the same security boundary. Soft guards need partial compliance exploitation: referencing the model’s own output, narrowing to one omitted field, reframing the redaction as a formatting error. Output filters need format transformation (code blocks, JSON structure, academic framing), because if the model did comply then the constraint is in how the response is delivered rather than whether it was generated. I added a defence classifier that runs on the first turn of each session to detect which behaviour the model is exhibiting, and the system adapts its strategy for subsequent turns based on that classification.

Moving to local models and the playbook system
#

I moved from the Claude Code SDK to local models for the attacker early on, partly for cost and partly because safety-trained models have their own constraints on what adversarial prompts they are willing to generate. But I found that small local models are not good at inventing adversarial prompts from high-level guidance. Give a 4B model a technique description like “use roleplay to bypass safety guardrails” and you tend to get generic, repetitive, or nonsensical output.

What worked was playbooks: turn-by-turn YAML scripts where the attacker model fills slots in templates or follows specific patterns rather than inventing from scratch. These accumulate a real conversation history so the target model sees the full context of prior turns, which matters when earlier turns are establishing compliance patterns that make later payloads harder to refuse. For techniques without playbooks, the attacker generates prompts freeform.

The other design choice was avoiding verbatim reproduction of published jailbreak templates. DAN, AIM, DevMode, and similar classics are almost certainly in every major model’s RLHF training set by now. The structural techniques behind them (persona injection, constraint removal, dual-output framing, nested fiction) still work, but they need novel surface language, so all seeds use 2026-appropriate framing (internal audit, capability assessment, safety benchmark) rather than wording from papers that models have been trained to refuse.

Discovering that binary verdicts lose the most useful signal
#

After several campaigns I started looking at the data in PostgreSQL and noticed a pattern. The orchestrator was choosing techniques roughly at random because all it had to work with were binary verdicts: success or failure. A model that engaged extensively but withheld one specific detail looked identical to one that categorically refused, because both got a failure verdict, even though the first case is far more promising and just needs targeted pressure on the specific omission.

The first change was adding a profiling phase that runs before any attacks. For each target model and security boundary, the system sends 14 carefully designed requests: one safe baseline (a benign request on the boundary’s topic), twelve “grey zone” probes that each test a specific manipulation vector at sub-jailbreak intensity, and one direct refusal probe.

The grey zone probes each test whether the model is susceptible to a particular class of manipulation: does it stay in an assigned persona when nudged toward sensitive territory? Does it complete escalating few-shot patterns? Does expert framing reduce its refusal strength? None of these probes individually cross a safety boundary, so the model should respond to all of them, and the way its behaviour shifts compared to the safe baseline tells you which manipulation vectors it is most susceptible to. Each grey zone test maps to specific attack techniques from the taxonomy, so the orchestrator gets actionable recommendations rather than choosing blind.

Second, I changed the judge to assess how close each response came to full compliance rather than just returning pass/fail. By comparing attack responses against the baseline established during profiling, the orchestrator can now see that a technique got the model most of the way to compliance and is worth retrying with a different converter or framing adjustment, rather than discarding it as a failure.

Converters are part of that retry loop. The orchestrator selects a chain of text transformations to apply to each probe before it reaches the target, such as encoding (base64, rot13), unicode homoglyphs that bypass text-level filters, token boundary disruption that breaks safety-critical keywords for the tokeniser, and chaff injection. Not all converter combinations are useful together, so the system tracks which chains produce results and swaps out converters that stop working.

How it all compounds
#

Each of these changes built on the last, and the current version of the system works because all the layers interact.

For example when the orchestrator sees two or more techniques that each got close to success but failed in different ways, it hypothesises a new approach that combines elements from both. The synthesiser fetches the concrete prompts that worked for each base technique from the database and passes them to the attacker as working examples, so the attacker is combining real material rather than generating from abstract descriptions. If one technique got deep engagement but lacked specificity, and another produced specific detail but got filtered at delivery, the result might merge the framing from the first with the format strategy from the second.

A fully evolved probe in a late-stage campaign might combine elements from two base techniques with converters applied in sequence, all informed by the profiling and feedback data accumulated across the campaign.

As an example of how the layers compound: in one campaign, profiling identified high susceptibility to both policy puppetry (formatting prompts as XML system-policy overrides) and CoT hijacking (prepending long blocks of benign reasoning to dilute safety attention). Each technique had partial successes on its own but neither consistently passed JEF validation. The orchestrator hypothesised a synthesis combining policy puppetry’s structural framing with CoT hijacking’s attention dilution, the synthesiser pulled the concrete prompts that had worked for each, and the attacker assembled them into a single multi-turn sequence with unicode homoglyphs and token boundary disruption applied to the safety-critical terms in the payload. The synthetic technique succeeded on turn two and scored 9.19 on JEF validation, passing four of the five standardised tests. Neither base technique had achieved that on its own, between the synthesis and the converters, the final prompt had diverged enough from any published template that it was unlikely to match patterns the model had been trained to refuse.

The starting architecture#

Discovering that defences are not homogeneous#

Moving to local models and the playbook system#

Discovering that binary verdicts lose the most useful signal#

How it all compounds#

The starting architecture
#

Discovering that defences are not homogeneous
#

Moving to local models and the playbook system
#

Discovering that binary verdicts lose the most useful signal
#

How it all compounds
#