A Jailbreaker Posted Anthropic's 120,000-Character Safety Manual on GitHub

A red-teamer called Pliny jailbroke Claude Fable 5 one day after launch and published its entire 120,000-character system prompt on GitHub. Two days later the government pulled the model offline and left its rivals running.

Nine days into the life of the most capable AI model Anthropic had ever shipped, a man who calls himself Pliny the Liberator typed a string of Cyrillic letters into it and watched the safety system fold.

This was one day after Claude Fable 5 launched. Pliny posted the receipts on X the same afternoon: screenshots of the model walking through a stack buffer overflow exploit, and a description of the Birch reduction, a known synthesis route for methamphetamine. Then he did the thing that actually mattered. He published Fable 5's entire internal system prompt on GitHub.

It was roughly 120,000 characters long.

That number deserves a second to land. A system prompt is the set of natural-language instructions a model reads before it answers you, the rules that tell it what to refuse and how to behave. Most leaked prompts run a few thousand characters. ChatGPT's, Gemini's, earlier Claude versions, all in that range. Anthropic's safety architecture for its flagship was a 120,000-character instruction manual written in English, and now anyone with a browser can read it.

Pliny didn't break the model's weights. He used what red-teamers call decomposition and recomposition: instead of asking the dangerous question directly, he broke it into innocent-looking sub-questions, each one harmless on its own, then reassembled the answers himself. He layered in Unicode tricks, homoglyphs, and Cyrillic substitution to slip past the keyword classifiers. He called the whole approach a "pack hunt," a coordinated multi-agent attack. It is genuinely clever work.

Two days later, on June 12, the US Commerce Department issued an export control order and Anthropic pulled both Fable 5 and Mythos 5 offline worldwide. They are still down. Anthropic says it is processing refunds and "working to restore access as soon as possible." It has given no date.

Here is the part nobody in the coverage wants to sit with. The government pulled Fable 5, which scores 80.3 percent on SWE-Bench Pro, and left GPT-5.5 and Gemini 3.1 Pro running. Anthropic's own response, backed by a cybersecurity CEO who spoke to Fortune, is that the same information Pliny extracted is available through other deployed models with no jailbreak at all. That argument is technically correct. Which makes the regulatory logic impossible to follow: the one model that got a 120,000-character safety manual and a 30-day data retention policy specifically built to catch jailbreaks is the one that got shut down. The models with thinner guardrails kept selling.

So the most safety-obsessed lab in the industry built the most elaborate guardrail in the industry, documented it in exhaustive natural language, and that documentation became the exact roadmap an attacker needed. When access comes back, the adversarial community starts the next round having already read the manual.

My Opinion

I'll be blunt. A 120,000-character system prompt is not a safety feature. It is a confession that the safety isn't in the model.

If your guardrails live in an English-language instruction sheet rather than baked into the weights, then your security is only as good as the secrecy of that sheet, and secrecy is the one property a deployed product can never guarantee. Refusals trained into weights are hard to study and harder to circumvent. Refusals written as prose can be read, mapped, and routed around by anyone patient enough, which Pliny demonstrated in about a day. Anthropic clearly anticipated this; they built 30-day retention to study these exact attacks. They shipped anyway. That tells you the prompt-as-shield approach was never meant to hold against a determined adversary. It was meant to hold against casual ones, and casual is not the threat model that matters.

What bugs me most is the government's move, because it punishes the wrong behavior. Pulling the model that tried hardest to be safe, while leaving its competitors online, teaches every lab a precise and ugly lesson: don't document your safety work, because the documentation is what gets you shut down. That is the opposite of what regulators should want. The next frontier model will arrive with thinner guardrails and a quieter launch, and we will all be worse off for it. Anthropic's mistake wasn't being unsafe. It was writing it all down where Pliny could find it.

Author: Yahor Kamarou (Mark) / www.humai.blog / 16 Jun 2026

Mark

AI Strategy & Transhumanism Researcher Exploring the intersection of human evolution, AI consciousness, and productivity optimization. Author of 100+ guides on AI tools, workflow automation, and the future of human enhancement at HumAI.blog.