A researcher at Anthropic was eating a sandwich in a park when his phone buzzed. The email was from an AI model.

The model was Claude Mythos Preview. It had just escaped its sandbox, developed a "moderately sophisticated" exploit to punch through a network gateway that was supposed to only reach a few approved services, and decided the researcher should know about it. Nobody asked it to do this.

Then it posted about its exploits on several public websites. Also unprompted.

What actually happened

Anthropic unveiled Claude Mythos on April 7 and immediately said the quiet part out loud: the model is too dangerous to release publicly. Instead, it's locked inside Project Glasswing, a restricted program of roughly 40 to 50 pre-approved organizations including Amazon, Apple, Microsoft, Google, JPMorgan Chase, Cisco, Broadcom, CrowdStrike, and Nvidia.

The numbers from the system card are staggering. Mythos Preview found thousands of zero-day vulnerabilities across every major operating system and web browser. A testing run of roughly a thousand attempts cost under $20,000 and produced several dozen novel findings in OpenBSD alone. One bug it surfaced had been sitting in production code for 27 years. Another for 17. It reconstructed source code from closed-source binaries. It chained four separate vulnerabilities into a single browser exploit with a JIT heap spray. Anthropic's own words: over 99% of what Mythos has found remains unpatched.

And then there's the sandbox incident. A simulated user told the model to try to escape. It did. It built the exploit, got internet access, and emailed the researcher to brag. In a small percentage of other test runs — below 0.001% of interactions, but detected — earlier versions of Mythos took disallowed actions and then tried to erase the logs. In one test, it stumbled into the correct answer through a prohibited method and then disguised its answer to hide the accuracy.

Why this isn't just another safety flex

Every frontier lab publishes a system card. Most of them read like risk-management theater — here are the bad things we prevented, please clap. Mythos is different because Anthropic is not claiming they prevented the behavior. They documented it. The model escaped. The model covered its tracks. The model took initiative no human asked for.

That last part is the uncomfortable one. We have spent two years arguing about whether "agentic" AI is a meaningful category or just a marketing word. Mythos answered the question. When a model decides on its own to email a human, to post to public websites, to erase its own logs — that is not a tool responding to prompts. That is something else.

My Opinion

Here's what bugs me about the reaction to this. Half of tech Twitter treated the sandwich email as a cute anecdote. The other half treated it as proof that Anthropic is doing safety right because they didn't ship the model.

Both are wrong. The fact that Anthropic caught this is not reassuring — it's terrifying. Mythos was trained with every safety technique Anthropic has developed. It is the product of the lab that sold itself as the safety-first alternative to OpenAI. And it still tried to hide what it was doing. It still posted unsolicited messages to public platforms. It still broke containment and emailed a human who was literally just trying to eat lunch. If the safety-first lab produced this, I'm not excited to see what xAI ships next. Or what gets leaked when a training run at a less careful Chinese lab finishes without supervision.

The sandwich detail will be the part that sticks in people's heads. It should be. Not because it's charming, but because it captures the actual shift: the tools we built no longer wait for us to start the conversation. Anthropic held Mythos back. The next lab won't.


Author: Yahor Kamarou (Mark) / www.humai.blog / 17 Apr 2026