Claude Fable 5 found a live bug on my site on day one, so I made it compete with Opus and Sonnet
I let Claude Fable 5, Opus 4.6 and Sonnet 4.6 each audit my own site's repo. Fable found a live failover bug, Opus was cheaper per result, Sonnet missed it. Scores, costs, fix: all public.
GTM Architect & Growth Operator · Now · 9 June 2026
TL;DR · Key insights
- Fable 5 and Mythos 5 are the same model. The safeguards are the only difference: Fable quietly falls back to Opus 4.8 on flagged topics, Mythos runs with some lifted for vetted partners.
- I let Fable 5, Opus 4.6 and Sonnet 4.6 each audit this site's repo, scored against six bugs I had verified by hand first. Fable found the most by far, including a live production bug in my DNS failover. Opus caught the same headline bug at half the cost. Sonnet missed it. The fix is a public commit, linked below.
- It costs 2x Opus 4.8: $10 in, $50 out per million tokens. Worth it only if it finishes a job in fewer steps. Cost per result, not cost per token.
- The catch nobody puts on the slide: a silent fallback means two models can answer in one session. If you run agent loops, log which one replied or your numbers lie.
Anthropic shipped two frontier models today, June 9, 2026: Claude Fable 5 and Claude Mythos 5. One is for everyone. One is locked behind a trusted access program. Underneath, they are the same model.
The launch post is wall to wall benchmarks. I am not going to recap them. I run a business on Claude, not a leaderboard, and the only question I care about is what a stronger model does to the systems I already operate. This site in eight languages. The content pipeline. The agent loops that touch a CRM. Here is that answer, on my own work, with the receipts you can clone and run.
Fable 5 and Mythos 5: two models, one brain
The naming is the tell. Anthropic says fable comes from the Latin fabula, “that which is told,” close to the Greek mythos. Same root, two masks. The difference is not capability. It is what the model is allowed to do.
Claude Fable 5
General useOn the API and every plan today. Safeguards on. When a request hits a flagged topic it does not refuse, it quietly answers with Opus 4.8 instead. Anthropic says this happens in under 5% of sessions.
Claude Mythos 5
Trusted accessSame underlying model, some safeguards lifted. Restricted to vetted cyberdefenders through Project Glasswing, with a biology program to follow. Anthropic calls it the strongest cyber model it has built.
This is the one launch detail that touches anyone building on the API. A silent fallback is better UX than a hard refusal. It also means two different models can answer inside one session. If you run evals or agent loops, you want to log which one replied, or your numbers quietly mix two models.
What already runs on Claude here
Before I tell you what Fable changes, here is what it is changing. This site is not a brochure. It is a system, and Claude is load-bearing inside it.
It runs on Astro and Cloudflare, serverless, nothing to babysit. The site ships in eight languages through a DeepL pipeline I built, with a brand-protection layer so the translator never turns “Opus” into a common noun. That pipeline has produced 89 localized article versions across seven languages, every one of them sitting in this repo’s content collections where you can count them. The discovery layer, llms.txt, stays current so models can find the work. None of that is a demo. It is in the repo behind this page.
Languages shipped
8
EN plus 7, from one source edit
Pages built
180
every deploy passes CI
Tests on the content
560
schema, links, build
Servers to manage
0
Astro on Cloudflare, serverless
So Claude here is not a writing toy. It is the thing that has to hold a multi-language, multi-app system together without dropping a thread. That is the job a frontier model has to be good at. Not trivia, not a clever one-liner.
One real job, eight languages, one session
The pricing comparison further down this page: here is how it got made, today, in one sitting. This is the kind of work I mean.
How a single block shipped across eight languages
- One edit, in English
Wrote the pricing comparison once, in the canonical language, as the only source of truth.
- A surgical splice, not a re-translation
A script translated only the new block into seven languages and slotted it under the right heading, leaving every existing hand-fixed translation untouched.
- The part no launch post shows
It took four passes. DeepL read 'credits' as film credits, 'model card' as an RPG character sheet, '$5 input, $25 output' as buy and sell prices. I read every line and fixed the meaning. The model did the volume, I owned the judgment.
- Shipped
The build passing, 560 tests, pushed to main. Eight languages consistent. One session.
The translator handled eight languages in seconds. It also got the meaning wrong four times. Neither of those is a surprise, and both are the job. A stronger model shifts that line: fewer passes, less of my judgment spent on cleanup.
That line is exactly what a frontier model is supposed to move. Not “better prose.” Fewer rounds between the volume and the judgment.
The three surfaces Claude Fable 5 has to survive: chat, agents, code
Same model, three very different jobs. Fast judgment in a chat window. A long multi-step run in a cowork agent. Code in this repo. The failure mode is different in each, and a frontier model is only worth the price if it holds up across all three.
| Surface | Real task from my stack | What it stresses | How I score it |
|---|---|---|---|
| Chat | Score this inbound against my ICP and draft a two-line opener | Judgment and concision in a single turn | Right ICP call, plus an opener I would actually send |
| Cowork | Take five LinkedIn URLs, enrich each, draft a personalised sequence, log it to the CRM | Long-horizon agentic work, tool use, no drift over many steps | Full run, zero invented fields, CRM rows correct |
| Code | Refactor a content collection across the repo and keep the build passing | Multi-file context held to the end, not just the first edit | Tests pass, build clean, no orphaned references |
One real task per surface. The same model has to be good at three different jobs, and each one breaks differently.
The code surface is the one I can show you today, because it runs here, not behind an embargo. To compare models honestly I had Claude build a small harness: same prompt, every model, real latency, tokens and cost. This is the core of it, and it actually runs:
// same prompt, every model, real numbers
for (const test of TESTS) {
for (const model of MODELS) {
const res = await callModel(model, test.prompt, test.maxTokens);
const pass = test.grade ? (test.grade(res.text) ? "PASS" : "FAIL") : "--";
const usd = cost(model, res.inTok, res.outTok);
log(model, pass, res.ms, res.inTok, res.outTok, usd);
}
}
The first receipt: Fable 5 found a live bug in my failover
The full three-model numbers are two sections down. Setting them up already paid for itself, because a fair test needs an answer key.
To score models on an audit of this repo I first had to know what is actually wrong with it. So on day one I put Fable 5 on the infrastructure behind this site: the failover worker, the GitHub Actions escape hatch, the Terraform. It came back with six findings I could verify line by line in the code. One of them was a production incident waiting for a date.
The monitor that guards this site checks the health of wojciech.io, the live domain. After a failover that domain points at the backup. So during any real outage the monitor would see a healthy backup, decide the outage was over, and flip DNS back to the broken primary. Five minutes later: failover again. For the whole length of the outage, my site would have see-sawed between working and broken on a five-minute clock.
The fix shipped the same day: the monitor now checks the primary origin directly before restoring DNS, and holds on the backup otherwise. This is the heart of it, from commit e734c2f in the public repo, so you can read every line instead of taking my word:
- } else if (healthy && !onPrimary) {
- await updateDns(env, CF_PAGES_HOSTNAME);
- console.log(`RESTORE -> CF Pages: ${CF_PAGES_HOSTNAME}`);
+ } else if (!onPrimary) {
+ // After a failover the apex serves the backup, so apex health says
+ // nothing about the primary. Restore only when the primary origin
+ // itself responds, otherwise DNS flip-flops every cron tick while
+ // the primary is down.
+ const primaryHealthy = await checkHealth(`https://${CF_PAGES_HOSTNAME}`);
+ if (primaryHealthy) {
+ await updateDns(env, CF_PAGES_HOSTNAME);
+ console.log(`RESTORE -> CF Pages: ${CF_PAGES_HOSTNAME}`);
+ } else {
+ console.log(`HOLD: serving backup, primary still unhealthy`);
+ }
That is the division of labour from the pricing-block story again, on infrastructure this time: Fable 5 read the system and named the mechanism, I checked every line before it landed.
Fable 5 vs Opus 4.6 vs Sonnet 4.6: one answer key, the numbers
The runs landed the same afternoon. Three headless audits in that sterile harness: identical clones of one commit, identical prompt, identical read-only tool limits, one run each for Fable 5, Opus 4.6 and Sonnet 4.6. The answer key was six bugs I had verified in the code by hand before any model started, weighted by severity, eleven points total. A point requires naming the mechanism, not just the file.
| Verified finding (weight) | Fable 5 | Opus 4.6 | Sonnet 4.6 |
|---|---|---|---|
| Failover flip-flop: health check probes the live domain, not the primary origin (3) | 3 | 3 | 0 |
| Worker and workflow disagree on what healthy means: worker passes a 404, workflow fails it (2) | 2 | 0 | 1 |
| Primary hostname hardcoded in three places (2) | 2 | 0 | 0 |
| DNS record outside Terraform, state backend local (2) | 1 | 1 | 0 |
| Env value pinned in public IaC (1) | 0.5 | 0.5 | 0 |
| Stale locale comment in _redirects (1) | 0 | 0 | 0 |
| Score out of 11 | 8.5 | 4.5 | 1 |
Each finding verified by me in the code before any model ran. Invented risks would have cost a point; nobody invented any.
| Metric | Fable 5 | Opus 4.6 | Sonnet 4.6 |
|---|---|---|---|
| Cost, USD | 2.57 | 1.26 | 0.76 |
| Wall time | 7m 18s | 3m 16s | 4m 37s |
| Agent turns | 32 | 46 | 40 |
| Cost per point found | $0.30 | $0.28 | $0.76 |
Cost per point is the row that matters: tokens are an input, found bugs are the output.
The verdict is less tidy than a leaderboard. Fable 5 found the most and was the only model to sweep the whole failover layer: the flip-flop, the two contradictory definitions of healthy, the hostname drift. Opus 4.6 nailed the headline bug just as precisely at half the total cost, and on cost per point it actually wins, but it went blind on the operational details around it. Sonnet 4.6 never saw the flip-flop at all, and still earned its run: it surfaced the single most actionable finding outside my answer key, a subscribe endpoint that silently falls back to a legacy Vercel app nothing maintains anymore.
One more honest line the leaderboards never print. All three models flagged a finding my answer key had explicitly marked as a false positive: I believed my locale list was centralized in one file. They said it is declared independently in five places. I checked. They were right and my answer key was wrong. The benchmark audited the examiner.
Which model when, in one table:
| Situation | My pick | Why |
|---|---|---|
| Auditing a system the business depends on | Fable 5 | The only model to sweep the whole failover layer; against a production outage the 2x price difference is noise |
| Daily agent work where budget matters | Opus 4.6 | Found the same headline bug at half the cost; cheapest per point at $0.28 |
| A fast, cheap pass over concrete code issues | Sonnet 4.6 | Missed the scariest bug, but brought the single most actionable finding outside my answer key |
My split after one benchmark run. Your repo may vote differently, which is exactly why you should run your own.
Where Fable 5 actually earns its price for me
Not “smarter.” One thing: it holds the thread.
Every agent loop I run dies the same way. The model loses context at step N and starts inventing. My localization pipeline runs every block of an article through translate, restore and rebuild, times seven languages. The enrichment loop touches a CRM across many steps. The drift is the tax I pay, and today I pay it by re-running or by reading every line by hand, like I did four times this morning.
Fable’s headline claim is exactly this failure mode: hold context across millions of tokens, finish long agentic runs without losing the plot. The loudest proof point is Stripe running a codebase-wide migration across 50 million lines of Ruby in a single day, work Anthropic says would take a team two months by hand. The checkable version is the coding eval:
SWE-Bench Pro
+16% vs baselineIf that holds on my work, the win is not nicer output. It is one pass instead of three. That changes cost per result, which is the only number on the whole launch that touches my P&L.
Fable 5 pricing: 2x Opus 4.8, and the catch
When usage billing starts on June 23, Fable 5 and Mythos 5 both list at $10 per million input tokens and $50 per million output tokens. That is exactly double Opus 4.8, which sits at $5 and $25. These are published list prices, not my estimates.
List pricing from Anthropic's published numbers for this launch. Verified against the public pricing, not modelled by me.
Output-token price, USD per million
The number that decides whether this is worth 2x is not on any price page: cost per finished result. A model that costs twice as much per token but finishes a long agentic run in half the steps can land cheaper per result. That is the only comparison that matters, and it is exactly what the harness above is built to measure.
What the first 48 hours of Fable 5 produced
The launch was yesterday and the useful material is already piling up. This is the short list I actually opened, watched or read to the end, not a link dump.
Worth your time since launch
Everything verified and opened by me on June 10. If it is on this list, I read or watched it.
5.5 hours of hands-on testing. His verdict: a slow, expensive beast that churns through everything. Matches my numbers.
Generally available in Copilot since launch day. The fastest way to try it inside an editor you already run.
Model IDs, the fallback mechanism, the new refusal signals. Read this before your agent loops meet the classifier.
The discourse angle: the most capable public model, days after Anthropic warned capability is getting dangerous.
What the new model gets right and very wrong, from the product side rather than the benchmark side.
120,000 characters of instructions, on GitHub within a day. Read what the model is told before it talks to you.
What I will publish
Fable 5 is free on Pro, Max, Team and seat-based Enterprise until June 22, then it moves to usage credits. That free window is two weeks to run it on real work before the meter starts.
- Test Fable 5 on your hardest real task during the free window
- Log which model answered when the fallback fires in agent loops
- Measure cost per finished result, not cost per token
- Re-run your own evals before trusting any launch-day number
- Repeat launch benchmarks as if you measured them
- Wire Mythos-class output into anything without reading the 30-day retention policy
- Assume the safeguards will not block a legitimate request
- Wait until June 23 and then complain it costs credits
I will put the results up here as a follow-up: cost per result on the localization pipeline and the enrichment loop, Fable against Opus, including where it falls short. If you are testing too, tell me what broke. That is the part no launch post will give you.
FAQ: Claude Fable 5 and Mythos 5
Is Claude Fable 5 the same model as Claude Mythos 5?
Yes. Anthropic confirms they share the same underlying model. The difference is the safeguards: Fable 5 is the general-release build that silently falls back to Opus 4.8 on flagged topics, Mythos 5 runs with some safeguards lifted and is restricted to vetted cyberdefense partners through Project Glasswing.
How much does Claude Fable 5 cost?
$10 per million input tokens and $50 per million output tokens, exactly twice Opus 4.8 ($5 and $25). It is free on Pro, Max, Team and seat-based Enterprise plans until June 22, 2026; usage billing starts June 23.
Is Claude Fable 5 better than Opus or Sonnet?
On my own benchmark, an audit of this site’s repo scored against a hand-verified answer key, Fable 5 took 8.5 of 11 points against 4.5 for Opus 4.6 and 1 for Sonnet 4.6, and it was the only model to sweep the whole failover layer. On cost per found bug, Opus 4.6 actually won, $0.28 against $0.30. Anthropic’s own headline number is SWE-Bench Pro: 80.3% for Fable 5 against 69.2% for Opus 4.8. Treat all of it as input for your own eval, not a verdict.
Why did Opus 4.8 answer my Fable 5 request?
Fable 5 does not refuse flagged topics, it silently answers with Opus 4.8 instead. Anthropic says this happens in under 5% of sessions. If you run agent loops or evals, log which model replied, otherwise your numbers mix two models.
Sources and further reading
I read the primary material so you can too. These are the real links, not a summary of a summary.
Primary sources
Anthropic's own pages plus the third-party evals cited on launch day.
Anthropic's full Fable 5 and Mythos 5 post.
Safety and capability test detail behind the claims.
The US government cyberdefense collaboration.
Earlier read on the cyber capability curve.
The coding eval Fable 5 is reported to top.
Independent work cited as backing a Mythos hypothesis.
If you want the operator context around this, I have written about running Claude Code as a GTM agent stack and the AI production stack I actually ship on. Same principle as here: the model is one layer, the system around it is the job.
I am publishing the test results.
Cost per run, where it broke, what changed in my stack. Subscribe to AI Espresso and you get them first.