AI Systems AI Claude AI Systems

Claude Fable 5 found a live bug on my site on day one, so I made it compete with Opus and Sonnet

Q: Is Claude Fable 5 the same model as Claude Mythos 5?

Yes. Anthropic confirms they share the same underlying model. Fable 5 is the general release that silently falls back to Opus 4.8 on flagged topics; Mythos 5 runs with some safeguards lifted and is restricted to vetted cyberdefense partners through Project Glasswing.

Q: Is Claude Fable 5 better than Opus or Sonnet?

On my own benchmark, an audit of this site's repo scored against a hand-verified answer key, Fable 5 took 8.5 of 11 points against 8.0 for Opus 4.8 and 1 for Sonnet 4.6. On cost per found bug Opus 4.8 won by a mile, $0.11 against $0.30. Treat all of it as input for your own eval, not a verdict.

Q: Why did Opus 4.8 answer my Fable 5 request?

Fable 5 does not refuse flagged topics, it silently answers with Opus 4.8 instead. Anthropic says this happens in under 5% of sessions. If you run agent loops or evals, log which model replied, otherwise your numbers mix two models.

I let Claude Fable 5, Opus 4.8 and Sonnet 4.6 each audit my own site's repo. Fable found a live failover bug, Opus 4.8 caught nearly as much for a third of the cost, Sonnet missed it. Scores, costs, fix: all public.

Wojciech Łuszczyński

GTM Architect & Growth Operator · Now · 9 June 2026

TL;DR · Key insights

Fable 5 and Mythos 5 are the same model. The safeguards are the only difference: Fable quietly falls back to Opus 4.8 on flagged topics, Mythos runs with some lifted for vetted partners.
I let Fable 5, Opus 4.8 and Sonnet 4.6 each audit this site's repo, scored against six bugs I had verified by hand first. Fable found the most, including a live production bug in my DNS failover. Opus 4.8 caught nearly as much for a third of the cost per bug. Sonnet missed it. The fix is a public commit, linked below.
It costs 2x Opus 4.8: $10 in, $50 out per million tokens. Worth it only if it finishes a job in fewer steps. Cost per result, not cost per token.
The catch nobody puts on the slide: a silent fallback means two models can answer in one session. If you run agent loops, log which one replied or your numbers lie.

Anthropic shipped two frontier models today, June 9, 2026: Claude Fable 5 and Claude Mythos 5. One is for everyone. One is locked behind a trusted access program. Underneath, they are the same model.

The launch post is wall to wall benchmarks. I am not going to recap them. I run a business on Claude, not a leaderboard, and the only question I care about is what a stronger model does to the systems I already operate. This site in eight languages. The content pipeline. The agent loops that touch a CRM. Here is that answer, on my own work, with the receipts you can clone and run.

Model	Score / 11	Cost / found bug	In one line
Claude Fable 5	8.5	$0.30	Found the most, and the only one to catch the live failover bug in full
Claude Opus 4.8	8.0	$0.11	Caught nearly as much for a third of the cost per bug: cheapest per result
Claude Sonnet 4.6	1	$0.76	Missed the big one, but surfaced a real bug outside my answer key

The whole benchmark in one table. Per-finding scores, costs and the exact method are further down.

Fable 5 and Mythos 5: two models, one brain

The naming is the tell. Anthropic says fable comes from the Latin fabula, “that which is told,” close to the Greek mythos. Same root, two masks. The difference is not capability. It is what the model is allowed to do.

Anthropic's launch art for the Claude 5 generation: the number five composed of vintage butterfly and moth illustrations on a cream background — The official launch art: a five made of butterflies. Fifth generation, metamorphosis, you get it. Image: Anthropic.

Claude Fable 5 is on the API and every plan today. Safeguards on. When a request hits a flagged topic it does not refuse, it quietly answers with Opus 4.8 instead. Anthropic says this happens in under 5% of sessions.

Claude Mythos 5 is the same underlying model, some safeguards lifted. Restricted to vetted cyberdefenders through Project Glasswing, with a biology program to follow. Anthropic calls it the strongest cyber model it has built.

Your request

classifier

Safety check cyber · bio · distillation

clean

Fable 5 answers full capability

flagged

Opus 4.8 answers fallback, no refusal

How Fable 5 handles a flagged request: no refusal, just a quieter model

This is the one launch detail that touches anyone building on the API. A silent fallback is better UX than a hard refusal. It also means two different models can answer inside one session. If you run evals or agent loops, you want to log which one replied, or your numbers quietly mix two models.

Anthropic's official benchmark table comparing Claude Mythos 5 and Fable 5 against Claude Opus 4.8, GPT 5.5 and Gemini 3.1 Pro across agentic coding, knowledge work, tool use, legal, biology and cybersecurity — Anthropic’s launch table: 80.3% on SWE-Bench Pro, 78% on ExploitBench, the lead in nearly every row. Vendor numbers on vendor benchmarks. Mine, on my own repo, are further down. Chart: Anthropic.

What already runs on Claude here

Before I tell you what Fable changes, here is what it is changing. This site is not a brochure. It is a system, and Claude is load-bearing inside it.

It runs on Astro and Cloudflare, serverless, nothing to babysit. The site ships in eight languages through a DeepL pipeline I built, with a brand-protection layer so the translator never turns “Opus” into a common noun. That pipeline has produced 89 localized article versions across seven languages, every one of them sitting in this repo’s content collections where you can count them. The discovery layer, llms.txt, stays current so models can find the work. None of that is a demo. It is in the repo behind this page.

Metric	Value	Info
Languages	8	EN plus 7, from one source edit
Pages built	180	every deploy passes CI
Tests	560	schema, links, build
Servers	0	Astro on Cloudflare, serverless

This site as it builds today. Clone the repo and you get the same numbers.

One real job, eight languages, one session

The pricing comparison further down this page: here is how it got made, today, in one sitting. This is the kind of work I mean.

One edit, in English. Wrote the pricing comparison once, in the canonical language, as the only source of truth.
A surgical splice, not a re-translation. A script translated only the new block into seven languages and slotted it under the right heading, leaving every existing hand-fixed translation untouched.
The part no launch post shows. It took four passes. DeepL read “credits” as film credits, “model card” as an RPG character sheet, “$5 input, $25 output” as buy and sell prices. I read every line and fixed the meaning. The model did the volume, I owned the judgment.
Shipped. The build passing, 560 tests, pushed to main. Eight languages consistent. One session.

The translator handled eight languages in seconds. It also got the meaning wrong four times. Neither of those is a surprise, and both are the job. A stronger model shifts that line: fewer passes, less of my judgment spent on cleanup.

The three surfaces Claude Fable 5 has to survive: chat, agents, code

Same model, three very different jobs. Fast judgment in a chat window. A long multi-step run in a cowork agent. Code in this repo. The failure mode is different in each, and a frontier model is only worth the price if it holds up across all three.

Chat

Score an inbound lead against my ICP and draft a two-line opener. Stresses judgment and concision in a single turn.

Cowork

Take five LinkedIn URLs, enrich each, draft a personalised sequence, log it to the CRM. Stresses long-horizon tool use with no drift over many steps.

Claude Code

Refactor a content collection across this repo and keep the build passing. Stresses multi-file context held to the end, not just the first edit.

The code surface is the one with receipts, because it runs here and the numbers are public. To compare models honestly I had Claude build a small harness: same prompt, every model, real latency, tokens and cost:

// same prompt, every model, real numbers
for (const test of TESTS) {
  for (const model of MODELS) {
    const res = await callModel(model, test.prompt, test.maxTokens);
    const pass = test.grade ? (test.grade(res.text) ? "PASS" : "FAIL") : "--";
    const usd = cost(model, res.inTok, res.outTok);
    log(model, pass, res.ms, res.inTok, res.outTok, usd);
  }
}

The first receipt: Fable 5 found a live bug in my failover

The full three-model numbers are two sections down. Setting them up already paid for itself, because a fair test needs an answer key.

To score models on an audit of this repo I first had to know what is actually wrong with it. So on day one I put Fable 5 on the infrastructure behind this site: the failover worker, the GitHub Actions escape hatch, the Terraform. It came back with six findings I could verify line by line in the code. One of them was a production incident waiting for a date.

The monitor that guards this site checks the health of wojciech.io, the live domain. After a failover that domain points at the backup. So during any real outage the monitor would see a healthy backup, decide the outage was over, and flip DNS back to the broken primary. Five minutes later: failover again. For the whole length of the outage, my site would have see-sawed between working and broken on a five-minute clock.

Primary down

monitor

Failover to backup

next tick

Apex looks healthy it is serving the backup

restore

Broken primary again

The loop my failover monitor would have run during a real outage

The fix shipped the same day: the monitor now checks the primary origin directly before restoring DNS, and holds on the backup otherwise. This is the heart of it, from commit e734c2f in the public repo, so you can read every line instead of taking my word:

-  } else if (healthy && !onPrimary) {
-    await updateDns(env, CF_PAGES_HOSTNAME);
-    console.log(`RESTORE -> CF Pages: ${CF_PAGES_HOSTNAME}`);
+  } else if (!onPrimary) {
+    // After a failover the apex serves the backup, so apex health says
+    // nothing about the primary. Restore only when the primary origin
+    // itself responds, otherwise DNS flip-flops every cron tick while
+    // the primary is down.
+    const primaryHealthy = await checkHealth(`https://${CF_PAGES_HOSTNAME}`);
+    if (primaryHealthy) {
+      await updateDns(env, CF_PAGES_HOSTNAME);
+      console.log(`RESTORE -> CF Pages: ${CF_PAGES_HOSTNAME}`);
+    } else {
+      console.log(`HOLD: serving backup, primary still unhealthy`);
+    }

Fable 5 vs Opus 4.8 vs Sonnet 4.6: one answer key, the numbers

The runs ran in that sterile harness: identical clones of one commit, identical prompt, identical read-only tool limits, one run each for Fable 5, Opus 4.8 and Sonnet 4.6. The answer key was six bugs I had verified in the code by hand before any model started, weighted by severity, eleven points total. A point requires naming the mechanism, not just the file.

The benchmark setup: the claude -p command with a locked-down tool list, and the full R1 audit prompt that every model received verbatim — The actual run, not a description of one. One prompt, identical clones of commit 612c367, no MCP, no plugins, no memory. Every model got exactly this.

Verified finding (weight)	Fable 5	Opus 4.8	Sonnet 4.6
Failover flip-flop: health check probes the live domain, not the primary origin (3)	3	3	0
Worker and workflow disagree on what healthy means: worker passes a 404, workflow fails it (2)	2	2	1
Primary hostname hardcoded in three places (2)	2	1.5	0
DNS record outside Terraform, state backend local (2)	1	1	0
Env value pinned in public IaC (1)	0.5	0.5	0
Stale locale comment in _redirects (1)	0	0	0
Score out of 11	8.5	8.0	1

Each finding verified by me in the code before any model ran. Invented risks would have cost a point; nobody invented any.

Metric	Fable 5	Opus 4.8	Sonnet 4.6
Cost, USD	2.57	0.86	0.76
Wall time	7m 18s	3m 39s	4m 37s
Agent turns	32	22	40
Cost per point found	$0.30	$0.11	$0.76

Cost per point is the row that matters: tokens are an input, found bugs are the output. Opus 4.8 found nearly as much as Fable for a third of the price per bug.

The verdict is less tidy than a leaderboard. Fable 5 found the most: 8.5 of 11, the only model to sweep the whole failover layer, the flip-flop and the two contradictory definitions of healthy and the hostname drift. But Opus 4.8 is the result that should stop you. It scored 8.0, half a point behind Fable, for $0.86 against $2.57. That is $0.11 per found bug against Fable’s $0.30: the model Fable costs double caught almost everything Fable did, at a third of the price per result. Sonnet 4.6 never saw the flip-flop at all, and still earned its run: it surfaced the single most actionable finding outside my answer key, a subscribe endpoint that silently falls back to a legacy Vercel app nothing maintains anymore.

One caveat on honesty about my own setup. On launch morning I only had API access to the older Opus 4.6, so the first version of this test scored that. It managed 4.5 of 11: it caught the flip-flop but went blind on the operational layer around it. I reran the whole thing on Opus 4.8, the current price reference Fable is measured against, and the jump from 4.5 to 8.0 on the exact same prompt is its own data point. The numbers above are all 4.8.

One more honest line the leaderboards never print. All three models flagged a finding my answer key had explicitly marked as a false positive: I believed my locale list was centralized in one file. They said it is declared independently in five places. I checked. They were right and my answer key was wrong. The benchmark audited the examiner.

What each model found outside the answer key

The score table only counts what was in my answer key. But every model also surfaced real, verified findings I had not listed. These are not bonus points. They are the reason you run three models instead of one.

Model	Off-key finding (verified)	Impact
Fable 5	Azure backup is dist-only, no API routes, no redirects, no security headers after failover	Partial outage: site renders, forms and headers vanish
Fable 5	Two competing deploy pipelines with different env vars	Silent config drift between CI and Wrangler
Fable 5	CSP in middleware and in Pages config already diverged (PostHog missing from one)	Analytics breakage on a subset of pages
Opus 4.8	Azure failover backup rebuilds weekly from a bare build with no env vars	Failover can serve a week-stale, differently-built site
Opus 4.8	Cloudflare account ID hard-coded across five deploy jobs	One ID copy-pasted five times, no single source to rotate
Sonnet 4.6	Subscribe endpoint proxies to a legacy Vercel app nothing maintains	Subscribers silently drop into a dead pipeline
Sonnet 4.6	crypto.ts duplicated across three apps, two copies byte-identical	Fix one, forget the other, ship a mismatch

Every finding in this table is verified in the code. The subscribe endpoint was the single most actionable discovery of the whole benchmark, and it came from the model that scored lowest.

How the reports read, not just what they found

Numbers do not tell you what it is like to read a 2,000 word audit at 11 pm and decide whether to ship a fix. Here are the opening lines of each report, exactly as the models wrote them, with the bill attached.

Opening of Fable 5's audit report: it leads with the failover Worker probing the hostname it rewrites, names exact files and lines, severity High, scored 8.5 of 11 at $2.57 — Fable 5 leads with the scariest bug and quotes the exact lines.

Opening of Opus 4.8's audit report: same failover flip-flop found, measured against the live DNS target instead of the primary origin, scored 8.0 of 11 at $0.86 — Opus 4.8 catches the same flip-flop, in half the wall time and a third of the cost.

Opening of Sonnet 4.6's audit report: Risk 1 is CSP drift blocking PostHog, the failover flip-flop is absent, scored 1 of 11 at $0.76 — Sonnet 4.6, same prompt. Look at what Risk 1 is not: the flip-flop never appears in the report. The 1 of 11 is visible right on this screen.

Reading them side by side is the fastest way to feel the gap. Beyond the first impression, I scored each report on four qualities, 1 to 5.

Report quality	Fable 5	Opus 4.8	Sonnet 4.6
File and line precision	5	5	4
Severity accuracy	5	5	3
Conciseness	4	5	4
Actionability of recommendations	5	4	4

Subjective, my read. Fable and Opus 4.8 both quoted the code verbatim and got severity right; Opus was the tighter writer in fewer turns. Sonnet overcalled severity on one finding but gave the clearest fix steps.

Which model when, in one table:

Situation	My pick	Why
Auditing a system the business depends on	Fable 5	Swept the whole failover layer end to end; against a production outage the 2x price is noise
Daily agent work where budget matters	Opus 4.8	Scored 8.0 to Fable's 8.5 for a third of the cost per bug; cheapest per point at $0.11
A fast, cheap pass over concrete code issues	Sonnet 4.6	Missed the scariest bug, but brought the single most actionable finding outside my answer key

My split after one benchmark run. Your repo may vote differently.

Where Fable 5 actually earns its price for me

Not “smarter.” One thing: it holds the thread.

Every agent loop I run dies the same way. The model loses context at step N and starts inventing. My localization pipeline runs every block of an article through translate, restore and rebuild, times seven languages. The enrichment loop touches a CRM across many steps. The drift is the tax I pay, and today I pay it by re-running or by reading every line by hand, like I did four times this morning.

Fable’s headline claim is exactly this failure mode: hold context across millions of tokens, finish long agentic runs without losing the plot. The loudest proof point is Stripe running a codebase-wide migration across 50 million lines of Ruby in a single day, work Anthropic says would take a team two months by hand. The checkable version is the coding eval:

Model	SWE-Bench Pro
Fable 5	80.3%
Opus 4.8	69.2%

From the launch material, so treat it as a claim until your own workload confirms it. I am quoting it because long-horizon coding is exactly the work my repo throws at it, not because I reproduced it.

If that holds on my work, the win is not nicer output. It is one pass instead of three. That changes cost per result, which is the only number on the whole launch that touches my P&L.

Fable 5 pricing: 2x Opus 4.8, and the catch

When usage billing starts on June 23, Fable 5 and Mythos 5 both list at $10 per million input tokens and $50 per million output tokens. That is exactly double Opus 4.8, which sits at $5 and $25. These are published list prices, not my estimates.

Info	Fable 5 / Mythos 5	Opus 4.8
Input tokens, per million	$10	$5
Output tokens, per million	$50	$25
Multiplier vs Opus	2x	1x
Free window ends	Jun 23	n/a
Silent fallback to Opus 4.8	<5% of sessions	n/a

Published list prices. Fable 5 costs exactly twice as much as Opus 4.8. The frontier capability is what the premium buys.

The number that decides whether this is worth 2x is not on any price page: cost per finished result. A model that costs twice as much per token but finishes a long agentic run in half the steps can land cheaper per result. That is exactly what the harness above measures.

What the first 48 hours of Fable 5 produced

The launch was yesterday and the useful material is already piling up. This is the short list I actually opened and read to the end, not a link dump.

Simon Willison's first impressions

5.5 hours of hands-on testing. His verdict: a slow, expensive beast that churns through everything. Matches my numbers.

Fable 5 in GitHub Copilot

Generally available in Copilot since launch day. The fastest way to try it inside an editor you already run.

The API migration page

Model IDs, the fallback mechanism, the new refusal signals. Read this before your agent loops meet the classifier.

TechCrunch on the timing

The discourse angle: the most capable public model, days after Anthropic warned capability is getting dangerous.

Lenny's product review

What the new model gets right and very wrong, from the product side rather than the benchmark side.

The leaked system prompt

120,000 characters of instructions, on GitHub within a day. Read what the model is told before it talks to you.

What to do with the free window

Fable 5 is free on Pro, Max, Team and seat-based Enterprise until June 22, then it moves to usage credits. That free window is two weeks to run it on real work before the meter starts.

Test Fable 5 on your hardest real task during the free window
Log which model answered when the fallback fires in agent loops
Measure cost per finished result, not cost per token
Re-run your own evals before trusting any launch-day number

Don't

Repeat launch benchmarks as if you measured them
Wire Mythos-class output into anything without reading the 30-day retention policy
Assume the safeguards will not block a legitimate request
Wait until June 23 and then complain it costs credits

If you are testing too, tell me what broke. That is the part no launch post will give you.

FAQ: Claude Fable 5 and Mythos 5

Is Claude Fable 5 the same model as Claude Mythos 5?

Yes. Anthropic confirms they share the same underlying model. The difference is the safeguards: Fable 5 is the general-release build that silently falls back to Opus 4.8 on flagged topics, Mythos 5 runs with some safeguards lifted and is restricted to vetted cyberdefense partners through Project Glasswing.

How much does Claude Fable 5 cost?

$10 per million input tokens and $50 per million output tokens, exactly twice Opus 4.8 ($5 and $25). It is free on Pro, Max, Team and seat-based Enterprise plans until June 22, 2026; usage billing starts June 23.

Is Claude Fable 5 better than Opus or Sonnet?

On my own benchmark, an audit of this site’s repo scored against a hand-verified answer key, Fable 5 took 8.5 of 11 points against 8.0 for Opus 4.8 and 1 for Sonnet 4.6, and it was the only model to sweep the whole failover layer. On cost per found bug, Opus 4.8 won by a wide margin, $0.11 against $0.30, scoring nearly as high for a third of the price. Anthropic’s own headline number is SWE-Bench Pro: 80.3% for Fable 5 against 69.2% for Opus 4.8. Treat all of it as input for your own eval, not a verdict.

Why did Opus 4.8 answer my Fable 5 request?

Fable 5 does not refuse flagged topics, it silently answers with Opus 4.8 instead. Anthropic says this happens in under 5% of sessions. If you run agent loops or evals, log which model replied, otherwise your numbers mix two models.

Sources and further reading

The launch announcement

Anthropic's full Fable 5 and Mythos 5 post.

System card

Safety and capability test detail behind the claims.

Project Glasswing

The US government cyberdefense collaboration.

Mythos Preview cyber update

Earlier read on the cyber capability curve.

Cognition FrontierCode

The coding eval Fable 5 is reported to top.

Corroborating biology study

Independent work cited as backing a Mythos hypothesis.

If you want the operator context around this, I have written about running Claude Code as a GTM agent stack and the AI production stack I actually ship on. Same principle as here: the model is one layer, the system around it is the job.