Skip to content
wojciech.io
All insights
// AI Systems AI Claude AI Systems

Claude Fable 5 found a live bug on my site on day one, so I made it compete with Opus and Sonnet

I let Claude Fable 5, Opus 4.6 and Sonnet 4.6 each audit my own site's repo. Fable found a live failover bug, Opus was cheaper per result, Sonnet missed it. Scores, costs, fix: all public.

Wojciech Łuszczyński

Wojciech Łuszczyński

GTM Architect & Growth Operator · Now · 9 June 2026

TL;DR · Key insights

  • Fable 5 and Mythos 5 are the same model. The safeguards are the only difference: Fable quietly falls back to Opus 4.8 on flagged topics, Mythos runs with some lifted for vetted partners.
  • I let Fable 5, Opus 4.6 and Sonnet 4.6 each audit this site's repo, scored against six bugs I had verified by hand first. Fable found the most by far, including a live production bug in my DNS failover. Opus caught the same headline bug at half the cost. Sonnet missed it. The fix is a public commit, linked below.
  • It costs 2x Opus 4.8: $10 in, $50 out per million tokens. Worth it only if it finishes a job in fewer steps. Cost per result, not cost per token.
  • The catch nobody puts on the slide: a silent fallback means two models can answer in one session. If you run agent loops, log which one replied or your numbers lie.

Anthropic shipped two frontier models today, June 9, 2026: Claude Fable 5 and Claude Mythos 5. One is for everyone. One is locked behind a trusted access program. Underneath, they are the same model.

The launch post is wall to wall benchmarks. I am not going to recap them. I run a business on Claude, not a leaderboard, and the only question I care about is what a stronger model does to the systems I already operate. This site in eight languages. The content pipeline. The agent loops that touch a CRM. Here is that answer, on my own work, with the receipts you can clone and run.

Fable 5 and Mythos 5: two models, one brain

The naming is the tell. Anthropic says fable comes from the Latin fabula, “that which is told,” close to the Greek mythos. Same root, two masks. The difference is not capability. It is what the model is allowed to do.

Claude Fable 5

General use

On the API and every plan today. Safeguards on. When a request hits a flagged topic it does not refuse, it quietly answers with Opus 4.8 instead. Anthropic says this happens in under 5% of sessions.

Claude Mythos 5

Trusted access

Same underlying model, some safeguards lifted. Restricted to vetted cyberdefenders through Project Glasswing, with a biology program to follow. Anthropic calls it the strongest cyber model it has built.

Your request
classifier
Safety check cyber · bio · distillation
clean
Fable 5 answers full capability
flagged
Opus 4.8 answers fallback, no refusal
How Fable 5 handles a flagged request: no refusal, just a quieter model

This is the one launch detail that touches anyone building on the API. A silent fallback is better UX than a hard refusal. It also means two different models can answer inside one session. If you run evals or agent loops, you want to log which one replied, or your numbers quietly mix two models.

What already runs on Claude here

Before I tell you what Fable changes, here is what it is changing. This site is not a brochure. It is a system, and Claude is load-bearing inside it.

It runs on Astro and Cloudflare, serverless, nothing to babysit. The site ships in eight languages through a DeepL pipeline I built, with a brand-protection layer so the translator never turns “Opus” into a common noun. That pipeline has produced 89 localized article versions across seven languages, every one of them sitting in this repo’s content collections where you can count them. The discovery layer, llms.txt, stays current so models can find the work. None of that is a demo. It is in the repo behind this page.

Languages shipped

8

EN plus 7, from one source edit

Pages built

180

every deploy passes CI

Tests on the content

560

schema, links, build

Servers to manage

0

Astro on Cloudflare, serverless

This site as it builds today. Clone the repo and you get the same numbers. That is the point: not a claim, a checkout.

So Claude here is not a writing toy. It is the thing that has to hold a multi-language, multi-app system together without dropping a thread. That is the job a frontier model has to be good at. Not trivia, not a clever one-liner.

One real job, eight languages, one session

The pricing comparison further down this page: here is how it got made, today, in one sitting. This is the kind of work I mean.

How a single block shipped across eight languages

  1. 01 / author One edit, in English

    Wrote the pricing comparison once, in the canonical language, as the only source of truth.

  2. 02 / fan out A surgical splice, not a re-translation

    A script translated only the new block into seven languages and slotted it under the right heading, leaving every existing hand-fixed translation untouched.

  3. 03 / catch The part no launch post shows

    It took four passes. DeepL read 'credits' as film credits, 'model card' as an RPG character sheet, '$5 input, $25 output' as buy and sell prices. I read every line and fixed the meaning. The model did the volume, I owned the judgment.

  4. 04 / ship Shipped

    The build passing, 560 tests, pushed to main. Eight languages consistent. One session.

The translator handled eight languages in seconds. It also got the meaning wrong four times. Neither of those is a surprise, and both are the job. A stronger model shifts that line: fewer passes, less of my judgment spent on cleanup.

The actual division of labour model does the volume, you own the judgment

That line is exactly what a frontier model is supposed to move. Not “better prose.” Fewer rounds between the volume and the judgment.

The three surfaces Claude Fable 5 has to survive: chat, agents, code

Same model, three very different jobs. Fast judgment in a chat window. A long multi-step run in a cowork agent. Code in this repo. The failure mode is different in each, and a frontier model is only worth the price if it holds up across all three.

Surface Real task from my stack What it stresses How I score it
Chat Score this inbound against my ICP and draft a two-line opener Judgment and concision in a single turn Right ICP call, plus an opener I would actually send
Cowork Take five LinkedIn URLs, enrich each, draft a personalised sequence, log it to the CRM Long-horizon agentic work, tool use, no drift over many steps Full run, zero invented fields, CRM rows correct
Code Refactor a content collection across the repo and keep the build passing Multi-file context held to the end, not just the first edit Tests pass, build clean, no orphaned references

One real task per surface. The same model has to be good at three different jobs, and each one breaks differently.

The code surface is the one I can show you today, because it runs here, not behind an embargo. To compare models honestly I had Claude build a small harness: same prompt, every model, real latency, tokens and cost. This is the core of it, and it actually runs:

// same prompt, every model, real numbers
for (const test of TESTS) {
  for (const model of MODELS) {
    const res = await callModel(model, test.prompt, test.maxTokens);
    const pass = test.grade ? (test.grade(res.text) ? "PASS" : "FAIL") : "--";
    const usd = cost(model, res.inTok, res.outTok);
    log(model, pass, res.ms, res.inTok, res.outTok, usd);
  }
}

The first receipt: Fable 5 found a live bug in my failover

The full three-model numbers are two sections down. Setting them up already paid for itself, because a fair test needs an answer key.

To score models on an audit of this repo I first had to know what is actually wrong with it. So on day one I put Fable 5 on the infrastructure behind this site: the failover worker, the GitHub Actions escape hatch, the Terraform. It came back with six findings I could verify line by line in the code. One of them was a production incident waiting for a date.

The monitor that guards this site checks the health of wojciech.io, the live domain. After a failover that domain points at the backup. So during any real outage the monitor would see a healthy backup, decide the outage was over, and flip DNS back to the broken primary. Five minutes later: failover again. For the whole length of the outage, my site would have see-sawed between working and broken on a five-minute clock.

Primary down
monitor
Failover to backup
next tick
Apex looks healthy it is serving the backup
restore
Broken primary again
The loop my failover monitor would have run during a real outage

The fix shipped the same day: the monitor now checks the primary origin directly before restoring DNS, and holds on the backup otherwise. This is the heart of it, from commit e734c2f in the public repo, so you can read every line instead of taking my word:

-  } else if (healthy && !onPrimary) {
-    await updateDns(env, CF_PAGES_HOSTNAME);
-    console.log(`RESTORE -> CF Pages: ${CF_PAGES_HOSTNAME}`);
+  } else if (!onPrimary) {
+    // After a failover the apex serves the backup, so apex health says
+    // nothing about the primary. Restore only when the primary origin
+    // itself responds, otherwise DNS flip-flops every cron tick while
+    // the primary is down.
+    const primaryHealthy = await checkHealth(`https://${CF_PAGES_HOSTNAME}`);
+    if (primaryHealthy) {
+      await updateDns(env, CF_PAGES_HOSTNAME);
+      console.log(`RESTORE -> CF Pages: ${CF_PAGES_HOSTNAME}`);
+    } else {
+      console.log(`HOLD: serving backup, primary still unhealthy`);
+    }

That is the division of labour from the pricing-block story again, on infrastructure this time: Fable 5 read the system and named the mechanism, I checked every line before it landed.

Fable 5 vs Opus 4.6 vs Sonnet 4.6: one answer key, the numbers

The runs landed the same afternoon. Three headless audits in that sterile harness: identical clones of one commit, identical prompt, identical read-only tool limits, one run each for Fable 5, Opus 4.6 and Sonnet 4.6. The answer key was six bugs I had verified in the code by hand before any model started, weighted by severity, eleven points total. A point requires naming the mechanism, not just the file.

Verified finding (weight) Fable 5 Opus 4.6 Sonnet 4.6
Failover flip-flop: health check probes the live domain, not the primary origin (3) 3 3 0
Worker and workflow disagree on what healthy means: worker passes a 404, workflow fails it (2) 2 0 1
Primary hostname hardcoded in three places (2) 2 0 0
DNS record outside Terraform, state backend local (2) 1 1 0
Env value pinned in public IaC (1) 0.5 0.5 0
Stale locale comment in _redirects (1) 0 0 0
Score out of 11 8.5 4.5 1

Each finding verified by me in the code before any model ran. Invented risks would have cost a point; nobody invented any.

Metric Fable 5 Opus 4.6 Sonnet 4.6
Cost, USD 2.57 1.26 0.76
Wall time 7m 18s 3m 16s 4m 37s
Agent turns 32 46 40
Cost per point found $0.30 $0.28 $0.76

Cost per point is the row that matters: tokens are an input, found bugs are the output.

The verdict is less tidy than a leaderboard. Fable 5 found the most and was the only model to sweep the whole failover layer: the flip-flop, the two contradictory definitions of healthy, the hostname drift. Opus 4.6 nailed the headline bug just as precisely at half the total cost, and on cost per point it actually wins, but it went blind on the operational details around it. Sonnet 4.6 never saw the flip-flop at all, and still earned its run: it surfaced the single most actionable finding outside my answer key, a subscribe endpoint that silently falls back to a legacy Vercel app nothing maintains anymore.

One more honest line the leaderboards never print. All three models flagged a finding my answer key had explicitly marked as a false positive: I believed my locale list was centralized in one file. They said it is declared independently in five places. I checked. They were right and my answer key was wrong. The benchmark audited the examiner.

Which model when, in one table:

Situation My pick Why
Auditing a system the business depends on Fable 5 The only model to sweep the whole failover layer; against a production outage the 2x price difference is noise
Daily agent work where budget matters Opus 4.6 Found the same headline bug at half the cost; cheapest per point at $0.28
A fast, cheap pass over concrete code issues Sonnet 4.6 Missed the scariest bug, but brought the single most actionable finding outside my answer key

My split after one benchmark run. Your repo may vote differently, which is exactly why you should run your own.

Where Fable 5 actually earns its price for me

Not “smarter.” One thing: it holds the thread.

Every agent loop I run dies the same way. The model loses context at step N and starts inventing. My localization pipeline runs every block of an article through translate, restore and rebuild, times seven languages. The enrichment loop touches a CRM across many steps. The drift is the tax I pay, and today I pay it by re-running or by reading every line by hand, like I did four times this morning.

Fable’s headline claim is exactly this failure mode: hold context across millions of tokens, finish long agentic runs without losing the plot. The loudest proof point is Stripe running a codebase-wide migration across 50 million lines of Ruby in a single day, work Anthropic says would take a team two months by hand. The checkable version is the coding eval:

SWE-Bench Pro

+16% vs baseline
Fable 5 80.3%
Opus 4.8 69.2%
From the launch material, so treat it as a claim until your own workload confirms it. I am quoting it because long-horizon coding is exactly the work my repo throws at it, not because I reproduced it.

If that holds on my work, the win is not nicer output. It is one pass instead of three. That changes cost per result, which is the only number on the whole launch that touches my P&L.

Fable 5 pricing: 2x Opus 4.8, and the catch

When usage billing starts on June 23, Fable 5 and Mythos 5 both list at $10 per million input tokens and $50 per million output tokens. That is exactly double Opus 4.8, which sits at $5 and $25. These are published list prices, not my estimates.

$10 / $50
Fable 5 input and output tokens, per million
Opus 4.8 is half the price, $5 and $25 per million tokens
<5%
Sessions handed to Opus 4.8 as a fallback
Jun 23
Free window ends, billing starts

List pricing from Anthropic's published numbers for this launch. Verified against the public pricing, not modelled by me.

Output-token price, USD per million

Fable 5 / Mythos 5 50
Opus 4.8 25
Lower is cheaper, so the longer bar is the one that costs more. Fable 5 costs exactly twice as much as Opus 4.8. The frontier capability is what the premium buys.

The number that decides whether this is worth 2x is not on any price page: cost per finished result. A model that costs twice as much per token but finishes a long agentic run in half the steps can land cheaper per result. That is the only comparison that matters, and it is exactly what the harness above is built to measure.

What the first 48 hours of Fable 5 produced

The launch was yesterday and the useful material is already piling up. This is the short list I actually opened, watched or read to the end, not a link dump.

How I AI put Fable 5 through early-access testing before launch. The agentic sections overlap almost exactly with what my own benchmark measures.

Worth your time since launch

Everything verified and opened by me on June 10. If it is on this list, I read or watched it.

What I will publish

Fable 5 is free on Pro, Max, Team and seat-based Enterprise until June 22, then it moves to usage credits. That free window is two weeks to run it on real work before the meter starts.

Do
  • Test Fable 5 on your hardest real task during the free window
  • Log which model answered when the fallback fires in agent loops
  • Measure cost per finished result, not cost per token
  • Re-run your own evals before trusting any launch-day number
Don't
  • Repeat launch benchmarks as if you measured them
  • Wire Mythos-class output into anything without reading the 30-day retention policy
  • Assume the safeguards will not block a legitimate request
  • Wait until June 23 and then complain it costs credits

I will put the results up here as a follow-up: cost per result on the localization pipeline and the enrichment loop, Fable against Opus, including where it falls short. If you are testing too, tell me what broke. That is the part no launch post will give you.

FAQ: Claude Fable 5 and Mythos 5

Is Claude Fable 5 the same model as Claude Mythos 5?

Yes. Anthropic confirms they share the same underlying model. The difference is the safeguards: Fable 5 is the general-release build that silently falls back to Opus 4.8 on flagged topics, Mythos 5 runs with some safeguards lifted and is restricted to vetted cyberdefense partners through Project Glasswing.

How much does Claude Fable 5 cost?

$10 per million input tokens and $50 per million output tokens, exactly twice Opus 4.8 ($5 and $25). It is free on Pro, Max, Team and seat-based Enterprise plans until June 22, 2026; usage billing starts June 23.

Is Claude Fable 5 better than Opus or Sonnet?

On my own benchmark, an audit of this site’s repo scored against a hand-verified answer key, Fable 5 took 8.5 of 11 points against 4.5 for Opus 4.6 and 1 for Sonnet 4.6, and it was the only model to sweep the whole failover layer. On cost per found bug, Opus 4.6 actually won, $0.28 against $0.30. Anthropic’s own headline number is SWE-Bench Pro: 80.3% for Fable 5 against 69.2% for Opus 4.8. Treat all of it as input for your own eval, not a verdict.

Why did Opus 4.8 answer my Fable 5 request?

Fable 5 does not refuse flagged topics, it silently answers with Opus 4.8 instead. Anthropic says this happens in under 5% of sessions. If you run agent loops or evals, log which model replied, otherwise your numbers mix two models.

Sources and further reading

I read the primary material so you can too. These are the real links, not a summary of a summary.

Primary sources

Anthropic's own pages plus the third-party evals cited on launch day.

If you want the operator context around this, I have written about running Claude Code as a GTM agent stack and the AI production stack I actually ship on. Same principle as here: the model is one layer, the system around it is the job.

// Next step

I am publishing the test results.

Cost per run, where it broke, what changed in my stack. Subscribe to AI Espresso and you get them first.

// About the author

Wojciech Łuszczyński

Wojciech Łuszczyński

GTM Architect and Growth Operator building AI-native revenue systems for B2B SaaS and technology companies. I connect positioning, SEO, content, paid acquisition, CRM, automation, analytics and AI workflows into practical growth infrastructure.

Newsletter

Get the next one first.

When I publish a new article on AI systems, GTM architecture, or growth operating models, you'll be the first to know.

Subscribe