- Becoming AI-Native
- Posts
- Capability vs. trust
Capability vs. trust
Anthropic’s Fable 5, tested.

Hi, and happy Tuesday.
Sometimes, AI labs feel like 7 year olds that struggle to keep a secret.
On April 7th - about two months ago - Anthropic announced Mythos, a frontier model so scary, it needed to be kept under wraps and away from the public.
On June 4th - almost two weeks ago - Anthropic urged the AI industry to slow down, because society needs more time to prepare for what is coming next
So it was surprising that last week, June 9th, Anthropic released a “Mythos class” model called Fable.
For those of us in the AI bubble, this was akin to Taylor Swift dropping a new album.
So what do we do?
We dropped everything and checked it out.
What did the parents - the US government - do?
Promptly shut it down.
(I expect the government censure to be short-lived.)
Nevertheless, here are some first impressions from our team, from the ~48 hour period during which Fable was available.
João Guerreiro, Technical Director - based out of the UK:
The biology protections are really extreme. As soon as it finds reference to a bacteria, an immune factor, a cell type, it will immediately downgrade to using a less powerful model.
Maikel Boot, Technical Director - based out of the Netherlands:
For non-bio stuff it’s quite powerful. Seems to be able to navigate complex problems more autonomously than Opus (needs less steering). It seems to make better use of MCPs and connectors. It’s much harder to vet the validity of the info received because it just covers such vast amounts of data and sources.
Ryan LaRanger, Technical Director - based out of the US:
Its outputs absolutely still need editing, but it is more proactive about finding material. It did a good job of navigating the Google Drive and finding the files/gaps it needed to build a unified spreadsheet that covers all of the data.
Mariam Jomha, Marketing Director - based out of Lebanon:
Can be a little too proactive. For example, I asked how to do something using Claude Cowork. It responded with how, but then said it’s so much easier building it as a standalone html and went ahead and started building without asking me. It ate up my credits before finishing.
Natalia Shestaka, Product Designer - based out of Spain:
If you do not give it design constraints, it designs everything using Anthropic colors and fonts… it seems like Anthropic´s goal is to make all of the internet look like Anthropic.
When ChatGPT went viral in late 2022, I thought the trajectories on which AI would improve would be:
Context window size - the amount of information they can hold in single conversation, before they get forgetful.
Their ability to "connect the dots" - from bigger neural networks coming up with new-to-the-world discoveries.
Multimodality - speaking to a model and being shown a video in response, for example.
I was wrong. Context windows plateaued at 1 million tokens (~2000 pages of text), and the other trajectories have been slow to materialize.
Instead, what we have seen is rapid improvements in the AI models’ ability to work for longer: their ability to undertake multiple steps in pursuit of a goal, using tools in each step and reason about the results from each step.
As you have probably experienced yourself, the more time you have to work on something, the more likely you can ultimately produce a good result. This is the trajectory on which the models have been most improving.
One way in which this is captured is METR's "time horizon" benchmark.
Instead of measuring how long it takes an AI to do a task, METR measures tasks by how long they usually take a skilled human to complete - whether it's a 30-minute chore, a 5-hour project, or a 2-day assignment. This comparison to human stamina is more useful than AIs blindly spinning their wheels.

In the chart:
The 50% Line is the "Coin Flip" mark - the AI can successfully complete tasks that take humans this length exactly half the time. i.e. "the AI can sometimes pull off it off"
The 80% Line is the "Trust" mark - the AI successfully completes tasks of this length 80% of the time. i.e. this is the threshold where the AI becomes genuinely useful.
An interesting point from this graph is that the 50% line has climbed three orders of magnitude while the 80% line crawled from ~1 minute to ~27 minutes - a gap that widened from roughly 5x to over 10x. The capability has been outrunning reliability.
METR hasn't published a clean 80% figure for Opus 4.6 or anything for Fable 5, which is why both series stop where they do. But, our testing suggests Fable makes a noticeable jump in both capability and reliability.
If this is the trajectory of improvement, what does this mean for how we work with AI?
AI is becoming less something you work with and more something you delegate to. Increasingly, we will all be learning how to let the AI be industrious to take its own initiative, while providing enough context for it to be aligned with your goals - just like with a colleague.
When it comes back online, and you test Fable yourself, keep this lens in mind and let me know what you find.
If you want to dive deeper into METR and other benchmarks I consider important, I put together the following:
We’re pushing on what long-running AI models can do, as described in this one-pager: https://drive.google.com/file/d/1ybsFhBx-mAujTv8hzLrDb5S8BC__D7DJ/view?usp=sharing

Dino