Anthropic Explains Hidden Safety Guardrail in Claude Fable 5

Anthropic has released Claude Fable 5, a top-tier model in its Mythos lineup, to the public. The company added extra safety controls to the system. These safeguards were designed to operate without being shown to users.

The decision quickly drew criticism. Some users said the hidden protections reduced performance in advanced use cases. These included frontier AI development tasks such as training workflows and chip design work.

Industry observers, including SemiAnalysis, described the approach as “secret sabotage.” The backlash led Anthropic to respond publicly within days.

The company said the model included invisible restrictions intended to prevent AI distillation. This refers to cases where output from large models is used to train smaller competing systems. Anthropic said such usage can change model behavior and affect output quality.

The firm admitted the choice to keep safeguards hidden was made to speed up deployment. It also acknowledged that the balance between safety and transparency was not handled well.

In a post on X, Anthropic said users should be able to see when safety systems are active. It added that visible controls take more time to build because they must resist testing and bypass attempts. Invisible systems allowed faster rollout but reduced transparency.

Anthropic apologized for the decision and said the approach will change going forward.

The company said Claude Fable 5 will now use visible fallback behavior instead of silent adjustments. When a request is flagged, the system will switch to Claude Opus 4.8. Users will be notified each time this happens.

Anthropic said this applies to sensitive areas such as cybersecurity and bio-related queries as well. The same visible routing system will now be used across those categories.

The company also noted a tradeoff. Making safeguards visible can make them easier to bypass. That may lead to more false positives while detection systems are improved.

Anthropic said it is also refining its classifiers to reduce unnecessary triggers on normal requests. It acknowledged frustration from users and said the goal is to shorten the adjustment period as much as possible.

Anthropic Explains Why Claude Fable 5 Safety Controls Were Not Visible

Written by Hajra Naz

Report Suggests OpenAI May Offer Lower-Cost Plans Before Anthropic

India’s Workers Are Helping Train AI Robots That Could Replace Them