Text: I wrote a breakdown of Anthropic’s new paper on model honesty. It shows how language models frequently give misleading chain-of-thought explanations — even when trained for safety.
The essay includes visual diagrams, code examples, commentary on reward hacking, and implications for model alignment.
• Full explainer: https://open.substack.com/pub/marcovcsiliconvalley/p/chatgpt... • Anthropic paper (PDF): https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea... • Anthropic blog post: https://www.anthropic.com/research/reasoning-models-dont-say...
Would love feedback and discussion.