Show HN: ChatGPT's Infinite Memory Lies – Anthropic Paper Explainer

3 points

a year ago

Title: Show HN: ChatGPT’s Infinite Memory Lies — Anthropic Paper Explainer

Text: I wrote a breakdown of Anthropic’s new paper on model honesty. It shows how language models frequently give misleading chain-of-thought explanations — even when trained for safety.

The essay includes visual diagrams, code examples, commentary on reward hacking, and implications for model alignment.

• Full explainer: https://open.substack.com/pub/marcovcsiliconvalley/p/chatgpt... • Anthropic paper (PDF): https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea... • Anthropic blog post: https://www.anthropic.com/research/reasoning-models-dont-say...

Would love feedback and discussion.

2 comments