When I transferred 237 of those corrections as rules to a new agent to save time with onboarding in a new repo, it made 44 new mistakes. 13 were in categories the rules explicitly covered. The rules were present in context. The behavior wasn't there. I published the field study with full correction logs.
Then Meta's Superintelligence Labs published HyperAgents (arXiv:2603.19461, March 2026). They found the complementary result: improvements DO transfer across domains when embodied in executable mechanisms (persistent memory, performance tracking, eval loops), not when written as rule text. Two independent studies, same boundary: documentation is not behavior.
So I built Calx. pip install getcalx gives you a CLI + MCP server that:
Captures corrections developers make to AI agents Detects recurrence via keyword similarity (Jaccard), auto-promotes at 3x threshold Promotes recurring corrections to enforced rules and hooks, injected at session start Scopes rules per domain/directory so each agent gets only what's relevant
It runs as a FastMCP server over Streamable HTTP (SQLite locally) so any MCP-compatible client connects: Claude Code, Claude Desktop, Cursor, custom agents. It is primarily designed for Claude Code. It also handles token discipline (prevents context compaction from destroying correction signal), multi-agent orchestration, session lifecycle hooks, orientation gates, and dirty-exit recovery.
The difference from agent memory tools: existing agent memory systems store information for retrieval. Calx tracks the behavioral plane, how an agent works with a specific person, not just what it knows. The data shows the information plane alone doesn't reliably change behavior.
v0.5.0, 443 tests, MIT license. Paper with full evidence: https://doi.org/10.5281/zenodo.19159223