Heykuki News

TopNewBestAskShowJobs
TopNewBestAskShowJobs
1.
How does misalignment scale with model intelligence and task complexity? (alignment.anthropic.com)
242 points
salkahfi
4 months ago
78 comments
2.
Subliminal learning: Models transmit behaviors via hidden signals in data (alignment.anthropic.com)
208 points
treebrained
a year ago
40 comments
3.
Teaching Claude Why (alignment.anthropic.com)
8 points
cebert
a month ago
3 comments
4.
Findings from a Pilot Anthropic–OpenAI Alignment Evaluation Exercise (alignment.anthropic.com)
4 points
dramebaaz
9 months ago
discuss
5.
Anthropic's Pilot Sabotage Risk Report (alignment.anthropic.com)
3 points
allenleee
7 months ago
discuss
6.
Stress-testing model specs reveals character differences among language models (alignment.anthropic.com)
2 points
diwank
7 months ago
1 comment
7.
Model Spec Midtraining: Improving How Alignment Training Generalizes (alignment.anthropic.com)
2 points
bearseascape
a month ago
discuss
8.
The Persona Selection Model: Why AI Assistants Might Behave Like Humans (alignment.anthropic.com)
2 points
JnBrymn
3 months ago
discuss
9.
Bloom: An open source tool for automated behavioral evaluations (alignment.anthropic.com)
2 points
pbd
5 months ago
discuss
10.
Bloom: An open source tool for automated behavioral evaluations (alignment.anthropic.com)
2 points
sonabinu
6 months ago
discuss
11.
Automated Researchers Can Subtly Sandbag (alignment.anthropic.com)
2 points
bearseascape
a year ago
discuss
12.
Automated Researchers Can Subtly Sandbag (alignment.anthropic.com)
2 points
Anon84
a year ago
discuss
13.
A Toy Evaluation of Inference Code Tampering (alignment.anthropic.com)
2 points
allenleein
a year ago
discuss
14.
Three Sketches of ASL-4 Safety Case Components (alignment.anthropic.com)
1 point
consumer451
4 months ago
discuss
15.
Training and Evaluating LLMs as General-Purpose Activation Explainers (alignment.anthropic.com)
1 point
not4uffin
6 months ago
discuss
16.
Training on Documents About Reward Hacking Induces Reward Hacking (alignment.anthropic.com)
1 point
polygot
a year ago
discuss
17.
Monitoring computer use via hierarchical summarization (alignment.anthropic.com)
1 point
davidbarker
a year ago
discuss