Heykuki News
Top
New
Best
Ask
Show
Jobs
Toggle theme
Login
Top
New
Best
Ask
Show
Jobs
1.
▲
How does misalignment scale with model intelligence and task complexity?
(alignment.anthropic.com)
242 points
salkahfi
4 months ago
78 comments
2.
▲
Subliminal learning: Models transmit behaviors via hidden signals in data
(alignment.anthropic.com)
208 points
treebrained
a year ago
40 comments
3.
▲
Teaching Claude Why
(alignment.anthropic.com)
8 points
cebert
a month ago
3 comments
4.
▲
Findings from a Pilot Anthropic–OpenAI Alignment Evaluation Exercise
(alignment.anthropic.com)
4 points
dramebaaz
9 months ago
discuss
5.
▲
Anthropic's Pilot Sabotage Risk Report
(alignment.anthropic.com)
3 points
allenleee
7 months ago
discuss
6.
▲
Stress-testing model specs reveals character differences among language models
(alignment.anthropic.com)
2 points
diwank
7 months ago
1 comment
7.
▲
Model Spec Midtraining: Improving How Alignment Training Generalizes
(alignment.anthropic.com)
2 points
bearseascape
a month ago
discuss
8.
▲
The Persona Selection Model: Why AI Assistants Might Behave Like Humans
(alignment.anthropic.com)
2 points
JnBrymn
3 months ago
discuss
9.
▲
Bloom: An open source tool for automated behavioral evaluations
(alignment.anthropic.com)
2 points
pbd
5 months ago
discuss
10.
▲
Bloom: An open source tool for automated behavioral evaluations
(alignment.anthropic.com)
2 points
sonabinu
6 months ago
discuss
11.
▲
Automated Researchers Can Subtly Sandbag
(alignment.anthropic.com)
2 points
bearseascape
a year ago
discuss
12.
▲
Automated Researchers Can Subtly Sandbag
(alignment.anthropic.com)
2 points
Anon84
a year ago
discuss
13.
▲
A Toy Evaluation of Inference Code Tampering
(alignment.anthropic.com)
2 points
allenleein
a year ago
discuss
14.
▲
Three Sketches of ASL-4 Safety Case Components
(alignment.anthropic.com)
1 point
consumer451
4 months ago
discuss
15.
▲
Training and Evaluating LLMs as General-Purpose Activation Explainers
(alignment.anthropic.com)
1 point
not4uffin
6 months ago
discuss
16.
▲
Training on Documents About Reward Hacking Induces Reward Hacking
(alignment.anthropic.com)
1 point
polygot
a year ago
discuss
17.
▲
Monitoring computer use via hierarchical summarization
(alignment.anthropic.com)
1 point
davidbarker
a year ago
discuss