Search: alignment.anthropic.com | Heykuki News

Heykuki News

Top New Best Ask Show Jobs

Top New Best Ask Show Jobs

1.

How does misalignment scale with model intelligence and task complexity? (alignment.anthropic.com)

242 points

4 months ago

2.

Subliminal learning: Models transmit behaviors via hidden signals in data (alignment.anthropic.com)

208 points

a year ago

3.

Teaching Claude Why (alignment.anthropic.com)

8 points

a month ago

4.

Findings from a Pilot Anthropic–OpenAI Alignment Evaluation Exercise (alignment.anthropic.com)

4 points

9 months ago

5.

Anthropic's Pilot Sabotage Risk Report (alignment.anthropic.com)

3 points

7 months ago

6.

Stress-testing model specs reveals character differences among language models (alignment.anthropic.com)

2 points

7 months ago

7.

Model Spec Midtraining: Improving How Alignment Training Generalizes (alignment.anthropic.com)

2 points

a month ago

8.

The Persona Selection Model: Why AI Assistants Might Behave Like Humans (alignment.anthropic.com)

2 points

3 months ago

9.

Bloom: An open source tool for automated behavioral evaluations (alignment.anthropic.com)

2 points

5 months ago

10.

Bloom: An open source tool for automated behavioral evaluations (alignment.anthropic.com)

2 points

6 months ago

11.

Automated Researchers Can Subtly Sandbag (alignment.anthropic.com)

2 points

a year ago

12.

Automated Researchers Can Subtly Sandbag (alignment.anthropic.com)

2 points

a year ago

13.

A Toy Evaluation of Inference Code Tampering (alignment.anthropic.com)

2 points

a year ago

14.

Three Sketches of ASL-4 Safety Case Components (alignment.anthropic.com)

1 point

4 months ago

15.

Training and Evaluating LLMs as General-Purpose Activation Explainers (alignment.anthropic.com)

1 point

6 months ago

16.

Training on Documents About Reward Hacking Induces Reward Hacking (alignment.anthropic.com)

1 point

a year ago

17.

Monitoring computer use via hierarchical summarization (alignment.anthropic.com)

1 point

a year ago