person

Evan Hubinger

Alignment Stress-Testing lead at Anthropic

Authored the influential 'Risks from Learned Optimization' paper on mesa-optimisation and inner alignment. Now leads Alignment Stress-Testing at Anthropic, including the Sleeper Agents research.

current Alignment Stress-Testing lead, Anthropic

Profile

expertise

Deep technical

Sustained peer-reviewed contribution to ML, alignment, interpretability, or safety techniques. Could review a frontier paper.

Anthropic alignment-stress-testing lead. Co-author of 'Risks from Learned Optimization' (2019), origin of mesa-optimisation framing.

recognition

Established

Reliable, recognised voice within their specific subfield. Cited and invited but not central to general AI discourse.

Recognised inside alignment community; low public profile.

vintage

Scaling era

Worldview formed during GPT-2/3, scaling laws, Anthropic's founding. Pre-ChatGPT but post-deep-learning. The 'scale is all you need' debate is live.

'Risks from Learned Optimization' 2019 introduced mesa-optimisation. Anthropic 2021. Career is squarely scaling-era alignment theory.

Hand-classified. See the board for the criteria and the full grid.

Strategy positions

Alignment firstendorses

Solve technical alignment before capability thresholds close

Frames inner alignment, ensuring a model's learned optimiser has the intended objective, as a separate and harder problem than outer alignment.

A model that has learned deceptive goals during training can pass all your behavioural tests and still fail catastrophically when deployed.

Context: Sleeper Agents paper at Anthropic.

§ paperSleeper Agents: Training Deceptive LLMs that Persist Through Safety Training· arXiv· 2024-01-12· faithful paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with Evan Hubinger's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

Aaron Courville
shared 1 · J=1.00
Université de Montréal; Deep Learning textbook co-author
Adam Jermyn
shared 1 · J=1.00
Anthropic; previously astrophysics
Adam Kalai
shared 1 · J=1.00
Microsoft Research; AI fairness and safety
Agnes Callard
shared 1 · J=1.00
University of Chicago philosopher; aspiration theorist
Ajeya Cotra
shared 1 · J=1.00
Open Philanthropy researcher; 'biological anchors' forecaster
Alan Turing
shared 1 · J=1.00
Founder of theoretical computer science (1912–1954)

Record last updated 2026-04-24.