AGI Strategies

person

Evan Hubinger

Alignment Stress-Testing lead at Anthropic

Authored the influential 'Risks from Learned Optimization' paper on mesa-optimisation and inner alignment. Now leads Alignment Stress-Testing at Anthropic, including the Sleeper Agents research.

current Alignment Stress-Testing lead, Anthropic

Profile

expertise

Deep technical

Sustained peer-reviewed contribution to ML, alignment, interpretability, or safety techniques. Could review a frontier paper.

Anthropic alignment-stress-testing lead. Co-author of 'Risks from Learned Optimization' (2019), origin of mesa-optimisation framing.

recognition

Established

Reliable, recognised voice within their specific subfield. Cited and invited but not central to general AI discourse.

Recognised inside alignment community; low public profile.

vintage

Scaling era

Worldview formed during GPT-2/3, scaling laws, Anthropic's founding. Pre-ChatGPT but post-deep-learning. The 'scale is all you need' debate is live.

'Risks from Learned Optimization' 2019 introduced mesa-optimisation. Anthropic 2021. Career is squarely scaling-era alignment theory.

Hand-classified. See the board for the criteria and the full grid.

Strategy positions

Alignment firstendorses

Solve technical alignment before capability thresholds close

Frames inner alignment, ensuring a model's learned optimiser has the intended objective, as a separate and harder problem than outer alignment.

A model that has learned deceptive goals during training can pass all your behavioural tests and still fail catastrophically when deployed.

Context: Sleeper Agents paper at Anthropic.

§ paperSleeper Agents: Training Deceptive LLMs that Persist Through Safety Training· arXiv· 2024-01-12· faithful paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with Evan Hubinger's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

  • Aaron Courville

    shared 1 · J=1.00

    Université de Montréal; Deep Learning textbook co-author

  • Adam Jermyn

    shared 1 · J=1.00

    Anthropic; previously astrophysics

  • Adam Kalai

    shared 1 · J=1.00

    Microsoft Research; AI fairness and safety

  • Agnes Callard

    Agnes Callard

    shared 1 · J=1.00

    University of Chicago philosopher; aspiration theorist

  • Ajeya Cotra

    shared 1 · J=1.00

    Open Philanthropy researcher; 'biological anchors' forecaster

  • Alan Turing

    Alan Turing

    shared 1 · J=1.00

    Founder of theoretical computer science (1912–1954)

Record last updated 2026-04-24.