AGI Strategies

person

Joar Skalse

Oxford researcher; reward-hacking formalism

Oxford AI safety researcher who co-authored foundational work defining when reward hacking can occur in learned reward models.

current AI safety researcher, Oxford University

Strategy positions

Alignment firstendorses

Solve technical alignment before capability thresholds close

Formalises reward-hacking failures in learned reward models; provides technical grounding for specification-gaming concerns.

We can formally characterise the conditions under which a learned reward model is hackable. The characterisation lets us design training regimes that reduce the attack surface.
§ paperDefining and Characterizing Reward Hacking· arXiv· 2022· loose paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with Joar Skalse's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

  • Aaron Courville

    shared 1 · J=1.00

    Université de Montréal; Deep Learning textbook co-author

  • Adam Jermyn

    shared 1 · J=1.00

    Anthropic; previously astrophysics

  • Adam Kalai

    shared 1 · J=1.00

    Microsoft Research; AI fairness and safety

  • Agnes Callard

    Agnes Callard

    shared 1 · J=1.00

    University of Chicago philosopher; aspiration theorist

  • Ajeya Cotra

    shared 1 · J=1.00

    Open Philanthropy researcher; 'biological anchors' forecaster

  • Alan Turing

    Alan Turing

    shared 1 · J=1.00

    Founder of theoretical computer science (1912–1954)

Record last updated 2026-04-25.