person
Joar Skalse
Oxford researcher; reward-hacking formalism
Oxford AI safety researcher who co-authored foundational work defining when reward hacking can occur in learned reward models.
current AI safety researcher, Oxford University
Strategy positions
Alignment firstendorses
Solve technical alignment before capability thresholds closeFormalises reward-hacking failures in learned reward models; provides technical grounding for specification-gaming concerns.
We can formally characterise the conditions under which a learned reward model is hackable. The characterisation lets us design training regimes that reduce the attack surface.
Closest strategy neighbours
by jaccard overlapOther people whose strategy tags overlap with Joar Skalse's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.
Record last updated 2026-04-25.