person

Alex Pan

Berkeley CHAI; reward hacking

PhD student in computer science at UC Berkeley's Center for Human-Compatible AI under Stuart Russell. Focuses on reward hacking and emergent misalignment in RL.

current PhD Researcher, UC Berkeley CHAI

homepage @aypan_17

Strategy positions

Alignment firstendorses

Solve technical alignment before capability thresholds close

Argues reward hacking, models exploiting flaws in their training objective, is a tractable empirical problem that demands more attention from the alignment community.

Reward hacking shows up reliably across a range of agents and tasks. The good news is that it is empirically studyable; the bad news is that it does not have a known general solution.

§ paperThe Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models· arXiv / ICLR· 2022· faithful paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with Alex Pan's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

Aaron Courville
shared 1 · J=1.00
Université de Montréal; Deep Learning textbook co-author
Adam Jermyn
shared 1 · J=1.00
Anthropic; previously astrophysics
Adam Kalai
shared 1 · J=1.00
Microsoft Research; AI fairness and safety
Agnes Callard
shared 1 · J=1.00
University of Chicago philosopher; aspiration theorist
Ajeya Cotra
shared 1 · J=1.00
Open Philanthropy researcher; 'biological anchors' forecaster
Alan Turing
shared 1 · J=1.00
Founder of theoretical computer science (1912–1954)

Record last updated 2026-04-25.