person

John Wentworth

Independent alignment researcher; natural abstractions

Independent alignment researcher who developed the 'natural abstractions hypothesis' as a framing for whether human concepts robustly transfer to learned representations.

current Alignment Researcher, Independent

@johnwentworth

Strategy positions

Interpretability betendorses

Mechanistic interpretability is necessary and sufficient to know models are safe

Argues alignment requires identifying the abstractions a model converges on; if these match human concepts, training-time supervision becomes far more reliable.

The natural abstractions hypothesis is roughly: a wide variety of cognitive systems will converge to use the same high-level abstractions for reasoning about the world.

✍ blogThe Natural Abstraction Hypothesis: Implications and Evidence· LessWrong· 2021· faithful paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with John Wentworth's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

Asma Ghandeharioun
shared 1 · J=1.00
Google DeepMind; 'Patchscopes' for LLM interpretability
Chris Olah
shared 1 · J=1.00
Anthropic interpretability co-founder; inventor of modern mech interp
Cynthia Rudin
shared 1 · J=1.00
Duke professor; interpretable ML pioneer
David Bau
shared 1 · J=1.00
Northeastern; mechanistic interpretability of LLMs
Fernanda Viégas
shared 1 · J=1.00
Harvard; ex-Google PAIR; data visualization
Jacob Andreas
shared 1 · J=1.00
MIT NLP; language models as belief reports

Record last updated 2026-04-25.