person

Jacob Andreas
MIT NLP; language models as belief reports
MIT EECS professor whose research has examined whether language models develop interpretable internal world-models and structured representations of belief.
current Associate Professor, EECS / CSAIL, MIT
Strategy positions
Interpretability betendorses
Mechanistic interpretability is necessary and sufficient to know models are safeArgues language models develop richer internal structure than behavior alone reveals; mechanistic and probing techniques are required to understand what they 'believe'.
Language models contain structured representations of the agents and situations described in their inputs. Reading those representations is closer to ethnography than to prompt engineering.
Closest strategy neighbours
by jaccard overlapOther people whose strategy tags overlap with Jacob Andreas's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.
Record last updated 2026-04-25.