AGI Strategies

person

Jacob Andreas

Jacob Andreas

MIT NLP; language models as belief reports

MIT EECS professor whose research has examined whether language models develop interpretable internal world-models and structured representations of belief.

current Associate Professor, EECS / CSAIL, MIT

Strategy positions

Interpretability betendorses

Mechanistic interpretability is necessary and sufficient to know models are safe

Argues language models develop richer internal structure than behavior alone reveals; mechanistic and probing techniques are required to understand what they 'believe'.

Language models contain structured representations of the agents and situations described in their inputs. Reading those representations is closer to ethnography than to prompt engineering.
§ paperLanguage Models as Agent Models· arXiv· 2022· faithful paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with Jacob Andreas's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

  • Asma Ghandeharioun

    shared 1 · J=1.00

    Google DeepMind; 'Patchscopes' for LLM interpretability

  • Chris Olah

    Chris Olah

    shared 1 · J=1.00

    Anthropic interpretability co-founder; inventor of modern mech interp

  • Cynthia Rudin

    Cynthia Rudin

    shared 1 · J=1.00

    Duke professor; interpretable ML pioneer

  • David Bau

    shared 1 · J=1.00

    Northeastern; mechanistic interpretability of LLMs

  • Fernanda Viégas

    shared 1 · J=1.00

    Harvard; ex-Google PAIR; data visualization

  • John Wentworth

    John Wentworth

    shared 1 · J=1.00

    Independent alignment researcher; natural abstractions

Record last updated 2026-04-25.