AGI Strategies

person

Neel Nanda

Neel Nanda

Mechanistic interpretability team lead at Google DeepMind

Pedagogical mechanistic interpretability researcher who runs one of the largest interpretability research teams. Publishes extensively on how to do mech interp research and trains the next generation of researchers.

current Mechanistic interpretability team lead, Google DeepMind

Profile

expertise

Deep technical

Sustained peer-reviewed contribution to ML, alignment, interpretability, or safety techniques. Could review a frontier paper.

Leads mechanistic-interpretability work at Google DeepMind. TransformerLens library author; large public corpus of interpretability tutorials and papers.

recognition

Established

Reliable, recognised voice within their specific subfield. Cited and invited but not central to general AI discourse.

Recognised name in interpretability circles; less prominent than Olah or Anthropic leadership.

vintage

Scaling era

Worldview formed during GPT-2/3, scaling laws, Anthropic's founding. Pre-ChatGPT but post-deep-learning. The 'scale is all you need' debate is live.

Active interpretability publishing from ~2021; TransformerLens during scaling era. His priors are post-GPT-2.

Hand-classified. See the board for the criteria and the full grid.

Strategy positions

Interpretability betendorses

Mechanistic interpretability is necessary and sufficient to know models are safe

Advocates mechanistic interpretability as a scalable safety tool; also writes accessible tutorials to grow the research field.

Interpretability is, I think, the most promising general-purpose alignment approach.
blogNeel Nanda, homepage· neelnanda.io· 2023· faithful paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with Neel Nanda's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

  • Asma Ghandeharioun

    shared 1 · J=1.00

    Google DeepMind; 'Patchscopes' for LLM interpretability

  • Chris Olah

    Chris Olah

    shared 1 · J=1.00

    Anthropic interpretability co-founder; inventor of modern mech interp

  • Cynthia Rudin

    Cynthia Rudin

    shared 1 · J=1.00

    Duke professor; interpretable ML pioneer

  • David Bau

    shared 1 · J=1.00

    Northeastern; mechanistic interpretability of LLMs

  • Fernanda Viégas

    shared 1 · J=1.00

    Harvard; ex-Google PAIR; data visualization

  • Jacob Andreas

    Jacob Andreas

    shared 1 · J=1.00

    MIT NLP; language models as belief reports

Record last updated 2026-04-24.