AGI Strategies

person

Chris Olah

Chris Olah

Anthropic interpretability co-founder; inventor of modern mech interp

The most-cited mechanistic interpretability researcher. Co-founded the interpretability team at Anthropic that produced circuits, superposition, and monosemanticity work.

current Co-founder; Interpretability lead, Anthropic

Profile

expertise

Frontier builder

Currently or recently led training, architecture, or safety work on a frontier model. Hands on the loss curve.

Founded mechanistic interpretability as a subfield (Distill papers, circuits thread). Anthropic interpretability team lead. Hands-on technical work on frontier model internals.

recognition

Field-leading

Widely known inside the AI and AI-safety community. Appears repeatedly in top venues, podcasts, or governance forums. Not a household name to outsiders.

The reference name in interpretability. Less public profile than CEOs or executives.

vintage

Deep-learning rise

Came up post-AlexNet. ImageNet, AlphaGo, transformer paper. DeepMind, Google Brain, FAIR establish the modern lab template.

Distill papers from 2017; circuits thread 2020. The interpretability subfield he founded is a deep-learning-era artefact.

Hand-classified. See the board for the criteria and the full grid.

Strategy positions

Interpretability betendorses

Mechanistic interpretability is necessary and sufficient to know models are safe

Frames mechanistic interpretability as the tool most likely to let us verify whether a model's cognition matches its stated goal.

I'm most optimistic about safety paths that give us some kind of detailed mechanistic understanding of neural networks.
blogMechanistic interpretability, variables, and the importance of interpretable bases· Transformer Circuits· 2022-06· faithful paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with Chris Olah's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

  • Asma Ghandeharioun

    shared 1 · J=1.00

    Google DeepMind; 'Patchscopes' for LLM interpretability

  • Cynthia Rudin

    Cynthia Rudin

    shared 1 · J=1.00

    Duke professor; interpretable ML pioneer

  • David Bau

    shared 1 · J=1.00

    Northeastern; mechanistic interpretability of LLMs

  • Fernanda Viégas

    shared 1 · J=1.00

    Harvard; ex-Google PAIR; data visualization

  • Jacob Andreas

    Jacob Andreas

    shared 1 · J=1.00

    MIT NLP; language models as belief reports

  • John Wentworth

    John Wentworth

    shared 1 · J=1.00

    Independent alignment researcher; natural abstractions

Record last updated 2026-04-24.