person

Chris Olah

Anthropic interpretability co-founder; inventor of modern mech interp

The most-cited mechanistic interpretability researcher. Co-founded the interpretability team at Anthropic that produced circuits, superposition, and monosemanticity work.

current Co-founder; Interpretability lead, Anthropic

homepage wikipedia @ch402

Profile

expertise

Frontier builder

Currently or recently led training, architecture, or safety work on a frontier model. Hands on the loss curve.

Founded mechanistic interpretability as a subfield (Distill papers, circuits thread). Anthropic interpretability team lead. Hands-on technical work on frontier model internals.

recognition

Field-leading

Widely known inside the AI and AI-safety community. Appears repeatedly in top venues, podcasts, or governance forums. Not a household name to outsiders.

The reference name in interpretability. Less public profile than CEOs or executives.

vintage

Deep-learning rise

Came up post-AlexNet. ImageNet, AlphaGo, transformer paper. DeepMind, Google Brain, FAIR establish the modern lab template.

Distill papers from 2017; circuits thread 2020. The interpretability subfield he founded is a deep-learning-era artefact.

Hand-classified. See the board for the criteria and the full grid.

Strategy positions

Interpretability betendorses

Mechanistic interpretability is necessary and sufficient to know models are safe

Frames mechanistic interpretability as the tool most likely to let us verify whether a model's cognition matches its stated goal.

I'm most optimistic about safety paths that give us some kind of detailed mechanistic understanding of neural networks.

✍ blogMechanistic interpretability, variables, and the importance of interpretable bases· Transformer Circuits· 2022-06· faithful paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with Chris Olah's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

Asma Ghandeharioun
shared 1 · J=1.00
Google DeepMind; 'Patchscopes' for LLM interpretability
Cynthia Rudin
shared 1 · J=1.00
Duke professor; interpretable ML pioneer
David Bau
shared 1 · J=1.00
Northeastern; mechanistic interpretability of LLMs
Fernanda Viégas
shared 1 · J=1.00
Harvard; ex-Google PAIR; data visualization
Jacob Andreas
shared 1 · J=1.00
MIT NLP; language models as belief reports
John Wentworth
shared 1 · J=1.00
Independent alignment researcher; natural abstractions

Record last updated 2026-04-24.