strategy tag

Interpretability bet.

Mechanistic interpretability is necessary and sufficient to know models are safe

Rich strategy page →Compare to another →

stated endorsers

1 oppose

profiled endorsers

248 on the board total

endorser p(doom)

no estimates on record

quotes by endorsers

just for this tag

principal voices

Highest-recognition profiled endorsers, broken ties by quote count. Inclusion is not endorsement of the position, it's recognition of who the discourse turns to when the bet is debated.

Cynthia Rudin
Field-leading
Chris Olah
Field-leading
Neel Nanda
Established

where the endorsers sit on the board

3 of 248 profiled · 1% of the board

expertise ↓ · recognition →	Household name	Field-leading	Established	Emerging
Frontier builder	·		·	·
Deep technical	·			·
Applied technical	·	·	·	·
Policy / meta	·	·	·	·
External-domain expert	·	·	·	·
Commentator	·	·	·	·

Each face is one profiled person. Cell shade intensifies with endorser density. Faces with × are profiled opposers, same tier, opposite position. Empty cells mark tier combinations the field has not produced for this bet.

Tier mix counts only endorsers (endorses, mixed, conditional, evolved-toward). 1 person opposes this position; they are not in the bars below but appear in the list further down.

expertise mix of endorsers · 3 profiled of 14

Builds frontier systems

Deep ML / safety technical

Applied or adjacent technical

Governance, policy, strategy

Expert in another field

Public-square commentator

recognition mix of endorsers

Mass-public recognition

Known across the AI/safety field

Recognised inside subfield

Newer or less central voice

People on the record.

Asma Ghandeharioun

Google DeepMind; 'Patchscopes' for LLM interpretability

endorses

Argues language models can be turned into interpretability tools for themselves; reframes mechanistic interpretation as a translation problem between hidden states and natural language.

“Patchscopes leverage the model's own ability to generate text to inspect its hidden representations, unifying many prior interpretability methods.”

§ paperPatchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models· arXiv / Google DeepMind· 2024· direct quote

Chris Olah

Anthropic interpretability co-founder; inventor of modern mech interp

endorses

Frames mechanistic interpretability as the tool most likely to let us verify whether a model's cognition matches its stated goal.

I'm most optimistic about safety paths that give us some kind of detailed mechanistic understanding of neural networks.

✍ blogMechanistic interpretability, variables, and the importance of interpretable bases· Transformer Circuits· 2022-06· faithful paraphrase

Connor Leahy

CEO of Conjecture; EleutherAI co-founder turned AI safety hawk

opposes

Argues current alignment approaches, including interpretability-only bets, are not sufficient; sometimes explicitly pessimistic about the research path.

“The truth is, I do not know how to build an aligned system and I don't even know where to start.”

¶ articleAI will leave us 'super fucked', says Conjecture's Connor Leahy· Sifted· 2023-04· direct quote

Cynthia Rudin

Duke professor; interpretable ML pioneer

mixed

Argues for inherently interpretable models over post-hoc explanations, a different flavour of interpretability than the mechanistic-interpretability school.

“Stop explaining black box machine learning models for high-stakes decisions and use interpretable models instead.”

§ paperStop Explaining Black Box Machine Learning Models for High Stakes Decisions· Nature Machine Intelligence· 2019-05· direct quote

David Bau

Northeastern; mechanistic interpretability of LLMs

endorses

Argues mechanistic interpretability is making rapid progress in localizing and editing knowledge inside transformer weights; views this as a foundation for safety oversight.

“Factual associations in GPT correspond to localized, directly editable computations in mid-layer feed-forward modules.”

§ paperLocating and Editing Factual Associations in GPT· arXiv / NeurIPS· 2022· direct quote

Fernanda Viégas

Harvard; ex-Google PAIR; data visualization

endorses

Argues human–AI interaction is best designed when people can see and steer model internals; co-led major industry investments in this approach at Google PAIR before moving to Harvard.

Interactive visualizations turn opaque models into objects we can think with. That is the path to AI that humans can actually verify and shape.

¶ articleFernanda Viégas, Harvard· Harvard SEAS· 2024· faithful paraphrase

Jacob Andreas

MIT NLP; language models as belief reports

endorses

Argues language models develop richer internal structure than behavior alone reveals; mechanistic and probing techniques are required to understand what they 'believe'.

Language models contain structured representations of the agents and situations described in their inputs. Reading those representations is closer to ethnography than to prompt engineering.

§ paperLanguage Models as Agent Models· arXiv· 2022· faithful paraphrase

John Wentworth

Independent alignment researcher; natural abstractions

endorses

Argues alignment requires identifying the abstractions a model converges on; if these match human concepts, training-time supervision becomes far more reliable.

The natural abstractions hypothesis is roughly: a wide variety of cognitive systems will converge to use the same high-level abstractions for reasoning about the world.

✍ blogThe Natural Abstraction Hypothesis: Implications and Evidence· LessWrong· 2021· faithful paraphrase

Lucius Bushnaq

Apollo Research; mech interp

endorses

Argues interpretability tools are most valuable when explicitly designed to detect deceptive or strategic behaviours in models, not just to characterize benign features.

Interpretability that only finds nice features misses the alignment-relevant ones. We need methods designed to surface the deceptive behaviours we are most worried about.

¶ articleApollo Research· Apollo Research· 2024· faithful paraphrase

Martin Wattenberg

Harvard; ex-Google PAIR; visualization for ML

endorses

Argues visualization is a primary research method for understanding modern neural networks, not a presentation layer, and that the field's safety guarantees rise and fall with the depth of that understanding.

If we can't see what models are doing, we can't trust them. Visualization is fundamental to building justified confidence in ML systems.

¶ articleMartin Wattenberg, Harvard· Harvard SEAS· 2023· faithful paraphrase

Neel Nanda

Mechanistic interpretability team lead at Google DeepMind

endorses

Advocates mechanistic interpretability as a scalable safety tool; also writes accessible tutorials to grow the research field.

Interpretability is, I think, the most promising general-purpose alignment approach.

✍ blogNeel Nanda, homepage· neelnanda.io· 2023· faithful paraphrase

Rich Caruana

Microsoft Research; interpretable ML

endorses

Argues high-stakes ML applications, health, criminal justice, finance, should default to interpretable models that practitioners can audit by hand, not opaque deep nets.

Black-box models are not appropriate for high-stakes decisions. We have interpretable models that match black-box accuracy in many of these domains; using them is a matter of choice, not capability.

¶ articleInterpretable Machine Learning· Microsoft Research· 2019· faithful paraphrase

Roger Grosse

U Toronto; Anthropic; influence functions for LLMs

endorses

Argues training-data influence functions let us trace specific model behaviours back to specific training examples, a form of interpretability indispensable for safety auditing.

We scale influence functions to language models with billions of parameters. The result is a tool for tracing what the model 'learned' from what it saw, at production scale.

§ paperStudying Large Language Model Generalization with Influence Functions· arXiv / Anthropic· 2023-08· faithful paraphrase

Trenton Bricken

Anthropic mechanistic interpretability

endorses

Argues sparse-autoencoder-style decomposition of model activations into monosemantic features is a tractable path to making large models comprehensible enough to oversee.

We use a sparse autoencoder to decompose a small language model's MLP activations into monosemantic features, and we find that the resulting features can be interpreted, controlled, and used to track model behaviour.

§ paperTowards Monosemanticity: Decomposing Language Models With Dictionary Learning· Anthropic / Transformer Circuits· 2023-10· faithful paraphrase

Tristan Hume

Anthropic mechanistic interpretability

endorses

Argues sparse-autoencoder scaling can characterize what large frontier models 'see' in a way that makes external safety claims testable rather than aspirational.

We extracted millions of features from Claude 3 Sonnet using sparse autoencoders. The features map to specific concepts, including ones relevant to safety, like power-seeking behaviour and deception.

§ paperScaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet· Anthropic / Transformer Circuits· 2024-05· faithful paraphrase

People on the record.

Asma GhandehariounAGasmAsma GhandehariounGoogle DeepMind; 'Patchscopes' for LLM interpretabilityInterpretability bet

Chris OlahChris OlahAnthropic interpretability co-founder; inventor of modern mech interpFrontier builder · Field-leading · Deep-learning riseInterpretability bet

Connor LeahyConnor LeahyCEO of Conjecture; EleutherAI co-founder turned AI safety hawkDeep technical · Field-leading · Scaling eraPauseInterpretability bet

Cynthia RudinCynthia RudinDuke professor; interpretable ML pioneerDeep technical · Field-leading · Deep-learning riseInterpretability bet

David BauDBdavDavid BauNortheastern; mechanistic interpretability of LLMsInterpretability bet

Fernanda ViégasFVferFernanda ViégasHarvard; ex-Google PAIR; data visualizationInterpretability bet

Jacob AndreasJacob AndreasMIT NLP; language models as belief reportsInterpretability bet

John WentworthJohn WentworthIndependent alignment researcher; natural abstractionsInterpretability bet

Lucius BushnaqLBlucLucius BushnaqApollo Research; mech interpInterpretability bet

Martin WattenbergMartin WattenbergHarvard; ex-Google PAIR; visualization for MLInterpretability bet

Neel NandaNeel NandaMechanistic interpretability team lead at Google DeepMindDeep technical · Established · Scaling eraInterpretability bet

Rich CaruanaRCricRich CaruanaMicrosoft Research; interpretable MLInterpretability bet

Roger GrosseRGrogRoger GrosseU Toronto; Anthropic; influence functions for LLMsInterpretability bet

Trenton BrickenTBtreTrenton BrickenAnthropic mechanistic interpretabilityInterpretability bet

Tristan HumeTHtriTristan HumeAnthropic mechanistic interpretabilityInterpretability bet

Asma Ghandeharioun

Chris Olah

Connor Leahy

Cynthia Rudin

David Bau

Fernanda Viégas

Jacob Andreas

John Wentworth

Lucius Bushnaq

Martin Wattenberg

Neel Nanda

Rich Caruana

Roger Grosse

Trenton Bricken

Tristan Hume