AGI Strategies

strategy tag

Interpretability bet.

Mechanistic interpretability is necessary and sufficient to know models are safe

stated endorsers

14

1 oppose

profiled endorsers

3

248 on the board total

endorser p(doom)

·

no estimates on record

quotes by endorsers

14

just for this tag

principal voices

Highest-recognition profiled endorsers, broken ties by quote count. Inclusion is not endorsement of the position, it's recognition of who the discourse turns to when the bet is debated.

  • Cynthia RudinCynthia Rudin

    Field-leading

  • Chris OlahChris Olah

    Field-leading

  • Neel NandaNeel Nanda

    Established

where the endorsers sit on the board

3 of 248 profiled · 1% of the board

expertise ↓ · recognition →Household nameField-leadingEstablishedEmerging
Frontier builder·
  • Chris Olah
··
Deep technical·
  • Cynthia Rudin
  • Connor Leahy
  • Neel Nanda
·
Applied technical····
Policy / meta····
External-domain expert····
Commentator····

Each face is one profiled person. Cell shade intensifies with endorser density. Faces with × are profiled opposers, same tier, opposite position. Empty cells mark tier combinations the field has not produced for this bet.

Tier mix counts only endorsers (endorses, mixed, conditional, evolved-toward). 1 person opposes this position; they are not in the bars below but appear in the list further down.

expertise mix of endorsers · 3 profiled of 14

Builds frontier systems
1
Deep ML / safety technical
2
Applied or adjacent technical
0
Governance, policy, strategy
0
Expert in another field
0
Public-square commentator
0

recognition mix of endorsers

Mass-public recognition
0
Known across the AI/safety field
2
Recognised inside subfield
1
Newer or less central voice
0

People on the record.

15

Asma Ghandeharioun

Google DeepMind; 'Patchscopes' for LLM interpretability

endorses

Argues language models can be turned into interpretability tools for themselves; reframes mechanistic interpretation as a translation problem between hidden states and natural language.

“Patchscopes leverage the model's own ability to generate text to inspect its hidden representations, unifying many prior interpretability methods.”
§ paperPatchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models· arXiv / Google DeepMind· 2024· direct quote
Chris Olah

Chris Olah

Anthropic interpretability co-founder; inventor of modern mech interp

endorses

Frames mechanistic interpretability as the tool most likely to let us verify whether a model's cognition matches its stated goal.

I'm most optimistic about safety paths that give us some kind of detailed mechanistic understanding of neural networks.
blogMechanistic interpretability, variables, and the importance of interpretable bases· Transformer Circuits· 2022-06· faithful paraphrase
Connor Leahy

Connor Leahy

CEO of Conjecture; EleutherAI co-founder turned AI safety hawk

opposes

Argues current alignment approaches, including interpretability-only bets, are not sufficient; sometimes explicitly pessimistic about the research path.

“The truth is, I do not know how to build an aligned system and I don't even know where to start.”
articleAI will leave us 'super fucked', says Conjecture's Connor Leahy· Sifted· 2023-04· direct quote
Cynthia Rudin

Cynthia Rudin

Duke professor; interpretable ML pioneer

mixed

Argues for inherently interpretable models over post-hoc explanations, a different flavour of interpretability than the mechanistic-interpretability school.

“Stop explaining black box machine learning models for high-stakes decisions and use interpretable models instead.”
§ paperStop Explaining Black Box Machine Learning Models for High Stakes Decisions· Nature Machine Intelligence· 2019-05· direct quote

David Bau

Northeastern; mechanistic interpretability of LLMs

endorses

Argues mechanistic interpretability is making rapid progress in localizing and editing knowledge inside transformer weights; views this as a foundation for safety oversight.

“Factual associations in GPT correspond to localized, directly editable computations in mid-layer feed-forward modules.”
§ paperLocating and Editing Factual Associations in GPT· arXiv / NeurIPS· 2022· direct quote

Fernanda Viégas

Harvard; ex-Google PAIR; data visualization

endorses

Argues human–AI interaction is best designed when people can see and steer model internals; co-led major industry investments in this approach at Google PAIR before moving to Harvard.

Interactive visualizations turn opaque models into objects we can think with. That is the path to AI that humans can actually verify and shape.
articleFernanda Viégas, Harvard· Harvard SEAS· 2024· faithful paraphrase
Jacob Andreas

Jacob Andreas

MIT NLP; language models as belief reports

endorses

Argues language models develop richer internal structure than behavior alone reveals; mechanistic and probing techniques are required to understand what they 'believe'.

Language models contain structured representations of the agents and situations described in their inputs. Reading those representations is closer to ethnography than to prompt engineering.
§ paperLanguage Models as Agent Models· arXiv· 2022· faithful paraphrase
John Wentworth

John Wentworth

Independent alignment researcher; natural abstractions

endorses

Argues alignment requires identifying the abstractions a model converges on; if these match human concepts, training-time supervision becomes far more reliable.

The natural abstractions hypothesis is roughly: a wide variety of cognitive systems will converge to use the same high-level abstractions for reasoning about the world.
blogThe Natural Abstraction Hypothesis: Implications and Evidence· LessWrong· 2021· faithful paraphrase

Lucius Bushnaq

Apollo Research; mech interp

endorses

Argues interpretability tools are most valuable when explicitly designed to detect deceptive or strategic behaviours in models, not just to characterize benign features.

Interpretability that only finds nice features misses the alignment-relevant ones. We need methods designed to surface the deceptive behaviours we are most worried about.
articleApollo Research· Apollo Research· 2024· faithful paraphrase
Martin Wattenberg

Martin Wattenberg

Harvard; ex-Google PAIR; visualization for ML

endorses

Argues visualization is a primary research method for understanding modern neural networks, not a presentation layer, and that the field's safety guarantees rise and fall with the depth of that understanding.

If we can't see what models are doing, we can't trust them. Visualization is fundamental to building justified confidence in ML systems.
articleMartin Wattenberg, Harvard· Harvard SEAS· 2023· faithful paraphrase
Neel Nanda

Neel Nanda

Mechanistic interpretability team lead at Google DeepMind

endorses

Advocates mechanistic interpretability as a scalable safety tool; also writes accessible tutorials to grow the research field.

Interpretability is, I think, the most promising general-purpose alignment approach.
blogNeel Nanda, homepage· neelnanda.io· 2023· faithful paraphrase

Rich Caruana

Microsoft Research; interpretable ML

endorses

Argues high-stakes ML applications, health, criminal justice, finance, should default to interpretable models that practitioners can audit by hand, not opaque deep nets.

Black-box models are not appropriate for high-stakes decisions. We have interpretable models that match black-box accuracy in many of these domains; using them is a matter of choice, not capability.
articleInterpretable Machine Learning· Microsoft Research· 2019· faithful paraphrase

Roger Grosse

U Toronto; Anthropic; influence functions for LLMs

endorses

Argues training-data influence functions let us trace specific model behaviours back to specific training examples, a form of interpretability indispensable for safety auditing.

We scale influence functions to language models with billions of parameters. The result is a tool for tracing what the model 'learned' from what it saw, at production scale.
§ paperStudying Large Language Model Generalization with Influence Functions· arXiv / Anthropic· 2023-08· faithful paraphrase

Trenton Bricken

Anthropic mechanistic interpretability

endorses

Argues sparse-autoencoder-style decomposition of model activations into monosemantic features is a tractable path to making large models comprehensible enough to oversee.

We use a sparse autoencoder to decompose a small language model's MLP activations into monosemantic features, and we find that the resulting features can be interpreted, controlled, and used to track model behaviour.
§ paperTowards Monosemanticity: Decomposing Language Models With Dictionary Learning· Anthropic / Transformer Circuits· 2023-10· faithful paraphrase

Tristan Hume

Anthropic mechanistic interpretability

endorses

Argues sparse-autoencoder scaling can characterize what large frontier models 'see' in a way that makes external safety claims testable rather than aspirational.

We extracted millions of features from Claude 3 Sonnet using sparse autoencoders. The features map to specific concepts, including ones relevant to safety, like power-seeking behaviour and deception.
§ paperScaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet· Anthropic / Transformer Circuits· 2024-05· faithful paraphrase