AGI Strategies

person

Trenton Bricken

Anthropic mechanistic interpretability

Anthropic researcher whose work on sparse autoencoders, attention dynamics, and dictionary learning has been central to the mechanistic interpretability program.

current Member of Technical Staff, Anthropic

Strategy positions

Interpretability betendorses

Mechanistic interpretability is necessary and sufficient to know models are safe

Argues sparse-autoencoder-style decomposition of model activations into monosemantic features is a tractable path to making large models comprehensible enough to oversee.

We use a sparse autoencoder to decompose a small language model's MLP activations into monosemantic features, and we find that the resulting features can be interpreted, controlled, and used to track model behaviour.
§ paperTowards Monosemanticity: Decomposing Language Models With Dictionary Learning· Anthropic / Transformer Circuits· 2023-10· faithful paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with Trenton Bricken's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

  • Asma Ghandeharioun

    shared 1 · J=1.00

    Google DeepMind; 'Patchscopes' for LLM interpretability

  • Chris Olah

    Chris Olah

    shared 1 · J=1.00

    Anthropic interpretability co-founder; inventor of modern mech interp

  • Cynthia Rudin

    Cynthia Rudin

    shared 1 · J=1.00

    Duke professor; interpretable ML pioneer

  • David Bau

    shared 1 · J=1.00

    Northeastern; mechanistic interpretability of LLMs

  • Fernanda Viégas

    shared 1 · J=1.00

    Harvard; ex-Google PAIR; data visualization

  • Jacob Andreas

    Jacob Andreas

    shared 1 · J=1.00

    MIT NLP; language models as belief reports

Record last updated 2026-04-25.