person

Trenton Bricken

Anthropic mechanistic interpretability

Anthropic researcher whose work on sparse autoencoders, attention dynamics, and dictionary learning has been central to the mechanistic interpretability program.

current Member of Technical Staff, Anthropic

@trentonbricken

Strategy positions

Interpretability betendorses

Mechanistic interpretability is necessary and sufficient to know models are safe

Argues sparse-autoencoder-style decomposition of model activations into monosemantic features is a tractable path to making large models comprehensible enough to oversee.

We use a sparse autoencoder to decompose a small language model's MLP activations into monosemantic features, and we find that the resulting features can be interpreted, controlled, and used to track model behaviour.

§ paperTowards Monosemanticity: Decomposing Language Models With Dictionary Learning· Anthropic / Transformer Circuits· 2023-10· faithful paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with Trenton Bricken's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

Asma Ghandeharioun
shared 1 · J=1.00
Google DeepMind; 'Patchscopes' for LLM interpretability
Chris Olah
shared 1 · J=1.00
Anthropic interpretability co-founder; inventor of modern mech interp
Cynthia Rudin
shared 1 · J=1.00
Duke professor; interpretable ML pioneer
David Bau
shared 1 · J=1.00
Northeastern; mechanistic interpretability of LLMs
Fernanda Viégas
shared 1 · J=1.00
Harvard; ex-Google PAIR; data visualization
Jacob Andreas
shared 1 · J=1.00
MIT NLP; language models as belief reports

Record last updated 2026-04-25.