person

Tristan Hume

Anthropic mechanistic interpretability

Anthropic researcher whose work on dictionary-learning sparse autoencoders for Claude was a landmark in scaling mechanistic interpretability beyond toy models.

current Member of Technical Staff, Anthropic

homepage @trishume

Strategy positions

Interpretability betendorses

Mechanistic interpretability is necessary and sufficient to know models are safe

Argues sparse-autoencoder scaling can characterize what large frontier models 'see' in a way that makes external safety claims testable rather than aspirational.

We extracted millions of features from Claude 3 Sonnet using sparse autoencoders. The features map to specific concepts, including ones relevant to safety, like power-seeking behaviour and deception.

§ paperScaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet· Anthropic / Transformer Circuits· 2024-05· faithful paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with Tristan Hume's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

Asma Ghandeharioun
shared 1 · J=1.00
Google DeepMind; 'Patchscopes' for LLM interpretability
Chris Olah
shared 1 · J=1.00
Anthropic interpretability co-founder; inventor of modern mech interp
Cynthia Rudin
shared 1 · J=1.00
Duke professor; interpretable ML pioneer
David Bau
shared 1 · J=1.00
Northeastern; mechanistic interpretability of LLMs
Fernanda Viégas
shared 1 · J=1.00
Harvard; ex-Google PAIR; data visualization
Jacob Andreas
shared 1 · J=1.00
MIT NLP; language models as belief reports

Record last updated 2026-04-25.