person
Tristan Hume
Anthropic mechanistic interpretability
Anthropic researcher whose work on dictionary-learning sparse autoencoders for Claude was a landmark in scaling mechanistic interpretability beyond toy models.
current Member of Technical Staff, Anthropic
Strategy positions
Interpretability betendorses
Mechanistic interpretability is necessary and sufficient to know models are safeArgues sparse-autoencoder scaling can characterize what large frontier models 'see' in a way that makes external safety claims testable rather than aspirational.
We extracted millions of features from Claude 3 Sonnet using sparse autoencoders. The features map to specific concepts, including ones relevant to safety, like power-seeking behaviour and deception.
Closest strategy neighbours
by jaccard overlapOther people whose strategy tags overlap with Tristan Hume's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.
Record last updated 2026-04-25.