person
Roger Grosse
U Toronto; Anthropic; influence functions for LLMs
University of Toronto professor and Anthropic part-time researcher. Co-led the 2023 work on influence functions for large language models, a key technique for tracing model behaviour back to training data.
current Associate Professor, University of Toronto; Member of Technical Staff (part-time), Anthropic
Strategy positions
Interpretability betendorses
Mechanistic interpretability is necessary and sufficient to know models are safeArgues training-data influence functions let us trace specific model behaviours back to specific training examples, a form of interpretability indispensable for safety auditing.
We scale influence functions to language models with billions of parameters. The result is a tool for tracing what the model 'learned' from what it saw, at production scale.
Closest strategy neighbours
by jaccard overlapOther people whose strategy tags overlap with Roger Grosse's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.
Record last updated 2026-04-25.