person

Roger Grosse

U Toronto; Anthropic; influence functions for LLMs

University of Toronto professor and Anthropic part-time researcher. Co-led the 2023 work on influence functions for large language models, a key technique for tracing model behaviour back to training data.

current Associate Professor, University of Toronto; Member of Technical Staff (part-time), Anthropic

homepage @RogerGrosse

Strategy positions

Interpretability betendorses

Mechanistic interpretability is necessary and sufficient to know models are safe

Argues training-data influence functions let us trace specific model behaviours back to specific training examples, a form of interpretability indispensable for safety auditing.

We scale influence functions to language models with billions of parameters. The result is a tool for tracing what the model 'learned' from what it saw, at production scale.

§ paperStudying Large Language Model Generalization with Influence Functions· arXiv / Anthropic· 2023-08· faithful paraphrase

Closest strategy neighbours

by jaccard overlap

Other people whose strategy tags overlap with Roger Grosse's. Overlap is on tag identity, not stance; opposites can show up if they reference the same tags.

Asma Ghandeharioun
shared 1 · J=1.00
Google DeepMind; 'Patchscopes' for LLM interpretability
Chris Olah
shared 1 · J=1.00
Anthropic interpretability co-founder; inventor of modern mech interp
Cynthia Rudin
shared 1 · J=1.00
Duke professor; interpretable ML pioneer
David Bau
shared 1 · J=1.00
Northeastern; mechanistic interpretability of LLMs
Fernanda Viégas
shared 1 · J=1.00
Harvard; ex-Google PAIR; data visualization
Jacob Andreas
shared 1 · J=1.00
MIT NLP; language models as belief reports

Record last updated 2026-04-25.