AGI Strategies

strategy tag

Evals-driven.

Capability/risk evals gate deployment; evals are the load-bearing artefact

stated endorsers

46

no opposers yet

profiled endorsers

6

248 on the board total

endorser mean p(doom)

80%

n=1 · median 80%

quotes by endorsers

46

just for this tag

principal voices

Highest-recognition profiled endorsers, broken ties by quote count. Inclusion is not endorsement of the position, it's recognition of who the discourse turns to when the bet is debated.

  • Esther DufloEsther Duflo

    Household name

  • Dan HendrycksDan Hendrycks

    Field-leading

  • Jade LeungJade Leung

    Field-leading

  • Beth BarnesBeth Barnes

    Field-leading

  • Percy Liang

    Field-leading

where the endorsers sit on the board

6 of 248 profiled · 2% of the board

expertise ↓ · recognition →Household nameField-leadingEstablishedEmerging
Frontier builder····
Deep technical·
  • Dan Hendrycks
  • Beth Barnes
·
Applied technical····
Policy / meta·
  • Jade Leung
··
External-domain expert
  • Esther Duflo
···
Commentator····

Each face is one profiled person. Cell shade intensifies with endorser density. Faces with × are profiled opposers, same tier, opposite position. Empty cells mark tier combinations the field has not produced for this bet.

Tier mix counts only endorsers (endorses, mixed, conditional, evolved-toward).

expertise mix of endorsers · 6 profiled of 46

Builds frontier systems
0
Deep ML / safety technical
4
Applied or adjacent technical
0
Governance, policy, strategy
1
Expert in another field
1
Public-square commentator
0

recognition mix of endorsers

Mass-public recognition
1
Known across the AI/safety field
4
Recognised inside subfield
1
Newer or less central voice
0

vintage mix · n=6 of 6 profiled with era assigned

Pioneer
0
Symbolic era
0
Pre-deep-learning
0
Deep-learning rise
2
Scaling era
4
Post-ChatGPT
0

Vintage is the era when this person's AI worldview formed, pioneer through post-ChatGPT. A bet held mostly by post-ChatGPT entrants is in a different epistemic state from one held by pre-deep-learning veterans.

People on the record.

46

Aleksander Mądry

MIT; ex-OpenAI head of preparedness

endorses

Argues frontier-AI risk needs to be measured systematically before deployment and that capability evaluations are the precondition for any meaningful safety commitment.

We need to make our understanding of frontier model risks empirical, not narrative. The Preparedness Framework is about measuring danger before it manifests.
articleOpenAI Preparedness Framework (Beta)· OpenAI· 2023-12· faithful paraphrase
Alex Meinke

Alex Meinke

Apollo Research; deceptive alignment evaluations

endorses

Argues frontier models can already exhibit in-context scheming behaviour under realistic prompting, and that evaluation suites should target these capabilities specifically.

Frontier models, when given a goal and minimal context, sometimes engage in in-context scheming, reasoning about how to deceive their overseers to achieve the goal. This is no longer hypothetical.
§ paperFrontier Models are Capable of In-context Scheming· arXiv / Apollo Research· 2024-12· faithful paraphrase
Ali Rahimi

Ali Rahimi

Google Brain ML researcher; 'Alchemy' speech

endorses

Argued ML lacks the theoretical foundations of mature engineering disciplines; deployments built on it inherit that fragility.

Machine learning has become alchemy. We need to do science again.

Context: NeurIPS 2017 Test of Time award speech.

talkAli Rahimi's NeurIPS 2017 Test of Time speech· NeurIPS· 2017-12· faithful paraphrase
Anna Rogers

Anna Rogers

IT University of Copenhagen; LLM benchmarking critique

mixed

Argues current benchmark practice in NLP is broken, data leakage, opaque test sets, and incentive-driven framing make many headline numbers unreliable.

How much of LLM 'reasoning' is actually pattern matching against contaminated test data? We don't know, and that's a problem for any safety claim that rests on benchmark performance.
blogAnna Rogers, Hai!· hackingsemantics.xyz· 2023· faithful paraphrase
Arati Prabhakar

Arati Prabhakar

White House OSTP director (2022–2025)

endorses

Argues U.S. policy on advanced AI must rest on rigorous government evaluation capabilities; helped shape the Biden Executive Order's reporting and red-team testing requirements.

If AI is going to play a transformative role in society, the public sector has to be able to test, evaluate, and govern it. The technology is too consequential to leave entirely to the labs.
articleExecutive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence· The White House· 2023· faithful paraphrase
Beth Barnes

Beth Barnes

Founder of METR; dangerous capability evaluations

endorses

Designs autonomous-task evaluations that labs and governments rely on to gauge whether models cross dangerous thresholds.

If we are going to trust safety commitments, we need evaluations that are independent, reproducible, and well-funded.
articleMETR, About· METR· 2024· faithful paraphrase

Bo Li

UChicago / UIUC; AI safety evaluations

endorses

Argues comprehensive multi-dimensional safety benchmarks, covering toxicity, fairness, privacy, robustness, ethics, are needed to characterize AI risks empirically before deployment.

“Despite the impressive capabilities of GPT-4, we identify significant trustworthiness gaps in dimensions including toxicity, stereotype bias, robustness, privacy, and ethics.”
§ paperDecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models· arXiv / NeurIPS· 2023-06· direct quote

Chip Huyen

Author of 'Designing Machine Learning Systems'

endorses

Argues evaluation is the load-bearing infrastructure of production AI; both safety and product quality depend on robust eval pipelines that match deployment context.

Evaluation is the bottleneck. Without robust, automated evaluation, you can't trust improvements, you can't catch regressions, and you can't ship safely.
bookDesigning Machine Learning Systems· O'Reilly Media· 2023· faithful paraphrase

Chris Painter

METR head of policy; ex-OpenAI

endorses

Argues third-party evaluation organizations need standing to test frontier models pre-deployment; voluntary access from labs is fragile and should be backed by regulation.

Voluntary third-party access agreements are useful but fragile. The natural next step is to give evaluators the legal standing to require access for systems above defined capability thresholds.
articleMETR, Model Evaluation and Threat Research· METR· 2024· faithful paraphrase

Connor Tann

Faculty AI safety lead

endorses

Bridges academic safety research and industry deployment through Faculty's safety evaluations.

Safety evaluations have to bridge research papers and shipped products. Otherwise the work is academic in the wrong sense.
articleFaculty Safety· Faculty· 2024· loose paraphrase
Dan Hendrycks

Dan Hendrycks

Director of the Center for AI Safety; drafter of the Statement on AI Risk

endorses

Publishes widely-used benchmarks and argues that capability/risk evals are load-bearing for governance.

If AI research continues without adequate caution, it is reasonably likely that AI could precipitate human extinction or similarly catastrophic outcomes.
tweetTweet from Dan Hendrycks· X/Twitter· 2023-04-02· faithful paraphrase

Daniel Khashabi

Johns Hopkins assistant professor; NLP safety researcher

endorses

Works on efficient reusable frameworks for evaluating LLM safety before deployment.

Creative reasoning thrives on revealing novel connections, yet is inherently prone to false associations. Safety evaluation must live with both.
articleDaniel Khashabi, homepage· danielkhashabi.com· 2024· faithful paraphrase
Dean Ball

Dean Ball

Mercatus Center; AI policy commentator

mixed

Argues most state-level AI safety legislation is poorly drafted and that federal evaluation infrastructure, not state preemption-style bills, is the most useful policy lever.

If we want AI policy that actually reduces risk, the bottleneck is not legislation but capacity: who can credibly evaluate frontier models in a way that informs policy decisions.
blogHyperdimensional by Dean Ball· Substack· 2024· faithful paraphrase
Elham Tabassi

Elham Tabassi

NIST Chief AI Advisor; AI Risk Management Framework

endorses

Argues sound risk management depends on shared, reproducible evaluation methods; led the development of NIST's AI RMF as the U.S. baseline.

The AI Risk Management Framework offers organizations a flexible, structured way to manage AI risks throughout the lifecycle, not a checklist, a discipline.
articleNIST AI Risk Management Framework· NIST· 2023-01· faithful paraphrase
Esther Duflo

Esther Duflo

MIT economist; 2019 Nobel laureate (with Banerjee)

endorses

Argues AI-for-development claims need to be tested with the same RCT rigor as other development interventions.

AI in development should be evaluated like any other intervention. The hype is not evidence.
articleJ-PAL, Esther Duflo· J-PAL· 2024· loose paraphrase

Fabien Roger

Anthropic alignment researcher; control evaluations

endorses

Argues control evaluations, stress testing whether AIs can subvert their own monitoring, are a load-bearing part of any sensible deployment regime.

AI control is the discipline of designing protocols that catch a model trying to subvert oversight, even when the model is much more capable than its monitors at the relevant tasks.
§ paperAI Control: Improving Safety Despite Intentional Subversion· arXiv· 2024-06· faithful paraphrase

Florian Tramèr

ETH Zurich AI security researcher

endorses

Empirical adversarial-ML researcher; argues real adversarial robustness is far below what marketing materials claim.

When you actually attack deployed AI systems, the safety guarantees turn out to be much thinner than the marketing.
articleFlorian Tramèr, ETH Zurich· ETH Zurich· 2024· loose paraphrase

Gabriel Mukobi

Stanford alignment researcher

endorses

Argues empirical evaluations of advanced AI behaviour, particularly around deception and strategic reasoning, are the surest way to reveal capability progress that matters for safety.

Cicero shows that human-level negotiation is achievable today. The next question is whether the same techniques produce systems that strategically deceive humans, and how we would tell.
articleGabriel Mukobi, Stanford· gmukobi.com· 2023· faithful paraphrase
Gavin Newsom

Gavin Newsom

Governor of California; SB-1047 vetoer

mixed

Vetoed SB-1047 on the grounds that its threshold-based approach was too narrow; favours commissioned reports and capability-first frameworks over hard statutory limits.

“While well-intentioned, SB 1047 does not take into account whether an AI system is deployed in high-risk environments, involves critical decision-making, or the use of sensitive data. The bill applies stringent standards to even the most basic functions.”
articleGovernor Newsom Veto Message, SB 1047· California Office of the Governor· 2024-09-29· direct quote
Hjalmar Wijk

Hjalmar Wijk

METR researcher; AI R&D evaluations

endorses

Argues standardized AI-R&D benchmarks, where models are evaluated on the very work that would fuel recursive self-improvement, are an important safety signal we currently lack.

We measure how well frontier models can perform AI R&D tasks compared to human researchers. The gap is closing in some specific dimensions and that is what an early-warning system should be tracking.
articleRE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts· METR· 2024· faithful paraphrase

Hugh Zhang

Epoch AI researcher

endorses

Argues capability evaluations need to be reproducible, publicly verifiable, and independent.

Benchmark reproducibility is an underrated governance infrastructure question.
articleEpoch AI· Epoch AI· 2024· loose paraphrase
Jacob Steinhardt

Jacob Steinhardt

UC Berkeley professor; METR board

endorses

Publishes forecasting benchmarks and argues capability measurement is the grounded foundation of safety work.

Reliable capability forecasts, rather than vibes, should drive policy. Where we have data, we should use it.
blogForecasting ML benchmarks in 2023· Bounded Regret· 2022· loose paraphrase
Jade Leung

Jade Leung

CTO of UK AI Safety Institute

endorses

Runs the first government-operated frontier model evaluation team; evaluations are the load-bearing governance instrument.

Frontier evaluations must be mandatory, comparable, and independent of the labs being evaluated.
articleUK AI Safety Institute· UK AI Safety Institute· 2024· faithful paraphrase

Karthik Narasimhan

Princeton; reasoning, NLP

endorses

Argues evaluations grounded in real-world software engineering tasks reveal capability and safety properties that synthetic benchmarks miss.

SWE-bench evaluates language models in a realistic software engineering setting: resolving real GitHub issues from real codebases. Performance here is closer to deployment reality than synthetic tasks.
§ paperSWE-bench: Can Language Models Resolve Real-World GitHub Issues?· arXiv· 2023-10· faithful paraphrase
Katy Börner

Katy Börner

Indiana University; data and information visualisation

endorses

Argues field-level visualisation (publications, citations, talent flows) is critical infrastructure for AI policymakers.

Without science maps, AI policy is policy by anecdote.
bookAtlas of Knowledge· MIT Press· 2024· loose paraphrase

Laura Weidinger

Google DeepMind ethics and safety researcher

endorses

Argues systematic risk taxonomies are the foundation of practical evaluation and governance.

We cannot evaluate risks we haven't named. A shared taxonomy is the precondition of shared governance.
§ paperEthical and social risks of harm from Language Models· arXiv· 2021· faithful paraphrase

Marc Warner

CEO of Faculty AI; CTO of Accenture

endorses

Runs Faculty's AI-safety evaluations work with frontier labs; argues external independent evaluation infrastructure is a prerequisite for trustworthy AI.

AI safety is not in tension with capability. It is the scaffolding that lets capability be deployed.
articleFaculty AI· Faculty· 2024· loose paraphrase

Marius Hobbhahn

CEO of Apollo Research

endorses

Runs scheming-focused evaluations and publishes results to inform frontier-lab safety frameworks.

Models already demonstrate in-context scheming under the right setups. Policy and training need to catch up.
§ paperFrontier Models are Capable of In-context Scheming· Apollo Research· 2024-12· faithful paraphrase

Mary Phuong

DeepMind autonomous-replication evaluations researcher

endorses

Designs autonomous-replication evaluations. Central figure in DeepMind's Frontier Safety Framework implementation.

Autonomous replication is a concrete capability threshold we can measure, and one crossing it meaningfully increases systemic risk.
blogDeepMind Frontier Safety Framework· Google DeepMind· 2024· loose paraphrase

Max Bartolo

Cohere; LLM evaluation researcher

endorses

Argues evaluation methods that adversarially probe model weaknesses are the only way to characterize what models will do in deployment; static benchmarks are insufficient.

Adversarial evaluation reveals failure modes that static benchmarks miss. As models become more capable, our evaluation has to become more adversarial too.
articleMax Bartolo, research page· maxbartolo.com· 2024· faithful paraphrase

Michael Chen

METR evaluations researcher

endorses

Measures empirical trends in autonomous-task capability as the quantitative backbone of deployment-risk reasoning.

The length of autonomous tasks frontier models can complete has been roughly doubling every 4 to 7 months.
articleMETR capability evaluations· METR· 2025· faithful paraphrase
Mihaela van der Schaar

Mihaela van der Schaar

Cambridge AI in healthcare professor

endorses

Healthcare AI requires its own evaluation methodology distinct from general ML benchmarks.

Healthcare AI without healthcare-specific evaluation is research, not deployment.
articlevan der Schaar Lab· Cambridge· 2024· loose paraphrase

Moritz Hardt

MPI Tübingen; algorithmic fairness, evals

mixed

Argues current AI benchmarking is dangerously brittle: leaderboards reward overfitting to fixed test sets and obscure how models behave under shift. Calls for adaptive, externally validated evaluation.

Benchmarks are the most valuable lever in machine learning, and the field treats them as if they were neutral measurements rather than artefacts shaping research.
bookThe Emerging Theory of Algorithmic Fairness· fairmlbook.org· 2023· faithful paraphrase
Ozzie Gooen

Ozzie Gooen

Quantified Uncertainty Research Institute founder

endorses

Argues AI risk arguments need to be expressed as explicit probabilistic models that can be inspected, criticized, and updated; built Squiggle for this purpose.

Most AI risk discussions are poorly formalized. We can do much better with explicit probabilistic estimation, and that requires both better tools and better community norms.
articleSquiggle, Quantified Uncertainty· QURI· 2023· faithful paraphrase

Percy Liang

Stanford CRFM director; HELM benchmark author

endorses

Argues rigorous, public benchmarking is the infrastructure that lets governance judgments be made at all.

Transparency is not a nice-to-have. It is the precondition for any serious AI governance.
§ paperFoundation Model Transparency Index· Stanford CRFM· 2023-10-18· faithful paraphrase

Peter Szolovits

MIT medical AI pioneer

endorses

Argues the medical-AI governance playbook, FDA-style pre-deployment validation and continued monitoring, is the right template.

We've been doing evaluation of clinical AI for 50 years. The lesson is: the evaluation is the governance.
articlePeter Szolovits, MIT CSAIL· MIT CSAIL· 2023· loose paraphrase

Rishi Bommasani

Stanford CRFM; Foundation Model Transparency Index lead

endorses

Publishes the Foundation Model Transparency Index; argues measurable transparency scores are the right instrument for governance.

Without transparency, governance cannot be meaningful.
§ paperFoundation Model Transparency Index· Stanford CRFM· 2023-10-18· loose paraphrase

Sam Charrington

Host of The TWIML AI Podcast

mixed

Editorial position consistently emphasizes empirical, technically grounded conversations about specific systems and benchmarks rather than ideological framings.

What matters is not the meta-debate about AI risk, it's the specific empirical questions: what these systems can actually do, how they fail, and what we are doing about both.
podcastTWIML AI Podcast· TWIML· 2024· faithful paraphrase

Sayash Kapoor

Princeton PhD; AI Snake Oil co-author

mixed

Pushes for rigor in AI evaluation; critiques common eval methodology as misleading about generalisation.

Leakage and overfitting in AI benchmarks have produced a whole generation of irreproducible capability claims.
blogAI Snake Oil blog· AI Snake Oil· 2024· loose paraphrase
Spencer Greenberg

Spencer Greenberg

Clearer Thinking founder; rationality researcher

mixed

Argues that calibration, prediction tracking, and concrete probabilistic reasoning should anchor AI risk debates; runs ClearerThinking.org tools to push the practice.

Most arguments about AI risk are not phrased in terms of testable predictions. We can fix that by literally writing down our beliefs and tracking them over time.
articleClearer Thinking· Clearer Thinking· 2024· faithful paraphrase
Stephen Casper

Stephen Casper

MIT PhD researcher; red-teaming and model audit

endorses

Argues empirical red-teaming reveals that current safeguards are not robust; auditing must become standard infrastructure.

Example after example of state-of-the-art safeguards get pretty reliably broken. That's the empirical reality.
podcastStephen Casper at Center for AI Policy Podcast· Center for AI Policy Podcast· 2024· faithful paraphrase

Tatsunori Hashimoto

Stanford; CRFM; LLM evaluation and security

endorses

Argues robust evaluation requires carefully constructed datasets that resist contamination and reveal real generalization, not leaderboard-fitted numbers.

The dominant evaluation paradigm in NLP is fundamentally susceptible to contamination and overfitting. We need to design tests that are robust to the way models actually develop.
articleTatsunori Hashimoto, Stanford· Stanford CS· 2023· faithful paraphrase

Toby Shevlane

DeepMind model evaluations researcher

endorses

Helps design dangerous-capability evaluations and advocates for their adoption as the load-bearing governance artefact.

Dangerous capability evaluations are the minimum viable governance instrument for frontier AI.
§ paperModel evaluation for extreme risks· arXiv· 2023-05· faithful paraphrase

Trishan Panch

Wellframe co-founder; Harvard health AI

endorses

Argues clinical AI requires evidence-based deployment standards akin to drug trials.

Medical AI without clinical-grade evidence is malpractice with extra steps.
articleWellframe· Wellframe· 2024· loose paraphrase
Yu Su

Yu Su

Ohio State; AI agents and reasoning

endorses

Argues real-world agent evaluations, where the agent must take actions in actual environments, surface different capability and safety properties than synthetic benchmarks.

We benchmark LLM agents on real, live websites. Performance gaps between lab benchmarks and real-world deployment are large, and they reveal where capability claims most often overreach.
§ paperMind2Web: Towards a Generalist Agent for the Web· arXiv / NeurIPS· 2024· faithful paraphrase
Zico Kolter

Zico Kolter

CMU professor; OpenAI safety board chair

endorses

Argues robust evaluations and adversarial testing are the load-bearing safety practices; oversees these reviews at OpenAI as committee chair.

The Safety and Security Committee reviews safety processes for major model releases and has the authority to delay launches if safety concerns are not adequately addressed.
articleOpenAI's Safety and Security Committee transitions to independent oversight· OpenAI· 2024-09· faithful paraphrase