Control mechanism ↑ · ai artefact

Alignment first

Technical alignment is solvable before critical capability thresholds close, and aligned systems compose safely into aligned populations.

Mechanism

Invest primarily in interpretability, scalable oversight, and post-training methods so AI does what principals intend.

What this name has meant

vintage drift

The name is stable; the content has shifted. A reader acting on the label without asking which vintage is being meant risks arguing with a position nobody currently holds.

2015

Solving the principal–agent problem for arbitrary specified values. A research agenda against future agentic systems.

2026

Training models to produce helpful, honest, harmless outputs via RLHF and constitutional methods. Current alignment practice absorbs this label.

If it succeeds: what binds next

Aligned frontier AI exists. Principals now choose what to align to, and operator legitimacy becomes the binding constraint. The race is over who gets to be the principal.

A strategy that produces a worse next problem than the one it solved has not done durable work.

Falsification signal

Interpretability and oversight methods stop scaling with model capability, stronger models are less rather than more inspectable.

A strategy held without a falsification signal is not strategy; it is affiliation. Continued support after this signal lands is identity, not bet. See the identity diagnostic.

Self-undermining threshold

overshoot risk

When alignment investment hollows out institutional capacity.

Concentrating talent and funding into alignment produces a shortage of the democratic and institutional capacity an aligned superintelligence would land into. Solved alignment, dysfunctional substrate.

Every strategy has a stable region where it reinforces itself and an unstable region where pursuit defeats it. The threshold between them is usually narrower than advocates acknowledge.

People on the record

103

Profiled figures appear first, with their tier in small caps. Each face links to the person and their full quote record. Tag: alignment-first.

expertise mix · 29 profiled

Builds frontier systems

Deep ML / safety technical

Applied or adjacent technical

Governance, policy, strategy

Expert in another field

Public-square commentator

recognition mix

Mass-public recognition

Known across the AI/safety field

Recognised inside subfield

Newer or less central voice

A strategy whose endorsement skews to commentators or external-domain experts is in a different epistemic state from one endorsed mostly by frontier-builders. The mix is read carefully across both axes; see the board for criteria. Counts are over the 29 profiled people on this strategy (74 unprofiled excluded).

Ajeya Cotra
Governance, policy, strategy · Recognised inside subfield
Alan Turing
Deep ML / safety technical · Mass-public recognition
Anca Dragan
Builds frontier systems · Known across the AI/safety field
Andrew G. Barto
Deep ML / safety technical · Known across the AI/safety field
Brian Christian
Expert in another field · Known across the AI/safety field
Buck Shlegeris
Deep ML / safety technical · Recognised inside subfield
Claude Shannon
Deep ML / safety technical · Mass-public recognition
Daniel Dewey
Deep ML / safety technical · Recognised inside subfield
Doris Tsao
Expert in another field · Known across the AI/safety field
Evan Hubinger
Deep ML / safety technical · Recognised inside subfield
Iyad Rahwan
Deep ML / safety technical · Known across the AI/safety field
Jan Leike
Builds frontier systems · Known across the AI/safety field
John McCarthy
Deep ML / safety technical · Mass-public recognition
John Schulman
Builds frontier systems · Known across the AI/safety field
Joseph Carlsmith
Governance, policy, strategy · Known across the AI/safety field
Marvin Minsky
Deep ML / safety technical · Mass-public recognition
Nick Bostrom
Governance, policy, strategy · Mass-public recognition
Norbert Wiener
Expert in another field · Mass-public recognition
Owain Evans
Deep ML / safety technical · Recognised inside subfield
Paul Christiano
Builds frontier systems · Known across the AI/safety field
Richard Ngo
Deep ML / safety technical · Recognised inside subfield
Rob Miles
Applied or adjacent technical · Known across the AI/safety field
Rohin Shah
Deep ML / safety technical · Recognised inside subfield
Ryan Greenblatt
Deep ML / safety technical · Recognised inside subfield
Scott Aaronson
Deep ML / safety technical · Known across the AI/safety field
Stuart Armstrong
Deep ML / safety technical · Recognised inside subfield
Stuart Russell
Deep ML / safety technical · Mass-public recognition
Victoria Krakovna
Deep ML / safety technical · Recognised inside subfield
Wei Dai
Deep ML / safety technical · Recognised inside subfield
Aaron Courville
Université de Montréal; Deep Learning textbook co-author
Adam Jermyn
Anthropic; previously astrophysics
Adam Kalai
Microsoft Research; AI fairness and safety
Agnes Callard
University of Chicago philosopher; aspiration theorist
Alex Irpan
Google Brain alumnus; Sorta Insightful blog
Alex Pan
Berkeley CHAI; reward hacking
Alex Turner
DeepMind alignment researcher; shard theory co-originator

67 more on the record. See the full tag page: alignment-first

Load-bearing commitments

Worldview positions this strategy quietly assumes. If the claim fails empirically or philosophically, the strategy loses its target or its premise.

Values

Principals have determinate values AI can learn.

Fails if: If values are contested or constructed, the strategy loses its target.

AI nature

AI is a tool with controllable properties.

Fails if: If AI has emergent agency, the tool frame fails and alignment becomes negotiation.

Coordinates

Primary leverControl mechanism (Add mechanism)

Acts onai artefact

Coercionconsent

Actor in controlhumans

Time horizonpre transition

Legitimacy sourcetechnical

Conflicts, grouped by mechanism

No strict conflicts catalogued. This strategy pulls a lever that nothing else pulls in the opposite direction.

Complements, grouped by mechanism

Cross-side bridge

one AI-side, one world-side

One acts on the model, the other on institutions or culture. The bridge hedges against both artefact-level and substrate-level failure.

Governance firstResilience first

Adjacent bet

different levers, loosely coupled

Different levers, different directions of action. They reinforce only via the general principle that covering more bets dominates covering fewer.

Open source maximalism

Stage-sequenced

one sets up the other

The pair is phase-offset: one acts before the transition, the other during or after. The first creates the conditions under which the second binds.

Cooperative AI

Same phase, different layer

same stage, distinct levers

Both are active in the same phase of the transition but act on different layers (model vs institution vs culture). They cover different failure modes inside the same window.

Pause

Same-lever twins

Both use the same lever in the same direction. Usually redundant inside a portfolio: each dollar or effort unit only buys one lever pull, even if two strategies are named.

AI containmenttwinAI for safetytwinCounter AI AItwinInterpretability firsttwinSafe by construction AItwin

Axis position

What the strategy acts onAI artefact

Coercion levelConsent

Actor in controlHumans as principals

Time horizonPre-transition

Legitimacy sourceTechnical

Source note: Alignment first strategy.md