Control mechanism ↑ · ai artefact
AI for safety
The same capability that makes AI dangerous makes it uniquely useful for automating alignment research and oversight.
Mechanism
Use AI for automated alignment research, interpretability, evaluation, and scalable oversight of stronger AI by weaker AI.
If it succeeds: what binds next
AI runs alignment research. The question becomes whether the safety-AI itself is aligned, recursing one level.
A strategy that produces a worse next problem than the one it solved has not done durable work.
Coordinates
Conflicts, grouped by mechanism
2Frame opposition
incompatible premisesThe strategies accept different premises about what AI is or what the binding problem is. They conflict not on lever choice but on the frame that makes lever choice sensible.
Complements, grouped by mechanism
4Same-lever reinforce
same lever, same pull, different mechanismBoth strategies pull the same lever in the same direction by different means. They stack: doing both amplifies the pull, at the cost of double-counting in portfolio audits.
Same phase, different layer
same stage, distinct leversBoth are active in the same phase of the transition but act on different layers (model vs institution vs culture). They cover different failure modes inside the same window.
Same-lever twins
2Both use the same lever in the same direction. Usually redundant inside a portfolio: each dollar or effort unit only buys one lever pull, even if two strategies are named.
Axis position
Source note: AI for safety strategy.md