- AI Control works now—not just against hypothetical scheming AIs, but against the prosaic failures already happening in production: jailbreaking, data poisoning, alignment-faking, and stochastic misbehavior.
- Classification is easier than generation. AI systems paired with adversarial screening can effectively judge whether other AI outputs are dangerous—and strategies exist to reduce the attack surface without hurting usefulness.
- Production deployment is the path to stronger safety. Lessons from real-world control systems build toward handling high-stakes scenarios, including potential future scheming risks.
Introduction
AI Control was originally developed to mitigate risks from “scheming” AIs—systems acting as highly capable adversaries optimizing for harmful outcomes. However, as of early 2025, scheming does not represent an immediate practical threat from frontier models.
This raises an important question: why should Luthien develop and deploy practical AI Control mechanisms now?
The distinction matters: AI Control focuses on mitigating risks from misaligned AI rather than aligning it. Control mechanisms assume the AI is misaligned and will cause harm unless prevented. While scheming represents the most dangerous form of misalignment, negative outcomes can result from any form of misalignment. Therefore, control systems can significantly reduce harms across diverse deployment scenarios regardless of scheming risk.
Redwood Research's January publication explored how scenarios like jailbreaking, data poisoning, alignment-faking, and low-frequency stochastic misbehavior could result in model failure modes that emerge only in production. These constitute immediate, practical threats that AI Control can address.
The 21 Points
- User intent and AI implementation typically diverge—AI systems frequently perform unintended actions.
- AI applications increasingly operate in domains offering significant freedom to effect real-world changes through tools like customer interaction, financial management, database updates, and machinery operation.
- These applications will encounter situations where AI optimizes for something other than user intent through mechanisms such as data poisoning, alignment-faking, and stochastic misbehavior.
- System usefulness depends on problem-solving effectiveness and required costs in latency, financial resources, and other inputs.
- Safety assessment involves defining adversarial outcomes, identifying untrustworthy system components, and assigning red teams to maximize adversarial outcome rates. The failure rate—how often systems prevent intended harms despite optimization—measures safety.
- Traditional digital security relies on precise, mechanistic rules governing permissible actions under specific circumstances.
- Tool-augmented AI systems can perform harmful actions: transferring funds to scammers, corrupting databases, or misusing machinery.
- Many negative outcomes resist precise mechanical definition. While humans recognize certain failures intuitively, translating those insights into executable functions proves nearly impossible.
- AI excels at classifying difficult-to-precisely-define categories.
- Classification generally presents an easier problem than generation.
- This principle may not apply to adversarial examples.
- Strategies exist that substantially reduce “attack surface” for adversarial examples involving AI outputs within context without significantly harming system usefulness.
- Specific strategies—including output paraphrasing and rejection of non-conforming formats—drastically reduce adversarial attack surfaces while maintaining usefulness.
- AI systems paired with effective adversarial screening mechanisms can effectively classify other AI outputs as dangerous or safe.
- “Output” means “output in context”—identical strings may be safe or dangerous depending on receiving system states.
- Iterated red/blue teaming, where red teams minimize safety and blue teams maximize usefulness while maintaining safety, effectively identifies and mitigates failure modes before production—assuming production adversaries remain sub-superhuman.
- Contemporary AI systems can plausibly automate iterated red/blue teaming through novel strategy generation and effective testing frameworks. Success doesn't require high individual team success rates, only continuous strategy discovery preventing novel production attacks.
- AI Control likely addresses prosaic alignment issues effectively.
- AI security improves through standardizing monitoring of outputs in context and determining whether outputs should be forwarded, modified, or rejected.
- Standardization supports developing AI Security best practices, particularly for control systems.
- Production deployment lessons enable developing more effective control systems for high-stakes scenarios, including potential future frontier model scheming risks.
What This Means for Practitioners
The key insight is that you don't need to wait for scheming AIs to benefit from AI Control. The same mechanisms that would detect a scheming model—monitoring outputs in context, applying classification-based screening, and iterating on policies—also catch the prosaic failures that are already costing developers time and money today.
Points 9–14 are especially relevant: classification is easier than generation, and practical strategies exist to reduce the adversarial attack surface. This means a “judge” model reviewing another model's outputs can be effective even against sophisticated failure modes—and the attack surface can be narrowed enough to make adversarial evasion impractical.
Point 21 ties it all together: the path to handling future, more dangerous AI risks runs through deploying control systems today. Every production incident caught, every policy refined, every red/blue teaming cycle completed builds institutional knowledge that compounds over time.
Originally published on luthienresearch.org.
← Back to blog