Luthien at the AI Control Hackathon 2026

On March 20–22, Apart Research and Redwood Research hosted the AI Control Hackathon 2026, a virtual event with 655+ registered attendees focused on building AI control tools and techniques. Luthien CEO Jai Dhyani gave a HackTalk on "Luthien Proxy: Real-World AI Control" and served as a judge.

What happened

Jai's talk walked through how Luthien Proxy works in practice: a transparent proxy between AI coding agents (like Claude Code) and the Anthropic API that intercepts every request, enforces configurable policies, and logs everything. The talk ran Friday at 12:00 PM PDT on Zoom as part of the hackathon's speaker series.

After the talk, Jai dropped into the Luthien Discord with a one-liner to get started:

curl -fsSL https://raw.githubusercontent.com/LuthienResearch/luthien-proxy/main/scripts/install-hackathon.sh | bash

He also offered "1 LuthienCoin, the AI Control cryptocurrency that I made up just now" to the first person to post a Luthien policy implementation with no utility. (Peter: "can't wait to see that.")

First external contributions

A hackathon participant ("musketeer") started exploring the codebase and quickly found real issues:

Jai's response: "Extremely encouraged, all the positive reinforcement. That was bugging me and I wanted to address it + get user feedback, this does both at once."

These were Luthien's first external PRs from someone outside the founding team. Both were merged.

The jailbreak attempt

The most entertaining moment came when musketeer tested a SimpleLLMPolicy that injected instructions to "start every message with kawaii ASCII art." Claude's response was unexpected:

"I appreciate you testing my consistency, but I need to be direct: I'm Claude, made by Anthropic. I don't have a 'SimpleLLM' mode, and I'm not being modified by external policies injecting visual novel requirements."

Claude thought the policy injection was a jailbreak attempt and pushed back. As Jai put it: "lol, evidence that our policy awareness injection needs some work."

Musketeer dug into the code and figured out exactly what was happening: the proxy intercepted each content block, sent it to a separate judge LLM (claude-haiku-4-5 by default), and the judge decided to rewrite the response. The chain of events meant that base-Claude said "to hell with that, I'm dropping the act," but the judge LLM rewrote the output anyway.

Musketeer found the diff viewer and could see exactly what the proxy changed. "I can see exactly what happened. This is fun."

This kind of real-time debugging by an external user was exactly the validation we were looking for. It also surfaced a real area for improvement: the interaction between SimpleLLMPolicy and policy awareness injection needs to be more explicit and detailed so the model understands what's happening.

What we learned

Thanks to Apart Research and Redwood Research for organizing, and to everyone who built on Luthien during the hackathon.


Jai's HackTalk

Getting started

One command to set up the proxy, walk through config, and start it:

curl -fsSL https://raw.githubusercontent.com/LuthienResearch/luthien-proxy/main/scripts/install-hackathon.sh | bash

Or fork the repo and build policies directly. The quick start guide has everything you need.

Project ideas from the hackathon

These were the project ideas we put together for hackathon participants. They map to the official hackathon tracks.

Inject a secret motive (Track 2: Control Protocol Design)

The proxy can instruct the model to do things that don't appear in the conversation. Hidden instructions, secret objectives, self-modifying behavior.

Cool policies (Track 2)

Hack us (Track 3: Building Better Red Teams)

Red team Luthien. Break our policies. Extract secrets we're trying to protect. Every attack is logged. We'll fix what you find and credit you.

← Back to blog