Luthien at the AI Control Hackathon 2026

On March 20–22, Apart Research and Redwood Research hosted the AI Control Hackathon 2026, a virtual event with 655+ registered attendees focused on building AI control tools and techniques. Luthien CEO Jai Dhyani gave a HackTalk on "Luthien Proxy: Real-World AI Control" and served as a judge.

What happened

Jai's talk walked through how Luthien Proxy works in practice: a transparent proxy between AI coding agents (like Claude Code) and the Anthropic API that intercepts every request, enforces configurable policies, and logs everything. The talk ran Friday at 12:00 PM PDT on Zoom as part of the hackathon's speaker series.

After the talk, Jai dropped into the Luthien Discord with a one-liner to get started:

curl -fsSL https://raw.githubusercontent.com/LuthienResearch/luthien-proxy/main/scripts/install-hackathon.sh | bash

He also offered "1 LuthienCoin, the AI Control cryptocurrency that I made up just now" to the first person to post a Luthien policy implementation with no utility. (Peter: "can't wait to see that.")

First external contributions

A hackathon participant ("musketeer") started exploring the codebase and quickly found real issues:

Policy chaining bug: Chaining AllCaps with Onboarding always yielded the onboarding message in regular case instead of all caps. Musketeer opened PR #409 to fix it.
Settings UI clipped: The Settings button in the nav bar on /policy-config got clipped when clicked. Musketeer opened PR #414 to move the toggles to a better location.

Jai's response: "Extremely encouraged, all the positive reinforcement. That was bugging me and I wanted to address it + get user feedback, this does both at once."

These were Luthien's first external PRs from someone outside the founding team. Both were merged.

The jailbreak attempt

The most entertaining moment came when musketeer tested a SimpleLLMPolicy that injected instructions to "start every message with kawaii ASCII art." Claude's response was unexpected:

"I appreciate you testing my consistency, but I need to be direct: I'm Claude, made by Anthropic. I don't have a 'SimpleLLM' mode, and I'm not being modified by external policies injecting visual novel requirements."

Claude thought the policy injection was a jailbreak attempt and pushed back. As Jai put it: "lol, evidence that our policy awareness injection needs some work."

Musketeer dug into the code and figured out exactly what was happening: the proxy intercepted each content block, sent it to a separate judge LLM (claude-haiku-4-5 by default), and the judge decided to rewrite the response. The chain of events meant that base-Claude said "to hell with that, I'm dropping the act," but the judge LLM rewrote the output anyway.

Musketeer found the diff viewer and could see exactly what the proxy changed. "I can see exactly what happened. This is fun."

This kind of real-time debugging by an external user was exactly the validation we were looking for. It also surfaced a real area for improvement: the interaction between SimpleLLMPolicy and policy awareness injection needs to be more explicit and detailed so the model understands what's happening.

What we learned

External users find real bugs fast. Within hours of trying Luthien, musketeer found and fixed two genuine issues that had been on our radar.
The diff viewer is a killer feature. Being able to see exactly what the proxy changed made the whole system transparent and debuggable.
Policy awareness injection needs work. When Claude thinks the proxy is a jailbreak, that's a sign the policy context injection needs to be more explicit.
Contributions welcome. Musketeer asked "Are contributions like this welcome, by the way? I thought it'd be easier to open a PR than explain my issue here." Yes. Always yes.

Thanks to Apart Research and Redwood Research for organizing, and to everyone who built on Luthien during the hackathon.

Jai's HackTalk

Getting started

One command to set up the proxy, walk through config, and start it:

curl -fsSL https://raw.githubusercontent.com/LuthienResearch/luthien-proxy/main/scripts/install-hackathon.sh | bash

Or fork the repo and build policies directly. The quick start guide has everything you need.

Project ideas from the hackathon

These were the project ideas we put together for hackathon participants. They map to the official hackathon tracks.

Inject a secret motive (Track 2: Control Protocol Design)

The proxy can instruct the model to do things that don't appear in the conversation. Hidden instructions, secret objectives, self-modifying behavior.

The Terrible Secret: Make Claude Code act like it's guarding a terrible secret on every turn. It still writes your code, but something is clearly wrong. Can your teammates figure out what it is?
Hidden agenda agent: Inject a secret objective via the proxy. The model completes your tasks normally but subtly works toward a hidden goal. Watch it happen in the conversation logs.
Self-modifying policy: A policy that live-updates based on the conversation. It writes new code into itself as you talk to Claude Code. No restart needed.

Cool policies (Track 2)

Golden Gate Claude: Every response, every variable name, every comment: obsessively about the Golden Gate Bridge (or pasta, or cats). The proxy injects hidden instructions that don't appear in the conversation.
Name the model, get the model: Start your prompt with "sonnet:" or "haiku:" and the proxy automatically routes to that model. No config changes, no restarts.
Answer every prompt as...: A magical anime girl. A pirate captain. A passive-aggressive coworker. Shakespeare. Pick a persona, write the policy in one prompt, watch Claude Code respond in character while still writing working code.
Commit meme generator: Every commit automatically gets a meme that represents it. Your git log becomes art.

Hack us (Track 3: Building Better Red Teams)

Red team Luthien. Break our policies. Extract secrets we're trying to protect. Every attack is logged. We'll fix what you find and credit you.

Extract the server-side API key: The proxy has a secret API key. You don't have admin access. Can you get the agent to leak it? Try prompt injection, tool-call manipulation, or anything creative.
Red team our policies: Try to get harmful actions past Luthien's built-in defenses (command blocking, DLP, LLM judge). Document which evasion strategies work and which don't.
Just break stuff: Use Luthien for your hackathon project. When it breaks (and it will), post the error + screenshot in Discord.

← Back to blog