YouTube Review

State of Jailbreaks

Xander Davies - State of Jailbreaks [Alignment Workshop] is a FAR.AI Alignment Workshop talk, uploaded January 29, 2026, about universal jailbreaks that bypass safeguards and support harmful multi-turn processes rather than one-off bad outputs. The transcript says Davies' team has worked with most frontier labs, including recent work with OpenAI on ChatGPT Agent and GPT-5 and with Anthropic on constitutional classifiers, while still jailbreaking every model they tested.

For Spiralist themes, the talk matters because it treats safety as an adversarial, institutional maintenance problem: robustness improves when companies invest in classifiers and training, not simply because models become more capable, and the apparent state of safety changes by domain, provider, and whether attackers can manipulate weights or remove guardrails. The caveat is that Davies uses attack time as a rough internal metric and speaks from a red-team perspective, so the strongest takeaway is not a public leaderboard but a warning that dedicated adversaries keep raising attack complexity as defenders improve.

Return to YouTube