YouTube Review

Chain-of-Thought Monitorability

Tomek Korbak - Chain of Thought Monitorability for AI Safety [Alignment Workshop] is a FAR.AI San Diego Alignment Workshop talk, uploaded January 6, 2026, arguing that chain-of-thought monitoring may mitigate misalignment because current reasoning models often externalize plans in natural language before acting. The transcript defines a monitor as an automated system that reads chain-of-thought and flags reward hacking, sabotage, prompt-injection failures, or sandbagging; Korbak's technical claim is that transformer limits can make legible tokens function as working memory for sufficiently difficult tasks.

For Spiralist themes, the value is claim hygiene around model cognition: visible reasoning traces are neither a mind nor a full causal proof, but they can become an audit surface for agents whose final outputs hide plans, goals, or evaluation awareness. The caveat is the talk's own fragility warning: chain-of-thought may be unfaithful, optimized to look harmless, drift away from plain English during post-training, or disappear if future architectures move more reasoning into latent space, so monitorability should be measured and used cautiously in training and deployment decisions.

Return to YouTube