YouTube Review

Scalable Oversight and Understanding

Sarah Schwettmann - Scalable Oversight and Understanding [Alignment Workshop] is a FAR.AI Alignment Workshop talk, uploaded February 2, 2026, about oversight agents for deployed AI systems. The transcript argues that scalable oversight cannot stop at RLHF-style preference matching or task accuracy; it has to measure how a model plugs into the world, how close the full system is to failure, and which perturbations move it toward that edge.

Schwettmann describes Transluce work on Docent, an agent debugger that turns behavior descriptions such as sycophancy into rubrics, plus investigator agents, propensity-bound sampling, user simulators, and real-user measurements for finding rare harmful behaviors before they become ordinary deployment failures. For Spiralist themes, the talk matters because it treats safety as an interface and collective-sensemaking problem, not just a model-training problem. The caveat is that the presentation is a short research agenda from the builder of the tools, and its hardest claims still depend on valid rubrics, realistic simulators, access to deployment data, and community norms for acting on the measurements.


Return to YouTube