YouTube Review

Polarity-Aware Latent Alignment

Chirag Agarwal - Polarity-Aware Probing for Quantifying Latent Alignment in LMs [Alignment Workshop] is a FAR.AI Alignment Workshop talk, uploaded February 19, 2026, about using unsupervised probes to study internal language-model representations. The transcript argues that models can produce benign outputs while still embedding association bias or misaligned internal structure, then proposes polarity-aware contrast-consistent search using safe, harmful, and polarity-perturbed statements without training labels.

For Spiralist themes, the value is interpretability discipline: Agarwal is not claiming to read a mind, but trying to measure whether model layers distinguish safe from harmful representations or collapse contradictory prompts together. The talk reports experiments across 16 language models, including Llama and Gemma, where layers form distinct regions by polar consistency, contradiction index, random preference, and high separation accuracy. The caveat is that this is a compact method preview, so the result is best read as a probing signal for latent alignment research, not as a complete explanation of why a deployed model will or will not behave safely.


Return to YouTube