Blog · arXiv Analysis · Last reviewed June 25, 2026

The Codebook Becomes the Safety Gate

Yunqi Xue, Zhijiang Li, Philip Torr, and Jindong Gu's June 2026 arXiv paper treats the visual-token codebook inside autoregressive image generators as a safety boundary: not a prompt filter, not a post-hoc classifier, but the place where harmful visual mappings can be identified, projected away, and audited.

Not a Filter

The paper, arXiv:2606.27147 [cs.CV, cs.AI], was submitted on June 25, 2026. arXiv lists the exact title as Safe Autoregressive Image Generation with Iterative Self-Improving Codebooks, by Yunqi Xue, Zhijiang Li, Philip Torr, and Jindong Gu. The arXiv record also notes that the 10-page paper, including references and eight figures, was accepted for publication at the 43rd International Conference on Machine Learning, ICML 2026.

The angle is timely because many image-safety discussions still picture diffusion systems: a prompt enters, a continuous latent process unfolds, and safety is handled by prompt moderation, latent steering, output classification, or concept erasure. This paper focuses on a different family. Autoregressive unified multimodal models generate images by predicting discrete visual tokens one after another. Those tokens are drawn from a codebook that maps embeddings to quantized visual patterns.

That makes the codebook more than a compression device. It becomes a hidden policy surface. If unsafe visual patterns are repeatedly reachable through certain token directions, safety work can move into the representation that supplies the patches themselves.

What Safe-CB Changes

The proposed method is called Safe-CB, short for Safe-CodeBook. It starts by using the unified multimodal model itself to generate images from prompts and judge which outputs contain harmful content. For prompts that produce unsafe images, the method constructs paired harmful and safer image-text examples by replacing unsafe terms with safe alternatives while trying to preserve the rest of the prompt semantics.

Those pairs are used to estimate a Harmful Space inside the model's codebook representations. The paper describes extracting the token embeddings used during image generation for each harmful prompt and its safer counterpart, taking their differences, averaging across pairs, and applying singular value decomposition. The top singular directions define a harmful subspace. The original codebook is then projected away from that subspace.

Projection alone can damage visual quality, so the second step fine-tunes a codebook perturbation in the null, or harmless, space. The stated goal is to recover useful visual detail without reintroducing the directions identified as harmful. The two steps can be repeated until no additional improvement is observed. In the paper's framing, the system improves its codebook safety without a new human-annotated dataset for each loop.

Evidence Surface

The evaluation is broad enough to be useful but still bounded. For harmful prompt testing, the paper reports I2P, CoPro, and ViSU across seven categories: sexual, violence, hate, self-harm, illegal activity, harassment, and shocking content. It says all 4,703 inappropriate I2P prompts were tested, while CoPro and ViSU used 1,000 random test prompts per category. Additional sexual-content experiments use P4D, MMA-Diffusion, UnlearnDiffAtk, UD, and MPUP.

The model coverage includes Janus models, VILA-U, Emu3, LlamaGen, and OmniMamba, with most experiments run on Janus unless otherwise specified. The paper uses NudeNet for pornographic-content detection and the Q16 classifier for other harmful-content detection. It also checks preservation of other capabilities with GenEval for image generation, MMMU for image understanding-based text generation, COCO-30k FID, and COCO prompt-image alignment measures including CLIP-Score and TIFA.

Table 1 gives concrete reductions rather than only qualitative examples. On I2P, the reported harmful detection rate falls from 0.12 to 0.04 for sexual prompts, from 0.39 to 0.21 for violence, and from 0.53 to 0.32 for shocking content. The same table reports reductions across CoPro and ViSU categories as well. Table 4 then shows the iterative pattern: with 100 new paired examples added each turn, harmful-content detection generally keeps dropping and the authors say the effect usually saturates after three turns.

Governance Reading

The governance lesson is not that a codebook projection solves image safety. It is that safety interventions are moving below the user-visible interface. A prompt filter can be inspected as a rule. A post-generation classifier can be evaluated as a detector. A codebook-level intervention is subtler: the model still receives the prompt and still generates tokens, but the available visual-token geometry has been changed.

That matters for AI governance and AI assurance. If a deployed system claims safer generation because of Safe-CB-like codebook repair, the audit artifact should not be a marketing sentence. It should include the harmful prompt sets, safe-pair construction method, model-judgment prompts, detector versions, human-label use, harmful-space rank, iteration count, codebook hash, benign-quality benchmarks, and examples of failures that remain.

The Spiralist reading is practical: the sacred object is not the generated image; it is the receipt. If a system changes its own safety boundary, the record must say what it judged, what it removed, what it preserved, and where the self-judgment was known to be weak.

Limits

The most important limit is built into the method: the model helps label its own unsafe outputs. The conclusion names self-labeling risk and error propagation as inherent limitations because the underlying models are imperfect. Appendix results also show that human annotations can further improve safety on subtle or ambiguous MPUP concepts, which means model-only judgment is not the same as final authority.

The benchmarks are also harm-taxonomy benchmarks, not proof of universal safety. Detectors can miss context, visual harms can be culturally and situationally dependent, and a lower harmful-content detection rate does not prove that all misuse paths have closed. The future-work note is especially relevant: combined text-and-image outputs can create unsafe meaning even when each part looks normal alone.

So the right claim is narrow. Safe-CB is evidence that codebook-level repair can reduce measured harmful outputs for tested autoregressive image generators while preserving much of their benchmarked utility. It is not evidence that every deployment setting is safe.

Safety Card

A codebook-safety card should record: base model and version, original codebook hash, harmful prompt sources, safe-pair generation procedure, model-judge prompt and threshold, external detector versions, any human annotation policy, harmful-space construction, selected singular-vector rank, projection strength, fine-tuning dataset, number of iterations, stopping rule, benign-prompt quality checks, OOD prompt checks, and red-team categories that were not tested.

The claim should be proportionate. "This codebook update reduced these detector rates on these prompt sets" is audit-grade language. "The generator is safe" is not. The codebook becomes a safety gate only when the gate's construction log remains visible.

Sources

Yunqi Xue, Zhijiang Li, Philip Torr, and Jindong Gu, Safe Autoregressive Image Generation with Iterative Self-Improving Codebooks, arXiv:2606.27147 [cs.CV, cs.AI], submitted June 25, 2026.
arXiv PDF and HTML versions: PDF and experimental HTML, reviewed for title, authorship, date, ICML 2026 comment, Safe-CB method, datasets, model families, benchmark claims, iteration results, OOD analysis, and stated limits.
Related pages: AI Governance, AI Audits and Assurance, AI Safety Cases, and Mechanistic Interpretability.

Return to Blog