Wiki · Concept · Last reviewed May 19, 2026

Masked Autoencoders

Masked autoencoders are self-supervised models that learn by hiding part of an input and training the system to reconstruct what was removed.

Definition

An autoencoder compresses input into an internal representation and reconstructs the input from that representation. A masked autoencoder makes the task harder by hiding parts of the input first.

In language, masked prediction appears in models such as BERT. In vision, masked image modeling hides patches of an image and trains a vision transformer or similar architecture to reconstruct the missing patches.

Masked Image Modeling

The influential MAE work from He et al. showed that masking a high proportion of image patches can create an efficient self-supervised training task for vision transformers. The model learns visual structure by reconstructing missing patches from visible context.

This differs from supervised classification, where labels are provided by humans. It also differs from contrastive methods, where the model compares paired views. The masked approach uses absence as the training signal.

Contrast With Joint Embedding

Masked autoencoders reconstruct data. JEPA-style systems usually predict representations rather than raw pixels. That distinction matters because raw reconstruction can spend capacity on details that are visually plausible but not causally important.

Both approaches are part of the broader self-supervised learning landscape. They answer the same problem from different directions: how can a model learn useful structure without relying on human labels for every example?

Governance Questions

Masked reconstruction can learn strong visual representations from large unlabeled datasets. That raises familiar questions about dataset provenance, consent, bias, domain failure, and downstream use.

Because reconstruction feels intuitive to humans, it can also invite overtrust. A model that fills in missing content plausibly has not necessarily understood the scene, the context, or the consequences of action.

Sources

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv, 2018.
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, and Ross Girshick, "Masked Autoencoders Are Scalable Vision Learners", arXiv, 2021.
Hangbo Bao, Li Dong, and Furu Wei, "BEiT: BERT Pre-Training of Image Transformers", arXiv, 2021.

Return to Wiki