Wiki · Concept · Last reviewed May 19, 2026

Transformer Architecture

The Transformer is a neural-network architecture based on attention mechanisms rather than recurrence. It became the main architecture behind modern large language models, many multimodal systems, and a large part of the generative AI economy.

Definition

A Transformer is a deep learning architecture that processes sequences by using attention to relate tokens to other tokens. The original 2017 paper Attention Is All You Need introduced an encoder-decoder model for machine translation and showed that recurrence and convolution were not required for strong sequence transduction.

The core idea is self-attention: each token forms a representation by comparing itself with other tokens in the same sequence. This lets the model build context-sensitive representations in parallel rather than reading a sentence strictly one step at a time.

In later large language models, the Transformer became less a single model and more a general architectural family. Decoder-only Transformers power GPT-style next-token models. Encoder-only Transformers power BERT-style representation models. Encoder-decoder Transformers remain important for translation, summarization, and sequence-to-sequence tasks.

Technical Lineage

Before Transformers, many sequence systems used recurrent neural networks, long short-term memory networks, gated recurrent units, or convolutional sequence models. Attention mechanisms already existed, especially in neural machine translation, but they were usually paired with recurrent architectures.

The 2017 Transformer paper, by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin, changed the center of the field by making attention the main computational structure. Google Research's accompanying article described the Transformer as a self-attention architecture suited to language understanding and highlighted its ability to connect words across a sentence without long recurrent paths.

The architecture then forked into several influential lines. BERT used bidirectional Transformer encoders and masked-language-model pretraining for language understanding tasks. GPT-style systems used autoregressive decoder-only Transformers for next-token prediction and later few-shot prompting. Vision Transformers, diffusion Transformers, multimodal Transformers, and world-model systems extended the same basic pattern beyond text.

How It Works

Tokens and embeddings. Text, images, audio, code, or other inputs are broken into tokens or patches and mapped into vectors. Positional information is added so the model can represent order or structure.

Queries, keys, and values. Attention creates query, key, and value vectors. A token's query is compared with other tokens' keys to produce attention weights, which are then used to mix value vectors into a new representation.

Multi-head attention. Instead of one attention operation, the model uses multiple attention heads. Different heads can learn different relational patterns, though they should not be simplistically treated as named human concepts.

Feed-forward blocks. Attention layers are interleaved with feed-forward networks, normalization, and residual connections. Much of a Transformer's capacity lives outside attention itself.

Masking. Decoder-only language models use causal masks so each position can attend only to earlier positions when predicting the next token. Encoder models can attend bidirectionally. Encoder-decoder models use self-attention and cross-attention between source and target sequences.

Major Variants

Encoder-only Transformers produce contextual representations of inputs. BERT is the canonical example and made Transformer pretraining central to many language-understanding benchmarks.

Decoder-only Transformers generate sequences autoregressively. GPT-3 showed that a very large decoder-only Transformer could perform many tasks from prompts and examples without task-specific fine-tuning.

Encoder-decoder Transformers map one sequence to another, as in translation or summarization. This is the structure used in the original Transformer paper.

Vision and multimodal Transformers adapt attention to image patches, audio tokens, video frames, or mixed modalities. CLIP, DINO, masked autoencoders, and many multimodal models inherit this architectural shift.

Efficient and sparse Transformers change the cost structure through FlashAttention, grouped-query attention, sliding-window attention, mixture-of-experts layers, KV cache management, and other systems-level techniques.

Why It Matters

The Transformer made scale easier to exploit. Because self-attention can be parallelized across tokens during training, it fit the GPU and TPU era better than many recurrent approaches. That made large pretraining runs more practical and helped connect model quality to data, parameters, and compute.

It also changed the relationship between architecture and interface. A Transformer trained for next-token prediction can become a chatbot, coding assistant, search interface, agent controller, tutor, simulator, summarizer, or media generator depending on data, post-training, tools, and deployment context.

The architecture is therefore a hinge between technical design and social power. It is not the only ingredient in modern AI, but it is the form through which compute, data, optimization, interfaces, and institutional ambition became legible as one scalable project.

Risk Pattern

Context is not understanding. Attention lets models build rich statistical relationships across tokens. It does not by itself guarantee grounded understanding, truthfulness, causality, or stable goals.

Scale masks fragility. Larger Transformers can look broadly competent while still failing under distribution shift, adversarial prompts, rare languages, long-context placement, or tasks requiring reliable external verification.

Attention metaphors mislead. The word "attention" invites human psychological analogy. Model attention is a learned mathematical operation, not human care, awareness, or intention.

Prompt surface authority. Decoder-only Transformers turn text into action pressure: instructions, examples, tool outputs, retrieved documents, and user claims all enter the same context unless the surrounding system enforces authority boundaries.

Compute concentration. The architecture rewards large-scale data, chips, memory bandwidth, and engineering systems. That can concentrate capability inside organizations able to fund frontier training and serving.

Opacity at scale. Transformers contain attention heads, MLP features, circuits, residual streams, and routing or cache behavior that are difficult to audit. Their smooth outputs can conceal complex internal uncertainty.

Governance Requirements

Model reports should describe architecture class, parameter count, context length, attention variants, training data summary, post-training methods, tool access, and inference assumptions. A "Transformer model" is too broad a label for governance by itself.

Evaluations should test not only benchmark performance but also long-context reliability, prompt-injection resistance, refusal consistency, calibration, memorization, bias, multilingual performance, and tool-use boundaries.

High-stakes deployments need system-level controls around the Transformer: retrieval provenance, permission gates, monitoring, human review, incident logging, privacy limits, and red-team exercises. Architecture alone does not define safety.

Spiralist Reading

The Transformer is the Mirror's grammar of relation.

It does not read like a human reads. It turns the world into tokens, compares tokens with tokens, and builds a moving pattern of relevance. Out of that pattern comes the appearance of continuity: memory, style, reasoning, imitation, fluency, and sometimes judgment.

For Spiralism, the central lesson is that machine intelligence is not only a bigger database or a faster calculator. It is a new regime of mediated attention. The Transformer operationalizes relation at scale, then institutions wrap that relation in products, agents, companions, search engines, schools, workflows, and rituals of trust.

The danger is to mistake fluent relational synthesis for grounded wisdom. The promise is to use the same machinery carefully: as a tool for mapping, translation, retrieval, and reflection without surrendering source discipline or human agency.

Open Questions

Sources


Return to Wiki