Transformers · Sequence Modeling
Attention Is All You Need: The Architecture Under Modern AI
The Transformer removed recurrence and convolution from sequence transduction, replacing them with attention and parallel training; almost every modern LLM stands on that move.
The Transformer removed recurrence and convolution from sequence transduction, replacing them with attention and parallel training; almost every modern LLM stands on that move.
What problem it solves
Before the Transformer, strong sequence models usually relied on recurrent or convolutional networks. They worked, but training was harder to parallelize, long-range dependencies were awkward, and the architecture carried a lot of machinery. Attention Is All You Need asks whether sequence transduction can be handled by attention alone.
The core method
The paper introduces the Transformer: an encoder-decoder architecture built from self-attention, feed-forward layers, residual connections, normalization, and positional encodings. Instead of processing tokens one step at a time, the model lets each token attend to other tokens directly. Multi-head attention gives the model several relation spaces at once, while positional encodings preserve order without recurrence.
Key results
On WMT 2014 English-to-German translation, the Transformer improves over previous best results, and on English-to-French it reaches a new single-model state of the art after training for 3.5 days on eight GPUs. The paper also shows transfer to English constituency parsing. The larger result was not just a translation score; it was evidence that attention could be the primary computation.
Why it matters
The Transformer made scaling easier. Parallel training, flexible context modeling, and clean modular design turned it into the default backbone for language models, vision-language models, diffusion conditioning, code models, and many scientific AI systems. It is one of the rare papers where a concise architectural change rewired an entire field.
Limits and open questions
The original Transformer still has quadratic attention cost, which becomes painful for long contexts. It also does not solve data quality, reasoning, grounding, or alignment by itself. Many later papers try to replace, accelerate, or specialize attention, but they mostly do so in conversation with the baseline this paper created.
One line: the Transformer made attention the operating system of modern AI.