Imagine reading a mystery novel — but someone has taped over every other chapter. You can follow the plot forward, but you never know what came before a clue was dropped. That's roughly how language models worked before October 2018.
Then BERT walked in, ripped off the tape, and said: "What if we read the whole book at once?"
In a single paper, Jacob Devlin and colleagues at Google introduced a model that could look left AND right simultaneously — and it didn't just beat the previous state-of-the-art. It obliterated it across eleven benchmarks. Let's break down exactly how, one interactive demo at a time.
I. The Problem: One-Way Reading
Before BERT, the dominant paradigm for pre-training language models was left-to-right. Models like OpenAI's GPT (June 2018) would read a sentence one token at a time, always predicting the next word based only on what came before it.
This is called an autoregressive approach. It's powerful for generation — great for finishing your sentences — but it has a blind spot. When GPT encodes the word "bank" in "I went to the bank to deposit my check," it only sees "I went to the" at the time it processes "bank." It doesn't know about the deposit or the check.
ELMo (Peters et al., 2018) tried to fix this by training two separate LSTMs — one left-to-right, one right-to-left — and concatenating their representations. Better, but those two directions never truly interacted during encoding. It was like two people reading the same book from opposite ends and comparing notes afterward.
BERT's insight was devastatingly simple: let every token attend to every other token, in both directions, at every layer.