BERT
Masked language modeling and bidirectionality
Paper: BERT: Pre-training of Deep Bidirectional Transformers — Devlin et al., 2019
GPT-1 bet on the causal, generative objective. A few months later, Google’s BERT (Devlin et al., 2019) made the opposite bet — and on language understanding benchmarks, it won decisively. BERT matters to us not because it’s the line that became modern LLMs (it isn’t), but because it crisply illustrates the single most important fork in pre-training: what objective do you train on?
The objection BERT answered
A causal language model causal language model A model that predicts each token using only earlier tokens (never future ones). "Causal" because information flows strictly left to right. The GPT family are causal LMs (Language Models). See in glossary → reads strictly left to right. That’s necessary for generation — you can’t condition on words you haven’t written yet — but it’s a handicap for understanding. To classify the sentiment of a sentence, you’d love to use the whole sentence, both directions, at every word.
BERT’s insight: if you’re not trying to generate, you don’t need the causal constraint. Drop it, use the encoder encoder The half of a transformer that reads an input sequence with full (bidirectional) attention, producing a contextual representation of it. BERT is encoder-only. See in glossary → with full bidirectional bidirectional Able to use context from both the left and the right of a token. BERT is bidirectional; a causal language model is left-to-right only. See in glossary → attention so every token sees every other token — but now you need a different objective, because a bidirectional model trained on next-token prediction would trivially cheat (each token could see itself through the layers).
Masked language modeling
The replacement is the masked language model masked language model Masked Language Model (MLM) — a pre-training objective (used by BERT) that hides a fraction of tokens and trains the model to fill them in using context from both sides. Contrast with next-token prediction. See in glossary → (MLM) objective, a denoising denoising objective Any pre-training objective that corrupts the input (masking, deleting, or shuffling tokens) and trains the model to restore the original. Masked LM and span corruption are both denoising objectives. See in glossary → task. Hide a fraction of the tokens and train the model to reconstruct them from the surrounding context on both sides:
- 15% of tokens are selected for prediction.
- Of those, 80% are replaced with a special
[MASK]token, 10% with a random token, and 10% are left unchanged. (This 80/10/10 split avoids a train/inference mismatch —[MASK]never appears at fine-tuning time, so the model can’t rely on it.) - The model predicts the originals with the same cross-entropy cross-entropy loss The standard LM loss: the negative log-probability the model assigned to the actual next token, averaged over all positions. Zero would mean perfect confidence in every correct token. See in glossary → loss we already know.
BERT added a second objective, Next Sentence Prediction next sentence prediction Next Sentence Prediction (NSP) — a secondary BERT objective: given two sentences, predict whether the second actually follows the first. Later work found it largely unnecessary. See in glossary → (NSP) — given two sentences, predict whether the second really follows the first — to help with sentence-pair tasks. (Later work, notably RoBERTa, found NSP largely unnecessary, a nice reminder that not every component of a famous model turns out to matter.)
The widget above makes the contrast tangible. Toggle between the objectives and notice the fundamental trade:
The specifics
- Two sizes: BERT-Base (12 layers, hidden 768, 12 heads, 110M parameters — deliberately matched to GPT-1) and BERT-Large (24 layers, hidden 1024, 16 heads, 340M parameters). Feed-forward inner size is throughout.
- WordPiece WordPiece A subword tokenization algorithm (used by BERT) closely related to Byte Pair Encoding, building a vocabulary of word pieces from frequent character sequences. See in glossary → tokenization with a 30,000-token vocabulary.
- Pre-training data: BooksCorpus BooksCorpus A dataset of around 7,000 unpublished books (~800M words) used to pre-train GPT-1. Long contiguous passages made it good for learning long-range structure. See in glossary → (800M words) plus English Wikipedia (2,500M words) — about 3.3 billion words. Like GPT-1, the authors stressed using document-level text to preserve long contiguous passages.