Section 11

BERT

Masked language modeling and bidirectionality

Paper: BERT: Pre-training of Deep Bidirectional Transformers — Devlin et al., 2019

GPT-1 bet on the causal, generative objective. A few months later, Google’s BERT (Devlin et al., 2019) made the opposite bet — and on language understanding benchmarks, it won decisively. BERT matters to us not because it’s the line that became modern LLMs (it isn’t), but because it crisply illustrates the single most important fork in pre-training: what objective do you train on?

The objection BERT answered

A causal language model reads strictly left to right. That’s necessary for generation — you can’t condition on words you haven’t written yet — but it’s a handicap for understanding. To classify the sentiment of a sentence, you’d love to use the whole sentence, both directions, at every word.

BERT’s insight: if you’re not trying to generate, you don’t need the causal constraint. Drop it, use the encoder with full bidirectional attention so every token sees every other token — but now you need a different objective, because a bidirectional model trained on next-token prediction would trivially cheat (each token could see itself through the layers).

Masked language modeling

The replacement is the masked language model (MLM) objective, a denoising task. Hide a fraction of the tokens and train the model to reconstruct them from the surrounding context on both sides:

15% of tokens are selected for prediction.
Of those, 80% are replaced with a special [MASK] token, 10% with a random token, and 10% are left unchanged. (This 80/10/10 split avoids a train/inference mismatch — [MASK] never appears at fine-tuning time, so the model can’t rely on it.)
The model predicts the originals with the same cross-entropy loss we already know.

BERT added a second objective, Next Sentence Prediction (NSP) — given two sentences, predict whether the second really follows the first — to help with sentence-pair tasks. (Later work, notably RoBERTa, found NSP largely unnecessary, a nice reminder that not every component of a famous model turns out to matter.)

Three pre-training objectives, one sentence

What does each objective actually ask the model to predict?

Thequickbrownfoxjumpsoverthelazydog

Predicting position 5: model sees The quick brown fox → target jumps

Left-to-right only. Blue = visible left context, teal = the token being predicted, dim = the future, hidden by the causal mask. Every position is a training example, all computed in parallel. This is the objective that became the LLM.

The widget above makes the contrast tangible. Toggle between the objectives and notice the fundamental trade:

The specifics

Two sizes: BERT-Base (12 layers, hidden 768, 12 heads, 110M parameters — deliberately matched to GPT-1) and BERT-Large (24 layers, hidden 1024, 16 heads, 340M parameters). Feed-forward inner size is $4H$ throughout.
WordPiece tokenization with a 30,000-token vocabulary.
Pre-training data: BooksCorpus (800M words) plus English Wikipedia (2,500M words) — about 3.3 billion words. Like GPT-1, the authors stressed using document-level text to preserve long contiguous passages.