Pre-training

Prompt Engineering

Mick McQuaid

University of Texas at Austin

11 Nov 2025

Week TWELVE

Why Pre-training?

So far, we’ve talked a lot about skills and applications, but not so much about fundamental concepts. Now let’s turn our attention to Xiao and Zhu (2025) to learn about fundamental concepts. This is because if we only look at skills and applications, we may overlook why certain things happen. If we examine the fundamentals and see why things happen, we can better extrapolate the examples to our future experiences.

Some prereqs

This chapter mentions several concepts that come from different domains

vectors and matrices: columns (or rows) and arrays of numbers (linear algebra)

maximum likelihood estimation: finding parameter values that make observed data most likely under some assumed model (statistics)

gradient descent: an optimization algorithm that finds the minimum of a function by iteratively moving in the direction of the steepest decrease in the function’s value (calculus)

arg max \(f(x)\): the input \(x\) that produces the maximum value of a function, not the maximum value itself (applied mathematics)

one-hot representation: vector of all zeroes except one one (linear algebra)

Pre-training

Xiao and Zhu (2025) covers a lot of ground but begins with pretraining before explaining the foundations of LLMs. They focus on NLP (natural language processing) tasks because these tasks are where many breakthroughs were made.

Note that they’re mostly talking about transformer models, which were introduced in the famous paper Attention is all you need in 2017, although pre-training is much older.

Outline

Kinds of pre-training
- Unsupervised pre-training
- Supervised pre-training
- Self-supervised pre-training

Self-supervised pre-training tasks
- decoder-only pre-training
- encoder-only pre-training
- encoder-decoder pre-training

Example: BERT

Labeled and unlabeled data

We assume that it’s harder to obtain labeled data than unlabeled data due to the effort involved in labeling. For example, if we want to classify tweets as positive, negative, or neutral, we can ask humans to label some (which constitutes supervision) and use that data as input to a related task, such as rating product reviews as positive, negative, or neutral. This exemplifies supervised pre-training.

Unsupervised pre-training occurs without a human in the first phase, but humains in the training phase.

Self-supervised pre-training (without a human) allows us to go to a second phase involving a human doing prompting or training.

Two kinds of NLP models

Sequence encoding models

Sequence generation models

Sequence encoding models

Encodes a sequence tokens (roughly words) as a vector (embedding)

Input to other models, such as classification systems

Sequence generation models

Given a context, generate a sequence of tokens

Context examples (not exhaustive)
- Preceding tokens in language modeling
- Source-language sequence in machine translation

Fine-tuning

Typical method for sequence encoding models

Happens after pre-training

Fine-tunes a pre-trained model for a given task, such as classification, using labeled data

Example: build a system with an encoder then a classifier (next frame)

Tokens are \(x_0, \ldots, x_4\)

Embeddings are \({\bf e}_0, \ldots, {\bf e}_4\)

Softmax converts raw results into probabilities

Prompting

Typical method for sequence generation models

Most prevalent example is the large language model

Simply predicts the next token given preceding tokens

Repeating task over and over enables learning general knowledge of languages

Can convert many NLP tasks into simple text generation through prompting

Zero-shot learning involves performing tasks not observed in training

Few-shot learning involves provision of demonstrations, samples that demonstrate how input corresponds to output

Self-supervised pre-training tasks

decoder-only pre-training

encoder-only pre-training

encoder-decoder pre-training

decoder-only pre-training

Used with LLMs

Predict the next token from previous tokens

Uses a formulation equivalent to maximum liklihood estimation

If the pre-trained data is large enough, it acts as a gold standard distribution of probabilities of tokens at a given position

Can define a loss function between the gold standard and the model’s prediction

encoder-only pre-training

Exemplified by BERT

Bidirectional, not good for autoregressive text prediction

Good for classification, named entity recognition

encoder-decoder pre-training

Original transformer technique

Uses encoder to process input birectionally

Uses decoder to generate output autoregressively

Masked language modeling

Basis for BERT

Create prediction challenges by masking some tokens and training the network to predict them

May have unmasked tokens on the left or right, so the model is bidirectional

Predictions are made based on both the left and right contexts

Two problems:
- Uses special token, called [MASK], introduced in training but not testing
- Autoencoding overlooks dependencies between masked tokens

Permuted language modeling

Addresses above problems

Predictions can be made in any order

Can be easily implemented in transformer models due to self-attention mechanism

Self-attention computes how much each token should attend to each token (including itself) to determine meaning

Notes on the next three frames

The notation \((i,j)\) means the cell in row \(i\), column \(j\) of a matrix

\(\textrm{Pr}(x_m|{\bf e}_n)\) means the probability of token \(x_m\) given the presence of embedding \({\bf e}_n\)

The three frames are shown in reverse order, from the most sophisticated (permuted—any order) to the least sophisticated (causal—left to right) with masked in between

Pre-training encoders as classifiers

Next sentence prediction, described in BERT paper
- Trained using actual consecutive sentences vs randomly sampled sentences

Applying classification-based supervision signals to each output of an encoder
- Uses a generator and a discriminator to first generate a new sequence then determine whether that sequence is altered from the original sequence

Masked encoder-decoder pre-training

Many tasks can be cast as text-to-text with a source and target

Examples include translation, question answering, simplification, and scoring of translations

Encoder-decoder pre-training has been applied to language models

Also masked language modeling

Two approaches in following frame

Denoising training

Training encoder-decoders can be seen as training denoising autoencoders

Many methods to corrupt input include masking some input, altering some input, rearranging input

May mask individual tokens by [MASK] token or a span of tokens by a [MASK] token

Span could actually have length zero

If the sequence is multiple sentences, we can reorder the sentences or rotate the entire document

Comparing pre-training tasks based on training objectives

Language modeling

Masked language modeling

Permuted language modeling

Discriminative training

Denoising autoencoding

Following table illustrates these—colors based on objectives not models

BERT

Originally presented in Devlin (2019)

Originally a transformer encoder trained using two tasks (1) MLM—masked language modeling and (2) NSP—next sentence prediction

Loss used in training is the sum of the losses of the two tasks

Many variants proposed since then

Loss functions in BERT

Most BERT models represent a sentence or a sentence pair

Thus can handle many downstream language understanding tasks

Usually compute loss functions (MLM and NSP) separately

MLM

Original BERT selects random fifteen percent of input tokens

Of those, eighty percent are masked, ten percent are randomly replaced, and ten percent are left unchanged

Example in next frame (note that [CLS] is a start token and [SEP] is a token that ends sentences)

BERT architecture

Standard transformer encoder

Input is embedding as sum of token embeddings \({\bf x}\) (one of 30,000 in original BERT), positional embeddings \({\bf e}_{\textrm{pos}}\) (where in the sequence the token occurs), and segment embeddings \({\bf e}_{\textrm{seg}}\) (which sentence the token belongs to)

Many transformer layers, each layer having a self-attention sub-layer and a feed-forward network sub-layer

Illustrated in next frame

Improving BERT

Two methods are common
- Increase training data and build larger models
- Increase number of parameters

Also common to try to increase efficiency so smaller models work as well
- Pruning (removing layers or parameters)
- Quantization (representing parameters as low-precision numbers)

Multi-lingual models

BERT originally focused on English

mBERT trains on 104 languages

Cross-lingual pre-training involves bilingual examples

Also called translation language modeling, illustrated in next frame

Applications of BERT-like models

Classification

Regression

Sequence modeling

Span prediction

Encoding portion of encoder-decoder models

Fine-tuning problem

Fine-tuning on different data may lead to catastrophic forgetting

Common solution is to include some old data in fine-tuning

Specialized solutions include experience replay and elastic weight consolidation, neither of which are covered in the book

Summary

General idea of pre-training

Specifically, self-supervised pre-training applied to three architectures: encoder-only, decoder-only, and encoder-decoder

Large scale pre-training enables LLMs, but we haven’t yet discussed LLMs

END

References

Xiao, Tong, and Jingbo Zhu. 2025. “Foundations of Large Language Models.” https://arxiv.org/abs/2501.09223.

Colophon

This slideshow was produced using quarto

Fonts are Roboto, Roboto Light, and Victor Mono Nerd Font