Why Pre-training?
So far, we’ve talked a lot about skills and applications, but not so much about fundamental concepts. Now let’s turn our attention to Xiao and Zhu (2025) to learn about fundamental concepts. This is because if we only look at skills and applications, we may overlook why certain things happen. If we examine the fundamentals and see why things happen, we can better extrapolate the examples to our future experiences.
Some prereqs
This chapter mentions several concepts that come from different domains
- vectors and matrices: columns (or rows) and arrays of numbers (linear algebra)
- maximum likelihood estimation: finding parameter values that make observed data most likely under some assumed model (statistics)
- gradient descent: an optimization algorithm that finds the minimum of a function by iteratively moving in the direction of the steepest decrease in the function’s value (calculus)
- arg max \(f(x)\): the input \(x\) that produces the maximum value of a function, not the maximum value itself (applied mathematics)
- one-hot representation: vector of all zeroes except one (linear algebra)
Pre-training
Xiao and Zhu (2025) covers a lot of ground but begins with pre-training before explaining the foundations of LLMs. They focus on NLP (natural language processing) tasks because these tasks are where many breakthroughs were made.
Note that they’re mostly talking about transformer models, which were introduced in the famous paper Attention is all you need in 2017, although pre-training is much older.
Outline
- Kinds of pre-training
- Unsupervised pre-training
- Self-supervised pre-training
- Self-supervised pre-training tasks
- decoder-only pre-training
- encoder-only pre-training
- encoder-decoder pre-training
Labeled and unlabeled data
We assume that it’s harder to obtain labeled data than unlabeled data due to the effort involved in labeling. For example, if we want to classify tweets as positive, negative, or neutral, we can ask humans to label some (which constitutes supervision) and use that data as input to a related task, such as rating product reviews as positive, negative, or neutral. This exemplifies supervised pre-training.
Unsupervised pre-training occurs without a human in the first phase, but humans in the training phase.
Self-supervised pre-training (without a human) allows us to go to a second phase involving a human doing prompting or training.
Two kinds of NLP models
- Sequence generation models
Sequence encoding models
- Encodes a sequence of tokens (roughly words) as a vector (embedding)
- Input to other models, such as classification systems
Sequence generation models
- Given a context, generate a sequence of tokens
- Context examples (not exhaustive)
- Preceding tokens in language modeling
- Source-language sequence in machine translation
Fine-tuning
- Typical method for sequence encoding models
- Happens after pre-training
- Fine-tunes a pre-trained model for a given task, such as classification, using labeled data
- Example: build a system with an encoder then a classifier (next frame)
- Tokens are \(x_0, \ldots, x_4\)
- Embeddings are \({\bf e}_0, \ldots, {\bf e}_4\)
- Softmax converts raw results into probabilities
Prompting
- Typical method for sequence generation models
- Most prevalent example is the large language model
- Simply predicts the next token given preceding tokens
- Repeating task over and over enables learning general knowledge of languages
- Can convert many NLP tasks into simple text generation through prompting
- Zero-shot learning involves performing tasks not observed in training
- Few-shot learning involves provision of demonstrations, samples that demonstrate how input corresponds to output