Principles: Evaluation

Prompt Engineering

Mick McQuaid

University of Texas at Austin

29 Jan 2025

Week Three

Agenda

  • Finish group assignments
  • Review syllabus
  • Review whatiknow
  • Review eA
  • Evaluation of LLMs
  • Announce eB

eA

Scoring

  • Average score: 4.7/5 (95%)
  • Grading was very lenient
  • But the honeymoon is over
  • Many people disregarded instructions like the file names—I’ll take off points for these omissions in future assignments
  • I don’t regrade assignments—I make suggestions for improvement that you should implement so that the hw is useful to you in the future
  • Why don’t I regrade? Think about it!
  • I don’t regrade because everyone would eventually get a 5/5

Observations

  • One person did a single prompt
  • One person did five prompts
  • Most did two or three prompts
  • Some people did not include an intro or conclusion
  • Most people did not mention the exact model used

Exemplary version

Let’s pause to look at an exemplary version of the assignment. Notice the detailed commentary and conclusion.

Requirements

  • Notice that I gave you very few requirements
  • I did not give you a list of detailed instructions to follow
  • I will typically underspecify the assignment and let you be adventurous
  • What happens if you just do the minimum?
  • Can you be replaced by genAI?
  • You have to be creative and adventurous to survive in the workplace

Observations

  • You must iterate
  • You must be specific
  • There will be drift (the model forgets what you told it earlier)
  • Giving the model my instructions is not enough!
  • You have to be creative and analyze the output
  • Feedback helps
  • You can control parameters
  • The model has limitations to discover

The best submissions

  • Capitalized names
  • Gave actual dates
  • Arranged results in chronological order (alphabetical order as an addition would have been good)
  • Formatted results in markdown format (but could have incorporated markdown output into the body of the document)

News

Deepseek R1

  • Deepseek is a Chinese startup founded by Liang Wenfeng
  • They rocked the US stock markets this week by releasing a new LLM called R1
  • R1 is claimed to be competitive with gpt-4o o3 but at a fraction of the cost to develop
  • They are open source code, weights, and data
  • They are offering R1 free to use, including for local installation
  • My former student is running it on his Macbook and wants to try it on some version of Raspberry Pi
  • Why should this matter to stock prices? \(\langle\) Discuss \(\rangle\)

Evaluation of LLMs

Another way to discover genuine principles is to evaluate LLMs. What qualities are we looking for in an LLM?

Generic Qualities

  • Fluency
  • Relevance
  • Coherence
  • Perplexity (how well a probability model predicts a sample)
  • Overlap of n-grams in translations (automating human judgments)
  • Benchmark task performance (e.g., resolving ambiguous pronouns, demonstrating reading comprehension)

Schaik and Pugh (2024)

Automating evaluation of LLMs

Desiderata

  • Scalable
  • Automatic
  • Reliable (LLMs are not reliable for evaluating LLMs)
  • Cost-effective
  • Considers new issues such as hallucinations

Issues

  • Hallucinations
  • Knowledge recency
  • Reasoning inconsistency
  • Difficulty in computational reasoning

Different communities evaluate different criteria

  • Security and Responsible AI
  • Computing performance
  • Retrieval vs Gnerator Evaluation
  • Offline vs Online Evaluation
  • System Evaluation vs Model Evaluation

Focus

van Schaik focuses on automatic, offline, system-level evaluation of generative AI text: methods for evaluating quality of summaries

Qualities van Schaik addresses

  • Fluency
  • Coherence
  • Relevance
  • Factual consistency
  • Fairness
  • Similarity to reference text

Three kinds of metrics

  • Reference-based
  • Reference-free (Context-based)
  • LLM-based

Reference-based metrics

  • N-gram based metrics
  • Embedding-based metrics
  • Both are simple, fast, inexpensive
  • Poor correlation with human judgments
  • Lack of interpretability
  • Inherent bias
  • Poor adaptability
  • Inability to capture subtle nuances

Reference-free metrics (Context-based)

  • Evaluation is a score
  • Quality-based metrics
    • Based on context or source
  • Entailment-based metrics
    • Based on the Natural Language Inference (NLI) task
    • Determines whether output entails, contradicts, or undermines premise
  • Factuality, QA, and QG-based metrics
    • Based on the QA (question answering) and QG (question generation) tasks
    • Determines whether output is factually correct

Limitations of reference-free metrics

  • Bias towards underlying model outputs
  • Bias against higher-quality text
  • But improved correlation with human judgments!

LLM-based metrics

  • Prompt-based evaluators
  • LLM embedding-base metrics
  • These are new and not well-studied
  • They are probably expensive to study (modulo DeepSeek)

Best practices

  • Use a suite rather than relying on a single metric
  • Use standard (e.g., factual consistency) and custom (e.g., writing style) metrics
  • Example: measuring F1 score overlap between regex extraction and ground truth on electronic product summaries
  • Use LLM and non-LLM metrics
  • Validate the evaluators (usually against human judgments)
  • Visualize and analyze metrics (e.g., use boxplots)
  • Involve experts to annotate data, evaluate summaries, and design metrics

More best practices

  • Data-driven prompt engineering
  • Tracking metrics over time
  • Appropriate metric interpretation

Open challenges

  • Cold start problem
    • Synthesize data (but may not represent distributions)
    • Repurpose existing data (e.g., search engine logs)
  • Subjectivity in evaluating and annotating text
    • Empirical IRR results show medium (80%) inter-rater reliability
  • Challenge of good vs excellent
    • Difficult to discern between good and excellent

Additional Observations

  • Lead author got PhD from Imperial College London, ranked 2 worldwide by QS
  • Paper is published in SIGIR, a top conference in information retrieval
  • Some material is already out of date given OpenAI o3 and DeepSeek R1

Shankar et al. (2024)

LLMs are the new way to evaluate LLMs!

What could possibly go wrong?

Shankar presents an example solution to the obvious problem

Alignment with human judgments

This is the most frequent goal of automatic evaluation

Criteria Drift

Users need criteria to grade outputs but grading outputs helps users define criteria

Some criteria cannnot be defined a priori

Problems with LLMs

  • They hallucinate
  • They ignore instructions
  • They generate invalid outputs
  • They generate uncalibrated outputs

Existing tools

  • Many tools exist for prompt engineering and auditing
  • These tools require metrics
  • These tools usually include calls to evaluator LLMs
  • Evaluator LLMs evaluate things like conciseness that are hard to encode
  • Evaluator LLMs struggle to find alignment with human judgments
  • Hard to craft code-based assertions, such as appropriate regexes
  • Hard to craft prompts due to unintuitive sensitivities to minor wording changes

Addressing challenges

  • Reduce user effort
  • LLM suggests a criterion in natural language in context
  • User modifies criterion
  • LLM generates pool of candidate assertions for each criterion
  • User votes good or bad (for each criterion or assertion or both?)
  • Embed solution in ChainForge (a prompt engineering tool created by the last co-author)

Aside on ChainForge, which claims you can:

  • test robustness to prompt injection attacks
  • test consistency of output when instructing the LLM to respond only in a certain format (e.g., only code)
  • send off a ton of parametrized prompts, cache them and export them to an Excel file, without having to write a line of code
  • verify quality of responses for the same model, but at different settings
  • measure the impact of different system messages on ChatGPT output
  • run example evaluations generated from OpenAI evals
  • …and more

User study

  • Nine users—industry practitioners
  • Qualitative criteria
  • Generic task
  • Little guidance provided to participants

Background

  • Prompt engineering is a new practice and research area
  • Auditing practices like red-teaming are used to identify harmful outputs
  • LLM operations or LLMOps include new tools and terminology: prompt template, chain of thought, agents, and chains
  • LLM-based evaluators are also called LLM-as-a-judge or co-auditors
  • Prompt engineering tools allow users to write evaluation metrics but don’t provide mechanisms like EvalGen (their tool) to align LLM evaluators with expectations
  • PE tools at most allow you to manually check outputs

Over-trust and over-generalization

  • Prior research shows people trust LLMs too much and generalize too much from them
  • Example: MIT EECS exam where gpt-4 graded itself with an A but didn’t deserve it
  • Set ordering matters in asking an LLM to choose from a set!
  • LLMs are overly sensitive to formatting changes
  • Users tend to throw out entire chains from one unsuccessful prompt
  • Users tend to over-generalize from first few outputs
  • Users tend to over-generalize from small numbers of error reports

Aligning LLMs with user expectations

  • Interactive systems have been shown to better match user expectations
  • Interactive systems typically assist users in selecting training examples, labeling (annotating) data, and evaluating results
  • All LLM alignment strategies feature human-in-the-loop
  • Heritage is pre-defined benchmarks from NLP and ML

Summary of background

  • Help is needed both in prototyping evaluations and validating LLM evaluators
  • Human evaluators are prone to over-reliance and over-generalization
  • These both lead to the authors’ proposed solution, EvalGen

EvalGen pipeline

EvalGen workflow