Principles: Evaluation

Prompt Engineering

Mick McQuaid

mcq@utexas.edu

University of Texas at Austin

09 Sep 2025

Week Three

Agenda

Finish group assignments

Review syllabus

Review whatiknow

Review eA

Evaluation of LLMs

Announce eB

eA

Scoring

Average score: 5/5 (100%)

Grading was very lenient

But the honeymoon is over

Many people disregarded instructions like the file names—I’ll take off points for these omissions in future assignments

I don’t regrade assignments—I make suggestions for improvement that you should implement so that the hw is useful to you in the future

Why don’t I regrade? Think about it!

I don’t regrade because everyone would eventually get a 5/5

Observations

One person did five prompts

Most did two or three prompts

Some people did not include an intro or conclusion

Most people included last names, which we discussed as a privacy no-no

Many people included the boilerplate, which should have been removed

Exemplary version

Let’s pause to look at an exemplary version of the assignment. Notice the detailed commentary and conclusion.

Requirements

Notice that I gave you very few requirements

I did not give you a list of detailed instructions to follow

I will typically underspecify the assignment and let you be adventurous

What happens if you just do the minimum?

Can you be replaced by genAI?

You have to be creative and adventurous to survive in the workplace

Observations

You must iterate

You must be specific

There will be drift (the model forgets what you told it earlier)

Giving the model my instructions is not enough!

You have to be creative and analyze the output

Feedback helps

You can control parameters

The model has limitations to discover

The best submissions

Capitalized names

Gave actual dates

Arranged results in chronological order (alphabetical order as an addition would have been good)

Formatted results in markdown format (but could have incorporated markdown output into the body of the document)

News

Biggest news of the week?

You guessed it: the Anthropic settlement

Is 1.5 billion dollars a lot?

It’s 3,000USD per book for 500,000 books

Fair use books were exempted from the settlement—only books from LibGen were covered

Current Anthropic valuation: 183 billion USD

Other news

Let’s look at Data Points and The Batch

Last semester’s news in week three: Deepseek R1

Deepseek is a Chinese startup founded by Liang Wenfeng

They rocked the US stock markets this week by releasing a new LLM called R1

R1 is claimed to be competitive with gpt-4o o3 but at a fraction of the cost to develop

They are open source code, weights, and data

They are offering R1 free to use, including for local installation

My former student is running it on his Macbook and wants to try it on some version of Raspberry Pi

Why should this matter to stock prices? \(\langle\) Discuss \(\rangle\)

Evaluation of LLMs

Another way to discover genuine principles is to evaluate LLMs. What qualities are we looking for in an LLM?

Generic Qualities

Fluency

Relevance

Coherence

Perplexity (how well a probability model predicts a sample)

Overlap of n-grams in translations (automating human judgments)

Benchmark task performance (e.g., resolving ambiguous pronouns, demonstrating reading comprehension)

van Schaik and Pugh (2024)

Automating evaluation of LLMs

Desiderata

Scalable

Automatic

Reliable (LLMs are not reliable for evaluating LLMs)

Cost-effective

Considers new issues such as hallucinations

Issues

Hallucinations

Knowledge recency

Reasoning inconsistency

Difficulty in computational reasoning

Different communities evaluate different criteria

Security and Responsible AI

Computing performance

Retrieval vs Gnerator Evaluation

Offline vs Online Evaluation

System Evaluation vs Model Evaluation

Focus

van Schaik focuses on automatic, offline, system-level evaluation of generative AI text: methods for evaluating quality of summaries

Qualities van Schaik addresses

Fluency

Coherence

Relevance

Factual consistency

Fairness

Similarity to reference text

Three kinds of metrics

Reference-based

Reference-free (Context-based)

LLM-based

Reference-based metrics

N-gram based metrics

Embedding-based metrics

Both are simple, fast, inexpensive

Poor correlation with human judgments

Lack of interpretability

Inherent bias

Poor adaptability

Inability to capture subtle nuances

Reference-free metrics (Context-based)

Evaluation is a score

Quality-based metrics
- Based on context or source

Entailment-based metrics
- Based on the Natural Language Inference (NLI) task
- Determines whether output entails, contradicts, or undermines premise

Factuality, QA, and QG-based metrics
- Based on the QA (question answering) and QG (question generation) tasks
- Determines whether output is factually correct

Limitations of reference-free metrics

Bias towards underlying model outputs

Bias against higher-quality text

But improved correlation with human judgments!

LLM-based metrics

Prompt-based evaluators

LLM embedding-base metrics

These are new and not well-studied

They are probably expensive to study (modulo DeepSeek)

Best practices

Use a suite rather than relying on a single metric

Use standard (e.g., factual consistency) and custom (e.g., writing style) metrics

Example: measuring F1 score overlap between regex extraction and ground truth on electronic product summaries

Use LLM and non-LLM metrics

Validate the evaluators (usually against human judgments)

Visualize and analyze metrics (e.g., use boxplots)

Involve experts to annotate data, evaluate summaries, and design metrics

More best practices

Data-driven prompt engineering

Tracking metrics over time

Appropriate metric interpretation

Open challenges

Cold start problem
- Synthesize data (but may not represent distributions)
- Repurpose existing data (e.g., search engine logs)

Subjectivity in evaluating and annotating text
- Empirical IRR results show medium (80%) inter-rater reliability

Challenge of good vs excellent
- Difficult to discern between good and excellent

Additional Observations

Lead author got PhD from Imperial College London, ranked 2 worldwide by QS

Paper is published in SIGIR, a top conference in information retrieval

Some material is already out of date given OpenAI o3 and DeepSeek R1

Shankar et al. (2024)

LLMs are the new way to evaluate LLMs!

What could possibly go wrong?

Shankar presents an example solution to the obvious problem

Alignment with human judgments

This is the most frequent goal of automatic evaluation

Criteria Drift

Users need criteria to grade outputs but grading outputs helps users define criteria

Some criteria cannnot be defined a priori

Problems with LLMs

They hallucinate

They ignore instructions

They generate invalid outputs

They generate uncalibrated outputs

Existing tools

Many tools exist for prompt engineering and auditing

These tools require metrics

These tools usually include calls to evaluator LLMs

Evaluator LLMs evaluate things like conciseness that are hard to encode

Evaluator LLMs struggle to find alignment with human judgments

Hard to craft code-based assertions, such as appropriate regexes

Hard to craft prompts due to unintuitive sensitivities to minor wording changes

Addressing challenges

Reduce user effort

LLM suggests a criterion in natural language in context

User modifies criterion

LLM generates pool of candidate assertions for each criterion

User votes good or bad (for each criterion or assertion or both?)

Embed solution in ChainForge (a prompt engineering tool created by the last co-author)

Aside on ChainForge, which claims you can:

test robustness to prompt injection attacks

test consistency of output when instructing the LLM to respond only in a certain format (e.g., only code)

send off a ton of parametrized prompts, cache them and export them to an Excel file, without having to write a line of code

verify quality of responses for the same model, but at different settings

measure the impact of different system messages on ChatGPT output

run example evaluations generated from OpenAI evals

…and more

User study

Nine users—industry practitioners

Qualitative criteria

Generic task

Little guidance provided to participants

Background

Prompt engineering is a new practice and research area

Auditing practices like red-teaming are used to identify harmful outputs

LLM operations or LLMOps include new tools and terminology: prompt template, chain of thought, agents, and chains

LLM-based evaluators are also called LLM-as-a-judge or co-auditors

Prompt engineering tools allow users to write evaluation metrics but don’t provide mechanisms like EvalGen (their tool) to align LLM evaluators with expectations

PE tools at most allow you to manually check outputs

Over-trust and over-generalization

Prior research shows people trust LLMs too much and generalize too much from them

Example: MIT EECS exam where gpt-4 graded itself with an A but didn’t deserve it

Set ordering matters in asking an LLM to choose from a set!

LLMs are overly sensitive to formatting changes

Users tend to throw out entire chains from one unsuccessful prompt

Users tend to over-generalize from first few outputs

Users tend to over-generalize from small numbers of error reports

Aligning LLMs with user expectations

Interactive systems have been shown to better match user expectations

Interactive systems typically assist users in selecting training examples, labeling (annotating) data, and evaluating results

All LLM alignment strategies feature human-in-the-loop

Heritage is pre-defined benchmarks from NLP and ML

Summary of background

Help is needed both in prototyping evaluations and validating LLM evaluators

Human evaluators are prone to over-reliance and over-generalization

These both lead to the authors’ proposed solution, EvalGen

EvalGen pipeline

EvalGen workflow

System implementation

This section is mainly of interest to system builders, a small subset of the audience here and not within course scope. I’ll only discuss it if people have questions.

Algorithm evaluation

Evaluated against SPADE as a baseline (is that valid?)

Uses a medical prompt (extract info from doctor-patient without revealing PII) and a product pipeline (craft SEO-friendly descriptions for Amazon products)

Note that product pipeline includes negative reviews, which must be filtered out of the description

Note that they used gpt-3.5 turbo (published May 2024)

Manually graded all 184 outputs

Issues included PII in the medical task and negative reviews or lengthy content in the product task

Two big differences between EvalGen and SPADE

EvalGen solicits input from the user about criteria and SPADE does not

EvalGen picks the most aligned assertion per criterion that meets some false failure rate threshold, while SPADE solves an optimization problem to select a minimal assertion set that meet a false failure rate threshold and cover all SPADE-generated criteria

Algorithm evaluation results

EvalGen is more aligned with user expectations than SPADE due to the user input

EvalGen requires fewer assertions than SPADE

SPADE produced unrealistic assertions like flag the words “never order” and flag the word “disappointed”

User Study Participants

Nine users—industry practitioners as mentioned above

Included software engineers, ML scientists, startup executives, and independent consultants

Invoked “five is enough” excuse

Focused on LLM-experienced developers

Introduced system using a named entity recognition task (100 tweets)

Gave participants 40 minutes of think-aloud time to explore the tool

Used common qualitative techniques to analyze data (open and axial coding)

General findings

Participants liked having control

Participants encountered difficulties in aligning assertions with preferences because of two reasons
- some criteria are hard for humans to evaluate
- criteria drift (loop of output and refined criteria)

Code-based and LLM-based evaluators affected alignment and needs

Typical workflow

eyeballing LLM outputs

Starting EvalGen

Grading outputs

Refining criteria

Grading more outputs

Understanding alignment on graded outputs

Eyeballing alignment on ungraded outputs

Criteria drift

Criteria drift \(=\) Grading outputs spurred changes in criteria

Participants wanted to add new criteria during the process, which EvalGen does not allow (you have to start a new process)

Participants reinterpret existing criteria to better fit LLM behavior

E.g., participants changed criteria from named entities must all be proper nouns to named entities must mostly be proper nouns

One participant kept grading the same way for consistency but didn’t feel it reflecting changing opinion

Two participants changed their views about hashtags during the process

Grading approach depends on difficulty of evaluating a criterion

Participants wanted to set different false failure rates depending on difficulty of evaluating a criterion

Some participants couldn’t trust their grades because they couldn’t evaluate as well as the machine

Example: word count is hard for humans but easy for machines

Alignment is subjective

Especially in converting natural language criteria to code-based assertions

For code-based assertions, EvalGen never matched participant expectations of criterion interpretation (here they say gpt-4 instead of gpt-3.5 turbo)

Grades therefore had no impact

Example: two participants had different interpretations of the hashtag criteria, so one was satisfied and the other was not

Some disagreement existed between participants regarding the resolution of the named entitity, e.g., Nike vs Nike Shoes

Code vs LLM evaluators

Users preferred code-based evaluators for formatting checks, count-based checks, and specific phrase inclusion / exclusion

Users preferred LLM-based evaluators for “fuzzy” criteria or when external knowledge was required

For code-based, they wanted to see the code

For LLM-based, they were less trusting

Future work and limitations

Authors suggest evaluators move beyond binary judgments

Authors admit to limited pipelines, limited sample of humans, and short time frame

Authors opine that users want automatic improvement in prompts based on assertion results

Authors suggest a regime where prompts, assertions, and evaluative mechanisms are all continuously refined in a unified interface

Additional observations

UIST is a prominent conference in human-computer interaction, but not as prominent as CHI or CSCW; not on par with SIGIR’s status in the IR world; yet it has a similar acceptance rate of 20 to 25 percent

Paper is not heavily edited or particularly well-written, so probably succeeded on importance of late-breaking results

eB

Specification

Assemble a bibliography in BibTeX format

The subject is PDF remediation

PDF remediation is the process of making a PDF accessible to people with disabilities

Most of the literature is about research papers students are required to read

I’m familiar with this literature, making it easier for me to detect hallucinations

I can also judge the importance of the papers listed

A recent bibliography in this area listed 39 sources, of which only about 20 were relevant

Document your process!

END

References

Shankar, Shreya, J. D. Zamfirescu-Pereira, Bjoern Hartmann, Aditya Parameswaran, and Ian Arawjo. 2024. “Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences.” In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. UIST ’24. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3654777.3676450.

van Schaik, Tempest A., and Brittany Pugh. 2024. “A Field Guide to Automatic Evaluation of LLM-Generated Summaries.” In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2832–36. SIGIR ’24. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3626772.3661346.

Colophon

This slideshow was produced using quarto

Fonts are Roboto Light, Roboto Bold, and JetBrains Mono Nerd Font

Principles: Evaluation

Agenda

eA

Scoring

Observations

Exemplary version

Requirements

Observations

The best submissions

News

Biggest news of the week?

You guessed it: the Anthropic settlement

Other news

Last semester’s news in week three: Deepseek R1

Evaluation of LLMs

Generic Qualities

van Schaik and Pugh (2024)

Desiderata

Issues

Different communities evaluate different criteria

Focus

Qualities van Schaik addresses

Three kinds of metrics

Reference-based metrics

Reference-free metrics (Context-based)

Limitations of reference-free metrics

LLM-based metrics

Best practices

More best practices

Open challenges

Additional Observations

Shankar et al. (2024)

Alignment with human judgments

Criteria Drift

Problems with LLMs

Existing tools

Addressing challenges

Aside on ChainForge, which claims you can:

User study

Background

Over-trust and over-generalization

Aligning LLMs with user expectations

Summary of background

EvalGen pipeline

EvalGen workflow

System implementation

Algorithm evaluation

Two big differences between EvalGen and SPADE

Algorithm evaluation results

User Study Participants

General findings

Typical workflow

Criteria drift

Grading approach depends on difficulty of evaluating a criterion

Alignment is subjective

Code vs LLM evaluators

More on criteria drift

Future work and limitations

Additional observations

eB

Specification

References

Colophon