Evaluation

Prompt Engineering

Mick McQuaid

mcq@utexas.edu

University of Texas at Austin

24 Nov 2025

Week THIRTEEN

First, an Agentic-AI conference

You need to know about these things

That said, this one had a lot of business fluff … circling back to leverage our proactive synergies for a win-win

I want to highlight one of the presentations to warn you and to highlight one presentation and one panel that were helpful

McKinsey presentation

McKinsey is the world’s most prominent business strategy consulting firm

Presenter said it’s really a tech firm in disguise and would like to consult with all of us on our AI strategies

Presented a lot of statistics on CEO attitudes toward AI

They evidently talk to a lot of CEOs who are mostly clueless about AI but are otherwise very smart people

Based a lot of his presentation on the following two frames

The NASA antecedent

How McKinsey adapted the NASA diagram

A useful panel on vibe coding

The expression vibe coding was coined by Andrei Karpathy in February

In November, it was selected as “Word of the year” [sic] by some dictionary

The panelists were all enthusiastic vibe coders

I’ll upload a video to Canvas when it’s made available to me

Main takeaway is, as always with AI, realistic expectations

Disagreement exists over degree of coding skill needed to effectively vibe code, but these people were all experienced coders before they began

Some agreement that, by pursuing test-driven development, vibe coding can be useful for debugging (contradicts some earlier wisdom)

A useful presentation on data representation for AI

Presenter was founder of Sphinx, which helps you massage data so AI can use it

Points out that, while genAI works for text and images, it does not do well on tabular data—needs agentic tools, which Sphinx conveniently provides

Gave an excellent example of genAI failure with the Ames dataset which you used for exercise D

Showed some interesting graphics, displayed in the next two frames

Data science benefits less from genAI than does software engineering

Data in different states

Chain of Thought prompting

One student asked for some coverage of chain-of-thought (CoT)

I thought it might be useful to survey a few recent papers on CoT

First, let’s see what Xiao and Zhu (2025) says about CoT, then look at Yu et al. (2025), Li et al. (2025), Shen et al. (2025), and Zhang et al. (2025), all recent articles focused on CoT techniques

Xiao and Zhu (2025)

Discusses CoT as an advanced technique, although it’s pretty common, saying “Let’s think step by step”

CoT somewhat mirrors human problem solving by decomposing problems

CoT improves transparency reporting steps to user

CoT increases potential scope by subdividing tasks

CoT can be extended via tree or graph structures (active area of research)

CoT can be extended via interaction during intermediate steps

BUT … it may be difficult to obtain demonstrations

BUT … requires review of all intermediate steps to prevent accumulating errors

Yu et al. (2025)

Proposes the first CoT framework for text-free graphs, GCoT

Note that graphs are arrangements of nodes and links, not graphics!

Graphs may represent relationships between people, drugs, photos, proteins, really any kind of objects that have relationships with other objects, the nodes being the objects and the links being the relationships

An example of the differences between CoT for NLP and graphs is shown in the next frame, with the thoughts highlighted in red

Yu et al. (2025) process

Developed system to represent graphs as embeddings to be used in downstream tasks, such as node classification, link prediction

Experimented on eight well-known graph datasets

Performed an ablation study (systematically removes layers and tests resulting variants)

GCoT is both faster (?) and of higher quality than alternatives

One quality measure—proximity of like classes in clustering

Li et al. (2025)

Proposes Structured Chain of Thought (SCoT) for code generation

Code generation aims to automatically produce a computer program from a natural language description

SCoT is based on notion that human developers practice structured programming, employing three structures: sequences, branches, and loops—and only those three

CoT prompting only uses sequences, not branches and loops and achieves limited performance gains over generic prompting

Yet LLMs have considerable knowledge of branches and loops, just needs to be “unlocked”

SCoT accomplishes this “unlocking”

SCoT process

Two-step process, (1) Input-Output structure, and (2) Rough problem-solving process

IO structure defines entry and exit of code, giving requirements

Rough problem-solving process mandates that only the three constructs (sequence, branch, and loop) be used in subsequent steps

SCoT Experience

Uses demonstrations of structure in input

Five LLMs: two GPT and three Deepseek Coder in various sizes

Three benchmarks: HumanEval, MBPP, and MBCPP

Two languages: Python and C++

Uses unbiased Pass@k as evaluation metric

Compares to zero-shot, few-shot, and CoT

Human preference also examined (10 developers w/3–5 yrs experience)

SCoT Example

Shen et al. (2025)

Brief paper exploring CoT for code generation with lightweight Language models (lLMs)

Assumes fewer than 10 billion parameters

Finds nearly twenty percent improvement in Pass@1 metric for CodeT5+ 6B

Seems motivated primarily by ease of deployment of resource-constrained lLMs

Uses four CoT methods, including the SCoT of the previous paper

Uses three benchmarks: HumanEval, OpenEval, CodeHarmony

Uses four models: CodeT5+, Qwen2.5-Coder, Yi-Coder, Deepseek-Coder

Zhang et al. (2025)

Overall goal is video content comprehension

Introduces Video-CoT, a dataset to enhance spatio-temporal understanding of CoT methods

Note that the name Video-CoT always appears in a yellow-to-pink gradient in the paper

Dataset contains 192,000 spatio-temporal question-answer pairs and 23,000 annotated CoT samples

Includes a new benchmark with 750 images for assessing each task

Conducts extensive experiments on video-language models (VLMs) and finds them wanting

Especially concerned with event start and end times and positions of subjects

Video-CoT overview

Evaluation

Key problem: what are you evaluating?

Are you evaluating your prompt? A framework? The context? The LLM?

Again, what are you evaluating?

Typical solution to many evaluation problems: ablation

That means subtracting elements one at a time to see the effect on output

Risks overlooking synergies

Here synergies means that the performance of two elements together is greater than the sum of their individual contributions

Agenta.ai

Recent improvements (last couple of weeks)

May be less buggy than when you tried it before

Advantage is that it balances cost, time, and performance

Allows you to identify a scaled, weighted, linear combination of the three

Substitutes should either help with more than one metric or be used in combinations

Doesn’t account for alignment

Costs

Input tokens have different costs than output tokens

Costs change over time and published reports are typically outdated except as to method

There’s the hidden cost of the carbon footprint of the work

Evaluation itself has a cost

Time

Time often matters as in the old saying that Time is money

People are increasingly using genAI in real-time interactions, for better or worse

Performance

There’s always a human somewhere in the loop but automation is attractive

Time may be considered as an aspect of performance but still a tradeoff

Alignment

How do you measure alignment with human objectives? Human judgment?

How do quantify human judgment? Can you?

How do you use or even compare the judgments of multiple humans?

Inspect

AI Security Institute (2024) is an open-source Python framework for evaluating LLMs.

https://inspect.aisi.org.uk/ is its home page

https://github.com/UKGovernmentBEIS/inspect_ai is its GitHub repository

Note that you install it as pip install inspect-ai but actually call it from the command line as inspect

… Or call it from a Python program as described on the above page

Running Inspect

The home page gives a first example of how to use it

Create a subdirectory welcome and copy the file theory.py into it

Before running it, I said export INSPECT_EVAL_MODEL='anthropic/claude-sonnet-4-0'

You should give it whatever model you want to use, noting that you could instead supply multiple models on the command line in the following command

While in that subdirectory, run inspect eval theory.py

You could give the model on the command line, as in inspect eval theory.py --model openai/gpt-4

Examining the output

Inspect creates a subdirectory called logs

Without switching to that subdirectory, say inspect view

Next visit the URL it returns in a browser, in my case it was https://127.0.0.1:7575

At that URL you will see a list of the evaluation runs (so far there should only be one)

Click on it to see the details, including Accuracy, Standard Error of Accuracy (which is the standard deviation divided by the square root of the number of samples), input tokens and output tokens, and the time the run took, all important evaluation measures

Structure of `theory.py`

Notice that theory.py returns a task composed of three parts
- dataset
- solver(s)
- scorer

Inspect datasets

These can be from HuggingFace or can be CSV, JSON, or JSONL

At its simplest, a dataset is a table with input / target pairs

Inspect solvers

The solvers are the most important aspect of Inspect evaluations

There are several built-in solvers, briefly described on the solvers page of the documentation on the home page

You used three of them in the theory.py example:
- generate() which calls a prompt
- chain_of_thought() which encourages the model work step-by-step
- self_critique() which prompts the model to critique the results of the previous call to generate()

Inspect scorers

Inspect has a number of built-in scorers with which you should familiarize yourself

Some are simple text-matching

Others include an f1 mechanism that computes the harmonic mean of precision and recall (see next frame)

Still others are model graded, meaning that another model actually grades the output

Precision and recall

Exploring Inspect further

I ran the tasks in the tutorial using Claude Sonnet 4.0

the welcome task cost 2.12 USD and ran for 11 minutes

the hello task cost 0.00 USD

the security_guide task cost 0.11 USD

the hellaswag task cost 0.05 USD (I limited it to 50 samples)

the gsm8k task cost 0.99 USD and ran for 7 minutes (I limited it to 100 samples)

the mathematics task cost 1.39 USD and ran for 9 minutes (then I interrupted it after 110 samples)

somehow i spent a few more cents, bringing my total to 5 USD

New work in evaluation

Why explore new work in evaluation?

People are inventing new ways and repurposing old ways to evaluate what they are doing

You may think of a new way to evaluate your project by taking inspiration from new projects

Following are two projects that use very different approaches to evaluation, one very quantitative and the other very qualitative

Khatua et al. (2025)

Khatua et al. (2025) evaluates the ability of eight LLMs to follow prompts in QA tasks

Khatua et al. (2025) demonstrates that they often can’t and are not self-reflective

Khatua et al. (2025) provides a framework for evaluation of the LLM itself

Khatua et al. (2025) divides knowledge into parametric knowledge (coming from the training corpus) and context knowledge (coming from any of various context engineering techniques, but mainly RAG)

Example

Example: a new president of Liberia was recently elected, so their identity is not part of parametric knowledge

Provided with (A) correct context, (B) masked context, (C) noisy context, or (D) absurd context

Possible errors: (A) relying on parametric knowledge, (B) not being self-reflective, (C) relying on parametric knowledge, or (D) hallucinating

Subramonyam et al. (2025)

Subramonyam et al. (2025) describes a user study of prototyping applications using genAI

Thirteen teams, each consisting of a designer, a manager, and a developer, worked on prototypes of marketing applications

Evaluation mainly consisted of aligning results with human intentions

Design choices are often undocumented and invisible to end users

Overfitting to examples is common

Discovered tendency to optimize model outputs rather than explore broader design space

These discoveries come from qualitative research in the vein of Anselm Strauss and Juliet Corbin

END

References

AI Security Institute, UK. 2024. “Inspect AI: Framework for Large Language Model Evaluations.” https://github.com/UKGovernmentBEIS/inspect_ai.

Khatua, Aparup, Tobias Kalmbach, Prasenjit Mitra, and Sandipan Sikdar. 2025. “Evaluating LLMs’ (in)ability to Follow Prompts in QA Tasks.” In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2941–45. SIGIR ’25. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3726302.3730190.

Li, Jia, Ge Li, Yongmin Li, and Zhi Jin. 2025. “Structured Chain-of-Thought Prompting for Code Generation.” ACM Trans. Softw. Eng. Methodol. 34 (2). https://doi.org/10.1145/3690635.

Shen, Yiheng, Jing Qiu, Xiang Nan, and Hongxing Xia. 2025. “An Empirical Study of Chain-of-Thought in Lightweight Language Models.” In Proceedings of the 2025 3rd International Conference on Educational Knowledge and Informatization, 69–73. EKI ’25. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3765325.3765338.

Subramonyam, Hari, Divy Thakkar, Andrew Ku, Juergen Dieber, and Anoop K. Sinha. 2025. “Prototyping with Prompts: Emerging Approaches and Challenges in Generative AI Design for Collaborative Software Teams.” In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. CHI ’25. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3706598.3713166.

Xiao, Tong, and Jingbo Zhu. 2025. “Foundations of Large Language Models.” https://arxiv.org/abs/2501.09223.

Yu, Xingtong, Chang Zhou, Zhongwei Kuai, Xinming Zhang, and Yuan Fang. 2025. “GCoT: Chain-of-Thought Prompt Learning for Graphs.” In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining v.2, 3669–79. KDD ’25. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3711896.3736974.

Zhang, Shuyi, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. 2025. “Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought.” In Proceedings of the 33rd ACM International Conference on Multimedia, 12745–52. MM ’25. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3746027.3758313.

Colophon

This slideshow was produced using quarto

Fonts are Roboto, Roboto Light, and Victor Mono Nerd Font

Evaluation

Week THIRTEEN

First, an Agentic-AI conference

McKinsey presentation

The NASA antecedent

How McKinsey adapted the NASA diagram

A useful panel on vibe coding

A useful presentation on data representation for AI

Data science benefits less from genAI than does software engineering

Data in different states

Chain of Thought prompting

Xiao and Zhu (2025)

Yu et al. (2025)

Yu et al. (2025) process

One quality measure—proximity of like classes in clustering

Li et al. (2025)

SCoT process

SCoT Experience

SCoT Example

Shen et al. (2025)

Zhang et al. (2025)

Video-CoT overview

Evaluation

Agenta.ai

Costs

Time

Performance

Alignment

Inspect

Running Inspect

Examining the output

Structure of theory.py

Inspect datasets

Inspect solvers

Inspect scorers

Precision and recall

Exploring Inspect further

New work in evaluation

Khatua et al. (2025)

Example

Subramonyam et al. (2025)

END

References

Colophon

Structure of `theory.py`