Iteration and Annotation

Prompt Engineering

Mick McQuaid

University of Texas at Austin

04 Feb 2026

Week FOUR

Agenda

  • Presentation: Shreya K
  • News
  • Review whatiknow
  • Q&A on eB
  • Q&A on m1
  • Simple chatbot
  • Iteration
  • Annotation
  • Work time

Presentations

News

WhatIKnow

Issues with the doc

  • Several people dated their contributions! Let’s all do that
  • Nothing in the Intro tab? (shouldn’t it introduce prompt engineering?)
  • How about more of your opinions / reflections on what you read?
  • The FAQ is empty! Don’t you have questions?
  • Some items need a TLDR
  • Are you finding it useful?

eB

Specification

  • Use gen AI to assemble a bibliography in BibTeX format
  • Manually assemble a bibliography at the ACM Digital Library
  • Compare them
  • The subject is PDF remediation: making a PDF accessible to people with disabilities
  • I’m familiar with this literature, making it easier for me to detect hallucinations
  • I can also judge the importance of the papers listed
  • A recent bibliography in this area listed 39 sources, of which only about 20 were relevant
  • Document your process!

m1

Deliverable

  • A short qmd / html document describing the domain
  • Use the template!
  • The doc should include a discussion of the possible datasets / skills / mcp servers you might use (the actual dataset will be due in m2)
  • You are not required to stick with the directions you give here, but this should be your current best guess of what you plan to do
  • Examples: chatbot to emulate a foreign leader; chatbot to triage banking problems; chatbot to analyze tweets

Simple chatbot

First, a Hello World in R

# install.packages("pacman")
pacman::p_load(ellmer)
client <- chat_google_gemini()
client$chat("Summarize the plot of Romeo and Juliet in 20 words or fewer.")
Romeo and Juliet, children of rival families, fall in love, marry, and die 
tragically, ending their parents' bitter feud.

Now client is an object (you can name it anything) and chat is a component of that object.

Continuing the same conversation

client$chat("Now do Hamlet.")
Hamlet, urged by his father's ghost, seeks revenge on his murderous uncle. His 
fatal hesitation leads to a tragic end for all.

… and it remembers the question from the previous call.

Same thing in Python

First, say pip install chatlas in a terminal.

from chatlas import ChatGoogle
client = ChatGoogle(model = "gemini-2.5-flash")
response = client.chat("Summarize the plot of Romeo and Juliet in 20 words or fewer.",echo="none")
print(response)
Two teens from warring families fall in love, secretly marry, and tragically die due to feuding and miscommunication.

Notice that the same exact answer may not be provided. This is a completely different conversation from the one in R.

This Python conversation can also be continued.

response = client.chat("Now do Hamlet.",echo="none")
print(response)
Prince Hamlet seeks revenge on his uncle, who murdered his father and married his mother. His feigned madness leads to tragedy and death for all.

Now a simple chatbot

client =
  chat_google_gemini(
  system_prompt =
    ellmer::interpolate_file("posit-expense-policy.md"))
client$chat("I am a Posit employee. Based on the Posit Expense Policy, can I expense a hotel for business?")
Yes, as a Posit employee, you can expense a hotel for business purposes 
(Section 4.1, Accommodation).

The policy states that "Employees should book standard hotel rooms at a 
reasonable rate." If a nightly rate for accommodation exceeds $250, a 
justification is required (Section 4.1, Accommodation).

Cost

Anthropic isn’t too good about itemizing costs, so I’m going to guess that the above, after running many times, cost about 0.25 USD. If you want to know the cost per conversation, you have to keep looking at Anthropic’s up to the minute meter.

Your turn

  • Do the same thing in Python, using whatever key you have (OpenAI, Gemini, Anthropic, etc.)
  • Put the key in an ~/.Renviron file
  • First task: Find something (not an expense report, but you can use my posit-expense-policy.md to get started) to put in the system prompt
  • Can be in any domain you prefer, e.g., your project focus
  • Remember to use . instead of $ and ChatAnthropic instead of chat_anthropic (or whatever—I’ll show the possible models)

Iteration and annotation

For iteration, I think we can just brainstorm a lot of prompts and use a foreach to run them through chatlas in different conversations. Remember that each time we initialized the variable client, we were starting a new conversation where we could reuse that variable. We could instead give the conversations different names and run them in parallel, skipping back and forth as we get new ideas.

As for annotation, the effectiveness of system prompts makes it seem less pressing. Last semester, I asked students to annotate data for their projects. This semester, I’m not going to ask the same.

Iteration

How do you iterate?

  • Do you document what you do?
  • Do you use any prompt engineering tool to redeploy queries?
  • How many times do you iterate?
  • How do you know when you’re done?
  • There are many such worthwhile questions!
  • Let’s play with a prompt engineering tool to learn more

Agenta

  • Let’s pause to see a couple of Agenta videos
  • \(\langle\) agenta intro \(\rangle\)
  • \(\langle\) agenta quick start \(\rangle\)
  • Let’s try the workflow in the second video
  • You can install it on your machine but I’ll use the cloud version
  • How many have Docker already installed?

Example workflow

  • Select Agenta Cloud
  • Set up an account
  • I tried using example_tweets_test without success, so I switched to example_tweets_test_tiny
  • Follow me through the workflow

Annotation

Example

  • We’ll work with Potato, a current experimental annotation tool
  • It’s described in a recent paper, Pei et al. (2023)
  • It’s an open source project
  • It’s written in Python
  • It’s available on GitHub

Getting started

  • Clone the repository
  • Install the dependencies
  • Run the tool
  • Sounds simple, right?
  • Unfortunately, the instructions don’t work
  • I think they are out of date
  • We’ll do it our own way

Part one

mkdir labeling && cd labeling
git clone https://github.com/davidjurgens/potato.git
cd potato
pip install -r requirements.txt
python potato/flask_server.py start project-hub/politeness_rating/configs/politeness.yaml -p 8000

Part two

  • Visit http://localhost:8000
  • You should see a login screen
  • You have to create an account
  • Click on the “Create an account” link
  • Enter an account name and a password, it doesn’t really matter what

Part three

  • Log in with your new account
  • You should see an intro, then an instance to annotate
  • There should be fifty instances to annotate
  • Try to annotate them all
  • You can just enter a number from 1 to 5, it doesn’t matter what
  • Try not to give them all the same number

Part four

  • Do the demographics; again, they don’t matter, just enter something
  • Click on the “Submit” button
  • Now you want to see your results
  • They are in a folder called potato/project-hub/politeness_rating/annotation_output/full/<your account name>/
  • Kill the server with Ctrl-C and use that window to navigate to the folder

Part five

  • There are several files in the folder
  • annotation_order.txt is a list of the instances you annotated in order
  • assigned_user_data.json is a list of the instances you annotated, with the annotations you made
  • annotated_instances.jsonl is a list of the instances you annotated, with the annotations you made, in JSON Lines format
  • Look at each file in a text editor

Part six

  • Convert the JSON Lines file to a CSV file
  • Use the CSV file to calculate the average rating
  • You may have some trouble extracting the scale
  • I used vi to change the word scale_1 to scale
  • To do this to all five scale labels, I used the vi command :%s/scale_[1-5]/scale/
  • Then I used the extractIDandScale.py script to extract the ID and scale from the JSON Lines file
  • You may use an LLM for these tasks

Part seven

  • I used vi to get rid of lines that start with Politeness
  • I used the vi command :g/^Politeness/d
  • Next I used the calcAvg.py script to calculate the average rating
  • However, average ratings are not very useful for Likert scales
  • Instead, I made a stem-and-leaf plot
  • It took several tries for gpt-4o to make it like the one I can easily generate in R

Stem-and-leaf plot

  5 | 000000
  4 | 00000000000
  3 | 00000000000000000
  2 | 00000000000
  1 | 00000

See if you can make a stem-and-leaf plot for your data with an LLM

END

References

Pei, Jiaxin, Aparna Ananthasubramaniam, Xingyao Wang, Naitian Zhou, Jackson Sargent, Apostolos Dedeloudis, and David Jurgens. 2023. “POTATO: The Portable Text Annotation Tool.” https://arxiv.org/abs/2212.08620.

Colophon

This slideshow was produced using quarto

Fonts are Roboto Light, Roboto Bold, and JetBrains Mono Nerd Font