Iteration and Annotation

Prompt Engineering

Mick McQuaid

University of Texas at Austin

16 Sep 2025

Week Four

Agenda

  • Presentations: Steven, Jihyung
  • News
  • Review whatiknow
  • Q&A on eB
  • Q&A on m1
  • Simple chatbot
  • Iteration
  • Annotation
  • Work time

Presentations

News

WhatIKnow

Issues with the doc

  • It’s a bit hard to read (aimed at me instead of you)
  • Nothing in the Intro tab? (shouldn’t it introduce prompt engineering?)
  • Most items are long summaries of old academic articles
  • The FAQ doesn’t look like an FAQ
  • Some of the items are articles we went over in class!
  • Most of the items need a TLDR
  • Most of the items are copied and pasted in, making the review process hard (I’ll demonstrate)
  • Three different tabs for LLM Evaluation??
  • Who would use this doc? (I meant it as something you would find useful in answering the questions you naturally have)

News

FAQ

Prompt Techniques

LLM Evaluation

LLM Evaluators (?!)

Who Validates the Validators?

So, let’s completely redo it

  • Make a document that works for you
  • Only make a new tab for a new group of topics
  • Use headings and subheadings for topics and subtopics on a tab
  • Try to write in the document rather than cutting and pasting in from another source
  • Dhruvi has already graded your previous contributions, so it’s okay to nuke them and start from scratch

eB

Specification

  • Assemble a bibliography in BibTeX format
  • The subject is PDF remediation
  • PDF remediation is the process of making a PDF accessible to people with disabilities
  • Most of the literature is about research papers students are required to read
  • I’m familiar with this literature, making it easier for me to detect hallucinations
  • I can also judge the importance of the papers listed
  • A recent bibliography in this area listed 39 sources, of which only about 20 were relevant
  • Document your process!

m1

Deliverable

  • A short qmd / html document describing the domain
  • Use the template!
  • The doc should include a discussion of the possible datasets you might use (the actual dataset will be due in m2)
  • You are not required to stick with the directions you give here, but this should be your current best guess of what you plan to do
  • Examples: chatbot to emulate a foreign leader; chatbot to triage banking problems; chatbot to analyze tweets

Simple chatbot

First, a Hello World in R

# install.packages(pacman)
pacman::p_load(ellmer)
client <- chat_anthropic(model = "claude-sonnet-4-20250514")
client$chat("Summarize the plot of Romeo and Juliet in 20 words or fewer.")
Two young lovers from feuding families secretly marry, but miscommunication 
leads to their tragic deaths, ultimately reconciling their warring houses.

Now client is an object (you can name it anything) and chat is a component of that object.

Continuing the same conversation

client$chat("Now do Hamlet.")
Prince Hamlet seeks revenge against his uncle Claudius for murdering his 
father, but his hesitation and madness lead to tragic consequences.

… and it remembers the question from the previous call.

Same thing in Python

First, say pip install chatlas in a terminal.

from chatlas import ChatAnthropic
client = ChatAnthropic(model = "claude-sonnet-4-20250514")
client.chat("Summarize the plot of Romeo and Juliet in 20 words or fewer.")

Two young lovers from feuding families secretly marry, but miscommunication     
leads to their tragic deaths, ultimately reconciling their warring houses.      <chatlas._chat.ChatResponse object at 0x11244e5a0>

Notice that the same exact answer may not be provided. This is a completely different conversation from the one in R.

This Python conversation can also be continued.

client.chat("Now do Hamlet.")

Prince Hamlet seeks revenge against his uncle Claudius for murdering his father,
but his hesitation and madness lead to tragedy.                                 <chatlas._chat.ChatResponse object at 0x113300770>

Now a simple chatbot

client =
  chat_anthropic(model = "claude-sonnet-4-20250514",
  system_prompt =
    ellmer::interpolate_file("posit-expense-policy.md"))
client$chat("I am a Posit employee. Based on the Posit Expense Policy, can I expense a hotel for business?")
Yes, as a Posit employee, you can expense a hotel for business travel. 
According to **Section 4.1 Travel Expenses**, accommodation expenses are 
reimbursable with the following guidelines:

- **Standard**: You should book standard hotel rooms at a reasonable rate
- **Rate limit**: If a nightly rate exceeds $250, you'll need to provide a 
justification
- **Business purpose**: The hotel stay must be directly related to company 
business (as stated in **Section 3 General Principles**)

**Requirements for reimbursement:**
- Submit the expense within 30 days of incurring it
- Provide a corresponding receipt or proof of purchase
- Submit through the company's designated expense management software
- Get approval from your direct manager

Make sure to attach a clear, legible image of the hotel receipt to your expense
report as outlined in **Section 6.1 Submission**.

Cost

Anthropic isn’t too good about itemizing costs, so I’m going to guess that the above, after running many times, cost about 0.25 USD. If you want to know the cost per conversation, you have to keep looking at Anthropic’s up to the minute meter.

Your turn

  • Do the same thing in Python, using whatever key you have (OpenAI, Gemini, Anthropic, etc.)
  • Put the key in an ~/.Renviron file
  • First task: Find something (not an expense report, but you can use my posit-expense-policy.md to get started) to put in the system prompt
  • Can be in any domain you prefer, e.g., your project focus
  • Remember to use . instead of $ and ChatAnthropic instead of chat_anthropic (or whatever—I’ll show the possible models)

Iteration and annotation

I’ve lost interest in the tools we explored last semester but I haven’t deleted the slideware talking about how to use them. Instead, for iteration, I think we can just brainstorm a lot of prompts and use a foreach to run them through chatlas in different conversations. Remember that each time we initialized the variable client, we were starting a new conversation where we could reuse that variable. We could instead give the conversations different names and run them in parallel, skipping back and forth as we get new ideas.

As for annotation, the effectiveness of system prompts makes it seem less pressing. Last semester, I asked students to annotate data for their projects. This semester, I’m not going to ask the same.

A different direction

I’m thinking we should write and deploy a number of chatbots using Shiny. That means we have to learn a little bit about Shiny. What do you think?

The remaining slides

The remaining slides are a holdover from iteration and annotation from last semester.

Iteration

How do you iterate?

  • Do you document what you do?
  • Do you use any prompt engineering tool to redeploy queries?
  • How many times do you iterate?
  • How do you know when you’re done?
  • There are many such worthwhile questions!
  • Let’s play with a prompt engineering tool to learn more

Agenta

  • Let’s pause to see a couple of Agenta videos
  • \(\langle\) agenta intro \(\rangle\)
  • \(\langle\) agenta quick start \(\rangle\)
  • Let’s try the workflow in the second video
  • You can install it on your machine but I’ll use the cloud version
  • How many have Docker already installed?

Example workflow

  • Select Agenta Cloud
  • Set up an account
  • I tried using example_tweets_test without success, so I switched to example_tweets_test_tiny
  • Follow me through the workflow

Annotation

Example

  • We’ll work with Potato, a current experimental annotation tool
  • It’s described in a recent paper, Pei et al. (2023)
  • It’s an open source project
  • It’s written in Python
  • It’s available on GitHub

Getting started

  • Clone the repository
  • Install the dependencies
  • Run the tool
  • Sounds simple, right?
  • Unfortunately, the instructions don’t work
  • I think they are out of date
  • We’ll do it our own way

Part one

mkdir labeling && cd labeling
git clone https://github.com/davidjurgens/potato.git
cd potato
pip install -r requirements.txt
python potato/flask_server.py start project-hub/politeness_rating/configs/politeness.yaml -p 8000

Part two

  • Visit http://localhost:8000
  • You should see a login screen
  • You have to create an account
  • Click on the “Create an account” link
  • Enter an account name and a password, it doesn’t really matter what

Part three

  • Log in with your new account
  • You should see an intro, then an instance to annotate
  • There should be fifty instances to annotate
  • Try to annotate them all
  • You can just enter a number from 1 to 5, it doesn’t matter what
  • Try not to give them all the same number

Part four

  • Do the demographics; again, they don’t matter, just enter something
  • Click on the “Submit” button
  • Now you want to see your results
  • They are in a folder called potato/project-hub/politeness_rating/annotation_output/full/<your account name>/
  • Kill the server with Ctrl-C and use that window to navigate to the folder

Part five

  • There are several files in the folder
  • annotation_order.txt is a list of the instances you annotated in order
  • assigned_user_data.json is a list of the instances you annotated, with the annotations you made
  • annotated_instances.jsonl is a list of the instances you annotated, with the annotations you made, in JSON Lines format
  • Look at each file in a text editor

Part six

  • Convert the JSON Lines file to a CSV file
  • Use the CSV file to calculate the average rating
  • You may have some trouble extracting the scale
  • I used vi to change the word scale_1 to scale
  • To do this to all five scale labels, I used the vi command :%s/scale_[1-5]/scale/
  • Then I used the extractIDandScale.py script to extract the ID and scale from the JSON Lines file
  • You may use an LLM for these tasks

Part seven

  • I used vi to get rid of lines that start with Politeness
  • I used the vi command :g/^Politeness/d
  • Next I used the calcAvg.py script to calculate the average rating
  • However, average ratings are not very useful for Likert scales
  • Instead, I made a stem-and-leaf plot
  • It took several tries for gpt-4o to make it like the one I can easily generate in R

Stem-and-leaf plot

  5 | 000000
  4 | 00000000000
  3 | 00000000000000000
  2 | 00000000000
  1 | 00000

See if you can make a stem-and-leaf plot for your data with an LLM

END

References

Pei, Jiaxin, Aparna Ananthasubramaniam, Xingyao Wang, Naitian Zhou, Jackson Sargent, Apostolos Dedeloudis, and David Jurgens. 2023. “POTATO: The Portable Text Annotation Tool.” https://arxiv.org/abs/2212.08620.

Colophon

This slideshow was produced using quarto

Fonts are Roboto Light, Roboto Bold, and JetBrains Mono Nerd Font