Prompting Techniques, Part Three

Prompt Engineering

Mick McQuaid

University of Texas at Austin

03 Mar 2025

Week Seven

Agenda

  • Presentations: Sainath, Anusha, Aditya
  • News
  • Review whatialreadyknow (Ishwari)
  • eC review
  • m1 review
  • Prompting Techniques

Presentations

News

The Batch

\langle pause to look at this week’s edition \rangle

WhatIAlreadyKnow (Ishwari)

eC review

  • What did you learn from this exercise?
  • One thing Ishwari and I noticed was limited documentation of prompt tuning, so did you learn a lot about prompt tuning?
  • No parameter tuning that I know of
  • Almost everyone used the same 20 tweets, probably due to uniform Python scripts with 42 as the random seed
  • Everyone reported accuracy results but not everyone reported latency and cost, which is a problem
  • Only a couple of people reported a score composed of a weighted linear combination of accuracy, latency, and cost
  • Most people just eyeballed the results and said that the most accurate model and prompt were the best, but some acknowledged that the cost and latency were important

  • Where cost and latency gave conflicting results, some people reported that one should take precedence, while others said that it depends on the user priority
  • The best accuracy reported by anyone was 95 percent, but it was not well-documented—the best well-documented accuracy was 85 percent
  • The lowest winner on accuracy was 60 percent
  • Many people did not include a table of accuracy, latency, and cost, which was disappointing
  • One person included prompt type (CoT, “regular”) in their table! Great!
  • Some people used VADER as a benchmark, generally finding it worse than the best prompt / model combinations; one person used a different benchmark from formulabot, which scored 95 percent, several used text2data, one used TextBlob

  • There was a lot of variety in the detail level of results
  • Some very thoughtful conclusions, e.g., about the effect of CoT on latency and accuracy
  • I really appreciated the documentation of the selection of two best prompts, especially one that used the blockquote feature of Quarto
  • A wide variety of models were used, including gpt-4o-mini, gemini-2.0-flash-001, llama-2-70-b-chat-hf, llama-3.3-70b-versatile, o3-mini-high, mistral-large, gemini-pro, claude 2.1, gpt 3.5, Claude 3.5 Sonnet:beta, claude-3.5-haiku, Cohere v2, DeepSeek v2, Sonar Pro, Llama 8192, Grok 3 DeepSearch
  • Many people included screenshots of tables from Agenta instead of creating tables on their own, which was disappointing
  • A key learning should have been how to compare models and prompts, some mentioned that in the conclusion

  • One unexpected learning was a bit of data wrangling to ensure things like column name matching
  • Some people reported discovering how LLMs interpret questions differently
  • There is a difference between using curly braces and not using them in the labels of code chunks, for example {python} should cause code to be run, but not including the curly braces should cause code to be displayed but not run
  • One person got a confidence score with the benchmark tool, which highlights a limitation of the prompting techniques used by nearly everyone—a confidence score would have been possible, although it would have required more effort and skill (one person did it)
  • One person selected zero-shot and few-shot prompts to compare, which was a good idea

  • No one used the feature of Quarto that allows you to save results from code chunks to display in markdown text (some people noticed that they got different results in different runs)
  • One person got better results from one model with less context rather than more
  • Some students investigated the individual tweets and the “ground truth” associated with them, finding it sometimes misleading
  • One interesting table was the one that showed the 20 tweets alongside the sentiments reported as ground truth and three models, allowing a very detailed comparison
  • One person tried to use VADER from NLTK without success
  • Good insights about the difficulty of neutral tweets compared to extremes of positive and negative (model is looking for cues and reads the absence of cues as neutral), as well as insights the problem of URLs in tweets

  • It’s valuable to learn more about Python, Markdown, and AI in general, as one person noted
  • People spent a wide variety of time on this assignment!
  • Some people experienced inexplicable errors in Agenta and soldiered on, I appreciate that!
  • One person got uppercase responses and didn’t remedy that, while others then asked explicitly for lowercase responses
  • One person used gpt-o3-mini to generate prompts!
  • One person looked at MMLU and HellaSwag metrics to choose models!
  • The best report, in my view, was the one with the most extensive and illuminating conclusion, which I have put on Canvas (several others were close, though)
  • I really appreciate efforts to make the report more readable! This can be facilitated by studying Quarto a bit

  • “even a minor clarification in instructions can lead to better consistency and accuracy … highlights the need for structured prompt experimentation”
  • One person wrote a lengthy report in the style of an academic paper, perhaps assisted by genAI

More on Techniques

Last time, we continued with more prompting techniques. Now we’ll move on to decomposition techniques and others. Again, we’re referring to Schulhoff et al. (2024).

Decomposition

  • explicitly decomposing the problem into subproblems
  • Least-to-Most Prompting
  • Plan-and-Solve Prompting
  • Tree-of-Thought Prompting
  • Recursion-of-Thought Prompting
  • Program-of-Thoughts
  • Faithful Chain-of-Thought
  • Skeleton-of-Thought
  • Metacognitive Prompting

Ensembling

  • using multiple prompts to solve the same problem, then aggregating the results, for example, by majority vote
  • Demonstration Ensembling
  • Mixture of Reasoning Experts
  • Max Mutual Information Method
  • Self-Consistency
  • Universal Self-Consistency
  • Meta-Reasoning over Multiple CoTs
  • DiVeRSe
  • Consistency-based Self-Adaptive Prompting

More Ensembling

  • Universal Self-Adaptive Prompting
  • Prompt Paraphrasing

Self-Criticism

  • prompting the model to critique its own output
  • Self-Calibration
  • Self-Refine
  • Reversing Chain-of-Thought
  • Self-Verification
  • Chain-of-Verification
  • Cumulative Reasoning

END

References

Schulhoff, Sander, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, et al. 2024. “The Prompt Report: A Systematic Survey of Prompting Techniques.” https://arxiv.org/abs/2406.06608.

Colophon

This slideshow was produced using quarto

Fonts are Roboto Light, Roboto Bold, and JetBrains Mono Nerd Font