milestone 4

evaluation

Authors

Group name

member 1

member 2

member 3

\(\cdots\)

Published

November 18, 2025

Intro

This documents our evaluation of our project.

\(\langle\) Note: The next two sections are the roughly the same as your milestone 2, plus any improvements. \(\rangle\)

Data card information

\(\langle\) replace this with detailed information about the dataset \(\rangle\)

\(\langle\) Note: if you have a small dataset, submit it along with the qmd file and the html file. If it’s large, the URL will probably suffice unless it’s somehow walled off. \(\rangle\)

Make sure you include at least

  • URL
  • source (original source not repository)
  • repository (e.g., HuggingFace, Kaggle, etc.)
  • task you intend to use it for (e.g., question answering, summarization, etc.)
  • size
  • structure (e.g., train, test split)
  • other information, depending on the dataset’s documentation

Data dictionary

\(\langle\) replace this with a data dictionary, a table listing info about each column \(\rangle\)

Here’s an example of a table:

Here’s the table caption. It, too, may span multiple lines.
Centered Header Default Aligned Right Aligned Left Aligned
First row 12.0 Example of a row that spans multiple lines.
Second row 5.0 Here’s another one. Note the blank line between rows.

Warning: if you use the Visual editor in RStudio, it will mangle the above table.

The data dictionary should be laid out like a table. It should include

  • column name
  • description
  • type
  • units
  • example

Steps to create the chatbot

This can be what you turned in for milestone 3.

Note that you will have to render the file so you will probably have to comment out the line that lauches the interface or the rendering process will hang.

Evaluation

Overview

Describe your approach to evaluation and give any sources or inspiration you used.

Details

Here is where you give the detailed evaluation. This should include quantitative evaluation and qualitative evaluation.

Quantitative Evaluation

Here you might include a table of cost, time, and performance under different conditions.

An example provided by Claude follows:

Prompt Performance Analysis

Task: Generate a 500-word marketing blog post about sustainable fashion

Model Cost per 1K tokens Total Cost Response Time Quality Score Token Count (Input/Output)
Claude Opus 4 $15.00 / $75.00 $0.12 8.2s 9.2/10 150 / 623
Claude Sonnet 4.5 $3.00 / $15.00 $0.02 4.7s 8.8/10 150 / 623
Claude Haiku 4 $0.80 / $4.00 $0.01 2.1s 7.5/10 150 / 598

Key Findings

  • Best Performance: Claude Opus 4 delivered the highest quality output but at 6x the cost of Sonnet
  • Best Value: Claude Sonnet 4.5 offers the optimal balance of quality (only 4% lower) and cost (83% cheaper)
  • Fastest: Claude Haiku 4 completed in 2.1s, ideal for high-volume, time-sensitive applications

Performance Metrics Explained

  • Quality Score: Human evaluation across coherence, accuracy, creativity, and relevance
  • Response Time: End-to-end latency including network overhead
  • Total Cost: Combined input + output token costs for this specific prompt

Qualitative Findings

Here you would include verbiage about your human judgment of alignment of the task completion with your intentions.

Conclusion

Here you discuss your lessons learned at a higher level.

Addendum: Features of this file

Note: delete this section before you turn in the file!

  • Front matter
    • Includes your name
    • Includes the keyword “today” which resolves to the date on which you render the document
    • Includes fonts—you should install these fonts on your computer or change the font specification to fonts you already have on your computer
    • Includes the format (html) to which Quarto will render
    • Includes some directives that are specific to that format: toc and embed-resources
    • toc causes the table of contents to be rendered, on the right side of the frame by default
    • embed-resources causes any diagrams to be included in the html file itself rather than linked—that way you can just submit the html file and I can view it instead of having to submit linked files
  • Headings: top level headings are preceded by a # and a space; second level headings are preceded by ## and a space; you can go down several levels by increasing the number of # symbols
  • Bulleted lists, formed by preceding the list with a blank line (or a heading) and beginning each line with a dash and a space (both are important)
  • LaTeX symbols, in this case \(\langle\) and \(\rangle\), which resolve to angle brackets when you render the document … you can include any LaTeX math expressions between dollar signs or double dollar signs … by the way, any dollar signs meant as real dollar signs should be preceded by a backslash, like $ this, so Quarto doesn’t get confused about whether you are starting an equation
  • Programmatic keywords, preceded and followed by a backtick, in this case, the name eB.bib bibliography file … this causes the keyword to be rendered in a code font
  • Emphasis, by surrounding an important word with asterisks, causing it to be rendered in italics

Of course, you will delete all the instructions and comments in this file before you turn it in! I don’t need to read them when I read your solution. The files you turn in (the qmd and the rendered html) will just include your work. These instructions and comments are just to help you get going.