UX Prototyping: Testing

Mick McQuaid

2025-03-22

Week TEN

Definition of usability testing

definition

usability testing: the process of learning about users from users by observing them using a product to accomplish some specific goals of interest to them, from Barnum (2010), page 6

definition

usability: the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use, from ISO 9241-11, quoted in Barnum (2010), page 11

Formative and Summative Testing

formative

exploratory
early in process
used on lofi prototypes
more communication between moderator and participant
focus on user perceptions rather than task completion
more likely to think aloud

summative

more formal
evaluating design choices
used on hifi prototypes
less likely to think aloud
metrics
quantitative measurements

Testing Low Fidelity Prototypes

We’ve discussed lofi prototypes

Recall Buxton’s characteristics

explore
suggest
question
propose
tentative

Lofi purpose drives lofi testing

to initiate a conversation
to help people envision one alternative
to stimulate ideas

How to initiate a conversation?

show paper pictures of artifact
ask participant where they might “click”
offer participant a pointer to “click” with, e.g., a pen
provide a different piece of paper as the result of the “click”

Encourage think-aloud

ask the participant to reveal thoughts
ask what do you want to do next
ask what is holding you up

Watch the response

Watch for body language
Observe pauses—they may indicate confusion
Indecision about where to point may highlight ambiguity

Don’t be afraid of pauses

but don’t get sidetracked
maintain focus on the test
but let the test “breathe”
pauses may be a natural part of the process

Put the participant at ease

remind the participant that they are not being tested
thank the participant
assure the participant that what they are doing is valuable
make sure they don’t feel you are too invested in the prototype

Aids

Scripts, forms, advice, and checklists from the book Rocket Surgery Made Easy
Available at http://sensible.com/downloads-rsme.html
Also watch Steve Krug at https://youtu.be/VTW1yYUqBm8

Amateur and Professional Testing

how serious are you?

Here are two books: one is for amateur usability testers and one is for professional usability testers. (You can still be a UX professional and be an amateur usability tester. They catch many problems.)

How many problems can you catch? As many as you can fix, plus any that don’t matter

first, the amateur’s book

Most professionals of my acquaintance have read the amateur’s book, by the way. Maybe they want to be sure that they are not missing anything an amateur would catch.

The book is called Rocket surgery made easy, the do-it-yourself guide to finding and fixing usability problems by Steve Krug, 2010.

Steve Krug

Krug is best known for the book Don’t Make Me Think, Krug (2005). I’ve always resisted reading that book because the title gives a questionable command.

I was taught the goodness of thinking as an axiom.

Steve Krug’s maxims

Start earlier than you think makes sense
A morning a month, that’s all we ask
Recruit loosely and grade on a curve
Make it a spectator sport
Focus ruthlessly on a small number of the most important problems
When fixing problems, always do the least you can do

A morning a month, that’s all we ask.

three users, debrief over lunch;
morning means half a day;
monthly sets expectations;
keeps it simple;
tester spends a couple of days prepping

Start earlier than you think makes sense.

Even test a similar product before you’ve done any design work at all!

Recruit loosely and grade on a curve.

It’s more important to test frequently than to get the right participants or more participants.

Make it a spectator sport.

Seeing is believing. Observing makes you realize how different users are from you. More observers are better.

Focus ruthlessly on a small number of the most important problems.

How many experience the problem and how severe is it for those who do?

When fixing problems, always do the least you can do.

Do something and don’t try to do everything. Tweak, don’t redesign. Take something away.

Krug’s common problems

Getting off on the wrong foot
Failure to shout

task-level measurements

Make a list of tasks!

Make each task into a scenario.

The scenario adds context, e.g., you are …, you need …, and supplies information, e.g., username and password, but doesn’t give clues!

Pilot test the scenarios.

〈 pause to watch a Steve Krug usability test 〉

https://youtu.be/VTW1yYUqBm8

the professional’s book

Usability testing essentials: ready, set… test! by Carol Barnum, 2010.

After you digest the amateur’s book, it’s time to tackle this one, especially chapters 5 through 7, describing planning, preparing, and conducting a test.

ideas from the professional’s book

Focus on the user, not the product
Focus on the user’s experience rather than the product’s performance
People are goal-oriented
Use a think-aloud process
Think of usability testing as hill-climbing (There are many paths up a hill and different combinations of tests may still lead to the same place)

small studies

define a user profile
create task-based scenarios
use a think-aloud process
make changes and test again

Figure 3.1 from Barnum (2010) p 55 shows the user centered design process in which usability testing is embedded.

user-centered design according to Barnum (2010)

distinguishes between heuristic evaluation (evaluators) and usability testing (participants)
advocates both, and reports some research that says usability testing is better and some that calls them equal
suggests an order: heuristic evaluation first and usability testing second

process according to Barnum (2010)

planning the usability test
preparing the usability test
conducting the usability test

Consider each of these in turn …

planning for usability testing (Barnum, 2010, ch 5)

establish test goals
determine how to test
agree on user subgroups
determine participant incentive
draft screeners for recruiting
create scenarios from tasks matching test goals
determine qualitative and quantitative feedback methods
set dates for testing and deliverables

preparing for usability testing (Barnum, 2010, ch 6)

recruit participants
assign team roles and responsibilities
develop checklists
write the moderator’s script
prepare other forms (consent forms)
choosing / preparing questionnaires
choosing / creating qualitative feedback methods
test the test

(Remember to arrange for a backup participant)

conducting a usability test (Barnum, 2010, ch 7)

set up for testing
the moderator can create a comfortable situation
the moderator can administer post-test feedback
decide what to do when participants ask for help
log findings with software or forms
manage observers and visitors

summative testing, according to Barnum

Summative testing happens near release time.
Summative testing generates quantitative measures.
Summative testing evaluates specific design choices that can be compared to other design choices.
Summative tests are larger and more rigidly controlled than formative tests.

metrics

quantitative measurements

time
tasks completed
errors
interruptions
questions

(measure anything you can quantify that might correlate with success)

system usability scale (SUS)

As a well-researched instrument, plenty is known about what it means. For example, the average score for 19 studies reported by Lewis (2009) was 62.1 rather than the 50/100 that might be expected at a glance. Another study (Bangor, 2008) reported an average score even higher, at 70/100.

tools

Screen recording: morae, silverback, camtasia
Screen sharing: morae, gotomeeting, teamviewer
Media analysis: morae

resources

http://rocketsurgerymadeeasy.com

http://usability.gov

http://uxpa.org

http://credibility.stanford.edu/guidelines/index.html

Academic and Industrial Testing

Lazar, Feng, and Hochheiser (2017)

The source for this material is a book called Research methods in hci

what is being tested?

paper prototypes
wireframes
functional layouts
Wizard of Oz
working software
in situ software

goals

finding flaws
improving quality
having an impact
practical improvement

differences from academic research

not expanding base of human knowledge
trying to improve an artifact
not strictly controlled
more like engineering than trad research
no goal of generalization
smaller number of participants

success of usability testing

build successful product
shortest time
fewest resources
fewest risks
optimizing trade-offs
Wixon (2003)

process drivers

scheduling issues
resource issues
not theory
(Wixon, 2003)

changing treatment during process

experimental design would require same treatment for all subjects
usability testing may change treatment as soon as you’re confident about change

observation differs from ethnography

no embedding in community
short term
not deeply studying context

expert-based testing

interface experts use structured methods

expert review types

heuristic review
consistency inspection (review layout, color, terminology, language)
cognitive walkthrough (experts simulate users and walk through task series)

shneiderman’s heuristics:

Following are eight heuristics known as Shneiderman’s Golden Rules of Interface Design, Shneiderman (2017)

Eight golden rules

strive for consistency
cater to universal usability
offer informative feedback (but security disagrees)
design dialogs to yield closure
prevent errors
permit easy reversal of actions
support internal locus of control
reduce short-term memory load

automated testing

Software program applies guidelines and determines where guidelines aren’t met
advantage: speed of code reading
advantage: easy to identify features such as alt text on web pages

Automated testing isoften used for WCAG checking (web content accessibility guidelines)

user-based testing

This is what people think of when they think of usability testing

validation testing

testing against benchmarks from other interfaces
just before release
when you think it’s ready

stages of testing

Here are two views of the testing process

First view

Rubin and Chisnell (2008) proposed eight steps in their Handbook of Usability Testing

Eight stages

develop test plan
set up test environment
find and select participants
prepare test materials
conduct test sessions
debrief participants
analyze data and observations
report findings and recommendations

Second view

Lazar (2006) also proposed eight steps in the book Web Usability

Lazar’s stages

select representative users
select setting
decide tasks users should perform
decide what data to collect
before test session
- informed consent
during test session
after test session
- debriefing
- summarize results and suggest improvements

how many users?

five

(Virzi, 1992)

no, seven or fifteen

(Nielsen and Landauer, 1993)

three point two!

(Nielsen and Landauer, 1993)

no, ten

(Hwang and Salvendy, 2010)

no, twelve

(Caine, 2016)

three

(Krug, no date)

all numbers are correct

(Lewis, 2006)

… for the following reasons

Choosing the number

how accurate do you need to be?
what are your problem discovery goals?
how many participants are available?

numbers don’t matter

(Lindgaard and Chattratichart, 2007)

Instead, number of flaws found will depend on design and scope of tasks!

testing groups

Do you need one representative from each group? or five? or ten?

testing pairs

Children alone are less likely to criticize than children in pairs

The final word on numbers: five

(folklore)

location

traditional setting two room setup with one-way mirror and recording devices
usability lab
usability van
user’s workplace or home

what do you need?

Do you need recordings?

remote testing

Why not test users you can not meet?

e.g., Amazon’s Mechanical Turk (MTurk), ClickWorker, crowdsourcing platforms for micro-tasks

pros of mechanical turk

number of participants
scheduling flexibility
good for quantitative metrics
clickstream data easy to get

cons of mechanical turk

can’t pick up interpersonal cues
can’t recover from errors
can’t ask questions contingent on what happens
may miss important context
33–46% of turkers use LLMs to answer

user nights at yahoo!

Yahoo ran a user night every Tuesday from 7 to 9
testing preceded by food for those who show up on time
orientation time, followed by testing whatever is ready for testing
after users leave, Yahoos remain and discuss

task lists

You usually need a list of tasks for the participant to perform

qualities of the task list

unambiguous
keeping participant goal-directed
key features exercised
shouldn’t require user to ask additional questions

workload

One study found participants had to be paid more for tasks involving own financial data

interventions

Plan for whether interventions are to be allowed and, if so, how they’ll work. Should they only be for insurmountable obstacles?

measurement

three main kinds

task performance
time performance
user satisfaction

others?

number of errors
time to recover from errors
time spent on help
number of visits to search

key logging

eye tracking

think aloud

Or a reflection session to get verbal feedback in case think aloud would interfere with task performance

the test session itself

reminders

People don’t show up unless you send reminders

extra time

People show up late or get interrupted, so you have to allow extra time

follow protocol

May have an irb (Institutional Review Board) protocol but should have some protocol anyway

testing the artifact, not the user

The participant should be reminded of this because it will otherwise appear that the user is being tested

flexible when compared with experiments

The protocol for the session should be followed but can be looser than for an experiment

debriefing

find out what they know
debriefing may be more for participant than for researcher

analysis

descriptive statistics are always possible
pictures are always possible

Usability testing software

morae

Why would you even care about Morae? It’s obsolete and discontinued. But … some people still use it and the workflow is quite good even if you don’t use it. So let’s explore it from the workflow standpoint.

suggested morae workflow

plan
setup
record, view, log
create project
view, analyze
present

morae includes three programs

recorder records a session
observer lets you watch a session in progress
manager does everything else

but first …

Think about the kind of session you’ll run

paper prototype
moderated computer
autopilot computer

paper prototype

If you’re testing a paper prototype, you probably need both cameras. You can set up Morae in paper prototype mode so that it does not record the PC screen.

moderated computer prototype

You can set up Morae in moderated usability test mode in a lab and run the test solo or with observers in the control room.

This is the most likely setup.

autopilot

Don’t discount the value of autopilot. This is meant for unmoderated sessions but you can use it to free you from having to juggle too many things to remember during the session. The only caveat is that you have to set it up properly or you will wind up abandoning it and returning to moderated mode.

process (one of three)

Visit a lab with Morae (before the day of your tests) and set up a configuration in recorder. Save that configuration to a USB drive, Google Drive, or somewhere. Leave the lab.
Return to the lab and verify that your config works. Pretend you’re doing the test. Actually record yourself playing both parts.
Save your recording.

process (two of three)

Go to any computer with Morae and verify that the copy of Morae can open your recording (rdg file)
Obtain a web cam
Return to the user testing lab and start your recording again using your config file.
Verify that you can plug in the web cam and select it as one of the video sources.
Use a touch screen in the user testing lab. Verify that you can use it and it works.

process (three of three)

If you have observers, connect them to the recording.
Log some tasks and markers.
Close down and save your recording.
Again, verify that your recording works.
Now you’re probably ready for your test.

during the test

Make sure you record!
Try to relax! Think of specific things you can do to reduce stress, both for you and your participant.
Take good care of your participant!
Be sure to save your recording afterward.

after the test

Now it is time get some evidence from your recordings and prepare it for presentation.

You’ll want to find video clips that illustrate important points.

You’ll want to log markers so you can get a graph that supports your findings.

alternatives

qualitative data analysis software (qdas)
- I’ve used MaxQDA, and others are NVivo, Atlas.Ti, and Dedoose
- Generally very expensive, hard to learn, and not too popular
usertesting.com
- Previously free for students, so we used it in class
- Probably much more popular that the qdas alternatives

results

report writing Mensh and Kording (2017)

caveat: This is about academic writing and industrial writing is usually less strict

four principles

focus on a central contribution, communicated in the title
write for people who don’t know your work
stick to a context-content-conclusion scheme
avoid zig-zagging; use parallelism

components of a report

tell a complete story in the abstract
communicate why it matters in the intro
deliver results as a sequence of statements, supported by figures, that connect logically to support the central contribution
discuss how the gap was filled

process

allocate time where it matters: title, abstract, figures, outlining
get feedback to reduce, reuse, recycle

usual report format

abstract
introduction
background
method
results
conclusion

using published information

many public studies of technology use are available
you may be able to use them
you should develop a feel for where you can get good numbers
e.g., Gartner Hype Curve, Pew Internet Studies, Mary Meeker Internet report

validity Lazar, Feng, and Hochheiser (2017)

Did we measure what we said we measured?

face validity, aka content validity
criterion validity
construct validity

face validity

does the content seem to be valid on its face?
face validity is highly subjective
weakest form of validity because susceptible to bias
classic example is IQ tests using cultural artifacts

criterion validity

how accurately does a new measure predict a previously validated criterion?
e.g., NASA Task Load Index is a previously validated set of measures
measure the correlation between new and previous criterion

construct validity

aka factorial validity
what if no previous criterion is suitable?
then ask what constructs account for variability
conduct a factor analysis to discover factors

establishing validity

always a multifaceted approach
may construct a database
at least need well-documented data
may triangulate among various data sources
try to account for all variability, not just that which is agreeable to your theory

using statistics

often problematic in hci work, where 15 subjects are often considered adequate
hard to generate statistics from 15 data points
on the other hand, easy to use statistics on live websites
a/b testing is a well-studied area

using grounded theory

usually used in academic study of hci
theory emerges from the data

using content analysis

after evaluation

what’s next?

design thinking process

empathize
define
ideate
prototype
test

… but should it really end there?

another round

more design!

iterative development

product of evaluation

more design ideas

should you do whatever participants asked?

NO!

At least, not necessarily

what can you do?

Understand participants better first!

… before you try to make design choices

first thoughts may scratch the surface

ask why? three times

why keep asking why?

to get beneath the surface

evaluation is not really finished

until you understand the usability problems

when do you understand the problems?

when you’re ready to make design choices

Coda: AI

This could actually be two separate slideshows:

Using AI to evaluate usability
Evaluating the usability of AI products

Here, we’ll focus on the first and just provide a few preliminary thoughts

AI in heuristic evaluation

There is already a Figma plugin (maybe more than one) to use AI in heuristic evaluation
The one I’m thinking of costs $12/mo for the “standard” plan and $22/mo for the “pro” plan
It’s called UX Pilot Design Review
It’s not very good
You should try it anyway

General problem

Most AI tools have poor usability and may not get better in the short term for a few reasons, mainly that people have a poor conceptual model of how they really work

You need to work with them for a while to get a feel for how to get the most out of them (which, so far, is not too much)

Specific problems

They make mistakes so you have to be a careful reader to understand how to fix them
They can’t understand transitions between screens, they can only evaluate specific screens
They don’t understand the user—in particular, they don’t have whatever knowledge is implicit and not present in the training data
They don’t understand the real world
(My AI tool keeps telling to add context to the list of problems!)

Mixed views

Baymard Institute, a potential competitor to AI tools, says that AI tools are not very good in a post here
Jakob Nielsen, a UX pioneer, speaks glowingly of them in a controversial post here

How to do it

You can upload a screenshot to several AI tools and a list of heuristic guidelines and get back a list of problems

You can, for example, use the OpenAI Vision API for this, but it is not free

My best recommendation right now is to use free tools; this is because good prompts to limited models seem in my experience to get better results than poor prompts to more sophisticated models and you need some practice to generate good prompts

References

Barnum, Carol M. 2010. Usability Testing Essentials: Ready, Set...test! 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Krug, Steve. 2005. Don’t Make Me Think: A Common Sense Approach to the Web (2nd Edition). Thousand Oaks, CA: New Riders Publishing.

Lazar, Jonathan. 2006. Web Usability: A User-Centered Design Approach. Boston MA: Pearson Addison Wesley.

Lazar, Jonathan, Jinjuan Heidi Feng, and Harry Hochheiser. 2017. Research Methods in Human-Computer Interaction, 2nd Ed. West Sussex, UK: Wiley.

Mensh, Brett, and Konrad Kording. 2017. “Ten Simple Rules for Structuring Papers.” PLOS Computational Biology 13 (9): 1–9. https://doi.org/10.1371/journal.pcbi.1005619.

Rubin, Jeffrey, and Dana Chisnell. 2008. Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests. Wiley.

Shneiderman, Ben. 2017. “Revisiting the Astonishing Growth of Human–Computer Interaction Research.” Computer, no. 10: 8–11.

Wixon, Dennis. 2003. “Evaluating Usability Methods: Why the Current Literature Fails the Practitioner.” Interactions 10 (4): 28–34. https://doi.org/10.1145/838830.838870.

END

Colophon

This slideshow was produced using quarto

Fonts are League Gothic and Lato

UX Prototyping: Testing

Definition of usability testing

definition

definition

Formative and Summative Testing

formative

summative

Testing Low Fidelity Prototypes

We’ve discussed lofi prototypes

Lofi purpose drives lofi testing

How to initiate a conversation?

Encourage think-aloud

Watch the response

Don’t be afraid of pauses

Put the participant at ease

Aids

Amateur and Professional Testing

how serious are you?

first, the amateur’s book

Steve Krug

Steve Krug’s maxims

A morning a month, that’s all we ask.

Start earlier than you think makes sense.

Recruit loosely and grade on a curve.

Make it a spectator sport.

Focus ruthlessly on a small number of the most important problems.

When fixing problems, always do the least you can do.

Krug’s common problems

task-level measurements

the professional’s book

ideas from the professional’s book

small studies

user-centered design according to Barnum (2010)

process according to Barnum (2010)

planning for usability testing (Barnum, 2010, ch 5)

preparing for usability testing (Barnum, 2010, ch 6)

conducting a usability test (Barnum, 2010, ch 7)

summative testing, according to Barnum

metrics

system usability scale (SUS)

tools

resources

Academic and Industrial Testing

Lazar, Feng, and Hochheiser (2017)

what is being tested?

goals

differences from academic research

success of usability testing

process drivers

changing treatment during process

observation differs from ethnography

categories

expert-based testing

expert review types

shneiderman’s heuristics:

Eight golden rules

automated testing

user-based testing

validation testing

stages of testing

First view

Eight stages

Second view

Lazar’s stages

how many users?

five

no, seven or fifteen

three point two!

no, ten

no, twelve

three

all numbers are correct

Choosing the number

numbers don’t matter

testing groups

testing pairs

More on numbers

The final word on numbers: five

location

what do you need?