Following are my course notes from a 2018 Coursera course called Designing, Running, and Analyzing Experiments, taught by Jacob Wobbrock, a prominent HCI scholar. Note that Wobbrock is in no way responsible for any errors or deviations from his presentation.
These course notes are an example of reproducible research and literate programming. They are reproducible research because the same file that generated this html document also ran all the experiments. This is an example of literate programming in the sense that the code, pictures, equations, and narrative are all encapsulated in one file. The source file for this project, along with the data files, are enough for you to reproduce the results and reproduce the documentation. All the source material is available in my github account, although in an obscure location therein.
options(readr.show_col_types=FALSE) # supress column type messageslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.0 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.1.8
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
How many prefer this over that? (Tests of proportions)
How many prefer website A over B? (One sample test of proportions in two categories)
Sixty subjects were asked whether they preferred website A or B. Their answer and a subject ID were recorded. Read the data and describe it.
prefsAB <-read_csv("prefsAB.csv")tail(prefsAB) # displays the last few rows of the data frame
# A tibble: 6 × 2
Subject Pref
<dbl> <chr>
1 55 A
2 56 B
3 57 A
4 58 B
5 59 B
6 60 A
prefsAB$Subject <-factor(prefsAB$Subject) # convert to nominal factorprefsAB$Pref <-factor(prefsAB$Pref) # convert to nominal factorsummary(prefsAB)
Is the difference between preferences significant? A default \(\chi^2\) test examines the proportions in two bins, expecting them to be equally apportioned.
To do the \(\chi^2\) test, first crosstabulate the data with xtabs().
Chi-squared test for given probabilities
data: prfs
X-squared = 17.067, df = 1, p-value = 3.609e-05
We don’t really need an exact binomial test yet because the \(\chi^2\) test told us enough: that the difference is not likely due to chance. That was only because there are only two choices. If there were more than two, we’d need a binomial test for every pair if the \(\chi^2\) test turned up a significant difference. This binomial test just foreshadows what we’ll need when we face three categories.
Exact binomial test
data: prfs
number of successes = 14, number of trials = 60, p-value = 4.224e-05
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.1338373 0.3603828
sample estimates:
probability of success
0.2333333
How many prefer website A, B, or C? (One sample test of proportions in three categories)
First, read in and describe the data. Convert Subject to a factor because R reads any numerical data as, well, numeric, but we don’t want to treat it as such. R interprets any data with characters as a factor. We want Subject to be treated as a factor.
prefsABC <-read_csv("prefsABC.csv")head(prefsABC) # displays the first few rows of the data frame
# A tibble: 6 × 2
Subject Pref
<dbl> <chr>
1 1 C
2 2 C
3 3 B
4 4 C
5 5 C
6 6 B
You can think of the three websites as representing three bins and the preferences as filling up those bins. Either each bin gets one third of the preferences or there is a discrepancy. The Pearson \(\chi^2\) test functions as an omnibus test to tell whether there is any discrepancy in the proportions of the three bins.
prfs <-xtabs( ~ Pref, data=prefsABC)prfs # show counts
Pref
A B C
8 21 31
chisq.test(prfs)
Chi-squared test for given probabilities
data: prfs
X-squared = 13.3, df = 2, p-value = 0.001294
A multinomial test can test for other than an even distribution across bins. Here’s an example with a one third distribution in each bin.
if (!require(XNomial)) {install.packages("XNomial",dependencies=TRUE)library(XNomial)}
Loading required package: XNomial
xmulti(prfs, c(1/3, 1/3, 1/3), statName="Prob")
P value (Prob) = 0.0008024
Now we don’t know which pair(s) differed so it makes sense to conduct post hoc binomial tests with correction for multiple comparisons. The correction, made by p.adjust(), is because the more hypotheses we check, the higher the probability of a Type I error, a false positive. That is, the more hypotheses we test, the higher the probability that one will appear true by chance. Wikipedia has more detail in its “Multiple Comparisons Problem” article.
Here, we test separately for whether each one has a third of the preferences.
Plotting is slightly more complicated by the fact that we want to represent two groups. There are many ways to do this, including stacked bar charts, side-by-side bars, or the method chosen here, using facet_wrap(~Sex) to cause two separate plots based on Sex to be created.
Although we can guess by looking at the above plot that the difference for females is significant and the difference for males is not, a Pearson chi-square test provides some statistical evidence for this hunch.
prfs <-xtabs( ~ Pref + Sex, data=prefsABsex) # the '+' sign indicates two varsprfs
Sex
Pref F M
A 2 12
B 29 17
chisq.test(prfs)
Pearson's Chi-squared test with Yates' continuity correction
data: prfs
X-squared = 8.3588, df = 1, p-value = 0.003838
What if the data are lopsided? (G-test, alternative to chi-square)
Wikipedia tells us that the \(G\)-test dominates the \(\chi^2\) test when \(O_i>2E_i\) in the formula
\[\chi^2=\sum_i \frac{(O_i-E_i)^2}{E_i}\]
where \(O_i\) is the observed and \(E_i\) is the expected proportion in the \(i\)th bin. This situation may occur in small sample sizes. For large sample sizes, both tests give the same conclusion. In our case, we’re on the borderline for this rule in the bin where 29 females prefer B. All females would have to prefer B for the rule to dictate a switch to the \(G\)-test.
if (!require(RVAideMemoire)) {install.packages("RVAideMemoire",dependencies=TRUE)library(RVAideMemoire)}
Fisher's Exact Test for Count Data
data: prfs
p-value = 0.001877
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.009898352 0.537050159
sample estimates:
odds ratio
0.1015763
How many males vs females prefer website A, B, or C? (Two-sample tests of proportions in three categories)
Revisit our data file with 3 response categories, but now with sex (M/F).
Site Pages.mean Pages.sd
1 A 3.404082 1.038197
2 B 4.490196 2.127552
#. graph histograms and a boxplotggplot(pgviews,aes(Pages,fill=Site,color=Site)) +geom_bar(alpha=0.5,position="identity",color="white") +scale_color_grey() +scale_fill_grey() +theme_tufte(base_size=7)