7  More about Inference

7.1 Recap Week 06

  • ggplot2
  • Foundations for Inference
    • Point estimates
    • Confidence Intervals
    • Hypothesis Tests

7.1.1 A useful ggplot trick

Some of you have struggled to produce boxplots of variables like price and odometer because they have ridiculous outliers. One way to overcome this that we’ve explored is to remove the extreme observations. A possibly better way, though, is illustrated below. Saying outlier.shape=NA in the geom_boxplot() function removes outliers before the dimensions of the plot are calculated. The result is a much more readable boxplot containing most of the cars.

load(paste0(Sys.getenv("STATS_DATA_DIR"),"/vehicles.Rdata"))
pacman::p_load(tidyverse)
df |> ggplot(aes(condition,odometer))+geom_boxplot(outlier.shape=NA)+  scale_y_continuous(limits = quantile(df$odometer, c(0.1, 0.8),na.rm=TRUE))
Warning: Removed 131132 rows containing non-finite values (`stat_boxplot()`).

df |> ggplot(aes(condition,price))+geom_boxplot(outlier.shape=NA)+  scale_y_continuous(limits = quantile(df$price, c(0.1, 0.8),na.rm=TRUE))
Warning: Removed 127440 rows containing non-finite values (`stat_boxplot()`).

Note that, in the above example, I’ve stored df in vehicles.Rdata using the save() function. Before saving it I converted condition to a factor and both price and odometer to numeric.

7.1.2 Preparing data

Note that you can use the above quantile trick in other ways. For example, here I’m filtering out the extreme values of price, without necessarily knowing what they are.

df |>
  filter(price >= quantile(price,.10) & price <= quantile(price,.90)) |>
  count(state) |>
  arrange(-n) |>
  as_tibble() |>
  print(n=12)
# A tibble: 51 × 2
   state     n
   <chr> <int>
 1 ca    39707
 2 fl    23260
 3 tx    18285
 4 ny    15956
 5 oh    15086
 6 mi    14959
 7 pa    11787
 8 nc    11105
 9 or    10875
10 wi    10194
11 tn     9466
12 co     9335
# ℹ 39 more rows

One group tabled price by paint color. This is not a good idea! There are too many individual prices. Instead, you can try using ranges with one of the cut_ functions, cut_interval (shown here), or cut_width. Then you can make tables or save the ranges to use in barcharts or other tables.

table(cut_interval(df$price[df$price<100000],20,dig.lab=8))

       [0,4999.95]   (4999.95,9999.9]  (9999.9,14999.85] (14999.85,19999.8] 
             95312              79652              53811              49107 
(19999.8,24999.75] (24999.75,29999.7] (29999.7,34999.65] (34999.65,39999.6] 
             34812              34946              26552              23510 
(39999.6,44999.55] (44999.55,49999.5] (49999.5,54999.45] (54999.45,59999.4] 
              9494               6643               3773               2948 
(59999.4,64999.35] (64999.35,69999.3] (69999.3,74999.25] (74999.25,79999.2] 
              1535               1617                821                702 
(79999.2,84999.15] (84999.15,89999.1] (89999.1,94999.05]   (94999.05,99999] 
               391                304                114                139 
dfReduced <- df[df$price<100000,]
dfReduced$priceRanges <- cut_interval(dfReduced$price,20,dig.lab=8)
ggplot(dfReduced,aes(priceRanges)) + geom_bar(fill="lightblue") + theme(axis.text.x = element_text(angle=45,vjust=1,hjust=1))

table(dfReduced$priceRanges[dfReduced$paint_color %in% c("red","white","black","blue")],dfReduced$paint_color[dfReduced$paint_color %in% c("red","white","black","blue")])
                    
                     black  blue   red white
  [0,4999.95]         9977  6692  5544 13156
  (4999.95,9999.9]    9764  7002  5592 10805
  (9999.9,14999.85]   8345  3901  3952  9476
  (14999.85,19999.8]  7567  3898  3746 10211
  (19999.8,24999.75]  5906  2480  2455  7760
  (24999.75,29999.7]  5783  2540  3195  8132
  (29999.7,34999.65]  5106  1517  2081  6742
  (34999.65,39999.6]  4934  1751  2111  5657
  (39999.6,44999.55]  1807   554   636  2286
  (44999.55,49999.5]  1214   313   346  1793
  (49999.5,54999.45]   699   149   191   871
  (54999.45,59999.4]   538   128   198   784
  (59999.4,64999.35]   277    54    92   489
  (64999.35,69999.3]   325    91   153   435
  (69999.3,74999.25]   148    19    36   227
  (74999.25,79999.2]   162    34    39   186
  (79999.2,84999.15]   120    23    30    76
  (84999.15,89999.1]    40    14    16    87
  (89999.1,94999.05]    14     3     6    14
  (94999.05,99999]      37    19    10    14

Can we cut year into intervals? Yes, we can do year, odometer, or price because they are continuous.

pacman::p_load(tidyverse)
load(paste0(Sys.getenv("STATS_DATA_DIR"),"/vehiclesSmall.Rdata"))
table(cut_interval(dfSmall$year[dfSmall$year>1999],22,dig.lab=6))

[2000,2001] (2001,2002] (2002,2003] (2003,2004] (2004,2005] (2005,2006] 
       1919        1348        1720        2037        2527        2987 
(2006,2007] (2007,2008] (2008,2009] (2009,2010] (2010,2011] (2011,2012] 
       3510        3998        2796        3763        4710        5686 
(2012,2013] (2013,2014] (2014,2015] (2015,2016] (2016,2017] (2017,2018] 
       7123        6950        7436        7156        8595        8452 
(2018,2019] (2019,2020] (2020,2021] (2021,2022] 
       5981        4606         537          27 

Getting commas in long numbers:

pacman::p_load(scales)
dfReduced <- dfSmall[dfSmall$price<100000,]
dfReduced$priceRanges <- cut_interval(dfReduced$price,20,dig.lab=8)
ggplot(dfReduced,aes(priceRanges)) + geom_bar(fill="lightblue") + theme(axis.text.x = element_text(angle=45,vjust=1,hjust=1)) +
  scale_y_continuous(labels=comma)

In a boxplot, Ford and Chevy look the same, but a violin plot reveals a bulge in Chevy that is not matched by Ford.

ggplot(df[df$price<100000 & df$manufacturer %in% c("ford","chevrolet"),],aes(price,manufacturer))+geom_violin()

The only obvious numeric variables are price, odometer, and year. Look at how bunched up the odometer values are in the first plot. You can transform them by taking the log() function on them so that the lower values take up more space. By the way, there are only two thirds as many chevys as fords in the Tiny data.

ggplot(df[df$price<100000 & df$manufacturer %in% c("ford","chevrolet"),],aes(price,odometer,color=manufacturer))+geom_jitter() + scale_color_manual(values=c('#999999','#E69F00'))
Warning: Removed 1322 rows containing missing values (`geom_point()`).

ggplot(df[df$price<100000 & df$manufacturer %in% c("ford","chevrolet"),],aes(price,log(odometer),color=manufacturer))+geom_jitter() + scale_color_manual(values=c('#999999','#E69F00'))
Warning: Removed 1322 rows containing missing values (`geom_point()`).

Notice that you can choose any hex value for colors manually. You can check the hex value for any color you can name with google or by sites like https://www.colorhexa.com/.

ggplot(df[df$price<100000 & df$manufacturer %in% c("ford","chevrolet"),],aes(price,year,color=manufacturer))+geom_jitter() + scale_color_manual(values=c('#999999','#E69F00'))

You can see if the three numeric variables are correlated wtih a heatmap.

col<-colorRampPalette(c("red","white","blue"))(20)
dfReduced <- df |> drop_na(year,odometer,price)
heatmap(cor(dfReduced[,c("year","odometer","price")]),symm=TRUE,col=col)

The year and odometer columns are very inversely correlated. Both year and odometer are somewhat inversely correlated with price.

You can make a kind of heatmap with the ztable package.

pacman::p_load(ztable)
mycolor=gradientColor(low="yellow",mid="orange",high="red",n=20,plot=TRUE)

ztable(table(dfReduced$paint_color,dfReduced$manufacturer)) |> makeHeatmap(mycolor=mycolor) |> print()
  acura alfa-romeo aston-martin audi bmw buick cadillac chevrolet chrysler datsun dodge ferrari fiat ford gmc harley-davidson honda hyundai infiniti jaguar jeep kia land rover lexus lincoln mazda mercedes-benz mercury mini mitsubishi morgan nissan pontiac porsche ram rover saturn subaru tesla toyota volkswagen volvo
black 1001 217 5 1926 3790 672 1644 7151 805 2 2224 9 89 8512 2971 86 2729 1229 1064 518 3239 1216 1 1245 900 814 2997 91 288 411 0 2785 205 250 2342 588 107 806 72 3548 1647 504
blue 307 164 3 718 1427 292 287 3613 506 9 937 0 31 4664 862 5 2096 1161 637 206 1003 492 1 354 345 564 643 128 372 220 0 1170 183 99 921 69 112 1496 81 2290 1040 324
brown 73 9 0 65 157 233 109 1014 94 3 83 0 19 914 441 0 414 155 178 16 249 214 0 139 53 42 154 36 53 128 1 283 32 26 175 18 18 102 2 556 77 49
custom 74 17 0 44 137 128 143 838 130 2 186 2 33 1173 240 4 440 167 74 19 313 146 0 136 61 48 126 40 36 30 0 289 47 12 240 32 21 137 1 698 95 58
green 26 4 2 51 97 57 37 741 57 2 351 0 32 1283 144 0 365 67 30 88 685 224 1 81 39 66 48 57 119 97 0 139 59 8 151 63 46 422 2 816 102 43
grey 367 11 1 560 881 218 217 2418 379 2 906 2 36 3263 664 3 2100 733 278 48 960 487 3 470 94 437 626 53 85 154 0 1643 125 75 827 111 47 690 28 2718 646 186
orange 1 1 0 4 35 2 5 250 4 4 213 0 4 330 12 2 76 63 2 0 195 38 0 4 2 6 4 1 30 190 0 45 22 3 65 3 13 71 0 52 63 3
purple 2 0 0 2 13 18 10 95 9 0 70 0 0 77 13 0 75 11 3 1 24 34 0 1 1 10 8 2 3 3 0 42 7 2 18 0 2 9 0 37 8 1
red 239 94 0 185 375 507 431 5230 336 7 1092 30 71 5134 979 4 1130 796 97 85 1595 660 1 408 252 597 393 97 287 331 0 1340 289 58 1128 71 135 553 30 2822 610 159
silver 912 28 2 950 1497 545 674 4859 788 3 1066 3 24 4895 1067 7 2977 1389 475 191 1884 1081 1 1362 386 460 1558 134 145 314 0 2424 225 125 1141 107 96 1208 41 5079 1025 384
white 1270 149 2 968 2426 1008 1257 10941 704 4 1965 8 133 18448 3756 0 2293 1513 586 322 2441 1145 1 1517 731 618 2223 130 322 570 1 3135 233 220 4742 402 83 1104 394 5224 1417 454
yellow 2 0 0 10 20 6 20 514 22 0 55 1 6 372 61 2 40 10 16 2 214 16 0 14 8 6 29 6 31 8 0 34 33 15 27 6 9 11 1 88 68 6

The above table is obviously too wide for a report. I suggest you make something like it while you’re working, then use it to figure out which columns to include, then do something like the following. Note that the expression #| output: asis must be the first line of the following chunk and it is not displayed in the html file. You must look at the qmd file to see it.

dfFurtherReduced <- dfReduced[dfReduced$manufacturer %in% c("bmw","chevrolet","ford","honda","nissan","ram","toyota"),]
ztable(table(dfFurtherReduced$paint_color,dfFurtherReduced$manufacturer)) |> makeHeatmap(mycolor=mycolor) |> print()
  bmw chevrolet ford honda nissan ram toyota
black 3790 7151 8512 2729 2785 2342 3548
blue 1427 3613 4664 2096 1170 921 2290
brown 157 1014 914 414 283 175 556
custom 137 838 1173 440 289 240 698
green 97 741 1283 365 139 151 816
grey 881 2418 3263 2100 1643 827 2718
orange 35 250 330 76 45 65 52
purple 13 95 77 75 42 18 37
red 375 5230 5134 1130 1340 1128 2822
silver 1497 4859 4895 2977 2424 1141 5079
white 2426 10941 18448 2293 3135 4742 5224
yellow 20 514 372 40 34 27 88

7.2 Textbook section 6.1 Inference for a single proportion

7.2.1 Is the sample proportion nearly normal?

Note: The book uses \(p\) in two ways: as a \(p\)-value in a hypothesis test, and as \(p\), a population proportion. Related to the population proportion is the sample proportion, \(\hat{p}\), pronounced p-hat. Too many \(p\)s!

The sampling distribution for \(\hat{p}\) based on a sample of size \(n\) from a population with a true proportion \(p\) is nearly normal when:

  1. The sample’s observations are independent, e.g., are from a simple random sample.
  2. We expected to see at least 10 successes and 10 failures in the sample, i.e., \(np \geqslant 10\) and \(n(1 − p) \geqslant 10\). This is called the success-failure condition.

When these conditions are met, then the sampling distribution of \(\hat{p}\) is nearly normal with mean \(p\) and standard error \(\text{SE}=\sqrt{p(1-p)/n}\).

7.2.2 Confidence interval for a proportion

A confidence interval provides a range of plausible values for the parameter \(p\), and when \(\hat{p}\) can be modeled using a normal distribution, the confidence interval for \(p\) takes the form

\[ \hat{p} \pm z^{*} \times \text{SE} \]

where \(z^{*}\) marks the \(x\)-axis for the selected confidence interval, e.g., 1.96 for a 95 percent confidence interval.

7.2.3 Prepare, Check, Calculate, Conclude Cycle

The OpenIntro Stats book recommends a four step cycle for both confidence intervals and hypothesis tests. It differs a bit from the seven step hypothesis testing method given in Week 06, but achieves the same result.

  • Prepare. Identify \(\hat{p}\) and \(n\), and determine what confidence level you wish to use.
  • Check. Verify the conditions to ensure \(\hat{p}\) is nearly normal. For one-proportion confidence intervals, use \(\hat{p}\) in place of \(p\) to check the success-failure condition.
  • Calculate. If the conditions hold, compute SE using \(\hat{p}\), find \(z^{*}\), and construct the interval.
  • Conclude. Interpret the confidence interval in the context of the problem.

7.2.4 Same cycle for hypothesis testing for a proportion

  • Prepare. Identify the parameter of interest, list hypotheses, identify the significance level, and identify \(\hat{p}\) and \(n\).
  • Check. Verify conditions to ensure \(\hat{p}\) is nearly normal under \(H_0\). For one-proportion hypothesis tests, use the null value to check the success-failure condition.
  • Calculate. If the conditions hold, compute the standard error, again using \(p_0\), compute the \(Z\)-score, and identify the \(p\)-value.
  • Conclude. Evaluate the hypothesis test by comparing the \(p\)-value to \(\alpha\), and provide a conclusion in the context of the problem.

7.2.5 Choosing sample size when estimating a proportion

This is probably the most important part of this section for practical purposes. The following expression denotes the margin of error:

\[ z^{*}\sqrt{\frac{p(1-p)}{n}} \]

You have to choose the margin of error you want to report. The book gives an example of 0.04. So you want to find

\[ z^{*}\sqrt{\frac{p(1-p)}{n}} < 0.04 \]

The problem is that you don’t know \(p\). Since the worst-case scenario is \(p=0.5\), you have to use that unless you have some information about \(p\). Recall that \(z^{*}\) represents the \(z\)-score for the desired confidence level, so you have to choose that. The book gives an example where you want a 95 percent confidence level, so you choose 1.96. You could find this out in R by saying

qnorm(0.025,lower.tail=FALSE)
[1] 1.959964

returning the \(z\)-score for the upper tail. The reason for saying that 0.025 instead of 0.05 is that the probability of 0.05 is split between the tails. The complementary function is pnorm(1.959964,lower.tail=FALSE), which will return 0.025.

Once you have decided on values for \(p\) and \(z^*\), solve the above inequality for \(n\).

7.3 Textbook section 6.2 Difference of two proportions

In this section, we’re just modifying the previous section to account for a difference instead of a single proportion.

The difference \(\hat{p}_1-\hat{p}_2\) can be modeled using the normal distribution when

  • The data are independent within and between the two groups (random samples or randomized experiment)
  • The success-failure condition holds for both groups (at least 10 successes and 10 failures in each sample)

7.3.1 Confidence intervals for \(p_1-p_2\)

\[ \text{point estimate } \pm z^* \times \text{SE} \rightarrow (\hat{p}_1 - \hat{p}_2) \pm z^* \times \sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}} \]

7.3.2 Hypothesis testing for \(p_1-p_2\)

When the null hypothesis is that the proportions are equal, use the pooled proportion (\(\hat{p}_\text{pooled}\)) to verify the success-failure condition and estimate the standard error

\[ \hat{p}_\text{pooled} = \frac{\text{number of “successes”}}{\text{number of cases}} = \frac{\hat{p}_1n_1+\hat{p}_2n_2}{n_1+n_2} \]

Here \(\hat{p}_1n_1\) represents the number of successes in sample 1 since

\[ \hat{p}_1 = \frac{\text{number of successes in sample 1}}{n_1} \]

Similarly, \(\hat{p}_2n_2\) represents the number of successes in sample 2.

7.4 Textbook section 6.3 Testing goodness of fit using \(\chi^2\)

The \(\chi^2\) test, pronounced k\(\overline{\text{i}}\) square, is useful in many circumstances. The textbook treats two such circumstances:

  1. Suppose your sample can be divided into groups, as can the general population. Does your sample represent the general population?
  2. Does your sample resemble a particular distribution, such as the normal distribution?

For the first circumstance, we could divide a sample of people into races or genders and we would like to examine all at once for resemblance to the general population, rather than in pairs. The \(\chi^2\) statistic will permit an all-at-once comparison.

The \(\chi^2\) statistic is given by the following formula for \(g\) groups.

\[ \chi^2 = \frac{(\text{observed count}_1-\text{null count}_1)^2}{\text{null count}_1} + \cdots + \frac{(\text{observed count}_g-\text{null count}_g)^2}{\text{null count}_g} \]

where the expression null count refers to the expected number of objects in the group. You have to be careful about how you determine the null count. For instance, the textbook gives an example of races of jurors. In such a case, the null counts should come from the population who can be selected as jurors. This might be a matter of some dispute since jurors are usually recruited through voting records and these records may not reflect the correct proportions. Statisticians tend to like things they can count, and some people are harder (more expensive) to count than others, particularly people in marginalized populations.

7.4.1 The \(\chi^2\) distribution

The \(\chi^2\) distribution is sometimes used to characterize data sets and statistics that are always positive and typically right skewed. Recall a normal distribution had two parameters – mean and standard deviation – that could be used to describe its exact characteristics. The \(\chi^2\) distribution has just one parameter called degrees of freedom (df), which influences the shape, center, and spread of the distribution. Here is a picture of the \(\chi^2\) distribution for several values of df (1–8).

In the jurors example, we can calculate the appropriate \(p\)-value in R by using the \(\chi^2\) statistic calculated from the sample, 5.89, and the parameter \(k-1\) which is the number of groups minus one, using R:

pchisq(5.89,3,lower.tail=FALSE)
[1] 0.1170863

This is a relatively large \(p\)-value given our earlier choices of cutoffs of 0.1, 0.05, and 0.01.

7.4.2 \(\chi^2\) test

The \(\chi^2\) test can be conducted in R for the juror example given in the book as follows.

o <- c(205,26,25,19)
e <- c(198,19.25,33,24.75)/sum(o)
chisq.test(o,p=e)

    Chi-squared test for given probabilities

data:  o
X-squared = 5.8896, df = 3, p-value = 0.1171

Note that I had to make an adjustment in R. The R variable p is supposed to be a vector of probabilities summing to 1. The way the table in the book presented it, it was not a vector of probabilities summing to one. So I divided each element of the input vector for e by the sum of the vector o.

7.5 Textbook section 6.4 Testing independence in 2-way tables

Suppose you have a two way table. Datacamp gives an example of gender and sport as the two ways. The following data frame lists the number of males and females who like the following three sports: archery, boxing, and cycling. The \(\chi^2\) tests suggests that the genders are not independent for the three sports, meaning that the preferences may differ by gender.

female <- c(35,15,50)
male <- c(10,30,60)
df <- cbind(male,female)
rownames(df) <- c("archery","boxing","cycling")
df
        male female
archery   10     35
boxing    30     15
cycling   60     50
chisq.test(df)

    Pearson's Chi-squared test

data:  df
X-squared = 19.798, df = 2, p-value = 5.023e-05