7 More about Inference

7.1 Recap Week 06

ggplot2
Foundations for Inference
- Point estimates
- Confidence Intervals
- Hypothesis Tests

7.1.1 A useful ggplot trick

Some of you have struggled to produce boxplots of variables like price and odometer because they have ridiculous outliers. One way to overcome this that we’ve explored is to remove the extreme observations. A possibly better way, though, is illustrated below. Saying outlier.shape=NA in the geom_boxplot() function removes outliers before the dimensions of the plot are calculated. The result is a much more readable boxplot containing most of the cars.

load(paste0(Sys.getenv("STATS_DATA_DIR"),"/vehicles.Rdata"))
pacman::p_load(tidyverse)
df |> ggplot(aes(condition,odometer))+geom_boxplot(outlier.shape=NA)+  scale_y_continuous(limits = quantile(df$odometer, c(0.1, 0.8),na.rm=TRUE))

Warning: Removed 131132 rows containing non-finite values (`stat_boxplot()`).

df |> ggplot(aes(condition,price))+geom_boxplot(outlier.shape=NA)+  scale_y_continuous(limits = quantile(df$price, c(0.1, 0.8),na.rm=TRUE))

Warning: Removed 127440 rows containing non-finite values (`stat_boxplot()`).

Note that, in the above example, I’ve stored df in vehicles.Rdata using the save() function. Before saving it I converted condition to a factor and both price and odometer to numeric.

7.1.2 Preparing data

Note that you can use the above quantile trick in other ways. For example, here I’m filtering out the extreme values of price, without necessarily knowing what they are.

df |>
  filter(price >= quantile(price,.10) & price <= quantile(price,.90)) |>
  count(state) |>
  arrange(-n) |>
  as_tibble() |>
  print(n=12)

# A tibble: 51 × 2
   state     n
   <chr> <int>
 1 ca    39707
 2 fl    23260
 3 tx    18285
 4 ny    15956
 5 oh    15086
 6 mi    14959
 7 pa    11787
 8 nc    11105
 9 or    10875
10 wi    10194
11 tn     9466
12 co     9335
# ℹ 39 more rows

One group tabled price by paint color. This is not a good idea! There are too many individual prices. Instead, you can try using ranges with one of the cut_ functions, cut_interval (shown here), or cut_width. Then you can make tables or save the ranges to use in barcharts or other tables.

table(cut_interval(df$price[df$price<100000],20,dig.lab=8))


       [0,4999.95]   (4999.95,9999.9]  (9999.9,14999.85] (14999.85,19999.8] 
             95312              79652              53811              49107 
(19999.8,24999.75] (24999.75,29999.7] (29999.7,34999.65] (34999.65,39999.6] 
             34812              34946              26552              23510 
(39999.6,44999.55] (44999.55,49999.5] (49999.5,54999.45] (54999.45,59999.4] 
              9494               6643               3773               2948 
(59999.4,64999.35] (64999.35,69999.3] (69999.3,74999.25] (74999.25,79999.2] 
              1535               1617                821                702 
(79999.2,84999.15] (84999.15,89999.1] (89999.1,94999.05]   (94999.05,99999] 
               391                304                114                139

dfReduced <- df[df$price<100000,]
dfReduced$priceRanges <- cut_interval(dfReduced$price,20,dig.lab=8)
ggplot(dfReduced,aes(priceRanges)) + geom_bar(fill="lightblue") + theme(axis.text.x = element_text(angle=45,vjust=1,hjust=1))

table(dfReduced$priceRanges[dfReduced$paint_color %in% c("red","white","black","blue")],dfReduced$paint_color[dfReduced$paint_color %in% c("red","white","black","blue")])

                    
                     black  blue   red white
  [0,4999.95]         9977  6692  5544 13156
  (4999.95,9999.9]    9764  7002  5592 10805
  (9999.9,14999.85]   8345  3901  3952  9476
  (14999.85,19999.8]  7567  3898  3746 10211
  (19999.8,24999.75]  5906  2480  2455  7760
  (24999.75,29999.7]  5783  2540  3195  8132
  (29999.7,34999.65]  5106  1517  2081  6742
  (34999.65,39999.6]  4934  1751  2111  5657
  (39999.6,44999.55]  1807   554   636  2286
  (44999.55,49999.5]  1214   313   346  1793
  (49999.5,54999.45]   699   149   191   871
  (54999.45,59999.4]   538   128   198   784
  (59999.4,64999.35]   277    54    92   489
  (64999.35,69999.3]   325    91   153   435
  (69999.3,74999.25]   148    19    36   227
  (74999.25,79999.2]   162    34    39   186
  (79999.2,84999.15]   120    23    30    76
  (84999.15,89999.1]    40    14    16    87
  (89999.1,94999.05]    14     3     6    14
  (94999.05,99999]      37    19    10    14

Can we cut year into intervals? Yes, we can do year, odometer, or price because they are continuous.

pacman::p_load(tidyverse)
load(paste0(Sys.getenv("STATS_DATA_DIR"),"/vehiclesSmall.Rdata"))
table(cut_interval(dfSmall$year[dfSmall$year>1999],22,dig.lab=6))


[2000,2001] (2001,2002] (2002,2003] (2003,2004] (2004,2005] (2005,2006] 
       1919        1348        1720        2037        2527        2987 
(2006,2007] (2007,2008] (2008,2009] (2009,2010] (2010,2011] (2011,2012] 
       3510        3998        2796        3763        4710        5686 
(2012,2013] (2013,2014] (2014,2015] (2015,2016] (2016,2017] (2017,2018] 
       7123        6950        7436        7156        8595        8452 
(2018,2019] (2019,2020] (2020,2021] (2021,2022] 
       5981        4606         537          27

Getting commas in long numbers:

pacman::p_load(scales)
dfReduced <- dfSmall[dfSmall$price<100000,]
dfReduced$priceRanges <- cut_interval(dfReduced$price,20,dig.lab=8)
ggplot(dfReduced,aes(priceRanges)) + geom_bar(fill="lightblue") + theme(axis.text.x = element_text(angle=45,vjust=1,hjust=1)) +
  scale_y_continuous(labels=comma)

In a boxplot, Ford and Chevy look the same, but a violin plot reveals a bulge in Chevy that is not matched by Ford.

ggplot(df[df$price<100000 & df$manufacturer %in% c("ford","chevrolet"),],aes(price,manufacturer))+geom_violin()

The only obvious numeric variables are price, odometer, and year. Look at how bunched up the odometer values are in the first plot. You can transform them by taking the log() function on them so that the lower values take up more space. By the way, there are only two thirds as many chevys as fords in the Tiny data.

ggplot(df[df$price<100000 & df$manufacturer %in% c("ford","chevrolet"),],aes(price,odometer,color=manufacturer))+geom_jitter() + scale_color_manual(values=c('#999999','#E69F00'))

Warning: Removed 1322 rows containing missing values (`geom_point()`).

ggplot(df[df$price<100000 & df$manufacturer %in% c("ford","chevrolet"),],aes(price,log(odometer),color=manufacturer))+geom_jitter() + scale_color_manual(values=c('#999999','#E69F00'))

Warning: Removed 1322 rows containing missing values (`geom_point()`).

Notice that you can choose any hex value for colors manually. You can check the hex value for any color you can name with google or by sites like https://www.colorhexa.com/.

ggplot(df[df$price<100000 & df$manufacturer %in% c("ford","chevrolet"),],aes(price,year,color=manufacturer))+geom_jitter() + scale_color_manual(values=c('#999999','#E69F00'))

You can see if the three numeric variables are correlated wtih a heatmap.

col<-colorRampPalette(c("red","white","blue"))(20)
dfReduced <- df |> drop_na(year,odometer,price)
heatmap(cor(dfReduced[,c("year","odometer","price")]),symm=TRUE,col=col)

The year and odometer columns are very inversely correlated. Both year and odometer are somewhat inversely correlated with price.

You can make a kind of heatmap with the ztable package.

pacman::p_load(ztable)
mycolor=gradientColor(low="yellow",mid="orange",high="red",n=20,plot=TRUE)

ztable(table(dfReduced$paint_color,dfReduced$manufacturer)) |> makeHeatmap(mycolor=mycolor) |> print()


	acura	alfa-romeo	aston-martin	audi	bmw	buick	cadillac	chevrolet	chrysler	datsun	dodge	ferrari	fiat	ford	gmc	harley-davidson	honda	hyundai	infiniti	jaguar	jeep	kia	land rover	lexus	lincoln	mazda	mercedes-benz	mercury	mini	mitsubishi	morgan	nissan	pontiac	porsche	ram	rover	saturn	subaru	tesla	toyota	volkswagen	volvo
black	1001	217	5	1926	3790	672	1644	7151	805	2	2224	9	89	8512	2971	86	2729	1229	1064	518	3239	1216	1	1245	900	814	2997	91	288	411	0	2785	205	250	2342	588	107	806	72	3548	1647	504
blue	307	164	3	718	1427	292	287	3613	506	9	937	0	31	4664	862	5	2096	1161	637	206	1003	492	1	354	345	564	643	128	372	220	0	1170	183	99	921	69	112	1496	81	2290	1040	324
brown	73	9	0	65	157	233	109	1014	94	3	83	0	19	914	441	0	414	155	178	16	249	214	0	139	53	42	154	36	53	128	1	283	32	26	175	18	18	102	2	556	77	49
custom	74	17	0	44	137	128	143	838	130	2	186	2	33	1173	240	4	440	167	74	19	313	146	0	136	61	48	126	40	36	30	0	289	47	12	240	32	21	137	1	698	95	58
green	26	4	2	51	97	57	37	741	57	2	351	0	32	1283	144	0	365	67	30	88	685	224	1	81	39	66	48	57	119	97	0	139	59	8	151	63	46	422	2	816	102	43
grey	367	11	1	560	881	218	217	2418	379	2	906	2	36	3263	664	3	2100	733	278	48	960	487	3	470	94	437	626	53	85	154	0	1643	125	75	827	111	47	690	28	2718	646	186
orange	1	1	0	4	35	2	5	250	4	4	213	0	4	330	12	2	76	63	2	0	195	38	0	4	2	6	4	1	30	190	0	45	22	3	65	3	13	71	0	52	63	3
purple	2	0	0	2	13	18	10	95	9	0	70	0	0	77	13	0	75	11	3	1	24	34	0	1	1	10	8	2	3	3	0	42	7	2	18	0	2	9	0	37	8	1
red	239	94	0	185	375	507	431	5230	336	7	1092	30	71	5134	979	4	1130	796	97	85	1595	660	1	408	252	597	393	97	287	331	0	1340	289	58	1128	71	135	553	30	2822	610	159
silver	912	28	2	950	1497	545	674	4859	788	3	1066	3	24	4895	1067	7	2977	1389	475	191	1884	1081	1	1362	386	460	1558	134	145	314	0	2424	225	125	1141	107	96	1208	41	5079	1025	384
white	1270	149	2	968	2426	1008	1257	10941	704	4	1965	8	133	18448	3756	0	2293	1513	586	322	2441	1145	1	1517	731	618	2223	130	322	570	1	3135	233	220	4742	402	83	1104	394	5224	1417	454
yellow	2	0	0	10	20	6	20	514	22	0	55	1	6	372	61	2	40	10	16	2	214	16	0	14	8	6	29	6	31	8	0	34	33	15	27	6	9	11	1	88	68	6

The above table is obviously too wide for a report. I suggest you make something like it while you’re working, then use it to figure out which columns to include, then do something like the following. Note that the expression #| output: asis must be the first line of the following chunk and it is not displayed in the html file. You must look at the qmd file to see it.

dfFurtherReduced <- dfReduced[dfReduced$manufacturer %in% c("bmw","chevrolet","ford","honda","nissan","ram","toyota"),]
ztable(table(dfFurtherReduced$paint_color,dfFurtherReduced$manufacturer)) |> makeHeatmap(mycolor=mycolor) |> print()


	bmw	chevrolet	ford	honda	nissan	ram	toyota
black	3790	7151	8512	2729	2785	2342	3548
blue	1427	3613	4664	2096	1170	921	2290
brown	157	1014	914	414	283	175	556
custom	137	838	1173	440	289	240	698
green	97	741	1283	365	139	151	816
grey	881	2418	3263	2100	1643	827	2718
orange	35	250	330	76	45	65	52
purple	13	95	77	75	42	18	37
red	375	5230	5134	1130	1340	1128	2822
silver	1497	4859	4895	2977	2424	1141	5079
white	2426	10941	18448	2293	3135	4742	5224
yellow	20	514	372	40	34	27	88

7.2 Textbook section 6.1 Inference for a single proportion

7.2.1 Is the sample proportion nearly normal?

Note: The book uses \(p\) in two ways: as a \(p\)-value in a hypothesis test, and as \(p\), a population proportion. Related to the population proportion is the sample proportion, \(\hat{p}\), pronounced p-hat. Too many \(p\)s!

The sampling distribution for \(\hat{p}\) based on a sample of size \(n\) from a population with a true proportion \(p\) is nearly normal when:

The sample’s observations are independent, e.g., are from a simple random sample.
We expected to see at least 10 successes and 10 failures in the sample, i.e., \(np \geqslant 10\) and \(n(1 − p) \geqslant 10\). This is called the success-failure condition.

When these conditions are met, then the sampling distribution of \(\hat{p}\) is nearly normal with mean \(p\) and standard error \(\text{SE}=\sqrt{p(1-p)/n}\).

7.2.2 Confidence interval for a proportion

A confidence interval provides a range of plausible values for the parameter \(p\), and when \(\hat{p}\) can be modeled using a normal distribution, the confidence interval for \(p\) takes the form

\[ \hat{p} \pm z^{*} \times \text{SE} \]

where \(z^{*}\) marks the \(x\)-axis for the selected confidence interval, e.g., 1.96 for a 95 percent confidence interval.

7.2.3 Prepare, Check, Calculate, Conclude Cycle

The OpenIntro Stats book recommends a four step cycle for both confidence intervals and hypothesis tests. It differs a bit from the seven step hypothesis testing method given in Week 06, but achieves the same result.

Prepare. Identify \(\hat{p}\) and \(n\), and determine what confidence level you wish to use.
Check. Verify the conditions to ensure \(\hat{p}\) is nearly normal. For one-proportion confidence intervals, use \(\hat{p}\) in place of \(p\) to check the success-failure condition.
Calculate. If the conditions hold, compute SE using \(\hat{p}\), find \(z^{*}\), and construct the interval.
Conclude. Interpret the confidence interval in the context of the problem.

7.2.4 Same cycle for hypothesis testing for a proportion

Prepare. Identify the parameter of interest, list hypotheses, identify the significance level, and identify \(\hat{p}\) and \(n\).
Check. Verify conditions to ensure \(\hat{p}\) is nearly normal under \(H_0\). For one-proportion hypothesis tests, use the null value to check the success-failure condition.
Calculate. If the conditions hold, compute the standard error, again using \(p_0\), compute the \(Z\)-score, and identify the \(p\)-value.
Conclude. Evaluate the hypothesis test by comparing the \(p\)-value to \(\alpha\), and provide a conclusion in the context of the problem.

7.2.5 Choosing sample size when estimating a proportion

This is probably the most important part of this section for practical purposes. The following expression denotes the margin of error:

\[ z^{*}\sqrt{\frac{p(1-p)}{n}} \]

You have to choose the margin of error you want to report. The book gives an example of 0.04. So you want to find

\[ z^{*}\sqrt{\frac{p(1-p)}{n}} < 0.04 \]

The problem is that you don’t know \(p\). Since the worst-case scenario is \(p=0.5\), you have to use that unless you have some information about \(p\). Recall that \(z^{*}\) represents the \(z\)-score for the desired confidence level, so you have to choose that. The book gives an example where you want a 95 percent confidence level, so you choose 1.96. You could find this out in R by saying

qnorm(0.025,lower.tail=FALSE)

[1] 1.959964

returning the \(z\)-score for the upper tail. The reason for saying that 0.025 instead of 0.05 is that the probability of 0.05 is split between the tails. The complementary function is pnorm(1.959964,lower.tail=FALSE), which will return 0.025.

Once you have decided on values for \(p\) and \(z^*\), solve the above inequality for \(n\).

7.3 Textbook section 6.2 Difference of two proportions

In this section, we’re just modifying the previous section to account for a difference instead of a single proportion.

The difference \(\hat{p}_1-\hat{p}_2\) can be modeled using the normal distribution when

The data are independent within and between the two groups (random samples or randomized experiment)
The success-failure condition holds for both groups (at least 10 successes and 10 failures in each sample)

7.3.1 Confidence intervals for \(p_1-p_2\)

\[ \text{point estimate } \pm z^* \times \text{SE} \rightarrow (\hat{p}_1 - \hat{p}_2) \pm z^* \times \sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}} \]

7.3.2 Hypothesis testing for \(p_1-p_2\)

When the null hypothesis is that the proportions are equal, use the pooled proportion (\(\hat{p}_\text{pooled}\)) to verify the success-failure condition and estimate the standard error

\[ \hat{p}_\text{pooled} = \frac{\text{number of “successes”}}{\text{number of cases}} = \frac{\hat{p}_1n_1+\hat{p}_2n_2}{n_1+n_2} \]

Here \(\hat{p}_1n_1\) represents the number of successes in sample 1 since

\[ \hat{p}_1 = \frac{\text{number of successes in sample 1}}{n_1} \]

Similarly, \(\hat{p}_2n_2\) represents the number of successes in sample 2.

7.4 Textbook section 6.3 Testing goodness of fit using \(\chi^2\)

The \(\chi^2\) test, pronounced k\(\overline{\text{i}}\) square, is useful in many circumstances. The textbook treats two such circumstances:

Suppose your sample can be divided into groups, as can the general population. Does your sample represent the general population?
Does your sample resemble a particular distribution, such as the normal distribution?

For the first circumstance, we could divide a sample of people into races or genders and we would like to examine all at once for resemblance to the general population, rather than in pairs. The \(\chi^2\) statistic will permit an all-at-once comparison.

The \(\chi^2\) statistic is given by the following formula for \(g\) groups.

\[ \chi^2 = \frac{(\text{observed count}_1-\text{null count}_1)^2}{\text{null count}_1} + \cdots + \frac{(\text{observed count}_g-\text{null count}_g)^2}{\text{null count}_g} \]

where the expression null count refers to the expected number of objects in the group. You have to be careful about how you determine the null count. For instance, the textbook gives an example of races of jurors. In such a case, the null counts should come from the population who can be selected as jurors. This might be a matter of some dispute since jurors are usually recruited through voting records and these records may not reflect the correct proportions. Statisticians tend to like things they can count, and some people are harder (more expensive) to count than others, particularly people in marginalized populations.

7.4.1 The \(\chi^2\) distribution

The \(\chi^2\) distribution is sometimes used to characterize data sets and statistics that are always positive and typically right skewed. Recall a normal distribution had two parameters – mean and standard deviation – that could be used to describe its exact characteristics. The \(\chi^2\) distribution has just one parameter called degrees of freedom (df), which influences the shape, center, and spread of the distribution. Here is a picture of the \(\chi^2\) distribution for several values of df (1–8).

In the jurors example, we can calculate the appropriate \(p\)-value in R by using the \(\chi^2\) statistic calculated from the sample, 5.89, and the parameter \(k-1\) which is the number of groups minus one, using R:

pchisq(5.89,3,lower.tail=FALSE)

[1] 0.1170863

This is a relatively large \(p\)-value given our earlier choices of cutoffs of 0.1, 0.05, and 0.01.

7.4.2 \(\chi^2\) test

The \(\chi^2\) test can be conducted in R for the juror example given in the book as follows.

o <- c(205,26,25,19)
e <- c(198,19.25,33,24.75)/sum(o)
chisq.test(o,p=e)


    Chi-squared test for given probabilities

data:  o
X-squared = 5.8896, df = 3, p-value = 0.1171

Note that I had to make an adjustment in R. The R variable p is supposed to be a vector of probabilities summing to 1. The way the table in the book presented it, it was not a vector of probabilities summing to one. So I divided each element of the input vector for e by the sum of the vector o.

7.5 Textbook section 6.4 Testing independence in 2-way tables

Suppose you have a two way table. Datacamp gives an example of gender and sport as the two ways. The following data frame lists the number of males and females who like the following three sports: archery, boxing, and cycling. The \(\chi^2\) tests suggests that the genders are not independent for the three sports, meaning that the preferences may differ by gender.

female <- c(35,15,50)
male <- c(10,30,60)
df <- cbind(male,female)
rownames(df) <- c("archery","boxing","cycling")
df

        male female
archery   10     35
boxing    30     15
cycling   60     50

chisq.test(df)


    Pearson's Chi-squared test

data:  df
X-squared = 19.798, df = 2, p-value = 5.023e-05