# Statistics: Intro

Mick McQuaid

2024-01-11

Week ONE

uncertainty

## Variability in data

• The cost of basic necessities fluctuates

## Summarizing data, 1 of 2

We use the mean and standard deviation to summarize many data items

u <- c(1,2,3,4,5)
mean(u)
[1] 3
sd(u)
[1] 1.581139

## Summarizing data, 2 of 2

v <- c(2,3,3,3,4)
mean(v)
[1] 3
sd(v)
[1] 0.7071068

$v$ is less variable than $u$, even though they are the same on average. We need the standard deviation to know that, although we will later learn some pictures we can draw to illustrate it.

## Let’s do some examples on the computer

• Install R (just google the letter R)
• Install R Studio (after installing R)
• Be sure you install these on your local disk, not in the cloud

## Now, recreate the above examples

• $u$ and $v$ are vectors and can be any size you like
• try using different numbers
• try using prices at different vendors for something you might want to buy or sell
• c() is a function that combines numbers or words into vectors
• The symbol <- can be read as the word gets, as in “$u$ gets the vector of numbers 1 through 5”

## Bundles of vectors called dataframes

• R has built-in dataframes
• One is called mtcars
• Type the word mtcars into the R console and press Enter to see it
• You can find the mean and standard deviation of any given column by saying, e.g., mean(mtcars$mpg) • The part that says mtcars$ tells R what dataframe to use to find mpg
• Each column of mtcars is a vector

## The mtcars dataframe

• The name of each vector is an abbreviation at the top
• The full names can be found by saying ?mtcars which also gives other information about the dataframe
• You can find all the column means by saying colMeans(mtcars)
• You can find all the standard deviations by saying sapply(mtcars,sd)
• The call colMeans(mtcars) is a faster version of sapply(mtcars,mean)

## Packages

• Most of the functionality in R is in packages
• You install packages into the library and retrieve them from there
• Some packages are actually collections of packages
• We’ll use a collection called the tidyverse
• So say install.packages("tidyverse") now
• Also say install.packages("pacman") because it simplifies installation and loading

## A picture of mtcars

#. install.packages("pacman")
ggplot(mtcars, aes(x = hp, y = mpg, size = cyl, color = factor(cyl))) +
geom_point(alpha = 0.7) +
scale_size_continuous(range = c(2, 10)) +
scale_color_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf")) +
labs(x = "Horsepower", y = "MPG", size = "Number of Cylinders", color = "Number of Cylinders", title = "MPG vs. Horsepower (Bubble Size: Number of Cylinders)") +
theme_minimal()

## Goal

• Interpret the variability in data
• Describe unwieldly data
• Estimate unknown quantities
• Predict the future
• Understand the mechanisms affecting stochastic processes

## This course is about finding and assessing the best line

• Consider a desired outcome (e.g., wins)
• Identify one or more factors contributing (e.g., payroll dollars)
• Find the slope and intercept to predict how much of the factor leads to how much of the outcome
• Figure out how good or bad the prediction is

$\ldots{}$ and there you have a simplified view of regression, the heart of this course.

## Think about the prediction lines as models of reality

• reduces reality to a manageable fiction
• requires deep knowledge of the subject you model
• hard to figure out what aspects of reality to leave in and what to take out

## Modeling

• That’s the process I’ve just described
• Mainly equations of lines in this class
• Parsimonious description of some aspect of reality

## Expense of modeling

• More realistic $\Rightarrow$ more expensive
• Example: fishing

## Death penalty in Florida example

• Analysts tried to find racism in the death penalty in Florida
• Most failed
• How could they go wrong?
• Finally, one analyst figured it out. How?

## Technicality of this course

• You will learn the mechanics
• You will learn how to build models
• You have to supply the intuition and insight
• No one can teach you to think of the best model

# Summary statistics

## Goals for this section

• Distinguish between samples and populations.
• Know how to calculate the arithmetic mean.
• Know how to calculate standard deviation.
• Know the definition of median.
• Review other summary statistics.

## finding a sum is denoted like this

The Greek letter Sigma, $\Sigma$, usually means to sum the values represented by the expression that follows:

$\sum_{i=1}^n y_i$

which is the same as

$y_1 + y_2 + \cdots + y_n$

## Sigma notation may be inconsistent

You may see $\Sigma$ used in an inconsistent way in math and stats:

$\sum_{i=1}^n y_i$

may be replaced by a synonymous shortcut like

$\sum_{i} y_i \quad \text{or} \quad \sum y$

## Means

The arithmetic mean is the average of a set of values.

Usually when we use the word mean, we refer to

$\overline{y} = \frac{\sum_{i=1}^n y_i}{n}$

which is the same as

$\overline{y} = \frac{y_1 + y_2 + \cdots + y_n}{n}$

## Sample mean and population mean

We use the sample mean to estimate the population mean.

The sample mean is often denoted as $\overline{y}$.

The population mean is called the expected value of $y$ and is often denoted as

$E(y)=\mu$

and in the case of the boxes, we would have to destroy all of them to be sure of its value, so we destroy a sample to estimate $\mu$.

## Range

A sample’s range is the difference between its max and min.

If the grades of a sample of six students are

$(2, 2, 3, 3, 4, 4)$

then the range is

$4-2=2$

The mean of the sample is $\overline{y}=(2+2+3+3+4+4)/6=3$

## Standard deviation

Standard deviation is used to describe data variation.

The standard deviation of a population is $\sigma$ and of a sample is $s$. It’s painfully easy to confuse the spreadsheet functions for $\sigma$ and $s$, usually stdev and sstdev.

$\sigma=\sqrt{\frac{\sum_{i=1}^n (y_i-\mu)^2}{n}}$

$s=\sqrt{\frac{\sum_{i=1}^n (y_i-\overline{y})^2}{n-1}}$

## Standard deviation example

Find the standard deviation of the grade sample.

• sum: $2+2+3+3+4+4=18$
• mean: $18/6=3$
• deviations: $(2-3)^2+(2-3)^2+(3-3)^2+(3-3)^2+(4-3)^2+(4-3)^2$
• deviations part two: $1+1+0+0+1+1$
• divide: $4/(6-1)=4/5=0.8$
• square root: just write $\sqrt{0.8}$ unless you’re allowed a calculator / computer

## Deviations

The deviations part two step shown previously is the numerical version of what I previously showed in the graph with pink lines between data and some imaginary prediction line. In this case, the imaginary line is $\overline{y}$. (In the previous case, it was the least squares line, which we’ll learn about later.)

# Standard deviation calculation

## Calculating $s$ emphasizes its interpretation

Here’s a shortcut equivalent to the previous formula for $s$.

$s=\sqrt{\frac{\sum_{i=1}^n y_i^2 - n(\overline{y})^2}{n-1}}$

## Two rough guidelines to interpret $s$

1. For any data set, at least three-fourths of the measurements will lie within two standard deviations of the mean.
2. For most data sets with enough measurements (25 or more) and a mound-shaped distribution, about 95 percent of the measurements will lie within two standard deviations of the mean. (We’ll study mound-shaped distributions later.)

## Standard deviation and mean work as a pair.

When you want to describe a set of data, the two most frequently used numbers, used as a pair, are mean and standard deviation. Suppose two websites, tra.com and la.com, both sell used phones. The last five sales of the ZZ11 on tra.com, in chronological order, were $36,$29, $59,$18, $23,$35, $25,$63, $69, and$43.

The last fives sales of the ZZ11 on la.com, in chronological order, were $44,$36, $47,$38, $35,$36, $37,$38, $50, and$39. Using only this info, what is the expected value of the next sale in each market? How is it spread out in each market?

## Do it in R

tra <- c(36,29,59,18,23,35,25,63,69,43)
mean(tra)
[1] 40
sd(tra)
[1] 17.95055
la <- c(44,36,47,38,35,36,37,38,50,39)
mean(la)
[1] 40
sd(la)
[1] 5.163978

Further, suppose bla.com has mean 40 and sd 36.89. Where would you sell?

# Summary statistics in R

#. help(mtcars)
#. ?mtcars
df <- mtcars
summary(df$mpg)  Min. 1st Qu. Median Mean 3rd Qu. Max. 10.40 15.43 19.20 20.09 22.80 33.90  (s <- sd(df$mpg))
[1] 6.026948
(m <- mean(df$mpg)) [1] 20.09062 (lower <- m-(2*s)) [1] 8.036729 (upper <- m+(2*s)) [1] 32.14452 ## Summarize the entire dataframe str(df) 'data.frame': 32 obs. of 11 variables:$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$cyl : num 6 6 4 6 8 6 8 4 4 6 ...$ disp: num  160 160 108 258 360 ...
$hp : num 110 110 93 110 175 105 245 62 95 123 ...$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$wt : num 2.62 2.88 2.32 3.21 3.44 ...$ qsec: num  16.5 17 18.6 19.4 17 ...
$vs : num 0 0 1 1 0 1 0 1 1 1 ...$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
$gear: num 4 4 4 3 3 3 3 4 4 4 ...$ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(df)
      mpg             cyl             disp             hp
Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0
1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5
Median :19.20   Median :6.000   Median :196.3   Median :123.0
Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7
3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0
Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0
drat             wt             qsec             vs
Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000
1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000
Median :3.695   Median :3.325   Median :17.71   Median :0.0000
Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375
3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000
Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000
am              gear            carb
Min.   :0.0000   Min.   :3.000   Min.   :1.000
1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000
Median :0.0000   Median :4.000   Median :2.000
Mean   :0.4062   Mean   :3.688   Mean   :2.812
3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000
Max.   :1.0000   Max.   :5.000   Max.   :8.000  

## Less ugly summaries

pacman::p_load(vtable)
df <- mtcars
df |> sumtable(summ=c('mean(x)','median(x)'),out='return')
   Variable Mean Median
1       mpg   20     19
2       cyl  6.2      6
3      disp  231    196
4        hp  147    123
5      drat  3.6    3.7
6        wt  3.2    3.3
7      qsec   18     18
8        vs 0.44      0
9        am 0.41      0
10     gear  3.7      4
11     carb  2.8      2

## There’s just one problem

• Not all of the columns should be numeric
• Discover this by saying ?mtcars in the R console
• You may also say str(mtcars) to discover that all of the columns are currently numeric
• You have to manually change each of columns 2 and 8 through 11 to factors (see next slide)

## Changing columns to factors, first way

df[,2] <- as.factor(df[,2])
df[,8] <- as.factor(df[,8])
df[,9] <- as.factor(df[,9])
df[,10] <- as.factor(df[,10])
df[,11] <- as.factor(df[,11])

## Changing columns to factors, second way

df |> mutate(across(c(2,8:11),as.factor))
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

## Changing columns to factors, third way

df$vs <- factor(df$vs,labels=c("V","S"))
df$am <- factor(df$am,labels=c("automatic","manual"))
df$cyl <- ordered(df$cyl)
df$gear <- ordered(df$gear)
df$carb <- ordered(df$carb)

## Structure after repair

str(df)
'data.frame':   32 obs. of  11 variables:
$mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...$ cyl : Ord.factor w/ 3 levels "4"<"6"<"8": 2 2 1 2 3 2 3 1 1 2 ...
$disp: num 160 160 108 258 360 ...$ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
$drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...$ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
$qsec: num 16.5 17 18.6 19.4 17 ...$ vs  : Factor w/ 2 levels "V","S": 1 1 2 2 1 2 1 2 2 2 ...
$am : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...$ gear: Ord.factor w/ 3 levels "3"<"4"<"5": 2 2 2 1 1 1 1 2 2 2 ...
\$ carb: Ord.factor w/ 6 levels "1"<"2"<"3"<"4"<..: 4 4 1 1 2 1 4 2 2 4 ...

## Less ugly summaries, just the numeric columns

df[,c(1,3:7)] |> sumtable(summ=c('min(x)','median(x)','mean(x)','max(x)'),out='return')
  Variable Min Median Mean Max
1      mpg  10     19   20  34
2     disp  71    196  231 472
3       hp  52    123  147 335
4     drat 2.8    3.7  3.6 4.9
5       wt 1.5    3.3  3.2 5.4
6     qsec  14     18   18  23

It’s tough to make predictions, especially about the future.

— Yogi Berra

END

# Colophon

This slideshow was produced using quarto

Fonts are Roboto Condensed Bold, JetBrains Mono Nerd Font, and STIX