Mick McQuaid

2024-01-20

Week TWO

# Review

## Rownames and column names

Note that R allows you to assign names to rows of a dataframe just as you can assign names to columns of a dataframe. We saw an example of that with the mtcars data, which appeared to have an extra column because the car makes and models were assigned as rownames.

## Mean

• arithmetic mean is the most popular measure of centrality
• can be dragged away from the center by outliers
• can be found by mean(vectorname) in R if vector is numeric
• can find all means in dataframe with colMeans(df) or sapply(df,mean)

## Means of some, but not all, columns

• subsetting just the first, second, and fourth column colMeans(mtcars[,c(1,2:4)])
• subsetting numeric columns colMeans(df[,which(sapply(df,is.numeric))])
• subsetting numeric columns & rows where hp > 100: df<-mtcars colMeans(df[which(df$hp>100),which(sapply(df,is.numeric))]) ## Median • the middle value of a sorted vector if there are an odd number of elements in the vector • the arithmetic mean of the two middle values of a sorted vector if there are an even number of elements • can be found by median(vectorname) in R if vector is numeric ## Standard Deviation • a measure of how spread out a vector is around its mean if vector is numeric • can be found in R by sd(vectorname) • is the square root of the variance • used in place of variance because it’s in the same units as the variable rather than squared units # More Numerical Summaries ## Structure of a dataframe • say str(df) in R to get the following • number of rows • number of columns • names of columns • types of columns • examples of entries in each column ## Summary of a dataframe • say summary(df) in R to get an entry for each column, containing • minimum, first quartile, median, mean, third quartile, maximum • above is for numeric columns • counts and level names for factors ## Better summaries pacman::p_load(vtable) df <- mtcars df[,c(1,3:7)] |> sumtable(summ=c('min(x)','median(x)','mean(x)','sd(x)','max(x)'),out='return')  Variable Min Median Mean Sd Max 1 mpg 10 19 20 6 34 2 disp 71 196 231 124 472 3 hp 52 123 147 69 335 4 drat 2.8 3.7 3.6 0.53 4.9 5 wt 1.5 3.3 3.2 0.98 5.4 6 qsec 14 18 18 1.8 23 ## Summarizing non-numeric data First, get some categorical data … options(digits=1) load(paste0(Sys.getenv("STATS_DATA_DIR"),"/migraine.rda")) str(migraine) Classes 'tbl_df', 'tbl' and 'data.frame': 89 obs. of 2 variables:$ group    : Factor w/ 2 levels "control","treatment": 2 2 2 2 2 2 2 2 2 2 ...
\$ pain_free: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...

## A contingency table

(tbl <- with(migraine,table(pain_free,group)))
         group
pain_free control treatment
no       44        33
yes       2        10

## A bigger contingency table

load(paste0(Sys.getenv("STATS_DATA_DIR"),"/loan50.rda"))
with(loan50,addmargins(table(loan_purpose,grade)))
                    grade
loan_purpose             A  B  C  D  E  F  G Sum
0  0  0  0  0  0  0  0   0
car                 0  0  1  1  0  0  0  0   2
credit_card         0  6  4  1  1  1  0  0  13
debt_consolidation  0  2  9  4  7  1  0  0  23
home_improvement    0  1  4  0  0  0  0  0   5
house               0  0  1  0  0  0  0  0   1
major_purchase      0  0  0  0  0  0  0  0   0
medical             0  0  0  0  0  0  0  0   0
moving              0  0  0  0  0  0  0  0   0
other               0  4  0  0  0  0  0  0   4
renewable_energy    0  1  0  0  0  0  0  0   1
small_business      0  1  0  0  0  0  0  0   1
vacation            0  0  0  0  0  0  0  0   0
wedding             0  0  0  0  0  0  0  0   0
Sum                 0 15 19  6  8  2  0  0  50

# Visual summaries

## A picture of mtcars

#. install.packages("pacman")
ggplot(mtcars, aes(x = hp, y = mpg, size = cyl, color = factor(cyl))) +
geom_point(alpha = 0.7) +
scale_size_continuous(range = c(2, 10)) +
scale_color_manual(values = c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf")) +
labs(x = "Horsepower", y = "MPG", size = "Number of Cylinders", color = "Number of Cylinders", title = "MPG vs. Horsepower (Bubble Size: Number of Cylinders)") +
theme_minimal()

## Preceding example

• uses the tidyverse, a coherent set of packages
• uses ggplot, the main function in that set of packages
• uses the layered grammar of graphics, a philosophy of data visualization
• graphics in this philosophy are built from reusable components

## Visual summary of a vector

pacman::p_load(scales)
loan50 |>
ggplot(aes(annual_income)) +
geom_boxplot() +
scale_x_continuous(labels = comma_format())

## Visual summary of several vectors

loan50 |>
ggplot(aes(annual_income,homeownership)) +
geom_boxplot() +
scale_x_continuous(labels = comma_format())

## A similar visual summary

loan50 |>
ggplot(aes(annual_income,homeownership)) +
geom_violin() +
scale_x_continuous(labels = comma_format())

## Uncertainty in visual summaries

loan50 |>
ggplot(aes(annual_income,homeownership)) +
geom_boxplot() +
geom_jitter( size=1.4, color="orange", width=0.1) +
scale_x_continuous(labels = comma_format())

## Uncertainty in a linear model

pacman::p_load(hrbrthemes)
df <- data.frame(
x = 1:100 + rnorm(100,sd=9),
y = 1:100 + rnorm(100,sd=16)
)
ggplot(df, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
theme_ipsum()

In these parts, a man’s life may depend on a mere scrap of information.

— Clint Eastwood, in A Fistful of Dollars (1964)

END

# Colophon

This slideshow was produced using quarto

Fonts are Roboto Condensed Bold, JetBrains Mono Nerd Font, and STIX2