2  Numerical and Visual Data Summaries

2.1 Recap Week 01

We looked at four things:

  • An example experiment (stents)
  • data basics
  • sampling
  • experiments

The best way to read the textbook is to do some of the exercises. I expect you to spend 6 to 9 hours doing so outside of class each week. This is based on the popular rule of thumb that three class hours implies six to nine study hours.

Let’s look at some of the Chapter 1 exercises briefly.

2.2 Textbook Section 1.1, An Example Experiment (stents)

Here we looked at treatment groups and control groups.

2.2.1 Textbook Exercise 1.1, the migraine data set

In class we loaded the data set and made a contingency table, with which we can answer the four questions in Exercise 1.1.

options(digits=1)
load(paste0(Sys.getenv("STATS_DATA_DIR"),"/migraine.rda"))
head(migraine)
      group pain_free
1 treatment       yes
2 treatment       yes
3 treatment       yes
4 treatment       yes
5 treatment       yes
6 treatment       yes
tbl <- with(migraine,table(pain_free,group))
tbl
         group
pain_free control treatment
      no       44        33
      yes       2        10

The four questions are:

  1. What percent of patients in the treatment group were pain free 24 hours after receiving acupuncture?
  2. What percent were pain free in the control group?
  3. In which group did a higher percent of patients become pain free 24 hours after receiving acupuncture?
  4. Your findings so far might suggest that acupuncture is an effective treatment for migraines for all people who suffer from migraines. However, this is not the only possible conclusion that can be drawn based on your findings so far. What is one other possible explanation for the observed difference between the percentages of patients that are pain free 24 hours after receiving acupuncture in the two groups?

We can use the tbl object we created to answer these questions.

  1. The percentage of patients in the treatment group who were pain free was 23.3%.
  2. The percentage of patients in the control group who were pain free was 4.3%.
  3. A higher percent of the treatment group became pain free.
  4. Not all migraine headaches are alike. It is possible that, due to chance, the patients in the treatment group had less severe migraine headaches that were easier to cure.

Examine the .qmd file that renders into this .html file. You’ll see that, rather than specify the numbers in the answers above, we actually did the calculations inline. What is good about doing this? (Hint: what happens if you add patients to the data set and rerender the document?) What is bad about doing this? (Hint: it looks clumsy and we could just as easily run the calculations in an r chunk, assign the results to object, and name the objects inline.)

Another issue is the code for part c. I only accounted for the case where the treatment group has the greater percent. It would be much better to add an else clause to account for the possibility that the answer should be control and an else to account for the two to be equal. The best way to do that is to create a non-echoing chunk and use the results inline. For example, the following chunk only appears in the .qmd file, not the .html file, but its results can be used inline.

Now we can say that the treatment group had the higher percentage. Only one problem remains and it is a software engineering problem. We haven’t tested the above code on a case where the control group or neither group had the higher percentages. We’ll leave that for now as a more advanced topic.

2.3 Textbook Section 1.2, Data Basics

Here we looked at

  • observations (rows), variables (columns), and data matrices (data frames)
  • types of variables (dbl or continuous, int or discrete, fctr or nominal, and ordered fctr or ordinal)
  • relationships between variables
  • explanatory (x or features or input) and response (y or targets or output) variables
  • observational studies and experiments (and we mentioned an in-between activity called quasi-experiments)
Data Type terminology: R vs the Textbook
R Openintro Stats textbook
dbl, as.numeric() numerical, continuous
int, as.integer() numerical, discrete
fctr, factor() categorical, nominal
ord, ordered() categorical, ordinal

On the left of the above table, you see how R refers to data types. On the right is how the OpenIntro Stats textbook refers to data types. When you display a tibble (a tibble is a data frame with some extra information) using R, each variable column will be headed with dbl, int, fctr, or ord to indicate the four kinds of numbers. If a variable is not interpreted as a number, R will display chr as an abbreviation of character.

2.3.1 Textbook Exercise 1.7

What were the explanatory and response variables in the migraine study? The group was explanatory and pain_free was the response variable.

2.3.2 Textbook Exercise 1.12

This is a hard question in two parts.

  1. List the variables used in creating this visualization.
  2. Indicate whether each variable in the study is numerical or categorical. If numerical, identify as contin- uous or discrete. If categorical, indicate if the variable is ordinal.

There is actually an r package underlying this question. If you visit https://github.com/dgrtwo/unvotes you will see the data represented as a tibble. If you recall, during week one we said that a tibble is a data frame that behaves well. Among its features is a list of the data types, so you can answer parts a and b by looking at a tibble of the data, where you’ll see that

  • year is stored as dbl although it is really discrete and could be stored as int
  • country is stored as chr which means characters although it is really a nominal factor
  • percent_yes is stored as a dbl which is appropriate
  • issue is stored as chr although it is really a nominal factor

Later we’ll learn how to produce a visualization like this, although you are welcome to try based on the code at the unvotes website mentioned above. If you want the actual code itself, you can slightly modify the code at https://rpubs.com/minebocek/unvotes to include Mexico.

2.4 Textbook Section 1.3, Sampling

Here we talked about random sampling, stratified sampling, cluster sampling, and observational studies.

2.4.1 Textbook Exercise 1.15, Asthma

  1. What is the population of interest and the sample? Note that the population is NOT all asthma sufferers. The population of interest is all asthma patients aged 18-69 who rely on medication for asthma treatment. The sample consists of 600 such patients.
  2. Is the study generalizable? Can we establish cause and effect? The patients are probably not randomly sampled, so we need to know more to say whether they represent all asthma patients 18–69 who rely on medication. For example, they could all be from a high-pollution city. We would need to know that. The cause and effect determination is easier. An experiment can determine cause and effect, while an observational study only determines association.

2.5 Textbook Section 1.4, Experiments

Here we discussed four issues:

  • control
  • randomization
  • replication
  • blocking

2.5.1 Textbook Exercise 1.34, Exercise and mental health

A researcher is interested in the effects of exercise on mental health and he proposes the following study: Use stratified random sampling to ensure representative proportions of 18-30, 31-40 and 41-55 year olds from the population. Next, randomly assign half the subjects from each age group to exercise twice a week, and instruct the rest not to exercise. Conduct a mental health exam at the beginning and at the end of the study, and compare the results.

  1. What type of study is this?
  2. What are the treatment and control groups in this study?
  3. Does this study make use of blocking? If so, what is the blocking variable?
  4. Does this study make use of blinding?
  5. Comment on whether or not the results of the study can be used to establish a causal relationship between exercise and mental health, and indicate whether or not the conclusions can be generalized to the population at large.
  6. Suppose you are given the task of determining if this proposed study should get funding. Would you have any reservations about the study proposal?

2.5.2 Answers

  1. This is an experiment.
  2. The treatment is exercise twice a week and control is no exercise.
  3. Yes, the blocking variable is age.
  4. No, the study is not blinded since the patients will know whether or not they are exercising.
  5. Since this is an experiment, we can make a causal statement. Since the sample is random, the causal statement can be generalized to the population at large. However, we should be cautious about making a causal statement because of a possible placebo effect.
  6. It would be very difficult, if not impossible, to successfully conduct this study since randomly sampled people cannot be required to participate in a clinical trial

2.6 Textbook Chapter 2: Summarizing data

This week, we’ll look at numerical data, categorical data, and a case study.

2.6.1 Numerical data

There are graphical and numerical methods for summarizing numerical data, including

  • scatterplots
  • dot plots
  • mean
  • histograms
  • variance and standard deviation
  • box plots, quartiles, and the median
  • robust statistics
  • cartographic maps and cartograms

We can draw a scatterplot of two variables of the loan50 data as follows.

options(repos=structure(c(CRAN="https://mirrors.nics.utk.edu/cran/")))
pacman::p_load(tidyverse)
load(paste0(Sys.getenv("STATS_DATA_DIR"),"/loan50.rda"))
loan50 |>
  ggplot(aes(annual_income,loan_amount)) +
  geom_point()

The above is a very basic scatterplot. Later, we’ll learn to change colors, background, labels, legends, and more. Bear in mind that the scatterplot is meant to compare two numeric variables. You can’t use it for a numeric variable and a categorical variable.

We can create a basic dotplot as follows.

loan50 |>
  ggplot(aes(homeownership)) +
  geom_dotplot()

A dotplot may be helpful to illustrate a categorical variable, but I seldom use them. Using them in concert with a boxplot may make more sense. We’ll look at boxplots later.

The mean is part of a good summary of data. We can find the mean of a variable by saying mean(variable_name) or as part of a summary. For instance

with(loan50,mean(annual_income))
[1] 86170
with(loan50,summary(annual_income))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  28800   55750   74000   86170   99500  325000 

Notice that the summary() function also gives us the minimum, the first quartile, the median, the 3rd quartile, and the maximum value of the variable, in addition to the mean. We’ll discuss all these statistics in the context of other ways to extract them. The problem with the mean that is demonstrated in the textbook is that two variables may have very different shapes but the same mean. So the textbook then describes histograms, which are a good way to identify the shape of a variable.

loan50 |>
  ggplot(aes(annual_income)) +
  geom_histogram()

Notice that the x-axis labels are shown in scientific notation. We can fix this using the scales package. By the way, in case I haven’t mentioned it before, we always refer to the horizontal axis as the x-axis and the vertical axis as the y-axis. This is the default for r and many other languages that create graphical displays.

pacman::p_load(scales)
loan50 |>
  ggplot(aes(annual_income)) +
  geom_histogram() +
  scale_x_continuous(labels = comma_format())

You may have noticed that Hadley Wickham, the inventor of the Tidyverse, dislikes the default number of bins in a histogram. So he programmed ggplot() to always show a warning message saying to pick a better number, depending on your data. We can fix that easily with a parameter to geom_histogram(). Then each bin (vertical stripe) will represent a thousand dollars.

loan50 |>
  ggplot(aes(annual_income)) +
  geom_histogram(binwidth=1000) +
  scale_x_continuous(labels = comma_format())

You may notice that warning messages aren’t dealbreakers. An error message on the other hand, will often stop output dead in its tracks.

The next concepts covered in the textbook are variance and standard deviation. We can calculate them as follows. When you do this, you may notice that sd() is the square root of var(). Why would you prefer one over the other? Usually you use sd() because it’s in the same units as the data, dollars in the following case, unlike var(), which is in squared units, squared dollars in the following case.

with(loan50,var(annual_income))
[1] 3e+09
with(loan50,sd(annual_income))
[1] 57566

Together, the mean and standard deviation are often a good, yet compact, description of a data set.

You may want to find the means of all the columns in a data set. If you try to do that with the colMeans() function, you’ll get an error message as follows. (Actually, I’ve disabled the following code chunk by saying #| eval: false in the .qmd file because otherwise the rendering would halt.)

colMeans(loan50)

The remedy is to use a logical function to identify only the numeric columns.

colMeans(loan50[sapply(loan50, is.numeric)])
             emp_length                    term           annual_income 
                     NA                   4e+01                   9e+04 
         debt_to_income      total_credit_limit   total_credit_utilized 
                  7e-01                   2e+05                   6e+04 
num_cc_carrying_balance             loan_amount           interest_rate 
                  5e+00                   2e+04                   1e+01 
 public_record_bankrupt            total_income 
                  8e-02                   1e+05 

This is okay, but the results are in scientific notation. You can use the format() function to supress scientific notation as follows.

format(colMeans(loan50[sapply(loan50, is.numeric)]),scientific=FALSE)
             emp_length                    term           annual_income 
            "       NA"             "    42.72"             " 86170.00" 
         debt_to_income      total_credit_limit   total_credit_utilized 
            "     0.72"             "208546.64"             " 61546.54" 
num_cc_carrying_balance             loan_amount           interest_rate 
            "     5.06"             " 17083.00"             "    11.57" 
 public_record_bankrupt            total_income 
            "     0.08"             "105220.56" 

Notice that you often wrap a function inside another function. The only problem is that it’s easy to lose track of all the parentheses.

The next topic in the textbook is the box plot. This also gives an opportunity to talk about the quartiles and the median. We can display a box plot as follows.

loan50 |>
  ggplot(aes(annual_income)) +
  geom_boxplot() +
  scale_x_continuous(labels = comma_format())

The thick line in the middle of the box is the median, the middle value of the data set. The box itself is bound by the first and third quartiles, known as hinges. The full name of this construct is actually a box and whiskers plot and the lines extending horizontally from the box are called whiskers. The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge. Data beyond the end of the whiskers are called “outlying” points and are plotted individually.

We often want to create several box plots and compare them. This is easy to do as follows.

loan50 |>
  ggplot(aes(annual_income,homeownership)) +
  geom_boxplot() +
  scale_x_continuous(labels = comma_format())

The textbook’s next topic is Robust Statistics and we’re going to pretty much skip that for now, except to say that outliers, as shown in the textbook, can affect the value of some statistics more than others. The mean and median are a good example. The median is much more robust to outliers than is the mean, which can be dragged way up or down by the presence of just one or a few outliers, whereas the median can not. As a kind of thought experiment, consider the following data set and the addition of an outlier and the effect of the outlier on the mean and median.

x<-c(2,3,3,3,4,4,4,5,5,6,6)
mean(x)
[1] 4
median(x)
[1] 4
y<-c(2,3,3,3,4,4,4,5,5,6,6,800)
mean(y)
[1] 70
median(y)
[1] 4

2.7 Uncertainty in summaries

How can we portray our uncertainty about estimates of parameters? Consider two different vectors:

u <- c(1,2,3,4,5,6,7,8,9)
v <- c(3,4,5,5,5,5,5,6,7)

Suppose that these are samples from two different populations. For both of these vectors, the best estimate is 5. Both the mean and median are 5 in both cases. But when estimating, we’re much more sure of our estimate in the case of v than u. We quantify that with variance or its square root, standard deviation. But those numbers don’t mean much to most people. It’s easier to portray uncertainty graphically than numerically. The typical ways to do so are to make boxplots, violin plots, or barcharts with error bars if we’re comparing two or more groups. If we are estimating a continuous variable, say \(y\), at many different values of \(x\), we can show our uncertainty by estimating standard deviation for portions of the data and putting them together as follows in an example from the R Graph Gallery.

pacman::p_load(hrbrthemes)
#. Create synthetic data
df <- data.frame(
  x = 1:100 + rnorm(100,sd=9),
  y = 1:100 + rnorm(100,sd=16)
)
#. linear trend + confidence interval
ggplot(df, aes(x=x, y=y)) +
  geom_point() +
  geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum()

Notice that the confidence interval is narrower at the center than at the extremes. Why? It’s because we’re using more data to estimate at the center and have less uncertainty at the center than at the ends, where the data is truncated.

Suppose we’re doing something even more sophisticated, such as comparing groups of continuous \(y\) variables. A ridgeline plot is a popular solution to show variability, which is often similar to uncertainty, as shown in this example from the R Graph Gallery.

#. packages
pacman::p_load(ggridges)
pacman::p_load(viridis)

#. Plot
ggplot(lincoln_weather, aes(x = `Mean Temperature [F]`, y = `Month`, fill = ..x..)) +
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
  scale_fill_viridis(name = "Temp. [F]", option = "C") +
  labs(title = 'Temperatures in Lincoln NE in 2016') +
  theme_ipsum() +
    theme(
      legend.position="none",
      panel.spacing = unit(0.1, "lines"),
      strip.text.x = element_text(size = 8)
    )
Warning: The dot-dot notation (`..x..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(x)` instead.

Not only are the temperatures higher in the summer than in the winter, they are less variable, so we’re more certain of what the temperature might be on any given day in summer.

2.8 Categorical Data

We’ve already seen contingency tables and how to manipulate them, which are introduced in more detail in this section. We’ve also seen a mosaic plot. Another kind of plot introduced in this section is the bar plot. We’ll examine each of these.

load(paste0(Sys.getenv("STATS_DATA_DIR"),"/migraine.rda"))
tbl <- with(migraine,table(pain_free,group))
tbl <- addmargins(tbl)
pacman::p_load(kableExtra)
tbl |>
  kbl() |>
  kable_classic(full_width=F) |>
  row_spec(3, color = "white", background = "#AAAAAA") |>
  column_spec(4, color = "white", background = "#AAAAAA") |>
  column_spec(1, color = "black", background = "white")
control treatment Sum
no 44 33 77
yes 2 10 12
Sum 46 43 89

As before, we can create a mosaic plot.

tbl <- with(migraine,table(pain_free,group))
mosaicplot(tbl)

We can also create bar plots.

loan50 |>
  ggplot(aes(homeownership)) +
    geom_bar()

loan50 |>
  ggplot(aes(homeownership,fill=loan_purpose)) +
    geom_bar() +
    scale_fill_brewer()

The above looks better but is misleading because it implies an ordinal relationship between the loan purposes and there is no such relationship. We would be better of specifying that the loan_purpose variable is qualitative.

loan50 |>
  ggplot(aes(homeownership,fill=loan_purpose)) +
    geom_bar() +
    scale_fill_brewer(type="qual",palette="Set1")

I find the above palette to be ugly but several others are available. Google colorbrewer for more info. By the way, these colors have been extensively psychologically tested to verify that people can easily distinguish between them. I’m uncertain about colorblind people because the most prevalent form of color blindness is red-green.

2.9 Thursday’s class

2.9.1 Statistical summary tools

pacman::p_load(ISLR2)
data(Auto)
Auto <- na.omit(Auto)
with(Auto,cylinders<-as.factor(cylinders))

Here are some statistical summary questions we can answer about the Auto data set.

  1. What is the range of each quantitative predictor? You can answer this using the range() function.
with(Auto,range(mpg))
[1]  9 47
with(Auto,range(displacement))
[1]  68 455
sapply(Auto[,c(1,3:7)],range)
     mpg displacement horsepower weight acceleration year
[1,]   9           68         46   1613            8   70
[2,]  47          455        230   5140           25   82
sapply(Auto[,sapply(Auto,is.numeric)],range)
     mpg cylinders displacement horsepower weight acceleration year origin
[1,]   9         3           68         46   1613            8   70      1
[2,]  47         8          455        230   5140           25   82      3
  1. What is the mean and standard deviation of each quantitative predictor?
sapply(Auto[,c(1,3:7)],mean)
         mpg displacement   horsepower       weight acceleration         year 
          23          194          104         2978           16           76 
sapply(Auto[,c(1,3:7)],sd)
         mpg displacement   horsepower       weight acceleration         year 
           8          105           38          849            3            4 
as.data.frame(t(sapply(Auto[,c(1,3:7)],function(bla) list(means=mean(bla),sds=sd(bla),ranges=range(bla)))))
             means sds     ranges
mpg             23   8      9, 47
displacement   194 105    68, 455
horsepower     104  38    46, 230
weight        2978 849 1613, 5140
acceleration    16   3      8, 25
year            76   4     70, 82
  1. Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?
as.data.frame(t(sapply(Auto[-10:-85,c(1,3:7)],function(bla) list(means=mean(bla),sds=sd(bla),ranges=range(bla)))))
             means sds     ranges
mpg             24   8     11, 47
displacement   187 100    68, 455
horsepower     101  36    46, 230
weight        2936 811 1649, 4997
acceleration    16   3      8, 25
year            77   3     70, 82

2.9.2 Numerical summary shortcuts

Several functions allow you to check up on many columns at once. This is especially helpful in examining dataframes with large numbers of columns, such as amesHousing2011.csv.

df <- read_csv(paste0(Sys.getenv("STATS_DATA_DIR"),"/amesHousing2011.csv"))
str(df)
spc_tbl_ [2,925 × 82] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Order        : num [1:2925] 1498 2738 2446 2667 2451 ...
 $ PID          : chr [1:2925] "0908154080" "0905427030" "0528320060" "0902400110" ...
 $ MSSubClass   : chr [1:2925] "020" "075" "060" "075" ...
 $ MSZoning     : chr [1:2925] "RL" "RL" "RL" "RM" ...
 $ LotFrontage  : num [1:2925] 123 60 118 90 114 87 NA 60 60 47 ...
 $ LotArea      : num [1:2925] 47007 19800 35760 22950 17242 ...
 $ Street       : chr [1:2925] "Pave" "Pave" "Pave" "Pave" ...
 $ Alley        : chr [1:2925] NA NA NA NA ...
 $ LotShape     : chr [1:2925] "IR1" "Reg" "IR1" "IR2" ...
 $ LandContour  : chr [1:2925] "Lvl" "Lvl" "Lvl" "Lvl" ...
 $ Utilities    : chr [1:2925] "AllPub" "AllPub" "AllPub" "AllPub" ...
 $ LotConfig    : chr [1:2925] "Inside" "Inside" "CulDSac" "Inside" ...
 $ LandSlope    : chr [1:2925] "Gtl" "Gtl" "Gtl" "Gtl" ...
 $ Neighborhood : chr [1:2925] "Edwards" "Edwards" "NoRidge" "OldTown" ...
 $ Condition1   : chr [1:2925] "Norm" "Norm" "Norm" "Artery" ...
 $ Condition2   : chr [1:2925] "Norm" "Norm" "Norm" "Norm" ...
 $ BldgType     : chr [1:2925] "1Fam" "1Fam" "1Fam" "1Fam" ...
 $ HouseStyle   : chr [1:2925] "1Story" "2.5Unf" "2Story" "2.5Fin" ...
 $ OverallQual  : num [1:2925] 5 6 10 10 9 7 8 6 10 8 ...
 $ OverallCond  : num [1:2925] 7 8 5 9 5 9 9 7 5 5 ...
 $ YearBuilt    : num [1:2925] 1959 1935 1995 1892 1993 ...
 $ YearRemod/Add: num [1:2925] 1996 1990 1996 1993 1994 ...
 $ RoofStyle    : chr [1:2925] "Gable" "Gable" "Hip" "Gable" ...
 $ RoofMatl     : chr [1:2925] "CompShg" "CompShg" "CompShg" "WdShngl" ...
 $ Exterior1st  : chr [1:2925] "Plywood" "BrkFace" "HdBoard" "Wd Sdng" ...
 $ Exterior2nd  : chr [1:2925] "Plywood" "Wd Sdng" "HdBoard" "Wd Sdng" ...
 $ MasVnrType   : chr [1:2925] "None" "None" "BrkFace" "None" ...
 $ MasVnrArea   : num [1:2925] 0 0 1378 0 738 ...
 $ ExterQual    : chr [1:2925] "TA" "TA" "Gd" "Gd" ...
 $ ExterCond    : chr [1:2925] "TA" "TA" "Gd" "Gd" ...
 $ Foundation   : chr [1:2925] "Slab" "BrkTil" "PConc" "BrkTil" ...
 $ BsmtQual     : chr [1:2925] NA "TA" "Ex" "TA" ...
 $ BsmtCond     : chr [1:2925] NA "TA" "TA" "TA" ...
 $ BsmtExposure : chr [1:2925] NA "No" "Gd" "Mn" ...
 $ BsmtFinType1 : chr [1:2925] NA "Rec" "GLQ" "Unf" ...
 $ BsmtFinSF1   : num [1:2925] 0 425 1387 0 292 ...
 $ BsmtFinType2 : chr [1:2925] NA "Unf" "Unf" "Unf" ...
 $ BsmtFinSF2   : num [1:2925] 0 0 0 0 1393 ...
 $ BsmtUnfSF    : num [1:2925] 0 1411 543 1107 48 ...
 $ TotalBsmtSF  : num [1:2925] 0 1836 1930 1107 1733 ...
 $ Heating      : chr [1:2925] "GasA" "GasA" "GasA" "GasA" ...
 $ HeatingQC    : chr [1:2925] "TA" "Gd" "Ex" "Ex" ...
 $ CentralAir   : chr [1:2925] "Y" "Y" "Y" "Y" ...
 $ Electrical   : chr [1:2925] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
 $ 1stFlrSF     : num [1:2925] 3820 1836 1831 1518 1933 ...
 $ 2ndFlrSF     : num [1:2925] 0 1836 1796 1518 1567 ...
 $ LowQualFinSF : num [1:2925] 0 0 0 572 0 0 0 515 0 0 ...
 $ GrLivArea    : num [1:2925] 3820 3672 3627 3608 3500 ...
 $ BsmtFullBath : num [1:2925] NA 0 1 0 1 0 0 0 0 1 ...
 $ BsmtHalfBath : num [1:2925] NA 0 0 0 0 0 0 0 0 0 ...
 $ FullBath     : num [1:2925] 3 3 3 2 3 3 3 2 3 3 ...
 $ HalfBath     : num [1:2925] 1 1 1 1 1 0 1 0 1 1 ...
 $ BedroomAbvGr : num [1:2925] 5 5 4 4 4 3 4 8 5 4 ...
 $ KitchenAbvGr : num [1:2925] 1 1 1 1 1 1 1 2 1 1 ...
 $ KitchenQual  : chr [1:2925] "Ex" "Gd" "Gd" "Ex" ...
 $ TotRmsAbvGrd : num [1:2925] 11 7 10 12 11 10 11 14 10 12 ...
 $ Functional   : chr [1:2925] "Typ" "Typ" "Typ" "Typ" ...
 $ Fireplaces   : num [1:2925] 2 2 1 2 1 1 2 0 1 1 ...
 $ FireplaceQu  : chr [1:2925] "Gd" "Gd" "TA" "TA" ...
 $ GarageType   : chr [1:2925] "Attchd" "Detchd" "Attchd" "Detchd" ...
 $ GarageYrBlt  : num [1:2925] 1959 1993 1995 1993 1993 ...
 $ GarageFinish : chr [1:2925] "Unf" "Unf" "Fin" "Unf" ...
 $ GarageCars   : num [1:2925] 2 2 3 3 3 3 3 0 3 3 ...
 $ GarageArea   : num [1:2925] 624 836 807 840 959 ...
 $ GarageQual   : chr [1:2925] "TA" "TA" "TA" "Ex" ...
 $ GarageCond   : chr [1:2925] "TA" "TA" "TA" "TA" ...
 $ PavedDrive   : chr [1:2925] "Y" "Y" "Y" "Y" ...
 $ WoodDeckSF   : num [1:2925] 0 684 361 0 870 302 314 0 204 503 ...
 $ OpenPorchSF  : num [1:2925] 372 80 76 260 86 0 12 110 34 36 ...
 $ EnclosedPorch: num [1:2925] 0 32 0 0 0 0 0 0 0 0 ...
 $ 3SsnPorch    : num [1:2925] 0 0 0 0 0 0 0 0 0 0 ...
 $ ScreenPorch  : num [1:2925] 0 0 0 410 210 0 0 0 0 210 ...
 $ PoolArea     : num [1:2925] 0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : chr [1:2925] NA NA NA NA ...
 $ Fence        : chr [1:2925] NA NA NA "GdPrv" ...
 $ MiscFeature  : chr [1:2925] NA NA NA NA ...
 $ MiscVal      : num [1:2925] 0 0 0 0 0 0 0 0 0 0 ...
 $ MoSold       : num [1:2925] 7 12 7 6 5 5 5 3 9 6 ...
 $ YrSold       : num [1:2925] 2008 2006 2006 2006 2006 ...
 $ SaleType     : chr [1:2925] "WD" "WD" "WD" "WD" ...
 $ SaleCondition: chr [1:2925] "Normal" "Normal" "Normal" "Normal" ...
 $ SalePrice    : num [1:2925] 284700 415000 625000 475000 584500 ...
 - attr(*, "spec")=
  .. cols(
  ..   Order = col_double(),
  ..   PID = col_character(),
  ..   MSSubClass = col_character(),
  ..   MSZoning = col_character(),
  ..   LotFrontage = col_double(),
  ..   LotArea = col_double(),
  ..   Street = col_character(),
  ..   Alley = col_character(),
  ..   LotShape = col_character(),
  ..   LandContour = col_character(),
  ..   Utilities = col_character(),
  ..   LotConfig = col_character(),
  ..   LandSlope = col_character(),
  ..   Neighborhood = col_character(),
  ..   Condition1 = col_character(),
  ..   Condition2 = col_character(),
  ..   BldgType = col_character(),
  ..   HouseStyle = col_character(),
  ..   OverallQual = col_double(),
  ..   OverallCond = col_double(),
  ..   YearBuilt = col_double(),
  ..   `YearRemod/Add` = col_double(),
  ..   RoofStyle = col_character(),
  ..   RoofMatl = col_character(),
  ..   Exterior1st = col_character(),
  ..   Exterior2nd = col_character(),
  ..   MasVnrType = col_character(),
  ..   MasVnrArea = col_double(),
  ..   ExterQual = col_character(),
  ..   ExterCond = col_character(),
  ..   Foundation = col_character(),
  ..   BsmtQual = col_character(),
  ..   BsmtCond = col_character(),
  ..   BsmtExposure = col_character(),
  ..   BsmtFinType1 = col_character(),
  ..   BsmtFinSF1 = col_double(),
  ..   BsmtFinType2 = col_character(),
  ..   BsmtFinSF2 = col_double(),
  ..   BsmtUnfSF = col_double(),
  ..   TotalBsmtSF = col_double(),
  ..   Heating = col_character(),
  ..   HeatingQC = col_character(),
  ..   CentralAir = col_character(),
  ..   Electrical = col_character(),
  ..   `1stFlrSF` = col_double(),
  ..   `2ndFlrSF` = col_double(),
  ..   LowQualFinSF = col_double(),
  ..   GrLivArea = col_double(),
  ..   BsmtFullBath = col_double(),
  ..   BsmtHalfBath = col_double(),
  ..   FullBath = col_double(),
  ..   HalfBath = col_double(),
  ..   BedroomAbvGr = col_double(),
  ..   KitchenAbvGr = col_double(),
  ..   KitchenQual = col_character(),
  ..   TotRmsAbvGrd = col_double(),
  ..   Functional = col_character(),
  ..   Fireplaces = col_double(),
  ..   FireplaceQu = col_character(),
  ..   GarageType = col_character(),
  ..   GarageYrBlt = col_double(),
  ..   GarageFinish = col_character(),
  ..   GarageCars = col_double(),
  ..   GarageArea = col_double(),
  ..   GarageQual = col_character(),
  ..   GarageCond = col_character(),
  ..   PavedDrive = col_character(),
  ..   WoodDeckSF = col_double(),
  ..   OpenPorchSF = col_double(),
  ..   EnclosedPorch = col_double(),
  ..   `3SsnPorch` = col_double(),
  ..   ScreenPorch = col_double(),
  ..   PoolArea = col_double(),
  ..   PoolQC = col_character(),
  ..   Fence = col_character(),
  ..   MiscFeature = col_character(),
  ..   MiscVal = col_double(),
  ..   MoSold = col_double(),
  ..   YrSold = col_double(),
  ..   SaleType = col_character(),
  ..   SaleCondition = col_character(),
  ..   SalePrice = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
df <- df |> select(-c("Order","PID"))
df |> select(where(is.character)) |> map ( table )
$MSSubClass

 020  030  040  045  050  060  070  075  080  085  090  120  150  160  180  190 
1078  139    6   18  287  571  128   23  118   48  109  192    1  129   17   61 

$MSZoning

A (agr) C (all)      FV I (all)      RH      RL      RM 
      2      25     139       2      27    2268     462 

$Street

Grvl Pave 
  12 2913 

$Alley

Grvl Pave 
 120   78 

$LotShape

 IR1  IR2  IR3  Reg 
 975   76   15 1859 

$LandContour

 Bnk  HLS  Low  Lvl 
 114  120   60 2631 

$Utilities

AllPub NoSeWa NoSewr 
  2922      1      2 

$LotConfig

 Corner CulDSac     FR2     FR3  Inside 
    508     180      85      14    2138 

$LandSlope

 Gtl  Mod  Sev 
2784  125   16 

$Neighborhood

Blmngtn Blueste  BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert  Greens 
     28      10      30     108      44     267     103     191     165       8 
GrnHill  IDOTRR Landmrk MeadowV Mitchel   NAmes NoRidge NPkVill NridgHt  NWAmes 
      2      93       1      37     114     443      69      23     166     131 
OldTown  Sawyer SawyerW Somerst StoneBr   SWISU  Timber Veenker 
    239     151     125     182      51      48      72      24 

$Condition1

Artery  Feedr   Norm   PosA   PosN   RRAe   RRAn   RRNe   RRNn 
    92    163   2519     20     38     28     50      6      9 

$Condition2

Artery  Feedr   Norm   PosA   PosN   RRAe   RRAn   RRNn 
     5     13   2896      4      3      1      1      2 

$BldgType

  1Fam 2fmCon Duplex  Twnhs TwnhsE 
  2420     62    109    101    233 

$HouseStyle

1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer   SLvl 
   314     19   1480      8     24    869     83    128 

$RoofStyle

   Flat   Gable Gambrel     Hip Mansard    Shed 
     20    2320      22     547      11       5 

$RoofMatl

CompShg Membran   Metal    Roll Tar&Grv WdShake WdShngl 
   2884       1       1       1      23       9       6 

$Exterior1st

AsbShng AsphShn BrkComm BrkFace  CBlock CemntBd HdBoard ImStucc MetalSd Plywood 
     44       2       6      88       2     124     441       1     450     221 
PreCast   Stone  Stucco VinylSd Wd Sdng WdShing 
      1       2      42    1026     419      56 

$Exterior2nd

AsbShng AsphShn Brk Cmn BrkFace  CBlock CmentBd HdBoard ImStucc MetalSd   Other 
     38       4      22      47       3     124     405      14     447       1 
Plywood PreCast   Stone  Stucco VinylSd Wd Sdng Wd Shng 
    274       1       6      46    1015     397      81 

$MasVnrType

 BrkCmn BrkFace  CBlock    None   Stone 
     25     879       1    1751     246 

$ExterQual

  Ex   Fa   Gd   TA 
 103   35  988 1799 

$ExterCond

  Ex   Fa   Gd   Po   TA 
  12   67  299    3 2544 

$Foundation

BrkTil CBlock  PConc   Slab  Stone   Wood 
   311   1244   1305     49     11      5 

$BsmtQual

  Ex   Fa   Gd   Po   TA 
 253   88 1219    2 1283 

$BsmtCond

  Ex   Fa   Gd   Po   TA 
   3  104  122    5 2611 

$BsmtExposure

  Av   Gd   Mn   No 
 417  280  239 1906 

$BsmtFinType1

ALQ BLQ GLQ LwQ Rec Unf 
429 269 854 154 288 851 

$BsmtFinType2

 ALQ  BLQ  GLQ  LwQ  Rec  Unf 
  53   68   34   89  106 2494 

$Heating

Floor  GasA  GasW  Grav  OthW  Wall 
    1  2880    27     9     2     6 

$HeatingQC

  Ex   Fa   Gd   Po   TA 
1490   92  476    3  864 

$CentralAir

   N    Y 
 196 2729 

$Electrical

FuseA FuseF FuseP   Mix SBrkr 
  188    50     8     1  2677 

$KitchenQual

  Ex   Fa   Gd   Po   TA 
 200   70 1160    1 1494 

$Functional

Maj1 Maj2 Min1 Min2  Mod  Sal  Sev  Typ 
  19    9   65   70   35    2    2 2723 

$FireplaceQu

 Ex  Fa  Gd  Po  TA 
 42  75 741  46 599 

$GarageType

 2Types  Attchd Basment BuiltIn CarPort  Detchd 
     23    1727      36     185      15     782 

$GarageFinish

 Fin  RFn  Unf 
 723  812 1231 

$GarageQual

  Ex   Fa   Gd   Po   TA 
   3  124   24    5 2610 

$GarageCond

  Ex   Fa   Gd   Po   TA 
   3   74   15   14 2660 

$PavedDrive

   N    P    Y 
 216   62 2647 

$PoolQC

Ex Fa Gd TA 
 3  2  3  3 

$Fence

GdPrv  GdWo MnPrv  MnWw 
  118   112   329    12 

$MiscFeature

Gar2 Othr Shed TenC 
   5    4   95    1 

$SaleType

  COD   Con ConLD ConLI ConLw   CWD   New   Oth   VWD    WD 
   87     5    26     9     8    12   236     7     1  2534 

$SaleCondition

Abnorml AdjLand  Alloca  Family  Normal Partial 
    189      12      24      46    2412     242 
df |> select(where(is.numeric)) |> map ( summary )
$LotFrontage
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     21      58      68      69      80     313     490 

$LotArea
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1300    7438    9428   10104   11515  215245 

$OverallQual
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       5       6       6       7      10 

$OverallCond
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       5       5       6       6       9 

$YearBuilt
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1872    1954    1973    1971    2001    2010 

$`YearRemod/Add`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1950    1965    1993    1984    2004    2010 

$MasVnrArea
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0       0     101     164    1600      23 

$BsmtFinSF1
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0     370     438     733    2288       1 

$BsmtFinSF2
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0       0      50       0    1526       1 

$BsmtUnfSF
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0     219     464     559     801    2336       1 

$TotalBsmtSF
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0     793     990    1047    1299    3206       1 

$`1stFlrSF`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    334     876    1082    1155    1383    3820 

$`2ndFlrSF`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0     334     702    1862 

$LowQualFinSF
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0       5       0    1064 

$GrLivArea
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    334    1126    1441    1494    1740    3820 

$BsmtFullBath
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0     0.0     0.4     1.0     3.0       2 

$BsmtHalfBath
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0     0.0     0.1     0.0     2.0       2 

$FullBath
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       1       2       2       2       4 

$HalfBath
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     0.0     0.4     1.0     2.0 

$BedroomAbvGr
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       2       3       3       3       8 

$KitchenAbvGr
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       1       1       1       1       3 

$TotRmsAbvGrd
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      2       5       6       6       7      14 

$Fireplaces
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       1       1       1       4 

$GarageYrBlt
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1895    1960    1979    1978    2002    2207     159 

$GarageCars
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       1       2       2       2       5       1 

$GarageArea
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0     320     480     472     576    1488       1 

$WoodDeckSF
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0      93     168    1424 

$OpenPorchSF
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0      27      47      70     742 

$EnclosedPorch
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0      23       0    1012 

$`3SsnPorch`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0       3       0     508 

$ScreenPorch
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0      16       0     576 

$PoolArea
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0       2       0     800 

$MiscVal
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0      45       0   15500 

$MoSold
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       4       6       6       8      12 

$YrSold
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2006    2007    2008    2008    2009    2010 

$SalePrice
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12789  129500  160000  180412  213500  625000 

Using the above functions would let you quickly and easily to determine that you should delete some columns and convert others to factor or integer. You can delete columns by saying, for instance,

df <- df |> select(-c("Order","PID"))

You may add as many names to the list in the -c() vector as desired.

You may wish to delete columns with a lot of NAs rather than deleting rows with lots of NAs. (Why is this?)

This is easy to discover in the case of numerical columns. For instance, LotFrontage has 490 NAs. You can quickly see that in the output of the above summary. But it doesn’t work for character data. One quick way to see the number of NAs in all columns is to say

colSums(is.na(df)) |> sort()
   MSSubClass      MSZoning       LotArea        Street      LotShape 
            0             0             0             0             0 
  LandContour     Utilities     LotConfig     LandSlope  Neighborhood 
            0             0             0             0             0 
   Condition1    Condition2      BldgType    HouseStyle   OverallQual 
            0             0             0             0             0 
  OverallCond     YearBuilt YearRemod/Add     RoofStyle      RoofMatl 
            0             0             0             0             0 
  Exterior1st   Exterior2nd     ExterQual     ExterCond    Foundation 
            0             0             0             0             0 
      Heating     HeatingQC    CentralAir      1stFlrSF      2ndFlrSF 
            0             0             0             0             0 
 LowQualFinSF     GrLivArea      FullBath      HalfBath  BedroomAbvGr 
            0             0             0             0             0 
 KitchenAbvGr   KitchenQual  TotRmsAbvGrd    Functional    Fireplaces 
            0             0             0             0             0 
   PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch     3SsnPorch 
            0             0             0             0             0 
  ScreenPorch      PoolArea       MiscVal        MoSold        YrSold 
            0             0             0             0             0 
     SaleType SaleCondition     SalePrice    BsmtFinSF1    BsmtFinSF2 
            0             0             0             1             1 
    BsmtUnfSF   TotalBsmtSF    Electrical    GarageCars    GarageArea 
            1             1             1             1             1 
 BsmtFullBath  BsmtHalfBath    MasVnrType    MasVnrArea      BsmtQual 
            2             2            23            23            80 
     BsmtCond  BsmtFinType1  BsmtFinType2  BsmtExposure    GarageType 
           80            80            81            83           157 
  GarageYrBlt  GarageFinish    GarageQual    GarageCond   LotFrontage 
          159           159           159           159           490 
  FireplaceQu         Fence         Alley   MiscFeature        PoolQC 
         1422          2354          2727          2820          2914 

This makes it clear that you should not use FireplaceQu, Fence, Alley, or MiscFeature in addition to LotFrontage.

A better-looking display may be obtained by saying

v <- colSums(is.na(df))
v[v>0] |> sort(decreasing=TRUE) |> as.data.frame()
             sort(v[v > 0], decreasing = TRUE)
PoolQC                                    2914
MiscFeature                               2820
Alley                                     2727
Fence                                     2354
FireplaceQu                               1422
LotFrontage                                490
GarageYrBlt                                159
GarageFinish                               159
GarageQual                                 159
GarageCond                                 159
GarageType                                 157
BsmtExposure                                83
BsmtFinType2                                81
BsmtQual                                    80
BsmtCond                                    80
BsmtFinType1                                80
MasVnrType                                  23
MasVnrArea                                  23
BsmtFullBath                                 2
BsmtHalfBath                                 2
BsmtFinSF1                                   1
BsmtFinSF2                                   1
BsmtUnfSF                                    1
TotalBsmtSF                                  1
Electrical                                   1
GarageCars                                   1
GarageArea                                   1

2.9.3 Exercises

Create a Quarto document called week02exercises.qmd. Use your name as the author name and the date as the current date. Make the title within the document “Week 2 Exercises”.

Answer the following questions in the document, using a combination of narration and R chunks.

  1. Use the loan50 data set. Find the mean and median of annual_income using R. Tell why they differ in words.

  2. Use the loan50 data set. Make a contingency table of loan_purpose and grade. Tell the most frequently occurring grade and most frequently occurring loan purpose in words.

  3. Use the loan50 data set. Provide a statistical summary of total_credit_limit.

  4. Use the loan50 data set. Show the column means for all numeric columns.

  5. Use the loan50 data set. Make a contingency table of state and homeownership. Tell which state has the most mortgages in words.

Now render the document and submit both the .qmd file and the .html file to Canvas under “week02exercises”.

2.9.4 Solutions to exercises

  1. Use the loan50 data set. Find the mean and median of annual_income using R. Tell why they differ in words.
load(paste0(Sys.getenv("STATS_DATA_DIR"),"/loan50.rda"))
with(loan50,mean(annual_income))
[1] 86170
with(loan50,median(annual_income))
[1] 74000

They differ because the mean is susceptible to outliers. There are about four outliers in this data set (high annual incomes) and they drag the mean upward but not the median. The median is a more reliable measure of centrality when there are influential outliers.

  1. Use the loan50 data set. Make a contingency table of loan_purpose and grade. Tell the most frequently occurring grade and most frequently occurring loan purpose in words.
with(loan50,addmargins(table(loan_purpose,grade)))
                    grade
loan_purpose             A  B  C  D  E  F  G Sum
                      0  0  0  0  0  0  0  0   0
  car                 0  0  1  1  0  0  0  0   2
  credit_card         0  6  4  1  1  1  0  0  13
  debt_consolidation  0  2  9  4  7  1  0  0  23
  home_improvement    0  1  4  0  0  0  0  0   5
  house               0  0  1  0  0  0  0  0   1
  major_purchase      0  0  0  0  0  0  0  0   0
  medical             0  0  0  0  0  0  0  0   0
  moving              0  0  0  0  0  0  0  0   0
  other               0  4  0  0  0  0  0  0   4
  renewable_energy    0  1  0  0  0  0  0  0   1
  small_business      0  1  0  0  0  0  0  0   1
  vacation            0  0  0  0  0  0  0  0   0
  wedding             0  0  0  0  0  0  0  0   0
  Sum                 0 15 19  6  8  2  0  0  50

The most frequently occurring purpose is debt consolidation, while the most frequently occurring grade is B.

  1. Use the loan50 data set. Provide a statistical summary of total_credit_limit.
with(loan50,summary(total_credit_limit))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15980   70526  147364  208547  299766  793009 
with(loan50,scales::comma_format(summary(total_credit_limit)))
function (x) 
{
    number(x, accuracy = accuracy, scale = scale, prefix = prefix, 
        suffix = suffix, big.mark = big.mark, decimal.mark = decimal.mark, 
        style_positive = style_positive, style_negative = style_negative, 
        scale_cut = scale_cut, trim = trim, ...)
}
<bytecode: 0x7f87e572afe0>
<environment: 0x7f87e076e030>
#. same info with commas in numbers
loan50 |>
    summarise(Min=comma(min(total_credit_limit)),
              firstq=comma(quantile(total_credit_limit,0.25)),
              Median=comma(median(total_credit_limit)),
              Mean=comma(mean(total_credit_limit)),
              thirdq=comma(quantile(total_credit_limit,0.75)),
              Max=comma(max(total_credit_limit)))
     Min firstq  Median    Mean  thirdq     Max
1 15,980 70,526 147,364 208,547 299,766 793,009
  1. Use the loan50 data set. Show the column means for all numeric columns.
options(digits=1)
format(colMeans(loan50[sapply(loan50, is.numeric)]),scientific=FALSE,big.mark=",")
             emp_length                    term           annual_income 
           "        NA"            "     42.72"            " 86,170.00" 
         debt_to_income      total_credit_limit   total_credit_utilized 
           "      0.72"            "208,546.64"            " 61,546.54" 
num_cc_carrying_balance             loan_amount           interest_rate 
           "      5.06"            " 17,083.00"            "     11.57" 
 public_record_bankrupt            total_income 
           "      0.08"            "105,220.56" 
  1. Use the loan50 data set. Make a contingency table of state and homeownership. Tell which state has the most mortgages in words.
with(loan50,table(state,homeownership))
     homeownership
state rent mortgage own
         0        0   0
   AK    0        0   0
   AL    0        0   0
   AR    0        0   0
   AZ    0        0   1
   CA    7        2   0
   CO    0        0   0
   CT    1        0   0
   DC    0        0   0
   DE    0        0   0
   FL    1        2   0
   GA    0        0   0
   HI    1        1   0
   ID    0        0   0
   IL    3        0   1
   IN    0        1   1
   KS    0        0   0
   KY    0        0   0
   LA    0        0   0
   MA    1        1   0
   MD    2        1   0
   ME    0        0   0
   MI    0        1   0
   MN    0        1   0
   MO    0        1   0
   MS    0        1   0
   MT    0        0   0
   NC    0        0   0
   ND    0        0   0
   NE    0        1   0
   NH    1        0   0
   NJ    2        1   0
   NM    0        0   0
   NV    0        2   0
   NY    1        0   0
   OH    0        1   0
   OK    0        0   0
   OR    0        0   0
   PA    0        0   0
   RI    0        1   0
   SC    0        1   0
   SD    0        0   0
   TN    0        0   0
   TX    0        5   0
   UT    0        0   0
   VA    1        0   0
   VT    0        0   0
   WA    0        0   0
   WI    0        1   0
   WV    0        1   0
   WY    0        0   0

Texas has five mortgages, more than any other state.

2.9.5 Exercise Notes

  1. Many students did not follow instructions on file naming. I will take off a lot of points if this happens when you turn in a graded assignment. I expect all files to be uniformly named.
  2. Several students left the boilerplate verbiage in their .qmd file. I will take off a lot of points if this happens when you turn in a graded assignment.
  3. One student put their narrative inside the code chunks as R comments. Don’t do this. It undercuts the purpose of mixing narrative and code in a Quarto document.
  4. Some students didn’t try to answer the second part of question 1. One way to understand this is to draw a boxplot of the data, showing that there are four outliers at the top end, dragging the mean upward but leaving the median pretty much alone.
loan50 |> ggplot(aes(annual_income)) + geom_boxplot()

  1. Some students included graphics, which don’t show up in the copy on Canvas. One way to make these graphics show up is to add the following code to the front matter (the front matter is the stuff between two sets of three dashes at the beginning of the file):
format:
  html:
    embed-resources: true

The indentation shown above is essential for it to work.

  1. Some students highlighted relevant rows and columns as shown below. This was a really great addition.
tbl <- with(loan50,table(loan_purpose,grade))
tbl <- addmargins(tbl)
pacman::p_load(kableExtra)
tbl |>
  kbl() |>
  kable_classic(full_width=F) |>
  row_spec(4, color = "white", background = "#AAAAAA") |>
  column_spec(4, color = "white", background = "#AAAAAA") |>
  column_spec(1, color = "black", background = "white")
A B C D E F G Sum
0 0 0 0 0 0 0 0 0
car 0 0 1 1 0 0 0 0 2
credit_card 0 6 4 1 1 1 0 0 13
debt_consolidation 0 2 9 4 7 1 0 0 23
home_improvement 0 1 4 0 0 0 0 0 5
house 0 0 1 0 0 0 0 0 1
major_purchase 0 0 0 0 0 0 0 0 0
medical 0 0 0 0 0 0 0 0 0
moving 0 0 0 0 0 0 0 0 0
other 0 4 0 0 0 0 0 0 4
renewable_energy 0 1 0 0 0 0 0 0 1
small_business 0 1 0 0 0 0 0 0 1
vacation 0 0 0 0 0 0 0 0 0
wedding 0 0 0 0 0 0 0 0 0
Sum 0 15 19 6 8 2 0 0 50

Another way to do this is to say

maxrow<-max(tbl[1:length(levels(loan50$loan_purpose))-1,"Sum"])
maxcol<-max(tbl["Sum",1:length(levels(loan50$grade))-1])

and

maxrownum <- which.max(tbl[1:length(levels(loan50$loan_purpose))-1,"Sum"])
maxcolnum <- which.max(tbl["Sum",1:length(levels(loan50$grade))-1])+1

Now you can plug maxrownum and maxcolnum into the formula without having to know which row and column you’re talking about.

tbl |>
  kbl() |>
  kable_classic(full_width=F) |>
  row_spec(maxrownum, color = "white", background = "#AAAAAA") |>
  column_spec(maxcolnum, color = "white", background = "#AAAAAA") |>
  column_spec(1, color = "black", background = "white")
A B C D E F G Sum
0 0 0 0 0 0 0 0 0
car 0 0 1 1 0 0 0 0 2
credit_card 0 6 4 1 1 1 0 0 13
debt_consolidation 0 2 9 4 7 1 0 0 23
home_improvement 0 1 4 0 0 0 0 0 5
house 0 0 1 0 0 0 0 0 1
major_purchase 0 0 0 0 0 0 0 0 0
medical 0 0 0 0 0 0 0 0 0
moving 0 0 0 0 0 0 0 0 0
other 0 4 0 0 0 0 0 0 4
renewable_energy 0 1 0 0 0 0 0 0 1
small_business 0 1 0 0 0 0 0 0 1
vacation 0 0 0 0 0 0 0 0 0
wedding 0 0 0 0 0 0 0 0 0
Sum 0 15 19 6 8 2 0 0 50

And, in the narrative you can say that the maximum frequency of loan_purpose is 23. In the narrative you can alo say that the maximum frequency of grade is 19.

  1. One student stipulated that the mean and median could not ever be the same except in two unusual circumstances. Actually it is quite easy for the mean to equal the mean as you can see from this simple example.
x <- c(1,2,3,4,5,6,7,8,9)
mean(x)
[1] 5
median(x)
[1] 5