Stats: More R & Quarto

Mick McQuaid

2024-01-29

Week THREE

Week 2 Exercises

No pre-grading

You shouldn’t ask me to look over your homework and make sure that everything is okay. That is because it amounts to pre-grading. If I say it looks okay and then you turn it in and then I take points off for something I didn’t previously notice, you will object, saying “But you said it was okay.” Therefore I don’t want to get into this kind of situation.

On the other hand, it’s okay to come to me with vague questions, like “I don’t get question two.” That opens up the possibility of explaining it better.

Removing instructions

You should remove the instructions from the file you turn in. That means remove the first two paragraphs and the last sentence. Leave the questions in and interleave the questions with your answers.

Texas has the most mortgages

Several people said that California did, even though California only leads in rentals.

Leave a blank line before headings

One person formatted the document incorrectly, leading to a heading not appearing as a heading. You should always review the work before you turn it in.

Name the files as I ask

I will take off points in future for files not correctly named.

Turn in the assignment on time!

I will go over the homework on the Monday after it’s due, so I can’t accept late submissions after that. If I accept something between Friday and Monday, it will be with a substantial penalty.

Scores

score <- c(2, 1.9, 1.9, 2, 1.9, 1.9, 2, 2, 1.9, 1.9, 1.9)
stem(score)

  The decimal point is 1 digit(s) to the left of the |

  19 | 0000000
  19 | 
  20 | 0000

More on scores

  • I went easy on you if you turned something in
  • Lesson: always turn something in
  • It will be harsher in future, though, as I expect more and more of you

Solutions

Look at the solution file! There are a lot of tips there!

More on R

Loading the project data:

pacman::p_load(tidyverse)
df <- read_csv(paste0(Sys.getenv("STATS_DATA_DIR"),"/amesHousing2011.csv"))
# df <- read_csv("amesHousing2011.csv")
str(df)
spc_tbl_ [2,925 × 82] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Order        : num [1:2925] 1498 2738 2446 2667 2451 ...
 $ PID          : chr [1:2925] "0908154080" "0905427030" "0528320060" "0902400110" ...
 $ MSSubClass   : chr [1:2925] "020" "075" "060" "075" ...
 $ MSZoning     : chr [1:2925] "RL" "RL" "RL" "RM" ...
 $ LotFrontage  : num [1:2925] 123 60 118 90 114 87 NA 60 60 47 ...
 $ LotArea      : num [1:2925] 47007 19800 35760 22950 17242 ...
 $ Street       : chr [1:2925] "Pave" "Pave" "Pave" "Pave" ...
 $ Alley        : chr [1:2925] NA NA NA NA ...
 $ LotShape     : chr [1:2925] "IR1" "Reg" "IR1" "IR2" ...
 $ LandContour  : chr [1:2925] "Lvl" "Lvl" "Lvl" "Lvl" ...
 $ Utilities    : chr [1:2925] "AllPub" "AllPub" "AllPub" "AllPub" ...
 $ LotConfig    : chr [1:2925] "Inside" "Inside" "CulDSac" "Inside" ...
 $ LandSlope    : chr [1:2925] "Gtl" "Gtl" "Gtl" "Gtl" ...
 $ Neighborhood : chr [1:2925] "Edwards" "Edwards" "NoRidge" "OldTown" ...
 $ Condition1   : chr [1:2925] "Norm" "Norm" "Norm" "Artery" ...
 $ Condition2   : chr [1:2925] "Norm" "Norm" "Norm" "Norm" ...
 $ BldgType     : chr [1:2925] "1Fam" "1Fam" "1Fam" "1Fam" ...
 $ HouseStyle   : chr [1:2925] "1Story" "2.5Unf" "2Story" "2.5Fin" ...
 $ OverallQual  : num [1:2925] 5 6 10 10 9 7 8 6 10 8 ...
 $ OverallCond  : num [1:2925] 7 8 5 9 5 9 9 7 5 5 ...
 $ YearBuilt    : num [1:2925] 1959 1935 1995 1892 1993 ...
 $ YearRemod/Add: num [1:2925] 1996 1990 1996 1993 1994 ...
 $ RoofStyle    : chr [1:2925] "Gable" "Gable" "Hip" "Gable" ...
 $ RoofMatl     : chr [1:2925] "CompShg" "CompShg" "CompShg" "WdShngl" ...
 $ Exterior1st  : chr [1:2925] "Plywood" "BrkFace" "HdBoard" "Wd Sdng" ...
 $ Exterior2nd  : chr [1:2925] "Plywood" "Wd Sdng" "HdBoard" "Wd Sdng" ...
 $ MasVnrType   : chr [1:2925] "None" "None" "BrkFace" "None" ...
 $ MasVnrArea   : num [1:2925] 0 0 1378 0 738 ...
 $ ExterQual    : chr [1:2925] "TA" "TA" "Gd" "Gd" ...
 $ ExterCond    : chr [1:2925] "TA" "TA" "Gd" "Gd" ...
 $ Foundation   : chr [1:2925] "Slab" "BrkTil" "PConc" "BrkTil" ...
 $ BsmtQual     : chr [1:2925] NA "TA" "Ex" "TA" ...
 $ BsmtCond     : chr [1:2925] NA "TA" "TA" "TA" ...
 $ BsmtExposure : chr [1:2925] NA "No" "Gd" "Mn" ...
 $ BsmtFinType1 : chr [1:2925] NA "Rec" "GLQ" "Unf" ...
 $ BsmtFinSF1   : num [1:2925] 0 425 1387 0 292 ...
 $ BsmtFinType2 : chr [1:2925] NA "Unf" "Unf" "Unf" ...
 $ BsmtFinSF2   : num [1:2925] 0 0 0 0 1393 ...
 $ BsmtUnfSF    : num [1:2925] 0 1411 543 1107 48 ...
 $ TotalBsmtSF  : num [1:2925] 0 1836 1930 1107 1733 ...
 $ Heating      : chr [1:2925] "GasA" "GasA" "GasA" "GasA" ...
 $ HeatingQC    : chr [1:2925] "TA" "Gd" "Ex" "Ex" ...
 $ CentralAir   : chr [1:2925] "Y" "Y" "Y" "Y" ...
 $ Electrical   : chr [1:2925] "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
 $ 1stFlrSF     : num [1:2925] 3820 1836 1831 1518 1933 ...
 $ 2ndFlrSF     : num [1:2925] 0 1836 1796 1518 1567 ...
 $ LowQualFinSF : num [1:2925] 0 0 0 572 0 0 0 515 0 0 ...
 $ GrLivArea    : num [1:2925] 3820 3672 3627 3608 3500 ...
 $ BsmtFullBath : num [1:2925] NA 0 1 0 1 0 0 0 0 1 ...
 $ BsmtHalfBath : num [1:2925] NA 0 0 0 0 0 0 0 0 0 ...
 $ FullBath     : num [1:2925] 3 3 3 2 3 3 3 2 3 3 ...
 $ HalfBath     : num [1:2925] 1 1 1 1 1 0 1 0 1 1 ...
 $ BedroomAbvGr : num [1:2925] 5 5 4 4 4 3 4 8 5 4 ...
 $ KitchenAbvGr : num [1:2925] 1 1 1 1 1 1 1 2 1 1 ...
 $ KitchenQual  : chr [1:2925] "Ex" "Gd" "Gd" "Ex" ...
 $ TotRmsAbvGrd : num [1:2925] 11 7 10 12 11 10 11 14 10 12 ...
 $ Functional   : chr [1:2925] "Typ" "Typ" "Typ" "Typ" ...
 $ Fireplaces   : num [1:2925] 2 2 1 2 1 1 2 0 1 1 ...
 $ FireplaceQu  : chr [1:2925] "Gd" "Gd" "TA" "TA" ...
 $ GarageType   : chr [1:2925] "Attchd" "Detchd" "Attchd" "Detchd" ...
 $ GarageYrBlt  : num [1:2925] 1959 1993 1995 1993 1993 ...
 $ GarageFinish : chr [1:2925] "Unf" "Unf" "Fin" "Unf" ...
 $ GarageCars   : num [1:2925] 2 2 3 3 3 3 3 0 3 3 ...
 $ GarageArea   : num [1:2925] 624 836 807 840 959 ...
 $ GarageQual   : chr [1:2925] "TA" "TA" "TA" "Ex" ...
 $ GarageCond   : chr [1:2925] "TA" "TA" "TA" "TA" ...
 $ PavedDrive   : chr [1:2925] "Y" "Y" "Y" "Y" ...
 $ WoodDeckSF   : num [1:2925] 0 684 361 0 870 302 314 0 204 503 ...
 $ OpenPorchSF  : num [1:2925] 372 80 76 260 86 0 12 110 34 36 ...
 $ EnclosedPorch: num [1:2925] 0 32 0 0 0 0 0 0 0 0 ...
 $ 3SsnPorch    : num [1:2925] 0 0 0 0 0 0 0 0 0 0 ...
 $ ScreenPorch  : num [1:2925] 0 0 0 410 210 0 0 0 0 210 ...
 $ PoolArea     : num [1:2925] 0 0 0 0 0 0 0 0 0 0 ...
 $ PoolQC       : chr [1:2925] NA NA NA NA ...
 $ Fence        : chr [1:2925] NA NA NA "GdPrv" ...
 $ MiscFeature  : chr [1:2925] NA NA NA NA ...
 $ MiscVal      : num [1:2925] 0 0 0 0 0 0 0 0 0 0 ...
 $ MoSold       : num [1:2925] 7 12 7 6 5 5 5 3 9 6 ...
 $ YrSold       : num [1:2925] 2008 2006 2006 2006 2006 ...
 $ SaleType     : chr [1:2925] "WD" "WD" "WD" "WD" ...
 $ SaleCondition: chr [1:2925] "Normal" "Normal" "Normal" "Normal" ...
 $ SalePrice    : num [1:2925] 284700 415000 625000 475000 584500 ...
 - attr(*, "spec")=
  .. cols(
  ..   Order = col_double(),
  ..   PID = col_character(),
  ..   MSSubClass = col_character(),
  ..   MSZoning = col_character(),
  ..   LotFrontage = col_double(),
  ..   LotArea = col_double(),
  ..   Street = col_character(),
  ..   Alley = col_character(),
  ..   LotShape = col_character(),
  ..   LandContour = col_character(),
  ..   Utilities = col_character(),
  ..   LotConfig = col_character(),
  ..   LandSlope = col_character(),
  ..   Neighborhood = col_character(),
  ..   Condition1 = col_character(),
  ..   Condition2 = col_character(),
  ..   BldgType = col_character(),
  ..   HouseStyle = col_character(),
  ..   OverallQual = col_double(),
  ..   OverallCond = col_double(),
  ..   YearBuilt = col_double(),
  ..   `YearRemod/Add` = col_double(),
  ..   RoofStyle = col_character(),
  ..   RoofMatl = col_character(),
  ..   Exterior1st = col_character(),
  ..   Exterior2nd = col_character(),
  ..   MasVnrType = col_character(),
  ..   MasVnrArea = col_double(),
  ..   ExterQual = col_character(),
  ..   ExterCond = col_character(),
  ..   Foundation = col_character(),
  ..   BsmtQual = col_character(),
  ..   BsmtCond = col_character(),
  ..   BsmtExposure = col_character(),
  ..   BsmtFinType1 = col_character(),
  ..   BsmtFinSF1 = col_double(),
  ..   BsmtFinType2 = col_character(),
  ..   BsmtFinSF2 = col_double(),
  ..   BsmtUnfSF = col_double(),
  ..   TotalBsmtSF = col_double(),
  ..   Heating = col_character(),
  ..   HeatingQC = col_character(),
  ..   CentralAir = col_character(),
  ..   Electrical = col_character(),
  ..   `1stFlrSF` = col_double(),
  ..   `2ndFlrSF` = col_double(),
  ..   LowQualFinSF = col_double(),
  ..   GrLivArea = col_double(),
  ..   BsmtFullBath = col_double(),
  ..   BsmtHalfBath = col_double(),
  ..   FullBath = col_double(),
  ..   HalfBath = col_double(),
  ..   BedroomAbvGr = col_double(),
  ..   KitchenAbvGr = col_double(),
  ..   KitchenQual = col_character(),
  ..   TotRmsAbvGrd = col_double(),
  ..   Functional = col_character(),
  ..   Fireplaces = col_double(),
  ..   FireplaceQu = col_character(),
  ..   GarageType = col_character(),
  ..   GarageYrBlt = col_double(),
  ..   GarageFinish = col_character(),
  ..   GarageCars = col_double(),
  ..   GarageArea = col_double(),
  ..   GarageQual = col_character(),
  ..   GarageCond = col_character(),
  ..   PavedDrive = col_character(),
  ..   WoodDeckSF = col_double(),
  ..   OpenPorchSF = col_double(),
  ..   EnclosedPorch = col_double(),
  ..   `3SsnPorch` = col_double(),
  ..   ScreenPorch = col_double(),
  ..   PoolArea = col_double(),
  ..   PoolQC = col_character(),
  ..   Fence = col_character(),
  ..   MiscFeature = col_character(),
  ..   MiscVal = col_double(),
  ..   MoSold = col_double(),
  ..   YrSold = col_double(),
  ..   SaleType = col_character(),
  ..   SaleCondition = col_character(),
  ..   SalePrice = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Note on loading data

  • Two main ways, depending on input file
  • The load() function creates a data frame
  • The read_csv() function returns a data frame
  • To create a data frame with read_csv, you must read it into a variable, e.g., df<-read_csv("filename")! If you just say read_csv("filename"), it will display the data frame, but not save it
  • It is a mistake to say df <- load("filename") or to say read_csv("filename")

Convert many columns to factors

Some columns should simply be removed, such as Order and PID. Others are useful as factors. How to tell?

with(df,table(MSSubClass))
MSSubClass
 020  030  040  045  050  060  070  075  080  085  090  120  150  160  180  190 
1078  139    6   18  287  571  128   23  118   48  109  192    1  129   17   61 

Use it in conjunction with amesHousing2011doc.txt.

Another way, using Tidyverse

df|>count(MSSubClass,sort=TRUE)
# A tibble: 16 × 2
   MSSubClass     n
   <chr>      <int>
 1 020         1078
 2 060          571
 3 050          287
 4 120          192
 5 030          139
 6 160          129
 7 070          128
 8 080          118
 9 090          109
10 190           61
11 085           48
12 075           23
13 045           18
14 180           17
15 040            6
16 150            1

R is changing and the Tidyverse is changing

  • But they are changing at different rates
  • Tidyverse is changing faster than base R
  • Implies that many StackOverflow answers for Tidyverse are outdated
  • You must learn to read cryptic error messages about deprecation

Tidyverse has an updated website

https://www.tidyverse.org

Tidyverse consists of packages, listed at https://www.tidyverse.org/packages/

Tidyverse package website lists several sections for learning: Installation and use, Core tidyverse, Import, Wrangle, and others.

Visit dplyr

The dplr package includes the following functions for data manipulation

  • mutate() adds new columns that are functions of existing columns
  • select() picks columns based on their names.
  • filter() picks rows based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.

Example of selecting all but a few columns

(dfReduced <- df |> select(!c(PID,Order)))
# A tibble: 2,925 × 80
   MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
   <chr>      <chr>          <dbl>   <dbl> <chr>  <chr> <chr>    <chr>      
 1 020        RL               123   47007 Pave   <NA>  IR1      Lvl        
 2 075        RL                60   19800 Pave   <NA>  Reg      Lvl        
 3 060        RL               118   35760 Pave   <NA>  IR1      Lvl        
 4 075        RM                90   22950 Pave   <NA>  IR2      Lvl        
 5 060        RL               114   17242 Pave   <NA>  IR1      Lvl        
 6 075        RM                87   18386 Pave   <NA>  Reg      Lvl        
 7 050        RL                NA   14100 Pave   <NA>  IR1      Lvl        
 8 190        RH                60   10896 Pave   Pave  Reg      Bnk        
 9 060        RL                60   18062 Pave   <NA>  IR1      HLS        
10 060        RL                47   53504 Pave   <NA>  IR2      HLS        
# ℹ 2,915 more rows
# ℹ 72 more variables: Utilities <chr>, LotConfig <chr>, LandSlope <chr>,
#   Neighborhood <chr>, Condition1 <chr>, Condition2 <chr>, BldgType <chr>,
#   HouseStyle <chr>, OverallQual <dbl>, OverallCond <dbl>, YearBuilt <dbl>,
#   `YearRemod/Add` <dbl>, RoofStyle <chr>, RoofMatl <chr>, Exterior1st <chr>,
#   Exterior2nd <chr>, MasVnrType <chr>, MasVnrArea <dbl>, ExterQual <chr>,
#   ExterCond <chr>, Foundation <chr>, BsmtQual <chr>, BsmtCond <chr>, …

Example of several dplyr functions

dfClasses <- read_tsv("classes.tsv")
(dfPriceByClass <- df |>
  select(c(MSSubClass,SalePrice)) |>
  group_by(MSSubClass) |>
  summarize(avPriceByClass = mean(SalePrice),n=n()) |>
  arrange(desc(avPriceByClass)) |>
  inner_join(dfClasses))
# A tibble: 16 × 4
   MSSubClass avPriceByClass     n subClassDescr                                
   <chr>               <dbl> <int> <chr>                                        
 1 060               237810.   571 2-STORY 1946 & NEWER                         
 2 120               208019.   192 1-STORY PUD (Planned Unit Development) - 194…
 3 075               199978.    23 2-1/2 STORY ALL AGES                         
 4 020               187359.  1078 1-STORY 1946 & NEWER ALL STYLES              
 5 080               168009.   118 SPLIT OR MULTI-LEVEL                         
 6 070               156526.   128 2-STORY 1945 & OLDER                         
 7 085               149842.    48 SPLIT FOYER                                  
 8 150               148400      1 1-1/2 STORY PUD - ALL AGES                   
 9 040               144917.     6 1-STORY W/FINISHED ATTIC ALL AGES            
10 090               139809.   109 DUPLEX - ALL STYLES AND AGES                 
11 050               137433.   287 1-1/2 STORY FINISHED ALL AGES                
12 160               137080.   129 2-STORY PUD - 1946 & NEWER                   
13 190               125870.    61 2 FAMILY CONVERSION - ALL STYLES AND AGES    
14 045               111783.    18 1-1/2 STORY - UNFINISHED ALL AGES            
15 180               107671.    17 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER      
16 030                96727.   139 1-STORY 1945 & OLDER                         

More on Quarto

What is Quarto?

  • A document production system
  • A way to conduct reproducible research
  • A way to practice literate programming

Naming convention for Quarto files

By default, Quarto files end in .qmd, although other extensions will work. When you feed a .qmd file to RStudio, it assumes that it’s a Quarto file and opens it accordingly.

Contents of a Quarto file

A quarto file just contains plain text, no binary information. It can be read by any text editor, although what they do with it depends on how Quarto-aware the editor is.

For example, an R chunk (prefaced by a blank line followed by three backticks and r in curly braces) is assumed to be R code. It is syntax-highlighted as R and, in some editors such as RStudio, can be independently executed. In RStudio this is done by clicking a green triangle to the right of the chunk.

Everything not in a code chunk is assumed to be Pandoc-flavored Markdown.

Pandoc-flavored Markdown

Since Markdown was invented around 2004, many flavors of it have developed. The one we’re using is the one interpreted by the program pandoc, documented at https://pandoc.org/.

Why so many flavors?

Markdown was originally devised as a shorthand for HTML by a person (Jon Gruber) was tired of having to write out lengthy HTML constructs for his blog. He wanted something simpler but also readable on its own. By the way, the original Markdown description is still on the web after all these years at https://daringfireball.net/projects/markdown/, although there are many more descriptive sites. What happened in the years since was that (A) people wanted their own shorthand sets, and (B) it turned out to be really easy to write a converter from Markdown to HTML.

Some of Markdown’s features

You have experienced some of Markdown’s features, such as a blank line followed by two hashtags followed by a space for a level two heading. You might have surrounded a word by asterisks to italicize it, or double asterisks to bold-face it. You might have used straight quotation marks and found them converted to typographical quotation marks (a different opening and closing mark).

An important extension: inline code

You can write inline code in Markdown chunks! Use a single backtick, followed by an r in curly braces, then the code, then a single backtick. Hence, you can let the data speak instead of laboriously running the file and extracting results from R chunks and manually inserting them into a Markdown chunk.

Other extensions

URLs can be included in Markdown chunks by saying [displayname](actualURL). I usually make the displayname be the actual URL, but you can put anything you want in the displayname brackets.

Pictures can be included by saying

![caption](filename)

on a line by itself.

END

Colophon

This slideshow was produced using quarto

Fonts are Roboto Condensed Bold, JetBrains Mono Nerd Font, and STIX2