3  More about R and Quarto

3.1 Recap Week 02

We did some exercises, for which there are now solutions in the file week02exercises-soln.qmd and week02exercises-soln.html. You should examine and compare these two files, especially the exercise parts.

3.2 Week 03: More on R and Quarto

We will establish groups for the milestones. I’m open to moving you around if needed, subject to the constraint that we have no more than five members in a group. I expect to have four groups of four and two groups of five.

3.2.1 The template files

The template files show how you can begin the milestones. Notice that I have named the data frame as df in the template.qmd file. As a result, I can write functions like the following.

pacman::p_load(tidyverse)
df <- read_csv(paste0(Sys.getenv("STATS_DATA_DIR"),"/vehicles.csv"))
df |> count(title_status,sort=TRUE)
# A tibble: 7 × 2
  title_status      n
  <chr>         <int>
1 clean        405117
2 <NA>           8242
3 rebuilt        7219
4 salvage        3868
5 lien           1422
6 missing         814
7 parts only      198

The above is a good way to investigate categorical data. Notice that I’ve used the pipe character, so that the data frame df is sent to the count() function. Then a particular categorical column of df is counted and sorted in descending numerical order. The count() function is part of the dplyr package, which is one of the tidyverse packages. You only need to load the tidyverse set of packages once in a document, preferably near the beginning.

Another point about the above result is that the count() function behaves different for large numbers of values. For example, region has 404 values. The count() function will just display the first and last 5 by default. I can make it display all the values by adding the print fuction:

df |> count(region,sort=TRUE) |> print(n=404)
# A tibble: 404 × 2
    region                         n
    <chr>                      <int>
  1 columbus                    3608
  2 jacksonville                3562
  3 spokane / coeur d'alene     2988
  4 eugene                      2985
  5 fresno / madera             2983
  6 orlando                     2983
  7 bend                        2982
  8 omaha / council bluffs      2982
  9 kennewick-pasco-richland    2981
 10 new hampshire               2981
 11 nashville                   2980
 12 salem                       2980
 13 oklahoma city               2979
 14 reno / tahoe                2979
 15 boston                      2978
 16 rochester                   2978
 17 sarasota-bradenton          2977
 18 stockton                    2977
 19 boise                       2976
 20 portland                    2976
 21 houston                     2975
 22 south jersey                2974
 23 minneapolis / st paul       2973
 24 modesto                     2973
 25 philadelphia                2973
 26 seattle-tacoma              2973
 27 grand rapids                2972
 28 baltimore                   2971
 29 las vegas                   2971
 30 milwaukee                   2971
 31 pittsburgh                  2971
 32 cincinnati                  2970
 33 sacramento                  2970
 34 tulsa                       2970
 35 washington, DC              2970
 36 austin                      2969
 37 north jersey                2969
 38 tucson                      2969
 39 charlotte                   2968
 40 maine                       2966
 41 atlanta                     2965
 42 hawaii                      2964
 43 long island                 2964
 44 central NJ                  2961
 45 phoenix                     2960
 46 detroit metro               2959
 47 dallas / fort worth         2957
 48 kansas city, MO             2957
 49 ft myers / SW florida       2955
 50 cleveland                   2953
 51 san diego                   2953
 52 albuquerque                 2952
 53 denver                      2952
 54 orange county               2952
 55 inland empire               2950
 56 new york city               2950
 57 norfolk / hampton roads     2946
 58 tampa bay area              2945
 59 st louis, MO                2940
 60 los angeles                 2937
 61 SF bay area                 2936
 62 chicago                     2929
 63 des moines                  2923
 64 south florida               2920
 65 raleigh / durham / CH       2918
 66 colorado springs            2914
 67 san antonio                 2891
 68 knoxville                   2763
 69 anchorage / mat-su          2742
 70 hartford                    2564
 71 albany                      2537
 72 bakersfield                 2528
 73 redding                     2526
 74 springfield                 2520
 75 ventura county              2518
 76 vermont                     2513
 77 fayetteville                2415
 78 madison                     2387
 79 fort collins / north CO     2385
 80 richmond                    2346
 81 greenville / upstate        2327
 82 rhode island                2320
 83 bellingham                  2313
 84 indianapolis                2303
 85 akron / canton              2211
 86 greensboro                  2078
 87 western massachusetts       2039
 88 buffalo                     2006
 89 palm springs                1978
 90 el paso                     1977
 91 medford-ashland             1962
 92 louisville                  1938
 93 worcester / central MA      1914
 94 little rock                 1841
 95 ocala                       1828
 96 dayton / springfield        1787
 97 yuba-sutter                 1747
 98 memphis                     1724
 99 yakima                      1704
100 billings                    1701
101 wichita                     1694
102 hudson valley               1692
103 birmingham                  1647
104 new haven                   1644
105 daytona beach               1633
106 charleston                  1509
107 chico                       1486
108 san luis obispo             1455
109 monterey bay                1451
110 asheville                   1423
111 toledo                      1406
112 columbia                    1388
113 missoula                    1378
114 lansing                     1367
115 jackson                     1351
116 space coast                 1346
117 syracuse                    1338
118 fredericksburg              1328
119 wenatchee                   1324
120 new orleans                 1321
121 huntsville / decatur        1273
122 appleton-oshkosh-FDL        1261
123 lehigh valley               1261
124 flint                       1258
125 columbia / jeff city        1257
126 bozeman                     1249
127 kalamazoo                   1236
128 western slope               1188
129 corpus christi              1176
130 treasure coast              1173
131 corvallis/albany            1168
132 roanoke                     1121
133 lexington                   1115
134 tri-cities                  1110
135 ann arbor                   1085
136 rockford                    1059
137 east idaho                  1052
138 santa barbara               1052
139 lakeland                    1048
140 myrtle beach                1043
141 chattanooga                 1038
142 green bay                   1033
143 winston-salem               1027
144 mcallen / edinburg          1019
145 scranton / wilkes-barre     1008
146 moses lake                   994
147 tyler / east TX              992
148 lancaster                    988
149 kalispell                    978
150 harrisburg                   976
151 wilmington                   975
152 fargo / moorhead             969
153 south bend / michiana        969
154 st cloud                     954
155 delaware                     949
156 gainesville                  926
157 visalia-tulare               922
158 eastern NC                   895
159 flagstaff / sedona           881
160 saginaw-midland-baycity      876
161 hickory / lenoir             853
162 eau claire                   846
163 jersey shore                 838
164 erie                         805
165 santa fe / taos              801
166 twin falls                   794
167 fort wayne                   788
168 clarksville                  785
169 york                         777
170 tallahassee                  771
171 gold country                 765
172 duluth / superior            763
173 kenosha-racine               757
174 prescott                     746
175 pueblo                       746
176 santa maria                  740
177 baton rouge                  733
178 wausau                       726
179 east oregon                  720
180 olympic peninsula            717
181 lewiston / clarkston         710
182 skagit / island / SJI        701
183 boulder                      694
184 oregon coast                 694
185 macon / warner robins        687
186 quad cities, IA/IL           687
187 waco                         681
188 winchester                   672
189 youngstown                   664
190 killeen / temple / ft hood   662
191 merced                       654
192 cedar rapids                 648
193 south coast                  645
194 charlottesville              642
195 battle creek                 639
196 mobile                       626
197 pensacola                    622
198 sioux falls / SE SD          621
199 northern michigan            612
200 wyoming                      610
201 athens                       607
202 utica-rome-oneida            607
203 lincoln                      604
204 cape cod / islands           598
205 pullman / moscow             595
206 wichita falls                589
207 eastern CT                   583
208 topeka                       582
209 amarillo                     564
210 southern illinois            555
211 waterloo / cedar falls       555
212 holland                      534
213 brainerd                     533
214 monroe                       530
215 great falls                  524
216 la crosse                    521
217 savannah / hinesville        516
218 rapid city / west SD         502
219 lynchburg                    496
220 lubbock                      492
221 lima / findlay               485
222 salt lake city               485
223 augusta                      480
224 southwest michigan           476
225 binghamton                   474
226 muskegon                     473
227 poconos                      464
228 north mississippi            462
229 central michigan             461
230 mankato                      445
231 finger lakes                 439
232 mohave county                435
233 odessa / midland             433
234 peoria                       432
235 danville                     427
236 fairbanks                    427
237 shreveport                   422
238 reading                      416
239 bowling green                415
240 northwest GA                 415
241 watertown                    414
242 evansville                   413
243 montgomery                   408
244 southern maryland            405
245 northwest CT                 397
246 sioux city                   393
247 humboldt county              385
248 elmira-corning               379
249 harrisonburg                 375
250 eastern shore                368
251 bemidji                      365
252 altoona-johnstown            364
253 st george                    362
254 williamsport                 362
255 brownsville                  361
256 mansfield                    359
257 las cruces                   358
258 port huron                   358
259 upper peninsula              355
260 janesville                   343
261 klamath falls                343
262 boone                        338
263 jonesboro                    337
264 laredo                       337
265 yuma                         335
266 joplin                       328
267 st augustine                 328
268 fort smith                   327
269 dothan                       325
270 florence                     324
271 st joseph                    323
272 college station              313
273 zanesville / cambridge       313
274 morgantown                   307
275 ithaca                       303
276 panama city                  298
277 roseburg                     294
278 frederick                    288
279 beaumont / port arthur       287
280 the thumb                    286
281 annapolis                    285
282 western maryland             285
283 texoma                       284
284 imperial county              282
285 champaign urbana             281
286 sheboygan                    281
287 new river valley             275
288 glens falls                  274
289 ashtabula                    270
290 northern panhandle           268
291 parkersburg-marietta         264
292 northern WI                  262
293 valdosta                     262
294 lafayette / west lafayette   260
295 chillicothe                  257
296 plattsburgh-adirondacks      256
297 helena                       255
298 southwest VA                 254
299 iowa city                    253
300 eastern panhandle            251
301 manhattan                    251
302 muncie / anderson            248
303 southeast missouri           245
304 sandusky                     244
305 victoria                     243
306 sierra vista                 241
307 mendocino county             238
308 lake of the ozarks           236
309 abilene                      235
310 north central FL             230
311 dubuque                      228
312 galveston                    225
313 chautauqua                   221
314 kenai peninsula              221
315 bloomington                  217
316 bloomington-normal           217
317 mason city                   217
318 eastern kentucky             213
319 grand island                 212
320 western KY                   212
321 texarkana                    211
322 state college                210
323 gadsden-anniston             207
324 lawton                       207
325 florida keys                 203
326 lawrence                     199
327 stillwater                   198
328 terre haute                  198
329 ames                         194
330 okaloosa / walton            194
331 grand forks                  192
332 cookeville                   189
333 brunswick                    187
334 hilton head                  186
335 cumberland valley            179
336 lafayette                    179
337 san marcos                   179
338 hanford-corcoran             178
339 meadville                    177
340 southeast IA                 175
341 decatur                      171
342 heartland florida            170
343 high rockies                 169
344 salina                       168
345 florence / muscle shoals     165
346 tuscaloosa                   162
347 san angelo                   161
348 kokomo                       156
349 lake charles                 156
350 mattoon-charleston           156
351 tuscarawas co                154
352 logan                        152
353 catskills                    147
354 northwest OK                 145
355 northwest KS                 143
356 auburn                       142
357 owensboro                    140
358 statesboro                   140
359 farmington                   136
360 twin tiers NY/PA             136
361 butte                        134
362 central louisiana            132
363 gulfport / biloxi            127
364 north dakota                 126
365 elko                         120
366 kirksville                   119
367 southeast KS                 119
368 susanville                   119
369 huntington-ashland           118
370 scottsbluff / panhandle      117
371 la salle co                  116
372 show low                     112
373 outer banks                  109
374 del rio / eagle pass         108
375 potsdam-canton-massena       102
376 roswell / carlsbad           100
377 hattiesburg                   99
378 southwest KS                  96
379 bismarck                      92
380 south dakota                  90
381 deep east texas               88
382 houma                         88
383 fort dodge                    87
384 southeast alaska              84
385 siskiyou county               83
386 north platte                  80
387 clovis / portales             78
388 ogden-clearfield              76
389 eastern montana               75
390 provo / orem                  75
391 western IL                    63
392 oneonta                       62
393 southwest MN                  56
394 pierre / central SD           55
395 eastern CO                    40
396 southern WV                   36
397 st louis                      35
398 northeast SD                  34
399 southwest TX                  30
400 meridian                      28
401 southwest MS                  14
402 kansas city                   11
403 fort smith, AR                 9
404 west virginia (old)            8

The number 404 is because there are 404 unique elements in the list, which you can find out by looking at the output of the count() function. Of course, in a report for a manager or recruiter I would be unlikely to print out a column of 404 numbers. A manager or recruiter would be more interested in some sort of summary. This is primarily for your preparation.

Another solution (more cumbersome) follows.

df %>%
  group_by(region) %>%
  do(data.frame(nrow=nrow(.))) %>%
  arrange(desc(nrow)) %>%
  print(n=40)
# A tibble: 404 × 2
# Groups:   region [404]
   region                    nrow
   <chr>                    <int>
 1 columbus                  3608
 2 jacksonville              3562
 3 spokane / coeur d'alene   2988
 4 eugene                    2985
 5 fresno / madera           2983
 6 orlando                   2983
 7 bend                      2982
 8 omaha / council bluffs    2982
 9 kennewick-pasco-richland  2981
10 new hampshire             2981
11 nashville                 2980
12 salem                     2980
13 oklahoma city             2979
14 reno / tahoe              2979
15 boston                    2978
16 rochester                 2978
17 sarasota-bradenton        2977
18 stockton                  2977
19 boise                     2976
20 portland                  2976
21 houston                   2975
22 south jersey              2974
23 minneapolis / st paul     2973
24 modesto                   2973
25 philadelphia              2973
26 seattle-tacoma            2973
27 grand rapids              2972
28 baltimore                 2971
29 las vegas                 2971
30 milwaukee                 2971
31 pittsburgh                2971
32 cincinnati                2970
33 sacramento                2970
34 tulsa                     2970
35 washington, DC            2970
36 austin                    2969
37 north jersey              2969
38 tucson                    2969
39 charlotte                 2968
40 maine                     2966
# ℹ 364 more rows

Another way (and a better one for high level understanding) to investigate categorical data is through contingency tables. You have already made some of these using the table() function and some associated functions that are mentioned in the week02exercises-soln.qmd file. It will take some intuition to figure out which pairs of categorical columns should be tabulated for Milestone 1.

3.2.2 Numerical Data

For numerical data, you can do what we did last week to investigate a single column.

summary(df$price)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000e+00 5.900e+03 1.395e+04 7.520e+04 2.649e+04 3.737e+09 

or

pacman::p_load(scales)
df$price <- as.double(df$price)
df %>%
    summarize(Min=comma(min(price)),
              firstq=comma(quantile(price,0.25)),
              Median=comma(median(price)),
              Mean=comma(mean(price)),
              thirdq=comma(quantile(price,0.75)),
              Max=comma(max(price)))
# A tibble: 1 × 6
  Min   firstq Median Mean   thirdq Max          
  <chr> <chr>  <chr>  <chr>  <chr>  <chr>        
1 0     5,900  13,950 75,199 26,486 3,736,928,711
df$price <- as.integer(df$price)
Warning: NAs introduced by coercion to integer range

You’ll notice that one of the prices is 3.7 billion dollars for a Toyota truck! This is obviously a misprint! You should probably remove this row from the data frame and save the data frame without it. A more sophisticated alternative would be to impute a value for this truck. There are many advanced statistical ways to do this, but they are beyond the scope of this course. van Buuren (2018) describes several excellent ways to do so, particularly in that book’s Section 5.1.1, Recommended Workflows. It is usually a mistake to use an average for missing (NA) values because to do so compresses the variance unnaturally. You may remove that particular row by saying df <- df[-(which.max(df$price)),]. Unfortunately, you will find that not to be very useful because there are several prices of over a billion dollars for generic cars! In fact, there are several with prices listed as 1234567890 and some listed at 987654321. Many other ridiculous patterns can be found for price. So what can you do? Personally, I might add the following line right after reading in the file:

df <- df[df$price<100000&df$price>0,]

or, better yet,

df <- df[df$price < quantile(df$price,.9,na.rm=TRUE) & df$price > quantile(df$price,.1,na.rm=TRUE),]

The first one will rid the data frame of cars priced at greater than 100,000 dollars and still leave you with over 300,000 automobiles to analyze. Of course, there are will still be spurious entries, but at least it’s a start. The first code will also get rid of cars priced at exactly zero dollars. Examination of the data frame will show that many of the zero dollar entries are just ads for used car dealers. The expression price<100000&price>0 is called a compound Boolean expression. It is compound because of the ampersand, which stands for the word and. It means that the row has to contain a price less than 100,000 AND it also has to be a price greater than zero.

The second code gets rid of the 90th percentile and above and the 10th percentile and below, which will still leave plenty.

By the way, this is a good reminder that a data frame has a row and a column index. You can refer to rows by saying df[expression,] and to columns by saying df[,expression]. The expression can be any mathematical expression that resolves to TRUE or FALSE. The first above expression resolves to TRUE for cars priced at greater than zero but less than 100,000 dollars, and FALSE for cars priced at any integer greater than or equal to 100,000. You can tell that price is a 64 bit integer by saying str(df) which will tell the structure of the df data frame. You should be able to see that most of the columns are classified as chr or character. This is not desirable. Most of the columns clasified as chr should more properly be classified as factors. Factors take up less space on your computer, are faster to process, and allow more types of processing than chr. Unfortunately, if you store your intermediate work as a .csv file, you will lose the factor designation. Therefore, I recommend that you do the following.

Step 1. Get rid of rows you don’t want, such as those with prices over or under some threshold value you choose.

Step 2. Get rid of columns you don’t want to analyze, such as url or VIN.

Step 3. Convert some of the chr columns to factor. For instance, you can say df$state <- as.factor(df$state).

Step 4. Save your file by saying something like save(df,file="df.RData")

Step 5. Quit using this file and open a file called intermediate.qmd

Step 6. At the beginning of that file, say load("df.RData").

Step 7. Do all your work in that file, then paste the work back into your template.qmd file so you can run it as required. (Remember, you are not turning in a .RData file. Your m1.qmd file must start with reading in the vehicles.csv file and do processing on the resulting data frame.)

Step 8. Merge your template.qmd file with those of your group members into one m1.qmd file. For example, you could name all your individual template files with your names and one group member could merge them together. This should be easy for Milestone 1 since an obvious way to divide up your work is to assign different columns to different group members.

3.2.3 Combining numerical and categorical selection

dfX <- subset(df,state %in% c("ca","ny") & type %in% c("sedan","SUV") & price<99999 & price>0)
tbl <- table(dfX$state,dfX$type)
addmargins(tbl)
     
      sedan   SUV   Sum
  ca  10313  6537 16850
  ny   3487  2953  6440
  Sum 13800  9490 23290

Above is an example of getting a small contingency table with only the data you want. The first line selects only cars offered in ca or ny, only sedans or SUVs, and only with prices below 99,999 dollars and more than zero dollars. Then we can make a compact contingency table of that new data frame and add the margins to it.

3.2.4 Getting the data displayed as you wish

Someone asked me how to display price ranges by manufacturer. Here’s one way to do that:

df |>
  group_by(manufacturer) |>
  reframe(min = min(price),max=max(price)) |>
  print(n=43)
# A tibble: 43 × 3
   manufacturer      min   max
   <chr>           <int> <int>
 1 acura             574 37500
 2 alfa-romeo       1000 37500
 3 aston-martin     1947 37000
 4 audi              504 37509
 5 bmw               501 37500
 6 buick             521 37500
 7 cadillac          507 37500
 8 chevrolet         502 37587
 9 chrysler          539 37500
10 datsun           2200 30000
11 dodge             513 37500
12 ferrari          2034 35000
13 fiat              800 32500
14 ford              502 37513
15 gmc               501 37550
16 harley-davidson  2500 27995
17 honda             502 37500
18 hyundai           543 36900
19 infiniti          517 37500
20 jaguar            514 37388
21 jeep              501 37587
22 kia               529 37500
23 land rover       3199 10999
24 lexus             521 37500
25 lincoln           541 37495
26 mazda             515 36995
27 mercedes-benz     502 37509
28 mercury           600 37495
29 mini              600 35990
30 mitsubishi        538 37500
31 morgan           1000 36500
32 nissan            522 37575
33 pontiac           600 37500
34 porsche           560 37500
35 ram               502 37556
36 rover             507 37500
37 saturn            600 22000
38 subaru            501 37200
39 tesla             563 36999
40 toyota            505 37583
41 volkswagen        507 37000
42 volvo             502 37000
43 <NA>               NA    NA

The number 43 is because there are 43 manufacturers in the data frame. Note that, using the reframe() function, I could add a few more comma-separated statistics to the output.

3.2.5 Investigating words

To make a word cloud, I first exported the model column from the data frame to a file called bla, using the write_csv() function. Next I used Vim to convert all spaces to newlines, so that the file would have one word on each line. To do so I said :%s/ /\r/g in Vim.

Next I said sort bla | uniq -c | sort -n >blabla to get the following output. This is just the last few lines of the file. Note that the most frequently occuring word in the file is 1500, which occurs 24,014 times.

3383    fusion
3475    tundra
3479    xl
3528    corolla
3585    unlimited
3985    altima
3985    explorer
4072    3500
4105    f-250
4256    mustang
4277    escape
5162    camry
5208    series
5244    pickup
5277    NA
5479    accord
5585    xlt
5659    awd
5660    cherokee
5667    f150
5680    tacoma
5734    civic
5744    s
5865    2d
6185    limited
6213    lt
6295    coupe
6519    crew
6869    premium
7190    duty
7319    se
7531    2500
7564    utility
8093    wrangler
8327    super
8642    sierra
8733    grand
9578    4x4
10283   f-150
14877   sedan
15152   cab
17181   silverado
17488   4d
23130   sport
24014   1500

Next I opened the blabla file in Vim and converted all sequences of spaces to a tab character, saying :%s/^ *// to get rid of leading spaces, then :%s/ */\t/ to convert the intercolumn spaces to tabs. I saved this file as wordfreq.tsv and opened it in R, using the following code to convert it to a word cloud.

pacman::p_load(tidyverse)
df<-read_tsv("wordfreq.tsv")
df<-df|>relocate(freq,.after=word)
head(df)
# A tibble: 6 × 2
  word         freq
  <chr>       <dbl>
1 "!"           102
2 "!!"            4
3 "!!!"           2
4 "!!!\""         1
5 "!!clean!!"     2
6 "!!leveled"     2
pacman::p_load(wordcloud2)
wordcloud2(df)

I was not able to reproduce the error message I kept getting in class. This simply worked the first time through after class. I also can not explain why several of the most frequently occurring words do not appear in the word cloud. I suspect it is because the word cloud is truncated in this display. When I have constructed this word cloud previously, it was ellipse-shaped and zoomed out. I have forgotten whatever I did to make that happen!

3.2.6 Filtering two specific trucks

Two of the most common words I found above were f-150 and silverado. Since these are two popular truck models, I thought to compare them by making a data frame for each. To do so I used the filter() function of the dplyr package. This is described in great detail in Chapter 4 of Wickham, Çetinkaya-Rundel, and Grolemund (2023). The special construction (?i) makes whatever follows case-insensitive. Thus, in this case, it picks up Silverado and SILVERADO, as well as silverado. It would also pick up sIlVeRaDo if such a strange mixed case version presented itself. This is called a regular expression or regex and is commonly used in finding and replacing text patterns.

The regexes for f-150 are much more complicated. First, I used the alternating selector [Ff]. This stands for either a capital F or small f but not both. I can put any number of characters in the brackets and this will select any of them occurring one time. Next, I used the zero-or one character selector, which consists of a dot followed by a question mark. That selector attaches itself to whatever precedes it, in this case [Ff]. So the whole construct [Ff].? can be read as “exactly one F or f followed by exactly one character.” Next comes a literal 150, so the only time that an F or f plus at most one character will be matched is if it is immediately followed by 150. The last construct in this regular expression is the negating alternator, [^] with a zero in it. The negating alternator matches anything except the characters following the ^ in brackets. In this case it means that any character except a zero is okay. It’s actually a vestige of an earlier attempt. I had previously written the expression as [Ff].*150 and run it and it erroneously picked up a model string that said “pacifica $1500”. That was because I used a * instead of a ?. The * symbol means zero or more characters. As a result, the words “pacifica $1500” matched because there is an f followed by some characters, followed by 150! So I stuck the negating alternator [^0] in to get rid of the 15000 before I realized that the real problem was the *. I report this so you can see that it is a potentially long iterative process to find the right regex. This regex picks up f150, F 150, f-150, and so on. I could have refined it further. Can you see how?

pacman::p_load(tidyverse)
#df <- read_csv(paste0(Sys.getenv("STATS_DATA_DIR"),"/vehicles.csv"))
#load("/home/mm223266/data/vehicles.Rdata")
#df <- read_csv("/home/mm223266/data/vehicles.csv")
load(paste0(Sys.getenv("STATS_DATA_DIR"),"/vehicles.Rdata"))
#names(df)
df2 <- df |> filter(str_detect(df$model,regex("(?i)silverado")))
head(table(df2$model),12)

      1500 sierra silverado              1500 silverado 
                          1                          23 
         1500 silverado 4x4      1500 silverado ltz z71 
                          5                           2 
      1500 silverado sierra        1500 silverado titan 
                          1                           2 
         1500 silverado z71          1980 Silverado K10 
                          1                           1 
             2000 Silverado              2006 Silverado 
                          1                           1 
     2006 SILVERADO 2500 HD 2007 Silverado crew cab 4x4 
                          1                           1 
print("::::::::::::::::::::::::::::::::::")
[1] "::::::::::::::::::::::::::::::::::"
df3 <- df |> filter(str_detect(df$model,regex("[Ff].?150[^0]")))
head(table(df3$model),12)

              1999 F150 4X4               1999 f150 xlt 
                          1                           2 
      2000 F150 4x4 5 speed      2000 F150 XLT SUPERCAB 
                          1                           1 
             2002 F150 4 WD               2004 f150 4x4 
                          1                           1 
              2004 F150 XLT               2005 f150 fx4 
                          1                           1 
           2006 F150 Lariat     2006 f150 Supercrew Cab 
                          1                           1 
2012 F-150 Lariat SuperCrew        2013 F150 KING RANCH 
                          1                           1 

3.2.7 A function between numeric and visual

Milestone 1 is supposed to be entirely about numeric descriptions of the data, not visual descriptions, which will be covered in Milestone 2. Yet there is one function that exists in a gray area between numeric and graphical. That is the stem and leaf plot. Consider the following output.

stem(df$odometer)

  The decimal point is 6 digit(s) to the right of the |

  0 | 00000000000000000000000000000000000000000000000000000000000000000000+421502
  1 | 00000000000000000000000000000000000000000000000000000000000000000000+562
  2 | 000000000000000000111122223333333335555566666677789
  3 | 02333357
  4 | 0577777777
  5 | 005556666666669
  6 | 15
  7 | 555678888888888888
  8 | 04789999
  9 | 000189
  10 | 00000000000000000000000000000000000000000000000000000000000000000000+58

The output does not need any graphical processor. It is only characters that can be included in text. Yet it is a kind of graphic because you can see, for instance, that of the cars have either very little mileage on the odometer or very much. Read it like this:

  • The stem is the vertical line.
  • The numbers to the left of the stem are, in this case, numbers in the sixth place to the left of the decimal point. In other words the first row represents zero to 999999.
  • Each character to the right of the stem represents one car. There are probably 80 zeros in the first row. The +390151 indicates that there are 390,151 cars in that category that are not represented. The numbers in these cases represent the next significant digit after the one on the stem.
  • It is probably easier to read a stem and leaf plot for a smaller data frame, in the following case for the first 100 cars in the above data frame.
stem(df$odometer[1:100])

  The decimal point is 4 digit(s) to the right of the |

   0 | 0222788900012223447789999
   2 | 11267799000004555678
   4 | 01123335568
   6 | 39117
   8 | 0804567
  10 | 0
  12 | 8
  14 | 5
  16 | 6
  18 | 2

These entries come from a reduced data frame where I first ran the above code, getting rid of the high-priced and free cars. It may make it easier to understand to look at the entries themselves.

head(df$odometer,n=100L) |> sort(decreasing=TRUE)
 [1] 192000 176144 144700 128000  99615  97000  96003  95000  94020  90000
[11]  88000  80318  77087  71229  70760  68696  63129  57923  55783  55251
[21]  55068  43182  43000  42755  41568  41124  40784  39508  37725  37332
[31]  35835  35320  35290  34940  34152  30237  30176  30047  30041  29652
[41]  29499  28942  26978  26685  26129  22120  20856  20581  19179  19160
[51]  18650  18531  17805  17302  16594  14230  14169  13035  12302  12231
[61]  12102  10688   9954   9859   9704   8663   8141   7885   6897   2195
[71]   1834   1722     21

The very first row in the stem and leaf plot above counts cars priced at less than 20,000 dollars. There are 28 of them. They all have 0 or 1 in the fifth position to the left of the decimal. Only one of those, which is offered at 21 dollars, has zeros in both of the first two positions. It is the very first entry after the stem, represented as a zero. The next three entries are the cars that sell for the next lowest prices, between zero and 2 in the next decimal position. They are represented as 2s. You can see at a glance that, in this group of 100 cars, the lower odometer readings predominate. By the way, the stem() function discards NA values before processing the remainder. So there are only a total of 78 characters to the right of the stem on all the rows put together.

There is some difference of opinion as to whether stem() is graphical or numerical. What do you think?

3.2.8 More on Quarto

It may surprise you to know that Quarto, the tool you’re using to write your reports for this course, has other uses as well. All the syllabuses for my courses are written in Quarto, as are all my slideshows. Why not just use Microsoft Word and Powerpoint for these things? There are a great many reasons but the ones that concern us for this course are chiefly reproducible research and literate programming. If I required you to use GitHub for your group projects, there would be a third: version control. Consider this last one first. The following cartoon from the archive of phdcomics highlights the problem of version control using Microsoft Word to write your papers.

version control cartoon

The illustrated problem concerns only a single grad student and advisor, but the problem gets worse in a group project. The solution is to break the work into parts and use a shared version control system to keep track of the changes made by different people so you can revert some text to an earlier version if needed and integrate all the parts at the end. Quarto enables this by being a plain text system. Everything you write is a plain text document without hidden binary data. It can be opened in your choice of text editor. This is valuable because most data scientists (like programmers) have a preferred text editor and don’t like to be told which one to use. It is also valuable because a tracking system like Git can track all changes in the file. We may even require the use of Git in a future iteration of this course because of the many issues that groups experience in trying to integrate their files. (If you only use the RStudio interface to your file, it may be a surprise that it contains only plain text since you see color coding and panels with graphical results. These are addons of RStudio and the underlying file is the same with or without these addons. I usually use a plain text editor called Vim to edit Quarto files, with a terminal window open on R for pasting in code from the document to check it. I do this because it’s quicker than RStudio and takes up less screen real estate.)

As for literate programming, that is defined as writing programs for people to read as much as for machines to read. By integrating R code and narrative in one file, you make it possible for people to understand your code better than they could if documentation were written separately.

Reproducible research is the most important of the three. There is a widely discussed phenomenon in the scientific community called the reproducibility crisis. That crisis is that much of academic research results can not be reproduced by other researchers. There are many reasons proposed for the reproducibility crisis and at least as many solutions proposed. One of the best is to leave an audit trail and this is enabled by Quatro’s blend of narrative and code. This is why I insist that you start with raw data and document all your transformations in your qmd file.