We did some exercises, for which there are now solutions in the file week02exercises-soln.qmd and week02exercises-soln.html. You should examine and compare these two files, especially the exercise parts.
3.2 Week 03: More on R and Quarto
We will establish groups for the milestones. I’m open to moving you around if needed, subject to the constraint that we have no more than five members in a group. I expect to have four groups of four and two groups of five.
3.2.1 The template files
The template files show how you can begin the milestones. Notice that I have named the data frame as df in the template.qmd file. As a result, I can write functions like the following.
# A tibble: 7 × 2
title_status n
<chr> <int>
1 clean 405117
2 <NA> 8242
3 rebuilt 7219
4 salvage 3868
5 lien 1422
6 missing 814
7 parts only 198
The above is a good way to investigate categorical data. Notice that I’ve used the pipe character, so that the data frame df is sent to the count() function. Then a particular categorical column of df is counted and sorted in descending numerical order. The count() function is part of the dplyr package, which is one of the tidyverse packages. You only need to load the tidyverse set of packages once in a document, preferably near the beginning.
Another point about the above result is that the count() function behaves different for large numbers of values. For example, region has 404 values. The count() function will just display the first and last 5 by default. I can make it display all the values by adding the print fuction:
df |>count(region,sort=TRUE) |>print(n=404)
# A tibble: 404 × 2
region n
<chr> <int>
1 columbus 3608
2 jacksonville 3562
3 spokane / coeur d'alene 2988
4 eugene 2985
5 fresno / madera 2983
6 orlando 2983
7 bend 2982
8 omaha / council bluffs 2982
9 kennewick-pasco-richland 2981
10 new hampshire 2981
11 nashville 2980
12 salem 2980
13 oklahoma city 2979
14 reno / tahoe 2979
15 boston 2978
16 rochester 2978
17 sarasota-bradenton 2977
18 stockton 2977
19 boise 2976
20 portland 2976
21 houston 2975
22 south jersey 2974
23 minneapolis / st paul 2973
24 modesto 2973
25 philadelphia 2973
26 seattle-tacoma 2973
27 grand rapids 2972
28 baltimore 2971
29 las vegas 2971
30 milwaukee 2971
31 pittsburgh 2971
32 cincinnati 2970
33 sacramento 2970
34 tulsa 2970
35 washington, DC 2970
36 austin 2969
37 north jersey 2969
38 tucson 2969
39 charlotte 2968
40 maine 2966
41 atlanta 2965
42 hawaii 2964
43 long island 2964
44 central NJ 2961
45 phoenix 2960
46 detroit metro 2959
47 dallas / fort worth 2957
48 kansas city, MO 2957
49 ft myers / SW florida 2955
50 cleveland 2953
51 san diego 2953
52 albuquerque 2952
53 denver 2952
54 orange county 2952
55 inland empire 2950
56 new york city 2950
57 norfolk / hampton roads 2946
58 tampa bay area 2945
59 st louis, MO 2940
60 los angeles 2937
61 SF bay area 2936
62 chicago 2929
63 des moines 2923
64 south florida 2920
65 raleigh / durham / CH 2918
66 colorado springs 2914
67 san antonio 2891
68 knoxville 2763
69 anchorage / mat-su 2742
70 hartford 2564
71 albany 2537
72 bakersfield 2528
73 redding 2526
74 springfield 2520
75 ventura county 2518
76 vermont 2513
77 fayetteville 2415
78 madison 2387
79 fort collins / north CO 2385
80 richmond 2346
81 greenville / upstate 2327
82 rhode island 2320
83 bellingham 2313
84 indianapolis 2303
85 akron / canton 2211
86 greensboro 2078
87 western massachusetts 2039
88 buffalo 2006
89 palm springs 1978
90 el paso 1977
91 medford-ashland 1962
92 louisville 1938
93 worcester / central MA 1914
94 little rock 1841
95 ocala 1828
96 dayton / springfield 1787
97 yuba-sutter 1747
98 memphis 1724
99 yakima 1704
100 billings 1701
101 wichita 1694
102 hudson valley 1692
103 birmingham 1647
104 new haven 1644
105 daytona beach 1633
106 charleston 1509
107 chico 1486
108 san luis obispo 1455
109 monterey bay 1451
110 asheville 1423
111 toledo 1406
112 columbia 1388
113 missoula 1378
114 lansing 1367
115 jackson 1351
116 space coast 1346
117 syracuse 1338
118 fredericksburg 1328
119 wenatchee 1324
120 new orleans 1321
121 huntsville / decatur 1273
122 appleton-oshkosh-FDL 1261
123 lehigh valley 1261
124 flint 1258
125 columbia / jeff city 1257
126 bozeman 1249
127 kalamazoo 1236
128 western slope 1188
129 corpus christi 1176
130 treasure coast 1173
131 corvallis/albany 1168
132 roanoke 1121
133 lexington 1115
134 tri-cities 1110
135 ann arbor 1085
136 rockford 1059
137 east idaho 1052
138 santa barbara 1052
139 lakeland 1048
140 myrtle beach 1043
141 chattanooga 1038
142 green bay 1033
143 winston-salem 1027
144 mcallen / edinburg 1019
145 scranton / wilkes-barre 1008
146 moses lake 994
147 tyler / east TX 992
148 lancaster 988
149 kalispell 978
150 harrisburg 976
151 wilmington 975
152 fargo / moorhead 969
153 south bend / michiana 969
154 st cloud 954
155 delaware 949
156 gainesville 926
157 visalia-tulare 922
158 eastern NC 895
159 flagstaff / sedona 881
160 saginaw-midland-baycity 876
161 hickory / lenoir 853
162 eau claire 846
163 jersey shore 838
164 erie 805
165 santa fe / taos 801
166 twin falls 794
167 fort wayne 788
168 clarksville 785
169 york 777
170 tallahassee 771
171 gold country 765
172 duluth / superior 763
173 kenosha-racine 757
174 prescott 746
175 pueblo 746
176 santa maria 740
177 baton rouge 733
178 wausau 726
179 east oregon 720
180 olympic peninsula 717
181 lewiston / clarkston 710
182 skagit / island / SJI 701
183 boulder 694
184 oregon coast 694
185 macon / warner robins 687
186 quad cities, IA/IL 687
187 waco 681
188 winchester 672
189 youngstown 664
190 killeen / temple / ft hood 662
191 merced 654
192 cedar rapids 648
193 south coast 645
194 charlottesville 642
195 battle creek 639
196 mobile 626
197 pensacola 622
198 sioux falls / SE SD 621
199 northern michigan 612
200 wyoming 610
201 athens 607
202 utica-rome-oneida 607
203 lincoln 604
204 cape cod / islands 598
205 pullman / moscow 595
206 wichita falls 589
207 eastern CT 583
208 topeka 582
209 amarillo 564
210 southern illinois 555
211 waterloo / cedar falls 555
212 holland 534
213 brainerd 533
214 monroe 530
215 great falls 524
216 la crosse 521
217 savannah / hinesville 516
218 rapid city / west SD 502
219 lynchburg 496
220 lubbock 492
221 lima / findlay 485
222 salt lake city 485
223 augusta 480
224 southwest michigan 476
225 binghamton 474
226 muskegon 473
227 poconos 464
228 north mississippi 462
229 central michigan 461
230 mankato 445
231 finger lakes 439
232 mohave county 435
233 odessa / midland 433
234 peoria 432
235 danville 427
236 fairbanks 427
237 shreveport 422
238 reading 416
239 bowling green 415
240 northwest GA 415
241 watertown 414
242 evansville 413
243 montgomery 408
244 southern maryland 405
245 northwest CT 397
246 sioux city 393
247 humboldt county 385
248 elmira-corning 379
249 harrisonburg 375
250 eastern shore 368
251 bemidji 365
252 altoona-johnstown 364
253 st george 362
254 williamsport 362
255 brownsville 361
256 mansfield 359
257 las cruces 358
258 port huron 358
259 upper peninsula 355
260 janesville 343
261 klamath falls 343
262 boone 338
263 jonesboro 337
264 laredo 337
265 yuma 335
266 joplin 328
267 st augustine 328
268 fort smith 327
269 dothan 325
270 florence 324
271 st joseph 323
272 college station 313
273 zanesville / cambridge 313
274 morgantown 307
275 ithaca 303
276 panama city 298
277 roseburg 294
278 frederick 288
279 beaumont / port arthur 287
280 the thumb 286
281 annapolis 285
282 western maryland 285
283 texoma 284
284 imperial county 282
285 champaign urbana 281
286 sheboygan 281
287 new river valley 275
288 glens falls 274
289 ashtabula 270
290 northern panhandle 268
291 parkersburg-marietta 264
292 northern WI 262
293 valdosta 262
294 lafayette / west lafayette 260
295 chillicothe 257
296 plattsburgh-adirondacks 256
297 helena 255
298 southwest VA 254
299 iowa city 253
300 eastern panhandle 251
301 manhattan 251
302 muncie / anderson 248
303 southeast missouri 245
304 sandusky 244
305 victoria 243
306 sierra vista 241
307 mendocino county 238
308 lake of the ozarks 236
309 abilene 235
310 north central FL 230
311 dubuque 228
312 galveston 225
313 chautauqua 221
314 kenai peninsula 221
315 bloomington 217
316 bloomington-normal 217
317 mason city 217
318 eastern kentucky 213
319 grand island 212
320 western KY 212
321 texarkana 211
322 state college 210
323 gadsden-anniston 207
324 lawton 207
325 florida keys 203
326 lawrence 199
327 stillwater 198
328 terre haute 198
329 ames 194
330 okaloosa / walton 194
331 grand forks 192
332 cookeville 189
333 brunswick 187
334 hilton head 186
335 cumberland valley 179
336 lafayette 179
337 san marcos 179
338 hanford-corcoran 178
339 meadville 177
340 southeast IA 175
341 decatur 171
342 heartland florida 170
343 high rockies 169
344 salina 168
345 florence / muscle shoals 165
346 tuscaloosa 162
347 san angelo 161
348 kokomo 156
349 lake charles 156
350 mattoon-charleston 156
351 tuscarawas co 154
352 logan 152
353 catskills 147
354 northwest OK 145
355 northwest KS 143
356 auburn 142
357 owensboro 140
358 statesboro 140
359 farmington 136
360 twin tiers NY/PA 136
361 butte 134
362 central louisiana 132
363 gulfport / biloxi 127
364 north dakota 126
365 elko 120
366 kirksville 119
367 southeast KS 119
368 susanville 119
369 huntington-ashland 118
370 scottsbluff / panhandle 117
371 la salle co 116
372 show low 112
373 outer banks 109
374 del rio / eagle pass 108
375 potsdam-canton-massena 102
376 roswell / carlsbad 100
377 hattiesburg 99
378 southwest KS 96
379 bismarck 92
380 south dakota 90
381 deep east texas 88
382 houma 88
383 fort dodge 87
384 southeast alaska 84
385 siskiyou county 83
386 north platte 80
387 clovis / portales 78
388 ogden-clearfield 76
389 eastern montana 75
390 provo / orem 75
391 western IL 63
392 oneonta 62
393 southwest MN 56
394 pierre / central SD 55
395 eastern CO 40
396 southern WV 36
397 st louis 35
398 northeast SD 34
399 southwest TX 30
400 meridian 28
401 southwest MS 14
402 kansas city 11
403 fort smith, AR 9
404 west virginia (old) 8
The number 404 is because there are 404 unique elements in the list, which you can find out by looking at the output of the count() function. Of course, in a report for a manager or recruiter I would be unlikely to print out a column of 404 numbers. A manager or recruiter would be more interested in some sort of summary. This is primarily for your preparation.
# A tibble: 404 × 2
# Groups: region [404]
region nrow
<chr> <int>
1 columbus 3608
2 jacksonville 3562
3 spokane / coeur d'alene 2988
4 eugene 2985
5 fresno / madera 2983
6 orlando 2983
7 bend 2982
8 omaha / council bluffs 2982
9 kennewick-pasco-richland 2981
10 new hampshire 2981
11 nashville 2980
12 salem 2980
13 oklahoma city 2979
14 reno / tahoe 2979
15 boston 2978
16 rochester 2978
17 sarasota-bradenton 2977
18 stockton 2977
19 boise 2976
20 portland 2976
21 houston 2975
22 south jersey 2974
23 minneapolis / st paul 2973
24 modesto 2973
25 philadelphia 2973
26 seattle-tacoma 2973
27 grand rapids 2972
28 baltimore 2971
29 las vegas 2971
30 milwaukee 2971
31 pittsburgh 2971
32 cincinnati 2970
33 sacramento 2970
34 tulsa 2970
35 washington, DC 2970
36 austin 2969
37 north jersey 2969
38 tucson 2969
39 charlotte 2968
40 maine 2966
# ℹ 364 more rows
Another way (and a better one for high level understanding) to investigate categorical data is through contingency tables. You have already made some of these using the table() function and some associated functions that are mentioned in the week02exercises-soln.qmd file. It will take some intuition to figure out which pairs of categorical columns should be tabulated for Milestone 1.
3.2.2 Numerical Data
For numerical data, you can do what we did last week to investigate a single column.
summary(df$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000e+00 5.900e+03 1.395e+04 7.520e+04 2.649e+04 3.737e+09
# A tibble: 1 × 6
Min firstq Median Mean thirdq Max
<chr> <chr> <chr> <chr> <chr> <chr>
1 0 5,900 13,950 75,199 26,486 3,736,928,711
df$price <-as.integer(df$price)
Warning: NAs introduced by coercion to integer range
You’ll notice that one of the prices is 3.7 billion dollars for a Toyota truck! This is obviously a misprint! You should probably remove this row from the data frame and save the data frame without it. A more sophisticated alternative would be to impute a value for this truck. There are many advanced statistical ways to do this, but they are beyond the scope of this course. van Buuren (2018) describes several excellent ways to do so, particularly in that book’s Section 5.1.1, Recommended Workflows. It is usually a mistake to use an average for missing (NA) values because to do so compresses the variance unnaturally. You may remove that particular row by saying df <- df[-(which.max(df$price)),]. Unfortunately, you will find that not to be very useful because there are several prices of over a billion dollars for generic cars! In fact, there are several with prices listed as 1234567890 and some listed at 987654321. Many other ridiculous patterns can be found for price. So what can you do? Personally, I might add the following line right after reading in the file:
The first one will rid the data frame of cars priced at greater than 100,000 dollars and still leave you with over 300,000 automobiles to analyze. Of course, there are will still be spurious entries, but at least it’s a start. The first code will also get rid of cars priced at exactly zero dollars. Examination of the data frame will show that many of the zero dollar entries are just ads for used car dealers. The expression price<100000&price>0 is called a compound Boolean expression. It is compound because of the ampersand, which stands for the word and. It means that the row has to contain a price less than 100,000 AND it also has to be a price greater than zero.
The second code gets rid of the 90th percentile and above and the 10th percentile and below, which will still leave plenty.
By the way, this is a good reminder that a data frame has a row and a column index. You can refer to rows by saying df[expression,] and to columns by saying df[,expression]. The expression can be any mathematical expression that resolves to TRUE or FALSE. The first above expression resolves to TRUE for cars priced at greater than zero but less than 100,000 dollars, and FALSE for cars priced at any integer greater than or equal to 100,000. You can tell that price is a 64 bit integer by saying str(df) which will tell the structure of the df data frame. You should be able to see that most of the columns are classified as chr or character. This is not desirable. Most of the columns clasified as chr should more properly be classified as factors. Factors take up less space on your computer, are faster to process, and allow more types of processing than chr. Unfortunately, if you store your intermediate work as a .csv file, you will lose the factor designation. Therefore, I recommend that you do the following.
Step 1. Get rid of rows you don’t want, such as those with prices over or under some threshold value you choose.
Step 2. Get rid of columns you don’t want to analyze, such as url or VIN.
Step 3. Convert some of the chr columns to factor. For instance, you can say df$state <- as.factor(df$state).
Step 4. Save your file by saying something like save(df,file="df.RData")
Step 5. Quit using this file and open a file called intermediate.qmd
Step 6. At the beginning of that file, say load("df.RData").
Step 7. Do all your work in that file, then paste the work back into your template.qmd file so you can run it as required. (Remember, you are not turning in a .RData file. Your m1.qmd file must start with reading in the vehicles.csv file and do processing on the resulting data frame.)
Step 8. Merge your template.qmd file with those of your group members into one m1.qmd file. For example, you could name all your individual template files with your names and one group member could merge them together. This should be easy for Milestone 1 since an obvious way to divide up your work is to assign different columns to different group members.
3.2.3 Combining numerical and categorical selection
dfX <-subset(df,state %in%c("ca","ny") & type %in%c("sedan","SUV") & price<99999& price>0)tbl <-table(dfX$state,dfX$type)addmargins(tbl)
sedan SUV Sum
ca 10313 6537 16850
ny 3487 2953 6440
Sum 13800 9490 23290
Above is an example of getting a small contingency table with only the data you want. The first line selects only cars offered in ca or ny, only sedans or SUVs, and only with prices below 99,999 dollars and more than zero dollars. Then we can make a compact contingency table of that new data frame and add the margins to it.
3.2.4 Getting the data displayed as you wish
Someone asked me how to display price ranges by manufacturer. Here’s one way to do that:
# A tibble: 43 × 3
manufacturer min max
<chr> <int> <int>
1 acura 574 37500
2 alfa-romeo 1000 37500
3 aston-martin 1947 37000
4 audi 504 37509
5 bmw 501 37500
6 buick 521 37500
7 cadillac 507 37500
8 chevrolet 502 37587
9 chrysler 539 37500
10 datsun 2200 30000
11 dodge 513 37500
12 ferrari 2034 35000
13 fiat 800 32500
14 ford 502 37513
15 gmc 501 37550
16 harley-davidson 2500 27995
17 honda 502 37500
18 hyundai 543 36900
19 infiniti 517 37500
20 jaguar 514 37388
21 jeep 501 37587
22 kia 529 37500
23 land rover 3199 10999
24 lexus 521 37500
25 lincoln 541 37495
26 mazda 515 36995
27 mercedes-benz 502 37509
28 mercury 600 37495
29 mini 600 35990
30 mitsubishi 538 37500
31 morgan 1000 36500
32 nissan 522 37575
33 pontiac 600 37500
34 porsche 560 37500
35 ram 502 37556
36 rover 507 37500
37 saturn 600 22000
38 subaru 501 37200
39 tesla 563 36999
40 toyota 505 37583
41 volkswagen 507 37000
42 volvo 502 37000
43 <NA> NA NA
The number 43 is because there are 43 manufacturers in the data frame. Note that, using the reframe() function, I could add a few more comma-separated statistics to the output.
3.2.5 Investigating words
To make a word cloud, I first exported the model column from the data frame to a file called bla, using the write_csv() function. Next I used Vim to convert all spaces to newlines, so that the file would have one word on each line. To do so I said :%s/ /\r/g in Vim.
Next I said sort bla | uniq -c | sort -n >blabla to get the following output. This is just the last few lines of the file. Note that the most frequently occuring word in the file is 1500, which occurs 24,014 times.
3383 fusion
3475 tundra
3479 xl
3528 corolla
3585 unlimited
3985 altima
3985 explorer
4072 3500
4105 f-250
4256 mustang
4277 escape
5162 camry
5208 series
5244 pickup
5277 NA
5479 accord
5585 xlt
5659 awd
5660 cherokee
5667 f150
5680 tacoma
5734 civic
5744 s
5865 2d
6185 limited
6213 lt
6295 coupe
6519 crew
6869 premium
7190 duty
7319 se
7531 2500
7564 utility
8093 wrangler
8327 super
8642 sierra
8733 grand
9578 4x4
10283 f-150
14877 sedan
15152 cab
17181 silverado
17488 4d
23130 sport
24014 1500
Next I opened the blabla file in Vim and converted all sequences of spaces to a tab character, saying :%s/^ *// to get rid of leading spaces, then :%s/ */\t/ to convert the intercolumn spaces to tabs. I saved this file as wordfreq.tsv and opened it in R, using the following code to convert it to a word cloud.
I was not able to reproduce the error message I kept getting in class. This simply worked the first time through after class. I also can not explain why several of the most frequently occurring words do not appear in the word cloud. I suspect it is because the word cloud is truncated in this display. When I have constructed this word cloud previously, it was ellipse-shaped and zoomed out. I have forgotten whatever I did to make that happen!
3.2.6 Filtering two specific trucks
Two of the most common words I found above were f-150 and silverado. Since these are two popular truck models, I thought to compare them by making a data frame for each. To do so I used the filter() function of the dplyr package. This is described in great detail in Chapter 4 of Wickham, Çetinkaya-Rundel, and Grolemund (2023). The special construction (?i) makes whatever follows case-insensitive. Thus, in this case, it picks up Silverado and SILVERADO, as well as silverado. It would also pick up sIlVeRaDo if such a strange mixed case version presented itself. This is called a regular expression or regex and is commonly used in finding and replacing text patterns.
The regexes for f-150 are much more complicated. First, I used the alternating selector [Ff]. This stands for either a capital F or small f but not both. I can put any number of characters in the brackets and this will select any of them occurring one time. Next, I used the zero-or one character selector, which consists of a dot followed by a question mark. That selector attaches itself to whatever precedes it, in this case [Ff]. So the whole construct [Ff].? can be read as “exactly one F or f followed by exactly one character.” Next comes a literal 150, so the only time that an F or f plus at most one character will be matched is if it is immediately followed by 150. The last construct in this regular expression is the negating alternator, [^] with a zero in it. The negating alternator matches anything except the characters following the ^ in brackets. In this case it means that any character except a zero is okay. It’s actually a vestige of an earlier attempt. I had previously written the expression as [Ff].*150 and run it and it erroneously picked up a model string that said “pacifica $1500”. That was because I used a * instead of a ?. The * symbol means zero or more characters. As a result, the words “pacifica $1500” matched because there is an f followed by some characters, followed by 150! So I stuck the negating alternator [^0] in to get rid of the 15000 before I realized that the real problem was the *. I report this so you can see that it is a potentially long iterative process to find the right regex. This regex picks up f150, F 150, f-150, and so on. I could have refined it further. Can you see how?
Milestone 1 is supposed to be entirely about numeric descriptions of the data, not visual descriptions, which will be covered in Milestone 2. Yet there is one function that exists in a gray area between numeric and graphical. That is the stem and leaf plot. Consider the following output.
stem(df$odometer)
The decimal point is 6 digit(s) to the right of the |
0 | 00000000000000000000000000000000000000000000000000000000000000000000+421502
1 | 00000000000000000000000000000000000000000000000000000000000000000000+562
2 | 000000000000000000111122223333333335555566666677789
3 | 02333357
4 | 0577777777
5 | 005556666666669
6 | 15
7 | 555678888888888888
8 | 04789999
9 | 000189
10 | 00000000000000000000000000000000000000000000000000000000000000000000+58
The output does not need any graphical processor. It is only characters that can be included in text. Yet it is a kind of graphic because you can see, for instance, that of the cars have either very little mileage on the odometer or very much. Read it like this:
The stem is the vertical line.
The numbers to the left of the stem are, in this case, numbers in the sixth place to the left of the decimal point. In other words the first row represents zero to 999999.
Each character to the right of the stem represents one car. There are probably 80 zeros in the first row. The +390151 indicates that there are 390,151 cars in that category that are not represented. The numbers in these cases represent the next significant digit after the one on the stem.
It is probably easier to read a stem and leaf plot for a smaller data frame, in the following case for the first 100 cars in the above data frame.
stem(df$odometer[1:100])
The decimal point is 4 digit(s) to the right of the |
0 | 0222788900012223447789999
2 | 11267799000004555678
4 | 01123335568
6 | 39117
8 | 0804567
10 | 0
12 | 8
14 | 5
16 | 6
18 | 2
These entries come from a reduced data frame where I first ran the above code, getting rid of the high-priced and free cars. It may make it easier to understand to look at the entries themselves.
The very first row in the stem and leaf plot above counts cars priced at less than 20,000 dollars. There are 28 of them. They all have 0 or 1 in the fifth position to the left of the decimal. Only one of those, which is offered at 21 dollars, has zeros in both of the first two positions. It is the very first entry after the stem, represented as a zero. The next three entries are the cars that sell for the next lowest prices, between zero and 2 in the next decimal position. They are represented as 2s. You can see at a glance that, in this group of 100 cars, the lower odometer readings predominate. By the way, the stem() function discards NA values before processing the remainder. So there are only a total of 78 characters to the right of the stem on all the rows put together.
There is some difference of opinion as to whether stem() is graphical or numerical. What do you think?
3.2.8 More on Quarto
It may surprise you to know that Quarto, the tool you’re using to write your reports for this course, has other uses as well. All the syllabuses for my courses are written in Quarto, as are all my slideshows. Why not just use Microsoft Word and Powerpoint for these things? There are a great many reasons but the ones that concern us for this course are chiefly reproducible research and literate programming. If I required you to use GitHub for your group projects, there would be a third: version control. Consider this last one first. The following cartoon from the archive of phdcomics highlights the problem of version control using Microsoft Word to write your papers.
The illustrated problem concerns only a single grad student and advisor, but the problem gets worse in a group project. The solution is to break the work into parts and use a shared version control system to keep track of the changes made by different people so you can revert some text to an earlier version if needed and integrate all the parts at the end. Quarto enables this by being a plain text system. Everything you write is a plain text document without hidden binary data. It can be opened in your choice of text editor. This is valuable because most data scientists (like programmers) have a preferred text editor and don’t like to be told which one to use. It is also valuable because a tracking system like Git can track all changes in the file. We may even require the use of Git in a future iteration of this course because of the many issues that groups experience in trying to integrate their files. (If you only use the RStudio interface to your file, it may be a surprise that it contains only plain text since you see color coding and panels with graphical results. These are addons of RStudio and the underlying file is the same with or without these addons. I usually use a plain text editor called Vim to edit Quarto files, with a terminal window open on R for pasting in code from the document to check it. I do this because it’s quicker than RStudio and takes up less screen real estate.)
As for literate programming, that is defined as writing programs for people to read as much as for machines to read. By integrating R code and narrative in one file, you make it possible for people to understand your code better than they could if documentation were written separately.
Reproducible research is the most important of the three. There is a widely discussed phenomenon in the scientific community called the reproducibility crisis. That crisis is that much of academic research results can not be reproduced by other researchers. There are many reasons proposed for the reproducibility crisis and at least as many solutions proposed. One of the best is to leave an audit trail and this is enabled by Quatro’s blend of narrative and code. This is why I insist that you start with raw data and document all your transformations in your qmd file.
van Buuren, Stef. 2018. Flexible Imputation of Missing Data. Boca Raton, FL: CRC Press.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. O’Reilly Media, Inc. https://r4ds.hadley.nz/.