HW Instructions
Statistics for Informatics


Mick McQuaid


October 29, 2023

Group Work, 40 points

  • 4 milestones, each 10 points
  • Milestone 1: description (tables, summary stats)
  • Milestone 2: description (visualization)
  • Milestone 3: regression
  • Milestone 4: regression diagnostics

All milestones will require one group member to post the relevant .qmd file and .html file to Canvas. All group members will have access to the group assignment box so that you can verify that it has been submitted on time.

All files must be named as m1.qmd, m2.qmd, … and m1.html, m2.html, … Substantial points will be deducted for any other names.

You may use the template.qmd file as a starter file for all milestones and gradually fill in the sections and subsections for each milestone, so that at the end of the course, you have a complete report.

  • Milestone 1 corresponds to the heading Part 1: Numerical Description
  • Milestone 2 corresponds to the heading Part 2: Visual Description
  • Milestone 3 corresponds to the heading Regression Analysis
  • Milestone 4 corresponds to the heading Regression Diagnostics

You have two choices with the vehicles.csv file. You may use my copy (preferred) or, if you do not use the server, you will add an .Rprofile file with the location where you’ll keep it, so that your homework file still refers to a generic location that I can use in grading your work. I don’t want to see any setwd() functions in your output. I will put all the files into one directory (folder), so it is important that you use exactly the read_csv() function I have written in the template.qmd file.

Milestone 1: description (tables, summary stats)

There are two files to consider in the Files section on Canvas: vehicles.csv and vehicles.dat. The first is a comma-separated values file of cars advertised on Craigslist. The second is a plain text file (you can open this in any text editor such as Notepad or TextEdit) containing metadata for the first file.

Your assignment is to summarize the data using contingency tables and summary statistics. This will prepare you for the next assignments as it will familiarize you with the data which is the same data that will be used for every milestone.

Be aware that not every column is useful. Not every column is easy to describe. You must use your best judgment as you familiarize yourself with the data to identify the most important columns to describe and the columns that relate to each other. Part of your grade depends on your judgment about which columns to include!

A few points to remember:

  • You must organize your report using subheadings.
  • There should not be redundant sections.
  • There should be a narrative explaining your statistics so that I know that you know what you are talking about.
  • Do not organize the report according to who did what. There should be no subheadings that are group member names.

Suggested steps:

Step 1. Get rid of rows you don’t want, such as those with prices over or under some threshold value you choose.

Step 2. Get rid of columns you don’t want to analyze, such as url or VIN.

Step 3. Convert some of the chr columns to factor. For instance, you can say df$state <- as.factor(df$state).

Step 4. Save your file by saying something like save(df,file="df.RData")

Step 5. Quit using this file and open a file called intermediate.qmd

Step 6. At the beginning of that file, say load("df.RData").

Step 7. Do all your work in that file, then paste the work back into your template.qmd file so you can run it as required. (Remember, you are not turning in a .RData file. Your m1.qmd file must start with reading in the vehicles.csv file and do processing on the resulting data frame.)

Step 8. Merge your template.qmd file with those of your group members into one m1.qmd file. For example, you could name all your individual template files with your names and one group member could merge them together. This should be easy for Milestone 1 since an obvious way to divide up your work is to assign different columns to different group members.

Step 9. Meet as a group and organize your work into subsections that make sense according to your findings. Make your final report something you could hand to a manager to familiarize them with the data. The numerical description section of the report should end with a summary of your findings.

Milestone 2: description (visualization)

This assignment is to describe the data using pictures, such as barplots, mosaic plots, histograms, and more.

As with the previous milestone, your findings must be organized into a coherent report. Follow the points to remember and suggested steps from Milestone 1 again. Do include the Numerical Description section and the headings for the following sections in this iteration of the report.

Milestone 3: regression

This assignment is to conduct a regression analysis using price as the response. The response variable is also known as y, the target variable, or the output variable. It is up to you to decide which explanatory variables to use. Explanatory variables are also known as x, features, or input variables. This assignment will be graded on your choice of explanatory variables and the quality of your explanations of the regression output.

You are responsible for completing the following steps.

  1. Create two linear models through trial and error. Explain them.
  2. Create a model through the ols_step_best_model() function of the olsrr package. Explain it.
  3. Create a fourth model based on, but not identical to, the model created above. The reason that it is not identical is that you will probably find that the model created in step 2 contains only some values of the categorical variables. This fourth model will be manually specified, not determined by the olsrr approach. This fourth model is the model you will use for milestone 4. Explain it.

Two important notes:

  1. If you use my mlTiny.qmd file to develop your data frame, you will have only 1,822 rows. This is a drastic decrease but I think it remains a worthwhile sample. It involves removing all NAs but after removing some columns, so you would need to do things in the same order in which I do them in the mlTiny.qmd file to get exactly 1,822 rows. I will downgrade you if you use fewer rows than 1,822 but that number is fine.
  2. You must come up with an overall summary that summarizes your reasoning for the path you took. You should include some visualizations or refer to the visualizations from milestone 2 to help explain why you chose some variables and not others. Keep in mind that your .qmd file should include your milestone 1 (possibly altered) and milestone 2 (possibly altered), so that the report builds up on what you now know about the data.

Milestone 4: regression diagnostics

This assignment is to assess the regression conducted in the previous assignment, using diagnostic plots and statistics. This assignment will be graded on the basis of your explanations of the resulting diagnostic plots and statistics.

You will also summarize all your findings in a Conclusion section and clean up the report further so that a manager can easily digest it. You will add echo=FALSE to your chunk options at the beginning of the file, so that it looks like the following. Be sure to add the comma on the previous line!

#| label: optionsSetup
#| include=FALSE

Notice that the R code chunks will now not appear. It doesn’t matter because a manager may not be able to read them anyway. Instead, you need to include more verbiage to explain what is going on to the manager. For example, in the Obtaining Data section, nothing now appears. You have to decide what to put there about the source of the data.

Weekly Exercises, 20 points

Nine weekly exercises are each worth one point, then two points each for the next four, then three points for the last one. We’ll start those in class but you may have to finish on your own. It is fine to work with a partner but, if you collaborate, both names must appear as authors on both submissions or it will be regarded as cheating. Every student will turn in their own copy of any collaborative work.

You must substantially complete the exercises to receive any credit. Even incorrect answers are okay as long as you make a serious attempt at solving each problem.

Take-home Exam, 30 points

This exam, provided near the end of the semester and due at our official exam time, is to conduct a complete analysis of a given data set, including description (tables, summary stats, visualization), regression, and regression diagnostics. It resembles the group milestones, except that it is accomplished in a compressed time period.