HW Instructions
i306
Statistics for Informatics

Author

Mick McQuaid

Published

April 3, 2024

Project Work, 50 points

  • 4 milestones, 1–3 12 points each, 4: 14 points
  • Milestone 1: description (tables, summary stats)
  • Milestone 2: description (visualization)
  • Milestone 3: regression
  • Milestone 4: regression diagnostics and final report

All milestones will require you to post the relevant .qmd file and .html file to Canvas.

All files must be named as m1.qmd, m2.qmd, … and m1.html, m2.html, … Substantial points will be deducted for any other names.

You may use the template.qmd file as a starter file for all milestones and gradually fill in the sections and subsections for each milestone, so that at the end of the course, you have a complete report.

  • Milestone 1 corresponds to the heading Part 1: Numerical Description
  • Milestone 2 corresponds to the heading Part 2: Visual Description
  • Milestone 3 corresponds to the heading Regression Analysis
  • Milestone 4 corresponds to the heading Regression Diagnostics

I don’t want to see any setwd() functions in your output. I will put all the files into one directory (folder), so it is important that you use the read_csv() function I have written in the template.qmd file.

Milestone 1: description (tables, summary stats)

There are two files to consider in the Files section on Canvas: amesHousing2011.csv and amesHousing2011doc.txt. The first is a comma-separated values file of homes sold between 2006 and 2010 in Ames, Iowa. The second is a plain text file (you can open this in any text editor such as Notepad or TextEdit) containing metadata for the first file.

Your assignment is to summarize the data using contingency tables and summary statistics. This will prepare you for the next assignments as it will familiarize you with the data which is the same data that will be used for every milestone.

Be aware that not every column is useful. Not every column is easy to describe. You must use your best judgment as you familiarize yourself with the data to identify the most important columns to describe and the columns that relate to each other. Part of your grade depends on your judgment about which columns to include!

A few points to remember:

  • You must organize your report using subheadings.
  • There should not be redundant sections.
  • There should be a narrative explaining your statistics so that I know that you know what you are talking about.

Suggested steps:

Step 1. Get rid of rows you don’t want, such as drastic outliers that won’t help in your eventual task of price prediction. Five outliers have already been removed but you may find others, such as an extremely large LotArea.

Step 2. Get rid of columns you don’t want to analyze, such as Order and PID. Some columns have a lot of NAs, such as ALley, which is 93 percent NA. You have to decide whether these columns are helpful in making predictions, and you must choose whether to delete them. Part of your grade depends on your judgment resulting from examining the data. For milestone 1, I suggest you take a deep dive into 10–20 columns (including SalePrice) and ignore or delete the others.

Step 3. Convert some of the chr columns to factor. For instance, you can say df$MSSubClass <- as.factor(df$MSSubClass). You can convert to an ordered factor by saying df$LotShape <- factor(df$LotShape,ordered=TRUE,levels=c("Reg","IR1,"IR2","IR3")) (and you should do so for columns listed as Ordinal in the amesHousing2011doc.txt file).

Step 4. Your m1.qmd file must start with reading in the amesHousing2011.csv file and do processing on the resulting data frame. Be sure to rename the template.qmd file as m1.qmd, not M1.qmd or m-1.qmd or m 1.qmd or anything else.

Step 5. Organize your work into subsections that make sense according to your findings. Make your final report something you could hand to a manager to familiarize them with the data. The numerical description section of the report should end with a summary of your findings.

You should certainly have several contingency tables and several uses of the summary() function. I expect a minimum should be three contingency tables for a C, four for a B, and five for an A. You should take advantage of the subsection Numerical summary shortcuts in section 2 (Numerical and Visual Data Summaries) of the Study Guide. This will give you most of what you need except for contingency tables. A contingency table depicts co-occurrance of two or more character variables or factors. For example, using the Ames data, table(df$Neighborhood,df$LotShape), shows that some lot shapes are characteristic of some neighborhoods.

On the other hand, some contingency tables are too large to be useful. You have to exercise judgment about whether it would be reasonable for a manager to realistically eyeball a given contingency table. (A common rule of thumb is whether the entire table fits on my screen.) If I find it too large to eyeball, I will deduct points. Quite a bit of this and the next milestone is about reducing unwieldly amounts of data to reasonably small descriptions.

One artifact, the stem-and-leaf plot, using the stem() function in R, can be regarded as either a numerical summary or a visual summary. It can be used in either m1 or m2.

You should use the dplyr package to summarize the data. In particular, you should find the functions select(), filter(), mutate(), group_by(), and summarize() useful. The file m1worksheet.qmd shows some ways you can use them. Keep in mind that, when you load the tidyverse, you will have automatically loaded dplyr, so it is wasteful to load it separately.

Milestone 2: description (visualization)

This assignment is to describe the data using pictures, such as barplots, mosaic plots, histograms, and more.

As with the previous milestone, your findings must be organized into a coherent report. Follow the points to remember and suggested steps from Milestone 1 again. Do include the Numerical Description section and the headings for the following sections in this iteration of the report.

Here are the points I will look for in scoring this milestone:

  • I want you to make at least 5 different plots
  • refer to the ggplot cheatsheet and think about the data types (continuous vs discrete)
  • refer to the R Graph Gallery for examples with similar data to yours (e.g., ridgeline plots are covered here but not on the ggplot cheatsheet)
  • use appropriate plots for variable types
  • don’t just do the same plot description with different variables, over and over
  • scatterplots are really informative if you have numeric variables on both axes, like year and price
  • use log or exponential scales if relationships appear to be curvilinear
  • use labels, legends, captions, titles, and subtitles
  • don’t use scientific notation for axes
  • try to make plots fit on the screen—that can be challenging when you use faceting
  • especially use faceting when you have two continuous variables, like price and year, and a discrete variable, like neighborhood
  • feel free to use faceting for variables like, say, the six most expensive neighborhoods, cutting down on scrolling
  • sort bars by height when you make bar plots
  • don’t use pie charts except in special cases (four to seven slices with obvious differences in slice size)
  • don’t make stacked bar charts out of percentages (all bars are the same height)
  • include an intro and summary
  • overlay box plots or violin plots with geom_jitter() (and use alpha channel, which is opacity)
  • avoid default colors—use colorbrewer (discrete) or viridis (continuous)
  • use color for meaning rather than decoration
  • don’t let labels overlap each other—put them on a diagonal if they are long
  • don’t create bar plots with two bars because people can compare two numbers in their minds
  • put commas in long numbers
  • use layered plots, e.g., put geom_smooth(method=lm) over scatterplots to see the relationship
  • word clouds are not too useful unless the words are of similar length
  • reject plots that turn out not to show anything important, e.g., suppose a plot shows that the category being plotted is irrelevant
  • it can be hard to reject a plot you’ve put time into but do it anyway
  • imagine a person trying to read meaning into your plot
  • make your plots either reinforce or complement what you did in milestone 1 (or both!)

Milestone 3: regression

This assignment is to conduct a regression analysis using SalePrice as the response. The response variable is also known as y, the target variable, or the output variable. It is up to you to decide which explanatory variables to use. Explanatory variables are also known as x, features, or input variables. This assignment will be graded in part on your choice of explanatory variables and the quality of your explanations of the regression output.

You are responsible for completing the following steps.

  1. Create two linear models through trial and error. Explain them.
  2. Create a model through the ols_step_best_subset() function of the olsrr package. Explain it.
  3. Create a fourth model based on, but not identical to, the model created above. The reason that it is not identical is that you will probably find that the model created in step 2 contains only some values of the categorical variables. This fourth model will be manually specified, not determined by the olsrr approach. This fourth model is the model you will use for milestone 4. Explain it.

You must come up with an overall summary that summarizes your reasoning for the path you took. You should include some visualizations or refer to the visualizations from milestone 2 to help explain why you chose some variables and not others. Keep in mind that your .qmd file should include your milestone 1 (possibly altered) and milestone 2 (possibly altered), so that the report builds up on what you now know about the data.

Milestone 4: regression diagnostics

This assignment is to assess the regression conducted in the previous assignment, using diagnostic plots and statistics. This assignment will be graded on the basis of your explanations of the resulting diagnostic plots and statistics.

You will also summarize all your findings in a Conclusion section and clean up the report further so that a manager can easily digest it. You will add echo=FALSE to your chunk options at the beginning of the file, so that it looks like the following. Be sure to add the comma on the previous line!

```{r}
#| label: optionsSetup
#| include=FALSE
knitr::opts_chunk$set(
  message=FALSE,
  echo=FALSE
)
```

Notice that the R code chunks will now not appear. It doesn’t matter because a manager may not be able to read them anyway. Instead, you need to include more verbiage to explain what is going on to the manager. For example, in the Obtaining Data section, nothing now appears. You have to decide what to put there about the source of the data.

Weekly Work, 40 points

Nine weekly exercises are each worth two points, then four points each for the next four, then six points for the last one. We’ll start those in class but you may have to finish on your own. It is fine to work with a partner but, if you collaborate, both names must appear as authors on both submissions or it will be regarded as cheating. Every student will turn in their own copy of any collaborative work.

You must substantially complete the exercises to receive any credit. Even incorrect answers are okay as long as you make a serious attempt at solving each problem.