Regression Diagnostics

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

m <- lm(price ~ .,data=diamonds)

summary(m)


Call:
lm(formula = price ~ ., data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-21376.0   -592.4   -183.5    376.4  10694.2 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  5753.762    396.630   14.507  < 2e-16 ***
carat       11256.978     48.628  231.494  < 2e-16 ***
cut.L         584.457     22.478   26.001  < 2e-16 ***
cut.Q        -301.908     17.994  -16.778  < 2e-16 ***
cut.C         148.035     15.483    9.561  < 2e-16 ***
cut^4         -20.794     12.377   -1.680  0.09294 .  
color.L     -1952.160     17.342 -112.570  < 2e-16 ***
color.Q      -672.054     15.777  -42.597  < 2e-16 ***
color.C      -165.283     14.725  -11.225  < 2e-16 ***
color^4        38.195     13.527    2.824  0.00475 ** 
color^5       -95.793     12.776   -7.498 6.59e-14 ***
color^6       -48.466     11.614   -4.173 3.01e-05 ***
clarity.L    4097.431     30.259  135.414  < 2e-16 ***
clarity.Q   -1925.004     28.227  -68.197  < 2e-16 ***
clarity.C     982.205     24.152   40.668  < 2e-16 ***
clarity^4    -364.918     19.285  -18.922  < 2e-16 ***
clarity^5     233.563     15.752   14.828  < 2e-16 ***
clarity^6       6.883     13.715    0.502  0.61575    
clarity^7      90.640     12.103    7.489 7.06e-14 ***
depth         -63.806      4.535  -14.071  < 2e-16 ***
table         -26.474      2.912   -9.092  < 2e-16 ***
x           -1008.261     32.898  -30.648  < 2e-16 ***
y               9.609     19.333    0.497  0.61918    
z             -50.119     33.486   -1.497  0.13448    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1130 on 53916 degrees of freedom
Multiple R-squared:  0.9198,    Adjusted R-squared:  0.9198 
F-statistic: 2.688e+04 on 23 and 53916 DF,  p-value: < 2.2e-16

load(paste0(Sys.getenv("STATS_DATA_DIR"),"/mariokart.rda"))
m<-(lm(total_pr~cond+stock_photo+duration+wheels,data=mariokart))
summary(m)


Call:
lm(formula = total_pr ~ cond + stock_photo + duration + wheels, 
    data = mariokart)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.485  -6.511  -2.530   1.836 263.025 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     43.5201     8.3701   5.199 7.05e-07 ***
condused        -2.5816     5.2272  -0.494 0.622183    
stock_photoyes  -6.7542     5.1729  -1.306 0.193836    
duration         0.3788     0.9388   0.403 0.687206    
wheels           9.9476     2.7184   3.659 0.000359 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24.4 on 138 degrees of freedom
Multiple R-squared:  0.1235,    Adjusted R-squared:  0.09808 
F-statistic:  4.86 on 4 and 138 DF,  p-value: 0.001069

plot(m)

Residuals vs Fitted

This plot tells you the magnitude of the difference between the residuals and the fitted values. There are three things to watch for here. First, are there any drastic outliers? Yes, there are two, points 65 and 20. (Those are row numbers in the data frame.) You need to investigate those and decide whether to omit them from further analysis. Were they typos? Mismeasurements? Or is the process from which they derive intrinsically subject to occasional extreme variation. In the third case, you probably don’t want to omit them.

Second, is the solid red line near the dashed zero line? Yes it is, indicating that the residuals have a mean of approximately zero. (The red line shows the mean of the residuals in the immediate region of the \(x\)-values of the observed data.)

Third, is there a pattern to the residuals? No, there is not. The residuals appear to be of the same general magnitude at one end as the other. The things that would need action would be a curve or multiple curves, or a widening or narrowing shape, like the cross section of a horn.

plot(m,which=c(1))

Normal Q-Q

This is an important plot. I see many students erroneously claiming that residuals are normally distributed because they have a vague bell shape. That is not good enough to detect normality. The Q-Q plot is the standard way to detect normality. If the points lie along the dashed line, you can be reasonably safe in an assumption of normality. If they deviate from the dashed line, the residuals are probably not normally distributed.

plot(m,which=c(2))

Scale-Location

Look for two things here. First, the red line should be approximately horizontal, meaning that there is not much variability in the standardized residuals. Second, look at the spread of the points around the red line. If they don’t show a pattrn, this reinforces the assumption of homoscedasticity that we already found evidence for in the first plot.

plot(m,which=c(3))

plot(m,which=c(4))

Residuals vs Leverage

This shows you influential points that you may want to remove. Point 84 has high leverage (potential for influence) but is probably not actually very influential because it is so far from Cook’s Distance. Points 20 and 65 are outliers but only point 20 is more than Cook’s Distance away from the mean. In this case, you would likely remove point 20 from consideration unless there were a mitigating reason. For example, game collectors often pay extra for a game that has unusual attributes, such as shrink-wrapped original edition. As an example of a point you would definitely remove, draw a horizontal line from point 20 to a vertical line from point 84. Where they meet would be a high-leverage outlier that is unduly affecting the model no matter what it’s underlying cause. On the other hand, what if you have many such points? Unfortunately, that probably means the model isn’t very good.

plot(m,which=c(5))

plot(m,which=c(6))

#.  If you render with the following,
#.  you get an error message saying
#.  "'which' must be in 1:6"
#. plot(m,which=c(7))

Variable selection

pacman::p_load(olsrr)
m <- lm(price ~ ., data=diamonds)
(best <- ols_step_best_subset(m))

                Best Subsets Regression                 
--------------------------------------------------------
Model Index    Predictors
--------------------------------------------------------
     1         carat                                     
     2         carat clarity                             
     3         carat color clarity                       
     4         carat color clarity x                     
     5         carat cut color clarity x                 
     6         carat cut color clarity depth x           
     7         carat cut color clarity depth table x     
     8         carat cut color clarity depth table x z   
     9         carat cut color clarity depth table x y z 
--------------------------------------------------------

                                                                  Subsets Regression Summary                                                                  
--------------------------------------------------------------------------------------------------------------------------------------------------------------
                       Adj.        Pred                                                                                                                        
Model    R-Square    R-Square    R-Square       C(p)           AIC           SBIC            SBC              MSEP               FPE           HSP       APC  
--------------------------------------------------------------------------------------------------------------------------------------------------------------
  1        0.8493      0.8493      0.8493    47343.7252    945466.5323    792389.1367    945493.2192    129350491485.6166    2398132.8805    44.4601    0.1507 
  2        0.8949      0.8948      0.8948    16756.0962    926075.5885    772986.9665    926164.5448     90268965134.5993    1673786.2236    31.0311    0.1052 
  3        0.9140      0.9139      0.9139     3925.0856    915270.4554    762172.8186    915412.7855     73867441333.4421    1369818.0969    25.3957    0.0861 
  4        0.9171      0.9171       0.917     1786.5467    913238.1961    760140.7902    913389.4218     71134846525.0090    1319168.5676    24.4567    0.0829 
  5        0.9194      0.9194      0.9192      292.5281    911771.5199    758668.3749    911958.3280     69217703948.1984    1283711.1090    23.7993    0.0806 
  6        0.9197      0.9196      0.9195      102.9375    911582.4842    758479.3794    911778.1880     68974272750.4518    1279220.1538    23.7161    0.0804 
  7        0.9198      0.9198      0.9196       22.3316    911501.9083    758398.8257    911706.5077     68870038819.1149    1277310.6785    23.6807    0.0802 
  8        0.9198      0.9198      0.9196       22.2470    911501.8229    758398.7418    911715.3179     68868653231.6045    1277308.6624    23.6806    0.0802 
  9        0.9198      0.9198      0.9196       24.0000    911503.5757    758400.4957    911725.9664     68869614710.3862    1277350.1772    23.6814    0.0802 
--------------------------------------------------------------------------------------------------------------------------------------------------------------
AIC: Akaike Information Criteria 
 SBIC: Sawa's Bayesian Information Criteria 
 SBC: Schwarz Bayesian Criteria 
 MSEP: Estimated error of prediction, assuming multivariate normality 
 FPE: Final Prediction Error 
 HSP: Hocking's Sp 
 APC: Amemiya Prediction Criteria

Look for elbows in the plots of the metrics to find the best model.

plot(best)

My impression is that the best model has three variables: carat, color, and clarity. This is based entirely on diminishing returns of adding more variables.