Statistics for Informatics:
Overview

Mick McQuaid

1/10/23

Week ZERO

Pre-Intro

Two things: deterministic and stochastic processes

  • deterministic process: If you repeat it, you get the same result, e.g., add two numbers together
  • stochastic process: If you repeat it, you may get different results, e.g., generate a random normally distributed dataset

Two things: probability and stats

  • probability: originated with gambling, trying to guess a single outcome from a known distribution
  • stats: “originated” mostly with farming, trying to infer general knowledge from small samples of crops

Two things: frequentists and bayesians

  • The frequentist approach deals with long-run probabilities (ie, how probable is this data set given the null hypothesis), whereas the Bayesian approach deals with the probability of a hypothesis given a particular data set.
  • Bayesian analysis incorporates prior information into the analysis, whereas a frequentist analysis is purely driven by the data.
  • The Bayesian approach can calculate the probability that a particular hypothesis is true, whereas the frequentist approach calculates the probability of obtaining another data set at least as extreme as the one collected (giving the \(p\) value).

Two things: description and inference (okay, three: also prediction)

  • description: find a number or numbers or picture that summarizes data
  • inference: understand the machinery of cause and effect
  • prediction: guess a future event based on a sample of past events

Two things: statisticians and stats users

  • statisticians: write packages in R
  • stats users: write code to use existing packages in R

Two things: statisticians and data scientists

  • statisticians: use old-fashioned language
  • data scientists: use fashionable language to mean the same thing

Two things: general data and discipline-specific data

  • general data: textbook uses examples from non-IT disciplines
  • discipline-specific data: the last two lectures will cover data specific to data scientists and ux people

The reproducibility crisis and what we’re going to do about it

Scientists have found that most publishable results are not reproducible. Therefore, they have started to use tools like Quarto to make research reproducible. We’ll use Quarto to make our results reproducible.

Skill aspect of the course

Check your computer

  • You can’t take the final exam with 4GB RAM
  • 16GB Ram is preferred—you’ll struggle with 8GB
  • You will need at least 12GB free drive space to take the final
  • Mac is preferred if you don’t already have a computer
  • If you already have a Windows computer, you may need some tech support which you’ll need to find outside of class because I can’t solve most Windows-specific problems

Download R and RStudio in that order

You need R to be installed before you try to install RStudio. You may prefer to use Python but you can not turn in any Python code or Jupyter notebooks in this class. There are numerous reasons for this which can be discussed outside of class.

Your part

HW Format

All homework assignments will be Quarto documents (qmd extension on filename) described at https://quarto.org

Do not attempt to use any microsoft products in this course. E.g., no reports will be in word and no calculations will be in excel. I’m not joking. I’ll just give you a zero if you turn in that stuff.

Two kinds of homework: group and individual

  • Group work: analyze the craiglist data
  • Individual work: demonstrate understanding on the penguins dataset

One individual exam

It will be a take-home exam where I’ll give you a large (approximately 4GB) dataset and you have to analyze it and turn in a .qmd file and a .pdf file summarizing your work.

Grading

  • 4×10 milestones as group: description 1 (tables, summary stats), description 2 (visualization), regression, diagnostics
  • 3×10 individual assignments: description, simple regression, multiple regression
  • 1×30 take-home exam: analyze nyc311 dataset (subject to change)

Texts

  • Wickham, Çetinkaya-Rundel, and Grolemund (2023)
  • Diez, Çetinkaya-Rundel, and Barr (2019)

References

Diez, David, Mine Çetinkaya-Rundel, and Cristopher D Barr. 2019. OpenIntro Statistics, Fourth Edition. self-published. https://openintro.org/os.
Fornacon-Wood, Isabella, Hitesh Mistry, Corinne Johnson-Hart, Corinne Faivre-Finn, James P. B. OConnor, and Gareth J. Price. 2022. “Understanding the Differences Between Bayesian and Frequentist Statistics.” International Journal of Radiation OncologyBiologyPhysics 112 (5): 1076–82. https://doi.org/10.1016/j.ijrobp.2021.12.011.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. O’Reilly Media, Inc. https://r4ds.hadley.nz/.

END

Colophon

This slideshow was produced using quarto

Fonts are Roboto Condensed and Stix