Overview

Mick McQuaid

1/10/23

Week ZERO

- deterministic process: If you repeat it, you get the same result, e.g., add two numbers together
- stochastic process: If you repeat it, you may get different results, e.g., generate a random normally distributed dataset

- probability: originated with gambling, trying to guess a single outcome from a known distribution
- stats: “originated” mostly with farming, trying to infer general knowledge from small samples of crops

- The frequentist approach deals with long-run probabilities (ie, how probable is this data set given the null hypothesis), whereas the Bayesian approach deals with the probability of a hypothesis given a particular data set.
- Bayesian analysis incorporates prior information into the analysis, whereas a frequentist analysis is purely driven by the data.
- The Bayesian approach can calculate the probability that a particular hypothesis is true, whereas the frequentist approach calculates the probability of obtaining another data set at least as extreme as the one collected (giving the \(p\) value).

- description: find a number or numbers or picture that summarizes data
- inference: understand the machinery of cause and effect
- prediction: guess a future event based on a sample of past events

- statisticians: write packages in R
- stats users: write code to use existing packages in R

- statisticians: use old-fashioned language
- data scientists: use fashionable language to mean the same thing

- general data: textbook uses examples from non-IT disciplines
- discipline-specific data: the last two lectures will cover data specific to data scientists and ux people

Scientists have found that most publishable results are not reproducible. Therefore, they have started to use tools like Quarto to make research reproducible. We’ll use Quarto to make our results reproducible.

- You can’t take the final exam with 4GB RAM
- 16GB Ram is preferred—you’ll struggle with 8GB
- You will need at least 12GB free drive space to take the final
- Mac is preferred if you don’t already have a computer
- If you already have a Windows computer, you may need some tech support which you’ll need to find outside of class because I can’t solve most Windows-specific problems

You need R to be installed before you try to install RStudio. You may prefer to use Python but you can not turn in any Python code or Jupyter notebooks in this class. There are numerous reasons for this which can be discussed outside of class.

All homework assignments will be Quarto documents (qmd extension on filename) described at https://quarto.org

Do not attempt to use any microsoft products in this course. E.g., no reports will be in word and no calculations will be in excel. I’m not joking. I’ll just give you a zero if you turn in that stuff.

- Group work: analyze the
*craiglist*data - Individual work: demonstrate understanding on the
*penguins*dataset

It will be a take-home exam where I’ll give you a large (approximately 4GB) dataset and you have to analyze it and turn in a .qmd file and a .pdf file summarizing your work.

- 4×10 milestones as group: description 1 (tables, summary stats), description 2 (visualization), regression, diagnostics
- 3×10 individual assignments: description, simple regression, multiple regression
- 1×30 take-home exam: analyze nyc311 dataset (subject to change)

Diez, David, Mine Çetinkaya-Rundel, and Cristopher D Barr. 2019. *OpenIntro Statistics, Fourth Edition*. self-published. https://openintro.org/os.

Fornacon-Wood, Isabella, Hitesh Mistry, Corinne Johnson-Hart, Corinne Faivre-Finn, James P. B. OConnor, and Gareth J. Price. 2022. “Understanding the Differences Between Bayesian and Frequentist Statistics.” *International Journal of Radiation OncologyBiologyPhysics* 112 (5): 1076–82. https://doi.org/10.1016/j.ijrobp.2021.12.011.

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. *R for Data Science: Import, Tidy, Transform, Visualize, and Model Data*. 2nd ed. O’Reilly Media, Inc. https://r4ds.hadley.nz/.

END

This slideshow was produced using `quarto`

Fonts are *Roboto Condensed* and *Stix*