Week One: Foundations of Statistics

Welcome!

In this coursework, you’ll get a refresher on the foundational components of statistics and data.

0.1 Learning Outcomes

By the end of this coursework you should be able to:

  • describe observations, variables, and data matrices

  • explain the different types of variables a study can have

  • describe the observations (rows) in a dataset

  • describe the variables (columns) in a dataset

  • illustrate the difference between explanatory and response variables

1 Prepare

Let’s start off by reading a refresher on some of the foundational concepts for data!

1.1 Textbook Reading

📖 Required Reading: Introduction to Modern Statistics – Hello Data

Reading Guide – Due Wednesday by noon

Download the Word Document

Submission (Due Wednesday, April 2 by the start of class)

Submit your completed reading guide to the Canvas assignment portal!

1.2 Concept Quiz – Due Wednesday by the start of class

The data frame below contains information on runners in the 2017 Cherry Blossom Run, which is an annual road race that takes place in Washington, DC. Most runners participate in a 10-mile run while a smaller fraction take part in a 5k run or walk.

run17_to_print <- run17 |>
  rownames_to_column() |>
  rename_with(str_to_title) |>
  rename(
    ` ` = Rowname,
    Net = Net_sec,
    Clock = Clock_sec,
    Pace = Pace_sec,
    `City / Country` = City
    )

run17_1_to_5 <- run17_to_print |>
  slice(1:5) |>
  mutate(across(.cols = everything(), .fns = as.character))

run17_n <- run17_to_print |>
  slice(nrow(run17)) |>
  mutate(across(.cols = everything(), .fns = as.character))

run17_filler <- run17_to_print |>
  slice(1) |>
  mutate(across(.cols = everything(), 
                .fns = ~str_replace(.x, pattern = ".*", replacement = "...")
                )
         )

run17_1_to_5 |>
  bind_rows(run17_filler) |>
  bind_rows(run17_n) |>
  kbl(linesep = "", booktabs = TRUE, align = "llllclcccl") |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = "HOLD_position"
    ) |>
  add_header_above(c(" " = 6, "Time" = 2, " " = 2))
Time
Bib Name Sex Age City / Country Net Clock Pace Event
1 6 Hiwot G. F 21 Ethiopia 3217 3217 321 10 Mile
2 22 Buze D. F 22 Ethiopia 3232 3232 323 10 Mile
3 16 Gladys K. F 31 Kenya 3276 3276 327 10 Mile
4 4 Mamitu D. F 33 Ethiopia 3285 3285 328 10 Mile
5 20 Karolina N. F 35 Poland 3288 3288 328 10 Mile
... ... ... ... ... ... ... ... ... ...
19961 25153 Andres E. M 33 Woodbridge, VA 5287 5334 1700 5K

1. How many observations and how many variables are included in these data?

2. What are the different types of variables that can appear in a dataset? Select all that apply!

  • discrete numerical
  • ordinal categorical
  • nominal categorical
  • continuous numerical
  • discrete categorical

3. Just because a variable has numeric values, does not mean it is a numeric variable. Which of the following variables appear numerical but behave like a categorical variable? Select all that apply!

  • zip code
  • GPA
  • height
  • year in school

Data were collected on 344 penguins living on three islands (Torgersen, Biscoe, and Dream) in the Palmer Archipelago, Antarctica. In addition to which island each penguin lives on, the data contains information on the species of the penguin (Adelie, Chinstrap, or Gentoo), its bill length, bill depth, and flipper length (measured in millimeters), its body mass (measured in grams), and the sex of the penguin (female or male).

4. There are ___ observations (cases) in these data and ___ variables.

5. Which of the following variables are numerical?

  • island
  • species
  • bill depth
  • body mass
  • bill length
  • flipper length
  • sex

6. There are ____ species of penguins included in these data: ____, ____, and ____.

2 R Tutorial – Due Wednesday by the start of class

💻 Required Tutorial: Language of Data

Submission

Submit a screenshot of the completion page (the last page of the tutorial) to the Canvas assignment portal!