Simulating Data in R

Same Seating Chart!

Tuesday, November 19

Today we will…

  • Plan for Week 9 & 10
  • New Material
    • Statistical Distributions
    • Simulating Data
  • PA 9: Instrument Con

Week 9

  • PA 9 (today)
  • Lab 9 & Challenge 9 (last ones!)
  • Revisions on Lab 7 (due Friday)
  • Code Review Lab 8 (due Sunday)

Week 10

  • Final Portfolio Week!

Warning

No revisions will be accepted on Lab 8 or Lab 9. You can, however, talk with me during class about any revisions you’ve made. :)

Statistical Distributions

Statistical Distributions

Recall from your statistics classes…

A random variable is a value we don’t know until we take a sample.

  • Coin flip: could be heads (0) or tails (1)
  • Person’s height: could be anything from 0 feet to 10 feet.
  • Annual income of a US worker: could be anything from $0 to $1.6 billion

The distribution of a random variable tells us its possible values and how likely they are to occur.

  • Coin flip: 50% chance of heads and tails.
  • Heights follow a bell curve centered at 5 foot 7.
  • Most American workers make under $100,000.

Statistical Distributions with Names!

Uniform Distribution

  • When you know the range of values, but not much else.
  • All values in the range are equally likely to occur.

Normal Distribution

  • When you expect values to fall near the center.
  • Frequency of values follows a bell shaped curve.

t-Distribution

  • A slightly wider bell curve.
  • Basically used in the same context as the Normal distribution, but more common with real data (when the standard deviation is unknown).

Chi-Square Distribution

  • Somewhat skewed, and only allows values above zero.
  • Used in testing count data.

Binomial Distribution

  • Appears when you have two possible outcomes, and you are counting how many times each outcome occurred.
  • This is a discrete distribution, as there can only be whole number values!

Distribution Functions in R

r is for random sampling.

  • Generate random values from a distribution.
  • We use this to simulate data (create pretend observations).
runif(n = 3, min = 10, max = 20)
[1] 18.15007 14.11976 17.97457
rnorm(n = 3)
[1] -1.1870619  1.7135969 -0.7684311
rnorm(n = 3, mean = -100, sd = 50)
[1] -170.4145 -143.8366  -77.0706
rt(n = 3, df = 11)
[1] 0.7902426 0.6496226 0.6371634
rbinom(n = 3, size = 10, prob = 0.7)
[1] 9 7 6
rchisq(n = 3, df = 11)
[1] 14.514977  5.529515 13.450288

p is for probability.

  • Compute the chances of observing a value less than x.
  • We use this for calculating p-values.
pnorm(q = 1.5)
[1] 0.9331928
pnorm(q = 70, mean = 67, sd = 3)
[1] 0.8413447
1 - pnorm(q = 70, mean = 67, sd = 3)
[1] 0.1586553
pnorm(q = 70, mean = 67, sd = 3, lower.tail = FALSE)
[1] 0.1586553

q is for quantile.

  • Given a probability \(p\), compute \(x\) such that \(P(X < x) = p\).
  • The q functions are “backwards” of the p functions.
qnorm(p = 0.95)
[1] 1.644854
qnorm(p = 0.95, mean = 67, sd = 3)
[1] 71.93456

d is for density.

  • Compute the height of a distribution curve at a given \(x\).
  • For discrete dist: probability of getting exactly \(x\).
  • For continuous dist: usually meaningless.

Probability of exactly 12 heads in 20 coin tosses, with a 70% chance of tails?

dbinom(x = 12, size = 20, prob = 0.3)
[1] 0.003859282

Simulating Data

Simulate a Dataset

We can generate fake data based on the assumption that a variable follows a certain distribution.

  • We randomly sample observations from the distribution.
age <- runif(1000, min = 15, max = 75)

Since there is randomness involved, we will get a different result each time we run the code.

runif(3, min = 15, max = 75)
[1] 51.12011 48.62721 73.48332
runif(3, min = 15, max = 75)
[1] 48.79667 72.64916 29.31348


To make a reproducible random sample, we first set the seed:

set.seed(93401)
runif(3, min = 15, max = 75)
[1] 20.84739 51.61768 42.68515
set.seed(93401)
runif(3, min = 15, max = 75)
[1] 20.84739 51.61768 42.68515
set.seed(435)

fake_data <- tibble(names   = charlatan::ch_name(1000),
                    height  = rnorm(1000, mean = 67, sd = 3),
                    age     = runif(1000, min = 15, max = 75),
                    measure = rbinom(1000, size = 1, prob = 0.6)
                    ) |> 
  mutate(supports_measure_A = ifelse(measure == 1, "yes", "no"))

head(fake_data)
# A tibble: 6 × 5
  names                 height   age measure supports_measure_A
  <chr>                  <dbl> <dbl>   <int> <chr>             
1 Elbridge Kautzer        67.4  66.3       1 yes               
2 Brandon King            65.0  61.5       0 no                
3 Phyllis Thompson        68.1  53.8       1 yes               
4 Humberto Corwin         67.5  33.9       1 yes               
5 Theresia Koelpin        71.4  16.1       1 yes               
6 Hayden O'Reilly-Johns   66.2  37.0       0 no                

Check to see the ages look uniformly distributed.

Code
fake_data |> 
  ggplot(mapping = aes(x = age,
                       fill = supports_measure_A)) +
  geom_histogram(show.legend = F) +
  facet_wrap(~ supports_measure_A,
             ncol = 1) +
  scale_fill_brewer(palette = "Paired") +
  theme_bw() +
  labs(x = "Age (years)",
       y = "",
       subtitle = "Number of Individuals Supportng Measure A for Different Ages",)

PA 9: Instrument Con

Is the instrument salesman selling fake instruments?

PA 9

In this practice activity you and your partner will write a function to simulate the weight of various band instruments, with the goal of identifying whether a particular shipment of instruments has a “reasonable” weight.

This activity will require knowledge of:

  • named distributions
  • probability calculations related to distributions
  • function documentation
  • function syntax
  • function arguments



None of us have all these abilities. Each of us has some of these abilities.

Task Card

Every group should have a task card!

  • The table on distributions provides pictures on what each function (e.g., p, d, q) means

  • The list of distributions should help you decide what function to use (e.g., pchisq())

Getting Started

The person who whose birthday is closest to today starts as the Coder (giving instructions on what to type to the Developer)!

  • The Coder does not type.
    • The collaborative editing feature should allow you to track what is being typed.
  • The Developer only types what they are told to type.

Submission

You and your partner together should address the following questions:

How many of these samples had a weight less than or equal to Professor Hill’s shipment?

Do you beleive Professor Hill ordered genuine instruments?