Simulating Data in R

Same Seating Chart!

Tuesday, November 19

Today we will…

Plan for Week 9 & 10
New Material
- Statistical Distributions
- Simulating Data
PA 9: Instrument Con

Week 9

PA 9 (today)
Lab 9 & Challenge 9 (last ones!)
Revisions on Lab 7 (due Friday)
Code Review Lab 8 (due Sunday)

Week 10

Final Portfolio Week!

Warning

No revisions will be accepted on Lab 8 or Lab 9. You can, however, talk with me during class about any revisions you’ve made. :)

Statistical Distributions

Recall from your statistics classes…

Random Variable
Distribution

A random variable is a value we don’t know until we take a sample.

Coin flip: could be heads (0) or tails (1)
Person’s height: could be anything from 0 feet to 10 feet.
Annual income of a US worker: could be anything from $0 to $1.6 billion

The distribution of a random variable tells us its possible values and how likely they are to occur.

Coin flip: 50% chance of heads and tails.
Heights follow a bell curve centered at 5 foot 7.
Most American workers make under $100,000.

Uniform Distribution

When you know the range of values, but not much else.
All values in the range are equally likely to occur.

Normal Distribution

When you expect values to fall near the center.
Frequency of values follows a bell shaped curve.

t-Distribution

A slightly wider bell curve.
Basically used in the same context as the Normal distribution, but more common with real data (when the standard deviation is unknown).

Chi-Square Distribution

Somewhat skewed, and only allows values above zero.
Used in testing count data.

Binomial Distribution

Appears when you have two possible outcomes, and you are counting how many times each outcome occurred.
This is a discrete distribution, as there can only be whole number values!

Distribution Functions in R

r is for random sampling.

Generate random values from a distribution.
We use this to simulate data (create pretend observations).

runif(n = 3, min = 10, max = 20)

[1] 12.39305 10.05442 15.47226

rnorm(n = 3)

[1] -0.4262411  0.8939456 -0.2120931

rnorm(n = 3, mean = -100, sd = 50)

[1] -193.37034 -139.95379  -78.93276

rt(n = 3, df = 11)

[1] -0.81031377  0.68131652  0.05096835

rbinom(n = 3, size = 10, prob = 0.7)

[1] 9 7 9

rchisq(n = 3, df = 11)

[1] 13.18364 15.80086 17.71515

p is for probability.

Compute the chances of observing a value less than x.
We use this for calculating p-values.

pnorm(q = 1.5)

[1] 0.9331928

pnorm(q = 70, mean = 67, sd = 3)

[1] 0.8413447

1 - pnorm(q = 70, mean = 67, sd = 3)

[1] 0.1586553

pnorm(q = 70, mean = 67, sd = 3, lower.tail = FALSE)

[1] 0.1586553

q is for quantile.

Given a probability $p$, compute $x$ such that $P(X < x) = p$.
The q functions are “backwards” of the p functions.

qnorm(p = 0.95)

[1] 1.644854

qnorm(p = 0.95, mean = 67, sd = 3)

[1] 71.93456

d is for density.

Compute the height of a distribution curve at a given $x$.
For discrete dist: probability of getting exactly $x$.
For continuous dist: usually meaningless.

Probability of exactly 12 heads in 20 coin tosses, with a 70% chance of tails?

dbinom(x = 12, size = 20, prob = 0.3)

[1] 0.003859282

Simulating Data

Simulate a Dataset

The Idea
set.seed()
tibble
visualize

We can generate fake data based on the assumption that a variable follows a certain distribution.

We randomly sample observations from the distribution.

age <- runif(1000, min = 15, max = 75)

Since there is randomness involved, we will get a different result each time we run the code.

runif(3, min = 15, max = 75)

[1] 61.85689 49.08084 30.63219

runif(3, min = 15, max = 75)

[1] 53.77562 67.76394 15.88530

To make a reproducible random sample, we first set the seed:

set.seed(93401)
runif(3, min = 15, max = 75)

[1] 20.84739 51.61768 42.68515

set.seed(93401)
runif(3, min = 15, max = 75)

[1] 20.84739 51.61768 42.68515

set.seed(435)

fake_data <- tibble(names   = charlatan::ch_name(1000),
                    height  = rnorm(1000, mean = 67, sd = 3),
                    age     = runif(1000, min = 15, max = 75),
                    measure = rbinom(1000, size = 1, prob = 0.6)
                    ) |> 
  mutate(supports_measure_A = ifelse(measure == 1, "yes", "no"))

head(fake_data)

# A tibble: 6 × 5
  names                 height   age measure supports_measure_A
  <chr>                  <dbl> <dbl>   <int> <chr>             
1 Elbridge Kautzer        67.4  66.3       1 yes               
2 Brandon King            65.0  61.5       0 no                
3 Phyllis Thompson        68.1  53.8       1 yes               
4 Humberto Corwin         67.5  33.9       1 yes               
5 Theresia Koelpin        71.4  16.1       1 yes               
6 Hayden O'Reilly-Johns   66.2  37.0       0 no

Check to see the ages look uniformly distributed.

Code

fake_data |> 
  ggplot(mapping = aes(x = age,
                       fill = supports_measure_A)) +
  geom_histogram(show.legend = F) +
  facet_wrap(~ supports_measure_A,
             ncol = 1) +
  scale_fill_brewer(palette = "Paired") +
  theme_bw() +
  labs(x = "Age (years)",
       y = "",
       subtitle = "Number of Individuals Supportng Measure A for Different Ages",)

PA 9: Instrument Con

Is the instrument salesman selling fake instruments?

PA 9

In this practice activity you and your partner will write a function to simulate the weight of various band instruments, with the goal of identifying whether a particular shipment of instruments has a “reasonable” weight.

This activity will require knowledge of:

named distributions
probability calculations related to distributions
function documentation
function syntax
function arguments

None of us have all these abilities. Each of us has some of these abilities.

Task Card

Every group should have a task card!

The table on distributions provides pictures on what each function (e.g., p, d, q) means
The list of distributions should help you decide what function to use (e.g., pchisq())

Getting Started

The person who whose birthday is closest to today starts as the Coder (giving instructions on what to type to the Developer)!

The Coder does not type.
- The collaborative editing feature should allow you to track what is being typed.
The Developer only types what they are told to type.

Submission

You and your partner together should address the following questions:

How many of these samples had a weight less than or equal to Professor Hill’s shipment?

Do you beleive Professor Hill ordered genuine instruments?