Project Checkpoint 2 - Example

Note

Note that this example document provides only two examples of data summaries; your project step requires a total of four.

Although the instructions don’t explicitly say to interpret your results, you should definitely make some notes for yourself about what you learn from these data summaries, to help inform your subsequent steps.

Since you are building on your previous report, your dataset introduction and description from last week should still be present in your file! I have omitted this here for simplicity.

Athlete sex proportions annually

New column: ratio of female to male athletes

Does not require iteration or custom function.

gender_props <- olympics |>
  group_by(year, games, team) |>
  summarize(
    count_male_athletes = sum(sex == "M", na.rm = TRUE),
    count_female_athletes = sum(sex == "F", na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(
    prop_female = count_female_athletes/(count_female_athletes + count_male_athletes)
  )

gender_props |>
  group_by(year) |>
  summarize(
    avg_prop_female = mean(prop_female, na.rm = TRUE)
  ) |> 
  gt::gt()
year avg_prop_female
1896 0.000000000
1900 0.009817973
1904 0.006513681
1906 0.024598963
1908 0.013272334
1912 0.022835916
1920 0.068553244
1924 0.081949971
1928 0.043849748
1932 0.097686199
1936 0.054935364
1948 0.051992839
1952 0.080506697
1956 0.074287903
1960 0.098024523
1964 0.085748426
1968 0.128812638
1972 0.120342792
1976 0.166452916
1980 0.164470918
1984 0.154771927
1988 0.175898833
1992 0.223904286
1994 0.246359511
1996 0.291684774
1998 0.242977985
2000 0.389531239
2002 0.273209448
2004 0.395184916
2006 0.326069434
2008 0.427192545
2010 0.336880907
2012 0.429163254
2014 0.295879556
2016 0.434188173

As we probably expected, the proportion of female athletes has gone up with time!

NoteIdeas to take this further:
  • Correct for team sports having more athletes.

  • Think about how to handle anomalies in the country team names, e.g. “USSR” vs. “Russia”.

  • Limit the data to one or two countries and compare them.

ImportantTable Improvements
  • Give columns more descriptive labels
  • Boldface column titles
  • Format the percentage column to be percentages!
  • Give the table a caption
  • Color the table in a way that tells the story of patterns over time.

Female medals by country

No new columns

Requires custom function and iteration

# Function to count medals but not overcount team sports

num_female_gold <- function(country) {
  
  female_medals <- olympics |>
    filter(team == country,
           sex == "F") |>
    distinct(games, event, medal) |>
    pull(medal)
  
  
  sum(female_medals == "Gold", na.rm = TRUE)
  
}
olympics |>
  distinct(team) |>
  mutate(
    n_f_gold = map_dbl(team, num_female_gold)
  ) |>
  slice_max(n_f_gold, n = 10)
# A tibble: 10 × 2
   team          n_f_gold
   <chr>            <dbl>
 1 United States      300
 2 China              125
 3 Soviet Union       124
 4 Germany            103
 5 East Germany        94
 6 Russia              82
 7 Australia           64
 8 Netherlands         61
 9 Romania             58
10 Great Britain       56

This follows about the same patterns as overall medals, but maybe if we compare to male gold counts we would find some standout countries. For example, Romania is on this Top 10 but probably not in the overall counts.

Note

Ideas to take this further:

  • Collect counts for silver and bronze medals as well, not just gold.

  • Collect a ratio of female to male medals, to get a sense of which countries have a disproportionately low number of female medal winners.

  • Collect this data over time as well to look for countries that evolve.

  • (This actually could have been done in a pipeline! But, the function approach makes it easier to incorporate some of the upgrades above.)

ImportantTable Improvements
  • Give columns more descriptive labels
  • Boldface column titles
  • Give the table a caption
  • Color the table in a way that draws people’s eyes to locations you want to highlight.