Project Checkpoint 2 - Example
Note that this example document provides only two examples of data summaries; your project step requires a total of four.
Although the instructions don’t explicitly say to interpret your results, you should definitely make some notes for yourself about what you learn from these data summaries, to help inform your subsequent steps.
Since you are building on your previous report, your dataset introduction and description from last week should still be present in your file! I have omitted this here for simplicity.
Athlete sex proportions annually
New column: ratio of female to male athletes
Does not require iteration or custom function.
gender_props <- olympics |>
group_by(year, games, team) |>
summarize(
count_male_athletes = sum(sex == "M", na.rm = TRUE),
count_female_athletes = sum(sex == "F", na.rm = TRUE),
.groups = "drop"
) |>
mutate(
prop_female = count_female_athletes/(count_female_athletes + count_male_athletes)
)
gender_props |>
group_by(year) |>
summarize(
avg_prop_female = mean(prop_female, na.rm = TRUE)
) |>
gt::gt()| year | avg_prop_female |
|---|---|
| 1896 | 0.000000000 |
| 1900 | 0.009817973 |
| 1904 | 0.006513681 |
| 1906 | 0.024598963 |
| 1908 | 0.013272334 |
| 1912 | 0.022835916 |
| 1920 | 0.068553244 |
| 1924 | 0.081949971 |
| 1928 | 0.043849748 |
| 1932 | 0.097686199 |
| 1936 | 0.054935364 |
| 1948 | 0.051992839 |
| 1952 | 0.080506697 |
| 1956 | 0.074287903 |
| 1960 | 0.098024523 |
| 1964 | 0.085748426 |
| 1968 | 0.128812638 |
| 1972 | 0.120342792 |
| 1976 | 0.166452916 |
| 1980 | 0.164470918 |
| 1984 | 0.154771927 |
| 1988 | 0.175898833 |
| 1992 | 0.223904286 |
| 1994 | 0.246359511 |
| 1996 | 0.291684774 |
| 1998 | 0.242977985 |
| 2000 | 0.389531239 |
| 2002 | 0.273209448 |
| 2004 | 0.395184916 |
| 2006 | 0.326069434 |
| 2008 | 0.427192545 |
| 2010 | 0.336880907 |
| 2012 | 0.429163254 |
| 2014 | 0.295879556 |
| 2016 | 0.434188173 |
As we probably expected, the proportion of female athletes has gone up with time!
Correct for team sports having more athletes.
Think about how to handle anomalies in the country team names, e.g. “USSR” vs. “Russia”.
Limit the data to one or two countries and compare them.
- Give columns more descriptive labels
- Boldface column titles
- Format the percentage column to be percentages!
- Give the table a caption
- Color the table in a way that tells the story of patterns over time.
Female medals by country
No new columns
Requires custom function and iteration
# A tibble: 10 × 2
team n_f_gold
<chr> <dbl>
1 United States 300
2 China 125
3 Soviet Union 124
4 Germany 103
5 East Germany 94
6 Russia 82
7 Australia 64
8 Netherlands 61
9 Romania 58
10 Great Britain 56
This follows about the same patterns as overall medals, but maybe if we compare to male gold counts we would find some standout countries. For example, Romania is on this Top 10 but probably not in the overall counts.
Ideas to take this further:
Collect counts for silver and bronze medals as well, not just gold.
Collect a ratio of female to male medals, to get a sense of which countries have a disproportionately low number of female medal winners.
Collect this data over time as well to look for countries that evolve.
(This actually could have been done in a pipeline! But, the function approach makes it easier to incorporate some of the upgrades above.)
- Give columns more descriptive labels
- Boldface column titles
- Give the table a caption
- Color the table in a way that draws people’s eyes to locations you want to highlight.