Project Checkpoint 1 - Example

Note

This is a minimal example representing “passing” or ‘C’ level work, to give you a baseline for how to proceed. ‘B’ or ‘A’ level work will require a more thoughtful and thorough description of the data and cleaning, deeper or more complex research questions, and/or more polished visualization sketches. Some notes about areas for improvement are included in the callouts below.

Introduction

My dataset is the Olympics data from TidyTuesday. This dataset shows all the Olympic athletes ever, and gives information about their sport, the country they are from, whether they won a medal, etc.

The data comes from a Kaggle dataset created by user RGriffin, who scraped the data from www.sports-reference.com in May 2018.

Here are a few rows of the dataset:

tuesdata <- tidytuesdayR::tt_load('2024-08-06')

tuesdata$olympics |> 
  dplyr::sample_n(6)
id name sex age height weight team noc games year season city sport event medal
3229 Gyula Alvics M 28 186 91 Hungary HUN 1988 Summer 1988 Summer Seoul Boxing Boxing Men’s Heavyweight NA
31516 Alfred Frederik “Fred” Eefting M 21 183 94 Netherlands NED 1980 Summer 1980 Summer Moskva Swimming Swimming Men’s 200 metres Backstroke NA
35269 Solenne Nadge Figus (-Saint Marie) F 25 178 59 France FRA 2004 Summer 2004 Summer Athina Swimming Swimming Women’s 200 metres Freestyle Bronze
36759 Robert Frank M 34 NA NA Switzerland SUI 1936 Summer 1936 Summer Berlin Art Competitions Art Competitions Mixed Sculpturing, Reliefs NA
80591 Aiko Miyamura F 24 173 60 Japan-1 JPN 1996 Summer 1996 Summer Atlanta Badminton Badminton Women’s Doubles NA
20202 Chen Hsiu-Hsiung M 32 169 70 Chinese Taipei TPE 1968 Summer 1968 Summer Mexico City Sailing Sailing Mixed Three Person Keelboat NA
Note

Good:

  • This description explains who created the data, from where, and when. It provides links to the sources so the reader can find them.

  • It somewhat explains what information is in the dataset; i.e., information about athletes, as opposed to scheduling information or other details.

  • Showing a random small snippet of the dataset is typically useful, to get a feel for what it looks like.

Bad:

  • This is lacking in the “why” and “how” of the data creation. It is important to know what motivated the creation and sharing of a dataset, as well as the specific process that the creator used to assemble it.

  • This description is unclear about the observational units, i.e., what each row represents. (It is not true that each row is a unique athlete!)

  • This description could give much more detail on the observed variables present, such as how they were measured and what a typical value looks like.

Data Cleaning

This data was cleaned by using janitor to reformat column names.

The user RGriffin who scraped the data also checked for misentered data in the columns Name, Gender, Height, and Weight.

Note

Good:

  • We did not just look at the Tidy Tuesday cleaning and call it a day! We followed the path of the data creation to figure out what other cleaning and wrangling took place.

Bad:

  • We were not specific enough about the cleaning done by RGriffin. What did they alter in those columns and why? What other anomalies did they look for?

Explorations

RQ 1: Olympic success by country

Which countries win the most Gold, Silver, and Bronze medals?

RQ 2: Sex over time

Are there a higher percentage of female sports over time?

Note

Good:

  • These two research questions are well-defined and answerable with the dataset.

Bad:

  • These are not particularly deep. “Which countries win the most?” is not going to provide a new insight beyond what most people already know, and I think we can expect from the outset that female sports have increased over time.

RQ 3: Olympic success corrected by population

Which countries win disproportionately more medals compared to their population size?

Additional Data: Populations of each country

RQ 4: Correlation between female wins and education

Do countries with better support for education for women also tend to see more success in female sports categories?

Additional Data: Education trends in each country

Note

Good:

  • Question 4 digs deeper into the relationships between a country’s culture and government; and outcomes at the Olympics. This is a good question.

Bad:

  • Question 3 is a bit less interesting - you might uncover 1 or 2 interesting countries that win disproportionately, but you aren’t telling a rich story.

Visualizations

Stacked bar chart comparing Olympic medal counts for three countries — USA, Russia, and China. Each bar is divided into three segments representing gold (yellow), silver (blue), and bronze (orange) medals. The USA has the most medals overall, followed by Russia, then China. Russia and China have notably fewer bronze medals compared to the USA.

Scatter plot showing the relationship between the percent of women with a high school degree or better (x-axis) and the number of gold medals won in women's sports (y-axis). Approximately 15 data points are scattered across the chart. A red trend line runs upward from left to right, indicating a weak positive correlation — countries where more women have at least a high school education tend to win slightly more gold medals in women's sports. The data points show considerable spread around the trend line.

Note

Good:

  • Plots are appropriate to data type and address RQs

Bad:

  • Not a lot of thought is put into these as far as good plot design, annotation and storytelling, etc.