PA 4: Military Spending

Author

The names of your group members go here!

library(readxl) 
library(tidyverse)

Today you will be tidying messy data to explore the relationship between countries of the world and military spending. You can find the gov_spending_per_capita.xlsx data included in the data folder.

This task is complex. It requires many different types of abilities. Everyone will be good at some of these abilities but nobody will be good at all of them. In order to produce the best product possible, you will need to use the skills of each member of your group.

Data Description

We will be using data from the Stockholm International Peace Research Institute (SIPRI). The SIPRI Military Expenditure Database is an open source data set containing time series on the military spending of countries from 1949–2019. The database is updated annually, which may include updates to data from previous years.

Military expenditure is presented in many ways:

in local currency and in US $ (both from 2018 and current);
in terms of financial years and calendar years;
as a share of GDP and per capita.

The availability of data varies considerably by country, but we note that data is available from at least the late 1950s for a majority of countries that were independent at the time. Estimates for regional military expenditure have been extended backwards depending on availability of data, but no estimates for total world military expenditure are available before 1988 due to the lack of data from the Soviet Union.

SIPRI military expenditure data is based on open sources only.

Data Import

First, you should notice that there are ten different sheets included in the dataset. We are interested in the sheet labeled “Share of Govt. spending”, which contains information about the share of all government spending that is allocated to the military.

Next, you’ll notice that there are notes about the data in the first six rows. Ugh! Also notice that the last six rows are footnotes about the data. Ugh!

Rather than copying this one sheet into a new Excel file and deleting the first and last few rows, let’s learn something new about the read_xlsx() function!

The read_xlsx() function has several useful arguments:

sheet: specify the name of the sheet that you want to use. The name must be passed in as a string (in quotations)!
skip: specify the number of rows you want to skip before reading in the data.
n_max: specify the maximum number of rows of data to read in.

1. Modify the code below (potentially including the file path) to read the military expenditures data into your workspace.

military <- read_xlsx(here::here("data", 
                                 "gov_spending_per_capita.xlsx"), 
                      sheet = , 
                      skip  = , 
                      n_max = )

Error: `path` does not exist: '/Users/allisontheobold/Documents/Research/IUSE/groupworthy-data-science/data/gov_spending_per_capita.xlsx'

I would highly recommend you open the dataset in Excel, so you can see the data layout! ## Data Cleaning

In addition to NAs, missing values were coded in two other ways.

2. What are the two ways missing values were coded?
Hint: information in the top 6 rows of the excel sheet will help you answer this question.

3. Now that we know how missing values were coded, let’s read in the data again. This time use the na argument to specify the values that need to be replaced with NAs.
Hint: You need to specify the values to replace with NAs as a character string (e.g., c("a", "b"))

military <- read_xlsx(here::here("data", 
                                 "gov_spending_per_capita.xlsx"), 
                      sheet = , 
                      skip  = , 
                      n_max = , 
                      na = c()
                      )

Error: `path` does not exist: '/Users/allisontheobold/Documents/Research/IUSE/groupworthy-data-science/data/gov_spending_per_capita.xlsx'

If you give the Country column a look, you’ll see there are names of continents and regions included. These names are only included to make it simpler to find countries, as they contain no data.

Luckily for us, these region names were also stored in the “Regional totals” sheet. We can use the Region column of this dataset to filter out the names we don’t want.

Run the code below to read in the “Regional totals” data.

cont_region <- read_xlsx(here::here("data", 
                                    "gov_spending_per_capita.xlsx"), 
                      sheet = "Regional totals", 
                      skip = 14) |> 
  filter(Region != "World total (including Iraq)", 
         Region != "World total (excluding Iraq)")

Error: `path` does not exist: '/Users/allisontheobold/Documents/Research/IUSE/groupworthy-data-science/data/gov_spending_per_capita.xlsx'

A clever way to filter out observations you don’t want is with an anti_join(). This function will return all of the rows of one dataset without a match in another dataset.

4. Use the anti_join() function to filter out the Country values we don’t want in the military_clean data. The by argument needs to be filled with the name(s) of the variables that the two datasets should be joined with.
Hint: To join by different variables in data1 and data2 you need to use by = join_by(a == b), which will match data1$a to data2$b.

Canvas Question #1

5. Complete the code below to figure out what four regions were NOT removed from the military_clean data set?
Hint: the regions that were not removed have missing values for every column except Country.

military_clean |> 
  filter(if_all(.cols = , 
                .fns = )
         )

Error in eval(expr, envir, enclos): object 'military_clean' not found

Data Organization

We are interested in comparing the military expenditures of countries in Eastern Europe. Our desired plot looks something like this:

Desired plot: Countries from Central Asia used for demonstration – your plot will have different countries and spending values.

Unfortunately, if we want a point representing the spending for every country and year, we need every year to be a single column!

To tidy a dataset like this, we need to pivot the columns of years from wide format to long format. To do this process we need three arguments:

cols: The set of columns that represent values, not variables. In these data, those are all the columns from 1988 to 2019.
names_to: The name of the variable that should be created to move these columns into. In these data, this could be "Year".
values_to: The name of the variable that should be created to move these column’s values into. In these data, this could be labeled "Spending".

These form the three required arguments for the pivot_longer() function.

6. Pivot the cleaned up military data set to a “longer” orientation. Save this new “long” version as a new object called military_long.
Hint: Do not overwrite your cleaned up dataset!

Data Visualization

Now that we’ve transformed the data, let’s create a plot to explore military spending across Eastern European countries.

7. Create side-by-side boxplots to explore the military spending between Eastern European countries.

Hint 1: Place the Country variable on an axis that makes it easier to read the labels!

Hint 2: Make sure you change the plot title and axis labels to accurately represent the plot.

# Countries to include in the plot!
eastern_europe <- c("Armenia", 
                    "Azerbaijan",
                    "Belarus", 
                    "Georgia", 
                    "Moldova", 
                    "Russia", 
                    "Ukraine")

Canvas Question 2 & Question 3

8. Looking at the plot you created above, which Eastern European country had the second highest median military expenditure?.

9. Looking at the plot you created above, which Eastern European country had the largest variability in military expenditures over time?

--- title: "PA 4: Military Spending" author: "The names of your group members go here!" format: html embed-resources: true code-tools: true toc: true editor: source execute: error: true echo: true message: false warning: false --- ```{r} #| label: setup library(readxl) library(tidyverse) ``` Today you will be tidying messy data to explore the relationship between countries of the world and military spending. You can find the `gov_spending_per_capita.xlsx` data included in the `data` folder. ***This task is complex. It requires many different types of abilities. Everyone will be good at some of these abilities but nobody will be good at all of them. In order to produce the best product possible, you will need to use the skills of each member of your group.***  ## Data Description We will be using data from the Stockholm International Peace Research Institute (SIPRI). The SIPRI Military Expenditure Database is an open source data set containing time series on the military spending of countries from 1949--2019. The database is updated annually, which may include updates to data from previous years. Military expenditure is presented in many ways: + in local currency and in US $ (both from 2018 and current); + in terms of financial years and calendar years; + as a share of GDP and per capita. The availability of data varies considerably by country, but we note that data is available from at least the late 1950s for a majority of countries that were independent at the time. Estimates for regional military expenditure have been extended backwards depending on availability of data, but no estimates for total world military expenditure are available before 1988 due to the lack of data from the Soviet Union. SIPRI military expenditure data is based on open sources only. ## Data Import First, you should notice that there are ten different sheets included in the dataset. We are interested in the sheet labeled *"Share of Govt. spending"*, which contains information about the share of all government spending that is allocated to the military. Next, you'll notice that there are notes about the data in the first six rows. Ugh! Also notice that the last six rows are footnotes about the data. **Ugh**! Rather than copying this one sheet into a new Excel file and deleting the first and last few rows, let's learn something new about the `read_xlsx()` function! The `read_xlsx()` function has several useful arguments: + `sheet`: specify the name of the sheet that you want to use. The name must be passed in as a string (in quotations)! + `skip`: specify the number of rows you want to skip *before* reading in the data. + `n_max`: specify the maximum number of rows of data to read in. **1. Modify the code below (potentially including the file path) to read the military expenditures data into your workspace.** ```{r} #| label: read-in-military-data military <- read_xlsx(here::here("data", "gov_spending_per_capita.xlsx"), sheet = , skip = , n_max = ) ``` *I would highly recommend you open the dataset in Excel, so you can see the data layout!* ## Data Cleaning In addition to `NA`s, missing values were coded in two other ways. **2. What are the two ways missing values were coded?** Hint: information in the top 6 rows of the excel sheet will help you answer this question. **3. Now that we know how missing values were coded, let's read in the data again. This time use the `na` argument to specify the values that need to be replaced with `NA`s.** Hint: You need to specify the values to replace with `NA`s as a character string (e.g., `c("a", "b")`) ```{r} #| label: read-in-data-code-missing-values military <- read_xlsx(here::here("data", "gov_spending_per_capita.xlsx"), sheet = , skip = , n_max = , na = c() ) ```  If you give the `Country` column a look, you'll see there are names of **continents and regions** included. These names are only included to make it simpler to find countries, as they contain no data. Luckily for us, these region names were also stored in the *"Regional totals"* sheet. We can use the `Region` column of this dataset to filter out the names we don't want. Run the code below to read in the *"Regional totals"* data. ```{r regional-totals} cont_region <- read_xlsx(here::here("data", "gov_spending_per_capita.xlsx"), sheet = "Regional totals", skip = 14) |> filter(Region != "World total (including Iraq)", Region != "World total (excluding Iraq)") ``` A clever way to filter out observations you **don't want** is with an `anti_join()`. This function will return all of the rows of one dataset **without** a match in another dataset. **4. Use the `anti_join()` function to filter out the `Country` values we don't want in the `military_clean` data. The `by` argument needs to be filled with the name(s) of the variables that the two datasets should be joined with.** Hint: To join by different variables in `data1` and `data2` you need to use `by = join_by(a == b)`, which will match `data1$a` to `data2$b`. ```{r} #| label: filtering-out-country-values ``` ### Canvas Question #1 **5. Complete the code below to figure out what four regions were NOT removed from the `military_clean` data set?** *Hint: the regions that were not removed have missing values for every column except `Country`.* ```{r} #| label: inspecting-what-regions-were-not-removed military_clean |> filter(if_all(.cols = , .fns = ) ) ``` ## Data Organization  We are interested in comparing the military expenditures of countries in Eastern Europe. Our desired plot looks something like this: ![Desired plot: Countries from Central Asia used for demonstration -- your plot will have different countries and spending values.](images/military-expenditure-plot.png) Unfortunately, if we want a point representing the spending for every country and year, we need every year to be a **single column**! To tidy a dataset like this, we need to pivot the columns of years from wide format to long format. To do this process we need three arguments: + `cols`: The set of columns that represent values, not variables. In these data, those are all the columns from `1988` to `2019`. + `names_to`: The name of the variable that should be created to move these columns into. In these data, this could be `"Year"`. + `values_to`: The name of the variable that should be created to move these column's values into. In these data, this could be labeled `"Spending"`. These form the three required arguments for the `pivot_longer()` function. **6. Pivot the cleaned up `military` data set to a "longer" orientation. Save this new "long" version as a new object called `military_long`.** Hint: **Do not** overwrite your cleaned up dataset! ```{r} #| label: pivoting-military-data-longer ``` ## Data Visualization  Now that we've transformed the data, let's create a plot to explore military spending across Eastern European countries. **7. Create side-by-side boxplots to explore the military spending between Eastern European countries.** Hint 1: Place the `Country` variable on an axis that makes it easier to read the labels! Hint 2: Make sure you change the plot title and axis labels to accurately represent the plot. ```{r} #| label: side-by-side-boxplots-east-europe # Countries to include in the plot! eastern_europe <- c("Armenia", "Azerbaijan", "Belarus", "Georgia", "Moldova", "Russia", "Ukraine") ``` ### Canvas Question 2 & Question 3 **8. Looking at the plot you created above, which Eastern European country had the second highest median military expenditure?.** **9. Looking at the plot you created above, which Eastern European country had the largest variability in military expenditures over time?**