DACSS 601: Data Science Fundamentals - FALL 2022
  • Fall 2022 Posts
  • Contributors
  • DACSS

Homework 2

  • Course information
    • Overview
    • Instructional Team
    • Course Schedule
  • Weekly materials
    • Fall 2022 posts
    • final posts

On this page

  • Read in Data
  • Clean Data
  • Narrative
  • Potential Research Questions

Homework 2

  • Show All Code
  • Hide All Code

  • View Source
hw2
Aleacia Messiah
SIPRI_Milex
tidyverse
readxl
Read, Clean, Describe, and Formulate Research Questions from Data
Author

Aleacia Messiah

Published

October 13, 2022

Code
library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE)

Read in Data

Code
# read in the Current USD sheet of the Milex dataset, skipping the first 6 rows and renaming the columns
milex <- read_excel("_data/SIPRI-Milex-data-1949-2021.xlsx", sheet = "Current USD", skip = 6, col_names = c("Country", "Notes", 1949:2021))
# view milex data
milex
# A tibble: 191 × 75
   Country  Notes `1949` `1950` `1951` `1952` `1953` `1954` `1955` `1956` `1957`
   <chr>    <chr> <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1 Africa   <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 2 North A… <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 3 Algeria  §4    xxx    xxx    xxx    xxx    xxx    xxx    xxx    xxx    xxx   
 4 Libya    द16 xxx    xxx    ...    ...    ...    ...    ...    ...    ...   
 5 Morocco  §17   xxx    xxx    xxx    xxx    xxx    xxx    xxx    23.71… 35.40…
 6 Tunisia  <NA>  xxx    xxx    xxx    xxx    xxx    xxx    xxx    3.714… 6.411…
 7 sub-Sah… <NA>  <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>   <NA>  
 8 Angola   §‖1   xxx    xxx    xxx    xxx    xxx    xxx    xxx    xxx    xxx   
 9 Benin    §     xxx    xxx    xxx    xxx    xxx    xxx    xxx    xxx    xxx   
10 Botswana <NA>  xxx    xxx    xxx    xxx    xxx    xxx    xxx    xxx    xxx   
# … with 181 more rows, and 64 more variables: `1958` <chr>, `1959` <chr>,
#   `1960` <chr>, `1961` <chr>, `1962` <chr>, `1963` <chr>, `1964` <chr>,
#   `1965` <chr>, `1966` <chr>, `1967` <chr>, `1968` <chr>, `1969` <chr>,
#   `1970` <chr>, `1971` <chr>, `1972` <chr>, `1973` <chr>, `1974` <chr>,
#   `1975` <chr>, `1976` <chr>, `1977` <chr>, `1978` <chr>, `1979` <chr>,
#   `1980` <chr>, `1981` <chr>, `1982` <chr>, `1983` <chr>, `1984` <chr>,
#   `1985` <chr>, `1986` <chr>, `1987` <chr>, `1988` <chr>, `1989` <chr>, …

Clean Data

Code
milex_clean <- milex %>% 
  # remove Notes column
  select(!contains("Notes")) %>% 
  # create a column with region names
  mutate(Region = case_when(
    str_ends(Country, "Africa") ~ "Africa",
      str_detect(Country, "America") ~ "Americas",
      str_detect(Country, "Asia & Oceania") ~ "Asia & Oceania",
      str_ends(Country, "Europe") ~ "Europe",
      str_ends(Country, "East") ~ "Middle East",
    TRUE ~ NA_character_
  )) %>% 
  # replace NAs in Region column
  fill(Region, .direction = "down") %>% 
  # create a column with sub region names
  mutate(Sub_Region = case_when(
    str_detect(Country, "North Africa") ~ "North Africa",
    str_detect(Country, "sub-Saharan Africa") ~ "sub-Saharan Africa",
    str_detect(Country, "Central America and the Caribbean") ~ "Central America and the Caribbean",
    str_detect(Country, "North America") ~ "North America",
    str_detect(Country, "South America") ~ "South America",
    str_detect(Country, "Oceania") ~ "Oceania",
    str_detect(Country, "South Asia") ~ "South Asia",
    str_detect(Country, "East Asia") ~ "East Asia",
    str_detect(Country, "South East Asia") ~ "South East Asia",
    str_detect(Country, "Central Asia") ~ "Central Asia",
    str_detect(Country, "Central Europe") ~ "Central Europe",
    str_detect(Country, "Eastern Europe") ~ "Eastern Europe",
    str_detect(Country, "Western Europe") ~ "Western Europe",
    str_ends(Country, "East") ~ "Middle East",
    TRUE ~ NA_character_
  )) %>% 
  # replace NAs in Sub Region column
  fill(Sub_Region, .direction = "down") %>% 
  # remove rows with region and sub-region names
  slice(-c(1:2, 7, 55:56, 70, 73, 85:86, 91, 98, 105, 117, 123:124, 145, 154, 175)) %>% 
  # replace "xxx" and "..." with NAs
  na_if("xxx") %>%
  na_if("...") %>% 
  # combine Region and Sub Region into one column
  unite(Region:Sub_Region, col = "Region & Sub Region", sep = "/" ) %>% 
  # combine year columns into one column
  pivot_longer('1949':'2021', names_to = "Year", values_to = "Military Expenditure in Millions of US$")
# view cleaned dataset
milex_clean
# A tibble: 12,629 × 4
   Country `Region & Sub Region` Year  `Military Expenditure in Millions of US$`
   <chr>   <chr>                 <chr> <chr>                                    
 1 Algeria Africa/North Africa   1949  <NA>                                     
 2 Algeria Africa/North Africa   1950  <NA>                                     
 3 Algeria Africa/North Africa   1951  <NA>                                     
 4 Algeria Africa/North Africa   1952  <NA>                                     
 5 Algeria Africa/North Africa   1953  <NA>                                     
 6 Algeria Africa/North Africa   1954  <NA>                                     
 7 Algeria Africa/North Africa   1955  <NA>                                     
 8 Algeria Africa/North Africa   1956  <NA>                                     
 9 Algeria Africa/North Africa   1957  <NA>                                     
10 Algeria Africa/North Africa   1958  <NA>                                     
# … with 12,619 more rows
Code
# number of distinct countries
n_distinct(milex_clean$Country)
[1] 173
Code
# list distinct countries
distinct(milex_clean[,1])
# A tibble: 173 × 1
   Country     
   <chr>       
 1 Algeria     
 2 Libya       
 3 Morocco     
 4 Tunisia     
 5 Angola      
 6 Benin       
 7 Botswana    
 8 Burkina Faso
 9 Burundi     
10 Cameroon    
# … with 163 more rows
Code
# list distinct region and sub-region combinations
distinct(milex_clean[,2])
# A tibble: 13 × 1
   `Region & Sub Region`                     
   <chr>                                     
 1 Africa/North Africa                       
 2 Africa/sub-Saharan Africa                 
 3 Americas/Central America and the Caribbean
 4 Americas/North America                    
 5 Americas/South America                    
 6 Asia & Oceania/Oceania                    
 7 Asia & Oceania/South Asia                 
 8 Asia & Oceania/East Asia                  
 9 Asia & Oceania/Central Asia               
10 Europe/Central Europe                     
11 Europe/Eastern Europe                     
12 Europe/Western Europe                     
13 Middle East/Middle East                   
Code
# number of years
n_distinct(milex_clean$Year)
[1] 73
Code
# range of years
range(milex_clean$Year)
[1] "1949" "2021"

Narrative

The Current USD data from the SIPRI Milex dataset details the military expenditures in millions of US dollars for 173 countries for the period 1949-2021. There are 191 observations (173 actual observations) and 75 variables (Country, Notes, and Years 1949 through 2021). Observations (countries) were grouped by region and sub-region. There are 5 regions (Africa, Americas, Asia & Oceania, Europe, and Middle East) and 14 sub-regions (North Africa, Sub-Saharan Africa, Central America and the Caribbean, North America, South America, Oceania, South Asia, East Asia, South East Asia, Central Asia, Central Europe, Eastern Europe, Western Europe, and Middle East). Some of the data is missing for several countries in various years due to the data not being available or the country did not exist or was not independent at the time the data was collected, so those values are replaced with NAs.

After cleaning the data, the new dataset has 12,629 observations and 4 variables (Country, Region & Sub Region, Year, Military Expenditure in Millions of US$). The regions and sub-regions were extracted from the Country column and combined together into one column to illustrate the various combinations of region and sub-region. The years were also condensed into one column.

Potential Research Questions

Since this data was taken over a large period of time, I can research whether using time series methods would be appropriate to visualize and analyze the trends in the data.

1. What trends are present in the data (e.g. are there certain patterns of military spending over the years)?

2. What are the top spending countries for each year?

3. What is the predicted amount of military expenditures in dollars for year 2022 for each country?

Source Code
---
title: "Homework 2"
author: "Aleacia Messiah"
description: "Read, Clean, Describe, and Formulate Research Questions from Data"
date: "10/13/2022"
format:
  html:
    toc: true
    code-fold: true
    code-copy: true
    code-tools: true
categories:
  - hw2
  - Aleacia Messiah
  - SIPRI_Milex
  - tidyverse
  - readxl
---

```{r}
#| label: setup
#| warning: false

library(tidyverse)
library(readxl)

knitr::opts_chunk$set(echo = TRUE)
```

## Read in Data

```{r}
# read in the Current USD sheet of the Milex dataset, skipping the first 6 rows and renaming the columns
milex <- read_excel("_data/SIPRI-Milex-data-1949-2021.xlsx", sheet = "Current USD", skip = 6, col_names = c("Country", "Notes", 1949:2021))
# view milex data
milex
```

## Clean Data

```{r}
milex_clean <- milex %>% 
  # remove Notes column
  select(!contains("Notes")) %>% 
  # create a column with region names
  mutate(Region = case_when(
    str_ends(Country, "Africa") ~ "Africa",
      str_detect(Country, "America") ~ "Americas",
      str_detect(Country, "Asia & Oceania") ~ "Asia & Oceania",
      str_ends(Country, "Europe") ~ "Europe",
      str_ends(Country, "East") ~ "Middle East",
    TRUE ~ NA_character_
  )) %>% 
  # replace NAs in Region column
  fill(Region, .direction = "down") %>% 
  # create a column with sub region names
  mutate(Sub_Region = case_when(
    str_detect(Country, "North Africa") ~ "North Africa",
    str_detect(Country, "sub-Saharan Africa") ~ "sub-Saharan Africa",
    str_detect(Country, "Central America and the Caribbean") ~ "Central America and the Caribbean",
    str_detect(Country, "North America") ~ "North America",
    str_detect(Country, "South America") ~ "South America",
    str_detect(Country, "Oceania") ~ "Oceania",
    str_detect(Country, "South Asia") ~ "South Asia",
    str_detect(Country, "East Asia") ~ "East Asia",
    str_detect(Country, "South East Asia") ~ "South East Asia",
    str_detect(Country, "Central Asia") ~ "Central Asia",
    str_detect(Country, "Central Europe") ~ "Central Europe",
    str_detect(Country, "Eastern Europe") ~ "Eastern Europe",
    str_detect(Country, "Western Europe") ~ "Western Europe",
    str_ends(Country, "East") ~ "Middle East",
    TRUE ~ NA_character_
  )) %>% 
  # replace NAs in Sub Region column
  fill(Sub_Region, .direction = "down") %>% 
  # remove rows with region and sub-region names
  slice(-c(1:2, 7, 55:56, 70, 73, 85:86, 91, 98, 105, 117, 123:124, 145, 154, 175)) %>% 
  # replace "xxx" and "..." with NAs
  na_if("xxx") %>%
  na_if("...") %>% 
  # combine Region and Sub Region into one column
  unite(Region:Sub_Region, col = "Region & Sub Region", sep = "/" ) %>% 
  # combine year columns into one column
  pivot_longer('1949':'2021', names_to = "Year", values_to = "Military Expenditure in Millions of US$")
# view cleaned dataset
milex_clean
# number of distinct countries
n_distinct(milex_clean$Country)
# list distinct countries
distinct(milex_clean[,1])
# list distinct region and sub-region combinations
distinct(milex_clean[,2])
# number of years
n_distinct(milex_clean$Year)
# range of years
range(milex_clean$Year)
```

## Narrative

The Current USD data from the SIPRI Milex dataset details the military expenditures in millions of US dollars for 173 countries for the period 1949-2021. There are 191 observations (173 actual observations) and 75 variables (`Country`, `Notes`, and Years `1949` through `2021`). Observations (countries) were grouped by region and sub-region. There are 5 regions (Africa, Americas, Asia & Oceania, Europe, and Middle East) and 14 sub-regions (North Africa, Sub-Saharan Africa, Central America and the Caribbean, North America, South America, Oceania, South Asia, East Asia, South East Asia, Central Asia, Central Europe, Eastern Europe, Western Europe, and Middle East). Some of the data is missing for several countries in various years due to the data not being available or the country did not exist or was not independent at the time the data was collected, so those values are replaced with NAs.

After cleaning the data, the new dataset has 12,629 observations and 4 variables (`Country`, `Region & Sub Region`, `Year`, `Military Expenditure in Millions of US$`). The regions and sub-regions were extracted from the `Country` column and combined together into one column to illustrate the various combinations of region and sub-region. The years were also condensed into one column.

## Potential Research Questions

Since this data was taken over a large period of time, I can research whether using time series methods would be appropriate to visualize and analyze the trends in the data.

**1. What trends are present in the data (e.g. are there certain patterns of military spending over the years)?**

**2. What are the top spending countries for each year?**

**3. What is the predicted amount of military expenditures in dollars for year 2022 for each country?**