library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 8 Instructions
Challenge Overview
Today’s challenge is to:
- read in multiple data sets, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- join two or more data sets and analyze some aspect of the joined data (see detailed instruction below)
(be sure to only include the category tags for the data you use!)
Read in data
Read in one (or more) of the following datasets, using the correct R package and command.
- military marriages (ActiveDuty_MaritalStatus.xls)⭐⭐
- faostat (birds.csv,FAOSTAT_country_groups.csv, and other “FAOSTAT_*” files) ⭐⭐⭐⭐
- snl (snl_actors.csv, snl_casts.csv, snl_seasons.csv)⭐⭐⭐
Briefly describe the data
Military Marriage
The excel workbook “ActiveDuty_MaritalStatus.xls” contains tabulations of total number of active duty military members classified by marital and family status, for each of the branches of military as well as the military overall. It has five sheets: “TotalDOD”. “AirForce”, “MarainCorps”, “Navy”, “Army”. In each sheet, we can see there are three different pieces of information of marital status: marriage (single or married), children (with or without kids), and types of marriage (joing service or civilian). It’s your choice.
Here are the major tasks:
You should think about how to present the marital information efficiently and clearly, such as by separating them in to three columns, or combining any two of them.
You can start with a single sheet, then iterate this data processing and transformation to other sheets. You will find materials in Session 12 of google classroom helpful.
Merge these four sheets of data to one dataframe which including the following columns: payGrade, payLevel, branch, marital status, gender, and count.
(Optional/Advanced Challenge) We can write a function to read multiple sheets on the same workbook.Then we can join them together using
map_dfr()
, one of thepurrr
functions, to join the four elements of the list into a single dataframe. map_dir() can turn all of the list elements created by map into a single dataframe(df) that is joined by rows(r).
FAO
The FAOSTAT sheets are excerpts of the FAOSTAT database provided by the Food and Agriculture Association, an agency of the United Nations. There are two approaches we could use to joining these data. depends on how we would like to present the information.
Approach 1 (FAOstat: Regions): join livestock estimates from a single file (birds.csv) to regional codes for the countries (FAOSTAT_country_groups). You will find the materials of joining data in Session 9 helpful.
Approach 2 (FASOstat: Other Lifestocks): join the bird.csv with other lifestock datasets (FAOSTATA_cattle_dairy.csv, FAOSTAT_egg_chicken.csv, and FAOSTATA_lifestock.csv).
SNL
These data came to my attention courtesy of Jeremy Singer-Vine’s wonderful Data is Plural newsletter. These datasets, archived by Joel Navaroli and scraped by Hendrik Hilleckes and Colin Morris, contain data about the actors, cast, seasons, etc. from every season of Saturday Night Live from its inception through 2020.
With these three datasets, we can join them together to observe many different features, such as gender makeup and number of appearance, of the SNL casts and actors changed over time.
Tidy Data (as needed)
Is your data already tidy, or is there work to be done? Be sure to anticipate your end result to provide a sanity check, and document your work here.
Are there any variables that require mutation to be usable in your analysis stream? For example, do you need to calculate new values in order to graph them? Can string values be represented numerically? Do you need to turn any variables into factors and reorder for ease of graphics and visualization?
Document your work here.
Join Data
Be sure to include a sanity check, and double-check that case count is correct!