library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Final Project Assignment#1: Neha Jhurani
Part 1. Introduction
In this part, you should introduce the dataset(s) and your research questions.
- Dataset(s) Introduction: Causes of Death - Our World in Data
The Lancet published a significant global study on the causes of disease and mortality called The Global Burden of Disease.
The datasaet describes the annual number of deaths and it’s cause in the world from the year 1990 to 2019. The data was collected from https://ourworldindata.org/causes-of-death and was uploaded in Kaggle.
The data represents the annual number of people that died of a particular disease or cause in different years. This dataset will help us analyse and answer various questions as described below.
- What questions do you like to answer with this dataset(s)?
For our analysis, we will be considering South Asian countries (India, Pakistan, Afghanistan, Bangladesh, Bhutan, Maldives, Sri Lanka, Nepal) and United States.
What are the total number of deaths in all South Asian countries?
Which countries in South Asia have the lowest number of deaths?
Which countries in South Asia have the highest number of deaths?
What is most common cause of death in South Asia in recent years?
What has been the trend of few of the diseases over the past years in South Asian countries?
What is the total number of deaths in United States?
What is the most common cause of death in United States?
What is the trend of the causes of death in recent years in United States?
Which year had the most number of deaths in United States?
What is the relationship between alcohol use disorders and suicides (mental health) in United States ?
Part 2. Describe the data set(s)
This part contains both a coding and a storytelling component.
In the coding component, you should:
read the dataset;
(optional) If you have multiple dataset(s) you want to work with, you should combine these datasets at this step.
(optional) If your dataset is too big (for example, it contains too many variables/columns that may not be useful for your analysis), you may want to subset the data just to include the necessary variables/columns.
library(readr)
<- read_csv("NehaJhurani_FinalProjectData/CausesOfDeath.csv")
death_dataset
<- death_dataset[death_dataset$Entity== (alist('India','Pakistan','Afghanistan','Bangladesh','Bhutan', 'Maldives', 'Sri Lanka', 'Nepal' )),]
south_asian_death_dataset view(south_asian_death_dataset)
<- death_dataset[death_dataset$Entity== 'United States',]
us_death_dataset
view(us_death_dataset)
present the descriptive information of the dataset(s) using the functions in Challenges 1, 2, and 3;
- for examples: dim(), length(unique()), head();
# Dimensions of South Asian countries death dataset
dim(south_asian_death_dataset)
[1] 991 6
# The number of countries considered as South Asian
length(unique(south_asian_death_dataset$Entity))
[1] 8
# Viewing the first few rows of South Asian dataset
head(south_asian_death_dataset)
# Dimensions of United States death dataset
dim(us_death_dataset)
[1] 990 6
# The number of diseases or causes of death in United States
length(unique(us_death_dataset$'Causes name'))
[1] 33
# Viewing the first few rows of United States dataset
head(us_death_dataset)
- conduct summary statistics of the dataset(s); especially show the basic statistics (min, max, mean, median, etc.) for the variables you are interested in.
# Displaying summary statistics for both the datasets
summary(south_asian_death_dataset)
Causes name Causes Full Description Death Numbers
Length:991 Length:991 Min. : 0
Class :character Class :character 1st Qu.: 101
Mode :character Mode :character Median : 1728
Mean : 39066
3rd Qu.: 12677
Max. :1942316
NA's :34
Entity Code Year
Length:991 Length:991 Min. :1990
Class :character Class :character 1st Qu.:1997
Mode :character Mode :character Median :2004
Mean :2004
3rd Qu.:2012
Max. :2019
summary(us_death_dataset)
Causes name Causes Full Description Death Numbers Entity
Length:990 Length:990 Min. : 0 Length:990
Class :character Class :character 1st Qu.: 1250 Class :character
Mode :character Mode :character Median : 10895 Mode :character
Mean : 73632
3rd Qu.: 53069
Max. :957455
NA's :23
Code Year
Length:990 Min. :1990
Class :character 1st Qu.:1997
Mode :character Median :2004
Mean :2004
3rd Qu.:2012
Max. :2019
Story Telling Component:
We have broken down the dataset into two sub dataaset. One represents the data related to all South Asian Countries and the other represents the data related to United States.
The dimensions of South Asian dataset is 991, 6 and the dimensions of United States dataset is 990, 6. The 6 columns in the dataset represent the following:
Causes name - This represents the cause of death. This can be a disease or a natural phenomena or a legal action
Causes Full Description - This represents if the death is in all age groups or in a particular set of age groups, and weather it is in all the genders or in specific gender
Death Numbers - This represents the annual number of deaths for specific value of cause, year and country
Entity - This represents the name of the country in which the deaths occurred.
Code - This represents the ISO Code of the country in which deaths occurred
Year - This represents the year in which the deaths occurred.
Each row of the dataset represents the annual number of deaths that happened because of a certain cause in a specific country during a specific year.
3. The Tentative Plan for Visualization
Briefly describe what data analyses and visualizations you plan to conduct to answer the research questions you proposed above.
Explain why you choose to conduct these specific data analyses and visualizations. In other words, how do such types of statistics or graphs (see the R Gallery) help you answer specific questions?
I am going to create a bivariate graph between year and deaths to know what are the annual number of deaths in South Asian countries for a specific causes of death and for all causes of deaths combined. This will help us understand if the specific cause of death has increased or decreased the number of deaths over the years for each country. This will also help us understand if the number of deaths have increased or decreased over the years for each country.
I am going to create a histogram representing the cause of deaths in United States. This will help in understanding which is the most common cause of death.
I am going to create a bivariate graph between year and number of deaths for specific cause of deaths in United States. This will help us in understanding if the number of deaths have increased or decreased over the years because of the specific cause.
If you plan to conduct specific data analyses and visualizations, describe how do you need to process and prepare the tidy data.
- Rename ‘Entity’ to ‘Country’ and ‘Causes name’ to ‘Cause’ for better understandability
- Many ISO Codes are missing. As this is same as Country Name that is represented by Entity (which is non null), we will drop Code column.
- Death Number are also missing in a lot of rows, we will drop those rows instead of making them 0 because ‘there are no deaths for that particular cause’ or ‘that particular cause of death is non existant’ describe the same case.
- Dropping ‘Causes Full Description’ because that column is currently not used in our analysis
(Optional) It is encouraged, but optional, to include a coding component of tidy data in this part.
<- south_asian_death_dataset %>%
south_asian_death_dataset select(-c(`Causes Full Description`, `Code`)) %>%
rename(Country = `Entity`,
Cause = 'Causes name') %>%
as.data.frame()
head(south_asian_death_dataset)
<- us_death_dataset %>%
us_death_dataset select(-c(`Causes Full Description`, `Code`)) %>%
rename(Country = `Entity`,
Cause = 'Causes name') %>%
as.data.frame()
head(us_death_dataset)