Final Project Assignment#1: Neha Jhurani

final_Project_assignment_1
final_project_data_description
Project & Data Description
Author

Neha Jhurani

Published

May 22, 2023

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Part 1. Introduction

In this part, you should introduce the dataset(s) and your research questions.

  1. Dataset(s) Introduction: Causes of Death - Our World in Data

The Lancet published a significant global study on the causes of disease and mortality called The Global Burden of Disease.

The datasaet describes the annual number of deaths and it’s cause in the world from the year 1990 to 2019. The data was collected from https://ourworldindata.org/causes-of-death and was uploaded in Kaggle.

The data represents the annual number of people that died of a particular disease or cause in different years. This dataset will help us analyse and answer various questions as described below.

  1. What questions do you like to answer with this dataset(s)?

For our analysis, we will be considering South Asian countries (India, Pakistan, Afghanistan, Bangladesh, Bhutan, Maldives, Sri Lanka, Nepal) and United States.

  • What are the total number of deaths in all South Asian countries?

  • Which countries in South Asia have the lowest number of deaths?

  • Which countries in South Asia have the highest number of deaths?

  • What is most common cause of death in South Asia in recent years?

  • What has been the trend of few of the diseases over the past years in South Asian countries?

  • What is the total number of deaths in United States?

  • What is the most common cause of death in United States?

  • What is the trend of the causes of death in recent years in United States?

  • Which year had the most number of deaths in United States?

  • What is the relationship between alcohol use disorders and suicides (mental health) in United States ?

Part 2. Describe the data set(s)

This part contains both a coding and a storytelling component.

In the coding component, you should:

  1. read the dataset;

    • (optional) If you have multiple dataset(s) you want to work with, you should combine these datasets at this step.

    • (optional) If your dataset is too big (for example, it contains too many variables/columns that may not be useful for your analysis), you may want to subset the data just to include the necessary variables/columns.

library(readr)
death_dataset <- read_csv("NehaJhurani_FinalProjectData/CausesOfDeath.csv")

south_asian_death_dataset <- death_dataset[death_dataset$Entity== (alist('India','Pakistan','Afghanistan','Bangladesh','Bhutan', 'Maldives', 'Sri Lanka', 'Nepal' )),]
view(south_asian_death_dataset)

us_death_dataset <- death_dataset[death_dataset$Entity== 'United States',]

view(us_death_dataset)
  1. present the descriptive information of the dataset(s) using the functions in Challenges 1, 2, and 3;

    • for examples: dim(), length(unique()), head();
# Dimensions of South Asian countries death dataset
dim(south_asian_death_dataset)
[1] 991   6
# The number of countries considered as South Asian
length(unique(south_asian_death_dataset$Entity))
[1] 8
# Viewing the first few rows of South Asian dataset
head(south_asian_death_dataset)
# Dimensions of United States death dataset
dim(us_death_dataset)
[1] 990   6
# The number of diseases or causes of death in United States
length(unique(us_death_dataset$'Causes name'))
[1] 33
# Viewing the first few rows of United States dataset
head(us_death_dataset)
  1. conduct summary statistics of the dataset(s); especially show the basic statistics (min, max, mean, median, etc.) for the variables you are interested in.
# Displaying summary statistics for both the datasets

summary(south_asian_death_dataset)
 Causes name        Causes Full Description Death Numbers    
 Length:991         Length:991              Min.   :      0  
 Class :character   Class :character        1st Qu.:    101  
 Mode  :character   Mode  :character        Median :   1728  
                                            Mean   :  39066  
                                            3rd Qu.:  12677  
                                            Max.   :1942316  
                                            NA's   :34       
    Entity              Code                Year     
 Length:991         Length:991         Min.   :1990  
 Class :character   Class :character   1st Qu.:1997  
 Mode  :character   Mode  :character   Median :2004  
                                       Mean   :2004  
                                       3rd Qu.:2012  
                                       Max.   :2019  
                                                     
summary(us_death_dataset)
 Causes name        Causes Full Description Death Numbers       Entity         
 Length:990         Length:990              Min.   :     0   Length:990        
 Class :character   Class :character        1st Qu.:  1250   Class :character  
 Mode  :character   Mode  :character        Median : 10895   Mode  :character  
                                            Mean   : 73632                     
                                            3rd Qu.: 53069                     
                                            Max.   :957455                     
                                            NA's   :23                         
     Code                Year     
 Length:990         Min.   :1990  
 Class :character   1st Qu.:1997  
 Mode  :character   Median :2004  
                    Mean   :2004  
                    3rd Qu.:2012  
                    Max.   :2019  
                                  

Story Telling Component:

We have broken down the dataset into two sub dataaset. One represents the data related to all South Asian Countries and the other represents the data related to United States.

The dimensions of South Asian dataset is 991, 6 and the dimensions of United States dataset is 990, 6. The 6 columns in the dataset represent the following:

Causes name - This represents the cause of death. This can be a disease or a natural phenomena or a legal action

Causes Full Description - This represents if the death is in all age groups or in a particular set of age groups, and weather it is in all the genders or in specific gender

Death Numbers - This represents the annual number of deaths for specific value of cause, year and country

Entity - This represents the name of the country in which the deaths occurred.

Code - This represents the ISO Code of the country in which deaths occurred

Year - This represents the year in which the deaths occurred.

Each row of the dataset represents the annual number of deaths that happened because of a certain cause in a specific country during a specific year.

3. The Tentative Plan for Visualization

  1. Briefly describe what data analyses and visualizations you plan to conduct to answer the research questions you proposed above.

  2. Explain why you choose to conduct these specific data analyses and visualizations. In other words, how do such types of statistics or graphs (see the R Gallery) help you answer specific questions?

  • I am going to create a bivariate graph between year and deaths to know what are the annual number of deaths in South Asian countries for a specific causes of death and for all causes of deaths combined. This will help us understand if the specific cause of death has increased or decreased the number of deaths over the years for each country. This will also help us understand if the number of deaths have increased or decreased over the years for each country.

  • I am going to create a histogram representing the cause of deaths in United States. This will help in understanding which is the most common cause of death.

  • I am going to create a bivariate graph between year and number of deaths for specific cause of deaths in United States. This will help us in understanding if the number of deaths have increased or decreased over the years because of the specific cause.

  1. If you plan to conduct specific data analyses and visualizations, describe how do you need to process and prepare the tidy data.

    • Rename ‘Entity’ to ‘Country’ and ‘Causes name’ to ‘Cause’ for better understandability
    • Many ISO Codes are missing. As this is same as Country Name that is represented by Entity (which is non null), we will drop Code column.
    • Death Number are also missing in a lot of rows, we will drop those rows instead of making them 0 because ‘there are no deaths for that particular cause’ or ‘that particular cause of death is non existant’ describe the same case.
    • Dropping ‘Causes Full Description’ because that column is currently not used in our analysis
  2. (Optional) It is encouraged, but optional, to include a coding component of tidy data in this part.

south_asian_death_dataset <- south_asian_death_dataset %>%  
    select(-c(`Causes Full Description`, `Code`)) %>% 
    rename(Country = `Entity`,
           Cause = 'Causes name') %>%  
    as.data.frame()
head(south_asian_death_dataset)
us_death_dataset <- us_death_dataset %>%  
    select(-c(`Causes Full Description`, `Code`)) %>% 
    rename(Country = `Entity`,
           Cause = 'Causes name') %>%  
    as.data.frame()
head(us_death_dataset)