library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Final Project: Abhinav Reddy Yadatha
Part 1. Introduction
In this part, you should introduce the dataset(s) and your research questions.
- Dataset(s) Introduction: Titanic Dataset
The Titanic dataset is a widely recognized and extensively used dataset for data analysis and machine learning purposes. It originates from the passenger manifest of the RMS Titanic, a British passenger liner that tragically sank during its inaugural voyage in April 1912. Numerous organizations and individuals have carefully compiled and organized the dataset for educational and research applications based on the real-life incident. Each row in the dataset corresponds to an individual passenger who was on board the Titanic during its ill-fated journey. The dataset comprises various attributes pertaining to the passengers, including their names, ages, genders, passenger classes, ticket fares, cabin numbers, ports of embarkation, survival status, and more. These attributes were gathered from passenger records, survivor interviews, and historical documents related to the disaster. The dataset encompasses both survivors and non-survivors, with the survival status column denoting whether a passenger survived (labeled as “1”) or did not survive (labeled as “0”) the sinking of the Titanic.
- What questions do you like to answer with this dataset(s)?
By analyzing this dataset, I would like to answer the following questions :
What was the overall survival rate of passengers aboard the Titanic?
How does the survival rate vary based on passenger gender?
Did the passenger class have an impact on the survival rate?
What were the most common ports of embarkation for the passengers?
How does the ticket fare correlate with the passenger class and survival?
What were the survival rates for passengers with family members aboard the Titanic versus those traveling alone
7.Did the cabin location or deck level have any influence on survival?
Can we identify any patterns or relationships between variables that are indicative of survival?
How does the survival rate differ among different age groups or passenger classes?
What was the distribution of ages among the passengers, and did age play a role in survival?
Part 2. Describe the data set(s)
This part contains both a coding and a storytelling component.
In the coding component, you should:
- read the titanic dataset;
#read and get a overview of the data
<- read_csv("AbhinavReddyYadatha_FinalProjectData/titanic.csv")
data view(data)
Descriptive information of the dataset(s) using the functions in Challenges 1, 2, and 3;
::: {.cell}
# Checking the dimensions of the data dim(data)
::: {.cell-output .cell-output-stdout}
[1] 891 12
:::# checking the unique age group count length(unique(data$Age))
::: {.cell-output .cell-output-stdout}
[1] 89
:::#printing out the first few examples head(data)
::: {.cell-output-display}
::: :::
conduct summary statistics of the dataset(s); especially show the basic statistics (min, max, mean, median, etc.) for the variables you are interested in.
# Displaying the summary of the titanic data
summary(data)
PassengerId Survived Pclass Name
Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
Median :446.0 Median :0.0000 Median :3.000 Mode :character
Mean :446.0 Mean :0.3838 Mean :2.309
3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :891.0 Max. :1.0000 Max. :3.000
Sex Age SibSp Parch
Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
Mode :character Median :28.00 Median :0.000 Median :0.0000
Mean :29.70 Mean :0.523 Mean :0.3816
3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
Max. :80.00 Max. :8.000 Max. :6.0000
NA's :177
Ticket Fare Cabin Embarked
Length:891 Min. : 0.00 Length:891 Length:891
Class :character 1st Qu.: 7.91 Class :character Class :character
Mode :character Median : 14.45 Mode :character Mode :character
Mean : 32.20
3rd Qu.: 31.00
Max. :512.33
Story telling :
This dataset contains a total of 891 rows and 12 columns and the descriptive information about what each of the 12 fields are given below :
PassengerId: A unique identifier assigned to each passenger.
Survived: Indicates whether the passenger survived the sinking or not (0 = No, 1 = Yes).
Pclass: Represents the passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class).
Name: The name of the passenger.
Sex: The gender of the passenger (Male or Female).
Age: The age of the passenger in years (some values may be missing).
SibSp: The number of siblings/spouses aboard the Titanic.
Parch: The number of parents/children aboard the Titanic.
Ticket: The ticket number of the passenger.
Fare: The fare or ticket price paid by the passenger . Cabin: The cabin number assigned to the passenger (some values may be missing).
Embarked: The port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
3. The Tentative Plan for Visualization
Briefly describe what data analyses (please the special note on statistics in the next section) and visualizations you plan to conduct to answer the research questions you proposed above.
Explain why you choose to conduct these specific data analyses and visualizations. In other words, how do such types of statistics or graphs (see the R Gallery) help you answer specific questions? For example, how can a bivariate visualization reveal the relationship between two variables, or how does a linear graph of variables over time present the pattern of development?
Answer for 1 and 2 :
Ans) Data Analyses I plan to do for this project are as follows :
Survival Rate Analysis:
Calculate the overall survival rate of passengers (percentage of survivors) to provide an initial understanding of survival outcomes. Gender-based Analysis:
Compare the survival rates of male and female passengers using a bivariate visualization (such as a bar chart or stacked bar plot) to explore the potential impact of gender on survival. Passenger Class Analysis:
Examine the survival rates across different passenger classes (first, second, and third) using visualizations (such as a grouped bar chart) to investigate the relationship between socio-economic status and survival. Age Distribution Analysis:
Plot the age distribution of passengers using a histogram to visualize the age groups and identify any patterns or trends in survival rates within different age ranges. Family Size Analysis:
Investigate the relationship between family size (based on the SibSp and Parch variables) and survival rates using a bivariate visualization (such as a scatter plot or grouped bar chart) to explore the impact of traveling with family members on survival. Fare Analysis:
Analyze the distribution of ticket fares among passengers and assess its relationship with survival rates, potentially using box plots or violin plots to compare fare distributions for survivors and non-survivors. Cabin Location Analysis:
Explore the relationship between cabin location (based on the Cabin variable) and survival rates using visualizations (such as a stacked bar plot) to understand if proximity to lifeboats or certain areas of the ship influenced survival.
The choice of specific data analyses and visualizations is driven by their ability to address the research questions and provide insights into the relationships between variables in the Titanic dataset. Here’s an explanation of why these types of statistics and graphs are selected and how they help answer specific questions:
Bivariate Visualization:
Bivariate visualizations, such as bar charts or stacked bar plots, are useful for comparing two variables, such as gender and survival or passenger class and survival. They allow us to visually examine the relationship between two categorical variables and observe any patterns or differences in survival rates based on these variables.
Scatter Plot:
Scatter plots are valuable for analyzing the relationship between two continuous variables, such as family size and survival rates. They enable us to observe the dispersion of data points and determine if there is any correlation or trend between the variables. Box Plot and Violin Plot:
Box plots and violin plots are excellent for comparing the distribution of a continuous variable, such as fare, between different groups (e.g., survivors vs. non-survivors). They provide a visual representation of the median, quartiles, and outliers in the data, allowing us to detect any variations or differences in fare distributions based on survival outcomes.
If you plan to conduct specific data analyses and visualizations, describe how do you need to process and prepare the tidy data.
What do you need to do to mutate the datasets (convert date data, create a new variable, pivot the data format, etc.)?
How are you going to deal with the missing data/NAs and outliers? And why do you choose this way to deal with NAs?
Calculating Age Groups:
Grouping passengers into age categories: You can create a new variable called “Age Group” by categorizing the passengers’ ages into different groups, such as “Child,” “Adult,” and “Elderly.” This can be done by specifying age ranges and using conditional statements or the cut() function in R. Converting Categorical Variables:
Mapping categorical variables to numerical values: If any categorical variables, such as “Sex” or “Embarked,” are represented as text, you can create new variables that map these categories to numerical values. For example, you can create a new variable called “Sex_Code” where “male” is encoded as 0 and “female” as 1. Calculating Family Size:
Creating a variable for family size: You can create a new variable called “Family Size” by summing the “SibSp” (number of siblings/spouses aboard) and “Parch” (number of parents/children aboard) variables. This new variable represents the total number of family members a passenger had onboard.
If any column has too much empty data, I plan to drop it completely as it would not contain a lot of information.