Data Analytics and Computational Social Science: DACSS 601: Data Science Fundamentals Final Project - Final Draft

Alexis Gamez

Setup

View Code

library(tidyverse)
library(kableExtra)
library(rmarkdown)

Introduction

My biological father passed away when I was young and the nature of his death was always questionable. Having been a twenty-four year old, fit and experienced individual in the water, one cannot simply deduce a reason for which someone like that would suddenly start drowning. Autopsy reports returned nothing abnormal, simply stating inundation/suffocation as the leading cause of death. It was not until around my 25th birthday that I obtained some new information, which was able to shed some light on the situation. After a discussion with my primary care physician, I learned that I had a heart arrhythmia and shortly after discussing the matter with my family, I learned that my biological sister had recently been diagnosed with one as well.

After conducting a bit of personal research, I found that most arrhythmia syndromes are genetically inherited. Specifically, arrhythmias are inherited in an autosomal dominant manner, meaning that children have a 50% chance of inheriting the disease from one their parents (Gray & Behr, 2016). Due to the nature of the disease, identification of the condition through predictive genetic testing is important for early intervention. With that being said, why did it take twenty-five years to be identified in my case? Unfortunately, the answer to that question is not a simple one.

There are several variables that play into the lack of diagnoses for a twenty-five year old Latino male. For example, a few might be ethnic disparities between patients and providers like mistrust, misunderstanding, or a lack of knowledge on how healthcare functions. Another important variable to acknowledge is the language barrier, as it’s fair to assume that the native language of a majority of immigrants that arrive in the US is not English. In fact, it was reported in 2001 (4 years after my father’s death) that nearly fourteen million Americans are not proficient in English and nearly one in five Spanish speaking Latinos refused to seek medical care due to their discomfort with the language barrier (Smedley, 2003).

There are a great many variables to take into consideration stemming from patient level variables to systematic factors, but I wanted to simplify my analyses for the sake of this project. My objective was to hone in on a specific variable, one that I felt was more quantifiable and use the result to fortify my understanding of the issue. As a result, I decided to further investigate the idea of biological variables between racial. The question I will be attempting to answer throughout this document is whether certain races, in the United States, are more likely to develop heart disease than others, and whether select additional variables further predispose them.

Data

For this project I used a data set found on the website Keggle called “Personal Key Indicators of Heart Disease”. The original data set was provided by the CDC as a part of the Behavioral Risk Factor Surveillance System (BRFSS), which is a system that conducts telephone surveys to gather data on the health statuses of US residents. Originally containing 279 different variables, this version of the data set simplifies it down to approximately twenty. This particular survey was conducted in 2020 and included a population of approximately three hundred-thousand adults from across the United States.

View Code

heart_rough<-read_csv("C:/Users/Leshiii/Desktop/DACSS Master's/DACSS 601/Final Project/heart_2020_cleaned.csv")
# Rows: 319795 Columns: 18  

#Confirming that the data set was read in correctly
paged_table(heart_rough)

Data Set Variables

Below is a list naming all the variables included in the data set that I will be using. Additionally, the list characterizes the type of data that each variable represents.

Heart Disease: Character Data (Yes/No)
BMI: Numeric/Doubles Data (16.6/20.3/etc)
Smoking: Character Data (Yes/No)
Alcohol Drinking: Character Data (Yes/No)
Stroke: Character Data (Yes/No)
Physical Health: Numeric Data (1/2/3/etc.)
Mental Health: Numeric Data (1/2/3/etc.)
Diff Walking: Character Data (Yes/No)
Sex: Character Data (Female/Male)
Age Category: Character Data (55-59/65-69/etc.)
Race: Character Data (White/Black/etc.)
Diabetic: Character Data (Yes/No)
Physical Activity: Character Data (Yes/No)
Gen Health: Character Data (Good/Fair/Poor/etc.)
Sleep Time: Numeric Data (1/2/3/etc.)
Asthma: Character Data (Yes/No)
Kidney Disease: Character Data (Yes/No)
Skin Cancer: Character Data (Yes/No)

Data Cleaning

Not every variable was immediately useful to the pursuit of my research question. As a result, I decided to pull specific variables that were more informative than others (see table below).

Additionally, I wanted to create a data set that filtered out all participants other than those that reported heart disease. However, after analyzing some of the categorical data, I noticed there was an overwhelming portion of white survey participants. I felt as though the difference was substantial enough to overshadow interesting data present within the other racial populations. Using that information, I created two sub-data sets. One including white participants and another that does not.

View Code

# Base tidied data set, unfiltered with select variables.
heart <- heart_rough %>%
  select(`HeartDisease`, 
         `Smoking`,
         `AlcoholDrinking`,
         `Stroke`,
         `PhysicalHealth`,
         `MentalHealth`,
         `AgeCategory`,
         `Race`,
         `Diabetic`,
         `PhysicalActivity`,
         `GenHealth`,
         `SleepTime`)
paged_table(heart)

# 1st sub-data set only containing participants who reported having heart disease. 
# Includes white participants. 
w_heart <- heart %>%
   filter(`HeartDisease` == "Yes") %>%
   select(`Smoking`,
         `AlcoholDrinking`,
         `Stroke`,
         `PhysicalHealth`,
         `MentalHealth`,
         `AgeCategory`,
         `Race`,
         `Diabetic`,
         `PhysicalActivity`,
         `GenHealth`,
         `SleepTime`)

# 2nd sub-data set. 
# Does not include white participants.
nw_w_heart <- heart %>%
   filter(`HeartDisease` == "Yes", `Race` != "White") %>%
   select(`Smoking`,
         `AlcoholDrinking`,
         `Stroke`,
         `PhysicalHealth`,
         `MentalHealth`,
         `AgeCategory`,
         `Race`,
         `Diabetic`,
         `PhysicalActivity`,
         `GenHealth`,
         `SleepTime`)

Data Analyses

Numeric Data

The first task I attempted to tackle was to analyze the numeric data that I felt had the potential to illuminate certain variables as correlating with heart disease. Specifically, I attempted to find the mean, median and standard deviation for the Mental Health, Physical Health and Sleep Time variables.

View Code

# Calculating the mean, median & standard deviation of the Mental Health variable.
# First table uses all participant data.
mental <- heart %>%
   filter(`HeartDisease` != "Yes") %>%
   summarize(Mental_Mean = mean(MentalHealth), 
          Mental_Median = median(MentalHealth), 
          Mental_SD = sd(MentalHealth)) %>%
   kable(col.names = c("Mean", "Median", "SD"), 
         caption = "Reported Mental Health in the Last 30 Days of those w/o Heart Disease", label = NA)
mental

# The second table only uses data from participants that reported heart disease.
w_mental <- w_heart %>%
   summarize(Mental_Mean = mean(MentalHealth), 
              Mental_Median = median(MentalHealth), 
              Mental_SD = sd(MentalHealth)) %>%
   kable(col.names = c("Mean", "Median", "SD"), 
         caption = "Reported Mental Health in the Last 30 Days of those w/ Heart Disease", label = NA)
w_mental

# Calculating the mean, median & standard deviation of the Physical Health variable.
phys <- heart %>%
   filter(`HeartDisease` != "Yes") %>%
   summarize(Phys_Mean = mean(PhysicalHealth),
             Phys_Median = median(PhysicalHealth), 
             Phys_SD = sd(PhysicalHealth)) %>%
   kable(col.names = c("Mean", "Median", "SD"), 
         caption = "Reported Physical Health in the Last 30 Days of those w/o Heart Disease", label = NA)
phys

w_phys <- w_heart %>%
   summarize(Phys_Mean = mean(PhysicalHealth),
             Phys_Median = median(PhysicalHealth), 
             Phys_SD = sd(PhysicalHealth)) %>%
   kable(col.names = c("Mean", "Median", "SD"), 
         caption = "Reported Physical Health in the Last 30 Days of those w/ Heart Disease", label = NA)
w_phys

# Calculating the mean, median & standard deviation of the Sleep Time variable.
sleep <- heart %>%
   filter(`HeartDisease` != "Yes") %>%
   summarize(Sleep_Mean = mean(SleepTime), 
             Sleep_Median = median(SleepTime), 
             Sleep_SD = sd(SleepTime)) %>%
   kable(col.names = c("Mean", "Median", "SD"), 
         caption = "Reported Sleep Times in the Last 30 Days of those w/o Heart Disease", label = NA)
sleep

w_sleep <- w_heart %>%
   summarize(Sleep_Mean = mean(SleepTime), 
             Sleep_Median = median(SleepTime), 
             Sleep_SD = sd(SleepTime)) %>%
   kable(col.names = c("Mean", "Median", "SD"), 
         caption = "Reported Sleep Times in the Last 30 Days of those w/ Heart Disease", label = NA)
w_sleep

Reported Mental Health in the Last 30 Days of those w/o Heart Disease
Mean	Median	SD
3.828778	0	7.828079

Reported Mental Health in the Last 30 Days of those w/ Heart Disease
Mean	Median	SD
4.641764	0	9.171932

Reported Physical Health in the Last 30 Days of those w/o Heart Disease
Mean	Median	SD
2.956416	0	7.400378

Reported Physical Health in the Last 30 Days of those w/ Heart Disease
Mean	Median	SD
7.808242	0	11.48782

Reported Sleep Times in the Last 30 Days of those w/o Heart Disease
Mean	Median	SD
7.093416	7	1.399331

Reported Sleep Times in the Last 30 Days of those w/ Heart Disease
Mean	Median	SD
7.136156	7	1.780863

My intention was to demonstrate instances in which the descriptive statistics, between the data set containing all participant data and the one only containing those with heart disease, differed greatly. Ideally, these visuals would’ve shown one or more variables being more common within the population having heart disease.

Categorical Data

Next, I wanted to determine the response proportionality for each of the selected variables. I started with the Heart Disease variable in order to visualize what percentage of participants in the survey reported having a disease. At a glance, this could show whether or not my research question was remotely relevant.

View Code

heart_disease <- heart %>%
   select(HeartDisease)
kable(prop.table(table(heart_disease))*100, col.names = c("Heart Disease?", "%"))

Heart Disease?	%
No	91.440454
Yes	8.559546

I was surprised to see that out of the entire participant population, about 9% reported having heart disease. The number is relatively accurate according to previous research and national averages across different sub-groups (CDC, 2019).

For the remainder of the tables generated, I chose to create two tables for each variable only using the data of those who reported having heart disease. The first includes the white participant population and the other does not, for reasons previously stated. Here, I give one last attempt to discern whether there are any other variables that predispose individuals to greater risks of heart disease.

View Code

# Calculating the proportion of all participants reporting heart disease according to 
# their race.

# The first table includes white participants.
race <- w_heart %>%
   select(Race)
kable(prop.table(table(race))*100, col.names = c("Race", "%"))

# The second does not.
nw_race <- nw_w_heart %>%
   select(Race)
kable(prop.table(table(nw_race))*100, col.names = c("Race", "%"),
      caption = "w/o White Participants", label = NA)

# Calculating the proportion of all participants reporting heart disease according to 
# their age group.
age <- w_heart %>%
   select(AgeCategory)
kable(prop.table(table(age))*100, col.names = c("Age", "%"))

nw_age <- nw_w_heart %>%
   select(AgeCategory)
kable(prop.table(table(nw_age))*100, col.names = c("Age", "%"),
      caption = "w/o White Participants", label = NA)

# Calculating the proportion of all participants reporting heart disease according to 
# their reported General Health in the last 30 days.
gen_health <- w_heart %>%
   select(GenHealth)
kable(prop.table(table(gen_health))*100, col.names = c("Gen Health in Last 30 Days", "%"))

nw_gen_health <- nw_w_heart %>%
   select(GenHealth)
kable(prop.table(table(nw_gen_health))*100, col.names = c("Gen Health in Last 30 Days", "%"),
      caption = "w/o White Participants", label = NA)

# Calculating the proportion of all participants reporting heart disease according to 
# their reported physical activity.
physical <- w_heart %>%
   select(PhysicalActivity)
kable(prop.table(table(physical))*100, col.names = c("Phys Activity in Last 30 Days?", "%"))

nw_physical <- nw_w_heart %>%
   select(PhysicalActivity)
kable(prop.table(table(nw_physical))*100, col.names = c("Phys Activity in Last 30 Days?", "%"),
      caption = "w/o White Participants", label = NA)

# Calculating the proportion of all participants reporting heart disease according to 
# whether or not they smoke.
smoke <- w_heart %>%
   select(Smoking)
kable(prop.table(table(smoke))*100, col.names = c("Smoking?", "%"))

nw_smoke <- nw_w_heart %>%
   select(Smoking)
kable(prop.table(table(nw_smoke))*100, col.names = c("Smoking?", "%"),
      caption = "w/o White Participants", label = NA)

# Calculating the proportion of all participants reporting heart disease according to 
# whether or not they drink alcohol regularly.
alcohol <- w_heart %>%
  select(AlcoholDrinking)
kable(prop.table(table(alcohol))*100, col.names = c("Drinking?", "%"))

nw_alcohol <- nw_w_heart %>%
   select(AlcoholDrinking)
kable(prop.table(table(nw_alcohol))*100, col.names = c("Drinking?", "%"),
      caption = "w/o White Participants", label = NA)

# Calculating the proportion of all participants reporting heart disease according to 
# whether or not they have had a stroke.
stroke <- w_heart %>%
   select(Stroke)
kable(prop.table(table(stroke))*100, col.names = c("Had a Stroke?", "%"))

nw_stroke <- nw_w_heart %>%
   select(Stroke)
kable(prop.table(table(nw_stroke))*100, col.names = c("Had a Stroke?", "%"),
      caption = "w/o White Participants", label = NA)

# Calculating the proportion of all participants reporting heart disease according to 
# whether or not they have diabetes.
diabetes <- w_heart %>%
   select(Diabetic)
kable(prop.table(table(diabetes))*100, col.names = c("Have Diabetes?", "%"))

nw_diabetes <- nw_w_heart %>%
   select(Diabetic)
kable(prop.table(table(nw_diabetes))*100, col.names = c("Have Diabetes?", "%"),
      caption = "w/o White Participants", label = NA)

Race	%
American Indian/Alaskan Native	1.9800533
Asian	0.9717605
Black	6.3164432
Hispanic	5.2716180
Other	3.2367662
White	82.2233588

w/o White Participants
Race	%
American Indian/Alaskan Native	11.138512
Asian	5.466502
Black	35.532265
Hispanic	29.654747
Other	18.207974

Age	%
18-24	0.4749205
25-29	0.4858802
30-34	0.8256311
35-39	1.0813575
40-44	1.7754722
45-49	2.7180068
50-54	5.0524239
55-59	8.0444233
60-64	12.1543126
65-69	14.9819165
70-74	17.7072298
75-79	14.7919483
80 or older	19.9064772

w/o White Participants
Age	%
18-24	1.191944
25-29	1.233046
30-34	1.808467
35-39	2.034525
40-44	3.596383
45-49	5.404850
50-54	8.651870
55-59	11.282367
60-64	14.919852
65-69	15.002055
70-74	13.748459
75-79	9.802713
80 or older	11.323469

Gen Health in Last 30 Days	%
Excellent	5.479852
Fair	25.879516
Good	34.917620
Poor	14.064955
Very good	19.658057

w/o White Participants
Gen Health in Last 30 Days	%
Excellent	5.856967
Fair	31.853679
Good	31.545417
Poor	17.673654
Very good	13.070284

Phys Activity in Last 30 Days?	%
No	36.10857
Yes	63.89143

w/o White Participants
Phys Activity in Last 30 Days?	%
No	40.87546
Yes	59.12454

Smoking?	%
No	41.41307
Yes	58.58693

w/o White Participants
Smoking?	%
No	46.58857
Yes	53.41143

Drinking?	%
No	95.831659
Yes	4.168341

w/o White Participants
Drinking?	%
No	96.465269
Yes	3.534731

Had a Stroke?	%
No	83.96595
Yes	16.03405

w/o White Participants
Had a Stroke?	%
No	78.72996
Yes	21.27004

Have Diabetes?	%
No	64.0010229
No, borderline diabetes	2.8824024
Yes	32.7220254
Yes (during pregnancy)	0.3945494

w/o White Participants
Have Diabetes?	%
No	55.0965886
No, borderline diabetes	3.5758323
Yes	40.6288533
Yes (during pregnancy)	0.6987259

Visualizations

Uni-Variate Visualizations

The first thing I wanted to visualize was whether heart disease was more prominent in certain age groups than others. This was confirmed as the bar graph below displays that heart disease is much more common in participants ages 55 and up.

View Code

ggplot(age, aes(AgeCategory)) + 
  geom_bar() +
  labs(x = "Age Range", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Similarly, I wanted to create a visual that depicted the frequency of heart disease within certain races of people as it ties directly to my research question. Like previously mentioned, the overwhelming quantity of participants who were white obscures the results of those participants that are not. As a result, I’ve also included a function that prioritizes the visualization of non-white participant data.

View Code

ggplot(race, aes(Race)) + 
  geom_bar() +
  labs(x = " ", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(nw_race, aes(Race)) + 
  geom_bar() +
  labs(title = "w/o White Participants", x = " ", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bi-Variate Visualizations

Moving into bi-variate visuals, I started by gauging correlation between mental health, physical health and heart disease according to age. My intention was to see whether poor mental or physical health became more commonly reported at later ages in an attempt to determine whether either could be catalysts to heart disease.

View Code

# Gauging correlation between mental health, physical health and heart disease according to age.
ggplot(w_heart, aes(AgeCategory, MentalHealth)) + 
  geom_boxplot() +
  labs(title = "Heart Disease Cases per Age & Mental Health over the Last 30 Days", 
       x = "Age Range", 
       y = "Days w/ Poor Mental Health") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(w_heart, aes(AgeCategory, PhysicalHealth)) + 
  geom_boxplot() +
  labs(title = "Heart Disease Cases per Age & Physical Health over the Last 30 Days", 
       x = "Age Range", 
       y = "Days w/ Poor Physical Health") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Afterwards, I continued in a nearly identical manner, but according to race rather than age. Similarly, my intention was to see whether poor mental or physical health are more commonly reported in different racial groups. While there may be more complex underlying systematic issues that have the potential to influence results, I was curious to see the visualizations nonetheless.

View Code

# Gauging correlation between mental health, physical health and heart disease according to race.
ggplot(w_heart, aes(Race, MentalHealth)) + 
  geom_boxplot() +
  labs(title = "Heart Disease Cases per Race & Mental Health over the Last 30 Days", 
       x = " ", 
       y = "Days w/ Poor Mental Health") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(w_heart, aes(Race, PhysicalHealth)) + 
  geom_boxplot() +
  labs(title = "Heart Disease Cases per Race & Physical Health over the Last 30 Days", 
       x = " ", 
       y = "Days w/ Poor Physical Health") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Next, using the data I obtained from my proportionality tests, I was able to discern that smoking was a somewhat prevalent factor among those with heart disease. As such, I wanted to do create a graphic that visualized a side-by-side quantity comparison of those with heart disease that do not smoke and those with heart disease that do.

View Code

# Analyzing correlation between smoking activity & heart disease according to age & race.
ggplot(w_heart, aes(AgeCategory)) + 
  geom_histogram(stat = "count") + 
  labs(title = "Heart Disease Cases per Age & Smoking Activity", 
       x = "Age Range", 
       y = "Count") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  facet_wrap(vars(Smoking))

From the results, I believe it is reasonable to assume that smoking does indeed increase the risk of heart disease as one ages!

I wanted to take the visualization one step further and visualize the same factors, but according to race rather than age range. Here, my intention is to illuminate whether genetic factors in certain racial groups would alter an individuals predisposition to heart disease if they were a smoker.

View Code

ggplot(nw_w_heart, aes(Race)) + 
  geom_histogram(stat = "count") + 
  labs(title = "Heart Disease Cases per Race & Smoking Activity w/o White Participants", 
       x = " ", 
       y = "Count") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  facet_wrap(vars(Smoking))

The results here are interesting, even if they aren’t supportive of my hypothesis The inconsistency in trends among the Hispanic population lead me to believe that there are external factors influencing data. External factors have the potential to be dishonest responses, misunderstanding of the survey or cultural differences, but considering all other races follow identical increasing trends, there has to be some reason for the inconsistency.

Lastly, I wanted to create one last chart to, again, visualize the presence of heart disease within different races, but this time as a bi-variate visualization also across different age ranges. The hope is to show consistently higher rates of heart disease among certain races across all age groups, making it harder to disprove that different races do indeed posses some sort of genetic variation that predispose them to heart disease.

View Code

# Analyzing the prevalence of heart disease in different racial groups according to age.
ggplot(nw_w_heart, aes(AgeCategory, fill = Race)) + 
  geom_histogram(stat = "count") + 
  labs(title = "Heart Disease in Various Age Groups according to Race", 
       x = "Age Range", 
       y = "Count") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

As before, the Black and Hispanic populations have consistently higher rates of heart disease, as the Asian population continues to have the lowest. This trend has been consistent across near all visualizations presented (not taking into consideration white participants).

Reflections

This class was interesting and pushed my boundaries and conventional understanding of data. I learned a lot more about coding than I initially anticipated, but in completing this project, have also come to enjoy a new skill. The project itself proved difficult to complete, more so in regards to time than anything else. I wanted to keep adding more and more information, do more and more research, but with my additional course load I had to restrain myself from going further and choose to finalize something.

For this project, I wanted to pick a subject that was both personal and particular to an industry of interest. That is why I chose the data set that I did! I’d love to work in the medical industry and the opportunity to study heart disease in different populations seemed extremely interesting. Most of the decisions regarding analyses have already been highlighted throughout this document, but in retrospect, selecting a route/path in which to analyze the data I had was difficult at first. Many of the decisions I made were only made after lots of trial and error with; different functions, combinations of variables, selection of visualizations, experimenting with packages, etc. In the end, I chose to whittle down my results and code to show the most effective data/code.

The biggest challenge I faced when creating my project was time management. The process of learning fundamental data analytics to this extent was a bit of a culture shock to me. The class truly altered the way I perceive data and that process was time consuming. That paired with my additional course load made it difficult to juggle everything and transition into the mentality I felt I needed in order to be effective in creating this project. I wish I would’ve been aware of the amount of attention the project would eventually require, but I’m satisfied with the result I produced nonetheless.

Ideally, if I were to continue with this project, I’d love to further research data concerning disparities in healthcare among different communities across the United States. I feel as though it would serve to fortify the points already made in this document, while simultaneously empowering my desire to improve my analysis skills.

Conclusion

From a glance, I was not able to draw many strong, direct correlation between specific variables and heart disease other than smoking.

In the analysis of numeric data, mean, median and standard deviation values were relatively consistent across the board. The median values were consistently offset by the quantity of zero value responses, proving useless. The only exception to the variable consistencies was reported physical health. Even so, the variation is vague could either occur as a result of having heart disease or some external factor, it’s difficult to draw any confident conclusions. The categorical data was a bit more enlightening, and I was able to conclude that there was some potential correlation between smoking and heart disease. There was also potential to establish a connection between heart disease and diabetes, but I felt as though that was already a well known hypothesis.

Wanting to first confirm my suspicions that heart disease was closely tied with age, I created the first uni-variate visual. Having seen the proportional spread in one of the tables above, the visualization created further confirms that heart disease becomes more and more common with age. I was able to conclude heart disease, in all instances, becomes increasingly more common after the age of 40/50. For the last two uni-variate visualizations, trying to analyze the data per race was a tad more complicated. I had to filter out white participants completely in order to view other population’s results at clearly at all. Even then, the data frame was substantially reduced. Still in them we can see that heart disease is indeed more common in the Black & Hispanic groups, with the lowest probability being with those who identify as Asian.

Within the first two bi-variate visualizations, per age & mental/physical health, I wanted to try and draw a relation between both new variables and the age category. In the end, one can draw the conclusion that after the age of 35-39, physical health starts to become a regular complaint among the population. There might be a loose connection between physical health and heart disease, but unfortunately, I can’t determine if there’s any concrete tie to either variables. I wanted to do something similar with the following two box-plot visualizations per race & mental/physical health. From the visuals, there is a large spread among the racial groups that had higher counts of heart disease from the uni-variate visualizations, leading me to believe that mental and physical health play some sort of role in the grand scheme of things. However, the variation within the races who previously had lower counts of heart disease are preventing me from generating any concrete conclusions.

The following two visualizations utilize the smoking variable and using the facet_wrap function I was able to generate side by side comparisons of heart disease counts per population (smoking & non-smoking). I was able to conclude, from the first, that smoking does indeed increase your chances of heart disease over time as the amount of positive cases in the smoking vs. non-smoking populations substantially increased. The second visual was not so conclusive. and I wasn’t able to draw anything concrete from it, but it did raise some questions concerning reporting accuracy and user input.

Lastly, the final visualization was a sort of cross analysis of the two primary variables (age & race) used in the uni-variate visualizations. It served to further confirm that, not only do the chances of heart disease increase with age, but odds are greatly increased in the Black and Hispanic populations. Unfortunately, this does not take into account the white population. Nonetheless, the Black and Hispanic populations have the highest counts of heart disease in every age category and if I were to guess, were there a proportionate amount of white respondents, we would continue to see that trend persist.

After all the analysis, I’m still left to wonder exactly which factors make heart disease so prevalent within the Black and Hispanic populations with no concrete ties to the analyzed variables. Additionally, I wonder why the data clearly show that heart disease is more common in smoking population, but those same results almost seem reversed when analyzed in the smoking population according to race. I get the feeling that may be due to user error or response validity. Also, it seems that poor mental and physical health are more prominent factors when broken down according to racial group. Like previously mentioned, I believe these are smaller cogs to a larger machine. I’d be curious to obtain additional information/data sets to flesh out more concrete information.

Bibliography

Pytlak, Kamil. “Personal Key Indicators of Heart Disease.” Kaggle, 16 Feb. 2022, https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease.

Gray, Belinda, and Elijah R. Behr. “New Insights into the Genetic Basis of Inherited Arrhythmia Syndromes.” Circulation: Cardiovascular Genetics, vol. 9, no. 6, 1 Dec. 2016, pp. 569–577., https://doi.org/10.1161/circgenetics.116.001571.

Smedley, Brian D., et al. Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care. National Academy Press, 2003.

President, Julia Cusick Interim Vice, et al. “Health Disparities by Race and Ethnicity.” Center for American Progress, 9 July 2018, https://www.americanprogress.org/article/health-disparities-race-ethnicity/.

Cardiology Magazine. “Cover Story: One Size Does Not Fit All: The Role of Sex, Gender, Race and Ethnicity in Cardiovascular Medicine.” American College of Cardiology, 19 Oct. 2018, https://www.acc.org/latest-in-cardiology/articles/2018/10/14/12/42/cover-story-one-size-does-not-fit-all-sex-gender-race-and-ethnicity-in-cardiovascular-medicine#:~:text=HF%20in%20Blacks%20and%20Hispanics,to%20have%20HFpEF%20than%20whites.

CDC. “Health, United States Spotlight - Centers for Disease Control and …” Centers for Disease Control and Prevention, CDC, 2019, https://www.cdc.gov/nchs/hus/spotlight/HeartDiseaseSpotlight_2019_0404.pdf.

Grolemund, Garrett, and Hadley Wickham. R For Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly, 2017.

R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

RStudio Team (2020). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.

Comment on this article Share:

DACSS 601: Data Science Fundamentals Final Project - Final Draft

Setup

Introduction

Data

Data Set Variables

Data Cleaning

Data Analyses

Numeric Data

Categorical Data

Visualizations

Uni-Variate Visualizations

Bi-Variate Visualizations

Reflections

Conclusion

Bibliography

Reuse

Citation