Descriptive Statistics and Visualization
For this assignment I used a data set originally provided by the CDC as a part of the Behavioral Risk Factor Surveillance System (BRFSS), which is a system that conducts telephone surveys to gather data on the health statuses of US residents
heart<-read_csv("C:/Users/Leshiii/Desktop/DACSS Master's/DACSS 601/HW4/heart_2020_cleaned.csv")
# Rows: 319795 Columns: 18
#Confirming that the data set was read in correctly
head(heart)
# A tibble: 6 x 18
HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 No 16.6 Yes No No 3
2 No 20.3 No No Yes 0
3 No 26.6 Yes No No 20
4 No 24.2 No No No 0
5 No 23.7 No No No 28
6 Yes 28.9 Yes No No 6
# ... with 12 more variables: MentalHealth <dbl>, DiffWalking <chr>,
# Sex <chr>, AgeCategory <chr>, Race <chr>, Diabetic <chr>,
# PhysicalActivity <chr>, GenHealth <chr>, SleepTime <dbl>,
# Asthma <chr>, KidneyDisease <chr>, SkinCancer <chr>
# Calculating the mean, median & standard deviation of the Mental Health variable
# z_variable include all zero values within the data frame, while nz_variable filters out all zero values
z_mental <- heart %>%
summarize(Mental_Mean = mean(MentalHealth),
Mental_Median = median(MentalHealth),
Mental_SD = sd(MentalHealth))
z_mental
# A tibble: 1 x 3
Mental_Mean Mental_Median Mental_SD
<dbl> <dbl> <dbl>
1 3.90 0 7.96
nz_mental <- heart %>%
filter(MentalHealth > 0) %>%
summarize(Mental_Mean = mean(MentalHealth),
Mental_Median = median(MentalHealth),
Mental_SD = sd(MentalHealth))
nz_mental
# A tibble: 1 x 3
Mental_Mean Mental_Median Mental_SD
<dbl> <dbl> <dbl>
1 10.9 6 10.0
# Calculating the mean, median & standard deviation of the Physical Health variable
z_phys <- heart %>%
summarize(Phys_Mean = mean(PhysicalHealth),
Phys_Median = median(PhysicalHealth),
Phys_SD = sd(PhysicalHealth))
z_phys
# A tibble: 1 x 3
Phys_Mean Phys_Median Phys_SD
<dbl> <dbl> <dbl>
1 3.37 0 7.95
nz_phys <- heart %>%
filter(PhysicalHealth > 0) %>%
summarize(Phys_Mean = mean(PhysicalHealth),
Phys_Median = median(PhysicalHealth),
Phys_SD = sd(PhysicalHealth))
nz_phys
# A tibble: 1 x 3
Phys_Mean Phys_Median Phys_SD
<dbl> <dbl> <dbl>
1 11.6 6 11.0
# Calculating the mean, median & standard deviation of the Sleep Time variable
z_sleep <- heart %>%
summarize(Sleep_Mean = mean(SleepTime),
Sleep_Median = median(SleepTime),
Sleep_SD = sd(SleepTime))
z_sleep
# A tibble: 1 x 3
Sleep_Mean Sleep_Median Sleep_SD
<dbl> <dbl> <dbl>
1 7.10 7 1.44
# Calculating the proportion of all participants and participants reporting heart disease according to their race.
race <- select(heart, Race)
p_race <- prop.table(table(race))*100
p_race
race
American Indian/Alaskan Native Asian
1.626667 2.522866
Black Hispanic
7.173033 8.582373
Other White
3.417189 76.677872
race_heart <- heart %>%
filter(`HeartDisease` == "Yes") %>%
select(Race)
prop.table(table(race_heart))*100
race_heart
American Indian/Alaskan Native Asian
1.9800533 0.9717605
Black Hispanic
6.3164432 5.2716180
Other White
3.2367662 82.2233588
# Because the proportion of white participants is so high, it may skew/overcast the results according to lesser reported races. As a remedy, I created a separate data set excluding white participants.
nw_race_heart <- heart %>%
filter(`HeartDisease` == "Yes", `Race` != "White") %>%
select(Race)
prop.table(table(nw_race_heart))*100
nw_race_heart
American Indian/Alaskan Native Asian
11.138512 5.466502
Black Hispanic
35.532265 29.654747
Other
18.207974
# Calculating the proportion of all participants and participants reporting heart disease according to their age group.
age <- select(heart, AgeCategory)
p_age <- prop.table(table(age))*100
p_age
age
18-24 25-29 30-34 35-39 40-44
6.586720 5.301834 5.864069 6.425992 6.568583
45-49 50-54 55-59 60-64 65-69
6.814053 7.936960 9.305024 10.533623 10.679029
70-74 75-79 80 or older
9.714036 6.717428 7.552651
age_heart <- heart %>%
filter(`HeartDisease` == "Yes") %>%
select(AgeCategory)
prop.table(table(age_heart))*100
age_heart
18-24 25-29 30-34 35-39 40-44
0.4749205 0.4858802 0.8256311 1.0813575 1.7754722
45-49 50-54 55-59 60-64 65-69
2.7180068 5.0524239 8.0444233 12.1543126 14.9819165
70-74 75-79 80 or older
17.7072298 14.7919483 19.9064772
# Calculating the proportion of all participants and participants reporting heart disease according to their reported General Health in the last 30 days.
gen <- select(heart, GenHealth)
p_gen <- prop.table(table(gen))*100
p_gen
gen
Excellent Fair Good Poor Very good
20.901515 10.843509 29.121468 3.530074 35.603433
gen_heart <- heart %>%
filter(`HeartDisease` == "Yes") %>%
select(GenHealth)
prop.table(table(gen_heart))*100
gen_heart
Excellent Fair Good Poor Very good
5.479852 25.879516 34.917620 14.064955 19.658057
# Calculating the proportion of all participants and participants reporting heart disease according to their reported physical activity.
physical <- select(heart, PhysicalActivity)
p_physical <- prop.table(table(physical))*100
p_physical
physical
No Yes
22.46377 77.53623
phys_heart <- heart %>%
filter(`HeartDisease` == "Yes") %>%
select(PhysicalActivity)
prop.table(table(phys_heart))*100
phys_heart
No Yes
36.10857 63.89143
# Calculating the proportion of all participants and participants reporting heart disease according to whether or not they smoke.
smoke <- select(heart, Smoking)
p_smoke <- prop.table(table(smoke))*100
p_smoke
smoke
No Yes
58.75233 41.24767
smoke_heart <- heart %>%
filter(`HeartDisease` == "Yes") %>%
select(Smoking)
prop.table(table(smoke_heart))*100
smoke_heart
No Yes
41.41307 58.58693
# Calculating the proportion of all participants and participants reporting heart disease according to whether or not they drink alcohol regularly.
alcohol <- select(heart, AlcoholDrinking)
p_alcohol <- prop.table(table(alcohol))
p_alcohol
alcohol
No Yes
0.93190325 0.06809675
alc_heart <- heart %>%
filter(`HeartDisease` == "Yes") %>%
select(AlcoholDrinking)
prop.table(table(alc_heart))*100
alc_heart
No Yes
95.831659 4.168341
# Calculating the proportion of all participants and participants reporting heart disease according to whether or not they have had a stroke.
stroke <- select(heart, Stroke)
p_stroke <- prop.table(table(stroke))*100
p_stroke
stroke
No Yes
96.22602 3.77398
stroke_heart <- heart %>%
filter(`HeartDisease` == "Yes") %>%
select(Stroke)
prop.table(table(stroke_heart))*100
stroke_heart
No Yes
83.96595 16.03405
# Calculating the proportion of all participants and participants reporting heart disease according to whether or not they have diabetes.
diabetes <- select(heart, Diabetic)
p_diabetes <- prop.table(table(diabetes))*100
p_diabetes
diabetes
No No, borderline diabetes
84.3205804 2.1204209
Yes Yes (during pregnancy)
12.7587986 0.8002001
diab_heart <- heart %>%
filter(`HeartDisease` == "Yes") %>%
select(Diabetic)
prop.table(table(diab_heart))*100
diab_heart
No No, borderline diabetes
64.0010229 2.8824024
Yes Yes (during pregnancy)
32.7220254 0.3945494
# The first thing I wanted to visualize was whether heart disease was more prominent in certain age groups than others. This was confirmed as the bar graph below displays that heart disease is much more common in participants ages 55 and up.
ggplot(age_heart, aes(AgeCategory)) + geom_bar()
#Similarly, I wanted to create a visual that depicted the frequency of heart disease within certain races of people. Like mentioned above, the overwhelming quantity of participants who were white obscures the results of those participants that are not. As a result, I've also included a function that prioritizes the visualization of non-white participant data.
ggplot(race_heart, aes(Race)) + geom_bar()
# Creating data sets that include all previously analyzed variables considering only those that reported having heart disease.
w_heart <- heart %>%
filter(`HeartDisease` == "Yes") %>%
select(Race,
AgeCategory,
PhysicalActivity,
Smoking,
AlcoholDrinking,
Stroke,
Diabetic,
MentalHealth,
PhysicalHealth,
SleepTime,
GenHealth)
nw_w_heart <- heart %>%
filter(`HeartDisease` == "Yes", `Race` != "White") %>%
select(Race,
AgeCategory,
PhysicalActivity,
Smoking,
AlcoholDrinking,
Stroke,
Diabetic,
MentalHealth,
PhysicalHealth,
SleepTime,
GenHealth)
# Gauging correlation between mental health, physical health and heart disease according to age.
ggplot(w_heart, aes(AgeCategory, MentalHealth)) +
geom_boxplot() +
labs(title = "Heart Disease Cases per Age & Mental Health over the Last 30 Days")
ggplot(w_heart, aes(AgeCategory, PhysicalHealth)) +
geom_boxplot() +
labs(title = "Heart Disease Cases per Age & Physical Health over the Last 30 Days")
# Gauging correlation between mental health, physical health and heart disease according to race.
ggplot(w_heart, aes(Race, MentalHealth)) +
geom_boxplot() +
labs(title = "Heart Disease Cases per Race & Mental Health over the Last 30 Days")
ggplot(w_heart, aes(Race, PhysicalHealth)) +
geom_boxplot() +
labs(title = "Heart Disease Cases per Race & Physical Health over the Last 30 Days")
# Analyzing correlation between smoking activity & heart disease according to age & race.
ggplot(w_heart, aes(AgeCategory)) +
geom_histogram(stat = "count") +
labs(title = "Heart Disease Cases per Age & Smoking Activity") +
theme_bw() +
facet_wrap(vars(Smoking))
ggplot(nw_w_heart, aes(Race)) +
geom_histogram(stat = "count") +
labs(title = "Heart Disease Cases per Race & Smoking Activity w/o White Participants") +
theme_bw() +
facet_wrap(vars(Smoking))
# Analyzing the prevalence of heart disease in different racial groups according to age.
ggplot(w_heart, aes(AgeCategory, fill = Race)) +
geom_histogram(stat = "count") +
labs(title = "Heart Disease in Different Age Groups according to Race") +
theme_bw()
ggplot(nw_w_heart, aes(AgeCategory, fill = Race)) +
geom_histogram(stat = "count") +
labs(title = "Heart Disease in Different Age Groups according to Race w/o White Participants") +
theme_bw()
# Similarly, analyzing the prevalence of heart disease in different age groups according to race.
#ggplot(nw_w_heart, aes(Race, fill = AgeCategory)) +
#geom_histogram(stat = "count") +
#labs(title = "Heart Disease in Different Races according to Age w/o White Participants") +
#theme_bw()
# Analyzing the prevalence of heart disease according to reported general health & race.
#ggplot(w_heart, aes(GenHealth, fill = Race)) +
#geom_histogram(stat = "count") +
#labs(title = "Heart Disease according to Reported General Health & Race") +
#theme_bw()
#ggplot(nw_w_heart, aes(GenHealth, fill = Race)) +
#geom_histogram(stat = "count") +
#labs(title = "Heart Disease according to Reported General Health & Race w/o White Participants") +
#theme_bw()
• What variable(s) you are visualizing?
I decided to create 3 uni-variate visualizations in an attempt to further analyze 2 variables, age & race. For both, I used a sub data set (ex. age_heart & race_heart) I created from the original “heart” data set that I loaded in at the beginning of this assignment. By filtering the initial data frame, I was able to a create a variation that only included the results of those that did indeed report heart disease. Therefore, what we see above positive heart disease cases spread according to age and race.
• What question(s) you are attempting to answer with the visualization?
From a glance, I was not able to draw any direct correlation between specific variables and heart disease. Being seemingly random, I wanted to first confirm my suspicions that heart disease was closely tied with age. Having seen the proportional spread in one of the tables above, the visualization created further confirms that heart disease becomes more and more common with age.
For the last two visualizations, I wanted an answer to whether heart disease was more common in certain racial groups than others. Whether it be socioeconomically related or purely biological, I wanted to think there was some sort of disproportionate relation that made heart disease more common within certain races.
• What conclusions you can make from the visualization?
I was able to conclude that there is some sort of correlation with heart disease and age, becoming increasingly more common after the age of 40/50.
Trying to analyze the data per race was a tad more complicated. Unfortunately, there was an overwhelming quantity of white participants that seemed to obscure the results of the other racial groups and as such, I had to filter out white participation in order to view the other results more clearly. While I was able to see the other data more clearly, the size of the data frame was substantially reduced. Still, having a few thousand entries is still valuable for what it is. In it we can see that heart disease is indeed more common in the Black & Hispanic groups, with the lowest probability being with those who identify as Asian. I try to further analyze the origin/cause for correlation between race and heart disease using additional variables in the bi-variate visualizations.
• What variable(s) you are visualizing?
In addition to the aforementioned variables, age & race, I also decided to visualize the mental health, physical health & smoking. Primarily, I utilize the additional variables to further reinforce the previously mentioned analysis within my uni-variate visualizations.
• What question(s) you are attempting to answer with the visualization?
Within the first two box-plot visualizations, per age & mental/physical health, I wanted to try and draw a relation between both new variables and the age category. I wanted to answer the questions of whether mental & physical health become increasingly poor with age and whether that factors into the likelihood of heart disease.
I wanted to do something similar with the following two box-plot visualizations per race & mental/physical health. Pending the results, it might draw further questions concerning economic, socioeconomic and genetic variations of different racial groups.
The following two visualizations utilize the smoking variable and using the facet_wrap function I was able to generate side by side comparisons of heart disease counts per population (smoking & non-smoking). My goal being to answer the questions of whether or not smoking increases your chance of heart disease.
Lastly, the final two visualizations were a sort of cross analysis of the two primary variables (age & race) used in the uni-variate visualizations. I wanted to take a deeper look into the prominence of heart disease within the different racial groups and see whether or not I could draw any definite conclusions as to whether heart disease is truly more common within certain races than others.
• What conclusions you can make from the visualization?
Within the first two box-plots (per age & mental/physical health), one can draw the conclusion that after the age of 35-39, physical health starts to become a regular complaint among the population. It seems like there might be a loose connection between physical health and heart disease, but I can’t say for certain. Unfortunately, I can’t determine if there’s any concrete tie between reported Mental and Physical health according to age with heart disease.
Similarly, I analyzed both mental/physical health according to race. From the visualizations, there is a large spread among the racial groups that had higher counts of heart disease from the uni-variate visualizations, leading me to believe that mental and physical health play some sort of role in the grand scheme of things. However, the variation within the races who previously had lower counts of heart disease are preventing me from generating any concrete correlations. In conclusion, I believe that mental and physical health are just components of a much more prominent factor that has yet to be introduced.
The next two visualizations were a bit more straightforward to analyze. I was able to conclude, from the first, that smoking does indeed increase your chances of heart disease overtime as the amount of positive cases in the smoking vs. non-smoking populations are substantially increased. The second visual was not so conclusive. Unfortunately, I wasn’t able to draw anything concrete, but it did raise some questions concerning reporting accuracy and user input.
The last visualizations further confirm that, not only do the chances of heart disease increase with age, but odds are greatly increased in the black and hispanic populations. Unfortunately, this does not take into account the white population. With that said, the black and hispanic populations have the highest counts of heart disease in every age category otherwise and if I were to guess, where there a proportionate amount of respondents compared to those that are white, we would clearly see the results there as well.
• What questions are left unanswered with your visualizations?
I still wonder exactly which factors make heart disease so prevalent within the black and hispanic populations.
• What about the visualizations may be unclear to a naive viewer?
From the 1st graph, it may seem that heart disease is vastly more prominent within the white population, however, a closer look clearly depicts that it is just a matter of respondent density.
• How could you improve the visualizations for the final project?
Possibly refrain from including the visual that includes the white population. Additionally, labeling each visual with a title and cleaning up the axis labels to look a bit more professional. May coloring the the graphs would also help to make it seem more visually appealing.
• What questions are left unanswered with your visualizations?
Once again, still curious over the factors that make heart disease more prominent in the black and hispanic populations. Additionally, I wonder why the data clearly shows that heart disease is more common in smoking population, but those same results almost seem reversed when analyzed in the smoking population according to race. I get the feeling that may be due to user error or a response honesty. Additionally, it seems that poor mental and physical health are more prominent factors when broken down according to racial group. Like previously mentioned, I believe these are smaller cogs to a larger machine. I’d be curious to obtain additional information/data sets to flush out more concrete information.
• What about the visualizations may be unclear to a naive viewer?
The mental health box plots can be difficult to interpret and as a result one might think that poor mental health over the last 30 days might correlate to a lack of heart disease, but a more reasonable assumption would be that the two aren’t concretely related. Similarly, one might think the same for physical health, but seeing the data from the other visualizations helps draw out different conclusions. The same goes for the smoking visualizations.
Lastly, similar to uni-variate visualizations, the population of white respondents overcasts the remaining populations data and may lead a viewer to misinterpret at a glance.
• How could you improve the visualizations for the final project?
In addition to possibly omitting the white population from the last set of colored visualizations, omitting not conclusive visualizations may help to further highlight the effective ones and reduce confusion and the misinterpretation of data.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gamez (2022, April 27). Data Analytics and Computational Social Science: HW 4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgamez654895017/
BibTeX citation
@misc{gamez2022hw, author = {Gamez, Alexis}, title = {Data Analytics and Computational Social Science: HW 4}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgamez654895017/}, year = {2022} }