HW 4

Descriptive Statistics and Visualization

Alexis Gamez
4/10/2022

Setup

Data Set

For this assignment I used a data set originally provided by the CDC as a part of the Behavioral Risk Factor Surveillance System (BRFSS), which is a system that conducts telephone surveys to gather data on the health statuses of US residents

heart<-read_csv("C:/Users/Leshiii/Desktop/DACSS Master's/DACSS 601/HW4/heart_2020_cleaned.csv")
# Rows: 319795 Columns: 18  

#Confirming that the data set was read in correctly
head(heart)
# A tibble: 6 x 18
  HeartDisease   BMI Smoking AlcoholDrinking Stroke PhysicalHealth
  <chr>        <dbl> <chr>   <chr>           <chr>           <dbl>
1 No            16.6 Yes     No              No                  3
2 No            20.3 No      No              Yes                 0
3 No            26.6 Yes     No              No                 20
4 No            24.2 No      No              No                  0
5 No            23.7 No      No              No                 28
6 Yes           28.9 Yes     No              No                  6
# ... with 12 more variables: MentalHealth <dbl>, DiffWalking <chr>,
#   Sex <chr>, AgeCategory <chr>, Race <chr>, Diabetic <chr>,
#   PhysicalActivity <chr>, GenHealth <chr>, SleepTime <dbl>,
#   Asthma <chr>, KidneyDisease <chr>, SkinCancer <chr>

Data Set Variables

  1. Heart Disease: Character Data (Yes/No)
  2. BMI: Numeric/Doubles Data (16.6/20.3/etc)
  3. Smoking: Character Data (Yes/No)
  4. Alcohol Drinking: Character Data (Yes/No)
  5. Stroke: Character Data (Yes/No)
  6. Physical Health: Numeric Data (1/2/3/etc.)
  7. Mental Health: Numeric Data (1/2/3/etc.)
  8. Diff Walking: Character Data (Yes/No)
  9. Sex: Character Data (Female/Male)
  10. Age Category: Character Data (55-59/65-69/etc.)
  11. Race: Character Data (White/Black/etc.)
  12. Diabetic: Character Data (Yes/No)
  13. Physical Activity: Character Data (Yes/No)
  14. Gen Health: Character Data (Good/Fair/Poor/etc.)
  15. Sleep Time: Numeric Data (1/2/3/etc.)
  16. Asthma: Character Data (Yes/No)
  17. Kidney Disease: Character Data (Yes/No)
  18. Skin Cancer: Character Data (Yes/No)

Part 1

Computing Descriptive Statistics for Numeric & Categorical Data

Numeric Data

# Calculating the mean, median & standard deviation of the Mental Health variable
# z_variable include all zero values within the data frame, while nz_variable filters out all zero values
z_mental <- heart %>%
   summarize(Mental_Mean = mean(MentalHealth), 
          Mental_Median = median(MentalHealth), 
          Mental_SD = sd(MentalHealth))
z_mental
# A tibble: 1 x 3
  Mental_Mean Mental_Median Mental_SD
        <dbl>         <dbl>     <dbl>
1        3.90             0      7.96
nz_mental <- heart %>%
   filter(MentalHealth > 0) %>%
   summarize(Mental_Mean = mean(MentalHealth), 
              Mental_Median = median(MentalHealth), 
              Mental_SD = sd(MentalHealth))
nz_mental
# A tibble: 1 x 3
  Mental_Mean Mental_Median Mental_SD
        <dbl>         <dbl>     <dbl>
1        10.9             6      10.0
# Calculating the mean, median & standard deviation of the Physical Health variable
z_phys <- heart %>%
   summarize(Phys_Mean = mean(PhysicalHealth),
             Phys_Median = median(PhysicalHealth), 
             Phys_SD = sd(PhysicalHealth))
z_phys
# A tibble: 1 x 3
  Phys_Mean Phys_Median Phys_SD
      <dbl>       <dbl>   <dbl>
1      3.37           0    7.95
nz_phys <- heart %>%
   filter(PhysicalHealth > 0) %>%
   summarize(Phys_Mean = mean(PhysicalHealth),
             Phys_Median = median(PhysicalHealth), 
             Phys_SD = sd(PhysicalHealth))
nz_phys
# A tibble: 1 x 3
  Phys_Mean Phys_Median Phys_SD
      <dbl>       <dbl>   <dbl>
1      11.6           6    11.0
# Calculating the mean, median & standard deviation of the Sleep Time variable
z_sleep <- heart %>%
   summarize(Sleep_Mean = mean(SleepTime), 
             Sleep_Median = median(SleepTime), 
             Sleep_SD = sd(SleepTime))
z_sleep
# A tibble: 1 x 3
  Sleep_Mean Sleep_Median Sleep_SD
       <dbl>        <dbl>    <dbl>
1       7.10            7     1.44
# Unnecessary considering all values are above 0, included for consistency
nz_sleep <- heart %>%
   filter(SleepTime > 0) %>%
   summarize(Sleep_Mean = mean(SleepTime), 
             Sleep_Median = median(SleepTime), 
             Sleep_SD = sd(SleepTime))

Categorical Data

# Calculating the proportion of all participants and participants reporting heart disease according to their race.
race <- select(heart, Race)
p_race <- prop.table(table(race))*100
p_race
race
American Indian/Alaskan Native                          Asian 
                      1.626667                       2.522866 
                         Black                       Hispanic 
                      7.173033                       8.582373 
                         Other                          White 
                      3.417189                      76.677872 
race_heart <- heart %>%
   filter(`HeartDisease` == "Yes") %>%
   select(Race)
prop.table(table(race_heart))*100
race_heart
American Indian/Alaskan Native                          Asian 
                     1.9800533                      0.9717605 
                         Black                       Hispanic 
                     6.3164432                      5.2716180 
                         Other                          White 
                     3.2367662                     82.2233588 
# Because the proportion of white participants is so high, it may skew/overcast the results according to lesser reported races. As a remedy, I created a separate data set excluding white participants.
nw_race_heart <- heart %>%
   filter(`HeartDisease` == "Yes", `Race` != "White") %>%
   select(Race)
prop.table(table(nw_race_heart))*100
nw_race_heart
American Indian/Alaskan Native                          Asian 
                     11.138512                       5.466502 
                         Black                       Hispanic 
                     35.532265                      29.654747 
                         Other 
                     18.207974 
# Calculating the proportion of all participants and participants reporting heart disease according to their age group.
age <- select(heart, AgeCategory)
p_age <- prop.table(table(age))*100
p_age
age
      18-24       25-29       30-34       35-39       40-44 
   6.586720    5.301834    5.864069    6.425992    6.568583 
      45-49       50-54       55-59       60-64       65-69 
   6.814053    7.936960    9.305024   10.533623   10.679029 
      70-74       75-79 80 or older 
   9.714036    6.717428    7.552651 
age_heart <- heart %>%
   filter(`HeartDisease` == "Yes") %>%
   select(AgeCategory)
prop.table(table(age_heart))*100
age_heart
      18-24       25-29       30-34       35-39       40-44 
  0.4749205   0.4858802   0.8256311   1.0813575   1.7754722 
      45-49       50-54       55-59       60-64       65-69 
  2.7180068   5.0524239   8.0444233  12.1543126  14.9819165 
      70-74       75-79 80 or older 
 17.7072298  14.7919483  19.9064772 
# Calculating the proportion of all participants and participants reporting heart disease according to their reported General Health in the last 30 days.
gen <- select(heart, GenHealth)
p_gen <- prop.table(table(gen))*100
p_gen
gen
Excellent      Fair      Good      Poor Very good 
20.901515 10.843509 29.121468  3.530074 35.603433 
gen_heart <- heart %>%
   filter(`HeartDisease` == "Yes") %>%
   select(GenHealth)
prop.table(table(gen_heart))*100
gen_heart
Excellent      Fair      Good      Poor Very good 
 5.479852 25.879516 34.917620 14.064955 19.658057 
# Calculating the proportion of all participants and participants reporting heart disease according to their reported physical activity.
physical <- select(heart, PhysicalActivity)
p_physical <- prop.table(table(physical))*100
p_physical
physical
      No      Yes 
22.46377 77.53623 
phys_heart <- heart %>%
   filter(`HeartDisease` == "Yes") %>%
   select(PhysicalActivity)
prop.table(table(phys_heart))*100
phys_heart
      No      Yes 
36.10857 63.89143 
# Calculating the proportion of all participants and participants reporting heart disease according to whether or not they smoke.
smoke <- select(heart, Smoking)
p_smoke <- prop.table(table(smoke))*100
p_smoke
smoke
      No      Yes 
58.75233 41.24767 
smoke_heart <- heart %>%
   filter(`HeartDisease` == "Yes") %>%
   select(Smoking)
prop.table(table(smoke_heart))*100
smoke_heart
      No      Yes 
41.41307 58.58693 
# Calculating the proportion of all participants and participants reporting heart disease according to whether or not they drink alcohol regularly.
alcohol <- select(heart, AlcoholDrinking)
p_alcohol <- prop.table(table(alcohol))
p_alcohol
alcohol
        No        Yes 
0.93190325 0.06809675 
alc_heart <- heart %>%
   filter(`HeartDisease` == "Yes") %>%
   select(AlcoholDrinking)
prop.table(table(alc_heart))*100
alc_heart
       No       Yes 
95.831659  4.168341 
# Calculating the proportion of all participants and participants reporting heart disease according to whether or not they have had a stroke.
stroke <- select(heart, Stroke)
p_stroke <- prop.table(table(stroke))*100
p_stroke
stroke
      No      Yes 
96.22602  3.77398 
stroke_heart <- heart %>%
   filter(`HeartDisease` == "Yes") %>%
   select(Stroke)
prop.table(table(stroke_heart))*100
stroke_heart
      No      Yes 
83.96595 16.03405 
# Calculating the proportion of all participants and participants reporting heart disease according to whether or not they have diabetes.
diabetes <- select(heart, Diabetic)
p_diabetes <- prop.table(table(diabetes))*100
p_diabetes
diabetes
                     No No, borderline diabetes 
             84.3205804               2.1204209 
                    Yes  Yes (during pregnancy) 
             12.7587986               0.8002001 
diab_heart <- heart %>%
   filter(`HeartDisease` == "Yes") %>%
   select(Diabetic)
prop.table(table(diab_heart))*100
diab_heart
                     No No, borderline diabetes 
             64.0010229               2.8824024 
                    Yes  Yes (during pregnancy) 
             32.7220254               0.3945494 

Part 2

Creating Uni-variate & Bi-variate Visualizations of the Data Set

Uni-variate Visualizations

# The first thing I wanted to visualize was whether heart disease was more prominent in certain age groups than others. This was confirmed as the bar graph below displays that heart disease is much more common in participants ages 55 and up.
ggplot(age_heart, aes(AgeCategory)) + geom_bar()
#Similarly, I wanted to create a visual that depicted the frequency of heart disease within certain races of people. Like mentioned above, the overwhelming quantity of participants who were white obscures the results of those participants that are not. As a result, I've also included a function that prioritizes the visualization of non-white participant data.
ggplot(race_heart, aes(Race)) + geom_bar()
ggplot(nw_race_heart, aes(Race)) + geom_bar()

Bi-variate Visualizations

# Creating data sets that include all previously analyzed variables considering only those that reported having heart disease.
w_heart <- heart %>%
   filter(`HeartDisease` == "Yes") %>%
   select(Race, 
          AgeCategory, 
          PhysicalActivity, 
          Smoking, 
          AlcoholDrinking, 
          Stroke, 
          Diabetic,
          MentalHealth,
          PhysicalHealth,
          SleepTime,
          GenHealth)

nw_w_heart <- heart %>%
   filter(`HeartDisease` == "Yes", `Race` != "White") %>%
   select(Race, 
          AgeCategory, 
          PhysicalActivity, 
          Smoking, 
          AlcoholDrinking, 
          Stroke, 
          Diabetic,
          MentalHealth,
          PhysicalHealth,
          SleepTime,
          GenHealth)
# Gauging correlation between mental health, physical health and heart disease according to age.
ggplot(w_heart, aes(AgeCategory, MentalHealth)) + 
  geom_boxplot() +
  labs(title = "Heart Disease Cases per Age & Mental Health over the Last 30 Days")
ggplot(w_heart, aes(AgeCategory, PhysicalHealth)) + 
  geom_boxplot() +
  labs(title = "Heart Disease Cases per Age & Physical Health over the Last 30 Days")
# Gauging correlation between mental health, physical health and heart disease according to race.
ggplot(w_heart, aes(Race, MentalHealth)) + 
  geom_boxplot() +
  labs(title = "Heart Disease Cases per Race & Mental Health over the Last 30 Days")
ggplot(w_heart, aes(Race, PhysicalHealth)) + 
  geom_boxplot() +
  labs(title = "Heart Disease Cases per Race & Physical Health over the Last 30 Days")
# Analyzing correlation between smoking activity & heart disease according to age & race.
ggplot(w_heart, aes(AgeCategory)) + 
  geom_histogram(stat = "count") + 
  labs(title = "Heart Disease Cases per Age & Smoking Activity") + 
  theme_bw() +
  facet_wrap(vars(Smoking))
ggplot(nw_w_heart, aes(Race)) + 
  geom_histogram(stat = "count") + 
  labs(title = "Heart Disease Cases per Race & Smoking Activity w/o White Participants") + 
  theme_bw() +
  facet_wrap(vars(Smoking))
# Analyzing the prevalence of heart disease in different racial groups according to age.
ggplot(w_heart, aes(AgeCategory, fill = Race)) + 
  geom_histogram(stat = "count") + 
  labs(title = "Heart Disease in Different Age Groups according to Race") + 
  theme_bw()
ggplot(nw_w_heart, aes(AgeCategory, fill = Race)) + 
  geom_histogram(stat = "count") + 
  labs(title = "Heart Disease in Different Age Groups according to Race w/o White Participants") + 
  theme_bw()  
# Similarly, analyzing the prevalence of heart disease in different age groups according to race.
#ggplot(nw_w_heart, aes(Race, fill = AgeCategory)) + 
    #geom_histogram(stat = "count") + 
    #labs(title = "Heart Disease in Different Races according to Age w/o White Participants") + 
    #theme_bw()

# Analyzing the prevalence of heart disease according to reported general health & race.
#ggplot(w_heart, aes(GenHealth, fill = Race)) + 
    #geom_histogram(stat = "count") + 
    #labs(title = "Heart Disease according to Reported General Health & Race") + 
    #theme_bw()

#ggplot(nw_w_heart, aes(GenHealth, fill = Race)) + 
    #geom_histogram(stat = "count") + 
    #labs(title = "Heart Disease according to Reported General Health & Race w/o White Participants") + 
    #theme_bw()

Part 3

Explaining the Visualizations

Uni-variate Visualizations

What variable(s) you are visualizing?

I decided to create 3 uni-variate visualizations in an attempt to further analyze 2 variables, age & race. For both, I used a sub data set (ex. age_heart & race_heart) I created from the original “heart” data set that I loaded in at the beginning of this assignment. By filtering the initial data frame, I was able to a create a variation that only included the results of those that did indeed report heart disease. Therefore, what we see above positive heart disease cases spread according to age and race.

What question(s) you are attempting to answer with the visualization?

From a glance, I was not able to draw any direct correlation between specific variables and heart disease. Being seemingly random, I wanted to first confirm my suspicions that heart disease was closely tied with age. Having seen the proportional spread in one of the tables above, the visualization created further confirms that heart disease becomes more and more common with age.

For the last two visualizations, I wanted an answer to whether heart disease was more common in certain racial groups than others. Whether it be socioeconomically related or purely biological, I wanted to think there was some sort of disproportionate relation that made heart disease more common within certain races.

What conclusions you can make from the visualization?

I was able to conclude that there is some sort of correlation with heart disease and age, becoming increasingly more common after the age of 40/50.

Trying to analyze the data per race was a tad more complicated. Unfortunately, there was an overwhelming quantity of white participants that seemed to obscure the results of the other racial groups and as such, I had to filter out white participation in order to view the other results more clearly. While I was able to see the other data more clearly, the size of the data frame was substantially reduced. Still, having a few thousand entries is still valuable for what it is. In it we can see that heart disease is indeed more common in the Black & Hispanic groups, with the lowest probability being with those who identify as Asian. I try to further analyze the origin/cause for correlation between race and heart disease using additional variables in the bi-variate visualizations.

Bi-variate Visualizations

What variable(s) you are visualizing?

In addition to the aforementioned variables, age & race, I also decided to visualize the mental health, physical health & smoking. Primarily, I utilize the additional variables to further reinforce the previously mentioned analysis within my uni-variate visualizations.

What question(s) you are attempting to answer with the visualization?

Within the first two box-plot visualizations, per age & mental/physical health, I wanted to try and draw a relation between both new variables and the age category. I wanted to answer the questions of whether mental & physical health become increasingly poor with age and whether that factors into the likelihood of heart disease.

I wanted to do something similar with the following two box-plot visualizations per race & mental/physical health. Pending the results, it might draw further questions concerning economic, socioeconomic and genetic variations of different racial groups.

The following two visualizations utilize the smoking variable and using the facet_wrap function I was able to generate side by side comparisons of heart disease counts per population (smoking & non-smoking). My goal being to answer the questions of whether or not smoking increases your chance of heart disease.

Lastly, the final two visualizations were a sort of cross analysis of the two primary variables (age & race) used in the uni-variate visualizations. I wanted to take a deeper look into the prominence of heart disease within the different racial groups and see whether or not I could draw any definite conclusions as to whether heart disease is truly more common within certain races than others.

What conclusions you can make from the visualization?

Within the first two box-plots (per age & mental/physical health), one can draw the conclusion that after the age of 35-39, physical health starts to become a regular complaint among the population. It seems like there might be a loose connection between physical health and heart disease, but I can’t say for certain. Unfortunately, I can’t determine if there’s any concrete tie between reported Mental and Physical health according to age with heart disease.

Similarly, I analyzed both mental/physical health according to race. From the visualizations, there is a large spread among the racial groups that had higher counts of heart disease from the uni-variate visualizations, leading me to believe that mental and physical health play some sort of role in the grand scheme of things. However, the variation within the races who previously had lower counts of heart disease are preventing me from generating any concrete correlations. In conclusion, I believe that mental and physical health are just components of a much more prominent factor that has yet to be introduced.

The next two visualizations were a bit more straightforward to analyze. I was able to conclude, from the first, that smoking does indeed increase your chances of heart disease overtime as the amount of positive cases in the smoking vs. non-smoking populations are substantially increased. The second visual was not so conclusive. Unfortunately, I wasn’t able to draw anything concrete, but it did raise some questions concerning reporting accuracy and user input.

The last visualizations further confirm that, not only do the chances of heart disease increase with age, but odds are greatly increased in the black and hispanic populations. Unfortunately, this does not take into account the white population. With that said, the black and hispanic populations have the highest counts of heart disease in every age category otherwise and if I were to guess, where there a proportionate amount of respondents compared to those that are white, we would clearly see the results there as well.

Part 4

Identifying the Limitations of the Visualizations

Uni-variate Visualizations

What questions are left unanswered with your visualizations?

I still wonder exactly which factors make heart disease so prevalent within the black and hispanic populations.

What about the visualizations may be unclear to a naive viewer?

From the 1st graph, it may seem that heart disease is vastly more prominent within the white population, however, a closer look clearly depicts that it is just a matter of respondent density.

How could you improve the visualizations for the final project?

Possibly refrain from including the visual that includes the white population. Additionally, labeling each visual with a title and cleaning up the axis labels to look a bit more professional. May coloring the the graphs would also help to make it seem more visually appealing.

Bi-variate Visualizations

What questions are left unanswered with your visualizations?

Once again, still curious over the factors that make heart disease more prominent in the black and hispanic populations. Additionally, I wonder why the data clearly shows that heart disease is more common in smoking population, but those same results almost seem reversed when analyzed in the smoking population according to race. I get the feeling that may be due to user error or a response honesty. Additionally, it seems that poor mental and physical health are more prominent factors when broken down according to racial group. Like previously mentioned, I believe these are smaller cogs to a larger machine. I’d be curious to obtain additional information/data sets to flush out more concrete information.

What about the visualizations may be unclear to a naive viewer?

The mental health box plots can be difficult to interpret and as a result one might think that poor mental health over the last 30 days might correlate to a lack of heart disease, but a more reasonable assumption would be that the two aren’t concretely related. Similarly, one might think the same for physical health, but seeing the data from the other visualizations helps draw out different conclusions. The same goes for the smoking visualizations.

Lastly, similar to uni-variate visualizations, the population of white respondents overcasts the remaining populations data and may lead a viewer to misinterpret at a glance.

How could you improve the visualizations for the final project?

In addition to possibly omitting the white population from the last set of colored visualizations, omitting not conclusive visualizations may help to further highlight the effective ones and reduce confusion and the misinterpretation of data.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Gamez (2022, April 27). Data Analytics and Computational Social Science: HW 4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgamez654895017/

BibTeX citation

@misc{gamez2022hw,
  author = {Gamez, Alexis},
  title = {Data Analytics and Computational Social Science: HW 4},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomgamez654895017/},
  year = {2022}
}