Data Analytics and Computational Social Science: HW #4

Katie Popiela

1. Read in your dataset and compute descriptive statistics for each of your variables using dplyr. This should include mean, median, SD, and frequencies for categorical variables. Use groupby() and summarise() to compute mean, median, and SD for any relevant groupings.

library(dplyr)
library(tidyverse)
library(poliscidata)
gss %>%
  select(polviews,sex,degree)%>%
  head(25) %>%
  tibble()

# A tibble: 25 x 3
   polviews  sex    degree      
   <fct>     <fct>  <fct>       
 1 Moderate  Male   Bachelor deg
 2 SlghtCons Male   HS          
 3 SlghtCons Male   HS          
 4 SlghtCons Female HS          
 5 Liberal   Female Bachelor deg
 6 Moderate  Female Bachelor deg
 7 Moderate  Female Junior Coll 
 8 Moderate  Female <HS         
 9 Conserv   Female <HS         
10 Liberal   Female Bachelor deg
# ... with 15 more rows

library(poliscidata)
data(gss)
gss_refined<-gss %>%
  select(polviews,sex,degree)
summary(gss_refined)

      polviews       sex                degree   
 Moderate :713   Male  : 886   <HS         :288  
 Conserv  :292   Female:1088   HS          :976  
 SlghtCons:268                 Junior Coll :151  
 Liberal  :244                 Bachelor deg:354  
 SlghtLib :208                 Graduate deg:205  
 (Other)  :149                                   
 NA's     :100

2. Created at least 2 visualizations using your final project dataset
3. Explain each visualization.
4. Identify limitations of said visualizations

Visualization #1

ggplot(data=gss_refined,aes(x=polviews,fill=degree))+
  geom_bar() + labs(x="Political Views", fill ="Highest Degree Awarded")

This visualization is univariate (polviews), but it is organized/categorized based on the highest degree awarded to the survey’s respondents. The bar graph is an excellent means of viewing the relationship between education and political views, but I am ultimately looking to visualize the impact sex and education have on respondents’ political views.

Note: I am still debating using age instead of sex, so I will include a graph with that last

Visualization 2

data(gss)
ggplot(gss_refined,aes(x=degree, y=polviews,color=sex)) +
  geom_jitter(width=0.2)+labs(x="Highest Degree Awarded",y="Political Views")

This visualization, unlike the first, is more comprehensive and accurate regarding the information I want to focus on. However, it is not conducive to exact (or at least semi-exact) measurements like the bar graph above as there are no numerical markers on either axis. So, now I will attempt to figure out how to incorporate numerical values - hopefully without removing any variables (but I’m going to try switching one).

data(gss)
gss_age_ed<-gss %>%
  select(polviews,age,degree)
summary(gss_age_ed)

      polviews        age                 degree   
 Moderate :713   Min.   :18.00   <HS         :288  
 Conserv  :292   1st Qu.:33.00   HS          :976  
 SlghtCons:268   Median :47.00   Junior Coll :151  
 Liberal  :244   Mean   :48.19   Bachelor deg:354  
 SlghtLib :208   3rd Qu.:61.00   Graduate deg:205  
 (Other)  :149   Max.   :89.00                     
 NA's     :100   NA's   :5

Visualization 3

data(gss)
ggplot(gss_age_ed,aes(x=age,color=polviews))+
  geom_bar()+labs(x="Age", fill ="Political Views")

This visualization is definitely more exact and accurate measurement-wise, but it does not account for the relationship between education and political views (it’s main limitation).

Visualization 4

data(gss)
ggplot(gss_age_ed,aes(x=age,y=degree,color=polviews))+
  geom_jitter(width=0.2)+labs(x="Age",y="Highest Degree Awarded")

This scatterplot is, in my opinion, one of the best visualizations for what I’m looking for (the impact of age and education on political views). As can be seen above, the vast majority of respondents have a high school diploma, and based on the point colors, a substantial amount of them have political views ranging from “slightlib” to “slightcons.” However, the main limitation for this graph is that it does not have exact measurements - any calculation related to mean, median, SD, etc. cannot be done visually.

Comment on this article Share:

HW #4

Reuse

Citation