Final Project check-in (1)

Final project (1)

Author

Diana Rinker

Published

March 27, 2023

DACSS 603, spring 2023

Final Project check-in (1), Diana Rinker.

Online engagement ”

It is well known that online engagement with the web resource is a highly valuable metric and is driving site revenue. However, engagement and popularity might also be associated with other factors, that websties are trying to avoid, such as online violence, inappropriate behavior and misinformation.This research project is exploring whith factos impact readers engagement in social media conversations.

To do that I will use the data from an online blog on the news website. The author of this blog is posting weekly articles about interpersonal relationships, that are formulated as a letter to to the author, where the author gives an advice about the situation. Readers are free to comment under each post, but cannot make their own posts.

Using this data set, I will explore how engagement of readers of a news blog connected with variety of factors, such blogs’s author engagement , topic of the post, source of readers and readers inappropriate online behavior.

My research question is: Does the authors engagement in the conversation around the post makes readers more engaged and promotes positive interactions among them?

DV: My dependent construct is “user’s engagement”, I will measure users’ engagement at the level of individual post, using the following variables

L1 - page viewers View page

L2 - page readers * Reveal letter * Reveal comments

L3 - logged in * Login / sign up * Up / down vote

L4 - commenter Comment

IV: My main independent variable is Blog’s author engagement. I will measure authors engagement as the factor variable, with the following levels:

A. Unspecified comment

B. Featured comment

C. Engagement in conversation

To control for confounders, I will also measure the follwing variables:

Topic of the post (“post tag”), categorical variable.
Source of the readers, also categorical variable.
Mood of the conversation , derivative continuous variable calculated as the ratio of “likes” to “dislikes”.
Blocked and flagged comments.


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ tibble  3.1.8     ✔ purrr   1.0.1
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Data

The data for analysis contains comments of all posts for 2022.

Code

getwd()

[1] "C:/Users/Diana/OneDrive - University Of Massachusetts Medical School/Documents/R/R working directory/DACSS/603/603_Spring_2023/posts"

Code

raw <- as_tibble (read_csv("C:\\Users\\Diana\\OneDrive - University Of Massachusetts Medical School\\Documents\\R\\R working directory\\DACSS\\603\\my study files for dacss603\\globe\\LL Comment Data\\iaexport03242023nb_output.csv"))

Rows: 1810511 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): content, user_name, display_name, image_url, email, approved
dbl  (6): message_id, post_id, user_id, parent, absolute_likes, absolute_dis...
lgl  (3): email_verified, created_at, private_profile
dttm (1): written_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

comments.data<-raw 
colnames (comments.data)

 [1] "content"           "message_id"        "post_id"          
 [4] "user_id"           "user_name"         "display_name"     
 [7] "image_url"         "email"             "email_verified"   
[10] "created_at"        "private_profile"   "approved"         
[13] "written_at"        "parent"            "absolute_likes"   
[16] "absolute_dislikes"

Code

head(comments.data$created_at)

[1] NA NA NA NA NA NA

Code

head(comments.data$written_at)

[1] "2012-08-10 09:31:55 UTC" "2012-08-10 15:30:48 UTC"
[3] "2012-08-10 09:32:52 UTC" "2011-01-04 12:39:46 UTC"
[5] "2012-08-12 19:19:35 UTC" "2011-01-05 09:26:07 UTC"

Code

comments.data <-comments.data%>%
              mutate(com.year = format(written_at,format = "%Y" ))
head(comments.data$com.year)

[1] "2012" "2012" "2012" "2011" "2012" "2011"

Code

dim(comments.data)

[1] 1810511      17

Code

comments.2022 <-comments.data  %>%
              filter(com.year =="2022" )
dim(comments.2022)

[1] 45020    17

To answer my research question I will need three datasets that are coming from a different sources.

The first dataset contains comments and their attributes.
The second dataset contains posts and their attributes
The third dataset contains website analytics, such as page views, scrolls, viewers’ sources and so on.

To get better understanding of my data, I will review variables:

Number of comments, distributed by month

Code

grouped<-comments.2022 %>%
  group_by (post_id)%>%
  summarise(n.comments=n(), 
            post.month = format(first(written_at),format = "%m" ))
grouped

# A tibble: 383 × 3
    post_id n.comments post.month
      <dbl>      <int> <chr>     
 1 27068003        176 12        
 2 27068009        127 12        
 3 27068015         96 12        
 4 27068021        137 12        
 5 27068027        148 12        
 6 27068033        210 12        
 7 27068039        169 12        
 8 27068045        207 12        
 9 27068051        137 12        
10 27068057        194 12        
# … with 373 more rows

Code

ggplot(grouped, mapping = aes(x=post.month, y=n.comments, fill=post.month))+
  geom_boxplot() +
  labs(title = "Density distribution of comments per post ", y = "Number of comments" )+
  scale_y_continuous(breaks = seq (from=0, to= 10000, by= 100))

Code

   # coord_flip()

This graph demonstrates significant variance in amount or comments per post over 12 months of 2022.

Dependent variable. User engagement.

My dependent construct is user’s engagement, I will measure users engagement at the level of individual post.

Engagement metrics

L1 - page viewers View page

L2 - page readers

* Reveal letter

* Reveal comments

L3 - logged in

* Login / sign up

* Up / down vote

L4 - commenter Comment

Mood of the post.

is a numerical variable.Since each comment has a certain number of “thumbs up” or “thumbs down”. I will use this variable to calculate overall “mood” of the post.

Code

# grouped<-comments.2022 %>%
#   group_by (post_id)%>%
#   summarise(n.comments=n(), 
#             post.month = format(first(written_at),format = "%m" ),
#             likes.sum = sum(absolute_likes), 
#             dislikes.sum=sum (absolute_dislikes))
# grouped <-  grouped%>%
#    mutate(mood =  likes.sum/dislikes.sum)  
#  grouped <- grouped %>%
#   mutate(mood = ifelse(is.nan(likes.sum/dislikes.sum), 0, likes.sum/dislikes.sum))
#  
 grouped<-comments.2022 %>%
      group_by (post_id)%>%
      summarise(n.comments=n(), 
            post.month = format(first(written_at),format = "%m" ),
            likes.sum = sum(absolute_likes), 
            dislikes.sum=sum (absolute_dislikes),
            blocked.sum= sum(approved=="blocked"),
            pct.positive =(sum(absolute_likes)/(sum(absolute_likes)+sum(absolute_dislikes)))*100,
            pct.negative =-1((sum(absolute_dislikes)/(sum(absolute_likes)+sum(absolute_dislikes)))*100)               
  )

Error in `summarise()`:
ℹ In argument: `pct.negative = -...`.
ℹ In group 1: `post_id = 27068003`.
Caused by error:
! attempt to apply non-function

Code

ggplot(grouped, mapping = aes( x=pct.positive))+
  geom_baoxplot()  +
  labs(title = "Overall mood distribution" )

Error in geom_baoxplot(): could not find function "geom_baoxplot"

Than I visualize distribution of mood by months, nd overal variability of a mood score:

Code

ggplot(grouped, mapping = aes(x= post.month, y=mood.score))+
  geom_boxplot()  +
  labs(title = "Distribution of mood by month ", y = "Post mood" )

Error in `geom_boxplot()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `FUN()`:
! object 'mood.score' not found

Code

ggplot(grouped, mapping = aes( y=mood.score))+
  geom_boxplot()  +
  labs(title = "Overall mood distribution " )

Error in `geom_boxplot()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `FUN()`:
! object 'mood.score' not found

This is showing that there are

We can also check the connection between positive and negative sentiment:

Code

x.lm<-lm(pct.positive ~ pct.negative, data = grouped)

Error in eval(predvars, data, env): object 'pct.positive' not found

Code

plot ( data = grouped, pct.negative ~ pct.positive )

Error in eval(predvars, data, env): object 'pct.negative' not found

It falls perfectly on linear regression line, with positive and negative values strongly negatively correlated.

Blocked comments per post.

Now I will visualize amount of blocked comments per post:

Code

colnames(comments.2022)

 [1] "content"           "message_id"        "post_id"          
 [4] "user_id"           "user_name"         "display_name"     
 [7] "image_url"         "email"             "email_verified"   
[10] "created_at"        "private_profile"   "approved"         
[13] "written_at"        "parent"            "absolute_likes"   
[16] "absolute_dislikes" "com.year"

Code

ggplot(grouped, mapping = aes(x= post.month, y=blocked.sum))+
  geom_boxplot() +
 labs(title = "Blockeed comments per month")

Error in `geom_boxplot()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `FUN()`:
! object 'blocked.sum' not found

Blocked comments and Post mood:

Code

plot(blocked.sum ~ mood.score, data= grouped )

Error in eval(predvars, data, env): object 'blocked.sum' not found

Code

summary(grouped)

    post_id           n.comments     post.month       
 Min.   :27068003   Min.   :  1.0   Length:383        
 1st Qu.:27068576   1st Qu.:  1.0   Class :character  
 Median :27069149   Median :149.0   Mode  :character  
 Mean   :27070403   Mean   :117.5                     
 3rd Qu.:27070634   3rd Qu.:187.0                     
 Max.   :27090293   Max.   :313.0

Code

fit <- lm(blocked.sum ~ mood.score, data = grouped)

Error in eval(predvars, data, env): object 'blocked.sum' not found

Code

sum(is.na(grouped$mood.score))

Warning: Unknown or uninitialised column: `mood.score`.

[1] 0

Code

summary(fit)

Error in summary(fit): object 'fit' not found

“created_at”

This variable indicates the date of the comment. Using the range of the dates per post, I can estimate how long each post was in active discussion. Later I can compare it with the views of the same post coming from the third dataset.

Code

# MG.comments<-comments.2022[grep("MeredithGoldstein", comments.2022$user_name, ignore.case = TRUE), ]

comments.2022$user_name<-  ifelse (is.na(comments.2022$user_name), 0, comments.2022$user_name)
comments.2022$mg.comment<-  ifelse (comments.2022$user_name=="MeredithGoldstein", 1, 0)

dim(comments.2022)

[1] 45020    18

Code

grouped<-comments.2022 %>%
  group_by (post_id)%>%
  summarise(n.comments=n(), 
            post.month = format(first(written_at),format = "%m" ),
            likes.sum = sum(absolute_likes), 
            dislikes.sum=sum (absolute_dislikes),
            blocked.sum= sum(approved=="blocked"), 
            pct.positive =(sum(absolute_likes)/(sum(absolute_likes)+sum(absolute_dislikes)))*100,
            pct.negative =(sum(absolute_dislikes)/(sum(absolute_likes)+sum(absolute_dislikes)))*100,
            mg.post =sum(mg.comment)               
  )
grouped<-grouped%>%
  mutate(mood.score = pct.positive/pct.negative)  
grouped

# A tibble: 383 × 10
    post_id n.comments post.mo…¹ likes…² disli…³ block…⁴ pct.p…⁵ pct.n…⁶ mg.post
      <dbl>      <int> <chr>       <dbl>   <dbl>   <int>   <dbl>   <dbl>   <dbl>
 1 27068003        176 12           1185     169       1    87.5    12.5       0
 2 27068009        127 12            624      83       3    88.3    11.7       0
 3 27068015         96 12            427      87       2    83.1    16.9       0
 4 27068021        137 12            739      92       7    88.9    11.1       0
 5 27068027        148 12            634     275       2    69.7    30.3       0
 6 27068033        210 12           1104     261       9    80.9    19.1       0
 7 27068039        169 12            863     155       0    84.8    15.2       0
 8 27068045        207 12            988     192       2    83.7    16.3       0
 9 27068051        137 12            657      94       0    87.5    12.5       0
10 27068057        194 12            805     259       1    75.7    24.3       0
# … with 373 more rows, 1 more variable: mood.score <dbl>, and abbreviated
#   variable names ¹post.month, ²likes.sum, ³dislikes.sum, ⁴blocked.sum,
#   ⁵pct.positive, ⁶pct.negative

Code

class(grouped$mg.post)

[1] "numeric"

Code

ggplot(grouped, mapping = aes(mg.post))+
  geom_boxplot()  +
  labs(title = "Authors comments ", y = "comments per post .per month" )

Code

grouped$mg.com<-  ifelse (grouped$mg.post>0, 1, 0)

grouped$mg.com <- as.factor(grouped$mg.com)
ggplot(grouped, mapping = aes(y=mood.score, x=mg.com))+
  geom_boxplot()

Warning: Removed 131 rows containing non-finite values (`stat_boxplot()`).

“post_id”

Variable allowing me to connect this dataset with another one with the post informnation,