Final Project check-in 2

Final project DACSS 603

Author

Diana Rinker

Published

April 24, 2023

DACSS 603, spring 2023

Final Project check-in 2, Diana Rinker.

Introduction. Online engagement

It is well known that online engagement and popularity with the web resource is a highly valuable metric and is driving site revenue. This research project is exploring which factors contribute to users online engagement.

To do that I will use the data of an online blog on the news website. The author of this blog is posting articles about interpersonal relationships every work day (Mon- Fri). The posts are formulated as a letter from a reader with the situation and a question about relationships. The author gives an advice about the situation. Website readers are free to comment under each post, but cannot make their own posts.

All post methadata and comments are public. They are saved by the website and available for the analysis. Using this data set, I will explore how readers’ engagement connected with blogs’s author engagement, site comments’, web source of readers and negative behaviors online.

My research question is: Does the authors engagement in the conversation around the post makes readers more engaged and promotes positive interactions among them?

DV: My dependent construct is “user’s engagement”, I will measure users’ engagement at the level of individual post, using the following metrics:

Page views
Page visits (one visit can contain a few views, if the person visits other pages on the site)
Unique users. One person can visit the same page a few times (it is calculated for all time since the page was posted).
Number of votes: “thums up” or “thumbs down”
Number of comments
Exit rate or “bounces”. When the visitor is coming to the page and then leaving, i.e. not opening other pages on this website.

L2 - page readers * Reveal letter * Reveal comments

IV: My main independent variable is Blog’s author engagement. I will measure authors engagement as the factor variable, with the following levels:

A. Unspecified comment

B. Featured comment

C. Engagement in conversation

To control for confounders, I will also measure the follwing variables:

Topic of the post (“post tag”), categorical variable.
Source of the readers, also categorical variable.
Mood of the conversation , derivative continuous variable calculated as the ratio of “likes” to “dislikes”.
Blocked and flagged comments.


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ tibble  3.1.8     ✔ purrr   1.0.1
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Data source and description.

To answer my research question I will use two datasets. the first data set has information about all comments associated with each post by post ID. The second data set is analytics data for the web bage. It contains one post per row and variables describe each post as a whole without breaking down to the comment level.

In this project I will analyze posts for January 2021 - February 2023. Here is the list of variables in each dataset: Post-level data:

Code

getwd()

[1] "C:/Users/Diana/OneDrive - University Of Massachusetts Medical School/Documents/R/R working directory/DACSS/603/603_Spring_2023/posts"

Code

# First, I will load the data set with the comment level data:
raw <- as_tibble (read_csv("C:\\Users\\Diana\\OneDrive - University Of Massachusetts Medical School\\Documents\\R\\R working directory\\DACSS\\603\\my study files for dacss603\\globe\\ data.2021.plus.csv"))

Rows: 105136 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): content, user_name, display_name, image_url, email, approved
dbl  (7): message_id, post_id, user_id, parent, absolute_likes, absolute_dis...
lgl  (3): email_verified, created_at, private_profile
dttm (1): written_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

comments.data<-raw 
colnames (comments.data)

 [1] "content"           "message_id"        "post_id"          
 [4] "user_id"           "user_name"         "display_name"     
 [7] "image_url"         "email"             "email_verified"   
[10] "created_at"        "private_profile"   "approved"         
[13] "written_at"        "parent"            "absolute_likes"   
[16] "absolute_dislikes" "comment.year"

Code

head(comments.data$written_at)

[1] "2021-01-01 08:28:44 UTC" "2021-01-01 08:57:08 UTC"
[3] "2021-01-01 10:36:58 UTC" "2021-01-01 11:13:13 UTC"
[5] "2021-01-01 12:27:44 UTC" "2021-01-01 12:51:48 UTC"

Code

comments.data <-comments.data%>%
              mutate(com.year = format(written_at,format = "%Y" ))
# range(comments.data$com.year)
# dim(comments.data)
str(comments.data)

tibble [105,136 × 18] (S3: tbl_df/tbl/data.frame)
 $ content          : chr [1:105136] "08:15  Dawn of the 21st year of the 21st century<br/>&quot;21st Century Schizoid Man&quot;  on the BOSE" "Liz:  😄" "Happy New Year to all y'all. And Ms. G, thank you for letting me play for another year playing in your sandbox." "Blog needs a new name - Hate Letters" ...
 $ message_id       : num [1:105136] 1.15e+08 1.15e+08 1.15e+08 1.15e+08 1.15e+08 ...
 $ post_id          : num [1:105136] 27071015 27071015 27071009 27071009 27071009 ...
 $ user_id          : num [1:105136] 2889100 1855822 5133556 5156343 5560421 ...
 $ user_name        : chr [1:105136] "Lefty49" "JacquiSmith" "Bzznlike-crazyman" "--SnowMan--" ...
 $ display_name     : chr [1:105136] "Lefty49" "JacquiSmith" "Bzznlike-crazyman" "--SnowMan--" ...
 $ image_url        : chr [1:105136] "https://u.o0bc.com/avatars/48/121/77/5077296.png?35" "https://u.o0bc.com/avatars/41/194/84/5554729.png?47" "https://u.o0bc.com/avatars/85/175/84/5549909.png?9" "https://u.o0bc.com/avatars/64/191/84/5553984.png?5" ...
 $ email            : chr [1:105136] "danwalker1949@aol.com" "amcr1124@comcast.net" "eflynn105@yahoo.com" "markedoc@yahoo.com" ...
 $ email_verified   : logi [1:105136] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ created_at       : logi [1:105136] NA NA NA NA NA NA ...
 $ private_profile  : logi [1:105136] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ approved         : chr [1:105136] "approved" "approved" "approved" "approved" ...
 $ written_at       : POSIXct[1:105136], format: "2021-01-01 08:28:44" "2021-01-01 08:57:08" ...
 $ parent           : num [1:105136] NA 1.15e+08 NA NA 1.15e+08 ...
 $ absolute_likes   : num [1:105136] 2 0 14 7 7 7 4 3 2 4 ...
 $ absolute_dislikes: num [1:105136] 1 0 1 0 1 1 1 1 1 0 ...
 $ comment.year     : num [1:105136] 2021 2021 2021 2021 2021 ...
 $ com.year         : chr [1:105136] "2021" "2021" "2021" "2021" ...

Comment data:

Code

# Second, loadng post-level data :
merged <- as_tibble (read_csv("C:\\Users\\Diana\\OneDrive - University Of Massachusetts Medical School\\Documents\\R\\R working directory\\DACSS\\603\\my study files for dacss603\\globe\\data.merged.csv"))

Rows: 535 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Letter, Exit rate, post.month
dbl  (18): Page views, Search + amp referral visits, Direct (non-email) refe...
num   (1): Visits when post was on LL HP
date  (1): post.date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

# colnames(merged)
str(merged)

tibble [535 × 23] (S3: tbl_df/tbl/data.frame)
 $ post.date                         : Date[1:535], format: "2021-01-04" "2021-01-05" ...
 $ Letter                            : chr [1:535] "love letters | blog | I don&'t want him to let me go" "love letters | blog | Should I be working to get her back?" "love letters | blog | I&'m sick of thinking about the breakup" "love letters | blog | I don&'t want to be selfish about 2020 Christmas" ...
 $ Page views                        : num [1:535] 14830 12067 11921 12817 12866 ...
 $ Search + amp referral visits      : num [1:535] 1005 822 793 765 934 ...
 $ Direct (non-email) referral visits: num [1:535] 10005 7948 7997 8746 8400 ...
 $ Visits                            : num [1:535] 12998 10391 10331 11040 10818 ...
 $ Uniques                           : num [1:535] 11453 8985 8917 9662 8564 ...
 $ Other website referral visits     : num [1:535] 129 232 106 152 139 153 165 157 85 151 ...
 $ Social referral visits            : num [1:535] 457 94 156 106 113 368 90 171 84 129 ...
 $ BDC referral visits               : num [1:535] 7087 5312 5185 5901 4570 ...
 $ Visits when post was on LL HP     : num [1:535] 3167 3323 2929 3056 4795 ...
 $ Exits                             : num [1:535] 9679 7549 7613 8188 7900 ...
 $ Exit rate                         : chr [1:535] "74%" "73%" "74%" "74%" ...
 $ dup                               : num [1:535] 0 0 0 0 0 0 0 0 0 0 ...
 $ post_id                           : num [1:535] 27071003 27070997 27070991 27070985 27070979 ...
 $ n.comments                        : num [1:535] 267 207 266 372 319 267 337 154 179 375 ...
 $ post.year                         : num [1:535] 2021 2021 2021 2021 2021 ...
 $ post.month                        : chr [1:535] "01" "01" "01" "01" ...
 $ post.likes                        : num [1:535] 1440 864 936 1497 1145 ...
 $ post.dislikes                     : num [1:535] 72 106 96 520 154 188 150 62 106 150 ...
 $ post.total.likes                  : num [1:535] 1512 970 1032 2017 1299 ...
 $ blocked.sum                       : num [1:535] 1 2 3 9 2 3 4 0 3 3 ...
 $ pct.positive                      : num [1:535] 95.2 89.1 90.7 74.2 88.1 ...

To begin, I will review available variables and evaluate whether it is a good measure for this study.

1.DV: popularity and engagement.

1.1.Popularity

Page views

This is the most general metric, representing how many views the post received. Views do not distinguish repeated views by the same person.

Code

# str(merged)
ggplot(data=merged, mapping=aes(x=`Page views`))+
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Code

merged$post.month <-as.numeric(merged$post.month)
merged$year_month <- paste0(merged$post.year, "-", sprintf("%02d", merged$post.month))

ggplot( data=merged, mapping=aes(y=`Page views`, x=year_month))+
          geom_boxplot()+
  labs(title="Number of post wiews per month", x="Month", y="Number of vews")+ 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code

# ggplot(data=merged, mapping=aes(x=log(`Page views`)))+
#   geom_histogram()

From this graph we can see that the numbers of views is increased over time. To get a better understanding of it, lets review other metrics.

Page visits and unique users

Visit is an instance of a user engaging with bdc website. The user can visit a few pages of the site and re-visit them, creaing few views in once visit.Once th user left the website, the visit is over.

There also can be many visits by the same viewer. To account for repeated users there is “Uniques” variable. It tells us how many unique users saw the post. The relationship will always be that

Uniques < Visits < Views

Lets plot them together:

Code

# str(merged)
ggplot(merged, aes(x =post.date )) +
     geom_bar(aes(y = `Page views`, fill = "Page views"), stat = "identity", position = "dodge", width = 0.6) +
     geom_bar(aes(y = Visits, fill = "Visits"), stat = "identity", position = "dodge", width = 0.6) +
     geom_bar(aes(y = Uniques, fill = "Unique users"), stat = "identity", position = "dodge", width = 0.6)    +
     labs(title = "Views, visits and unique users per month", x = "Post.date", y = "Value") + 
     theme(axis.text.x = element_text(angle = 45, hjust = 1))

This graph also shows increase in all three metrics in 2022-2023. To see how these metrics crrespond to each other, I will calculate “visits ratio” and “unique.ratio”.

Page unique viewers.

Becuse “Uniques” variable represents number of unique people who came to the page and viewed it at least once, his metric represents popularity of the post. It’s distribution shows us that not all posts are equally popular:

Code

# colnames(merged)
ggplot(data=merged, mapping=aes(x=Uniques))+
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see a long tail on the right, showing that there is a number of posts who are way more popular. If we look at the distribution of this variable over time, we will see significant increase of average popularity after December 2021.

Code

ggplot( data=merged, mapping=aes(y=Uniques, x=merged$year_month))+
          geom_boxplot()+
  labs(title="Number of unique viewers per month", x="Month", y="Number of unique viewers")+ 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Warning: Use of `merged$year_month` is discouraged.
ℹ Use `year_month` instead.

1.2. Engagement metrics:

Visit ratio and unique ratio:

“visits ratio” is the number of visits per total views. This metric tells us how often the page being re-viewed. The lower the ratio means that the page was viewed more times during single visit. Since repeated view can be considered a higher engagement, the lower visit ratio indicates higher engagement.

“unique ratio” - number of unique users per total views. It is telling us how often the same person re-visited the post. Lower ratio indicates repeated users, therefore higher engagement.

Both ratios are always in the range of 0 - 1.

Code

  merged <- merged %>%
     mutate(uniques.ratio = Uniques / `Page views`)%>%
      mutate(visits.ratio = Visits / `Page views`)
# str(merged)

     ggplot(merged, aes( y = uniques.ratio, x =post.date, )) +
          geom_point(color ="green" )    +
          labs(title = "Unique.ratio per post ", x = "Post.date", y = "Value") + 
          theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code

      ggplot(merged, aes( y = visits.ratio, x =post.date, )) +
          geom_point(color ="blue" )    +
          labs(title = "Visits.ratio per post ", x = "Post.date", y = "Value") + 
          theme(axis.text.x = element_text(angle = 45, hjust = 1))

From the two graphs above we can see that the two variables are distributed very similarly.

If we plot them against each other and visualize their correlation, we can see that they highly correlated especially in higher values.

Code

      plot(merged$uniques.ratio, merged$visits.ratio)

Code

correlation <- cor(merged$uniques.ratio, merged$visits.ratio)
correlation

[1] 0.915789

Density of the distribution also showing that most posts are visited once and by unique users, i.e. post is read once by each user.

Code

# DENSITY DISTRIBUTION HERE

Exit rate

This variable is measuring how many people visited the page and then left the website after the first view. This metric is the best measure of engagement for all users, as it represents the first step after being exposed to the post - either quitting the site or remaining on the site.

Here can see the distribution of this variable :

Code

# str(merged)
merged$`Exit rate` <- as.numeric(sub("%", "", merged$`Exit rate`)) / 100

ggplot(data=merged, mapping=aes(x=`Exit rate`))+
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also see consistent increase of this value ovr time, similar to the trend seen before:

Code

ggplot(merged, mapping = aes(x=year_month , y=`Exit rate`, fill=year_month ))+
  geom_boxplot() +
  labs(title = "distribution of `Exit rate` per post ", y = "Exit rate" , x="Month")+ 
     theme(axis.text.x = element_text(angle = 45, hjust = 1))

Number of reactions

This represents total amount of likes and dislikes of comments per post.To like/dislike the comment, user doesn’t have to log in.

Code

# merged$year_month 
# colnames(merged)

ggplot(data=merged, mapping=aes(x=post.total.likes))+
  geom_histogram()+
  labs(title=" Number of all reactions per post")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Since we already saw that there is large variability in popularity of posts, which can natually impact absolute amount of reactions. To account for number of views, we calculate reactions.ratio:

Code

  merged <- merged %>%
     mutate(reactions.ratio = post.total.likes / `Page views`)
  
# str(merged)
ggplot(merged, mapping = aes(x=year_month , y=reactions.ratio, fill=year_month ))+
  geom_boxplot() +
  labs(title = "distribution of reactions ", y = "Percentage of reactions " )+
  scale_y_continuous(breaks = seq (from=0, to= 10000, by= 100)) + 
     theme(axis.text.x = element_text(angle = 45, hjust = 1))

The absolute number of reactions and the ratio appear to decrease over time.

Number of comments: engagement metric for subset of readers. Absolute and per views.

To comment on the requires logging in from a user, which is an indicator of greater engagement of an individual user. Therefore this variable represents engagement of a subset of users - more loyal readers who have created an account.

Code

# merged$year_month 
ggplot(data=merged, mapping=aes(x=n.comments))+
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

And change in the distribution over time:

Code

ggplot(merged, mapping = aes(x=year_month , y=n.comments, fill=year_month ))+
  geom_boxplot() +
  labs(title = "distribution of comments per post ", y = "Number of comments" )+
  scale_y_continuous(breaks = seq (from=0, to= 10000, by= 100)) + 
     theme(axis.text.x = element_text(angle = 45, hjust = 1))

The trend in amount of comments over time is different. While previously reviewed variables seemed to increase over time, this variable decreases.

We can calculate the ratio of comments per If we compare absolute values fo comments and ratio of comments per page views, we see the decrease in both cases, while the ratio’s decrease is more dramatic:

Code

  merged <- merged %>%
     mutate(comments.ratio = n.comments/ `Page views`)

ggplot(merged, mapping = aes(x=year_month , y=comments.ratio, fill=year_month ))+
  geom_boxplot() +
  labs(title = "distribution of comments ratio (comments/page views) ", y = "Number of comments" )+
  scale_y_continuous(breaks = seq (from=0, to= 10000, by= 100)) + 
     theme(axis.text.x = element_text(angle = 45, hjust = 1))

WE can see that with an increase of views (mostly driven by post popularity), comments ratio decreeases. So, newly added viewers are not adding to the engaged subset of readers.

The distribution of comments per post over time:

Code

# str(merged)
ggplot(merged, mapping = aes(x=year_month , y=post.total.likes, fill=year_month ))+
  geom_boxplot() +
  labs(title = "distribution of reactions over time  ", y = "Percentage of reactions " )+
  scale_y_continuous(breaks = seq (from=0, to= 10000, by= 100)) + 
     theme(axis.text.x = element_text(angle = 45, hjust = 1))

1.3 Impact of popularity on engagement.

Since we noticed a pattern in distribution of describbed variables, lets find out how they are interconnected, and how popularity of the post impact users’ engagement:

Code

pairs(subset (merged, select=c(Uniques, `Exit rate`, uniques.ratio, visits.ratio, reactions.ratio )))

This graph showing that popularity is strongly correlated with all engagement metrics, such as Exit rate, uniques ratio, visit ratio and reaction’s ratio, where an increase in popularity causes decrease of engagement across all 4 metrics for engagement.

We can also see that the relationship between Popularity and engagement rate are curvi-linear. If we log the values, we will see linear relationship:

Code

pairs(subset (merged, select=c(log(Uniques), log(`Exit rate`), uniques.ratio, visits.ratio, reactions.ratio )))

Error in `x[r, vars, drop = drop]`:
! Can't subset columns with `vars`.
✖ Can't convert from `j` <double> to <integer> due to loss of precision.

Code

# WHT DOES THE GRAPH SHOWS POST.DATE and POST.DATE1  INSTEAD OF "UNIQUES" and  "Exit rate"?

As we saw in the distributions of variables above, all engagement metrics are strongly correlate with each other the popularity metric.

1.3.1. Modeling engagement ~ popularity:

Lets review how it is impacting exit rate : I will use “Exit rate” s my main engagement variable, as it reporesents the first engagement chouce that all site users make: to stay on the site or leave. Using log() of popularity is shoing a better fitted model:

Code

ggplot(merged, mapping=aes(x=log(Uniques), y=`Exit rate` ))+
  geom_point()

Code

summary(lm(`Exit rate`~ + log(Uniques) , data = merged))


Call:
lm(formula = `Exit rate` ~ +log(Uniques), data = merged)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.259486 -0.016818  0.001917  0.019850  0.073240 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.332581   0.027678  -12.02   <2e-16 ***
log(Uniques)  0.113870   0.002855   39.89   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03073 on 533 degrees of freedom
Multiple R-squared:  0.749, Adjusted R-squared:  0.7486 
F-statistic:  1591 on 1 and 533 DF,  p-value: < 2.2e-16

We can see significance of popularity and high R^2 of the model, and can conclude that popularity overall decreases engagement. To explore this connection further, lets mreview Sources of users that are coming to the website.

1.3.2. How popularity impacts engagement: Referral sources.

The website analytics provides information on where the viewers are coming from to the blog page. for example, if people clicked on the blog link posted on FaceBook, that would be referral from social media. If people clicked on the blog link within BDC website, that would be “BDC referral visit”.

There are 5 sources of referrals, each corresponding with a variable in the data set. Variable’s value is a number of visits from this referral source.

    "Search + amp referral visits"
    "Direct (non-email) referral visits"
    "Other website referral visits"
    "Social referral visits"
    "BDC referral visits"
    "Visits when post was on LL HP"

Code

#renaming variables for convenience: 
merged<-merged%>%
  rename(google ="Search + amp referral visits",
         direct ="Direct (non-email) referral visits",
         other.web = "Other website referral visits",
          social= "Social referral visits",
          bdc= "BDC referral visits",
          ll= "Visits when post was on LL HP" )

ggplot(merged, mapping=aes(x=post.date))+
  geom_point(aes(y=google), color="red")+
  geom_point(aes(y=direct), color="green")+
  geom_point(aes(y=other.web), color="yellow")+
  geom_point(aes(y=social), color="purple")+
  geom_point(aes(y=bdc), color="blue")+
  geom_point(aes(y=ll), color="pink")  +
  labs(title = "referral sources per post ", y = "Number of referrals" , x="Post")+ 
     theme(axis.text.x = element_text(angle = 45, hjust = 1))

This graph clearly demonstrates increase of “Search + amp referral visits” after December 2021, while other referral sources maintain the same level over time. This increase matches changes observed in popularity ( Page Views), and engagement metrics that are highly correlated with popularity. It also doubles views of the posts in many cases and might significanly influence our model and conclusions.

Early comments’impact

We saw that increased popularity is associated with the sources of the viewers and might come from the search engines. We want to find out if early engagement in the post ( comments within first 10 hours of the post time) trigger its appearance in google news listing and therefore cause sharp increase in popularity.

We are interested to find out, what characteristics of the post (prior to it being picked up by google), correlate with its appearance on google news.

Number of comments within 1st 10 hours from the post.

Code

#  calculating number of comments within first 10 hours 

# date and time of the post 
post.date  <-comments.data %>%
  group_by(post_id)%>%
  arrange(written_at)%>%
  summarize(post.dt = first(written_at), 
            n.coms =n())%>%
  filter (n.coms>100)%>%
  select(post_id, post.dt, n.coms)%>%
arrange(desc(post_id))
# head(post.date)

# Calculating age of the post 
comments.data <- merge( comments.data , post.date , by = "post_id", all = TRUE)
comments.data <-comments.data %>%
  mutate (age = difftime(  written_at,post.dt, units = "hours"), 
          early = ifelse(age<10, 1, 0))

early.coms  <-comments.data%>%
  group_by(post_id)%>%
  arrange(written_at)%>%
  summarize(early.sum = sum(early), 
            n.coms =n())%>%
  filter (n.coms>100)%>%
  select(post_id, early.sum)%>%
arrange(desc(post_id))

# Adding number of early comments to merged dataset:
merged <- merge( merged, early.coms , by = "post_id", all = TRUE)

head(merged )

   post_id  post.date
1 27067757 2023-03-01
2 27067763 2023-02-28
3 27067769 2023-02-27
4 27067775 2023-02-24
5 27067781 2023-02-23
6 27067787 2023-02-22
                                                                             Letter
1                          love letters | blog | We&'ve gained weight – as a couple
2           love letters | blog | She was a friendly ghost, but I was still ghosted
3 love letters | blog | I waited for her to initiate sex – and it ended my marriage
4                         love letters | blog | My husband doesn&'t share his money
5                              love letters | blog | My wife won&'t go back to work
6                                love letters | blog | He doesn&'t have time for me
  Page views google direct Visits Uniques other.web social   bdc   ll Exits
1      24890  10407  10180  22766   20765       349     56  7088 2785 18547
2      15575   1210  10297  13597   11677       214     85  7461 2832 10127
3      28594   7027  15609  26115   23914       909     87 12957 3016 22464
4      72718  43419  19921  67007   62122       311    134 14106 4216 60915
5      67537  43173  16863  62888   59272       310    152 12344 2739 56929
6      28778  14466   9698  26269   23865       331     67  6332 3003 21515
  Exit rate dup n.comments post.year post.month post.likes post.dislikes
1      0.81   0        208      2023          3       1001           167
2      0.74   0        264      2023          2       1908           118
3      0.86   0        241      2023          2       1195           167
4      0.91   0        321      2023          2       1921           341
5      0.91   0        294      2023          2       1445           549
6      0.82   0        182      2023          2       1337            85
  post.total.likes blocked.sum pct.positive year_month uniques.ratio
1             1168           1     85.70205    2023-03     0.8342708
2             2026           1     94.17572    2023-02     0.7497271
3             1362           1     87.73862    2023-02     0.8363293
4             2262           3     84.92485    2023-02     0.8542864
5             1994           0     72.46740    2023-02     0.8776226
6             1422           0     94.02250    2023-02     0.8292793
  visits.ratio reactions.ratio comments.ratio early.sum
1    0.9146645      0.04692648    0.008356770       192
2    0.8730016      0.13008026    0.016950241       247
3    0.9133035      0.04763237    0.008428342       214
4    0.9214637      0.03110647    0.004414313       261
5    0.9311637      0.02952456    0.004353169       240
6    0.9128153      0.04941275    0.006324275       142

Now, as I calculated number of early comments for all posts, I can see if that number correlated with Google referrals.

Code

merged.2<-merged%>%
  filter(google>10000)
correlation.2 <- cor(merged.2$early.sum, merged.2$google)
correlation.2

[1] 0.2736053

Code

plot(merged.2$early.sum, merged.2$google)

Code

summary(lm(google ~early.sum, data=merged.2))


Call:
lm(formula = google ~ early.sum, data = merged.2)

Residuals:
   Min     1Q Median     3Q    Max 
-15035  -6141  -2482   3559  54900 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  5908.84    5265.21   1.122  0.26406   
early.sum      94.56      30.73   3.077  0.00261 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10600 on 117 degrees of freedom
Multiple R-squared:  0.07486,   Adjusted R-squared:  0.06695 
F-statistic: 9.467 on 1 and 117 DF,  p-value: 0.002605

Code

summary(lm(google ~early.sum, data=merged))


Call:
lm(formula = google ~ early.sum, data = merged)

Residuals:
   Min     1Q Median     3Q    Max 
 -7036  -5548  -4652   2036  68716 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8921.388   1789.530   4.985 8.38e-07 ***
early.sum    -12.627      9.921  -1.273    0.204    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9806 on 533 degrees of freedom
Multiple R-squared:  0.00303,   Adjusted R-squared:  0.001159 
F-statistic:  1.62 on 1 and 533 DF,  p-value: 0.2037

We can see significant connection between early comments and google referrals, only on the data set of google referrals above 10000 visits. Also, R^2 is low in this model, which suggests that there are other factors that are not considered in this model. I will explore how days of the week and time of the post contribute to this relationship.

Code

# calculating day of the week for the post: 
colnames(comments.data)

 [1] "post_id"           "content"           "message_id"       
 [4] "user_id"           "user_name"         "display_name"     
 [7] "image_url"         "email"             "email_verified"   
[10] "created_at"        "private_profile"   "approved"         
[13] "written_at"        "parent"            "absolute_likes"   
[16] "absolute_dislikes" "comment.year"      "com.year"         
[19] "post.dt"           "n.coms"            "age"              
[22] "early"

Code

# Calculating age of the post 
library(lubridate)


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

Code

comments.data <-comments.data %>%
  mutate (weekday = wday(post.dt, label = TRUE), 
          weekend = ifelse(weekday =="Fri" | weekday == "Sat", 1, 0))

# class(week.days$weekday)
# levels(week.days$weekday) 
# week.days$weekend

weekdays <-comments.data %>%
  group_by(post_id)%>%
  arrange(written_at)%>%
  summarize(post.weekday = first(weekday), 
            post.weekend =first(weekend),
            n.coms =n())%>%
  filter (n.coms>100)%>%
  select(post_id, post.weekday,post.weekend )%>%
arrange(desc(post_id))

# Adding number of early comments to merged dataset:
merged <- merge( merged, weekdays , by = "post_id", all = TRUE)
colnames(merged)

 [1] "post_id"          "post.date"        "Letter"           "Page views"      
 [5] "google"           "direct"           "Visits"           "Uniques"         
 [9] "other.web"        "social"           "bdc"              "ll"              
[13] "Exits"            "Exit rate"        "dup"              "n.comments"      
[17] "post.year"        "post.month"       "post.likes"       "post.dislikes"   
[21] "post.total.likes" "blocked.sum"      "pct.positive"     "year_month"      
[25] "uniques.ratio"    "visits.ratio"     "reactions.ratio"  "comments.ratio"  
[29] "early.sum"        "post.weekday"     "post.weekend"

Code

# levels(merged$post.weekday)

Post weekdays to the model

Code

merged.2<-merged %>%
  filter(google>10000)

 plot(merged$post.weekday, merged$google)

Code

summary(lm(log(google) ~early.sum +post.weekday , data=merged.2))


Call:
lm(formula = log(google) ~ early.sum + post.weekday, data = merged.2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85556 -0.25080 -0.05083  0.23281  1.25348 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     9.30191    0.20350  45.710  < 2e-16 ***
early.sum       0.00352    0.00119   2.958  0.00377 ** 
post.weekday.L  0.01571    0.08732   0.180  0.85751    
post.weekday.Q -0.02154    0.08544  -0.252  0.80145    
post.weekday.C -0.02274    0.08280  -0.275  0.78409    
post.weekday^4  0.15962    0.08043   1.985  0.04962 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4058 on 113 degrees of freedom
Multiple R-squared:  0.102, Adjusted R-squared:  0.06229 
F-statistic: 2.568 on 5 and 113 DF,  p-value: 0.03066

Code

# table(merged.weekdays$post.weekday)

ggplot (merged, mapping =aes(x=early.sum, y=google, color =post.weekday))+
  geom_point( ) +
  geom_smooth(method="lm")+
  labs(title = "All google referrals and early comments", y = "Google referrals" , x="Early comments")+ 
     theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  facet_wrap(~post.weekday)

`geom_smooth()` using formula = 'y ~ x'

Code

ggplot (merged.2, mapping =aes(x=early.sum, y=google, color =post.weekday))+
  geom_point( ) +
  geom_smooth(method="lm")+
  labs(title = "Google referrals over 10K and early comments", y = "Google referrals" , x="Early comments")+ 
     theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  facet_wrap(~post.weekday)

`geom_smooth()` using formula = 'y ~ x'

Early comments indeed influence Google referrals. However, this influence is different on different days of the week and appears more pronounced on Thursdays and Fridays. This overall supports our suggestion that early engagement of loyal users through comments impacts post popularity. further investigation of factors would be helpful, with the goal to model fit (R^2).

Our hypothesis that large amount of early comments triggers google referrals has some preliminary support with the few restrictiions to be considere: - the connection only appears on the posts with google referrals above 10000. - This connection is atronger for only thursdays and Fridays and appears very weak on other days of the week. - low coefficient and high deviation of data from predicated valaues (low R^2) suggest that there are other , more influential, factors that were not considered.

1.4.Conclusion

As we reviewed a variety of engagement metric, I found: 1. That they are all strongly correlated to each other and influenced by post popularity: Engagement overall decreases as popularity increases.In our further analysis I will need to include “Uniques” variable in the model to control for popularity

Also as a result of exploration of engagement metrics, we can distinguish two types of users: superficial visitors and loyal readers. The main difference is that loyal readers have created accounts and therefore can comment. It is possible that htese two groups of readers respond differently to popularity increase, as well as other Independent variables. Therefore, the difference in types of readers must be considered when addressing our main hypothesi: whether authors engagement and other factors impacts reader’s engagement.
Early engagement of loyal users through comments might contribute to dramatic changes in post popularity.

Now, lets consider independent variables of our main hypothesis. # 2. Independent variables (IV):

Code

library(stargazer)


Please cite as:

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

Code

stargazer()

Error in if (substr(inside[i], 1, nchar("list(")) == "list(") {: missing value where TRUE/FALSE needed

Authors comments

To identify, how much the author of the blog is engaged in the post, I will create an additional variable derived from a user_name field.

Code

# str(merged)
comments.data$user_name<-  ifelse (is.na(comments.data$user_name), 0, comments.data$user_name)
comments.data$author<-  ifelse (comments.data$user_name=="MeredithGoldstein", 1, 0)

comments.grouped <-comments.data %>%
  group_by(post_id)%>%
  summarize(n.comments=n(),
            author.sum = sum(author))

# dim(comments.grouped )
# colnames(comments.grouped )
# class(comments.grouped$author.sum)

# Comments data contains rows that dont actually reporesent posts, and were crearted by web support team for troubleshooting. I need to remove these rows. They typically have very low number of comments
comments.grouped <-comments.grouped %>%
filter(n.comments >100)  # removing invalid posts created by the  website management team.

comments.grouped <-comments.grouped %>%
  select(post_id, author.sum)
 dim(comments.grouped)

[1] 535   2

Code

#adding author.sum to main data set: 
merged <- merge( merged , comments.grouped, by = "post_id", all = TRUE)


ggplot(data=merged, mapping = aes(x=year_month , y=author.sum, fill=year_month ))+
  geom_boxplot() +
  labs(title = "distribution of Author's comments by months", y = "Author's comments" , x="Month")+ 
     theme(axis.text.x = element_text(angle = 45, hjust = 1))

This graph shows, that majority of posts have no author’s comments. However, if we look at the box plots’ upper ranges, we can suggest a trend of decrease in author’s comments with time.

Mood of the post.

This is a numerical variable, calculated as percentage of “thumbs up” from all likes (both “thumbs up” and “thumbs down”).

Code

ggplot(merged, mapping = aes(x=year_month , y=pct.positive, fill=year_month ))+
  geom_boxplot() +
  labs(title = "distribution of Mood per post ", y = "Mood" , x="Month")+ 
     theme(axis.text.x = element_text(angle = 45, hjust = 1))

Blocked comments per post.

Now I will visualize amount of blocked comments per post:

Code

ggplot(merged, mapping = aes(x=year_month , y=blocked.sum, fill=year_month ))+
  geom_boxplot() +
  labs(title = "Number Blocked comments  per post ", y = "Number of blocked comments per post" , x="Month")+ 
     theme(axis.text.x = element_text(angle = 45, hjust = 1))

3. Creating the model for main hypothesis testing

As we mentioned above, we will add popularity to the model when testing main hypothesis. Also, I am going to break the data set into two groups: prior to December 2021 and after December 2021 due to dramatic change in referral sources. I will use these two datasets to to test my main hypothesis and compare results.

Code

merged.2021 <- merged%>%
  filter(post.date <    "2021-12-01")
merged.2022 <- merged%>%
  filter(post.date >=   "2021-12-01")

3.1 Basic model.

I will start with creating a simple model of engagement ~ popularity and author’s engagement:

Code

# colnames(merged)
summary(lm(`Exit rate` ~ log(Uniques)+ author.sum, data = merged))


Call:
lm(formula = `Exit rate` ~ log(Uniques) + author.sum, data = merged)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.257486 -0.016558  0.001669  0.019958  0.072030 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.326263   0.027749 -11.758   <2e-16 ***
log(Uniques)  0.113326   0.002857  39.663   <2e-16 ***
author.sum   -0.003389   0.001602  -2.116   0.0348 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03063 on 532 degrees of freedom
Multiple R-squared:  0.7511,    Adjusted R-squared:  0.7502 
F-statistic: 802.9 on 2 and 532 DF,  p-value: < 2.2e-16

Code

summary(lm(`Exit rate` ~ log(Uniques)+ author.sum, data = merged.2021))


Call:
lm(formula = `Exit rate` ~ log(Uniques) + author.sum, data = merged.2021)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.257995 -0.018855  0.000548  0.027265  0.069625 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.402793   0.096359  -4.180 4.18e-05 ***
log(Uniques)  0.121731   0.010266  11.857  < 2e-16 ***
author.sum   -0.002533   0.002568  -0.986    0.325    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0372 on 224 degrees of freedom
Multiple R-squared:  0.3856,    Adjusted R-squared:  0.3801 
F-statistic:  70.3 on 2 and 224 DF,  p-value: < 2.2e-16

Code

summary(lm(`Exit rate` ~ log(Uniques)+ author.sum, data = merged.2022))


Call:
lm(formula = `Exit rate` ~ log(Uniques) + author.sum, data = merged.2022)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.083434 -0.014968  0.002043  0.017982  0.053266 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.354200   0.029528 -11.995  < 2e-16 ***
log(Uniques)  0.115989   0.002974  38.996  < 2e-16 ***
author.sum   -0.005402   0.001997  -2.705  0.00722 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02455 on 305 degrees of freedom
Multiple R-squared:  0.8369,    Adjusted R-squared:  0.8359 
F-statistic: 782.7 on 2 and 305 DF,  p-value: < 2.2e-16

To see how engagement of a subset of readers changes (loyal readers), I will use another measure for engagement as DV, which represents only registered users: number of comments.

Code

summary(lm(n.comments ~ log(Uniques) + author.sum , data = merged))


Call:
lm(formula = n.comments ~ log(Uniques) + author.sum, data = merged)

Residuals:
    Min      1Q  Median      3Q     Max 
-98.670 -35.171  -8.842  26.596 181.918 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   238.521     45.076   5.292 1.78e-07 ***
log(Uniques)   -4.667      4.641  -1.006   0.3151    
author.sum      4.872      2.602   1.872   0.0617 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 49.75 on 532 degrees of freedom
Multiple R-squared:  0.009115,  Adjusted R-squared:  0.00539 
F-statistic: 2.447 on 2 and 532 DF,  p-value: 0.08752

This shows that engagement is decreasing with popularity and someone increasing with author’s comments. for loyal readers : engagement is also sensitive to author’s comments, but not sensitive to popularity of the post.

analize change in x1 and x2 to y , considering log()

3.2 Adding other variables.

Now I will add other variables to see if there is any impact on the results of the model with engagement of both groups of readers. For number of comments, I would have to select either early.sum or n.comments, due to high correlation between them

Code

(cor(merged$early.sum, merged$n.comments))

[1] 0.9507822

I am including n/comments in the model, to see if engagement of loyal readers through comments impact overall engagement:

Code

# colnames(merged)
summary(lm(`Exit rate` ~ log(Uniques)+ author.sum + pct.positive +blocked.sum+post.weekday +n.comments, data = merged))


Call:
lm(formula = `Exit rate` ~ log(Uniques) + author.sum + pct.positive + 
    blocked.sum + post.weekday + n.comments, data = merged)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.249664 -0.014653  0.001237  0.018954  0.061566 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -3.658e-01  3.738e-02  -9.787  < 2e-16 ***
log(Uniques)    1.135e-01  2.803e-03  40.492  < 2e-16 ***
author.sum     -4.158e-03  1.551e-03  -2.681  0.00756 ** 
pct.positive    1.488e-04  2.369e-04   0.628  0.53032    
blocked.sum    -3.634e-04  2.099e-04  -1.732  0.08391 .  
post.weekday.L -1.959e-03  3.001e-03  -0.653  0.51425    
post.weekday.Q  1.332e-02  2.879e-03   4.627 4.69e-06 ***
post.weekday.C  1.109e-04  2.825e-03   0.039  0.96869    
post.weekday^4  3.235e-03  2.797e-03   1.157  0.24793    
n.comments      1.384e-04  2.642e-05   5.237 2.36e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02933 on 525 degrees of freedom
Multiple R-squared:  0.7748,    Adjusted R-squared:  0.771 
F-statistic: 200.7 on 9 and 525 DF,  p-value: < 2.2e-16

Ahd it is shown to be significant, negativelly impacting overall engagement. considering our finding s earlier, that early comments increase popularity and therefore decrease overall engagement, this is expected. Tuesday as level of “weekday” also shows to significantly decrease engagement.

Authors engagement, day of the week(Tuesday) and blocked comments seem to increase engagement.

Code

summary(lm(n.comments ~ log(Uniques) + author.sum + pct.positive +blocked.sum +post.weekday, data = merged)) #not adding early.sum due to high correlation with DV: cor(merged$early.sum, merged$n.comments): 0.95


Call:
lm(formula = n.comments ~ log(Uniques) + author.sum + pct.positive + 
    blocked.sum + post.weekday, data = merged)

Residuals:
   Min     1Q Median     3Q    Max 
-84.53 -33.89  -9.63  25.39 183.31 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    106.8646    61.5191   1.737  0.08296 .  
log(Uniques)    -1.8812     4.6253  -0.407  0.68438    
author.sum       3.9147     2.5540   1.533  0.12593    
pct.positive     1.1648     0.3878   3.004  0.00279 ** 
blocked.sum      0.8864     0.3443   2.575  0.01030 *  
post.weekday.L  21.7036     4.8624   4.464 9.87e-06 ***
post.weekday.Q   6.0475     4.7449   1.275  0.20304    
post.weekday.C   1.0266     4.6622   0.220  0.82581    
post.weekday^4  -0.1307     4.6164  -0.028  0.97743    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 48.41 on 526 degrees of freedom
Multiple R-squared:  0.07252,   Adjusted R-squared:  0.05841 
F-statistic: 5.141 on 8 and 526 DF,  p-value: 3.455e-06

While general viewers show no impact by mood of the post or blocked comments, we see significant positive correlation of engagement with all of these metrics for loyal readers(=commenters).

Number of comments, representing engagement is somewhat impacted by three independent variables, but this model has low R^2 , which suggests that there are other factors impacting loyal reader’s engagement that are not considered in this model.

3.3 Using different datasets for comparing:

Exit rate for 2021 and 2022:

Code

# colnames(merged)
summary(lm(`Exit rate` ~ log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged.2021))


Call:
lm(formula = `Exit rate` ~ log(Uniques) + author.sum + pct.positive + 
    blocked.sum, data = merged.2021)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.258483 -0.017314  0.000667  0.027730  0.077470 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -5.315e-01  1.194e-01  -4.451 1.35e-05 ***
log(Uniques)  1.259e-01  1.053e-02  11.949  < 2e-16 ***
author.sum   -2.673e-03  2.563e-03  -1.043   0.2981    
pct.positive  1.009e-03  5.573e-04   1.811   0.0715 .  
blocked.sum   7.438e-05  4.341e-04   0.171   0.8641    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03709 on 222 degrees of freedom
Multiple R-squared:  0.3946,    Adjusted R-squared:  0.3837 
F-statistic: 36.18 on 4 and 222 DF,  p-value: < 2.2e-16

Code

summary(lm(`Exit rate` ~ log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged.2022))


Call:
lm(formula = `Exit rate` ~ log(Uniques) + author.sum + pct.positive + 
    blocked.sum, data = merged.2022)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.08367 -0.01553  0.00204  0.01839  0.05326 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.3315325  0.0368576  -8.995  < 2e-16 ***
log(Uniques)  0.1153984  0.0029870  38.633  < 2e-16 ***
author.sum   -0.0052803  0.0019958  -2.646  0.00858 ** 
pct.positive -0.0001812  0.0002344  -0.773  0.44014    
blocked.sum  -0.0003622  0.0002190  -1.654  0.09915 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0245 on 303 degrees of freedom
Multiple R-squared:  0.8386,    Adjusted R-squared:  0.8365 
F-statistic: 393.7 on 4 and 303 DF,  p-value: < 2.2e-16

Exit rate (engagement) still correlates with popularity of the site or bot 2021 and 2021,confirming the same finding from un-splitted data set earlier. However, author’s comments appear only be relevant to all user’s engagement in 2022, and not in 2021. Mood of the post appears significant for user’s engagement only in 2021, and blocked comments only appear significant in 2022. ## 3.2 Diagnostic of the model:

Code

# colnames(merged)
model.ex.r<- lm(`Exit rate` ~ log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged)

par(mfrow = c(2,3))
plot(model.ex.r, which = 1:6)

Code

ex.r.2021 <- lm(`Exit rate` ~ log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged.2021)
ex.r.2022 <-lm(`Exit rate` ~ log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged.2022)
par(mfrow = c(2,3))
plot(ex.r.2021, which = 1:6)

Code

par(mfrow = c(2,3))
plot(ex.r.2022, which = 1:6)

Code

model.2<- lm(n.comments ~ log(Uniques) + author.sum + pct.positive +blocked.sum, data = merged)


par(mfrow = c(2,3))
plot(model.2, which = 1:6)

Number of comments for 2021 and 2022:

Code

summary(lm(n.comments ~ log(Uniques) + author.sum + pct.positive +blocked.sum, data = merged.2021))


Call:
lm(formula = n.comments ~ log(Uniques) + author.sum + pct.positive + 
    blocked.sum, data = merged.2021)

Residuals:
     Min       1Q   Median       3Q      Max 
-106.632  -38.547   -8.915   28.987  167.497 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  209.0074   180.9229   1.155    0.249
log(Uniques)   3.0321    15.9599   0.190    0.849
author.sum     4.5830     3.8832   1.180    0.239
pct.positive  -0.3590     0.8443  -0.425    0.671
blocked.sum    0.8618     0.6577   1.310    0.191

Residual standard error: 56.19 on 222 degrees of freedom
Multiple R-squared:  0.01739,   Adjusted R-squared:  -0.000314 
F-statistic: 0.9823 on 4 and 222 DF,  p-value: 0.418

Code

summary(lm(n.comments ~ log(Uniques) + author.sum + pct.positive +blocked.sum, data = merged.2022))


Call:
lm(formula = n.comments ~ log(Uniques) + author.sum + pct.positive + 
    blocked.sum, data = merged.2022)

Residuals:
    Min      1Q  Median      3Q     Max 
-76.762 -28.410  -7.407  25.375 155.874 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -72.7563    59.4359  -1.224 0.221861    
log(Uniques)  18.5894     4.8169   3.859 0.000139 ***
author.sum     1.9188     3.2184   0.596 0.551481    
pct.positive   0.7834     0.3780   2.072 0.039094 *  
blocked.sum    1.1017     0.3531   3.120 0.001981 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.51 on 303 degrees of freedom
Multiple R-squared:  0.07768,   Adjusted R-squared:  0.0655 
F-statistic:  6.38 on 4 and 303 DF,  p-value: 6.108e-05

None of the variables show significance for loyal user’s engagement(comments) in 2021. However, loyal users are significantly influenced by popularity, mood of the post and blocked comments in 2022.

Adding a weekday:

Adding the weekday of the post ad additional IV increases R^2 and shows significance for one of the levels:

Code

summary(lm(n.comments ~ log(Uniques) + author.sum + pct.positive +blocked.sum + post.weekday, data = merged))


Call:
lm(formula = n.comments ~ log(Uniques) + author.sum + pct.positive + 
    blocked.sum + post.weekday, data = merged)

Residuals:
   Min     1Q Median     3Q    Max 
-84.53 -33.89  -9.63  25.39 183.31 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    106.8646    61.5191   1.737  0.08296 .  
log(Uniques)    -1.8812     4.6253  -0.407  0.68438    
author.sum       3.9147     2.5540   1.533  0.12593    
pct.positive     1.1648     0.3878   3.004  0.00279 ** 
blocked.sum      0.8864     0.3443   2.575  0.01030 *  
post.weekday.L  21.7036     4.8624   4.464 9.87e-06 ***
post.weekday.Q   6.0475     4.7449   1.275  0.20304    
post.weekday.C   1.0266     4.6622   0.220  0.82581    
post.weekday^4  -0.1307     4.6164  -0.028  0.97743    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 48.41 on 526 degrees of freedom
Multiple R-squared:  0.07252,   Adjusted R-squared:  0.05841 
F-statistic: 5.141 on 8 and 526 DF,  p-value: 3.455e-06

Code

summary(lm(`Exit rate` ~ log(Uniques) + author.sum + pct.positive+blocked.sum +post.weekday, data = merged))


Call:
lm(formula = `Exit rate` ~ log(Uniques) + author.sum + pct.positive + 
    blocked.sum + post.weekday, data = merged)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.253949 -0.016021  0.001578  0.019637  0.065920 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.3510322  0.0381950  -9.191  < 2e-16 ***
log(Uniques)    0.1132211  0.0028717  39.427  < 2e-16 ***
author.sum     -0.0036165  0.0015857  -2.281    0.023 *  
pct.positive    0.0003099  0.0002408   1.287    0.199    
blocked.sum    -0.0002408  0.0002137  -1.127    0.260    
post.weekday.L  0.0010440  0.0030189   0.346    0.730    
post.weekday.Q  0.0141570  0.0029459   4.806 2.02e-06 ***
post.weekday.C  0.0002529  0.0028946   0.087    0.930    
post.weekday^4  0.0032168  0.0028662   1.122    0.262    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.03006 on 526 degrees of freedom
Multiple R-squared:  0.7631,    Adjusted R-squared:  0.7595 
F-statistic: 211.8 on 8 and 526 DF,  p-value: < 2.2e-16

Code

class(merged$post.weekday)

[1] "ordered" "factor"

Adding early comments:

Measuring how early comments impact engagement:

Code

merged<- merged %>%
  mutate (e.rate = `Exit rate`*10000)
summary(lm(e.rate ~ log(Uniques) + author.sum +post.weekday +early.sum, data = merged))


Call:
lm(formula = e.rate ~ log(Uniques) + author.sum + post.weekday + 
    early.sum, data = merged)

Residuals:
     Min       1Q   Median       3Q      Max 
-2491.71  -146.36    15.88   188.10   638.62 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -3587.5846   276.2872 -12.985  < 2e-16 ***
log(Uniques)    1138.5047    27.5061  41.391  < 2e-16 ***
author.sum       -38.3691    15.4708  -2.480   0.0134 *  
post.weekday.L   -14.2529    29.1268  -0.489   0.6248    
post.weekday.Q   143.4069    28.7226   4.993 8.10e-07 ***
post.weekday.C     5.1103    28.2598   0.181   0.8566    
post.weekday^4    35.8574    27.9676   1.282   0.2004    
early.sum          1.5903     0.2998   5.305 1.66e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 293.5 on 527 degrees of freedom
Multiple R-squared:  0.7737,    Adjusted R-squared:  0.7707 
F-statistic: 257.4 on 7 and 527 DF,  p-value: < 2.2e-16

Code

summary(lm(e.rate ~ log(Uniques) + author.sum+early.sum, data = merged))


Call:
lm(formula = e.rate ~ log(Uniques) + author.sum + early.sum, 
    data = merged)

Residuals:
     Min       1Q   Median       3Q      Max 
-2538.89  -143.55    24.05   198.90   685.26 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3634.7905   281.4951 -12.912  < 2e-16 ***
log(Uniques)  1144.2628    28.0370  40.813  < 2e-16 ***
author.sum     -35.9680    15.6776  -2.294   0.0222 *  
early.sum        1.5193     0.3042   4.994 8.04e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 299.6 on 531 degrees of freedom
Multiple R-squared:  0.7623,    Adjusted R-squared:  0.761 
F-statistic: 567.6 on 3 and 531 DF,  p-value: < 2.2e-16

The model’s predictive value is pretty good (probably due to popularity being included) : 0.77. It also appears that this connection We can check it, by excluding the popularity, as see that R^2 drops dramatically:

Code

summary(lm(e.rate ~ author.sum+early.sum, data = merged))


Call:
lm(formula = e.rate ~ author.sum + early.sum, data = merged)

Residuals:
    Min      1Q  Median      3Q     Max 
-3227.6  -410.9   -65.3   383.0  1613.4 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7634.2416   111.2795  68.604   <2e-16 ***
author.sum   -92.0013    31.7347  -2.899   0.0039 ** 
early.sum      0.5435     0.6163   0.882   0.3783    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 608.8 on 532 degrees of freedom
Multiple R-squared:  0.01667,   Adjusted R-squared:  0.01298 
F-statistic:  4.51 on 2 and 532 DF,  p-value: 0.01142

4. Conlusion

This study explored the connection between user’s online engagement and bolg’s author engagement. To do so, we also analyzed the impact of post popularity on engagement.

Code

knitr::include_graphics("C:\\Users\\Diana\\OneDrive - University Of Massachusetts Medical School\\Documents\\R\\R working directory\\DACSS\\603\\my study files for dacss603\\FP diagram .png")

The results showed that the post popularity is the major factor for online engagement and is negatively correlated with it. i.e. more popular posts tend to be less engaging.

We also identified two groups of users, whose engagement is measured with different metrics. We explored how blog author’s comments impact engagement of each group. We found than there is no significant impact of authors engagement on users engagement. Such characteristics, as post mood and blocked comments also showed no connection with engagement of either groups of users.

While engagement metrics (exit rate and comments) did not show significant contribution, authors comment appeared significant in this model. - this is really out of place, i just don’t want to lose this.