It is well known that online engagement and popularity with the web resource is a highly valuable metric and is driving site revenue. This research project is exploring which factors contribute to users online engagement.
To do that I will use the data of an online blog on the news website. The author of this blog is posting articles about interpersonal relationships every work day (Mon- Fri). The posts are formulated as a letter from a reader with the situation and a question about relationships. The author gives an advice about the situation. Website readers are free to comment under each post, but cannot make their own posts.
All post methadata and comments are public. They are saved by the website and available for the analysis. Using this data set, I will explore how readers’ engagement connected with blogs’s author engagement, site comments’, web source of readers and negative behaviors online.
My research question is: Does the authors engagement in the conversation around the post makes readers more engaged and promotes positive interactions among them?
DV: My dependent construct is “user’s engagement”, I will measure users’ engagement at the level of individual post, using the following metrics:
Page views
Page visits (one visit can contain a few views, if the person visits other pages on the site)
Unique users. One person can visit the same page a few times (it is calculated for all time since the page was posted).
Number of votes: “thums up” or “thumbs down”
Number of comments
Exit rate or “bounces”. When the visitor is coming to the page and then leaving, i.e. not opening other pages on this website.
L2 - page readers * Reveal letter * Reveal comments
IV: My main independent variable is Blog’s author engagement. I will measure authors engagement as the factor variable, with the following levels:
A. Unspecified comment
B. Featured comment
C. Engagement in conversation
To control for confounders, I will also measure the follwing variables:
Topic of the post (“post tag”), categorical variable.
Source of the readers, also categorical variable.
Mood of the conversation , derivative continuous variable calculated as the ratio of “likes” to “dislikes”.
Blocked and flagged comments.
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
To answer my research question I will use two datasets. the first data set has information about all comments associated with each post by post ID. The second data set is analytics data for the web bage. It contains one post per row and variables describe each post as a whole without breaking down to the comment level.
In this project I will analyze posts for January 2021 - February 2023. Here is the list of variables in each dataset: Post-level data:
Code
getwd()
[1] "C:/Users/Diana/OneDrive - University Of Massachusetts Medical School/Documents/R/R working directory/DACSS/603/603_Spring_2023/posts"
Code
# First, I will load the data set with the comment level data:raw <-as_tibble (read_csv("C:\\Users\\Diana\\OneDrive - University Of Massachusetts Medical School\\Documents\\R\\R working directory\\DACSS\\603\\my study files for dacss603\\globe\\ data.2021.plus.csv"))
Rows: 105136 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): content, user_name, display_name, image_url, email, approved
dbl (7): message_id, post_id, user_id, parent, absolute_likes, absolute_dis...
lgl (3): email_verified, created_at, private_profile
dttm (1): written_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tibble [105,136 × 18] (S3: tbl_df/tbl/data.frame)
$ content : chr [1:105136] "08:15 Dawn of the 21st year of the 21st century<br/>"21st Century Schizoid Man" on the BOSE" "Liz: 😄" "Happy New Year to all y'all. And Ms. G, thank you for letting me play for another year playing in your sandbox." "Blog needs a new name - Hate Letters" ...
$ message_id : num [1:105136] 1.15e+08 1.15e+08 1.15e+08 1.15e+08 1.15e+08 ...
$ post_id : num [1:105136] 27071015 27071015 27071009 27071009 27071009 ...
$ user_id : num [1:105136] 2889100 1855822 5133556 5156343 5560421 ...
$ user_name : chr [1:105136] "Lefty49" "JacquiSmith" "Bzznlike-crazyman" "--SnowMan--" ...
$ display_name : chr [1:105136] "Lefty49" "JacquiSmith" "Bzznlike-crazyman" "--SnowMan--" ...
$ image_url : chr [1:105136] "https://u.o0bc.com/avatars/48/121/77/5077296.png?35" "https://u.o0bc.com/avatars/41/194/84/5554729.png?47" "https://u.o0bc.com/avatars/85/175/84/5549909.png?9" "https://u.o0bc.com/avatars/64/191/84/5553984.png?5" ...
$ email : chr [1:105136] "danwalker1949@aol.com" "amcr1124@comcast.net" "eflynn105@yahoo.com" "markedoc@yahoo.com" ...
$ email_verified : logi [1:105136] FALSE FALSE FALSE FALSE FALSE FALSE ...
$ created_at : logi [1:105136] NA NA NA NA NA NA ...
$ private_profile : logi [1:105136] FALSE FALSE FALSE FALSE FALSE FALSE ...
$ approved : chr [1:105136] "approved" "approved" "approved" "approved" ...
$ written_at : POSIXct[1:105136], format: "2021-01-01 08:28:44" "2021-01-01 08:57:08" ...
$ parent : num [1:105136] NA 1.15e+08 NA NA 1.15e+08 ...
$ absolute_likes : num [1:105136] 2 0 14 7 7 7 4 3 2 4 ...
$ absolute_dislikes: num [1:105136] 1 0 1 0 1 1 1 1 1 0 ...
$ comment.year : num [1:105136] 2021 2021 2021 2021 2021 ...
$ com.year : chr [1:105136] "2021" "2021" "2021" "2021" ...
Comment data:
Code
# Second, loadng post-level data :merged <-as_tibble (read_csv("C:\\Users\\Diana\\OneDrive - University Of Massachusetts Medical School\\Documents\\R\\R working directory\\DACSS\\603\\my study files for dacss603\\globe\\data.merged.csv"))
Rows: 535 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Letter, Exit rate, post.month
dbl (18): Page views, Search + amp referral visits, Direct (non-email) refe...
num (1): Visits when post was on LL HP
date (1): post.date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# colnames(merged)str(merged)
tibble [535 × 23] (S3: tbl_df/tbl/data.frame)
$ post.date : Date[1:535], format: "2021-01-04" "2021-01-05" ...
$ Letter : chr [1:535] "love letters | blog | I don&'t want him to let me go" "love letters | blog | Should I be working to get her back?" "love letters | blog | I&'m sick of thinking about the breakup" "love letters | blog | I don&'t want to be selfish about 2020 Christmas" ...
$ Page views : num [1:535] 14830 12067 11921 12817 12866 ...
$ Search + amp referral visits : num [1:535] 1005 822 793 765 934 ...
$ Direct (non-email) referral visits: num [1:535] 10005 7948 7997 8746 8400 ...
$ Visits : num [1:535] 12998 10391 10331 11040 10818 ...
$ Uniques : num [1:535] 11453 8985 8917 9662 8564 ...
$ Other website referral visits : num [1:535] 129 232 106 152 139 153 165 157 85 151 ...
$ Social referral visits : num [1:535] 457 94 156 106 113 368 90 171 84 129 ...
$ BDC referral visits : num [1:535] 7087 5312 5185 5901 4570 ...
$ Visits when post was on LL HP : num [1:535] 3167 3323 2929 3056 4795 ...
$ Exits : num [1:535] 9679 7549 7613 8188 7900 ...
$ Exit rate : chr [1:535] "74%" "73%" "74%" "74%" ...
$ dup : num [1:535] 0 0 0 0 0 0 0 0 0 0 ...
$ post_id : num [1:535] 27071003 27070997 27070991 27070985 27070979 ...
$ n.comments : num [1:535] 267 207 266 372 319 267 337 154 179 375 ...
$ post.year : num [1:535] 2021 2021 2021 2021 2021 ...
$ post.month : chr [1:535] "01" "01" "01" "01" ...
$ post.likes : num [1:535] 1440 864 936 1497 1145 ...
$ post.dislikes : num [1:535] 72 106 96 520 154 188 150 62 106 150 ...
$ post.total.likes : num [1:535] 1512 970 1032 2017 1299 ...
$ blocked.sum : num [1:535] 1 2 3 9 2 3 4 0 3 3 ...
$ pct.positive : num [1:535] 95.2 89.1 90.7 74.2 88.1 ...
Be
To begin, I will review available variables and evaluate whether it is a good measure for this study.
1.DV: popularity and engagement.
1.1.Popularity
Page views
This is the most general metric, representing how many views the post received. Views do not distinguish repeated views by the same person.
From this graph we can see that the numbers of views is increased over time. To get a better understanding of it, lets review other metrics.
Page visits and unique users
Visit is an instance of a user engaging with bdc website. The user can visit a few pages of the site and re-visit them, creaing few views in once visit.Once th user left the website, the visit is over.
There also can be many visits by the same viewer. To account for repeated users there is “Uniques” variable. It tells us how many unique users saw the post. The relationship will always be that
Uniques < Visits < Views
Lets plot them together:
Code
# str(merged)ggplot(merged, aes(x =post.date )) +geom_bar(aes(y =`Page views`, fill ="Page views"), stat ="identity", position ="dodge", width =0.6) +geom_bar(aes(y = Visits, fill ="Visits"), stat ="identity", position ="dodge", width =0.6) +geom_bar(aes(y = Uniques, fill ="Unique users"), stat ="identity", position ="dodge", width =0.6) +labs(title ="Views, visits and unique users per month", x ="Post.date", y ="Value") +theme(axis.text.x =element_text(angle =45, hjust =1))
This graph also shows increase in all three metrics in 2022-2023. To see how these metrics crrespond to each other, I will calculate “visits ratio” and “unique.ratio”.
Page unique viewers.
Becuse “Uniques” variable represents number of unique people who came to the page and viewed it at least once, his metric represents popularity of the post. It’s distribution shows us that not all posts are equally popular:
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see a long tail on the right, showing that there is a number of posts who are way more popular. If we look at the distribution of this variable over time, we will see significant increase of average popularity after December 2021.
Code
ggplot( data=merged, mapping=aes(y=Uniques, x=merged$year_month))+geom_boxplot()+labs(title="Number of unique viewers per month", x="Month", y="Number of unique viewers")+theme(axis.text.x =element_text(angle =45, hjust =1))
Warning: Use of `merged$year_month` is discouraged.
ℹ Use `year_month` instead.
1.2. Engagement metrics:
Visit ratio and unique ratio:
“visits ratio” is the number of visits per total views. This metric tells us how often the page being re-viewed. The lower the ratio means that the page was viewed more times during single visit. Since repeated view can be considered a higher engagement, the lower visit ratio indicates higher engagement.
“unique ratio” - number of unique users per total views. It is telling us how often the same person re-visited the post. Lower ratio indicates repeated users, therefore higher engagement.
Both ratios are always in the range of 0 - 1.
Code
merged <- merged %>%mutate(uniques.ratio = Uniques /`Page views`)%>%mutate(visits.ratio = Visits /`Page views`)# str(merged)ggplot(merged, aes( y = uniques.ratio, x =post.date, )) +geom_point(color ="green" ) +labs(title ="Unique.ratio per post ", x ="Post.date", y ="Value") +theme(axis.text.x =element_text(angle =45, hjust =1))
Code
ggplot(merged, aes( y = visits.ratio, x =post.date, )) +geom_point(color ="blue" ) +labs(title ="Visits.ratio per post ", x ="Post.date", y ="Value") +theme(axis.text.x =element_text(angle =45, hjust =1))
From the two graphs above we can see that the two variables are distributed very similarly.
If we plot them against each other and visualize their correlation, we can see that they highly correlated especially in higher values.
Density of the distribution also showing that most posts are visited once and by unique users, i.e. post is read once by each user.
Code
# DENSITY DISTRIBUTION HERE
Exit rate
This variable is measuring how many people visited the page and then left the website after the first view. This metric is the best measure of engagement for all users, as it represents the first step after being exposed to the post - either quitting the site or remaining on the site.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can also see consistent increase of this value ovr time, similar to the trend seen before:
Code
ggplot(merged, mapping =aes(x=year_month , y=`Exit rate`, fill=year_month ))+geom_boxplot() +labs(title ="distribution of `Exit rate` per post ", y ="Exit rate" , x="Month")+theme(axis.text.x =element_text(angle =45, hjust =1))
Number of reactions
This represents total amount of likes and dislikes of comments per post.To like/dislike the comment, user doesn’t have to log in.
Code
# merged$year_month # colnames(merged)ggplot(data=merged, mapping=aes(x=post.total.likes))+geom_histogram()+labs(title=" Number of all reactions per post")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Since we already saw that there is large variability in popularity of posts, which can natually impact absolute amount of reactions. To account for number of views, we calculate reactions.ratio:
The absolute number of reactions and the ratio appear to decrease over time.
Number of comments: engagement metric for subset of readers. Absolute and per views.
To comment on the requires logging in from a user, which is an indicator of greater engagement of an individual user. Therefore this variable represents engagement of a subset of users - more loyal readers who have created an account.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
And change in the distribution over time:
Code
ggplot(merged, mapping =aes(x=year_month , y=n.comments, fill=year_month ))+geom_boxplot() +labs(title ="distribution of comments per post ", y ="Number of comments" )+scale_y_continuous(breaks =seq (from=0, to=10000, by=100)) +theme(axis.text.x =element_text(angle =45, hjust =1))
The trend in amount of comments over time is different. While previously reviewed variables seemed to increase over time, this variable decreases.
We can calculate the ratio of comments per If we compare absolute values fo comments and ratio of comments per page views, we see the decrease in both cases, while the ratio’s decrease is more dramatic:
Code
merged <- merged %>%mutate(comments.ratio = n.comments/`Page views`)ggplot(merged, mapping =aes(x=year_month , y=comments.ratio, fill=year_month ))+geom_boxplot() +labs(title ="distribution of comments ratio (comments/page views) ", y ="Number of comments" )+scale_y_continuous(breaks =seq (from=0, to=10000, by=100)) +theme(axis.text.x =element_text(angle =45, hjust =1))
WE can see that with an increase of views (mostly driven by post popularity), comments ratio decreeases. So, newly added viewers are not adding to the engaged subset of readers.
The distribution of comments per post over time:
Code
# str(merged)ggplot(merged, mapping =aes(x=year_month , y=post.total.likes, fill=year_month ))+geom_boxplot() +labs(title ="distribution of reactions over time ", y ="Percentage of reactions " )+scale_y_continuous(breaks =seq (from=0, to=10000, by=100)) +theme(axis.text.x =element_text(angle =45, hjust =1))
1.3 Impact of popularity on engagement.
Since we noticed a pattern in distribution of describbed variables, lets find out how they are interconnected, and how popularity of the post impact users’ engagement:
This graph showing that popularity is strongly correlated with all engagement metrics, such as Exit rate, uniques ratio, visit ratio and reaction’s ratio, where an increase in popularity causes decrease of engagement across all 4 metrics for engagement.
We can also see that the relationship between Popularity and engagement rate are curvi-linear. If we log the values, we will see linear relationship:
Error in `x[r, vars, drop = drop]`:
! Can't subset columns with `vars`.
✖ Can't convert from `j` <double> to <integer> due to loss of precision.
Code
# WHT DOES THE GRAPH SHOWS POST.DATE and POST.DATE1 INSTEAD OF "UNIQUES" and "Exit rate"?
As we saw in the distributions of variables above, all engagement metrics are strongly correlate with each other the popularity metric.
1.3.1. Modeling engagement ~ popularity:
Lets review how it is impacting exit rate : I will use “Exit rate” s my main engagement variable, as it reporesents the first engagement chouce that all site users make: to stay on the site or leave. Using log() of popularity is shoing a better fitted model:
summary(lm(`Exit rate`~+log(Uniques) , data = merged))
Call:
lm(formula = `Exit rate` ~ +log(Uniques), data = merged)
Residuals:
Min 1Q Median 3Q Max
-0.259486 -0.016818 0.001917 0.019850 0.073240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.332581 0.027678 -12.02 <2e-16 ***
log(Uniques) 0.113870 0.002855 39.89 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.03073 on 533 degrees of freedom
Multiple R-squared: 0.749, Adjusted R-squared: 0.7486
F-statistic: 1591 on 1 and 533 DF, p-value: < 2.2e-16
We can see significance of popularity and high R^2 of the model, and can conclude that popularity overall decreases engagement. To explore this connection further, lets mreview Sources of users that are coming to the website.
1.3.2. How popularity impacts engagement: Referral sources.
The website analytics provides information on where the viewers are coming from to the blog page. for example, if people clicked on the blog link posted on FaceBook, that would be referral from social media. If people clicked on the blog link within BDC website, that would be “BDC referral visit”.
There are 5 sources of referrals, each corresponding with a variable in the data set. Variable’s value is a number of visits from this referral source.
"Search + amp referral visits"
"Direct (non-email) referral visits"
"Other website referral visits"
"Social referral visits"
"BDC referral visits"
"Visits when post was on LL HP"
Code
#renaming variables for convenience: merged<-merged%>%rename(google ="Search + amp referral visits",direct ="Direct (non-email) referral visits",other.web ="Other website referral visits",social="Social referral visits",bdc="BDC referral visits",ll="Visits when post was on LL HP" )ggplot(merged, mapping=aes(x=post.date))+geom_point(aes(y=google), color="red")+geom_point(aes(y=direct), color="green")+geom_point(aes(y=other.web), color="yellow")+geom_point(aes(y=social), color="purple")+geom_point(aes(y=bdc), color="blue")+geom_point(aes(y=ll), color="pink") +labs(title ="referral sources per post ", y ="Number of referrals" , x="Post")+theme(axis.text.x =element_text(angle =45, hjust =1))
This graph clearly demonstrates increase of “Search + amp referral visits” after December 2021, while other referral sources maintain the same level over time. This increase matches changes observed in popularity ( Page Views), and engagement metrics that are highly correlated with popularity. It also doubles views of the posts in many cases and might significanly influence our model and conclusions.
Early comments’impact
We saw that increased popularity is associated with the sources of the viewers and might come from the search engines. We want to find out if early engagement in the post ( comments within first 10 hours of the post time) trigger its appearance in google news listing and therefore cause sharp increase in popularity.
We are interested to find out, what characteristics of the post (prior to it being picked up by google), correlate with its appearance on google news.
Number of comments within 1st 10 hours from the post.
Code
# calculating number of comments within first 10 hours # date and time of the post post.date <-comments.data %>%group_by(post_id)%>%arrange(written_at)%>%summarize(post.dt =first(written_at), n.coms =n())%>%filter (n.coms>100)%>%select(post_id, post.dt, n.coms)%>%arrange(desc(post_id))# head(post.date)# Calculating age of the post comments.data <-merge( comments.data , post.date , by ="post_id", all =TRUE)comments.data <-comments.data %>%mutate (age =difftime( written_at,post.dt, units ="hours"), early =ifelse(age<10, 1, 0))early.coms <-comments.data%>%group_by(post_id)%>%arrange(written_at)%>%summarize(early.sum =sum(early), n.coms =n())%>%filter (n.coms>100)%>%select(post_id, early.sum)%>%arrange(desc(post_id))# Adding number of early comments to merged dataset:merged <-merge( merged, early.coms , by ="post_id", all =TRUE)head(merged )
post_id post.date
1 27067757 2023-03-01
2 27067763 2023-02-28
3 27067769 2023-02-27
4 27067775 2023-02-24
5 27067781 2023-02-23
6 27067787 2023-02-22
Letter
1 love letters | blog | We&'ve gained weight – as a couple
2 love letters | blog | She was a friendly ghost, but I was still ghosted
3 love letters | blog | I waited for her to initiate sex – and it ended my marriage
4 love letters | blog | My husband doesn&'t share his money
5 love letters | blog | My wife won&'t go back to work
6 love letters | blog | He doesn&'t have time for me
Page views google direct Visits Uniques other.web social bdc ll Exits
1 24890 10407 10180 22766 20765 349 56 7088 2785 18547
2 15575 1210 10297 13597 11677 214 85 7461 2832 10127
3 28594 7027 15609 26115 23914 909 87 12957 3016 22464
4 72718 43419 19921 67007 62122 311 134 14106 4216 60915
5 67537 43173 16863 62888 59272 310 152 12344 2739 56929
6 28778 14466 9698 26269 23865 331 67 6332 3003 21515
Exit rate dup n.comments post.year post.month post.likes post.dislikes
1 0.81 0 208 2023 3 1001 167
2 0.74 0 264 2023 2 1908 118
3 0.86 0 241 2023 2 1195 167
4 0.91 0 321 2023 2 1921 341
5 0.91 0 294 2023 2 1445 549
6 0.82 0 182 2023 2 1337 85
post.total.likes blocked.sum pct.positive year_month uniques.ratio
1 1168 1 85.70205 2023-03 0.8342708
2 2026 1 94.17572 2023-02 0.7497271
3 1362 1 87.73862 2023-02 0.8363293
4 2262 3 84.92485 2023-02 0.8542864
5 1994 0 72.46740 2023-02 0.8776226
6 1422 0 94.02250 2023-02 0.8292793
visits.ratio reactions.ratio comments.ratio early.sum
1 0.9146645 0.04692648 0.008356770 192
2 0.8730016 0.13008026 0.016950241 247
3 0.9133035 0.04763237 0.008428342 214
4 0.9214637 0.03110647 0.004414313 261
5 0.9311637 0.02952456 0.004353169 240
6 0.9128153 0.04941275 0.006324275 142
Now, as I calculated number of early comments for all posts, I can see if that number correlated with Google referrals.
Call:
lm(formula = google ~ early.sum, data = merged.2)
Residuals:
Min 1Q Median 3Q Max
-15035 -6141 -2482 3559 54900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5908.84 5265.21 1.122 0.26406
early.sum 94.56 30.73 3.077 0.00261 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10600 on 117 degrees of freedom
Multiple R-squared: 0.07486, Adjusted R-squared: 0.06695
F-statistic: 9.467 on 1 and 117 DF, p-value: 0.002605
Code
summary(lm(google ~early.sum, data=merged))
Call:
lm(formula = google ~ early.sum, data = merged)
Residuals:
Min 1Q Median 3Q Max
-7036 -5548 -4652 2036 68716
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8921.388 1789.530 4.985 8.38e-07 ***
early.sum -12.627 9.921 -1.273 0.204
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9806 on 533 degrees of freedom
Multiple R-squared: 0.00303, Adjusted R-squared: 0.001159
F-statistic: 1.62 on 1 and 533 DF, p-value: 0.2037
We can see significant connection between early comments and google referrals, only on the data set of google referrals above 10000 visits. Also, R^2 is low in this model, which suggests that there are other factors that are not considered in this model. I will explore how days of the week and time of the post contribute to this relationship.
Code
# calculating day of the week for the post: colnames(comments.data)
Call:
lm(formula = log(google) ~ early.sum + post.weekday, data = merged.2)
Residuals:
Min 1Q Median 3Q Max
-0.85556 -0.25080 -0.05083 0.23281 1.25348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.30191 0.20350 45.710 < 2e-16 ***
early.sum 0.00352 0.00119 2.958 0.00377 **
post.weekday.L 0.01571 0.08732 0.180 0.85751
post.weekday.Q -0.02154 0.08544 -0.252 0.80145
post.weekday.C -0.02274 0.08280 -0.275 0.78409
post.weekday^4 0.15962 0.08043 1.985 0.04962 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4058 on 113 degrees of freedom
Multiple R-squared: 0.102, Adjusted R-squared: 0.06229
F-statistic: 2.568 on 5 and 113 DF, p-value: 0.03066
Code
# table(merged.weekdays$post.weekday)ggplot (merged, mapping =aes(x=early.sum, y=google, color =post.weekday))+geom_point( ) +geom_smooth(method="lm")+labs(title ="All google referrals and early comments", y ="Google referrals" , x="Early comments")+theme(axis.text.x =element_text(angle =45, hjust =1))+facet_wrap(~post.weekday)
`geom_smooth()` using formula = 'y ~ x'
Code
ggplot (merged.2, mapping =aes(x=early.sum, y=google, color =post.weekday))+geom_point( ) +geom_smooth(method="lm")+labs(title ="Google referrals over 10K and early comments", y ="Google referrals" , x="Early comments")+theme(axis.text.x =element_text(angle =45, hjust =1))+facet_wrap(~post.weekday)
`geom_smooth()` using formula = 'y ~ x'
Early comments indeed influence Google referrals. However, this influence is different on different days of the week and appears more pronounced on Thursdays and Fridays. This overall supports our suggestion that early engagement of loyal users through comments impacts post popularity. further investigation of factors would be helpful, with the goal to model fit (R^2).
Our hypothesis that large amount of early comments triggers google referrals has some preliminary support with the few restrictiions to be considere: - the connection only appears on the posts with google referrals above 10000. - This connection is atronger for only thursdays and Fridays and appears very weak on other days of the week. - low coefficient and high deviation of data from predicated valaues (low R^2) suggest that there are other , more influential, factors that were not considered.
1.4.Conclusion
As we reviewed a variety of engagement metric, I found: 1. That they are all strongly correlated to each other and influenced by post popularity: Engagement overall decreases as popularity increases.In our further analysis I will need to include “Uniques” variable in the model to control for popularity
Also as a result of exploration of engagement metrics, we can distinguish two types of users: superficial visitors and loyal readers. The main difference is that loyal readers have created accounts and therefore can comment. It is possible that htese two groups of readers respond differently to popularity increase, as well as other Independent variables. Therefore, the difference in types of readers must be considered when addressing our main hypothesi: whether authors engagement and other factors impacts reader’s engagement.
Early engagement of loyal users through comments might contribute to dramatic changes in post popularity.
Now, lets consider independent variables of our main hypothesis. # 2. Independent variables (IV):
Code
library(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
Code
stargazer()
Error in if (substr(inside[i], 1, nchar("list(")) == "list(") {: missing value where TRUE/FALSE needed
Authors comments
To identify, how much the author of the blog is engaged in the post, I will create an additional variable derived from a user_name field.
Code
# str(merged)comments.data$user_name<-ifelse (is.na(comments.data$user_name), 0, comments.data$user_name)comments.data$author<-ifelse (comments.data$user_name=="MeredithGoldstein", 1, 0)comments.grouped <-comments.data %>%group_by(post_id)%>%summarize(n.comments=n(),author.sum =sum(author))# dim(comments.grouped )# colnames(comments.grouped )# class(comments.grouped$author.sum)# Comments data contains rows that dont actually reporesent posts, and were crearted by web support team for troubleshooting. I need to remove these rows. They typically have very low number of commentscomments.grouped <-comments.grouped %>%filter(n.comments >100) # removing invalid posts created by the website management team.comments.grouped <-comments.grouped %>%select(post_id, author.sum)dim(comments.grouped)
[1] 535 2
Code
#adding author.sum to main data set: merged <-merge( merged , comments.grouped, by ="post_id", all =TRUE)ggplot(data=merged, mapping =aes(x=year_month , y=author.sum, fill=year_month ))+geom_boxplot() +labs(title ="distribution of Author's comments by months", y ="Author's comments" , x="Month")+theme(axis.text.x =element_text(angle =45, hjust =1))
This graph shows, that majority of posts have no author’s comments. However, if we look at the box plots’ upper ranges, we can suggest a trend of decrease in author’s comments with time.
Mood of the post.
This is a numerical variable, calculated as percentage of “thumbs up” from all likes (both “thumbs up” and “thumbs down”).
Code
ggplot(merged, mapping =aes(x=year_month , y=pct.positive, fill=year_month ))+geom_boxplot() +labs(title ="distribution of Mood per post ", y ="Mood" , x="Month")+theme(axis.text.x =element_text(angle =45, hjust =1))
Blocked comments per post.
Now I will visualize amount of blocked comments per post:
Code
ggplot(merged, mapping =aes(x=year_month , y=blocked.sum, fill=year_month ))+geom_boxplot() +labs(title ="Number Blocked comments per post ", y ="Number of blocked comments per post" , x="Month")+theme(axis.text.x =element_text(angle =45, hjust =1))
3. Creating the model for main hypothesis testing
As we mentioned above, we will add popularity to the model when testing main hypothesis. Also, I am going to break the data set into two groups: prior to December 2021 and after December 2021 due to dramatic change in referral sources. I will use these two datasets to to test my main hypothesis and compare results.
I will start with creating a simple model of engagement ~ popularity and author’s engagement:
Code
# colnames(merged)summary(lm(`Exit rate`~log(Uniques)+ author.sum, data = merged))
Call:
lm(formula = `Exit rate` ~ log(Uniques) + author.sum, data = merged)
Residuals:
Min 1Q Median 3Q Max
-0.257486 -0.016558 0.001669 0.019958 0.072030
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.326263 0.027749 -11.758 <2e-16 ***
log(Uniques) 0.113326 0.002857 39.663 <2e-16 ***
author.sum -0.003389 0.001602 -2.116 0.0348 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.03063 on 532 degrees of freedom
Multiple R-squared: 0.7511, Adjusted R-squared: 0.7502
F-statistic: 802.9 on 2 and 532 DF, p-value: < 2.2e-16
Code
summary(lm(`Exit rate`~log(Uniques)+ author.sum, data = merged.2021))
Call:
lm(formula = `Exit rate` ~ log(Uniques) + author.sum, data = merged.2021)
Residuals:
Min 1Q Median 3Q Max
-0.257995 -0.018855 0.000548 0.027265 0.069625
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.402793 0.096359 -4.180 4.18e-05 ***
log(Uniques) 0.121731 0.010266 11.857 < 2e-16 ***
author.sum -0.002533 0.002568 -0.986 0.325
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0372 on 224 degrees of freedom
Multiple R-squared: 0.3856, Adjusted R-squared: 0.3801
F-statistic: 70.3 on 2 and 224 DF, p-value: < 2.2e-16
Code
summary(lm(`Exit rate`~log(Uniques)+ author.sum, data = merged.2022))
Call:
lm(formula = `Exit rate` ~ log(Uniques) + author.sum, data = merged.2022)
Residuals:
Min 1Q Median 3Q Max
-0.083434 -0.014968 0.002043 0.017982 0.053266
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.354200 0.029528 -11.995 < 2e-16 ***
log(Uniques) 0.115989 0.002974 38.996 < 2e-16 ***
author.sum -0.005402 0.001997 -2.705 0.00722 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02455 on 305 degrees of freedom
Multiple R-squared: 0.8369, Adjusted R-squared: 0.8359
F-statistic: 782.7 on 2 and 305 DF, p-value: < 2.2e-16
To see how engagement of a subset of readers changes (loyal readers), I will use another measure for engagement as DV, which represents only registered users: number of comments.
Code
summary(lm(n.comments ~log(Uniques) + author.sum , data = merged))
Call:
lm(formula = n.comments ~ log(Uniques) + author.sum, data = merged)
Residuals:
Min 1Q Median 3Q Max
-98.670 -35.171 -8.842 26.596 181.918
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 238.521 45.076 5.292 1.78e-07 ***
log(Uniques) -4.667 4.641 -1.006 0.3151
author.sum 4.872 2.602 1.872 0.0617 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 49.75 on 532 degrees of freedom
Multiple R-squared: 0.009115, Adjusted R-squared: 0.00539
F-statistic: 2.447 on 2 and 532 DF, p-value: 0.08752
This shows that engagement is decreasing with popularity and someone increasing with author’s comments. for loyal readers : engagement is also sensitive to author’s comments, but not sensitive to popularity of the post.
analize change in x1 and x2 to y , considering log()
3.2 Adding other variables.
Now I will add other variables to see if there is any impact on the results of the model with engagement of both groups of readers. For number of comments, I would have to select either early.sum or n.comments, due to high correlation between them
Code
(cor(merged$early.sum, merged$n.comments))
[1] 0.9507822
I am including n/comments in the model, to see if engagement of loyal readers through comments impact overall engagement:
Ahd it is shown to be significant, negativelly impacting overall engagement. considering our finding s earlier, that early comments increase popularity and therefore decrease overall engagement, this is expected. Tuesday as level of “weekday” also shows to significantly decrease engagement.
Authors engagement, day of the week(Tuesday) and blocked comments seem to increase engagement.
Code
summary(lm(n.comments ~log(Uniques) + author.sum + pct.positive +blocked.sum +post.weekday, data = merged)) #not adding early.sum due to high correlation with DV: cor(merged$early.sum, merged$n.comments): 0.95
While general viewers show no impact by mood of the post or blocked comments, we see significant positive correlation of engagement with all of these metrics for loyal readers(=commenters).
Number of comments, representing engagement is somewhat impacted by three independent variables, but this model has low R^2 , which suggests that there are other factors impacting loyal reader’s engagement that are not considered in this model.
3.3 Using different datasets for comparing:
Exit rate for 2021 and 2022:
Code
# colnames(merged)summary(lm(`Exit rate`~log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged.2021))
Call:
lm(formula = `Exit rate` ~ log(Uniques) + author.sum + pct.positive +
blocked.sum, data = merged.2021)
Residuals:
Min 1Q Median 3Q Max
-0.258483 -0.017314 0.000667 0.027730 0.077470
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.315e-01 1.194e-01 -4.451 1.35e-05 ***
log(Uniques) 1.259e-01 1.053e-02 11.949 < 2e-16 ***
author.sum -2.673e-03 2.563e-03 -1.043 0.2981
pct.positive 1.009e-03 5.573e-04 1.811 0.0715 .
blocked.sum 7.438e-05 4.341e-04 0.171 0.8641
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.03709 on 222 degrees of freedom
Multiple R-squared: 0.3946, Adjusted R-squared: 0.3837
F-statistic: 36.18 on 4 and 222 DF, p-value: < 2.2e-16
Code
summary(lm(`Exit rate`~log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged.2022))
Call:
lm(formula = `Exit rate` ~ log(Uniques) + author.sum + pct.positive +
blocked.sum, data = merged.2022)
Residuals:
Min 1Q Median 3Q Max
-0.08367 -0.01553 0.00204 0.01839 0.05326
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3315325 0.0368576 -8.995 < 2e-16 ***
log(Uniques) 0.1153984 0.0029870 38.633 < 2e-16 ***
author.sum -0.0052803 0.0019958 -2.646 0.00858 **
pct.positive -0.0001812 0.0002344 -0.773 0.44014
blocked.sum -0.0003622 0.0002190 -1.654 0.09915 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0245 on 303 degrees of freedom
Multiple R-squared: 0.8386, Adjusted R-squared: 0.8365
F-statistic: 393.7 on 4 and 303 DF, p-value: < 2.2e-16
Exit rate (engagement) still correlates with popularity of the site or bot 2021 and 2021,confirming the same finding from un-splitted data set earlier. However, author’s comments appear only be relevant to all user’s engagement in 2022, and not in 2021. Mood of the post appears significant for user’s engagement only in 2021, and blocked comments only appear significant in 2022. ## 3.2 Diagnostic of the model:
Code
# colnames(merged)model.ex.r<-lm(`Exit rate`~log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged)par(mfrow =c(2,3))plot(model.ex.r, which =1:6)
Code
ex.r.2021<-lm(`Exit rate`~log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged.2021)ex.r.2022<-lm(`Exit rate`~log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged.2022)par(mfrow =c(2,3))plot(ex.r.2021, which =1:6)
Code
par(mfrow =c(2,3))plot(ex.r.2022, which =1:6)
Code
model.2<-lm(n.comments ~log(Uniques) + author.sum + pct.positive +blocked.sum, data = merged)par(mfrow =c(2,3))plot(model.2, which =1:6)
Number of comments for 2021 and 2022:
Code
summary(lm(n.comments ~log(Uniques) + author.sum + pct.positive +blocked.sum, data = merged.2021))
Call:
lm(formula = n.comments ~ log(Uniques) + author.sum + pct.positive +
blocked.sum, data = merged.2021)
Residuals:
Min 1Q Median 3Q Max
-106.632 -38.547 -8.915 28.987 167.497
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 209.0074 180.9229 1.155 0.249
log(Uniques) 3.0321 15.9599 0.190 0.849
author.sum 4.5830 3.8832 1.180 0.239
pct.positive -0.3590 0.8443 -0.425 0.671
blocked.sum 0.8618 0.6577 1.310 0.191
Residual standard error: 56.19 on 222 degrees of freedom
Multiple R-squared: 0.01739, Adjusted R-squared: -0.000314
F-statistic: 0.9823 on 4 and 222 DF, p-value: 0.418
Code
summary(lm(n.comments ~log(Uniques) + author.sum + pct.positive +blocked.sum, data = merged.2022))
Call:
lm(formula = n.comments ~ log(Uniques) + author.sum + pct.positive +
blocked.sum, data = merged.2022)
Residuals:
Min 1Q Median 3Q Max
-76.762 -28.410 -7.407 25.375 155.874
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -72.7563 59.4359 -1.224 0.221861
log(Uniques) 18.5894 4.8169 3.859 0.000139 ***
author.sum 1.9188 3.2184 0.596 0.551481
pct.positive 0.7834 0.3780 2.072 0.039094 *
blocked.sum 1.1017 0.3531 3.120 0.001981 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 39.51 on 303 degrees of freedom
Multiple R-squared: 0.07768, Adjusted R-squared: 0.0655
F-statistic: 6.38 on 4 and 303 DF, p-value: 6.108e-05
None of the variables show significance for loyal user’s engagement(comments) in 2021. However, loyal users are significantly influenced by popularity, mood of the post and blocked comments in 2022.
Adding a weekday:
Adding the weekday of the post ad additional IV increases R^2 and shows significance for one of the levels:
summary(lm(e.rate ~log(Uniques) + author.sum+early.sum, data = merged))
Call:
lm(formula = e.rate ~ log(Uniques) + author.sum + early.sum,
data = merged)
Residuals:
Min 1Q Median 3Q Max
-2538.89 -143.55 24.05 198.90 685.26
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3634.7905 281.4951 -12.912 < 2e-16 ***
log(Uniques) 1144.2628 28.0370 40.813 < 2e-16 ***
author.sum -35.9680 15.6776 -2.294 0.0222 *
early.sum 1.5193 0.3042 4.994 8.04e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 299.6 on 531 degrees of freedom
Multiple R-squared: 0.7623, Adjusted R-squared: 0.761
F-statistic: 567.6 on 3 and 531 DF, p-value: < 2.2e-16
The model’s predictive value is pretty good (probably due to popularity being included) : 0.77. It also appears that this connection We can check it, by excluding the popularity, as see that R^2 drops dramatically:
Code
summary(lm(e.rate ~ author.sum+early.sum, data = merged))
Call:
lm(formula = e.rate ~ author.sum + early.sum, data = merged)
Residuals:
Min 1Q Median 3Q Max
-3227.6 -410.9 -65.3 383.0 1613.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7634.2416 111.2795 68.604 <2e-16 ***
author.sum -92.0013 31.7347 -2.899 0.0039 **
early.sum 0.5435 0.6163 0.882 0.3783
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 608.8 on 532 degrees of freedom
Multiple R-squared: 0.01667, Adjusted R-squared: 0.01298
F-statistic: 4.51 on 2 and 532 DF, p-value: 0.01142
4. Conlusion
This study explored the connection between user’s online engagement and bolg’s author engagement. To do so, we also analyzed the impact of post popularity on engagement.
Code
knitr::include_graphics("C:\\Users\\Diana\\OneDrive - University Of Massachusetts Medical School\\Documents\\R\\R working directory\\DACSS\\603\\my study files for dacss603\\FP diagram .png")
The results showed that the post popularity is the major factor for online engagement and is negatively correlated with it. i.e. more popular posts tend to be less engaging.
We also identified two groups of users, whose engagement is measured with different metrics. We explored how blog author’s comments impact engagement of each group. We found than there is no significant impact of authors engagement on users engagement. Such characteristics, as post mood and blocked comments also showed no connection with engagement of either groups of users.
While engagement metrics (exit rate and comments) did not show significant contribution, authors comment appeared significant in this model. - this is really out of place, i just don’t want to lose this.
Source Code
---title: "Final Project check-in 2"author: "Diana Rinker"description: "Final project DACSS 603"date: "4/24/2023"format: html: toc: true code-fold: true code-copy: true code-tools: trueeditor: markdown: wrap: 72---## DACSS 603, spring 2023## Final Project check-in 2, Diana Rinker.# Introduction. Online engagementIt is well known that online engagement and popularity with the webresource is a highly valuable metric and is driving site revenue. Thisresearch project is exploring which factors contribute to users onlineengagement.To do that I will use the data of an online blog on the news website.The author of this blog is posting articles about interpersonalrelationships every work day (Mon- Fri). The posts are formulated as aletter from a reader with the situation and a question aboutrelationships. The author gives an advice about the situation. Websitereaders are free to comment under each post, but cannot make their ownposts.All post methadata and comments are public. They are saved by thewebsite and available for the analysis. Using this data set, I willexplore how readers' engagement connected with blogs's authorengagement, site comments', web source of readers and negative behaviorsonline.My research question is: Does the authors engagement in the conversationaround the post makes readers more engaged and promotes positiveinteractions among them?**DV:** My dependent construct is **"user's engagement**", I willmeasure users' engagement at the level of individual post, using thefollowing metrics:1. Page views2. Page visits (one visit can contain a few views, if the person visits other pages on the site)3. Unique users. One person can visit the same page a few times (it is calculated for all time since the page was posted).4. Number of votes: "thums up" or "thumbs down"5. Number of comments6. Exit rate or "bounces". When the visitor is coming to the page and then leaving, i.e. not opening other pages on this website.L2 - page readers \* Reveal letter \* Reveal comments**IV:** My main independent variable is **Blog's author engagement.** Iwill measure authors engagement as the factor variable, with thefollowing levels:A. Unspecified commentB. Featured commentC. Engagement in conversationTo control for **confounders,** I will also measure the follwingvariables:1. Topic of the post ("post tag"), categorical variable.2. Source of the readers, also categorical variable.3. Mood of the conversation , derivative continuous variable calculated as the ratio of "likes" to "dislikes".4. Blocked and flagged comments.```{r, echo=F}#Loading necessary libraries: library(readxl)library(dplyr)library(ggplot2)library(tidyverse)```### Data source and description. To answer my research question I will use two datasets. the first dataset has information about all comments associated with each post by postID. The second data set is analytics data for the web bage. It containsone post per row and variables describe each post as a whole withoutbreaking down to the comment level.In this project I will analyze posts for January 2021 - February 2023.Here is the list of variables in each dataset: Post-level data:```{r, echo=T}getwd()# First, I will load the data set with the comment level data:raw <-as_tibble (read_csv("C:\\Users\\Diana\\OneDrive - University Of Massachusetts Medical School\\Documents\\R\\R working directory\\DACSS\\603\\my study files for dacss603\\globe\\ data.2021.plus.csv"))comments.data<-raw colnames (comments.data)head(comments.data$written_at)comments.data <-comments.data%>%mutate(com.year =format(written_at,format ="%Y" ))# range(comments.data$com.year)# dim(comments.data)str(comments.data)```Comment data: ```{r, echo=T}# Second, loadng post-level data :merged <-as_tibble (read_csv("C:\\Users\\Diana\\OneDrive - University Of Massachusetts Medical School\\Documents\\R\\R working directory\\DACSS\\603\\my study files for dacss603\\globe\\data.merged.csv"))# colnames(merged)str(merged)```BeTo begin, I will review available variables and evaluate whether it is agood measure for this study.# 1.DV: popularity and engagement. ## 1.1.Popularity### Page viewsThis is the most general metric, representing how many views the postreceived. Views do not distinguish repeated views by the same person.```{r, echo=T}# str(merged)ggplot(data=merged, mapping=aes(x=`Page views`))+geom_histogram()``````{r, echo=T}merged$post.month <-as.numeric(merged$post.month)merged$year_month <-paste0(merged$post.year, "-", sprintf("%02d", merged$post.month))ggplot( data=merged, mapping=aes(y=`Page views`, x=year_month))+geom_boxplot()+labs(title="Number of post wiews per month", x="Month", y="Number of vews")+theme(axis.text.x =element_text(angle =45, hjust =1))# ggplot(data=merged, mapping=aes(x=log(`Page views`)))+# geom_histogram()```From this graph we can see that the numbers of views is increased overtime. To get a better understanding of it, lets review other metrics.### Page visits and unique usersVisit is an instance of a user engaging with bdc website. The user canvisit a few pages of the site and re-visit them, creaing few views inonce visit.Once th user left the website, the visit is over.There also can be many visits by the same viewer. To account forrepeated users there is "Uniques" variable. It tells us how many uniqueusers saw the post. The relationship will always be thatUniques \< Visits \< ViewsLets plot them together:```{r, echo=T}# str(merged)ggplot(merged, aes(x =post.date )) +geom_bar(aes(y =`Page views`, fill ="Page views"), stat ="identity", position ="dodge", width =0.6) +geom_bar(aes(y = Visits, fill ="Visits"), stat ="identity", position ="dodge", width =0.6) +geom_bar(aes(y = Uniques, fill ="Unique users"), stat ="identity", position ="dodge", width =0.6) +labs(title ="Views, visits and unique users per month", x ="Post.date", y ="Value") +theme(axis.text.x =element_text(angle =45, hjust =1))```This graph also shows increase in all three metrics in 2022-2023. To seehow these metrics crrespond to each other, I will calculate "visitsratio" and "unique.ratio".### Page unique viewers.Becuse "Uniques" variable represents number of unique people who came tothe page and viewed it at least once, his metric represents popularityof the post. It's distribution shows us that not all posts are equallypopular:```{r, echo=T}# colnames(merged)ggplot(data=merged, mapping=aes(x=Uniques))+geom_histogram()```We can see a long tail on the right, showing that there is a number ofposts who are way more popular. If we look at the distribution of thisvariable over time, we will see significant increase of averagepopularity after December 2021.```{r, echo=T}ggplot( data=merged, mapping=aes(y=Uniques, x=merged$year_month))+geom_boxplot()+labs(title="Number of unique viewers per month", x="Month", y="Number of unique viewers")+theme(axis.text.x =element_text(angle =45, hjust =1))```## 1.2. Engagement metrics:### Visit ratio and unique ratio:"visits ratio" is the number of visits per total views. This metrictells us how often the page being re-viewed. The lower the ratio meansthat the page was viewed more times during single visit. Since repeatedview can be considered a higher engagement, the lower visit ratioindicates higher engagement."unique ratio" - number of unique users per total views. It is tellingus how often the same person re-visited the post. Lower ratio indicatesrepeated users, therefore higher engagement.Both ratios are always in the range of 0 - 1.```{r, echo=T} merged <- merged %>%mutate(uniques.ratio = Uniques /`Page views`)%>%mutate(visits.ratio = Visits /`Page views`)# str(merged)ggplot(merged, aes( y = uniques.ratio, x =post.date, )) +geom_point(color ="green" ) +labs(title ="Unique.ratio per post ", x ="Post.date", y ="Value") +theme(axis.text.x =element_text(angle =45, hjust =1)) ggplot(merged, aes( y = visits.ratio, x =post.date, )) +geom_point(color ="blue" ) +labs(title ="Visits.ratio per post ", x ="Post.date", y ="Value") +theme(axis.text.x =element_text(angle =45, hjust =1))```From the two graphs above we can see that the two variables aredistributed very similarly.If we plot them against each other and visualize their correlation, wecan see that they highly correlated especially in higher values.```{r, echo=T}plot(merged$uniques.ratio, merged$visits.ratio)correlation <-cor(merged$uniques.ratio, merged$visits.ratio)correlation```Density of the distribution also showing that most posts are visitedonce and by unique users, i.e. post is read once by each user.```{r, echo=T}# DENSITY DISTRIBUTION HERE ```### Exit rateThis variable is measuring how many people visited the page and thenleft the website after the first view. This metric is the best measureof engagement for all users, as it represents the first step after beingexposed to the post - either quitting the site or remaining on the site.Here can see the distribution of this variable :```{r, echo=T}# str(merged)merged$`Exit rate`<-as.numeric(sub("%", "", merged$`Exit rate`)) /100ggplot(data=merged, mapping=aes(x=`Exit rate`))+geom_histogram()```We can also see consistent increase of this value ovr time, similar tothe trend seen before:```{r, echo=T}ggplot(merged, mapping =aes(x=year_month , y=`Exit rate`, fill=year_month ))+geom_boxplot() +labs(title ="distribution of `Exit rate` per post ", y ="Exit rate" , x="Month")+theme(axis.text.x =element_text(angle =45, hjust =1))```### Number of reactionsThis represents total amount of likes and dislikes of comments perpost.To like/dislike the comment, user doesn't have to log in.```{r, echo=T}# merged$year_month # colnames(merged)ggplot(data=merged, mapping=aes(x=post.total.likes))+geom_histogram()+labs(title=" Number of all reactions per post")```Since we already saw that there is large variability in popularity ofposts, which can natually impact absolute amount of reactions. Toaccount for number of views, we calculate reactions.ratio:```{r, echo=T} merged <- merged %>%mutate(reactions.ratio = post.total.likes /`Page views`)# str(merged)ggplot(merged, mapping =aes(x=year_month , y=reactions.ratio, fill=year_month ))+geom_boxplot() +labs(title ="distribution of reactions ", y ="Percentage of reactions " )+scale_y_continuous(breaks =seq (from=0, to=10000, by=100)) +theme(axis.text.x =element_text(angle =45, hjust =1))```The absolute number of reactions and the ratio appear to decrease overtime.### Number of comments: engagement metric for subset of readers. Absolute and per views.To comment on the requires logging in from a user, which is an indicatorof greater engagement of an individual user. Therefore this variablerepresents engagement of a subset of users - more loyal readers who havecreated an account.```{r, echo=T}# merged$year_month ggplot(data=merged, mapping=aes(x=n.comments))+geom_histogram()```And change in the distribution over time:```{r, echo=T}ggplot(merged, mapping =aes(x=year_month , y=n.comments, fill=year_month ))+geom_boxplot() +labs(title ="distribution of comments per post ", y ="Number of comments" )+scale_y_continuous(breaks =seq (from=0, to=10000, by=100)) +theme(axis.text.x =element_text(angle =45, hjust =1))```The trend in amount of comments over time is different. While previouslyreviewed variables seemed to increase over time, this variabledecreases.We can calculate the ratio of comments per If we compare absolute valuesfo comments and ratio of comments per page views, we see the decrease inboth cases, while the ratio's decrease is more dramatic:```{r, echo=T} merged <- merged %>%mutate(comments.ratio = n.comments/`Page views`)ggplot(merged, mapping =aes(x=year_month , y=comments.ratio, fill=year_month ))+geom_boxplot() +labs(title ="distribution of comments ratio (comments/page views) ", y ="Number of comments" )+scale_y_continuous(breaks =seq (from=0, to=10000, by=100)) +theme(axis.text.x =element_text(angle =45, hjust =1))```WE can see that with an increase of views (mostly driven by postpopularity), comments ratio decreeases. So, newly added viewers are notadding to the engaged subset of readers.<!-- It is also more likely that the users who logged and commented will return to the post page , which would correspond with the decrased in visits.ratio and uniques.ratio). -->The distribution of comments per post over time:```{r, echo=T}# str(merged)ggplot(merged, mapping =aes(x=year_month , y=post.total.likes, fill=year_month ))+geom_boxplot() +labs(title ="distribution of reactions over time ", y ="Percentage of reactions " )+scale_y_continuous(breaks =seq (from=0, to=10000, by=100)) +theme(axis.text.x =element_text(angle =45, hjust =1))```## 1.3 Impact of popularity on engagement.Since we noticed a pattern in distribution of describbed variables, lets find out how they are interconnected, and how popularity of the post impact users' engagement: ```{r, echo=T}pairs(subset (merged, select=c(Uniques, `Exit rate`, uniques.ratio, visits.ratio, reactions.ratio )))```This graph showing that popularity is strongly correlated with all engagement metrics, such as Exit rate, uniques ratio, visit ratio and reaction's ratio, where an increase in popularity causes decrease of engagement across all 4 metrics for engagement.We can also see that the relationship between Popularity and engagement rate are curvi-linear. If we log the values, we will see linear relationship: ```{r, echo=T}pairs(subset (merged, select=c(log(Uniques), log(`Exit rate`), uniques.ratio, visits.ratio, reactions.ratio )))# WHT DOES THE GRAPH SHOWS POST.DATE and POST.DATE1 INSTEAD OF "UNIQUES" and "Exit rate"? ```As we saw in the distributions of variables above, all engagement metrics are strongly correlate with each other the popularity metric. ### 1.3.1. Modeling engagement ~ popularity: Lets review how it is impacting exit rate :I will use "Exit rate" s my main engagement variable, as it reporesents the first engagement chouce that all site users make: to stay on the site or leave. Using log() of popularity is shoing a better fitted model: ```{r, echo=T}ggplot(merged, mapping=aes(x=log(Uniques), y=`Exit rate` ))+geom_point()summary(lm(`Exit rate`~+log(Uniques) , data = merged))```We can see significance of popularity and high R^2 of the model, and can conclude that popularity overall decreases engagement. To explore this connection further, lets mreview Sources of users that are coming to the website.### 1.3.2. How popularity impacts engagement: Referral sources.The website analytics provides information on where the viewers arecoming from to the blog page. for example, if people clicked on the bloglink posted on FaceBook, that would be referral from social media. Ifpeople clicked on the blog link within BDC website, that would be "BDCreferral visit".There are 5 sources of referrals, each corresponding with a variable inthe data set. Variable's value is a number of visits from this referralsource.``` "Search + amp referral visits" "Direct (non-email) referral visits" "Other website referral visits" "Social referral visits" "BDC referral visits" "Visits when post was on LL HP" ``````{r, echo=T}#renaming variables for convenience: merged<-merged%>%rename(google ="Search + amp referral visits",direct ="Direct (non-email) referral visits",other.web ="Other website referral visits",social="Social referral visits",bdc="BDC referral visits",ll="Visits when post was on LL HP" )ggplot(merged, mapping=aes(x=post.date))+geom_point(aes(y=google), color="red")+geom_point(aes(y=direct), color="green")+geom_point(aes(y=other.web), color="yellow")+geom_point(aes(y=social), color="purple")+geom_point(aes(y=bdc), color="blue")+geom_point(aes(y=ll), color="pink") +labs(title ="referral sources per post ", y ="Number of referrals" , x="Post")+theme(axis.text.x =element_text(angle =45, hjust =1))```This graph clearly demonstrates increase of "Search + amp referralvisits" after December 2021, while other referral sources maintain thesame level over time. This increase matches changes observed inpopularity ( Page Views), and engagement metrics that are highlycorrelated with popularity. It also doubles views of the posts in manycases and might significanly influence our model and conclusions.### Early comments'impact We saw that increased popularity is associated with the sources of the viewers and might come from the search engines.We want to find out if early engagement in the post ( comments within first 10 hours of the post time) trigger its appearance in google news listing and therefore cause sharp increase in popularity. We are interested to find out, what characteristics of the post (priorto it being picked up by google), correlate with its appearance ongoogle news.Number of comments within 1st 10 hours from the post.```{r, echo=T}# calculating number of comments within first 10 hours # date and time of the post post.date <-comments.data %>%group_by(post_id)%>%arrange(written_at)%>%summarize(post.dt =first(written_at), n.coms =n())%>%filter (n.coms>100)%>%select(post_id, post.dt, n.coms)%>%arrange(desc(post_id))# head(post.date)# Calculating age of the post comments.data <-merge( comments.data , post.date , by ="post_id", all =TRUE)comments.data <-comments.data %>%mutate (age =difftime( written_at,post.dt, units ="hours"), early =ifelse(age<10, 1, 0))early.coms <-comments.data%>%group_by(post_id)%>%arrange(written_at)%>%summarize(early.sum =sum(early), n.coms =n())%>%filter (n.coms>100)%>%select(post_id, early.sum)%>%arrange(desc(post_id))# Adding number of early comments to merged dataset:merged <-merge( merged, early.coms , by ="post_id", all =TRUE)head(merged )```Now, as I calculated number of early comments for all posts, I can seeif that number correlated with Google referrals. ```{r, echo=T}merged.2<-merged%>%filter(google>10000)correlation.2<-cor(merged.2$early.sum, merged.2$google)correlation.2plot(merged.2$early.sum, merged.2$google)``````{r, echo=T}summary(lm(google ~early.sum, data=merged.2))summary(lm(google ~early.sum, data=merged))```We can see significant connection between early comments and googlereferrals, only on the data set of google referrals above 10000 visits. Also, R\^2 is low in this model, whichsuggests that there are other factors that are not considered in thismodel. I will explore how days of the week and time of the postcontribute to this relationship.```{r, echo=T}# calculating day of the week for the post: colnames(comments.data)# Calculating age of the post library(lubridate)comments.data <-comments.data %>%mutate (weekday =wday(post.dt, label =TRUE), weekend =ifelse(weekday =="Fri"| weekday =="Sat", 1, 0))# class(week.days$weekday)# levels(week.days$weekday) # week.days$weekendweekdays <-comments.data %>%group_by(post_id)%>%arrange(written_at)%>%summarize(post.weekday =first(weekday), post.weekend =first(weekend),n.coms =n())%>%filter (n.coms>100)%>%select(post_id, post.weekday,post.weekend )%>%arrange(desc(post_id))# Adding number of early comments to merged dataset:merged <-merge( merged, weekdays , by ="post_id", all =TRUE)colnames(merged)# levels(merged$post.weekday)```### Post weekdays to the model```{r, echo=T}merged.2<-merged %>%filter(google>10000)plot(merged$post.weekday, merged$google)summary(lm(log(google) ~early.sum +post.weekday , data=merged.2))# table(merged.weekdays$post.weekday)ggplot (merged, mapping =aes(x=early.sum, y=google, color =post.weekday))+geom_point( ) +geom_smooth(method="lm")+labs(title ="All google referrals and early comments", y ="Google referrals" , x="Early comments")+theme(axis.text.x =element_text(angle =45, hjust =1))+facet_wrap(~post.weekday)ggplot (merged.2, mapping =aes(x=early.sum, y=google, color =post.weekday))+geom_point( ) +geom_smooth(method="lm")+labs(title ="Google referrals over 10K and early comments", y ="Google referrals" , x="Early comments")+theme(axis.text.x =element_text(angle =45, hjust =1))+facet_wrap(~post.weekday)```Early comments indeed influence Google referrals. However, this influence is different on different days of the week and appears more pronounced on Thursdays and Fridays. This overall supports our suggestion that early engagement of loyal users through comments impacts post popularity. further investigation of factors would be helpful, with the goal to model fit (R^2).Our hypothesis that large amount of early comments triggers googlereferrals has some preliminary support with the few restrictiions to be considere: - the connection only appears on the posts with google referrals above 10000. - This connection is atronger for only thursdays and Fridays and appears very weak on other days of the week. - low coefficient and high deviation of data from predicated valaues (low R\^2) suggestthat there are other , more influential, factors that were not considered. ## 1.4.ConclusionAs we reviewed a variety of engagement metric, I found:1. That they are all strongly correlated to each other and influenced by post popularity:Engagement overall decreases as popularity increases.In our further analysis I will need to include "Uniques" variable in themodel to control for popularity2. Also as a result of exploration of engagement metrics, we can distinguish two types of users: superficial visitors and loyal readers. The main difference is that loyal readers have created accounts and therefore can comment. It is possible that htese two groups of readers respond differently to popularity increase, as well as other Independent variables. Therefore, the difference in types of readers must be considered when addressing our main hypothesi: whether authors engagement and other factors impacts reader's engagement.3. Early engagement of loyal users through comments might contribute to dramatic changes in post popularity. Now, lets consider independent variables of our main hypothesis. # 2. Independent variables (IV):```{r, echo=T}library(stargazer)stargazer()```### Authors commentsTo identify, how much the author of the blog is engaged in the post, Iwill create an additional variable derived from a user_name field.```{r, echo=T}# str(merged)comments.data$user_name<-ifelse (is.na(comments.data$user_name), 0, comments.data$user_name)comments.data$author<-ifelse (comments.data$user_name=="MeredithGoldstein", 1, 0)comments.grouped <-comments.data %>%group_by(post_id)%>%summarize(n.comments=n(),author.sum =sum(author))# dim(comments.grouped )# colnames(comments.grouped )# class(comments.grouped$author.sum)# Comments data contains rows that dont actually reporesent posts, and were crearted by web support team for troubleshooting. I need to remove these rows. They typically have very low number of commentscomments.grouped <-comments.grouped %>%filter(n.comments >100) # removing invalid posts created by the website management team.comments.grouped <-comments.grouped %>%select(post_id, author.sum)dim(comments.grouped)#adding author.sum to main data set: merged <-merge( merged , comments.grouped, by ="post_id", all =TRUE)ggplot(data=merged, mapping =aes(x=year_month , y=author.sum, fill=year_month ))+geom_boxplot() +labs(title ="distribution of Author's comments by months", y ="Author's comments" , x="Month")+theme(axis.text.x =element_text(angle =45, hjust =1))```This graph shows, that majority of posts have no author's comments.However, if we look at the box plots' upper ranges, we can suggest atrend of decrease in author's comments with time.### Mood of the post.This is a numerical variable, calculated as percentage of "thumbs up"from all likes (both "thumbs up" and "thumbs down").```{r, echo=T}ggplot(merged, mapping =aes(x=year_month , y=pct.positive, fill=year_month ))+geom_boxplot() +labs(title ="distribution of Mood per post ", y ="Mood" , x="Month")+theme(axis.text.x =element_text(angle =45, hjust =1))```### Blocked comments per post.Now I will visualize amount of blocked comments per post:```{r, echo=T}ggplot(merged, mapping =aes(x=year_month , y=blocked.sum, fill=year_month ))+geom_boxplot() +labs(title ="Number Blocked comments per post ", y ="Number of blocked comments per post" , x="Month")+theme(axis.text.x =element_text(angle =45, hjust =1))```# 3. Creating the model for main hypothesis testingAs we mentioned above, we will add popularity to the model when testing main hypothesis. Also, I am going to break the data set into two groups: prior to December 2021 and after December 2021 due to dramatic change in referral sources. I will use these two datasets to to test my main hypothesis and compare results. ```{r, echo=T}merged.2021<- merged%>%filter(post.date <"2021-12-01")merged.2022<- merged%>%filter(post.date >="2021-12-01")```## 3.1 Basic model. I will start with creating a simple model of engagement ~ popularity and author's engagement: ```{r, echo=T}# colnames(merged)summary(lm(`Exit rate`~log(Uniques)+ author.sum, data = merged))summary(lm(`Exit rate`~log(Uniques)+ author.sum, data = merged.2021))summary(lm(`Exit rate`~log(Uniques)+ author.sum, data = merged.2022))```To see how engagement of a subset of readers changes (loyal readers), I will use another measure for engagement as DV, which represents only registered users: number of comments. ```{r, echo=T}summary(lm(n.comments ~log(Uniques) + author.sum , data = merged))```This shows that engagement is decreasing with popularity and someone increasing with author's comments. for loyal readers : engagement is also sensitive to author's comments, but not sensitive to popularity of the post.### analize change in x1 and x2 to y , considering log() ## 3.2 Adding other variables.Now I will add other variables to see if there is any impact on the results of the model with engagement of both groups of readers. For number of comments, I would have to select either early.sum or n.comments, due to high correlation between them```{r, echo=T}(cor(merged$early.sum, merged$n.comments))```I am including n/comments in the model, to see if engagement of loyal readers through comments impact overall engagement:```{r, echo=T}# colnames(merged)summary(lm(`Exit rate`~log(Uniques)+ author.sum + pct.positive +blocked.sum+post.weekday +n.comments, data = merged))```Ahd it is shown to be significant, negativelly impacting overall engagement. considering our finding s earlier, that early comments increase popularity and therefore decrease overall engagement, this is expected. Tuesday as level of "weekday" also shows to significantly decrease engagement. Authors engagement, day of the week(Tuesday) and blocked comments seem to increase engagement. ```{r, echo=T}summary(lm(n.comments ~log(Uniques) + author.sum + pct.positive +blocked.sum +post.weekday, data = merged)) #not adding early.sum due to high correlation with DV: cor(merged$early.sum, merged$n.comments): 0.95``` While general viewers show no impact by mood of the post or blocked comments, we see significant positive correlation of engagement with all of these metrics for loyal readers(=commenters). Number of comments, representing engagement is somewhat impacted bythree independent variables, but this model has low R\^2 , whichsuggests that there are other factors impacting loyal reader'sengagement that are not considered in this model.## 3.3 Using different datasets for comparing: ### Exit rate for 2021 and 2022:```{r, echo=T}# colnames(merged)summary(lm(`Exit rate`~log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged.2021))summary(lm(`Exit rate`~log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged.2022))```Exit rate (engagement) still correlates with popularity of the site or bot 2021 and 2021,confirming the same finding from un-splitted data set earlier. However, author's comments appear only be relevant to all user's engagement in 2022, and not in 2021. Mood of the post appears significant for user's engagement only in 2021, and blocked comments only appear significant in 2022. ## 3.2 Diagnostic of the model: ```{r, echo=T}# colnames(merged)model.ex.r<-lm(`Exit rate`~log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged)par(mfrow =c(2,3))plot(model.ex.r, which =1:6)ex.r.2021<-lm(`Exit rate`~log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged.2021)ex.r.2022<-lm(`Exit rate`~log(Uniques)+ author.sum + pct.positive +blocked.sum, data = merged.2022)par(mfrow =c(2,3))plot(ex.r.2021, which =1:6)par(mfrow =c(2,3))plot(ex.r.2022, which =1:6)```{r, echo=T}model.2<- lm(n.comments ~ log(Uniques) + author.sum + pct.positive +blocked.sum, data = merged)par(mfrow = c(2,3))plot(model.2, which = 1:6)```### Number of comments for 2021 and 2022:```{r, echo=T}summary(lm(n.comments ~ log(Uniques) + author.sum + pct.positive +blocked.sum, data = merged.2021))summary(lm(n.comments ~ log(Uniques) + author.sum + pct.positive +blocked.sum, data = merged.2022))```None of the variables show significance for loyal user's engagement(comments) in 2021. However, loyal users are significantly influenced by popularity, mood of the post and blocked comments in 2022. ### Adding a weekday:Adding the weekday of the post ad additional IV increases R\^2 and showssignificance for one of the levels:```{r, echo=T}summary(lm(n.comments ~ log(Uniques) + author.sum + pct.positive +blocked.sum + post.weekday, data = merged))summary(lm(`Exit rate` ~ log(Uniques) + author.sum + pct.positive+blocked.sum +post.weekday, data = merged))class(merged$post.weekday)```### Adding early comments:Measuring how early comments impact engagement:```{r, echo=T}merged<- merged %>% mutate (e.rate = `Exit rate`*10000)summary(lm(e.rate ~ log(Uniques) + author.sum +post.weekday +early.sum, data = merged))summary(lm(e.rate ~ log(Uniques) + author.sum+early.sum, data = merged))```The model's predictive value is pretty good (probably due to popularitybeing included) :0.77. It also appears that this connection We cancheck it, by excluding the popularity, as see that R\^2 dropsdramatically:```{r, echo=T}summary(lm(e.rate ~ author.sum+early.sum, data = merged))```# 4. ConlusionThis study explored the connection between user's online engagement and bolg's author engagement. To do so, we also analyzed the impact of post popularity on engagement. ```{r, echo=T}knitr::include_graphics("C:\\Users\\Diana\\OneDrive - University Of Massachusetts Medical School\\Documents\\R\\R working directory\\DACSS\\603\\my study files for dacss603\\FP diagram .png")```The results showed that the post popularity is the major factor for online engagement and is negatively correlated with it. i.e. more popular posts tend to be less engaging. We also identified two groups of users, whose engagement is measured with different metrics. We explored how blog author's comments impact engagement of each group. We found than there is no significant impact of authors engagement on users engagement. Such characteristics, as post mood and blocked comments also showed no connection with engagement of either groups of users. While engagement metrics (exit rate and comments) did not showsignificant contribution, authors comment appeared significant in thismodel. - this is really out of place, i just don't want to lose this.#