library(tidyverse)
library(ggplot2)
library(readr)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Homework 2 - Darron Bunt
Loading Packages into R Environment
Loading the Dataset
I have chosen to use my own data. Specifically, the dataset that I have chosen is made up of all of the Twitter posts authored by all 50 state flagship universities during the month of November 2022. In addition to the posts themselves, this dataset also includes a variety of associated metrics (such date/time posted, post impressions and reach, number of likes/retweets/comments, total number of Twitter followers, sentiment as tagged by artificial intelligence).
#import data
<- read_csv("_data/FlagshipTwitterUpdated.csv") FlagshipTwitter
Cleaning the Dataset
1. Cleaning out any rows that are not Twitter data
I realized that when I had run my query in Brandwatch, I had not specified that I only wanted to return authors from Twitter. Accordingly, I had pulled in some mentions from forums, blogs, YouTube (etc.) where the author name there matched the flagship institution’s Twitter author name.
#remove rows where the Page Type isn't twitter
<- subset(FlagshipTwitter, PageType =='twitter')
FlagshipTwitter2 FlagshipTwitter2
# A tibble: 5,658 × 37
Date Url Domain Senti…¹ PageT…² Langu…³ Count…⁴ Conti…⁵ Conti…⁶ Country
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 11/30/2… http… twitt… neutral twitter en USA NORTH … North … United…
2 11/30/2… http… twitt… neutral twitter en USA NORTH … North … United…
3 11/30/2… http… twitt… positi… twitter en USA NORTH … North … United…
4 11/30/2… http… twitt… neutral twitter en USA NORTH … North … United…
5 11/30/2… http… twitt… neutral twitter en USA NORTH … North … United…
6 11/30/2… http… twitt… neutral twitter en USA NORTH … North … United…
7 11/30/2… http… twitt… neutral twitter en USA NORTH … North … United…
8 11/30/2… http… twitt… neutral twitter en USA NORTH … North … United…
9 11/30/2… http… twitt… neutral twitter en USA NORTH … North … United…
10 11/30/2… http… twitt… neutral twitter en USA NORTH … North … United…
# … with 5,648 more rows, 27 more variables: `City Code` <chr>,
# `Account Type` <chr>, Author <chr>, City <chr>, `Expanded URLs` <chr>,
# `Full Text` <chr>, Gender <chr>, Hashtags <chr>, Impact <dbl>,
# Impressions <dbl>, `Location Name` <chr>, `Mentioned Authors` <chr>,
# `Twitter Followers` <dbl>, `Twitter Reply Count` <dbl>,
# `Twitter Reply to` <chr>, `Twitter Retweet of` <chr>,
# `Twitter Retweets` <dbl>, `Twitter Likes` <dbl>, `Twitter Tweets` <dbl>, …
Ok, awesome. I’ve gotten rid of 865 mentions that I didn’t need/want.
3. Narrow down to the columns I actually want to use in analysis (and clean up their names, as necessary)
Though I had already deleted a few columns, I quickly realized I would probably not be using 32 different data points in this analysis and would be well served to re-order my columns such that the ones I intend to use the most are first, not interspersed randomly.
I also wanted to give several columns clearer/more usable names.
#rename the columns I intend to use for analysis
<- rename(FlagshipTwitter2, c(Tweet = 'Full Text', MentionedAuthors = 'Mentioned Authors', TWFollowers = 'Twitter Followers', TWReply = 'Twitter Reply Count', TWRetweets = 'Twitter Retweets', TWLikes = 'Twitter Likes', Reach = 'Reach (new)', EngType = 'Engagement Type', URL = Url))
FlagshipTwitter3
#put columns I plan to use first
<- select(FlagshipTwitter3, Author, Date, Impressions, Reach, TWLikes, TWRetweets, TWReply, EngType, Sentiment, Hashtags, MentionedAuthors, Tweet, TWFollowers, URL, everything())
FlagshipTwitterUse FlagshipTwitterUse
# A tibble: 5,658 × 37
Author Date Impre…¹ Reach TWLikes TWRet…² TWReply EngType Senti…³ Hasht…⁴
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 UofAlaba… 11/3… 189187 24242 6 0 2 <NA> neutral <NA>
2 uhmanoa 11/3… 33885 9715 0 0 0 RETWEET neutral <NA>
3 CUBoulder 11/3… 90003 15452 0 0 0 RETWEET positi… #skobus
4 UNC 11/3… 143148 18997 0 0 0 RETWEET neutral #chtra…
5 unevadar… 11/3… 33378 10932 5 3 0 <NA> neutral #nvgra…
6 uhmanoa 11/3… 33887 9716 1 0 0 <NA> neutral #takem…
7 uarizona 11/3… 233177 24770 12 4 0 <NA> neutral <NA>
8 UArkansas 11/3… 72683 14013 0 0 0 RETWEET neutral <NA>
9 UArkansas 11/3… 72683 15413 1 0 1 REPLY neutral <NA>
10 UUtah 11/3… 126437 17988 0 0 0 RETWEET neutral <NA>
# … with 5,648 more rows, 27 more variables: MentionedAuthors <chr>,
# Tweet <chr>, TWFollowers <dbl>, URL <chr>, Domain <chr>, PageType <chr>,
# Language <chr>, `Country Code` <chr>, `Continent Code` <chr>,
# Continent <chr>, Country <chr>, `City Code` <chr>, `Account Type` <chr>,
# City <chr>, `Expanded URLs` <chr>, Gender <chr>, Impact <dbl>,
# `Location Name` <chr>, `Twitter Reply to` <chr>,
# `Twitter Retweet of` <chr>, `Twitter Tweets` <dbl>, …
4. Sort out my dates
My dates are currently date + times that each tweet was posted. I want to separate this into a date column and a time column.
#separate dates into respective date and time column
<- separate(FlagshipTwitterUse, Date, into = c("Date", "Time"), sep = " ")
TwitterUse2
$Date <- parse_date(TwitterUse2$Date, format = "%m/%d/%Y")
TwitterUse2$Time <- parse_time(TwitterUse2$Time, format = "%H:%M")
TwitterUse2 TwitterUse2
# A tibble: 5,658 × 38
Author Date Time Impre…¹ Reach TWLikes TWRet…² TWReply EngType Senti…³
<chr> <date> <tim> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 UofAl… 2022-11-30 23:55 189187 24242 6 0 2 <NA> neutral
2 uhman… 2022-11-30 23:50 33885 9715 0 0 0 RETWEET neutral
3 CUBou… 2022-11-30 23:47 90003 15452 0 0 0 RETWEET positi…
4 UNC 2022-11-30 23:30 143148 18997 0 0 0 RETWEET neutral
5 uneva… 2022-11-30 23:30 33378 10932 5 3 0 <NA> neutral
6 uhman… 2022-11-30 23:24 33887 9716 1 0 0 <NA> neutral
7 uariz… 2022-11-30 23:10 233177 24770 12 4 0 <NA> neutral
8 UArka… 2022-11-30 23:03 72683 14013 0 0 0 RETWEET neutral
9 UArka… 2022-11-30 23:02 72683 15413 1 0 1 REPLY neutral
10 UUtah 2022-11-30 23:00 126437 17988 0 0 0 RETWEET neutral
# … with 5,648 more rows, 28 more variables: Hashtags <chr>,
# MentionedAuthors <chr>, Tweet <chr>, TWFollowers <dbl>, URL <chr>,
# Domain <chr>, PageType <chr>, Language <chr>, `Country Code` <chr>,
# `Continent Code` <chr>, Continent <chr>, Country <chr>, `City Code` <chr>,
# `Account Type` <chr>, City <chr>, `Expanded URLs` <chr>, Gender <chr>,
# Impact <dbl>, `Location Name` <chr>, `Twitter Reply to` <chr>,
# `Twitter Retweet of` <chr>, `Twitter Tweets` <dbl>, …
I suspect I will want to mutate the data and create new variables as I progress with my project, but I believe that this iteration of the dataset will provide the foundation I need to begin exploratory data analysis.
Narrative About the Dataset
The dataset is comprised of data relating to every tweet authored by one of the 50 US flagship colleges during the month of November, 2022.
The dataset is comprised of the 5,658 posts that were made by the 50 US flagship colleges in November 2022. For each post, there are several associated variables that will be used for analysis. The 14 variables that are of particular interest for this project are:
- School Name: Which school authored each post.
- Twitter Followers: The number of Twitter followers the account had at the time of posting.
- F20 Enrollment: The enrollment at each school in the Fall of 2020.
- Size Setting: The size and setting designation for each school.
- Date: The date each post was authored.
- Time: The time each post was posted.
- Weekday: The day of the week each post was made.
- Engagement Type: A designation of whether the post was an original post (OG), a retweet of someone else’s post (RETWEET), a reply to another account’s post (REPLY), or quote tweet, a retweet of another account’s post with added commentary (QUOTE).
- Impressions: The sum of the followers of a tweet’s author and the followers of any retweeting authors.
- Reach: An estimate of how many people have actually seen/read a given post.
- Twitter Likes: The number of times Twitter users “liked” a given post.
- Twitter Retweets: The number of times Twitter users retweeted a given post on their own Twitter.
- Twitter Replies: The number of times Twitter users left a comment on a given post.
- Sentiment: An AI-driven interpretation of the content of each tweet that subsequently labels the post as either Positive, Negative, or Neutral.
- Tweet: The content of the tweet authored.
Research Questions This Dataset Could Answer
There are three primary questions of interest that I think this dataset could help answer: * Are there consistencies in how colleges are using Twitter? * What makes some posts more successful than others? * Are there takeaways on how colleges can most effectively use Twitter?