Final Project - Impact of Men’s March Madness Tournament on a School’s Online Conversation Volume

finalproject

Final Project - Darron Bunt

Author

Darron Bunt

Published

May 25, 2023

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(ggplot2)
library(dplyr)
library(lubridate)
library(car)

Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Code

library(forcats)
library(stargazer)


Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

Introduction

In April 2013, Doug J. Chung published a research paper¹ endeavoring to quantify and model the impact of the so-called “Flutie Effect” - the spillover impact that athletics has on the quantity and quality of applicants to US colleges (named after Boston College quarterback Doug Flutie, who in 1984 threw a Hail Mary touchdown pass to secure victory with six seconds in a game against the University of Miami, qualifying the team to compete in the Cotton Bowl²). The legacy of Flutie’s on-field success has been credited with catalyzing a 30% increase in undergraduate applications at Boston College, though institutional officials have argued that other, non-athletic factors were the “true” reason behind the increase. This trend of increased applications after prominent athletic successes, however, has been observed at other institutions including Georgetown, Northwestern, Boise State, and Texas Christian University.

Chung was able to find a statistically significant relationship between athletic success and both the quantity (number of applications) and quality (SAT scores of those applicants) of applicants to a given institution; his findings included that when a school rises from being classified as a mediocre football program to great one, applications rise by 18.7%³.

While in my job I do not focus directly on applications to US colleges, I do work with on campus marketing and communications leaders, and generate insights from data derived from online conversation about their schools so that they can better understand that conversation in ways that can help them to develop, refine, and align their communications strategies with the goals of the institutions that they serve⁴. A trend I frequently observe while analyzing this online conversation data is the impact that athletics has on the volume and reach of mentions related to schools.

In benchmarking work that we’ve undertaken in order to better understand online conversation trends in higher education, we’ve found that, on average, 63% of all online conversation related to schools is about their athletics - and for some schools, this proportion can be as high as 91%⁵. And while the proportion of overall online conversation relating to different colleges is already quite high, there are certain significant events within the realm of college sports that send mention volume even higher. One such event is March Madness.

Run by the NCAA every year during the month of March, the men’s version of the March Madness tournament is widely regarded as one of biggest American sporting events. 68 teams participate in the three-week long single-elimination tournament, which produced $1.14 billion in annual revenue for the NCAA in 2022⁶. One of the most exciting parts of the tournament are the “Cinderella stories”; teams with a low ranking that manage to eliminate higher ranked teams.

In 2022, I became interested in developing a greater understanding of the impact that March Madness could have on a school’s online conversation volume and began tracking that year’s Cinderella - the Saint Peter’s College Peacocks, a #15 seed that made it all the way to the Elite 8. In their Sweet 16 game against Purdue, Saint Peter’s had more mentions per minute than their average monthly volume for the prior five months. Their total mention volume during the month of March was 12,380 times more than that same monthly average⁷.

After completing that work, I became increasingly curious about how participation in the tournament affected conversation volume and reach for all teams, not just Cinderella stories. I wanted to investigate the factors that might influence more/less conversation volume for any given team, and indeed, whether conversation about the team was the only facet of conversation about a college that increased - or if the school as a whole received more conversation overall. I became very interested in learning more about the overall trends that relate to participation in events like March Madness and how we could use that information to provide strategic insights to colleges across the US.

Bringing it back to the Flutie Effect - if on-field athletic success can impact enrollment at US colleges, it stands to reason that online conversation about athletic success (or athletic events in general) could have a similar impact. Accordingly, it would be prudent to further explore the impact an event like the men’s March Madness tournament has on online conversation volume for US colleges.

Research Question

Specifically, the research question that I want to explore is:

How does online conversation volume differ when schools are participating in the men’s March Madness tournament?

To examine this research question in greater detail, I will also consider what variables contribute to differences in mention volume during the tournament. These variables can be broken down into three types: game outcome related, tournament related, and school related.

The answer to this question and a better understanding of the variables that contribute to differences in mention volume will provide more robust insight into the overall impact that events such as the men’s March Madness tournament have and the types of changes that can be expected based on the presence/absence of additional variables. This information can subsequently serve to inform strategies and tactics that can be implemented by marketing and communications professionals who focus on digital media in higher education. That is, schools can use this information to develop further hypotheses to be tested that would facilitate a goal of leveraging and maximizing the impact of athletic participation and success on their online conversation.

Hypothesis

My overarching hypothesis is that participation in the men’s March Madness tournament increases online conversation about the schools that are involved. I further hypothesize that the increase in conversation volume for each school will be further influenced by three types of additional mediating variables: game-outcome related variables, school-related variables, and tournament related variables.

Data Collection

The data used for this project was collected by me using the Brandwatch Consumer Research platform. I first wrote Boolean to search for mentions about all 68 schools with teams in the men’s March Madness tournament, using the same parameters to construct each one (specifically, the school’s full name, shortened name(s), acronym(s), mascot(s), website URL, athletics website URL, and Twitter usernames for the school’s flagship, flagship athletics, and men’s basketball team accounts).

The Boolean was then used to run a query within Brandwatch to pull all relevant, retrievable online mentions for the 68 schools made between January 1 and May 12 2023. This search returned 9,007,552 mentions.

One of the features included in Brandwatch is the ability to segment data into categories, and this can be done in a wide range of manners (using Boolean or a variety of pre-built parameters such as content sources and mention types). I used this feature to break the mention volume data down by school (the 68 schools participating in men’s March Madness), and daily volume (for the 132 days) and exported this information to .csv. This mention volume data provides the foundation for this analysis.

Code

MMVolumeNew <- read_csv("_data/MMVolumeByDayFullNEW.csv", show_col_types = FALSE)

Variables of Interest

I performed a variety of data manipulation in order to create my variables of interest.

Data Manipulation For Mention Volume (Dependent Variable)

I calculated the average daily volume for the following time spans:

during the tournament (March 14 to day the after each team’s elimination)
outside of the tournament (January 1 to March 13 and the day after each team’s elimination to May 12 - ie. both before and after, but not during)
the week after the tournament for each team (the week after each team’s elimination)
two weeks after the tournament for each team (the week two weeks after each team’s elimination)
before the tournament (January 1 to March 11)
after the tournament (Two days after each team’s elimination to May 12)

I also added columns for the actual daily volume on the following days:

Day of elimination
Day after elimination
Two days after elimination
Three days after elimination
Four days after elimination
Five days after elimination
Six days after elimination

Code

# Read in data
TournamentVariables <- read_csv("_data/MMVariables.csv", show_col_types = FALSE)

# Join data
MMVolumeVar <- MMVolumeNew %>%
  left_join(TournamentVariables, by = "School")

# Calculate pre-tournament average daily volume 
MMVolumeVar$VolBefore <- (round(rowMeans(MMVolumeVar[, 2:73]), digits = 0))

# Create a function to pull the rows needed to calculate "during tournament" for each of the schools
fun1 <- function(row) {
  start_col1 <- as.numeric(row[134])  # Convert the value to a numeric column index
  end_col1 <- as.numeric(row[135])    # Convert the value to a numeric column index
  values1 <- as.numeric(row[start_col1:end_col1]) # The rows to calculate the average
  average1 <- (round(mean(values1, na.rm = TRUE), digits = 0))
  return(average1)
}

# Return the average daily volume for "during tournament" for each school 
MMVolumeVar$VolDuring <- apply(MMVolumeVar, 1, fun1)

# Create a function to pull the rows needed to calculate "after tournament" for each of the schools
fun2 <- function(row) {
  start_col2 <- as.numeric(row[136])  # Convert the value to a numeric column index
  end_col2 <- as.numeric(row[137])    # Convert the value to a numeric column index
  values2 <- as.numeric(row[start_col2:end_col2]) # The rows to calculate the average
  average2 <- (round(mean(values2, na.rm = TRUE), digits = 0))
  return(average2)
}

# Return the average daily volume for "during tournament" for each school 
MMVolumeVar$VolAfter <- apply(MMVolumeVar, 1, fun2)

# Create a function to pull the rows needed for post-elimination volume 
fun3 <- function(row) {
  column_index3 <- as.numeric(row[136])  # Convert the value to a numeric column index
  value3 <- as.numeric(row[column_index3])  # Get the value from the specified column
  return(value3)
}

# Return the post-elimination day volume for each school 
MMVolumeVar$VolPostElim <- apply(MMVolumeVar, 1, fun3)

# Create a function to pull the rows needed for non-tournament volume
fun14 <- function(row) {
  start_col14 <- as.numeric(row[145])    # Start column index for range 1
  end_col14 <- as.numeric(row[146])      # End column index for range 1
  start_col14a <- as.numeric(row[147])    # Start column index for range 2
  end_col14a <- as.numeric(row[148])      # End column index for range 2
  
  values14 <- as.numeric(row[start_col14:end_col14])    # Values for range 1
  values14a <- as.numeric(row[start_col14a:end_col14a])    # Values for range 2
  
  values14b <- c(values14, values14a)    # Combine the values from both ranges
  
  average14 <- (round(mean(values14b, na.rm = TRUE), digits = 0))
  return(average14)
}

# Return the average non-tournament daily volume for each school 
MMVolumeVar$VolNonTourn <- apply(MMVolumeVar, 1, fun14) 

# Create a function to pull the rows needed for post-elimination +1 volume 
fun4 <- function(row) {
  column_index4 <- as.numeric(row[137])  # Convert the value to a numeric column index
  value4 <- as.numeric(row[column_index4])  # Get the value from the specified column
  return(value4)
}

# Return the post-elimination +1 volume for each school 
MMVolumeVar$VolPostElim1 <- apply(MMVolumeVar, 1, fun4)

# Create a function to pull the rows needed for post-elimination +2 volume 
fun5 <- function(row) {
  column_index5 <- as.numeric(row[138])  # Convert the value to a numeric column index
  value5 <- as.numeric(row[column_index5])  # Get the value from the specified column
  return(value5)
}

# Return the post-elimination +2 volume for each school 
MMVolumeVar$VolPostElim2 <- apply(MMVolumeVar, 1, fun5)

# Create a function to pull the rows needed for post-elimination +3 volume 
fun6 <- function(row) {
  column_index6 <- as.numeric(row[139])  # Convert the value to a numeric column index
  value6 <- as.numeric(row[column_index6])  # Get the value from the specified column
  return(value6)
}

# Return the post-elimination +3 volume for each school 
MMVolumeVar$VolPostElim3 <- apply(MMVolumeVar, 1, fun6)

# Create a function to pull the rows needed for post-elimination +4 volume 
fun17 <- function(row) {
  column_index17 <- as.numeric(row[140])  # Convert the value to a numeric column index
  value17 <- as.numeric(row[column_index17])  # Get the value from the specified column
  return(value17)
}

# Return the post-elimination +4 volume for each school 
MMVolumeVar$VolPostElim4 <- apply(MMVolumeVar, 1, fun17)

# Create a function to pull the rows needed for post-elimination +5 volume 
fun18 <- function(row) {
  column_index18 <- as.numeric(row[141])  # Convert the value to a numeric column index
  value18 <- as.numeric(row[column_index18])  # Get the value from the specified column
  return(value18)
}

# Return the post-elimination +5 volume for each school 
MMVolumeVar$VolPostElim5 <- apply(MMVolumeVar, 1, fun18)

# Create a function to pull the rows needed for post-elimination +6 volume 
fun19 <- function(row) {
  column_index19 <- as.numeric(row[142])  # Convert the value to a numeric column index
  value19 <- as.numeric(row[column_index19])  # Get the value from the specified column
  return(value19)
}

# Return the post-elimination +6 volume for each school 
MMVolumeVar$VolPostElim6 <- apply(MMVolumeVar, 1, fun19)

# Create a function to pull the rows needed for day after elimination volume 
fun20 <- function(row) {
  column_index20 <- as.numeric(row[135])  # Convert the value to a numeric column index
  value20 <- as.numeric(row[column_index20])  # Get the value from the specified column
  return(value20)
}

# Return the day after volume for each school 
MMVolumeVar$VolAfterElim <- apply(MMVolumeVar, 1, fun20)

# Create a function to pull the rows needed for day of elimination volume 
fun21 <- function(row) {
  column_index21 <- as.numeric(row[149])  # Convert the value to a numeric column index
  value21 <- as.numeric(row[column_index21])  # Get the value from the specified column
  return(value21)
}

# Return the day after volume for each school 
MMVolumeVar$VolElim_Day <- apply(MMVolumeVar, 1, fun21)

# Create a function to pull the rows needed to calculate "week after tournament" for each of the schools
fun15 <- function(row) {
  start_col15 <- as.numeric(row[136])  # Convert the value to a numeric column index
  end_col15 <- as.numeric(row[142])    # Convert the value to a numeric column index
  values15 <- as.numeric(row[start_col15:end_col15]) # The rows to calculate the average
  average15 <- (round(mean(values15, na.rm = TRUE), digits = 0))
  return(average15)
}

# Return the average daily volume for "week after tournament" for each school 
MMVolumeVar$VolWeekAfter <- apply(MMVolumeVar, 1, fun15)

# Create a function to pull the rows needed to calculate "two weeks after tournament" for each of the schools
fun16 <- function(row) {
  start_col16 <- as.numeric(row[143])  # Convert the value to a numeric column index
  end_col16 <- as.numeric(row[144])    # Convert the value to a numeric column index
  values16 <- as.numeric(row[start_col16:end_col16]) # The rows to calculate the average
  average16 <- (round(mean(values16, na.rm = TRUE), digits = 0))
  return(average16)
}

# Return the average daily volume for "two weeks after tournament" for each school 
MMVolumeVar$Vol2WeeksAfter <- apply(MMVolumeVar, 1, fun16)

# Specify the starting date
TournEndDate <- as.Date("2023-03-12")

# Convert the existing column "Numbers" to dates
MMVolumeVar$EndDate <- TournEndDate + (MMVolumeVar$During_End - 72)

# Create a data frame with the columns for the calculated means
MMVolAvgs <- MMVolumeVar %>%
  select(School, VolDuring, VolNonTourn, VolWeekAfter, Vol2WeeksAfter, VolElim_Day, VolAfterElim, VolPostElim,  VolPostElim1, VolPostElim2, VolPostElim3, VolPostElim4, VolPostElim5, VolPostElim6, VolBefore, VolAfter)
MMVolAvgs

# A tibble: 68 × 16
   School          VolDuring VolNonTourn VolWeekAfter Vol2WeeksAfter VolElim_Day
   <chr>               <dbl>       <dbl>        <dbl>          <dbl>       <dbl>
 1 University of …       796         260          151            132        1481
 2 University of …      2903        1320         1037           1113        6441
 3 Arizona State …      2476        1352         1121           1199        2458
 4 Duke University      4567        2162         1242           1879        8261
 5 Utah State Uni…       734         297          203            236        1796
 6 University of …      3762        1453          824            983        6693
 7 North Carolina…      1013         861          993            561        2255
 8 University of …      5907        1487         1722           1239       11182
 9 Providence Col…       721         363          997            489        1829
10 Xavier Univers…      1238         354          228            170        2975
# ℹ 58 more rows
# ℹ 10 more variables: VolAfterElim <dbl>, VolPostElim <dbl>,
#   VolPostElim1 <dbl>, VolPostElim2 <dbl>, VolPostElim3 <dbl>,
#   VolPostElim4 <dbl>, VolPostElim5 <dbl>, VolPostElim6 <dbl>,
#   VolBefore <dbl>, VolAfter <dbl>

Data Manipulation for Independent Variables

I also manipulated the data to create my tournament variables of interest.

Code

# Read in game scores, dates, etc.
DateAndTeam <- read_csv("_data/MBBMMGameData.csv", show_col_types = FALSE)

DateAndTeam$GameDate <- as.Date(DateAndTeam$Date, format = "%m/%d/%Y")
DateAndTeam$Time <- DateAndTeam$Time <- gsub(" ET", "", DateAndTeam$Time)
DateAndTeam$Time <- strptime(DateAndTeam$Time, format="%I:%M%p")
DateAndTeam$Time <- format(DateAndTeam$Time, format = "%H:%M:%S")

# Use GameDate to create a TournamentRound variable
DateRoundTeam <- DateAndTeam %>%
  mutate(TournamentRound = case_when(
      GameDate >= "2023-03-14" & GameDate <= "2023-03-15" ~ "First 4", 
      GameDate >= "2023-03-16" & GameDate <= "2023-03-17" ~ "Round of 64",
      GameDate >= "2023-03-18" & GameDate <= "2023-03-19" ~ "Round of 32",
      GameDate >= "2023-03-23" & GameDate <= "2023-03-24" ~ "Sweet 16",
      GameDate >= "2023-03-25" & GameDate <= "2023-03-26" ~ "Elite 8",
      GameDate == "2023-04-01" ~ "Final 4",
      GameDate == "2023-04-03" ~ "Championship")) %>%

# Replace game start times with time categories
  mutate(Time = case_when(
    Time >= "12:00:00" & Time < "14:14:59" ~ "Early Afternoon",
    Time >= "14:15:00" & Time < "16:29:59" ~ "Mid Afternoon",
    Time >= "16:30:00" & Time < "18:44:59" ~ "Late Afternoon",
    Time >= "18:45:00" & Time < "20:59:59" ~ "Early Evening",
    Time >= "21:00:00" & Time < "22:45:00" ~ "Late Evening")) %>%

# Import game day volume
  mutate(GDayVolCol = case_when(
      GameDate == "2023-03-14" ~ 74, 
      GameDate == "2023-03-15" ~ 75, 
      GameDate == "2023-03-16" ~ 76,
      GameDate == "2023-03-17" ~ 77,
      GameDate == "2023-03-18" ~ 78,
      GameDate == "2023-03-19" ~ 79,
      GameDate == "2023-03-23" ~ 83,
      GameDate == "2023-03-24" ~ 84,
      GameDate == "2023-03-25" ~ 85,
      GameDate == "2023-03-26" ~ 86,
      GameDate == "2023-04-01" ~ 92,
      GameDate == "2023-04-03" ~ 94
            )) %>%
  rename(School = Team) %>%
  select(School, Seed, GameDate, TournamentRound, Time, GDayVolCol, IsWinner, WinningSeed, LosingSeed)

# Create a function to bring in Game Day Volume 
fun13 <- function(row) {
  school <- row["School"]  # Get the school name from the row
  column_index <- as.numeric(row["GDayVolCol"])  # Convert the value to a numeric column index
  value <- MMVolumeVar[MMVolumeVar$School == school, column_index]
  return(as.numeric(value))
}

# Return the flattened game day volume for each school
DateRoundTeam$GameDayVolume <- unlist(apply(DateRoundTeam, 1, fun13))

# Create data frame with School, GameDate, Time, TournamentRound, GameDayVolume, IsWinner, WinningSeed, LosingSeed
TVariables <- DateRoundTeam %>%
  left_join(TournamentVariables, by = "School") %>%
  select(School, Seed, GameDate, Time, TournamentRound, GameDayVolume, IsWinner, WinningSeed, LosingSeed)

# Mutate WinningSeed and LosingSeed to create SeedDifference column 
TVariables <- TVariables %>%
  mutate(SeedDifference = ifelse(IsWinner == "Yes", WinningSeed - LosingSeed, LosingSeed - WinningSeed)) %>%
# Mutate SeedDifference and Is Winner to create UpsetWin and UpsetLoss
   mutate(UpsetWin = ifelse(IsWinner == "Yes" & SeedDifference >= 5, "Yes", "No")) %>%
  mutate(UpsetLoss = ifelse(IsWinner == "No" & SeedDifference <= -5, "Yes", "No")) %>%
# Mutate IsWinner and Winning Seed/Losing Seed to create FavoriteWin and FavoriteLoss
  mutate(FavoriteWin = ifelse(IsWinner == "Yes" & WinningSeed > LosingSeed, "Yes", "No")) %>%
  mutate(FavoriteLoss = ifelse(IsWinner == "No" & WinningSeed > LosingSeed, "Yes", "No")) %>%
# Mutate IsWinner and Winning Seed/Losing Seed to create UnderdogWin and UnderdogLoss
  mutate(UnderdogWin = ifelse(IsWinner == "Yes" & WinningSeed < LosingSeed, "Yes", "No")) %>%
  mutate(UnderdogLoss = ifelse(IsWinner == "No" & WinningSeed < LosingSeed, "Yes", "No"))
TVariables

# A tibble: 134 × 16
   School           Seed GameDate   Time  TournamentRound GameDayVolume IsWinner
   <chr>           <dbl> <date>     <chr> <chr>                   <dbl> <chr>   
 1 Pittsburgh Uni…    11 2023-03-14 Late… First 4                  4731 Yes     
 2 Texas A&M Corp…    16 2023-03-14 Late… First 4                   521 Yes     
 3 Arizona State …    11 2023-03-15 Late… First 4                  3967 Yes     
 4 Fairleigh Dick…    16 2023-03-15 Late… First 4                  1669 Yes     
 5 Princeton Univ…    15 2023-03-16 Mid … Round of 64              9532 Yes     
 6 Furman Univers…    13 2023-03-16 Earl… Round of 64              8721 Yes     
 7 Penn State Uni…    10 2023-03-16 Late… Round of 64              3692 Yes     
 8 Auburn Univers…     9 2023-03-16 Earl… Round of 64              4477 Yes     
 9 University of …     8 2023-03-16 Late… Round of 64              5918 Yes     
10 University of …     8 2023-03-16 Earl… Round of 64             10336 Yes     
# ℹ 124 more rows
# ℹ 9 more variables: WinningSeed <dbl>, LosingSeed <dbl>,
#   SeedDifference <dbl>, UpsetWin <chr>, UpsetLoss <chr>, FavoriteWin <chr>,
#   FavoriteLoss <chr>, UnderdogWin <chr>, UnderdogLoss <chr>

Does Participation Impact Average Mention Volume?

In order to answer the question of what impact participation in the men’s March Madness tournament has on the volume of online conversation about the schools involved, it is important to first examine if March Madness does indeed have an impact on conversation volume.

Code

# Descriptive statistics for volume during the tournament and not during
MMVolAvgs %>%
  select(VolNonTourn, VolDuring) %>%
summary()

  VolNonTourn     VolDuring     
 Min.   :  51   Min.   : 322.0  
 1st Qu.: 269   1st Qu.: 731.5  
 Median : 682   Median :1802.0  
 Mean   : 908   Mean   :2389.1  
 3rd Qu.:1312   3rd Qu.:3576.0  
 Max.   :5269   Max.   :7504.0

This seems to indicate a pretty clear difference between average volume during the tournament and average volume outside of the tournament; the mean for volume has increased from 908 mentions to 2,389 and the IQR has increased from 269-1,312 to 731-3,576

We can also display this difference visually:

Code

#Create bar chart showing difference between average mention volume during/not during March Madness

MMVolAvgs_Long <- pivot_longer(MMVolAvgs, cols = c(VolDuring, VolNonTourn), names_to = "Variable", values_to = "Volume")
MMVolAvgs_Long$School <- factor(MMVolAvgs_Long$School, levels = unique(MMVolAvgs_Long$School)) 
ggplot(MMVolAvgs_Long, 
       aes(x = School, y=Volume, fill = Variable)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(x = "School", y = "Average Mention Volume", fill = "Time Frame") +
 scale_fill_manual(values = c("VolDuring" = "blue", "VolNonTourn" = "red"),
                    labels = c("March Madness", "Not March Madness")) +
  ggtitle("Comparison of Mention Volume During and Outside March Madness") +
  theme(axis.text.x = element_blank())

It can also be helpful to visualize this difference in mention volume by highlighting just how large, proportionally, that difference was. Accordingly, I transformed these volume metrics into a percentage difference in in-tournament volume compared to outside-tournament volume.

We can then provide summary statistics on this proportional difference in volume during the tournament compared to not.

Code

# Calculate difference between average volume during tournament and non-tournament
MMVolDiffs <- MMVolAvgs %>%
  mutate(
    InTournamentDifference = round((((VolDuring / VolNonTourn) * 100) -100), digits = 2),
    WeekAfterDifference = round((((VolWeekAfter / VolNonTourn) * 100) -100), digits = 2),
    V2WeeksAfterDifference = round((((Vol2WeeksAfter / VolNonTourn) * 100) -100), digits = 2),
    ElimDayDifference = round(((VolElim_Day / VolNonTourn) * 100), digits = 2),
    DayAfterElimDifference = round(((VolAfterElim / VolNonTourn) * 100), digits = 2),
    DayPostElimDifference = round(((VolPostElim / VolNonTourn) * 100), digits = 2),
    ElimPlus1Difference = round(((VolPostElim1 / VolNonTourn) * 100), digits = 2),
    ElimPlus2Difference = round(((VolPostElim2 / VolNonTourn) * 100), digits = 2), 
    ElimPlus3Difference = round(((VolPostElim3 / VolNonTourn) * 100), digits = 2),
    ElimPlus4Difference = round(((VolPostElim4 / VolNonTourn) * 100), digits = 2),
    ElimPlus5Difference = round(((VolPostElim5 / VolNonTourn) * 100), digits = 2),
    ElimPlus6Difference = round(((VolPostElim6 / VolNonTourn) * 100), digits = 2),
  ) %>%
  
  select(School, InTournamentDifference, WeekAfterDifference, V2WeeksAfterDifference, ElimDayDifference, DayPostElimDifference, ElimPlus1Difference, ElimPlus2Difference, ElimPlus3Difference) %>%
  arrange(desc(InTournamentDifference))

# Descriptive statistics for the proportional difference in volume during the tournament and not 
summary(MMVolDiffs$InTournamentDifference)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -6.32   94.75  162.41  278.95  278.06 3800.00

Among the 68 schools, there was a mean proportional difference in average mention volume of 279% - again, this seems to indicate that participation in March Madness increases mention volume.

Overall, mention volume increased by more than 100% for 50 of the 68 schools in the men’s March Madness tournament. We can visualize the magnitude of the increase by creating groupings of schools according to how much their average volume changed during the tournament.

Code

# Use InTournamentDifference to create a percentage difference
MMVolDiffCat <- MMVolDiffs %>%
  mutate(InTournamentDifference = case_when(
      InTournamentDifference < 0 ~ "Decreased Volume",
      InTournamentDifference > 0 & InTournamentDifference <= 100 ~ "Up to 100% Increase",
      InTournamentDifference >= 101 & InTournamentDifference <= 200 ~ "100%-200% Increase",
      InTournamentDifference >= 201 & InTournamentDifference <= 300 ~ "201%-300% Increase",
      InTournamentDifference >= 301 & InTournamentDifference <= 600 ~ "301%-600% Increase",
      InTournamentDifference >= 601 & InTournamentDifference <= 800 ~ "601%-800% Increase",
      InTournamentDifference > 801 ~ "1000%+ Increase"))

MMVolDiffCatBD <- MMVolDiffCat %>%
        group_by(InTournamentDifference) %>%
        tally() %>%
  arrange(factor(InTournamentDifference, levels = c("Decreased Volume", "Up to 100% Increase", "100%-200% Increase", "201%-300% Increase", "301%-600% Increase", "601%-800% Increase", "1000%+ Increase")))

MMVolDiffCatBD$InTournamentDifference <- factor(MMVolDiffCatBD$InTournamentDifference, levels = c("Decreased Volume", "Up to 100% Increase", "100%-200% Increase", "201%-300% Increase", "301%-600% Increase", "601%-800% Increase", "1000%+ Increase"))
ggplot(MMVolDiffCatBD, 
       aes(x = InTournamentDifference, y=n)) +
  geom_bar(stat = "identity", fill = "purple") +
  geom_text(aes(label = n), vjust = -0.1) +
  labs(x = "", y = "Number of Schools") +
  ggtitle("Number of Schools w/Each Grouping of % Increase/Decrease") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Out of the 68 schools that participated in March Madness, only two experienced a decrease in their average mention volume - the University of Iowa (-6.08%) and Michigan State (-6.32%). These outliers have both experienced significant non-athletic events during 2023 that have also served to increase their mention volume outside of March Madness (a screening of the contentious film “What is a Woman” at the University of Iowa and a school shooting on campus at Michigan State).

37 schools experienced an increase in their mention volume of 100-300% while 11 saw a 301-800% increase. Two schools - Furman University and Farleigh Dickinson University - experienced volume increases of over 1,000% (1,121% and 3,800% respectively.)

Testing the Hypothesis

My original hypothesis was that participation in the men’s March Madness tournament increases online conversation about the schools that are involved.

To test this hypothesis, I will perform a paired t.test on the average volume during the tournament and outside of the tournament for all of the teams involved.

Null hypothesis: Participation in the men’s March Madness tournament does not increase average mention volume for the schools that are involved.
Alternative hypothesis: Participation in the men’s March Madness tournament does increase average mention volume for the schools that are involved.

Code

t.test(MMVolAvgs$VolDuring, MMVolAvgs$VolNonTourn, paired = TRUE)


    Paired t-test

data:  MMVolAvgs$VolDuring and MMVolAvgs$VolNonTourn
t = 8.1765, df = 67, p-value = 1.157e-11
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 1119.565 1842.700
sample estimates:
mean difference 
       1481.132

The observed mean difference between the paired observations was 1,481 mentions and the 95% confidence interval is 1,120-1,843. We can be 95% certain that the true mean difference lies between this range.

The magnitude of the t-value (8.18) indicates there is strong evidence against the null hypothesis. Based on the p-value (1.157e-11) we reject the null hypothesis and support the alternative hypothesis that participation in the men’s March Madness tournament does increase average mention volume for the schools that are involved.

How long does March Madness contribute to changes in online conversation volume?

I was immediately curious how long changes in conversation volume due to March Madness would be considered statistically significant, so I performed t-tests for the average volume one week and two weeks after each team’s elimination.

Code

# Compare non-tournament volume to week after and two weeks after tournament ended for each team
t.test(MMVolAvgs$Vol2WeeksAfter, MMVolAvgs$VolNonTourn, paired = TRUE)


    Paired t-test

data:  MMVolAvgs$Vol2WeeksAfter and MMVolAvgs$VolNonTourn
t = -2.6328, df = 67, p-value = 0.0105
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -302.18943  -41.57528
sample estimates:
mean difference 
      -171.8824

Code

t.test(MMVolAvgs$VolWeekAfter, MMVolAvgs$VolNonTourn, paired = TRUE)


    Paired t-test

data:  MMVolAvgs$VolWeekAfter and MMVolAvgs$VolNonTourn
t = -1.2093, df = 67, p-value = 0.2308
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -225.99467   55.46526
sample estimates:
mean difference 
      -85.26471

Based on the p-value for both two weeks after (0.0105) and one week after (0.2308) we fail to reject the null hypothesis - participation in the men’s March Madness tournament does not increase mention volume for the schools that are involved at either one or two weeks post tournament.

I was then curious whether changes in conversation volume due to March Madness would be considered statistically significant in the days immediately following the tournament, so I performed t-tests comparing those days’ volume numbers to each school’s average non-tournament volume.

Code

t.test(MMVolAvgs$VolElim_Day, MMVolAvgs$VolNonTourn, paired = TRUE)


    Paired t-test

data:  MMVolAvgs$VolElim_Day and MMVolAvgs$VolNonTourn
t = 6.7004, df = 67, p-value = 5.187e-09
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 2723.919 5035.346
sample estimates:
mean difference 
       3879.632

Code

t.test(MMVolAvgs$VolAfterElim, MMVolAvgs$VolNonTourn, paired = TRUE)


    Paired t-test

data:  MMVolAvgs$VolAfterElim and MMVolAvgs$VolNonTourn
t = 2.3848, df = 67, p-value = 0.01993
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
  230.9917 2602.8318
sample estimates:
mean difference 
       1416.912

Code

t.test(MMVolAvgs$VolPostElim, MMVolAvgs$VolNonTourn, paired = TRUE)


    Paired t-test

data:  MMVolAvgs$VolPostElim and MMVolAvgs$VolNonTourn
t = -0.048907, df = 67, p-value = 0.9611
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -283.4630  269.9042
sample estimates:
mean difference 
      -6.779412

Code

t.test(MMVolAvgs$VolPostElim1, MMVolAvgs$VolNonTourn, paired = TRUE)


    Paired t-test

data:  MMVolAvgs$VolPostElim1 and MMVolAvgs$VolNonTourn
t = -1.4567, df = 67, p-value = 0.1499
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -334.72775   52.28657
sample estimates:
mean difference 
      -141.2206

Code

# Create a data frame with results
VolTTestResults <- data.frame(
  "Time Frame" = c("Elimination Day", "Day After Elimination", "2 Days After Elimination", "3 Days After Elimination"),
  t_value = c(6.7004, 2.3848, -0.048907, -1.4567),
  p_value = c(5.187e-09, 0.01993, 0.9611, 0.1499),
  Mean_Difference = c(3880, 1417,  -6.8, -141))
VolTTestResults$p_value <- format(VolTTestResults$p_value, scientific = FALSE)
VolTTestResults

                Time.Frame   t_value        p_value Mean_Difference
1          Elimination Day  6.700400 0.000000005187          3880.0
2    Day After Elimination  2.384800 0.019930000000          1417.0
3 2 Days After Elimination -0.048907 0.961100000000            -6.8
4 3 Days After Elimination -1.456700 0.149900000000          -141.0

Based on the p-values for Elimination Day and Day After Elimination, we can reject the null hypothesis - participation in the men’s March Madness tournament does increase mention volume on those two days.

For two days and three days post-elimination, however, the p-values (0.96 and 0.15) are higher than 0.05. For these days we fail to reject the null hypothesis - participation in the men’s March Madness tournament does not increase mention volume for the schools that are involved two and three days post-elimination.

The observed mean difference for the day of elimination and the day after elimination was 3,880 and 1,417 mentions respectively and the 95% confidence intervals were 2,724-5,035 and 231-2,603 respectively. We can be 95% certain that the true mean difference lies between this range.

To summarize: Participation in the men’s March Madness tournament does increase mention volume for schools with teams participating and does so for the days between the start of the tournament and the day after each team is eliminated.

What variables contribute to changes in mention volume during the tournament?

Having confirmed that participation in the men’s March Madness tournament does indeed increase mention volume for schools with teams participating, the next step is to determine what variables contribute to these changes in mention volume.

For the daily mention outcome variable, I created a data frame with the daily mention volume for each of the 68 schools in March Madness for the two days prior to the tournament beginning, all days in the tournament, and the day after the tournament, for a total time frame of March 12 to April 4, 2023. I then joined together the predictor variable data for each school’s daily mention volume during the tournament/on tournament game days.

Data manipulation

Code

# Read in data for enrollment 
SchoolNameToSchool <- read_csv("_data/SchoolNameToSchool.csv", show_col_types = FALSE) 
EnrollmentData <- read_csv("_data/CCIHE2021PublicData.csv", show_col_types = FALSE)
SchoolEnrollmentData <- SchoolNameToSchool %>%
  left_join(EnrollmentData, by = "SchoolName")

VolumeByDay <- MMVolumeNew %>%
  pivot_longer(cols = -School, names_to = "Date", values_to = "Mention_Volume") %>%
  arrange(Date)

# Convert the Date column to the "YYYY-MM-DD" format if needed
VolumeByDay$Date <- as.Date(VolumeByDay$Date, format = "%m/%d/%Y")

# Join volume by school and enrollment data for each school
VolumeAndEnrollment <- VolumeByDay %>%
  left_join(SchoolEnrollmentData, by = c("School"))

# Narrow down to all days between March 12 and April 4
MMVolumeByDay <- VolumeByDay %>%
  filter(Date >= as.Date("2023-03-12") & Date <= as.Date("2023-04-04"))

# Remove unneeded variables from TVariables
TVariables <- TVariables %>%
  rename("Date" = "GameDate") %>%
  select(-GameDayVolume, -WinningSeed, -LosingSeed)

# Join volume by school and game-day variables for each school
TVariables$Date <- as.Date(TVariables$Date)
MMVolumeByDay$Date <- as.Date(MMVolumeByDay$Date)
VolumeAndVariables <- MMVolumeByDay %>%
  left_join(TVariables, by = c("School", "Date")) %>%
  left_join(SchoolEnrollmentData, by = c("School"))

VolumeAndVariables <- VolumeAndVariables %>%
  mutate(Seed = ifelse(School %in% TVariables$School, TVariables$Seed, NA))

# Read in data
AddMajor <- read_csv("_data/DateRoundTeamSeedMajor.csv", show_col_types = FALSE) %>%
  select(School, Major)
# Find matching indices
matching_indices <- match(VolumeAndVariables$School, AddMajor$School)
VolumeAndVariables$Major <- ifelse(!is.na(matching_indices), AddMajor$Major[matching_indices], NA)

# Add EndDate column to VolumeAndVariables
MMVolVarJustEndDate <- MMVolumeVar %>%
  select(School, EndDate)
VolumeAndVariables <- VolumeAndVariables %>%
  left_join(MMVolVarJustEndDate, by = "School")

# Create column for DayAfterGame
DateRoundTeam2 <- DateRoundTeam %>%
  rename(Date = GameDate) %>%
  mutate(DayAfterGame = as.Date(Date + 1)) %>%
  select(School, Date, DayAfterGame)

# Join to dataset
VolumeAndVariables <- VolumeAndVariables %>%
  left_join(DateRoundTeam2, by = c("School", "Date")) %>%
mutate(
    IsMarchMadness = ifelse(Date >= as.Date("2023-03-12") & Date <= EndDate, 1, 0),
    IsGameDay = ifelse(is.na(Time), 0, 1)) %>%
  mutate(
    NextRow = if_else(!is.na(DayAfterGame), row_number() + 1, NA_integer_),
    IsDayAfterGame = if_else(row_number() %in% NextRow & !is.na(NextRow), 1, 0)) %>%
  mutate(
    IsDayAfterGame = replace(IsDayAfterGame, NextRow, 1),
    GDayOrAfter = ifelse(IsGameDay == 1 | IsDayAfterGame == 1, 1, 0)) %>%
    ungroup() %>%
  mutate(
    IsWinner = ifelse(is.na(IsWinner), "No", IsWinner),
    UpsetWin = ifelse(is.na(UpsetWin), "No", UpsetWin), 
    UpsetLoss = ifelse(is.na(UpsetLoss), "No", UpsetLoss),
    FavoriteWin = ifelse(is.na(FavoriteWin), "No", FavoriteWin),
    FavoriteLoss = ifelse(is.na(FavoriteLoss),  "No", FavoriteLoss),
    UnderdogWin = ifelse(is.na(UnderdogWin), "No", UnderdogWin),
    UnderdogLoss = ifelse(is.na(UnderdogLoss), "No", UnderdogLoss),
    SeedDifference = abs(SeedDifference),
    SeedDifference = ifelse(is.na(SeedDifference), "Not March Madness", SeedDifference),
   TournamentRound = fct_na_value_to_level(as.factor(TournamentRound), "Not March Madness"),
    Time = fct_na_value_to_level(as.factor(Time), "Not March Madness"),
    SizeSetting = as.factor(SizeSetting),
    Control = as.factor(Control),
    Seed = ifelse(is.na(Seed), "Not March Madness", Seed)) %>%
  select(Date, School, Mention_Volume, Seed, Time, TournamentRound, IsWinner, SeedDifference, UpsetWin, UpsetLoss, FavoriteWin, FavoriteLoss, UnderdogWin, UnderdogLoss, SizeSetting, Control, F20Enrollment, Major, IsMarchMadness, IsGameDay, IsDayAfterGame, GDayOrAfter)

VolumeAndVariables

# A tibble: 1,632 × 22
   Date       School         Mention_Volume  Seed Time  TournamentRound IsWinner
   <date>     <chr>                   <dbl> <dbl> <fct> <fct>           <chr>   
 1 2023-03-12 University of…            786    11 Not … Not March Madn… No      
 2 2023-03-12 University of…           1532    16 Not … Not March Madn… No      
 3 2023-03-12 Arizona State…           2664    11 Not … Not March Madn… No      
 4 2023-03-12 Duke Universi…           8882    16 Not … Not March Madn… No      
 5 2023-03-12 Utah State Un…            638    15 Not … Not March Madn… No      
 6 2023-03-12 University of…           3980    13 Not … Not March Madn… No      
 7 2023-03-12 North Carolin…           2181    10 Not … Not March Madn… No      
 8 2023-03-12 University of…           1476     9 Not … Not March Madn… No      
 9 2023-03-12 Providence Co…           1012     8 Not … Not March Madn… No      
10 2023-03-12 Xavier Univer…            801     8 Not … Not March Madn… No      
# ℹ 1,622 more rows
# ℹ 15 more variables: SeedDifference <chr>, UpsetWin <chr>, UpsetLoss <chr>,
#   FavoriteWin <chr>, FavoriteLoss <chr>, UnderdogWin <chr>,
#   UnderdogLoss <chr>, SizeSetting <fct>, Control <fct>, F20Enrollment <dbl>,
#   Major <chr>, IsMarchMadness <dbl>, IsGameDay <dbl>, IsDayAfterGame <dbl>,
#   GDayOrAfter <dbl>

Model 1 - Participation in March Madness

There is one independent variable to be considered in this section - whether the day’s mention volume is from a day when the school was not yet eliminated from the tournament (1 - Yes, 2 - No).

Code

# Run linear regression with NonTournProp and IsMarchMadness
IsMarchMadnessLR <- lm(Mention_Volume ~ IsMarchMadness, data = VolumeAndVariables)
summary(IsMarchMadnessLR)


Call:
lm(formula = Mention_Volume ~ IsMarchMadness, data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -2787   -682   -428    339  37243 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      720.54      80.47   8.954   <2e-16 ***
IsMarchMadness  2150.27     130.13  16.524   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2555 on 1630 degrees of freedom
Multiple R-squared:  0.1435,    Adjusted R-squared:  0.1429 
F-statistic:   273 on 1 and 1630 DF,  p-value: < 2.2e-16

While linear regression does indeed demonstrate a statistically significant relationship between IsMarchMadness and mention volume, a scatterplot comparing the two variables shows large cluster of values, followed by a long tail indicating outliers on the high end of mention volume.

Code

# Scatterplot of IsMarchMadness vs. (Mention_Volume)
plot(VolumeAndVariables$IsMarchMadness, VolumeAndVariables$Mention_Volume, 
     xlab = "IsMarchMadness", ylab = "Mention Volume", 
     main = "Scatterplot of IsMarchMadness vs. Mention Volume")

Whereas if I plot IsMarchMadness against the log of Mention_Volume, this reduces the skew and makes the data more symmetric.

Code

# Scatterplot of IsMarchMadness vs. log(Mention_Volume)
plot(VolumeAndVariables$IsMarchMadness, log(VolumeAndVariables$Mention_Volume), 
     xlab = "IsMarchMadness", ylab = "log(Mention Volume)", 
     main = "Scatterplot of IsMarchMadness vs. log(Mention Volume)")

Code

# Run linear regression with MentionVolume and log(IsMarchMadness)
IsMarchMadnessLRlog <- lm(log(Mention_Volume) ~ IsMarchMadness, data = VolumeAndVariables)
summary(IsMarchMadnesslog)

Error in eval(expr, envir, enclos): object 'IsMarchMadnesslog' not found

Using the log transformed dependent variable has a much higher r squared (0.2491) vs. the model using the original dependent variable (0.1395).

Accordingly, I am going to use test using the log transformed version of Mention_Volume and the original version in all of my other models.

Model 2 - Game Outcome Related Variables

Win/Loss (IsWinner - Yes, No, Not March Madness)
Did the favorite win? (FavoriteWin - Yes, No, Not March Madness)
Did the favorite lose? (FavoriteLoss - Yes, No, Not March Madness)
Did the underdog win? (UnderdogWin - Yes, No, Not March Madness)
Did underdog lose (UnderdogLose - Yes, No, Not March Madness)
Is this an upset win? (UpsetWin - Yes, No, Not March Madness)
Is this an upset loss? (UpsetLoss - Yes, No, Not March Madness)

Win/Loss

This category is relatively self-explanatory - does the result of the game (win or loss) have an impact on mention volume.

Code

# Linear regression for IsWinner
summary(lm(Mention_Volume ~ IsMarchMadness + IsWinner, data = VolumeAndVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + IsWinner, data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -7435   -669   -388    391  37855 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      720.54      72.78    9.90   <2e-16 ***
IsMarchMadness  1538.60     121.99   12.61   <2e-16 ***
IsWinnerYes     5696.69     298.79   19.07   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2311 on 1629 degrees of freedom
Multiple R-squared:  0.2997,    Adjusted R-squared:  0.2989 
F-statistic: 348.6 on 2 and 1629 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ IsMarchMadness + IsWinner, data = VolumeAndVariables))


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + IsWinner, 
    data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5531 -0.8947  0.0863  0.8877  3.3890 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.95095    0.03607 164.990   <2e-16 ***
IsMarchMadness  1.25949    0.06046  20.832   <2e-16 ***
IsWinnerYes     1.41941    0.14808   9.586   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.145 on 1629 degrees of freedom
Multiple R-squared:  0.2938,    Adjusted R-squared:  0.293 
F-statistic: 338.9 on 2 and 1629 DF,  p-value: < 2.2e-16

Winning the game is a statistically significant predictor variable.

Favorite Win/Loss

The favorite in a game is the team that has a higher seed. Here we will consider whether the favorite in the game winning, the favorite in the game losing, and both variables together had a statistically significant impact on mention volume.

Code

summary(lm(Mention_Volume ~ IsMarchMadness + FavoriteWin, data = VolumeAndVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + FavoriteWin, data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -8848   -672   -410    356  37490 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      720.54      75.78   9.508   <2e-16 ***
IsMarchMadness  1903.57     123.73  15.384   <2e-16 ***
FavoriteWinYes  8102.16     560.55  14.454   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2406 on 1629 degrees of freedom
Multiple R-squared:  0.2408,    Adjusted R-squared:  0.2399 
F-statistic: 258.4 on 2 and 1629 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ IsMarchMadness + FavoriteWin, data = VolumeAndVariables))


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + FavoriteWin, 
    data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5531 -0.9075  0.0774  0.8943  3.2894 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.95095    0.03661 162.528  < 2e-16 ***
IsMarchMadness  1.35915    0.05979  22.734  < 2e-16 ***
FavoriteWinYes  1.73238    0.27085   6.396 2.08e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.162 on 1629 degrees of freedom
Multiple R-squared:  0.2723,    Adjusted R-squared:  0.2714 
F-statistic: 304.7 on 2 and 1629 DF,  p-value: < 2.2e-16

Code

# Linear regression for favorite win
FaveWin <- (lm(Mention_Volume ~ IsMarchMadness + FavoriteWin, data = VolumeAndVariables))
FaveWinlog <- (lm(log(Mention_Volume) ~ IsMarchMadness + FavoriteWin, data = VolumeAndVariables))

# Linear regression for favorite loss
FaveLoss <- (lm(Mention_Volume ~ IsMarchMadness + FavoriteLoss, data = VolumeAndVariables))
FaveLosslog <- (lm(log(Mention_Volume) ~ IsMarchMadness + FavoriteLoss, data = VolumeAndVariables))

# Linear regression for both favorite win and loss
FaveWinOrLoss <- (lm(Mention_Volume ~ IsMarchMadness + FavoriteWin + FavoriteLoss, data = VolumeAndVariables))
FaveWinOrLosslog <- (lm(log(Mention_Volume) ~ IsMarchMadness + FavoriteWin + FavoriteLoss, data = VolumeAndVariables))

stargazer(FaveWin, FaveLoss, FaveWinOrLoss, type = "text")


=================================================================================================
                                                 Dependent variable:                             
                    -----------------------------------------------------------------------------
                                                   Mention_Volume                                
                               (1)                       (2)                       (3)           
-------------------------------------------------------------------------------------------------
IsMarchMadness            1,903.568***              2,036.239***              1,777.843***       
                            (123.734)                 (129.823)                 (123.048)        
                                                                                                 
FavoriteWinYes            8,102.157***                                        8,227.883***       
                            (560.552)                                           (552.157)        
                                                                                                 
FavoriteLossYes                                     3,744.960***              4,003.356***       
                                                      (588.138)                 (552.157)        
                                                                                                 
Constant                   720.538***                720.538***                720.538***        
                            (75.779)                  (79.508)                  (74.607)         
                                                                                                 
-------------------------------------------------------------------------------------------------
Observations                  1,632                     1,632                     1,632          
R2                            0.241                     0.164                     0.265          
Adjusted R2                   0.240                     0.163                     0.263          
Residual Std. Error   2,405.903 (df = 1629)     2,524.305 (df = 1629)     2,368.703 (df = 1628)  
F Statistic         258.388*** (df = 2; 1629) 160.101*** (df = 2; 1629) 195.234*** (df = 3; 1628)
=================================================================================================
Note:                                                                 *p<0.1; **p<0.05; ***p<0.01

Code

stargazer(FaveWinlog, FaveLosslog, FaveWinOrLosslog, type = "text")


=================================================================================================
                                                 Dependent variable:                             
                    -----------------------------------------------------------------------------
                                                 log(Mention_Volume)                             
                               (1)                       (2)                       (3)           
-------------------------------------------------------------------------------------------------
IsMarchMadness              1.359***                  1.374***                  1.318***         
                             (0.060)                   (0.060)                   (0.060)         
                                                                                                 
FavoriteWinYes              1.732***                                            1.773***         
                             (0.271)                                             (0.269)         
                                                                                                 
FavoriteLossYes                                       1.252***                  1.307***         
                                                       (0.272)                   (0.269)         
                                                                                                 
Constant                    5.951***                  5.951***                  5.951***         
                             (0.037)                   (0.037)                   (0.036)         
                                                                                                 
-------------------------------------------------------------------------------------------------
Observations                  1,632                     1,632                     1,632          
R2                            0.272                     0.264                     0.283          
Adjusted R2                   0.271                     0.263                     0.281          
Residual Std. Error     1.162 (df = 1629)         1.169 (df = 1629)         1.155 (df = 1628)    
F Statistic         304.721*** (df = 2; 1629) 291.442*** (df = 2; 1629) 213.829*** (df = 3; 1628)
=================================================================================================
Note:                                                                 *p<0.1; **p<0.05; ***p<0.01

Both the favorite winning and the favorite losing are statistically significant predictor variables.

Underdog Win/Loss

The favorite in a game is the team that has a lower seed. Here we will consider whether the underdog in the game winning, the underdog in the game losing, and both variables had a statistically significant impact on mention volume.

Code

# Linear regression for underdog win
UnderWin <- (lm(Mention_Volume ~ IsMarchMadness + UnderdogWin, data = VolumeAndVariables))
UnderWinlog <- (lm(log(Mention_Volume) ~ IsMarchMadness + UnderdogWin, data = VolumeAndVariables))

# Linear regression for underdog loss
UnderLoss <- (lm(Mention_Volume ~ IsMarchMadness + UnderdogLoss, data = VolumeAndVariables))
UnderLosslog <- (lm(log(Mention_Volume) ~ IsMarchMadness + UnderdogLoss, data = VolumeAndVariables))

# Linear regression for both underdog win and loss
UnderWinAndLoss <- (lm(Mention_Volume ~ IsMarchMadness + UnderdogWin + UnderdogLoss, data = VolumeAndVariables))
UnderWinAndLosslog <- (lm(log(Mention_Volume) ~ IsMarchMadness + UnderdogWin + UnderdogLoss, data = VolumeAndVariables))

stargazer(UnderWin, UnderLoss, UnderWinAndLoss, type = "text")


=================================================================================================
                                                 Dependent variable:                             
                    -----------------------------------------------------------------------------
                                                   Mention_Volume                                
                               (1)                       (2)                       (3)           
-------------------------------------------------------------------------------------------------
IsMarchMadness            1,819.167***              2,083.666***              1,719.918***       
                            (127.417)                 (132.958)                 (130.267)        
                                                                                                 
UnderdogWinYes            4,695.613***                                        4,794.863***       
                            (382.296)                                           (382.145)        
                                                                                                 
UnderdogLossYes                                       944.547**               1,308.295***       
                                                      (398.920)                 (382.145)        
                                                                                                 
Constant                   720.538***                720.538***                720.538***        
                            (77.005)                  (80.353)                  (76.753)         
                                                                                                 
-------------------------------------------------------------------------------------------------
Observations                  1,632                     1,632                     1,632          
R2                            0.216                     0.146                     0.222          
Adjusted R2                   0.215                     0.145                     0.220          
Residual Std. Error   2,444.823 (df = 1629)     2,551.140 (df = 1629)     2,436.817 (df = 1628)  
F Statistic         224.500*** (df = 2; 1629) 139.706*** (df = 2; 1629) 154.559*** (df = 3; 1628)
=================================================================================================
Note:                                                                 *p<0.1; **p<0.05; ***p<0.01

Code

stargazer(UnderWinlog, UnderLosslog, UnderWinAndLosslog, type = "text")


=================================================================================================
                                                 Dependent variable:                             
                    -----------------------------------------------------------------------------
                                                 log(Mention_Volume)                             
                               (1)                       (2)                       (3)           
-------------------------------------------------------------------------------------------------
IsMarchMadness              1.322***                  1.375***                  1.275***         
                             (0.060)                   (0.061)                   (0.062)         
                                                                                                 
UnderdogWinYes              1.272***                                            1.319***         
                             (0.181)                                             (0.181)         
                                                                                                 
UnderdogLossYes                                       0.522***                  0.622***         
                                                       (0.184)                   (0.181)         
                                                                                                 
Constant                    5.951***                  5.951***                  5.951***         
                             (0.037)                   (0.037)                   (0.036)         
                                                                                                 
-------------------------------------------------------------------------------------------------
Observations                  1,632                     1,632                     1,632          
R2                            0.276                     0.258                     0.281          
Adjusted R2                   0.275                     0.257                     0.280          
Residual Std. Error     1.160 (df = 1629)         1.174 (df = 1629)         1.156 (df = 1628)    
F Statistic         310.261*** (df = 2; 1629) 282.726*** (df = 2; 1629) 212.140*** (df = 3; 1628)
=================================================================================================
Note:                                                                 *p<0.1; **p<0.05; ***p<0.01

Both the underdog winning and the underdog losing are statistically significant predictor variables. However, when I ran them together, the underdog losing had a significant impact on mention volume while the underdog winning did not.

Is Upset

A game is considered an upset when a team ranked five or more seeds lower than the team they are playing wins the game.

Code

# Add IsUpset column
VolumeAndVariables <- VolumeAndVariables %>%
  mutate(IsUpset = ifelse(UpsetWin == "Yes" | UpsetLoss == "Yes", "Yes", "No"))

# Linear Regression for IsUpset
IsUpset <- (lm(Mention_Volume ~ IsMarchMadness + IsUpset, data = VolumeAndVariables))
IsUpsetlog <- (lm(log(Mention_Volume) ~ IsMarchMadness + IsUpset, data = VolumeAndVariables))

# Linear Regression for UpsetWin
UpsetWin <- (lm(Mention_Volume ~ IsMarchMadness + UpsetWin, data = VolumeAndVariables))
UpsetWinlog <- (lm(log(Mention_Volume) ~ IsMarchMadness + UpsetWin, data = VolumeAndVariables))

# Linear Regression for UpsetLoss
UpsetLoss <- (lm(Mention_Volume ~ IsMarchMadness + UpsetLoss, data = VolumeAndVariables))
UpsetLosslog <- (lm(log(Mention_Volume) ~ IsMarchMadness + UpsetLoss, data = VolumeAndVariables))

# Linear Regression for UpsetWin and UpsetLoss
UpsetWinAndLoss <- (lm(Mention_Volume ~ IsMarchMadness + UpsetWin + UpsetLoss, data = VolumeAndVariables))
UpsetWinAndLosslog <- (lm(log(Mention_Volume) ~ IsMarchMadness + UpsetWin + UpsetLoss, data = VolumeAndVariables))

stargazer(IsUpset, UpsetWin, UpsetLoss, UpsetWinAndLoss, type = "text")


===========================================================================================================================
                                                              Dependent variable:                                          
                    -------------------------------------------------------------------------------------------------------
                                                                Mention_Volume                                             
                               (1)                       (2)                       (3)                       (4)           
---------------------------------------------------------------------------------------------------------------------------
IsMarchMadness            1,934.834***              2,001.561***              2,086.693***              1,934.834***       
                            (125.196)                 (124.830)                 (129.696)                 (124.244)        
                                                                                                                           
IsUpsetYes                7,468.406***                                                                                     
                            (582.541)                                                                                      
                                                                                                                           
UpsetWinYes                                         10,310.340***                                       10,377.070***      
                                                      (819.144)                                           (811.654)        
                                                                                                                           
UpsetLossYes                                                                  4,407.880***              4,559.740***       
                                                                                (851.074)                 (811.654)        
                                                                                                                           
Constant                   720.538***                720.538***                720.538***                720.538***        
                            (76.714)                  (76.842)                  (79.837)                  (76.131)         
                                                                                                                           
---------------------------------------------------------------------------------------------------------------------------
Observations                  1,632                     1,632                     1,632                     1,632          
R2                            0.222                     0.219                     0.157                     0.234          
Adjusted R2                   0.221                     0.218                     0.156                     0.233          
Residual Std. Error   2,435.604 (df = 1629)     2,439.646 (df = 1629)     2,534.742 (df = 1629)     2,417.079 (df = 1628)  
F Statistic         232.380*** (df = 2; 1629) 228.915*** (df = 2; 1629) 152.092*** (df = 2; 1629) 165.993*** (df = 3; 1628)
===========================================================================================================================
Note:                                                                                           *p<0.1; **p<0.05; ***p<0.01

Code

stargazer(IsUpsetlog, UpsetWinlog, UpsetLosslog, UpsetWinAndLosslog, type = "text")


===========================================================================================================================
                                                              Dependent variable:                                          
                    -------------------------------------------------------------------------------------------------------
                                                              log(Mention_Volume)                                          
                               (1)                       (2)                       (3)                       (4)           
---------------------------------------------------------------------------------------------------------------------------
IsMarchMadness              1.362***                  1.383***                  1.392***                  1.362***         
                             (0.060)                   (0.060)                   (0.060)                   (0.060)         
                                                                                                                           
IsUpsetYes                  1.736***                                                                                       
                             (0.278)                                                                                       
                                                                                                                           
UpsetWinYes                                           2.018***                                            2.039***         
                                                       (0.392)                                             (0.391)         
                                                                                                                           
UpsetLossYes                                                                    1.402***                  1.432***         
                                                                                 (0.394)                   (0.391)         
                                                                                                                           
Constant                    5.951***                  5.951***                  5.951***                  5.951***         
                             (0.037)                   (0.037)                   (0.037)                   (0.037)         
                                                                                                                           
---------------------------------------------------------------------------------------------------------------------------
Observations                  1,632                     1,632                     1,632                     1,632          
R2                            0.271                     0.266                     0.260                     0.272          
Adjusted R2                   0.270                     0.265                     0.259                     0.271          
Residual Std. Error     1.163 (df = 1629)         1.168 (df = 1629)         1.172 (df = 1629)         1.163 (df = 1628)    
F Statistic         303.386*** (df = 2; 1629) 295.067*** (df = 2; 1629) 285.804*** (df = 2; 1629) 202.694*** (df = 3; 1628)
===========================================================================================================================
Note:                                                                                           *p<0.1; **p<0.05; ***p<0.01

Both the an upset win and an upset loss are statistically significant predictor variables.

Testing All of the Game Outcome-Related Variables Together

I will test the game outcome-related variables together using both the Mention_Volume predictor variable and the log transformed version.

Code

summary(lm(Mention_Volume ~ IsMarchMadness + IsWinner + UpsetWin + UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss, data = VolumeAndVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + IsWinner + UpsetWin + 
    UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss, 
    data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -7306   -661   -386    398  38151 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       720.54      69.21  10.411  < 2e-16 ***
IsMarchMadness   1242.75     120.69  10.297  < 2e-16 ***
IsWinnerYes       758.71    1103.15   0.688    0.492    
UpsetWinYes      4381.74    1009.65   4.340 1.51e-05 ***
UpsetLossYes     1355.41    1009.65   1.342    0.180    
FavoriteWinYes   5928.70    1300.01   4.560 5.49e-06 ***
FavoriteLossYes  3896.41     701.88   5.551 3.31e-08 ***
UnderdogWinYes   4513.32    1147.57   3.933 8.74e-05 ***
UnderdogLossYes  1785.46     345.71   5.165 2.71e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2197 on 1623 degrees of freedom
Multiple R-squared:  0.369, Adjusted R-squared:  0.3659 
F-statistic: 118.7 on 8 and 1623 DF,  p-value: < 2.2e-16

When combining all the game-related variables together, IsWinnerYes and UpsetLossYes were no longer statistically significant. IsWinnerYes had the highest p-value, so I will remove it first and run the model again.

Code

summary(lm(Mention_Volume ~ IsMarchMadness + UpsetWin + UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss, data = VolumeAndVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + UpsetWin + UpsetLoss + 
    FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss, 
    data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -7306   -660   -384    397  38145 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)        720.5       69.2  10.412  < 2e-16 ***
IsMarchMadness    1248.8      120.3  10.378  < 2e-16 ***
UpsetWinYes       4381.7     1009.5   4.341 1.51e-05 ***
UpsetLossYes      1355.4     1009.5   1.343     0.18    
FavoriteWinYes    6681.3      701.7   9.521  < 2e-16 ***
FavoriteLossYes   3890.3      701.7   5.544 3.44e-08 ***
UnderdogWinYes    5265.9      345.5  15.240  < 2e-16 ***
UnderdogLossYes   1779.4      345.5   5.149 2.93e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2197 on 1624 degrees of freedom
Multiple R-squared:  0.3689,    Adjusted R-squared:  0.3661 
F-statistic: 135.6 on 7 and 1624 DF,  p-value: < 2.2e-16

UpsetLoss is still not statistically significant, so I will run the model again without it.

Code

GameOutcomeLR <- lm(Mention_Volume ~ IsMarchMadness + UpsetWin + FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss, data = VolumeAndVariables)
summary(GameOutcomeLR)


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + UpsetWin + FavoriteWin + 
    FavoriteLoss + UnderdogWin + UnderdogLoss, data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -7306   -660   -383    398  38145 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       720.54      69.22  10.410  < 2e-16 ***
IsMarchMadness   1248.84     120.37  10.375  < 2e-16 ***
UpsetWinYes      4381.74    1009.73   4.340 1.52e-05 ***
FavoriteWinYes   6681.32     701.89   9.519  < 2e-16 ***
FavoriteLossYes  4532.36     513.69   8.823  < 2e-16 ***
UnderdogWinYes   5265.94     345.63  15.236  < 2e-16 ***
UnderdogLossYes  1779.37     345.63   5.148 2.95e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2198 on 1625 degrees of freedom
Multiple R-squared:  0.3682,    Adjusted R-squared:  0.3658 
F-statistic: 157.8 on 6 and 1625 DF,  p-value: < 2.2e-16

When it comes to game outcome-related variables and Mention_Volume, UpsetWin, FavoriteWin, FavoriteLoss, UnderdogWin, and UnderdogLoss all had a statistically significant impact on March Madness mention volume.

Code

# For log(Mention_Volume)
summary(lm(log(Mention_Volume) ~ IsMarchMadness + IsWinner + UpsetWin + UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss, data = VolumeAndVariables))


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + IsWinner + 
    UpsetWin + UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + 
    UnderdogLoss, data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5531 -0.8526  0.0646  0.8850  3.4984 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      5.95095    0.03555 167.417  < 2e-16 ***
IsMarchMadness   1.15015    0.06198  18.556  < 2e-16 ***
IsWinnerYes      0.50475    0.56655   0.891 0.373110    
UpsetWinYes      0.58808    0.51853   1.134 0.256910    
UpsetLossYes     0.32011    0.51853   0.617 0.537101    
FavoriteWinYes   1.15807    0.66765   1.735 0.083013 .  
FavoriteLossYes  1.32353    0.36047   3.672 0.000249 ***
UnderdogWinYes   0.93893    0.58936   1.593 0.111327    
UnderdogLossYes  0.74721    0.17755   4.208 2.71e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.129 on 1623 degrees of freedom
Multiple R-squared:  0.3167,    Adjusted R-squared:  0.3133 
F-statistic: 94.02 on 8 and 1623 DF,  p-value: < 2.2e-16

Code

# For log(Mention_Volume)
GameOutcomeLRlog <- lm(log(Mention_Volume) ~ IsMarchMadness + FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss, data = VolumeAndVariables)
summary(GameOutcomeLRlog)


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + FavoriteWin + 
    FavoriteLoss + UnderdogWin + UnderdogLoss, data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5531 -0.8668  0.0745  0.8872  3.4943 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      5.95095    0.03554 167.445  < 2e-16 ***
IsMarchMadness   1.15420    0.06180  18.675  < 2e-16 ***
FavoriteWinYes   1.93732    0.26375   7.345 3.24e-13 ***
FavoriteLossYes  1.47110    0.26375   5.578 2.85e-08 ***
UnderdogWinYes   1.43962    0.17746   8.112 9.69e-16 ***
UnderdogLossYes  0.74316    0.17746   4.188 2.97e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.128 on 1626 degrees of freedom
Multiple R-squared:  0.3156,    Adjusted R-squared:  0.3135 
F-statistic:   150 on 5 and 1626 DF,  p-value: < 2.2e-16

When it comes to modeling game-related outcomes, the statistically significant variables were slightly different when using Mention_Volume vs. log(Mention_Volume); namely, in the log(Mention_Volume) version, UpsetWin was no longer a statistically significant predictor variable.

The r squared for the non-transformed version of Mention_Volume was higher (0.3658) compared to the log (0.3135), suggesting that the non-transformed Mention_Volume is a better fit for modeling.

Model 3 - Tournament Related Variables

Round of the tournament (8 - First4, RoundOf64, RoundOf32, Sweet16, Elite8, Final4, Championship, Not in Tournament)
What Seed each team was (16 - 1 through 16)
- Absolute difference in seeding between teams in game (0-15)
Time slot the game was played during (5 - LateEve, EarlyEve, LateAft, MidAft, EarlyAft)
Is Game Day (Yes/No)

Tournament Round

There are seven rounds in March Madness - First Four (play-in to determine the final eight teams in the round of 64), Round of 64, Round of 32, Sweet 16, Elite 8, Final 4, and the Championship. I have an eighth category for “Not in Tournament” for teams that have been eliminated from competition.

I will create dummy variables for each round as “not in tournament” will be multicollinear with IsMarchMadness.

Code

# Make new TournamentRound column
VolumeAndVariables <- VolumeAndVariables %>%
  mutate(
    TournamentRoundNew = case_when(
      IsMarchMadness == 1 & IsWinner == "Yes" & Date >= as.Date("2023-03-12") & Date <= as.Date("2023-03-15") ~ "First 4",
      IsMarchMadness == 1 & IsWinner == "No" & Date >= as.Date("2023-03-12") & Date <= as.Date("2023-03-16") ~ "First 4",
      IsMarchMadness == 1 & IsWinner == "Yes" & Date >= as.Date("2023-03-16") & Date <= as.Date("2023-03-17") ~ "Round of 64",
      IsMarchMadness == 1 & IsWinner == "No" & Date >= as.Date("2023-03-16") & Date <= as.Date("2023-03-18") ~ "Round of 64",
      IsMarchMadness == 1 & Date >= as.Date("2023-03-18") & Date <= as.Date("2023-03-20") ~ "Round of 32",
      IsMarchMadness == 1 & IsWinner == "Yes" & Date >= as.Date("2023-03-23") & Date <= as.Date("2023-03-24") ~ "Sweet 16",
      IsMarchMadness == 1 & IsWinner == "No" & Date >= as.Date("2023-03-23") & Date <= as.Date("2023-03-25") ~ "Sweet 16",
      IsMarchMadness == 1 & Date >= as.Date("2023-03-25") & Date <= as.Date("2023-03-27") ~ "Elite 8",
      IsMarchMadness == 1 & Date >= as.Date("2023-04-01") & Date <= as.Date("2023-04-02") ~ "Final 4",
      IsMarchMadness == 1 & Date >= as.Date("2023-04-03") & Date <= as.Date("2023-04-04") ~ "Championship",
      IsMarchMadness == 0 ~ "Not In Tournament"))

# Create dummy variables for each round
VolumeAndVariables <- VolumeAndVariables %>%
mutate(
IsFirst4 = ifelse(TournamentRoundNew =="First 4", 1, 0),
IsRd64 = ifelse(TournamentRoundNew == "Round of 64", 1, 0),
IsRd32 = ifelse(TournamentRoundNew == "Round of 32", 1, 0),
IsRd16 = ifelse(TournamentRoundNew == "Sweet 16", 1, 0),
IsRd8 = ifelse(TournamentRoundNew == "Elite 8", 1, 0),
IsRd4 = ifelse(TournamentRoundNew == "Final 4", 1, 0),
IsChamp = ifelse(TournamentRoundNew == "Championship", 1, 0))

MMDaysTournamentRound <- VolumeAndVariables %>%
  select(Mention_Volume, IsMarchMadness, IsFirst4, IsRd64, IsRd32, IsRd16, IsRd8, IsRd4, IsChamp)

# Linear regression for tournament round
summary(lm(Mention_Volume ~ ., data = MMDaysTournamentRound))


Call:
lm(formula = Mention_Volume ~ ., data = MMDaysTournamentRound)

Residuals:
     Min       1Q   Median       3Q      Max 
-14604.5   -640.5   -368.0    402.0  20132.5 

Coefficients: (1 not defined because of singularities)
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)       720.5       62.5   11.53   <2e-16 ***
IsMarchMadness  24136.0      994.1   24.28   <2e-16 ***
IsFirst4       -23419.2      998.3  -23.46   <2e-16 ***
IsRd64         -21908.0     1008.5  -21.72   <2e-16 ***
IsRd32         -20473.0     1022.7  -20.02   <2e-16 ***
IsRd16         -18959.4     1038.3  -18.26   <2e-16 ***
IsRd8          -16179.4     1109.2  -14.59   <2e-16 ***
IsRd4          -12657.6     1215.1  -10.42   <2e-16 ***
IsChamp              NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1984 on 1576 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.497, Adjusted R-squared:  0.4948 
F-statistic: 222.4 on 7 and 1576 DF,  p-value: < 2.2e-16

The IsChamp coefficient was not defined due to singularities; I will run linear regression again with this variable on its own, and for all of the rounds except the championship round.

Code

summary(lm(Mention_Volume ~ IsChamp, data = MMDaysTournamentRound))


Call:
lm(formula = Mention_Volume ~ IsChamp, data = MMDaysTournamentRound)

Residuals:
     Min       1Q   Median       3Q      Max 
-14604.5  -1212.7   -717.0    103.5  25757.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1460.49      63.73   22.92   <2e-16 ***
IsChamp     23396.01    1268.24   18.45   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2533 on 1582 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.177, Adjusted R-squared:  0.1765 
F-statistic: 340.3 on 1 and 1582 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ IsChamp, data = MMDaysTournamentRound))


Call:
lm(formula = log(Mention_Volume) ~ IsChamp, data = MMDaysTournamentRound)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0504 -0.9348  0.1631  0.9045  3.7634 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.44828    0.03405 189.394  < 2e-16 ***
IsChamp      3.54487    0.67752   5.232  1.9e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.353 on 1582 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.01701,   Adjusted R-squared:  0.01639 
F-statistic: 27.37 on 1 and 1582 DF,  p-value: 1.9e-07

Code

summary(lm(Mention_Volume ~ .-IsChamp, data = MMDaysTournamentRound))


Call:
lm(formula = Mention_Volume ~ . - IsChamp, data = MMDaysTournamentRound)

Residuals:
     Min       1Q   Median       3Q      Max 
-14604.5   -640.5   -368.0    402.0  20132.5 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)       720.5       62.5   11.53   <2e-16 ***
IsMarchMadness  24136.0      994.1   24.28   <2e-16 ***
IsFirst4       -23419.2      998.3  -23.46   <2e-16 ***
IsRd64         -21908.0     1008.5  -21.72   <2e-16 ***
IsRd32         -20473.0     1022.7  -20.02   <2e-16 ***
IsRd16         -18959.4     1038.3  -18.26   <2e-16 ***
IsRd8          -16179.4     1109.2  -14.59   <2e-16 ***
IsRd4          -12657.6     1215.1  -10.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1984 on 1576 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.497, Adjusted R-squared:  0.4948 
F-statistic: 222.4 on 7 and 1576 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ .-IsChamp, data = MMDaysTournamentRound))


Call:
lm(formula = log(Mention_Volume) ~ . - IsChamp, data = MMDaysTournamentRound)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5531 -0.8450  0.0987  0.8922  2.9509 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.95095    0.03548 167.747  < 2e-16 ***
IsMarchMadness  4.04219    0.56428   7.163 1.20e-12 ***
IsFirst4       -3.12992    0.56665  -5.524 3.88e-08 ***
IsRd64         -2.46016    0.57247  -4.297 1.83e-05 ***
IsRd32         -2.03323    0.58049  -3.503 0.000474 ***
IsRd16         -1.60848    0.58937  -2.729 0.006420 ** 
IsRd8          -1.22641    0.62963  -1.948 0.051615 .  
IsRd4          -0.75028    0.68973  -1.088 0.276855    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.126 on 1576 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.3217,    Adjusted R-squared:  0.3187 
F-statistic: 106.8 on 7 and 1576 DF,  p-value: < 2.2e-16

Tournament round is a statistically significant predictor variable for the increase in March Madness tournament volume.

When the championship round is run separately from the other rounds, its impact is statistically significant; however, when it is run with the other rounds, it produces singularities.

Looking at the coefficients, I suspect there may be better variables to explain increases in mention volume during March Madness.

Seeding

The 68 teams in March Madness are divided into four regions, and each region’s teams are ranked 1-16 (the other four teams participate in the First Four and are additional teams ranked at either 11 or 16). There are two questions I want to answer as it relates to seeding and whether it is a predictor for conversation volume:

Does the absolute difference in seeding between the two teams in the game impact mention volume?
Does each team’s seeding impact mention volume?

Code

# Linear regression for seeding
summary(lm(Mention_Volume ~ IsMarchMadness + Seed, data = VolumeAndVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + Seed, data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -2952   -766   -395    327  37209 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      959.00     132.86   7.218 8.05e-13 ***
IsMarchMadness  2133.91     130.17  16.393  < 2e-16 ***
Seed             -31.38      13.92  -2.254   0.0243 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2552 on 1629 degrees of freedom
Multiple R-squared:  0.1461,    Adjusted R-squared:  0.1451 
F-statistic: 139.4 on 2 and 1629 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ IsMarchMadness + Seed, data = VolumeAndVariables))


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + Seed, data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4589 -0.9049  0.1126  0.9052  3.2083 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     6.15053    0.06097 100.877  < 2e-16 ***
IsMarchMadness  1.39820    0.05974  23.406  < 2e-16 ***
Seed           -0.02627    0.00639  -4.111 4.14e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.171 on 1629 degrees of freedom
Multiple R-squared:  0.2616,    Adjusted R-squared:  0.2607 
F-statistic: 288.6 on 2 and 1629 DF,  p-value: < 2.2e-16

Similar to tournament round, while Seed is statistically significant, I suspect there are likely other variables that are a better fit in terms of explaining this impact.

To measure the impact of seed difference without causing multicollinearity with IsMarchMadness, I will again create dummy variables for each possible difference in seeding.

Because no games had a seeding difference of 14, 12, 10, or 2, these will be excluded from my dummy variables.

Code

# Create dummy variables for each possible difference in seeding
VolumeAndVariables <- VolumeAndVariables %>%
mutate(
Is15 = ifelse(SeedDifference == 15, 1, 0),
Is13 = ifelse(SeedDifference == 13, 1, 0),
Is11 = ifelse(SeedDifference == 11, 1, 0),
Is9 = ifelse(SeedDifference == 9, 1, 0),
Is8 = ifelse(SeedDifference == 8, 1, 0),
Is7 = ifelse(SeedDifference == 7, 1, 0),
Is6 = ifelse(SeedDifference == 6, 1, 0),
Is5 = ifelse(SeedDifference == 5, 1, 0),
Is4 = ifelse(SeedDifference == 4, 1, 0),
Is3 = ifelse(SeedDifference == 3, 1, 0),
Is1 = ifelse(SeedDifference == 1, 1, 0),
Is0 = ifelse(SeedDifference == 0, 1, 0))
SeedDifference <- VolumeAndVariables %>%
  select(Mention_Volume, IsMarchMadness, Is15, Is13, Is11, Is9, Is8, Is7, Is6, Is5, Is4, Is3, Is1, Is0)

# Linear regression for difference in seeding
summary(lm(Mention_Volume ~ IsMarchMadness + ., data = SeedDifference))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + ., data = SeedDifference)

Residuals:
   Min     1Q Median     3Q    Max 
 -7124   -658   -382    402  38147 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      720.54      68.44  10.529  < 2e-16 ***
IsMarchMadness  1246.83     119.66  10.420  < 2e-16 ***
Is15            5303.00     774.44   6.848 1.06e-11 ***
Is13            1997.25     774.44   2.579  0.01000 ** 
Is11            -125.75     774.44  -0.162  0.87103    
Is9             1877.23     694.07   2.705  0.00691 ** 
Is8             3231.93     694.07   4.656 3.48e-06 ***
Is7             3417.27     588.94   5.802 7.85e-09 ***
Is6            13337.13    1539.52   8.663  < 2e-16 ***
Is5             2920.13     588.94   4.958 7.86e-07 ***
Is4            10969.53     694.07  15.805  < 2e-16 ***
Is3             4573.13     551.99   8.285 2.45e-16 ***
Is1             5999.67     437.28  13.721  < 2e-16 ***
Is0              125.13     774.44   0.162  0.87166    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2173 on 1618 degrees of freedom
Multiple R-squared:  0.385, Adjusted R-squared:  0.3801 
F-statistic: 77.92 on 13 and 1618 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ IsMarchMadness + ., data = SeedDifference))


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + ., data = SeedDifference)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5531 -0.8630  0.0490  0.8891  3.4991 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.95095    0.03558 167.239  < 2e-16 ***
IsMarchMadness  1.14942    0.06222  18.474  < 2e-16 ***
Is15            1.38431    0.40267   3.438 0.000601 ***
Is13            0.76900    0.40267   1.910 0.056344 .  
Is11            0.31545    0.40267   0.783 0.433515    
Is9             0.92347    0.36088   2.559 0.010590 *  
Is8             1.29187    0.36088   3.580 0.000354 ***
Is7             1.12796    0.30622   3.684 0.000238 ***
Is6             2.52050    0.80048   3.149 0.001670 ** 
Is5             1.08112    0.30622   3.531 0.000426 ***
Is4             2.28742    0.36088   6.338 3.00e-10 ***
Is3             1.24910    0.28701   4.352 1.43e-05 ***
Is1             1.56438    0.22736   6.881 8.50e-12 ***
Is0             0.29790    0.40267   0.740 0.459520    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.13 on 1618 degrees of freedom
Multiple R-squared:  0.3173,    Adjusted R-squared:  0.3118 
F-statistic: 57.85 on 13 and 1618 DF,  p-value: < 2.2e-16

Nearly all differences in seeding (with the exception of Is11 and Is0) had a statistically significant impact on mention volume. There are, however, likely better explanatory variables than this.

For seeding, there are two final variables I want to create and examine: the impact of a seed difference of 15 (ie. a number 1 seed playing a number 16 seed), and the impact of a seed difference greater or equal to five (ie. the difference in seeding that creates the potential for a game to be an upset).

Code

# Add new column for seed difference = 15 
VolumeAndVariables <- VolumeAndVariables %>%
  mutate(SeedDiff15 = ifelse(SeedDifference == 15, 1, 0))
  
# Add new column for seed difference >= 5 
VolumeAndVariables <- VolumeAndVariables %>%
  mutate(SeedDiffMoreThan5 = ifelse(SeedDifference >= 5, 1, 0))

# Linear regression for seed difference = 15
summary(lm(Mention_Volume ~ IsMarchMadness + SeedDiff15, data = VolumeAndVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + SeedDiff15, data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -6681   -676   -424    346  37300 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)       720.5       79.9   9.018  < 2e-16 ***
IsMarchMadness   2093.1      129.7  16.135  < 2e-16 ***
SeedDiff15       4456.7      902.6   4.937 8.73e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2537 on 1629 degrees of freedom
Multiple R-squared:  0.1561,    Adjusted R-squared:  0.1551 
F-statistic: 150.7 on 2 and 1629 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ IsMarchMadness + SeedDiff15, data = VolumeAndVariables))


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + SeedDiff15, 
    data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5531 -0.9203  0.0809  0.9121  3.2512 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.95095    0.03699 160.889  < 2e-16 ***
IsMarchMadness  1.39733    0.06006  23.267  < 2e-16 ***
SeedDiff15      1.13640    0.41788   2.719  0.00661 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.174 on 1629 degrees of freedom
Multiple R-squared:  0.2574,    Adjusted R-squared:  0.2564 
F-statistic: 282.3 on 2 and 1629 DF,  p-value: < 2.2e-16

Games where a number one seed was playing a number 15 seed did indeed have a statistically significant impact on mention volume.

Code

summary(lm(Mention_Volume ~ IsMarchMadness + SeedDiffMoreThan5, data = VolumeAndVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + SeedDiffMoreThan5, 
    data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -6176   -669   -383    402  37838 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5141.4      288.9   17.80   <2e-16 ***
IsMarchMadness      1555.1      126.9   12.26   <2e-16 ***
SeedDiffMoreThan5  -4420.9      279.0  -15.85   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2379 on 1629 degrees of freedom
Multiple R-squared:  0.2579,    Adjusted R-squared:  0.2569 
F-statistic:   283 on 2 and 1629 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ IsMarchMadness + SeedDiffMoreThan5, data = VolumeAndVariables))


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + SeedDiffMoreThan5, 
    data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5531 -0.9075  0.0933  0.9068  3.3915 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        7.10126    0.13986  50.774   <2e-16 ***
IsMarchMadness     1.25705    0.06141  20.468   <2e-16 ***
SeedDiffMoreThan5 -1.15031    0.13507  -8.516   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.152 on 1629 degrees of freedom
Multiple R-squared:  0.2858,    Adjusted R-squared:  0.2849 
F-statistic: 325.9 on 2 and 1629 DF,  p-value: < 2.2e-16

While a seed difference of more than five had a statistically significant result, there are likely better explanatory variables for the increase in mention volume during March Madness.

Time Slot

I have divided all the games played during the tournament into five categories based on the time that each game tipped off. start times for the March Madness games into five time slots - Early Afternoon (12:00-2:15pm), Mid Afternoon (2:15-4:30pm), Late Afternoon (4:30-6:45pm), Early Evening (6:45-9:00pm), and Late Evening (9:00-10:45pm).

I included this variable because I suspect that playing earlier in the day may have an impact on overall conversation volume.

Code

# Create dummy variables for each time slot
VolumeAndVariables <- VolumeAndVariables %>%
mutate(
IsEarlyEvening = ifelse(Time == "Early Evening", 1, 0),
IsLateAfternoon = ifelse(Time == "Late Afternoon", 1, 0),
IsLateEvening = ifelse(Time == "Late Evening", 1, 0),
IsMidAfternoon = ifelse(Time == "Mid Afternoon", 1, 0),
IsEarlyAfternoon = ifelse(Time == "Early Afternoon", 1, 0))
TimeVariables <- VolumeAndVariables %>%
  select(Mention_Volume, IsMarchMadness, IsEarlyEvening, IsLateAfternoon, IsLateEvening, IsMidAfternoon, IsEarlyAfternoon)

# Linear regression for time of day
summary(lm(Mention_Volume ~ IsMarchMadness + ., data = TimeVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + ., data = TimeVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -9146   -662   -384    409  38147 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        720.54      71.27  10.110  < 2e-16 ***
IsMarchMadness    1246.83     124.61  10.006  < 2e-16 ***
IsEarlyEvening    3743.33     401.30   9.328  < 2e-16 ***
IsLateAfternoon   7699.94     455.38  16.909  < 2e-16 ***
IsLateEvening     3007.43     390.73   7.697 2.40e-14 ***
IsMidAfternoon    4613.46     543.04   8.496  < 2e-16 ***
IsEarlyAfternoon  2248.08     516.19   4.355 1.41e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2263 on 1625 degrees of freedom
Multiple R-squared:  0.3301,    Adjusted R-squared:  0.3277 
F-statistic: 133.5 on 6 and 1625 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ IsMarchMadness + ., data = TimeVariables))


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + ., data = TimeVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5531 -0.8773  0.0790  0.8965  3.4991 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5.95095    0.03575 166.455  < 2e-16 ***
IsMarchMadness    1.14942    0.06251  18.388  < 2e-16 ***
IsEarlyEvening    1.26754    0.20130   6.297 3.90e-10 ***
IsLateAfternoon   1.56688    0.22843   6.859 9.81e-12 ***
IsLateEvening     0.99169    0.19600   5.060 4.68e-07 ***
IsMidAfternoon    1.30455    0.27241   4.789 1.83e-06 ***
IsEarlyAfternoon  1.03853    0.25894   4.011 6.33e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.135 on 1625 degrees of freedom
Multiple R-squared:  0.3079,    Adjusted R-squared:  0.3053 
F-statistic: 120.5 on 6 and 1625 DF,  p-value: < 2.2e-16

Time slot had a statistically significant impact across all times.

Is Game Day

While the entire tournament takes 20 days to play, games only occur on 12 of those days. Accordingly, I want to consider the impact of game day when modeling.

In my original hypothesis testing the day after each game also had a statistically significant impact on mention volume, so I want to use that in regression as well.

Code

# Linear regression game day
IsOnGameDay <- lm(Mention_Volume ~ IsMarchMadness + IsGameDay, data = VolumeAndVariables)
IsOnGameDaylog <- lm(log(Mention_Volume) ~ IsMarchMadness + IsGameDay, data = VolumeAndVariables)

# Linear regression day after game 
IsOnDayAfterGame <- lm(Mention_Volume ~ IsMarchMadness + IsDayAfterGame, data = VolumeAndVariables)
IsOnDayAfterGamelog <- lm(log(Mention_Volume) ~ IsMarchMadness + IsDayAfterGame, data = VolumeAndVariables)

# Linear regression for day of and day after game 
IsOnBothDays <- lm(Mention_Volume ~ IsMarchMadness + GDayOrAfter, data = VolumeAndVariables)
IsOnBothDayslog <- lm(log(Mention_Volume) ~ IsMarchMadness + GDayOrAfter, data = VolumeAndVariables)

stargazer(IsOnGameDay, IsOnDayAfterGame, IsOnBothDays, type = "text")


======================================================================
                                         Dependent variable:          
                                --------------------------------------
                                            Mention_Volume            
                                    (1)          (2)          (3)     
----------------------------------------------------------------------
IsMarchMadness                  1,246.834*** 2,140.523*** 1,569.368***
                                 (127.826)    (132.629)    (133.667)  
                                                                      
IsGameDay                       4,207.039***                          
                                 (226.273)                            
                                                                      
IsDayAfterGame                                  89.928                
                                              (234.776)               
                                                                      
GDayOrAfter                                               2,270.919***
                                                           (188.061)  
                                                                      
Constant                         720.538***   716.880***   628.169*** 
                                  (73.107)     (81.052)     (77.493)  
                                                                      
----------------------------------------------------------------------
Observations                       1,632        1,632        1,632    
R2                                 0.293        0.144        0.214    
Adjusted R2                        0.293        0.142        0.213    
Residual Std. Error (df = 1629)  2,321.087    2,555.411    2,448.297  
F Statistic (df = 2; 1629)       338.231***   136.519***   221.554*** 
======================================================================
Note:                                      *p<0.1; **p<0.05; ***p<0.01

Code

stargazer(IsOnGameDaylog, IsOnDayAfterGamelog, IsOnBothDayslog, type = "text")


================================================================
                                      Dependent variable:       
                                --------------------------------
                                      log(Mention_Volume)       
                                   (1)        (2)        (3)    
----------------------------------------------------------------
IsMarchMadness                   1.149***   1.399***   1.240*** 
                                 (0.063)    (0.061)    (0.063)  
                                                                
IsGameDay                        1.222***                       
                                 (0.111)                        
                                                                
IsDayAfterGame                               0.120              
                                            (0.108)             
                                                                
GDayOrAfter                                            0.673*** 
                                                       (0.089)  
                                                                
Constant                         5.951***   5.946***   5.924*** 
                                 (0.036)    (0.037)    (0.037)  
                                                                
----------------------------------------------------------------
Observations                      1,632      1,632      1,632   
R2                                0.306      0.255      0.279   
Adjusted R2                       0.305      0.254      0.278   
Residual Std. Error (df = 1629)   1.135      1.177      1.157   
F Statistic (df = 2; 1629)      359.055*** 278.132*** 315.725***
================================================================
Note:                                *p<0.1; **p<0.05; ***p<0.01

Game day had the most significant impact on mention volume.

Testing All of the Tournament-Related Variables Together

Code

summary(lm(Mention_Volume ~ IsMarchMadness + IsGameDay + IsDayAfterGame + IsFirst4 + IsRd64 + IsRd32 + IsRd16 + IsRd8 + IsRd4 + IsChamp + Is15 + Is13 + Is11 + Is9 + Is8 + Is7 + Is6 + Is5 + Is4 + Is3 + Is1 + Is0 + IsEarlyEvening + IsLateAfternoon + IsLateEvening + IsMidAfternoon + IsEarlyAfternoon, data = VolumeAndVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + IsGameDay + IsDayAfterGame + 
    IsFirst4 + IsRd64 + IsRd32 + IsRd16 + IsRd8 + IsRd4 + IsChamp + 
    Is15 + Is13 + Is11 + Is9 + Is8 + Is7 + Is6 + Is5 + Is4 + 
    Is3 + Is1 + Is0 + IsEarlyEvening + IsLateAfternoon + IsLateEvening + 
    IsMidAfternoon + IsEarlyAfternoon, data = VolumeAndVariables)

Residuals:
     Min       1Q   Median       3Q      Max 
-13943.8   -640.4   -289.6    406.4  17009.8 

Coefficients: (3 not defined because of singularities)
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         734.13      54.82  13.393  < 2e-16 ***
IsMarchMadness    23461.67     893.45  26.260  < 2e-16 ***
IsGameDay           -69.13     799.62  -0.086 0.931118    
IsDayAfterGame     -334.18     167.80  -1.991 0.046601 *  
IsFirst4         -22915.62     896.50 -25.561  < 2e-16 ***
IsRd64           -22398.66     907.72 -24.676  < 2e-16 ***
IsRd32           -21062.18     916.82 -22.973  < 2e-16 ***
IsRd16           -20214.68     924.32 -21.870  < 2e-16 ***
IsRd8            -17610.87     990.90 -17.773  < 2e-16 ***
IsRd4            -14514.47    1073.34 -13.523  < 2e-16 ***
IsChamp                 NA         NA      NA       NA    
Is15               5758.96     917.35   6.278 4.44e-10 ***
Is13               2041.10     930.93   2.193 0.028488 *  
Is11                464.91     941.22   0.494 0.621418    
Is9                2052.03     847.64   2.421 0.015597 *  
Is8                2063.71     893.39   2.310 0.021020 *  
Is7                2952.85     820.67   3.598 0.000331 ***
Is6                7191.86    1419.27   5.067 4.52e-07 ***
Is5                2101.36     816.62   2.573 0.010167 *  
Is4                6086.26     912.41   6.671 3.53e-11 ***
Is3                3613.55     798.24   4.527 6.44e-06 ***
Is1                2526.37     743.58   3.398 0.000697 ***
Is0                     NA         NA      NA       NA    
IsEarlyEvening      231.95     520.08   0.446 0.655661    
IsLateAfternoon    2898.75     593.53   4.884 1.15e-06 ***
IsLateEvening     -1135.84     534.53  -2.125 0.033749 *  
IsMidAfternoon     1481.34     630.17   2.351 0.018863 *  
IsEarlyAfternoon        NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1727 on 1559 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.6232,    Adjusted R-squared:  0.6174 
F-statistic: 107.4 on 24 and 1559 DF,  p-value: < 2.2e-16

I’m going to use backward elimination to determine the final model.

Code

TournamentRelatedLR <- lm(Mention_Volume ~ IsMarchMadness + IsDayAfterGame + IsFirst4 + IsRd64 + IsRd32 + IsRd16 + IsRd8 + IsRd4 + Is15 + Is13 + Is9 + Is8 + Is7 + Is6 + Is5 + Is4 + Is3 + Is1 + IsLateAfternoon + IsLateEvening + IsMidAfternoon, data = VolumeAndVariables)
summary(TournamentRelatedLR)


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + IsDayAfterGame + 
    IsFirst4 + IsRd64 + IsRd32 + IsRd16 + IsRd8 + IsRd4 + Is15 + 
    Is13 + Is9 + Is8 + Is7 + Is6 + Is5 + Is4 + Is3 + Is1 + IsLateAfternoon + 
    IsLateEvening + IsMidAfternoon, data = VolumeAndVariables)

Residuals:
     Min       1Q   Median       3Q      Max 
-13927.7   -640.4   -292.7    408.1  17008.2 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        734.18      54.77  13.404  < 2e-16 ***
IsMarchMadness   23445.48     892.06  26.282  < 2e-16 ***
IsDayAfterGame    -335.41     167.29  -2.005 0.045140 *  
IsFirst4        -22897.87     894.47 -25.599  < 2e-16 ***
IsRd64          -22341.78     903.97 -24.715  < 2e-16 ***
IsRd32          -21030.36     914.91 -22.986  < 2e-16 ***
IsRd16          -20197.16     923.07 -21.880  < 2e-16 ***
IsRd8           -17579.51     989.14 -17.772  < 2e-16 ***
IsRd4           -14517.70    1072.52 -13.536  < 2e-16 ***
Is15              5777.35     647.24   8.926  < 2e-16 ***
Is13              2093.96     686.04   3.052 0.002310 ** 
Is9               2032.30     616.83   3.295 0.001007 ** 
Is8               2086.84     575.70   3.625 0.000298 ***
Is7               3004.99     498.30   6.031 2.04e-09 ***
Is6               7239.14    1318.33   5.491 4.65e-08 ***
Is5               2149.85     531.60   4.044 5.51e-05 ***
Is4               6187.63     672.38   9.203  < 2e-16 ***
Is3               3635.31     475.06   7.652 3.43e-14 ***
Is1               2570.99     416.45   6.174 8.49e-10 ***
IsLateAfternoon   2774.03     423.35   6.553 7.66e-11 ***
IsLateEvening    -1217.31     376.70  -3.232 0.001257 ** 
IsMidAfternoon    1342.85     503.56   2.667 0.007739 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1726 on 1562 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.623, Adjusted R-squared:  0.6179 
F-statistic: 122.9 on 21 and 1562 DF,  p-value: < 2.2e-16

The remaining variables are all considered statistically significant in the model for tournament-related variables.

Testing with log(Mention_Volume)

Code

TournamentRelatedLRlog <- lm(log(Mention_Volume) ~ IsMarchMadness + IsGameDay + IsFirst4 + IsRd64 + IsRd32 + IsRd16, data = VolumeAndVariables)
summary(TournamentRelatedLRlog)


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + IsGameDay + 
    IsFirst4 + IsRd64 + IsRd32 + IsRd16, data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5531 -0.8271  0.0958  0.8731  3.0197 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.95095    0.03487 170.671  < 2e-16 ***
IsMarchMadness  2.73125    0.21830  12.511  < 2e-16 ***
IsGameDay       0.92348    0.12063   7.656 3.33e-14 ***
IsFirst4       -1.88781    0.22226  -8.494  < 2e-16 ***
IsRd64         -1.58017    0.23238  -6.800 1.48e-11 ***
IsRd32         -1.06859    0.25091  -4.259 2.18e-05 ***
IsRd16         -0.69331    0.27009  -2.567   0.0103 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.107 on 1577 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.3443,    Adjusted R-squared:  0.3419 
F-statistic:   138 on 6 and 1577 DF,  p-value: < 2.2e-16

There are far fewer variables that are considered statistically significant when using the log of Mention_Volume compared to the non-transformed version. In the log-transformed model, there are five significant variables other than IsMarchMadness; in the original model there are 20.

IsGameDay was statistically significant in the log model; it was not in the non-transformed model.

IsDayAfterGame, IsRd8, IsRd4, Is15, Is13, Is9, Is8, Is7, Is6, Is5, Is4, Is3, Is1, IsLateAfternoon, IsLateEvening, and IsMidAfternoon were statistically significant in the non-log transformed model but not the log model.

The non-log-transformed model has a much higher adjusted r squared (0.6179) than the log-transformed model (0.3419).

Model 4 - School-Related Variables

Size of school (6 variables - LargeHighRez, LargePriRez, LargePriNonRez, MedHighRez, MedPriRez, SmallHighRez
Major / mid-major (1 - Yes, 0 - No)

School Size

School size is based on how school size is defined by Carnegie Classification of Institutions of Higher Education and broken into six categories: Large (Highly Residential), Large (Primarily Residential), Large (Primarily Non-Residential), Medium (Highly Residential), Medium (Primarily Residential), Small (Highly Residential).

Code

# Linear regression for school size and setting
summary(lm(Mention_Volume ~ IsMarchMadness + SizeSetting, data = VolumeAndVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + SizeSetting, data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -3128   -896   -255    293  36654 

Coefficients:
                                                      Estimate Std. Error
(Intercept)                                             1354.8      202.1
IsMarchMadness                                          2105.6      128.4
SizeSettingFour-year, large, primarily nonresidential   -388.8      234.5
SizeSettingFour-year, large, primarily residential      -392.2      216.1
SizeSettingFour-year, medium, highly residential       -1420.4      265.1
SizeSettingFour-year, medium, primarily residential    -1462.5      274.5
SizeSettingFour-year, small, highly residential        -1579.0      411.2
                                                      t value Pr(>|t|)    
(Intercept)                                             6.702 2.82e-11 ***
IsMarchMadness                                         16.394  < 2e-16 ***
SizeSettingFour-year, large, primarily nonresidential  -1.658 0.097465 .  
SizeSettingFour-year, large, primarily residential     -1.815 0.069633 .  
SizeSettingFour-year, medium, highly residential       -5.357 9.67e-08 ***
SizeSettingFour-year, medium, primarily residential    -5.328 1.13e-07 ***
SizeSettingFour-year, small, highly residential        -3.840 0.000128 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2509 on 1625 degrees of freedom
Multiple R-squared:  0.1761,    Adjusted R-squared:  0.1731 
F-statistic:  57.9 on 6 and 1625 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ IsMarchMadness + SizeSetting, data = VolumeAndVariables))


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + SizeSetting, 
    data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0525 -0.7147  0.0280  0.7061  3.6527 

Coefficients:
                                                        Estimate Std. Error
(Intercept)                                            6.4197206  0.0810504
IsMarchMadness                                         1.3742651  0.0514978
SizeSettingFour-year, large, primarily nonresidential -0.4102806  0.0940232
SizeSettingFour-year, large, primarily residential     0.0001005  0.0866304
SizeSettingFour-year, medium, highly residential      -1.2290024  0.1063152
SizeSettingFour-year, medium, primarily residential   -1.5743925  0.1100526
SizeSettingFour-year, small, highly residential       -1.9468254  0.1648861
                                                      t value Pr(>|t|)    
(Intercept)                                            79.207  < 2e-16 ***
IsMarchMadness                                         26.686  < 2e-16 ***
SizeSettingFour-year, large, primarily nonresidential  -4.364 1.36e-05 ***
SizeSettingFour-year, large, primarily residential      0.001    0.999    
SizeSettingFour-year, medium, highly residential      -11.560  < 2e-16 ***
SizeSettingFour-year, medium, primarily residential   -14.306  < 2e-16 ***
SizeSettingFour-year, small, highly residential       -11.807  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.006 on 1625 degrees of freedom
Multiple R-squared:  0.4561,    Adjusted R-squared:  0.4541 
F-statistic: 227.1 on 6 and 1625 DF,  p-value: < 2.2e-16

SizeSetting overall had a statistically significant impact, but being a large-sized school specifically did not. Schools being small or medium had a statistically significant impact.

Accordingly, I’m curious what the impact would be with size broken down only as small, medium, and large, and with size as a binary variable for IsLarge (Yes = 1, No = 2).

Code

# Reduce SizeSetting to just Size
VolumeAndVariables <- VolumeAndVariables %>%
mutate(
Size = case_when(
SizeSetting == "Four-year, large, primarily residential" ~ "Large",
SizeSetting == "Four-year, large, primarily nonresidential" ~ "Large",
SizeSetting == "Four-year, large, highly residential" ~ "Large",
SizeSetting == "Four-year, medium, primarily residential" ~ "Medium",
SizeSetting == "Four-year, medium, highly residential" ~ "Medium",
SizeSetting == "Four-year, small, highly residential" ~ "Small"),
IsLarge = ifelse(Size == "Large", 1, 0))

# Linear regression for small, medium, large
summary(lm(Mention_Volume ~ IsMarchMadness + Size, data = VolumeAndVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + Size, data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -2956   -911   -269    286  36984 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1012.80      87.48  11.577  < 2e-16 ***
IsMarchMadness  2117.25     127.95  16.548  < 2e-16 ***
SizeMedium     -1102.32     150.52  -7.323 3.79e-13 ***
SizeSmall      -1240.40     369.52  -3.357 0.000807 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2510 on 1628 degrees of freedom
Multiple R-squared:  0.1743,    Adjusted R-squared:  0.1728 
F-statistic: 114.6 on 3 and 1628 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ IsMarchMadness + Size, data = VolumeAndVariables))


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + Size, data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9337 -0.7169  0.0555  0.7192  3.4716 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     6.30102    0.03561  176.97   <2e-16 ***
IsMarchMadness  1.36937    0.05207   26.30   <2e-16 ***
SizeMedium     -1.26971    0.06126  -20.73   <2e-16 ***
SizeSmall      -1.82670    0.15039  -12.15   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.021 on 1628 degrees of freedom
Multiple R-squared:  0.4384,    Adjusted R-squared:  0.4374 
F-statistic: 423.7 on 3 and 1628 DF,  p-value: < 2.2e-16

These results were statistically significant, but the r squared from the first models suggests a better fit.

Code

# Linear regression for IsLarge
summary(lm(Mention_Volume ~ IsMarchMadness + IsLarge, data = VolumeAndVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + IsLarge, data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -2957   -910   -269    271  36983 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -106.1      132.3  -0.802    0.422    
IsMarchMadness   2118.3      127.9  16.566  < 2e-16 ***
IsLarge          1118.5      143.5   7.794 1.15e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2509 on 1629 degrees of freedom
Multiple R-squared:  0.1743,    Adjusted R-squared:  0.1733 
F-statistic: 171.9 on 2 and 1629 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ IsMarchMadness + IsLarge, data = VolumeAndVariables))


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + IsLarge, 
    data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9320 -0.7301  0.0622  0.7256  3.5342 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.96422    0.05404   91.86   <2e-16 ***
IsMarchMadness  1.37379    0.05224   26.30   <2e-16 ***
IsLarge         1.33507    0.05863   22.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.025 on 1629 degrees of freedom
Multiple R-squared:  0.4341,    Adjusted R-squared:  0.4334 
F-statistic: 624.8 on 2 and 1629 DF,  p-value: < 2.2e-16

IsLarge was statistically significant; the adjusted r squared (for both log and non-log models) were effectively the same (0.1675 vs. 0.168 and 0.4006 vs. 0.3969) respectively.

Status as Major or Mid-Major

While it is not an official designation by the NCAA, teams that play in particular NCAA Division 1 conferences (ACC, AAC, Big East, Big 10, Big 12, Pac-12, and SEC) are often referred to as “high major” programs, while teams that play in any other conference are referred to as “mid-major”⁸.

Code

# Linear regression for whether or not team is a high major
summary(lm(Mention_Volume ~ IsMarchMadness + Major, data = VolumeAndVariables))


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + Major, data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -3018   -919   -124    129  36769 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)       176.3      101.6   1.735   0.0829 .  
IsMarchMadness   2115.7      127.5  16.600   <2e-16 ***
MajorYes         1053.1      124.1   8.486   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2501 on 1629 degrees of freedom
Multiple R-squared:  0.1797,    Adjusted R-squared:  0.1787 
F-statistic: 178.5 on 2 and 1629 DF,  p-value: < 2.2e-16

Code

summary(lm(log(Mention_Volume) ~ IsMarchMadness + Major, data = VolumeAndVariables))


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + Major, data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8262 -0.6724 -0.0451  0.5702  3.6218 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.22406    0.03837  136.17   <2e-16 ***
IsMarchMadness  1.36575    0.04814   28.37   <2e-16 ***
MajorYes        1.40635    0.04687   30.01   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9445 on 1629 degrees of freedom
Multiple R-squared:  0.5196,    Adjusted R-squared:  0.519 
F-statistic: 880.8 on 2 and 1629 DF,  p-value: < 2.2e-16

Being considered a high major school had a statistically significant impact on mention volume.

Testing the School-Related Variables Together

Code

SchoolRelatedLR <- lm(Mention_Volume ~ IsMarchMadness + Size + Major, data = VolumeAndVariables)
summary(SchoolRelatedLR)


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + Size + Major, 
    data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -2936   -935   -293    289  36737 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)       508.4      127.0   4.003 6.54e-05 ***
IsMarchMadness   2104.9      126.9  16.593  < 2e-16 ***
SizeMedium       -695.3      167.0  -4.165 3.28e-05 ***
SizeSmall        -732.4      378.1  -1.937   0.0529 .  
MajorYes          763.8      140.5   5.435 6.31e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2488 on 1627 degrees of freedom
Multiple R-squared:  0.1891,    Adjusted R-squared:  0.1871 
F-statistic: 94.82 on 4 and 1627 DF,  p-value: < 2.2e-16

Testing log(Mention_Volume) for the school-related variables

Code

SchoolRelatedLRlog <- lm(log(Mention_Volume) ~ IsMarchMadness + Size + Major, data = VolumeAndVariables)
summary(SchoolRelatedLRlog)


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + Size + Major, 
    data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2523 -0.6714 -0.0486  0.5529  3.6293 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.57580    0.04599 121.237  < 2e-16 ***
IsMarchMadness  1.35167    0.04593  29.426  < 2e-16 ***
SizeMedium     -0.68447    0.06045 -11.322  < 2e-16 ***
SizeSmall      -1.09632    0.13689  -8.009 2.19e-15 ***
MajorYes        1.09822    0.05088  21.583  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9009 on 1627 degrees of freedom
Multiple R-squared:  0.5634,    Adjusted R-squared:  0.5624 
F-statistic:   525 on 4 and 1627 DF,  p-value: < 2.2e-16

Both models had the same p-value, but the log-transformed model has an adjusted r squared of 0.5624 while the non-transformed model was 0.1871.

When it comes to modeling school-related variables, the model with log-transformed mention volume was a better fit.

Model 5 - All Variables

I will now run a model with every variable from all four previous models.

Code

AllVariablesLR <- lm(Mention_Volume ~ IsMarchMadness + IsWinner + UpsetWin + UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss + IsGameDay + IsDayAfterGame + IsFirst4 + IsRd64 + IsRd32 + IsRd16 + IsRd8 + IsRd4 + IsChamp + Is15 + Is13 + Is11 + Is9 + Is8 + Is7 + Is6 + Is5 + Is4 + Is3 + Is1 + Is0 + IsEarlyEvening + IsLateAfternoon + IsLateEvening + IsMidAfternoon + IsEarlyAfternoon + Size + Major, data = VolumeAndVariables)
summary(AllVariablesLR)


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + IsWinner + UpsetWin + 
    UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss + 
    IsGameDay + IsDayAfterGame + IsFirst4 + IsRd64 + IsRd32 + 
    IsRd16 + IsRd8 + IsRd4 + IsChamp + Is15 + Is13 + Is11 + Is9 + 
    Is8 + Is7 + Is6 + Is5 + Is4 + Is3 + Is1 + Is0 + IsEarlyEvening + 
    IsLateAfternoon + IsLateEvening + IsMidAfternoon + IsEarlyAfternoon + 
    Size + Major, data = VolumeAndVariables)

Residuals:
     Min       1Q   Median       3Q      Max 
-13366.5   -538.1   -148.0    256.2  16412.3 

Coefficients: (4 not defined because of singularities)
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         424.49      78.70   5.393 7.98e-08 ***
IsMarchMadness    23193.97     786.22  29.501  < 2e-16 ***
IsWinnerYes        1048.41    1065.28   0.984 0.325188    
UpsetWinYes        7102.00     839.81   8.457  < 2e-16 ***
UpsetLossYes       3839.19     838.70   4.578 5.08e-06 ***
FavoriteWinYes     2504.38     950.12   2.636 0.008476 ** 
FavoriteLossYes     968.62     947.51   1.022 0.306805    
UnderdogWinYes     3029.86     878.18   3.450 0.000575 ***
UnderdogLossYes    1107.89     873.47   1.268 0.204854    
IsGameDay          -100.41     880.61  -0.114 0.909234    
IsDayAfterGame     -265.74     147.21  -1.805 0.071240 .  
IsFirst4         -22583.09     788.85 -28.628  < 2e-16 ***
IsRd64           -22352.73     799.16 -27.970  < 2e-16 ***
IsRd32           -21261.16     806.37 -26.366  < 2e-16 ***
IsRd16           -19989.55     815.03 -24.526  < 2e-16 ***
IsRd8            -17565.23     876.73 -20.035  < 2e-16 ***
IsRd4            -14329.90     936.47 -15.302  < 2e-16 ***
IsChamp                 NA         NA      NA       NA    
Is15               1942.69     668.07   2.908 0.003690 ** 
Is13              -1818.91     677.58  -2.684 0.007343 ** 
Is11              -1742.82     678.82  -2.567 0.010338 *  
Is9               -1348.84     622.47  -2.167 0.030392 *  
Is8               -1126.76     636.16  -1.771 0.076723 .  
Is7                 -84.35     560.00  -0.151 0.880293    
Is6                 832.81    1246.77   0.668 0.504248    
Is5               -2904.69     588.19  -4.938 8.73e-07 ***
Is4                3963.33     644.11   6.153 9.65e-10 ***
Is3                1092.60     494.37   2.210 0.027245 *  
Is1                     NA         NA      NA       NA    
Is0                     NA         NA      NA       NA    
IsEarlyEvening      646.44     470.42   1.374 0.169591    
IsLateAfternoon    1999.72     530.69   3.768 0.000171 ***
IsLateEvening      -858.93     472.92  -1.816 0.069529 .  
IsMidAfternoon     1470.80     550.89   2.670 0.007668 ** 
IsEarlyAfternoon        NA         NA      NA       NA    
SizeMedium         -505.87     103.73  -4.877 1.19e-06 ***
SizeSmall          -402.88     230.77  -1.746 0.081044 .  
MajorYes            842.34      87.83   9.590  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1506 on 1550 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.715, Adjusted R-squared:  0.7089 
F-statistic: 117.8 on 33 and 1550 DF,  p-value: < 2.2e-16

Code

AllVariablesLRlog <- lm(log(Mention_Volume) ~ IsMarchMadness + IsWinner + UpsetWin + UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss + IsGameDay + IsDayAfterGame + IsFirst4 + IsRd64 + IsRd32 + IsRd16 + IsRd8 + IsRd4 + IsChamp + Is15 + Is13 + Is11 + Is9 + Is8 + Is7 + Is6 + Is5 + Is4 + Is3 + Is1 + Is0 + IsEarlyEvening + IsLateAfternoon + IsLateEvening + IsMidAfternoon + IsEarlyAfternoon + Size + Major, data = VolumeAndVariables)
summary(AllVariablesLRlog)


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + IsWinner + 
    UpsetWin + UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + 
    UnderdogLoss + IsGameDay + IsDayAfterGame + IsFirst4 + IsRd64 + 
    IsRd32 + IsRd16 + IsRd8 + IsRd4 + IsChamp + Is15 + Is13 + 
    Is11 + Is9 + Is8 + Is7 + Is6 + Is5 + Is4 + Is3 + Is1 + Is0 + 
    IsEarlyEvening + IsLateAfternoon + IsLateEvening + IsMidAfternoon + 
    IsEarlyAfternoon + Size + Major, data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2475 -0.5444 -0.0327  0.4991  3.1885 

Coefficients: (4 not defined because of singularities)
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5.548279   0.041446 133.867  < 2e-16 ***
IsMarchMadness    3.667101   0.414029   8.857  < 2e-16 ***
IsWinnerYes       0.128204   0.560985   0.229 0.819262    
UpsetWinYes       1.437542   0.442252   3.251 0.001177 ** 
UpsetLossYes      0.489885   0.441667   1.109 0.267527    
FavoriteWinYes   -0.267184   0.500339  -0.534 0.593414    
FavoriteLossYes  -0.578416   0.498964  -1.159 0.246540    
UnderdogWinYes   -0.289704   0.462454  -0.626 0.531112    
UnderdogLossYes  -0.332252   0.459974  -0.722 0.470203    
IsGameDay         1.239874   0.463736   2.674 0.007582 ** 
IsDayAfterGame   -0.132229   0.077523  -1.706 0.088269 .  
IsFirst4         -2.826484   0.415412  -6.804 1.45e-11 ***
IsRd64           -2.631728   0.420845  -6.253 5.18e-10 ***
IsRd32           -2.232335   0.424641  -5.257 1.67e-07 ***
IsRd16           -1.558388   0.429201  -3.631 0.000292 ***
IsRd8            -1.404315   0.461691  -3.042 0.002392 ** 
IsRd4            -0.896345   0.493151  -1.818 0.069320 .  
IsChamp                 NA         NA      NA       NA    
Is15              0.749508   0.351812   2.130 0.033294 *  
Is13              0.101789   0.356816   0.285 0.775475    
Is11              0.092589   0.357469   0.259 0.795661    
Is9               0.286940   0.327796   0.875 0.381512    
Is8               0.191323   0.335005   0.571 0.568011    
Is7               0.426904   0.294898   1.448 0.147923    
Is6               0.244996   0.656558   0.373 0.709086    
Is5              -0.688168   0.309745  -2.222 0.026446 *  
Is4               0.268176   0.339194   0.791 0.429283    
Is3               0.025999   0.260339   0.100 0.920462    
Is1                     NA         NA      NA       NA    
Is0                     NA         NA      NA       NA    
IsEarlyEvening   -0.142239   0.247729  -0.574 0.565934    
IsLateAfternoon  -0.203041   0.279464  -0.727 0.467620    
IsLateEvening    -0.585250   0.249044  -2.350 0.018899 *  
IsMidAfternoon    0.006344   0.290104   0.022 0.982556    
IsEarlyAfternoon        NA         NA      NA       NA    
SizeMedium       -0.661744   0.054622 -12.115  < 2e-16 ***
SizeSmall        -1.033428   0.121527  -8.504  < 2e-16 ***
MajorYes          1.147779   0.046254  24.815  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7932 on 1550 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.6692,    Adjusted R-squared:  0.6621 
F-statistic:    95 on 33 and 1550 DF,  p-value: < 2.2e-16

In this version of the model, the p-values are again the same for the original mention volume variable and the log-transformed one; however, the adjusted r squared for the original model is higher (0.7089 vs. 0.6621).

Model 6 - Only Significant Variables

I now want to create a model with statistically significant variables from the four models that I created. I am going to use backward elimination to create this model. I am not going to include every step of me backwardly creating this model, but will include the final model with only significant variables.

Code

# Use backward elimination to create model of significant variables
SigVariablesLR <- lm(Mention_Volume ~ IsMarchMadness + UpsetWin + UpsetLoss + FavoriteWin + UnderdogWin + IsFirst4 + IsRd64 + IsRd32 + IsRd16 + IsRd8 + IsRd4 + Is15 + Is13 + Is9 + Is5 + Is4 + Is3 + IsEarlyEvening + IsLateAfternoon + IsMidAfternoon + Major, data = VolumeAndVariables)
summary(SigVariablesLR)


Call:
lm(formula = Mention_Volume ~ IsMarchMadness + UpsetWin + UpsetLoss + 
    FavoriteWin + UnderdogWin + IsFirst4 + IsRd64 + IsRd32 + 
    IsRd16 + IsRd8 + IsRd4 + Is15 + Is13 + Is9 + Is5 + Is4 + 
    Is3 + IsEarlyEvening + IsLateAfternoon + IsMidAfternoon + 
    Major, data = VolumeAndVariables)

Residuals:
     Min       1Q   Median       3Q      Max 
-13319.3   -526.5    -92.7    206.5  16456.0 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        174.71      62.27   2.806 0.005082 ** 
IsMarchMadness   23396.59     764.40  30.608  < 2e-16 ***
UpsetWinYes       7154.29     768.31   9.312  < 2e-16 ***
UpsetLossYes      3937.23     584.18   6.740 2.22e-11 ***
FavoriteWinYes    2611.99     531.59   4.914 9.88e-07 ***
UnderdogWinYes    3028.71     283.61  10.679  < 2e-16 ***
IsFirst4        -22793.33     768.04 -29.677  < 2e-16 ***
IsRd64          -22658.56     775.93 -29.202  < 2e-16 ***
IsRd32          -21514.15     786.95 -27.339  < 2e-16 ***
IsRd16          -20233.70     805.01 -25.135  < 2e-16 ***
IsRd8           -17617.28     859.38 -20.500  < 2e-16 ***
IsRd4           -14420.43     941.36 -15.319  < 2e-16 ***
Is15              2351.40     575.09   4.089 4.56e-05 ***
Is13             -1617.32     599.77  -2.697 0.007081 ** 
Is9              -1147.69     509.99  -2.250 0.024561 *  
Is5              -2692.84     476.91  -5.646 1.94e-08 ***
Is4               4492.86     608.05   7.389 2.40e-13 ***
Is3               1499.10     411.92   3.639 0.000282 ***
IsEarlyEvening    1124.35     312.27   3.601 0.000327 ***
IsLateAfternoon   2558.36     377.14   6.784 1.66e-11 ***
IsMidAfternoon    2123.85     426.94   4.975 7.26e-07 ***
MajorYes          1056.04      77.09  13.699  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1519 on 1562 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.7078,    Adjusted R-squared:  0.7038 
F-statistic: 180.2 on 21 and 1562 DF,  p-value: < 2.2e-16

And then for the log-transformed version of mention volume:

Code

# Use backward elimination to create model of significant variables
SigVariablesLRlog <- lm(log(Mention_Volume) ~ IsMarchMadness + IsWinner + UpsetWin + FavoriteLoss + UnderdogLoss + IsFirst4 + IsRd64 + IsRd32 + IsRd16 + IsRd8 + Is15 + Is5 + Size + Major, data = VolumeAndVariables)
summary(SigVariablesLRlog)


Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + IsWinner + 
    UpsetWin + FavoriteLoss + UnderdogLoss + IsFirst4 + IsRd64 + 
    IsRd32 + IsRd16 + IsRd8 + Is15 + Is5 + Size + Major, data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2516 -0.5599 -0.0313  0.5010  3.1930 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      5.54377    0.04121 134.521  < 2e-16 ***
IsMarchMadness   2.92884    0.23678  12.369  < 2e-16 ***
IsWinnerYes      0.97977    0.12242   8.003 2.34e-15 ***
UpsetWinYes      1.56557    0.29417   5.322 1.17e-07 ***
FavoriteLossYes  0.75590    0.19610   3.855 0.000121 ***
UnderdogLossYes  0.81726    0.13380   6.108 1.27e-09 ***
IsFirst4        -2.08275    0.23861  -8.729  < 2e-16 ***
IsRd64          -1.90091    0.24268  -7.833 8.71e-15 ***
IsRd32          -1.49297    0.25236  -5.916 4.04e-09 ***
IsRd16          -0.87063    0.26330  -3.307 0.000965 ***
IsRd8           -0.66700    0.30565  -2.182 0.029239 *  
Is15             0.62959    0.29513   2.133 0.033055 *  
Is5             -0.81295    0.23441  -3.468 0.000539 ***
SizeMedium      -0.65313    0.05435 -12.017  < 2e-16 ***
SizeSmall       -1.03135    0.12132  -8.501  < 2e-16 ***
MajorYes         1.14217    0.04608  24.784  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7954 on 1568 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.6635,    Adjusted R-squared:  0.6602 
F-statistic: 206.1 on 15 and 1568 DF,  p-value: < 2.2e-16

Excluding IsMarchMadess, the non-log transformed model had 20 statistically significant variables, while the log-transformed model had 13.

The non-log transformed model differed from the log-transformed model with the inclusion of the UpsetLoss, FavoriteWin, UnderdogWin, IsRd4, Is13, Is9, Is4, Is3, IsEarlyEvening, IsLateAfternoon, and IsMidAfternoon variables.

The log transformed model differed from the non-log transformed model with the inclusion of the IsWinner, FavoriteLoss, UnderdogLoss, and Size variables.

The two models once again had the same p-value (< 2.2e-16) but the non-log transformed model once again had the larger adjusted r squared (0.7038 vs. 0.6602).

Comparing the Models Using Adjusted r-squared, AIC, and BIC

I have compiled a final summary of all six non-log transformed models, and a summary for the six log-transformed models.

Code

models <- list(IsMarchMadnessLR, GameOutcomeLR, TournamentRelatedLR, SchoolRelatedLR, AllVariablesLR, SigVariablesLR)
stargazer(models, 
          title = "Linear Regression Models", 
          align = TRUE, 
          single.row = TRUE,
          type = "text", 
          font.size = "small",
          add.lines = list(c("AIC", 
                             round(AIC(IsMarchMadnessLR), 2), 
                             round(AIC(GameOutcomeLR), 2), 
                             round(AIC(TournamentRelatedLR), 2), 
                             round(AIC(SchoolRelatedLR), 2), 
                             round(AIC(AllVariablesLR), 2),
                             round(AIC(SigVariablesLR), 2)), 
                           c("BIC", 
                             round(BIC(IsMarchMadnessLR), 2), 
                             round(BIC(GameOutcomeLR), 2), 
                             round(BIC(TournamentRelatedLR), 2), 
                             round(BIC(SchoolRelatedLR), 2), 
                             round(BIC(AllVariablesLR), 2),
                             round(BIC(SigVariablesLR), 2))))


Linear Regression Models
=================================================================================================================================================================================
                                                                                         Dependent variable:                                                                     
                    -------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                                                           Mention_Volume                                                                        
                               (1)                       (2)                       (3)                       (4)                       (5)                        (6)            
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
IsMarchMadness       2,150.268*** (130.132)    1,248.844*** (120.370)    23,445.480*** (892.060)    2,104.942*** (126.861)   23,193.970*** (786.221)    23,396.590*** (764.397)  
IsWinnerYes                                                                                                                   1,048.415 (1,065.283)                              
UpsetWinYes                                   4,381.744*** (1,009.731)                                                        7,102.001*** (839.815)     7,154.288*** (768.309)  
UpsetLossYes                                                                                                                  3,839.192*** (838.703)     3,937.226*** (584.177)  
FavoriteWinYes                                 6,681.318*** (701.888)                                                         2,504.379*** (950.118)     2,611.989*** (531.591)  
FavoriteLossYes                                4,532.355*** (513.694)                                                           968.625 (947.508)                                
UnderdogWinYes                                 5,265.937*** (345.628)                                                         3,029.861*** (878.177)     3,028.715*** (283.613)  
UnderdogLossYes                                1,779.368*** (345.628)                                                          1,107.887 (873.468)                               
IsGameDay                                                                                                                       -100.411 (880.612)                               
IsDayAfterGame                                                             -335.408** (167.289)                                -265.745* (147.212)                               
IsFirst4                                                                 -22,897.870*** (894.474)                            -22,583.090*** (788.846)   -22,793.330*** (768.044) 
IsRd64                                                                   -22,341.780*** (903.967)                            -22,352.730*** (799.164)   -22,658.560*** (775.934) 
IsRd32                                                                   -21,030.360*** (914.912)                            -21,261.160*** (806.371)   -21,514.150*** (786.946) 
IsRd16                                                                   -20,197.160*** (923.072)                            -19,989.550*** (815.030)   -20,233.700*** (805.009) 
IsRd8                                                                    -17,579.510*** (989.145)                            -17,565.230*** (876.728)   -17,617.280*** (859.384) 
IsRd4                                                                   -14,517.690*** (1,072.525)                           -14,329.900*** (936.470)   -14,420.430*** (941.364) 
IsChamp                                                                                                                                                                          
Is15                                                                      5,777.345*** (647.241)                              1,942.691*** (668.073)     2,351.403*** (575.086)  
Is13                                                                      2,093.956*** (686.043)                             -1,818.909*** (677.575)    -1,617.316*** (599.769)  
Is11                                                                                                                          -1,742.822** (678.815)                             
Is9                                                                       2,032.296*** (616.835)                              -1,348.843** (622.467)     -1,147.685** (509.988)  
Is8                                                                       2,086.841*** (575.695)                              -1,126.764* (636.157)                              
Is7                                                                       3,004.990*** (498.296)                                -84.348 (559.997)                                
Is6                                                                      7,239.143*** (1,318.326)                              832.813 (1,246.771)                               
Is5                                                                       2,149.854*** (531.601)                             -2,904.689*** (588.191)    -2,692.837*** (476.905)  
Is4                                                                       6,187.630*** (672.376)                              3,963.332*** (644.113)     4,492.858*** (608.050)  
Is3                                                                       3,635.314*** (475.062)                              1,092.598** (494.371)      1,499.101*** (411.917)  
Is1                                                                       2,570.985*** (416.446)                                                                                 
Is0                                                                                                                                                                              
IsEarlyEvening                                                                                                                  646.438 (470.425)        1,124.348*** (312.271)  
IsLateAfternoon                                                           2,774.030*** (423.352)                              1,999.720*** (530.688)     2,558.360*** (377.143)  
IsLateEvening                                                            -1,217.310*** (376.697)                               -858.932* (472.921)                               
IsMidAfternoon                                                            1,342.854*** (503.565)                              1,470.798*** (550.893)     2,123.854*** (426.935)  
IsEarlyAfternoon                                                                                                                                                                 
SizeMedium                                                                                          -695.316*** (166.956)     -505.874*** (103.725)                              
SizeSmall                                                                                            -732.447* (378.056)       -402.884* (230.774)                               
MajorYes                                                                                             763.765*** (140.525)      842.340*** (87.834)       1,056.045*** (77.088)   
Constant               720.538*** (80.467)       720.538*** (69.218)       734.180*** (54.774)       508.443*** (127.016)      424.485*** (78.704)        174.705*** (62.267)    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
AIC                          30243.8                  29757.29                   28131.13                  30160.56                  27712.21                   27727.59         
BIC                         30259.99                  29800.47                   28254.59                  30192.94                  27900.08                   27851.04         
Observations                  1,632                     1,632                     1,584                     1,632                     1,584                      1,584           
R2                            0.143                     0.368                     0.623                     0.189                     0.715                      0.708           
Adjusted R2                   0.143                     0.366                     0.618                     0.187                     0.709                      0.704           
Residual Std. Error   2,554.742 (df = 1630)     2,197.609 (df = 1625)     1,725.566 (df = 1562)     2,488.128 (df = 1627)     1,506.217 (df = 1550)      1,519.184 (df = 1562)   
F Statistic         273.034*** (df = 1; 1630) 157.803*** (df = 6; 1625) 122.909*** (df = 21; 1562) 94.824*** (df = 4; 1627) 117.808*** (df = 33; 1550) 180.154*** (df = 21; 1562)
=================================================================================================================================================================================
Note:                                                                                                                                                 *p<0.1; **p<0.05; ***p<0.01

Code

logmodels <- list(IsMarchMadnessLRlog, GameOutcomeLRlog, TournamentRelatedLRlog, SchoolRelatedLRlog, AllVariablesLRlog, SigVariablesLRlog)
stargazer(logmodels, 
          title = "Linear Regression Models - log", 
          align = TRUE, 
          single.row = TRUE,
          type = "text", 
          font.size = "small",
          add.lines = list(c("AIC", 
                             round(AIC(IsMarchMadnessLRlog), 2), 
                             round(AIC(GameOutcomeLRlog), 2), 
                             round(AIC(TournamentRelatedLRlog), 2), 
                             round(AIC(SchoolRelatedLRlog), 2), 
                             round(AIC(AllVariablesLRlog), 2),
                             round(AIC(SigVariablesLRlog), 2)), 
                           c("BIC", 
                             round(BIC(IsMarchMadnessLRlog), 2), 
                             round(BIC(GameOutcomeLRlog), 2), 
                             round(BIC(TournamentRelatedLRlog), 2), 
                             round(BIC(SchoolRelatedLRlog), 2), 
                             round(BIC(AllVariablesLRlog), 2),
                             round(BIC(SigVariablesLRlog), 2))))


Linear Regression Models - log
================================================================================================================================================================================
                                                                                        Dependent variable:                                                                     
                    ------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                                                        log(Mention_Volume)                                                                     
                               (1)                       (2)                       (3)                       (4)                       (5)                       (6)            
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
IsMarchMadness          1.412*** (0.060)          1.154*** (0.062)          2.731*** (0.218)          1.352*** (0.046)          3.667*** (0.414)           2.929*** (0.237)     
IsWinnerYes                                                                                                                       0.128 (0.561)            0.980*** (0.122)     
UpsetWinYes                                                                                                                     1.438*** (0.442)           1.566*** (0.294)     
UpsetLossYes                                                                                                                      0.490 (0.442)                                 
FavoriteWinYes                                    1.937*** (0.264)                                                               -0.267 (0.500)                                 
FavoriteLossYes                                   1.471*** (0.264)                                                               -0.578 (0.499)            0.756*** (0.196)     
UnderdogWinYes                                    1.440*** (0.177)                                                               -0.290 (0.462)                                 
UnderdogLossYes                                   0.743*** (0.177)                                                               -0.332 (0.460)            0.817*** (0.134)     
IsGameDay                                                                   0.923*** (0.121)                                    1.240*** (0.464)                                
IsDayAfterGame                                                                                                                   -0.132* (0.078)                                
IsFirst4                                                                    -1.888*** (0.222)                                   -2.826*** (0.415)         -2.083*** (0.239)     
IsRd64                                                                      -1.580*** (0.232)                                   -2.632*** (0.421)         -1.901*** (0.243)     
IsRd32                                                                      -1.069*** (0.251)                                   -2.232*** (0.425)         -1.493*** (0.252)     
IsRd16                                                                      -0.693** (0.270)                                    -1.558*** (0.429)         -0.871*** (0.263)     
IsRd8                                                                                                                           -1.404*** (0.462)          -0.667** (0.306)     
IsRd4                                                                                                                            -0.896* (0.493)                                
IsChamp                                                                                                                                                                         
Is15                                                                                                                             0.750** (0.352)           0.630** (0.295)      
Is13                                                                                                                              0.102 (0.357)                                 
Is11                                                                                                                              0.093 (0.357)                                 
Is9                                                                                                                               0.287 (0.328)                                 
Is8                                                                                                                               0.191 (0.335)                                 
Is7                                                                                                                               0.427 (0.295)                                 
Is6                                                                                                                               0.245 (0.657)                                 
Is5                                                                                                                             -0.688** (0.310)          -0.813*** (0.234)     
Is4                                                                                                                               0.268 (0.339)                                 
Is3                                                                                                                               0.026 (0.260)                                 
Is1                                                                                                                                                                             
Is0                                                                                                                                                                             
IsEarlyEvening                                                                                                                   -0.142 (0.248)                                 
IsLateAfternoon                                                                                                                  -0.203 (0.279)                                 
IsLateEvening                                                                                                                   -0.585** (0.249)                                
IsMidAfternoon                                                                                                                    0.006 (0.290)                                 
IsEarlyAfternoon                                                                                                                                                                
SizeMedium                                                                                            -0.684*** (0.060)         -0.662*** (0.055)         -0.653*** (0.054)     
SizeSmall                                                                                             -1.096*** (0.137)         -1.033*** (0.122)         -1.031*** (0.121)     
MajorYes                                                                                              1.098*** (0.051)          1.148*** (0.046)           1.142*** (0.046)     
Constant                5.951*** (0.037)          5.951*** (0.036)          5.951*** (0.035)          5.576*** (0.046)          5.548*** (0.041)           5.544*** (0.041)     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
AIC                          5166.33                   5033.57                   4826.28                   4297.86                   3796.8                    3787.85          
BIC                          5182.52                   5071.35                   4869.22                   4330.24                   3984.67                    3879.1          
Observations                  1,632                     1,632                     1,584                     1,632                     1,584                     1,584           
R2                            0.254                     0.316                     0.344                     0.563                     0.669                     0.663           
Adjusted R2                   0.254                     0.314                     0.342                     0.562                     0.662                     0.660           
Residual Std. Error     1.177 (df = 1630)         1.128 (df = 1626)         1.107 (df = 1577)         0.901 (df = 1627)         0.793 (df = 1550)         0.795 (df = 1568)     
F Statistic         554.943*** (df = 1; 1630) 149.983*** (df = 5; 1626) 138.039*** (df = 6; 1577) 524.973*** (df = 4; 1627) 95.004*** (df = 33; 1550) 206.086*** (df = 15; 1568)
================================================================================================================================================================================
Note:                                                                                                                                                *p<0.1; **p<0.05; ***p<0.01

AllVariablesLR had the highest adjusted r-squared (0.709) while SigVariablesLR had the next highest (0.704). When it came to AIC and BIC, however, the numbers for the log-adjusted models were substantially lower - AIC was 3,787.85 to 5,166.33 for the log-adjusted models and 27,727.59 to 30,243.8 for the non-adjusted ones; BIC was similar, varying from 3,879.1 to 5,182.52 for the log-adjusted models and 27,851.04 to 30,259.99 for the non-adjusted ones.

High values of AIC and BIC suggest that the model may not be a good fit for the data and that it may be overfitting. Accordingly, while the adjusted r squared was highest for the non-log models, I believe a log-adjusted model is a better fit for the data overall. Specifically, SigVariablesLRlog - the model for all significant variables from the four categories of variables (is march madness, game outcome-related, tournament-related, and school-related) is the best fitting model, with the lowest AIC and BIC, and the highest adjusted r squared amongst the log-adjusted models.

Diagnostics

Code

par(mfrow = c(2,3))
plot(SigVariablesLRlog, which = 1:5)

Residuals vs. Fitted

There is a slight curve to the residuals vs. fitted. While there are some outliers, this is mostly a well behaving residuals vs. fitted plot. This suggests the model is capturing most of the relationship.

Q-Q Residuals

There is greater variability/outliers at the lower end of this plot and then points fall nearly perfectly along the line until there is again slight variability at the top end. This suggests that the residuals are largely normally distributed.

Scale-Location

In the scale-location plot, the red line should be approximately horizontal, which this is. The points also appear to be randomly scattered around the line. These two observations suggest that the residuals have a largely consistent spread across the range of fitted values.

Cook’s Distance

If I use 4/n as my threshold for Cook’s distance, this plot surpasses that threshold; however, if I use 1, then this plot does not surpass that threshold. The presence of influential observations makes sense within the research context and arguably, the higher threshold is appropriate.

Residuals vs. Leverage

There are no points outside of the lines for Cook’s distance.

Conclusion

The results for both hypotheses were statistically significant -

Participation in the men’s March Madness tournament increases online conversation about the schools that are involved
This increase in conversation volume for each school is influenced by three types of additional mediating variables: game-outcome related variables, school-related variables, and tournament related variables.

Testing of the different variables and models evidenced that there are a large number of variables and factors that can contribute to differences in conversation volume for the schools/teams involved in the tournament in a statistically significant fashion. While I believe I was able to determine many of the factors that can help to predict an increase in volume, I think this also presents a significant limitation when it comes to the utility of this research for the schools involved. When it comes to social media management strategies for the teams involved in March Madness, a model with tens of different variables is not going to be easy to use; in this sense I believe that quantitative analysis to determine the statistically significant factors followed by qualitative analysis and subsequent overarching recommendations leveraging both components would provide such groups with the best opportunities to leverage the information when moving forward.

Footnotes

https://www.hbs.edu/faculty/Pages/item.aspx?num=44778 ↩︎
https://hbswk.hbs.edu/item/diagnosing-the-flutie-effect-on-college-marketing ↩︎
https://www.forbes.com/sites/hbsworkingknowledge/2013/04/29/the-flutie-effect-how-athletic-success-boosts-college-applications/?sh=61f984206e96 ↩︎
https://www.campussonar.com/campus-sonar-expertise ↩︎
Publication of this data is forthcoming; it will be discussed during this webinar https://t.co/MK2fARTWb5 ↩︎
https://wallethub.com/blog/march-madness-statistics/11016 ↩︎
https://blog.campussonar.com/blog/leverage-the-everyday-impact-of-athletics ↩︎
Some split this designation down even further to high-, mid-, and low-major, but I did not see merit in narrowing the sample down this far.↩︎