Final Project - Impact of Men’s March Madness Tournament on a School’s Online Conversation Volume

finalproject
Final Project - Darron Bunt
Author

Darron Bunt

Published

May 25, 2023

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(ggplot2)
library(dplyr)
library(lubridate)
library(car)
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some
Code
library(forcats)
library(stargazer)

Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 

Introduction

In April 2013, Doug J. Chung published a research paper1 endeavoring to quantify and model the impact of the so-called “Flutie Effect” - the spillover impact that athletics has on the quantity and quality of applicants to US colleges (named after Boston College quarterback Doug Flutie, who in 1984 threw a Hail Mary touchdown pass to secure victory with six seconds in a game against the University of Miami, qualifying the team to compete in the Cotton Bowl2). The legacy of Flutie’s on-field success has been credited with catalyzing a 30% increase in undergraduate applications at Boston College, though institutional officials have argued that other, non-athletic factors were the “true” reason behind the increase. This trend of increased applications after prominent athletic successes, however, has been observed at other institutions including Georgetown, Northwestern, Boise State, and Texas Christian University.

Chung was able to find a statistically significant relationship between athletic success and both the quantity (number of applications) and quality (SAT scores of those applicants) of applicants to a given institution; his findings included that when a school rises from being classified as a mediocre football program to great one, applications rise by 18.7%3.

While in my job I do not focus directly on applications to US colleges, I do work with on campus marketing and communications leaders, and generate insights from data derived from online conversation about their schools so that they can better understand that conversation in ways that can help them to develop, refine, and align their communications strategies with the goals of the institutions that they serve4. A trend I frequently observe while analyzing this online conversation data is the impact that athletics has on the volume and reach of mentions related to schools.

In benchmarking work that we’ve undertaken in order to better understand online conversation trends in higher education, we’ve found that, on average, 63% of all online conversation related to schools is about their athletics - and for some schools, this proportion can be as high as 91%5. And while the proportion of overall online conversation relating to different colleges is already quite high, there are certain significant events within the realm of college sports that send mention volume even higher. One such event is March Madness.

Run by the NCAA every year during the month of March, the men’s version of the March Madness tournament is widely regarded as one of biggest American sporting events. 68 teams participate in the three-week long single-elimination tournament, which produced $1.14 billion in annual revenue for the NCAA in 20226. One of the most exciting parts of the tournament are the “Cinderella stories”; teams with a low ranking that manage to eliminate higher ranked teams.

In 2022, I became interested in developing a greater understanding of the impact that March Madness could have on a school’s online conversation volume and began tracking that year’s Cinderella - the Saint Peter’s College Peacocks, a #15 seed that made it all the way to the Elite 8. In their Sweet 16 game against Purdue, Saint Peter’s had more mentions per minute than their average monthly volume for the prior five months. Their total mention volume during the month of March was 12,380 times more than that same monthly average7.

After completing that work, I became increasingly curious about how participation in the tournament affected conversation volume and reach for all teams, not just Cinderella stories. I wanted to investigate the factors that might influence more/less conversation volume for any given team, and indeed, whether conversation about the team was the only facet of conversation about a college that increased - or if the school as a whole received more conversation overall. I became very interested in learning more about the overall trends that relate to participation in events like March Madness and how we could use that information to provide strategic insights to colleges across the US.

Bringing it back to the Flutie Effect - if on-field athletic success can impact enrollment at US colleges, it stands to reason that online conversation about athletic success (or athletic events in general) could have a similar impact. Accordingly, it would be prudent to further explore the impact an event like the men’s March Madness tournament has on online conversation volume for US colleges.

Research Question

Specifically, the research question that I want to explore is:

How does online conversation volume differ when schools are participating in the men’s March Madness tournament?

To examine this research question in greater detail, I will also consider what variables contribute to differences in mention volume during the tournament. These variables can be broken down into three types: game outcome related, tournament related, and school related.

The answer to this question and a better understanding of the variables that contribute to differences in mention volume will provide more robust insight into the overall impact that events such as the men’s March Madness tournament have and the types of changes that can be expected based on the presence/absence of additional variables. This information can subsequently serve to inform strategies and tactics that can be implemented by marketing and communications professionals who focus on digital media in higher education. That is, schools can use this information to develop further hypotheses to be tested that would facilitate a goal of leveraging and maximizing the impact of athletic participation and success on their online conversation.

Hypothesis

My overarching hypothesis is that participation in the men’s March Madness tournament increases online conversation about the schools that are involved. I further hypothesize that the increase in conversation volume for each school will be further influenced by three types of additional mediating variables: game-outcome related variables, school-related variables, and tournament related variables.

Data Collection

The data used for this project was collected by me using the Brandwatch Consumer Research platform. I first wrote Boolean to search for mentions about all 68 schools with teams in the men’s March Madness tournament, using the same parameters to construct each one (specifically, the school’s full name, shortened name(s), acronym(s), mascot(s), website URL, athletics website URL, and Twitter usernames for the school’s flagship, flagship athletics, and men’s basketball team accounts).

The Boolean was then used to run a query within Brandwatch to pull all relevant, retrievable online mentions for the 68 schools made between January 1 and May 12 2023. This search returned 9,007,552 mentions.

One of the features included in Brandwatch is the ability to segment data into categories, and this can be done in a wide range of manners (using Boolean or a variety of pre-built parameters such as content sources and mention types). I used this feature to break the mention volume data down by school (the 68 schools participating in men’s March Madness), and daily volume (for the 132 days) and exported this information to .csv. This mention volume data provides the foundation for this analysis.

Code
MMVolumeNew <- read_csv("_data/MMVolumeByDayFullNEW.csv", show_col_types = FALSE)

Variables of Interest

I performed a variety of data manipulation in order to create my variables of interest.

Data Manipulation For Mention Volume (Dependent Variable)

I calculated the average daily volume for the following time spans:

  • during the tournament (March 14 to day the after each team’s elimination)
  • outside of the tournament (January 1 to March 13 and the day after each team’s elimination to May 12 - ie. both before and after, but not during)
  • the week after the tournament for each team (the week after each team’s elimination)
  • two weeks after the tournament for each team (the week two weeks after each team’s elimination)
  • before the tournament (January 1 to March 11)
  • after the tournament (Two days after each team’s elimination to May 12)

I also added columns for the actual daily volume on the following days:

  • Day of elimination
  • Day after elimination
  • Two days after elimination
  • Three days after elimination
  • Four days after elimination
  • Five days after elimination
  • Six days after elimination
Code
# Read in data
TournamentVariables <- read_csv("_data/MMVariables.csv", show_col_types = FALSE)

# Join data
MMVolumeVar <- MMVolumeNew %>%
  left_join(TournamentVariables, by = "School")

# Calculate pre-tournament average daily volume 
MMVolumeVar$VolBefore <- (round(rowMeans(MMVolumeVar[, 2:73]), digits = 0))

# Create a function to pull the rows needed to calculate "during tournament" for each of the schools
fun1 <- function(row) {
  start_col1 <- as.numeric(row[134])  # Convert the value to a numeric column index
  end_col1 <- as.numeric(row[135])    # Convert the value to a numeric column index
  values1 <- as.numeric(row[start_col1:end_col1]) # The rows to calculate the average
  average1 <- (round(mean(values1, na.rm = TRUE), digits = 0))
  return(average1)
}

# Return the average daily volume for "during tournament" for each school 
MMVolumeVar$VolDuring <- apply(MMVolumeVar, 1, fun1)

# Create a function to pull the rows needed to calculate "after tournament" for each of the schools
fun2 <- function(row) {
  start_col2 <- as.numeric(row[136])  # Convert the value to a numeric column index
  end_col2 <- as.numeric(row[137])    # Convert the value to a numeric column index
  values2 <- as.numeric(row[start_col2:end_col2]) # The rows to calculate the average
  average2 <- (round(mean(values2, na.rm = TRUE), digits = 0))
  return(average2)
}

# Return the average daily volume for "during tournament" for each school 
MMVolumeVar$VolAfter <- apply(MMVolumeVar, 1, fun2)

# Create a function to pull the rows needed for post-elimination volume 
fun3 <- function(row) {
  column_index3 <- as.numeric(row[136])  # Convert the value to a numeric column index
  value3 <- as.numeric(row[column_index3])  # Get the value from the specified column
  return(value3)
}

# Return the post-elimination day volume for each school 
MMVolumeVar$VolPostElim <- apply(MMVolumeVar, 1, fun3)

# Create a function to pull the rows needed for non-tournament volume
fun14 <- function(row) {
  start_col14 <- as.numeric(row[145])    # Start column index for range 1
  end_col14 <- as.numeric(row[146])      # End column index for range 1
  start_col14a <- as.numeric(row[147])    # Start column index for range 2
  end_col14a <- as.numeric(row[148])      # End column index for range 2
  
  values14 <- as.numeric(row[start_col14:end_col14])    # Values for range 1
  values14a <- as.numeric(row[start_col14a:end_col14a])    # Values for range 2
  
  values14b <- c(values14, values14a)    # Combine the values from both ranges
  
  average14 <- (round(mean(values14b, na.rm = TRUE), digits = 0))
  return(average14)
}

# Return the average non-tournament daily volume for each school 
MMVolumeVar$VolNonTourn <- apply(MMVolumeVar, 1, fun14) 

# Create a function to pull the rows needed for post-elimination +1 volume 
fun4 <- function(row) {
  column_index4 <- as.numeric(row[137])  # Convert the value to a numeric column index
  value4 <- as.numeric(row[column_index4])  # Get the value from the specified column
  return(value4)
}

# Return the post-elimination +1 volume for each school 
MMVolumeVar$VolPostElim1 <- apply(MMVolumeVar, 1, fun4)

# Create a function to pull the rows needed for post-elimination +2 volume 
fun5 <- function(row) {
  column_index5 <- as.numeric(row[138])  # Convert the value to a numeric column index
  value5 <- as.numeric(row[column_index5])  # Get the value from the specified column
  return(value5)
}

# Return the post-elimination +2 volume for each school 
MMVolumeVar$VolPostElim2 <- apply(MMVolumeVar, 1, fun5)

# Create a function to pull the rows needed for post-elimination +3 volume 
fun6 <- function(row) {
  column_index6 <- as.numeric(row[139])  # Convert the value to a numeric column index
  value6 <- as.numeric(row[column_index6])  # Get the value from the specified column
  return(value6)
}

# Return the post-elimination +3 volume for each school 
MMVolumeVar$VolPostElim3 <- apply(MMVolumeVar, 1, fun6)

# Create a function to pull the rows needed for post-elimination +4 volume 
fun17 <- function(row) {
  column_index17 <- as.numeric(row[140])  # Convert the value to a numeric column index
  value17 <- as.numeric(row[column_index17])  # Get the value from the specified column
  return(value17)
}

# Return the post-elimination +4 volume for each school 
MMVolumeVar$VolPostElim4 <- apply(MMVolumeVar, 1, fun17)

# Create a function to pull the rows needed for post-elimination +5 volume 
fun18 <- function(row) {
  column_index18 <- as.numeric(row[141])  # Convert the value to a numeric column index
  value18 <- as.numeric(row[column_index18])  # Get the value from the specified column
  return(value18)
}

# Return the post-elimination +5 volume for each school 
MMVolumeVar$VolPostElim5 <- apply(MMVolumeVar, 1, fun18)

# Create a function to pull the rows needed for post-elimination +6 volume 
fun19 <- function(row) {
  column_index19 <- as.numeric(row[142])  # Convert the value to a numeric column index
  value19 <- as.numeric(row[column_index19])  # Get the value from the specified column
  return(value19)
}

# Return the post-elimination +6 volume for each school 
MMVolumeVar$VolPostElim6 <- apply(MMVolumeVar, 1, fun19)

# Create a function to pull the rows needed for day after elimination volume 
fun20 <- function(row) {
  column_index20 <- as.numeric(row[135])  # Convert the value to a numeric column index
  value20 <- as.numeric(row[column_index20])  # Get the value from the specified column
  return(value20)
}

# Return the day after volume for each school 
MMVolumeVar$VolAfterElim <- apply(MMVolumeVar, 1, fun20)

# Create a function to pull the rows needed for day of elimination volume 
fun21 <- function(row) {
  column_index21 <- as.numeric(row[149])  # Convert the value to a numeric column index
  value21 <- as.numeric(row[column_index21])  # Get the value from the specified column
  return(value21)
}

# Return the day after volume for each school 
MMVolumeVar$VolElim_Day <- apply(MMVolumeVar, 1, fun21)

# Create a function to pull the rows needed to calculate "week after tournament" for each of the schools
fun15 <- function(row) {
  start_col15 <- as.numeric(row[136])  # Convert the value to a numeric column index
  end_col15 <- as.numeric(row[142])    # Convert the value to a numeric column index
  values15 <- as.numeric(row[start_col15:end_col15]) # The rows to calculate the average
  average15 <- (round(mean(values15, na.rm = TRUE), digits = 0))
  return(average15)
}

# Return the average daily volume for "week after tournament" for each school 
MMVolumeVar$VolWeekAfter <- apply(MMVolumeVar, 1, fun15)

# Create a function to pull the rows needed to calculate "two weeks after tournament" for each of the schools
fun16 <- function(row) {
  start_col16 <- as.numeric(row[143])  # Convert the value to a numeric column index
  end_col16 <- as.numeric(row[144])    # Convert the value to a numeric column index
  values16 <- as.numeric(row[start_col16:end_col16]) # The rows to calculate the average
  average16 <- (round(mean(values16, na.rm = TRUE), digits = 0))
  return(average16)
}

# Return the average daily volume for "two weeks after tournament" for each school 
MMVolumeVar$Vol2WeeksAfter <- apply(MMVolumeVar, 1, fun16)

# Specify the starting date
TournEndDate <- as.Date("2023-03-12")

# Convert the existing column "Numbers" to dates
MMVolumeVar$EndDate <- TournEndDate + (MMVolumeVar$During_End - 72)

# Create a data frame with the columns for the calculated means
MMVolAvgs <- MMVolumeVar %>%
  select(School, VolDuring, VolNonTourn, VolWeekAfter, Vol2WeeksAfter, VolElim_Day, VolAfterElim, VolPostElim,  VolPostElim1, VolPostElim2, VolPostElim3, VolPostElim4, VolPostElim5, VolPostElim6, VolBefore, VolAfter)
MMVolAvgs
# A tibble: 68 × 16
   School          VolDuring VolNonTourn VolWeekAfter Vol2WeeksAfter VolElim_Day
   <chr>               <dbl>       <dbl>        <dbl>          <dbl>       <dbl>
 1 University of …       796         260          151            132        1481
 2 University of …      2903        1320         1037           1113        6441
 3 Arizona State …      2476        1352         1121           1199        2458
 4 Duke University      4567        2162         1242           1879        8261
 5 Utah State Uni…       734         297          203            236        1796
 6 University of …      3762        1453          824            983        6693
 7 North Carolina…      1013         861          993            561        2255
 8 University of …      5907        1487         1722           1239       11182
 9 Providence Col…       721         363          997            489        1829
10 Xavier Univers…      1238         354          228            170        2975
# ℹ 58 more rows
# ℹ 10 more variables: VolAfterElim <dbl>, VolPostElim <dbl>,
#   VolPostElim1 <dbl>, VolPostElim2 <dbl>, VolPostElim3 <dbl>,
#   VolPostElim4 <dbl>, VolPostElim5 <dbl>, VolPostElim6 <dbl>,
#   VolBefore <dbl>, VolAfter <dbl>

Data Manipulation for Independent Variables

I also manipulated the data to create my tournament variables of interest.

Code
# Read in game scores, dates, etc.
DateAndTeam <- read_csv("_data/MBBMMGameData.csv", show_col_types = FALSE)

DateAndTeam$GameDate <- as.Date(DateAndTeam$Date, format = "%m/%d/%Y")
DateAndTeam$Time <- DateAndTeam$Time <- gsub(" ET", "", DateAndTeam$Time)
DateAndTeam$Time <- strptime(DateAndTeam$Time, format="%I:%M%p")
DateAndTeam$Time <- format(DateAndTeam$Time, format = "%H:%M:%S")

# Use GameDate to create a TournamentRound variable
DateRoundTeam <- DateAndTeam %>%
  mutate(TournamentRound = case_when(
      GameDate >= "2023-03-14" & GameDate <= "2023-03-15" ~ "First 4", 
      GameDate >= "2023-03-16" & GameDate <= "2023-03-17" ~ "Round of 64",
      GameDate >= "2023-03-18" & GameDate <= "2023-03-19" ~ "Round of 32",
      GameDate >= "2023-03-23" & GameDate <= "2023-03-24" ~ "Sweet 16",
      GameDate >= "2023-03-25" & GameDate <= "2023-03-26" ~ "Elite 8",
      GameDate == "2023-04-01" ~ "Final 4",
      GameDate == "2023-04-03" ~ "Championship")) %>%

# Replace game start times with time categories
  mutate(Time = case_when(
    Time >= "12:00:00" & Time < "14:14:59" ~ "Early Afternoon",
    Time >= "14:15:00" & Time < "16:29:59" ~ "Mid Afternoon",
    Time >= "16:30:00" & Time < "18:44:59" ~ "Late Afternoon",
    Time >= "18:45:00" & Time < "20:59:59" ~ "Early Evening",
    Time >= "21:00:00" & Time < "22:45:00" ~ "Late Evening")) %>%

# Import game day volume
  mutate(GDayVolCol = case_when(
      GameDate == "2023-03-14" ~ 74, 
      GameDate == "2023-03-15" ~ 75, 
      GameDate == "2023-03-16" ~ 76,
      GameDate == "2023-03-17" ~ 77,
      GameDate == "2023-03-18" ~ 78,
      GameDate == "2023-03-19" ~ 79,
      GameDate == "2023-03-23" ~ 83,
      GameDate == "2023-03-24" ~ 84,
      GameDate == "2023-03-25" ~ 85,
      GameDate == "2023-03-26" ~ 86,
      GameDate == "2023-04-01" ~ 92,
      GameDate == "2023-04-03" ~ 94
            )) %>%
  rename(School = Team) %>%
  select(School, Seed, GameDate, TournamentRound, Time, GDayVolCol, IsWinner, WinningSeed, LosingSeed)

# Create a function to bring in Game Day Volume 
fun13 <- function(row) {
  school <- row["School"]  # Get the school name from the row
  column_index <- as.numeric(row["GDayVolCol"])  # Convert the value to a numeric column index
  value <- MMVolumeVar[MMVolumeVar$School == school, column_index]
  return(as.numeric(value))
}

# Return the flattened game day volume for each school
DateRoundTeam$GameDayVolume <- unlist(apply(DateRoundTeam, 1, fun13))

# Create data frame with School, GameDate, Time, TournamentRound, GameDayVolume, IsWinner, WinningSeed, LosingSeed
TVariables <- DateRoundTeam %>%
  left_join(TournamentVariables, by = "School") %>%
  select(School, Seed, GameDate, Time, TournamentRound, GameDayVolume, IsWinner, WinningSeed, LosingSeed)

# Mutate WinningSeed and LosingSeed to create SeedDifference column 
TVariables <- TVariables %>%
  mutate(SeedDifference = ifelse(IsWinner == "Yes", WinningSeed - LosingSeed, LosingSeed - WinningSeed)) %>%
# Mutate SeedDifference and Is Winner to create UpsetWin and UpsetLoss
   mutate(UpsetWin = ifelse(IsWinner == "Yes" & SeedDifference >= 5, "Yes", "No")) %>%
  mutate(UpsetLoss = ifelse(IsWinner == "No" & SeedDifference <= -5, "Yes", "No")) %>%
# Mutate IsWinner and Winning Seed/Losing Seed to create FavoriteWin and FavoriteLoss
  mutate(FavoriteWin = ifelse(IsWinner == "Yes" & WinningSeed > LosingSeed, "Yes", "No")) %>%
  mutate(FavoriteLoss = ifelse(IsWinner == "No" & WinningSeed > LosingSeed, "Yes", "No")) %>%
# Mutate IsWinner and Winning Seed/Losing Seed to create UnderdogWin and UnderdogLoss
  mutate(UnderdogWin = ifelse(IsWinner == "Yes" & WinningSeed < LosingSeed, "Yes", "No")) %>%
  mutate(UnderdogLoss = ifelse(IsWinner == "No" & WinningSeed < LosingSeed, "Yes", "No"))
TVariables
# A tibble: 134 × 16
   School           Seed GameDate   Time  TournamentRound GameDayVolume IsWinner
   <chr>           <dbl> <date>     <chr> <chr>                   <dbl> <chr>   
 1 Pittsburgh Uni…    11 2023-03-14 Late… First 4                  4731 Yes     
 2 Texas A&M Corp…    16 2023-03-14 Late… First 4                   521 Yes     
 3 Arizona State …    11 2023-03-15 Late… First 4                  3967 Yes     
 4 Fairleigh Dick…    16 2023-03-15 Late… First 4                  1669 Yes     
 5 Princeton Univ…    15 2023-03-16 Mid … Round of 64              9532 Yes     
 6 Furman Univers…    13 2023-03-16 Earl… Round of 64              8721 Yes     
 7 Penn State Uni…    10 2023-03-16 Late… Round of 64              3692 Yes     
 8 Auburn Univers…     9 2023-03-16 Earl… Round of 64              4477 Yes     
 9 University of …     8 2023-03-16 Late… Round of 64              5918 Yes     
10 University of …     8 2023-03-16 Earl… Round of 64             10336 Yes     
# ℹ 124 more rows
# ℹ 9 more variables: WinningSeed <dbl>, LosingSeed <dbl>,
#   SeedDifference <dbl>, UpsetWin <chr>, UpsetLoss <chr>, FavoriteWin <chr>,
#   FavoriteLoss <chr>, UnderdogWin <chr>, UnderdogLoss <chr>

Does Participation Impact Average Mention Volume?

In order to answer the question of what impact participation in the men’s March Madness tournament has on the volume of online conversation about the schools involved, it is important to first examine if March Madness does indeed have an impact on conversation volume.

Code
# Descriptive statistics for volume during the tournament and not during
MMVolAvgs %>%
  select(VolNonTourn, VolDuring) %>%
summary()
  VolNonTourn     VolDuring     
 Min.   :  51   Min.   : 322.0  
 1st Qu.: 269   1st Qu.: 731.5  
 Median : 682   Median :1802.0  
 Mean   : 908   Mean   :2389.1  
 3rd Qu.:1312   3rd Qu.:3576.0  
 Max.   :5269   Max.   :7504.0  

This seems to indicate a pretty clear difference between average volume during the tournament and average volume outside of the tournament; the mean for volume has increased from 908 mentions to 2,389 and the IQR has increased from 269-1,312 to 731-3,576

We can also display this difference visually:

Code
#Create bar chart showing difference between average mention volume during/not during March Madness

MMVolAvgs_Long <- pivot_longer(MMVolAvgs, cols = c(VolDuring, VolNonTourn), names_to = "Variable", values_to = "Volume")
MMVolAvgs_Long$School <- factor(MMVolAvgs_Long$School, levels = unique(MMVolAvgs_Long$School)) 
ggplot(MMVolAvgs_Long, 
       aes(x = School, y=Volume, fill = Variable)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(x = "School", y = "Average Mention Volume", fill = "Time Frame") +
 scale_fill_manual(values = c("VolDuring" = "blue", "VolNonTourn" = "red"),
                    labels = c("March Madness", "Not March Madness")) +
  ggtitle("Comparison of Mention Volume During and Outside March Madness") +
  theme(axis.text.x = element_blank())

It can also be helpful to visualize this difference in mention volume by highlighting just how large, proportionally, that difference was. Accordingly, I transformed these volume metrics into a percentage difference in in-tournament volume compared to outside-tournament volume.

We can then provide summary statistics on this proportional difference in volume during the tournament compared to not.

Code
# Calculate difference between average volume during tournament and non-tournament
MMVolDiffs <- MMVolAvgs %>%
  mutate(
    InTournamentDifference = round((((VolDuring / VolNonTourn) * 100) -100), digits = 2),
    WeekAfterDifference = round((((VolWeekAfter / VolNonTourn) * 100) -100), digits = 2),
    V2WeeksAfterDifference = round((((Vol2WeeksAfter / VolNonTourn) * 100) -100), digits = 2),
    ElimDayDifference = round(((VolElim_Day / VolNonTourn) * 100), digits = 2),
    DayAfterElimDifference = round(((VolAfterElim / VolNonTourn) * 100), digits = 2),
    DayPostElimDifference = round(((VolPostElim / VolNonTourn) * 100), digits = 2),
    ElimPlus1Difference = round(((VolPostElim1 / VolNonTourn) * 100), digits = 2),
    ElimPlus2Difference = round(((VolPostElim2 / VolNonTourn) * 100), digits = 2), 
    ElimPlus3Difference = round(((VolPostElim3 / VolNonTourn) * 100), digits = 2),
    ElimPlus4Difference = round(((VolPostElim4 / VolNonTourn) * 100), digits = 2),
    ElimPlus5Difference = round(((VolPostElim5 / VolNonTourn) * 100), digits = 2),
    ElimPlus6Difference = round(((VolPostElim6 / VolNonTourn) * 100), digits = 2),
  ) %>%
  
  select(School, InTournamentDifference, WeekAfterDifference, V2WeeksAfterDifference, ElimDayDifference, DayPostElimDifference, ElimPlus1Difference, ElimPlus2Difference, ElimPlus3Difference) %>%
  arrange(desc(InTournamentDifference))

# Descriptive statistics for the proportional difference in volume during the tournament and not 
summary(MMVolDiffs$InTournamentDifference)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -6.32   94.75  162.41  278.95  278.06 3800.00 

Among the 68 schools, there was a mean proportional difference in average mention volume of 279% - again, this seems to indicate that participation in March Madness increases mention volume.

Overall, mention volume increased by more than 100% for 50 of the 68 schools in the men’s March Madness tournament. We can visualize the magnitude of the increase by creating groupings of schools according to how much their average volume changed during the tournament.

Code
# Use InTournamentDifference to create a percentage difference
MMVolDiffCat <- MMVolDiffs %>%
  mutate(InTournamentDifference = case_when(
      InTournamentDifference < 0 ~ "Decreased Volume",
      InTournamentDifference > 0 & InTournamentDifference <= 100 ~ "Up to 100% Increase",
      InTournamentDifference >= 101 & InTournamentDifference <= 200 ~ "100%-200% Increase",
      InTournamentDifference >= 201 & InTournamentDifference <= 300 ~ "201%-300% Increase",
      InTournamentDifference >= 301 & InTournamentDifference <= 600 ~ "301%-600% Increase",
      InTournamentDifference >= 601 & InTournamentDifference <= 800 ~ "601%-800% Increase",
      InTournamentDifference > 801 ~ "1000%+ Increase"))

MMVolDiffCatBD <- MMVolDiffCat %>%
        group_by(InTournamentDifference) %>%
        tally() %>%
  arrange(factor(InTournamentDifference, levels = c("Decreased Volume", "Up to 100% Increase", "100%-200% Increase", "201%-300% Increase", "301%-600% Increase", "601%-800% Increase", "1000%+ Increase")))

MMVolDiffCatBD$InTournamentDifference <- factor(MMVolDiffCatBD$InTournamentDifference, levels = c("Decreased Volume", "Up to 100% Increase", "100%-200% Increase", "201%-300% Increase", "301%-600% Increase", "601%-800% Increase", "1000%+ Increase"))
ggplot(MMVolDiffCatBD, 
       aes(x = InTournamentDifference, y=n)) +
  geom_bar(stat = "identity", fill = "purple") +
  geom_text(aes(label = n), vjust = -0.1) +
  labs(x = "", y = "Number of Schools") +
  ggtitle("Number of Schools w/Each Grouping of % Increase/Decrease") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Out of the 68 schools that participated in March Madness, only two experienced a decrease in their average mention volume - the University of Iowa (-6.08%) and Michigan State (-6.32%). These outliers have both experienced significant non-athletic events during 2023 that have also served to increase their mention volume outside of March Madness (a screening of the contentious film “What is a Woman” at the University of Iowa and a school shooting on campus at Michigan State).

37 schools experienced an increase in their mention volume of 100-300% while 11 saw a 301-800% increase. Two schools - Furman University and Farleigh Dickinson University - experienced volume increases of over 1,000% (1,121% and 3,800% respectively.)

Testing the Hypothesis

My original hypothesis was that participation in the men’s March Madness tournament increases online conversation about the schools that are involved.

To test this hypothesis, I will perform a paired t.test on the average volume during the tournament and outside of the tournament for all of the teams involved.

  • Null hypothesis: Participation in the men’s March Madness tournament does not increase average mention volume for the schools that are involved.

  • Alternative hypothesis: Participation in the men’s March Madness tournament does increase average mention volume for the schools that are involved.

Code
t.test(MMVolAvgs$VolDuring, MMVolAvgs$VolNonTourn, paired = TRUE)

    Paired t-test

data:  MMVolAvgs$VolDuring and MMVolAvgs$VolNonTourn
t = 8.1765, df = 67, p-value = 1.157e-11
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 1119.565 1842.700
sample estimates:
mean difference 
       1481.132 

The observed mean difference between the paired observations was 1,481 mentions and the 95% confidence interval is 1,120-1,843. We can be 95% certain that the true mean difference lies between this range.

The magnitude of the t-value (8.18) indicates there is strong evidence against the null hypothesis. Based on the p-value (1.157e-11) we reject the null hypothesis and support the alternative hypothesis that participation in the men’s March Madness tournament does increase average mention volume for the schools that are involved.

How long does March Madness contribute to changes in online conversation volume?

I was immediately curious how long changes in conversation volume due to March Madness would be considered statistically significant, so I performed t-tests for the average volume one week and two weeks after each team’s elimination.

Code
# Compare non-tournament volume to week after and two weeks after tournament ended for each team
t.test(MMVolAvgs$Vol2WeeksAfter, MMVolAvgs$VolNonTourn, paired = TRUE)

    Paired t-test

data:  MMVolAvgs$Vol2WeeksAfter and MMVolAvgs$VolNonTourn
t = -2.6328, df = 67, p-value = 0.0105
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -302.18943  -41.57528
sample estimates:
mean difference 
      -171.8824 
Code
t.test(MMVolAvgs$VolWeekAfter, MMVolAvgs$VolNonTourn, paired = TRUE)

    Paired t-test

data:  MMVolAvgs$VolWeekAfter and MMVolAvgs$VolNonTourn
t = -1.2093, df = 67, p-value = 0.2308
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -225.99467   55.46526
sample estimates:
mean difference 
      -85.26471 

Based on the p-value for both two weeks after (0.0105) and one week after (0.2308) we fail to reject the null hypothesis - participation in the men’s March Madness tournament does not increase mention volume for the schools that are involved at either one or two weeks post tournament.

I was then curious whether changes in conversation volume due to March Madness would be considered statistically significant in the days immediately following the tournament, so I performed t-tests comparing those days’ volume numbers to each school’s average non-tournament volume.

Code
t.test(MMVolAvgs$VolElim_Day, MMVolAvgs$VolNonTourn, paired = TRUE)

    Paired t-test

data:  MMVolAvgs$VolElim_Day and MMVolAvgs$VolNonTourn
t = 6.7004, df = 67, p-value = 5.187e-09
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 2723.919 5035.346
sample estimates:
mean difference 
       3879.632 
Code
t.test(MMVolAvgs$VolAfterElim, MMVolAvgs$VolNonTourn, paired = TRUE)

    Paired t-test

data:  MMVolAvgs$VolAfterElim and MMVolAvgs$VolNonTourn
t = 2.3848, df = 67, p-value = 0.01993
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
  230.9917 2602.8318
sample estimates:
mean difference 
       1416.912 
Code
t.test(MMVolAvgs$VolPostElim, MMVolAvgs$VolNonTourn, paired = TRUE)

    Paired t-test

data:  MMVolAvgs$VolPostElim and MMVolAvgs$VolNonTourn
t = -0.048907, df = 67, p-value = 0.9611
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -283.4630  269.9042
sample estimates:
mean difference 
      -6.779412 
Code
t.test(MMVolAvgs$VolPostElim1, MMVolAvgs$VolNonTourn, paired = TRUE)

    Paired t-test

data:  MMVolAvgs$VolPostElim1 and MMVolAvgs$VolNonTourn
t = -1.4567, df = 67, p-value = 0.1499
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -334.72775   52.28657
sample estimates:
mean difference 
      -141.2206 
Code
# Create a data frame with results
VolTTestResults <- data.frame(
  "Time Frame" = c("Elimination Day", "Day After Elimination", "2 Days After Elimination", "3 Days After Elimination"),
  t_value = c(6.7004, 2.3848, -0.048907, -1.4567),
  p_value = c(5.187e-09, 0.01993, 0.9611, 0.1499),
  Mean_Difference = c(3880, 1417,  -6.8, -141))
VolTTestResults$p_value <- format(VolTTestResults$p_value, scientific = FALSE)
VolTTestResults
                Time.Frame   t_value        p_value Mean_Difference
1          Elimination Day  6.700400 0.000000005187          3880.0
2    Day After Elimination  2.384800 0.019930000000          1417.0
3 2 Days After Elimination -0.048907 0.961100000000            -6.8
4 3 Days After Elimination -1.456700 0.149900000000          -141.0

Based on the p-values for Elimination Day and Day After Elimination, we can reject the null hypothesis - participation in the men’s March Madness tournament does increase mention volume on those two days.

For two days and three days post-elimination, however, the p-values (0.96 and 0.15) are higher than 0.05. For these days we fail to reject the null hypothesis - participation in the men’s March Madness tournament does not increase mention volume for the schools that are involved two and three days post-elimination.

The observed mean difference for the day of elimination and the day after elimination was 3,880 and 1,417 mentions respectively and the 95% confidence intervals were 2,724-5,035 and 231-2,603 respectively. We can be 95% certain that the true mean difference lies between this range.

To summarize: Participation in the men’s March Madness tournament does increase mention volume for schools with teams participating and does so for the days between the start of the tournament and the day after each team is eliminated.

What variables contribute to changes in mention volume during the tournament?

Having confirmed that participation in the men’s March Madness tournament does indeed increase mention volume for schools with teams participating, the next step is to determine what variables contribute to these changes in mention volume.

For the daily mention outcome variable, I created a data frame with the daily mention volume for each of the 68 schools in March Madness for the two days prior to the tournament beginning, all days in the tournament, and the day after the tournament, for a total time frame of March 12 to April 4, 2023. I then joined together the predictor variable data for each school’s daily mention volume during the tournament/on tournament game days.

Data manipulation

Code
# Read in data for enrollment 
SchoolNameToSchool <- read_csv("_data/SchoolNameToSchool.csv", show_col_types = FALSE) 
EnrollmentData <- read_csv("_data/CCIHE2021PublicData.csv", show_col_types = FALSE)
SchoolEnrollmentData <- SchoolNameToSchool %>%
  left_join(EnrollmentData, by = "SchoolName")

VolumeByDay <- MMVolumeNew %>%
  pivot_longer(cols = -School, names_to = "Date", values_to = "Mention_Volume") %>%
  arrange(Date)

# Convert the Date column to the "YYYY-MM-DD" format if needed
VolumeByDay$Date <- as.Date(VolumeByDay$Date, format = "%m/%d/%Y")

# Join volume by school and enrollment data for each school
VolumeAndEnrollment <- VolumeByDay %>%
  left_join(SchoolEnrollmentData, by = c("School"))

# Narrow down to all days between March 12 and April 4
MMVolumeByDay <- VolumeByDay %>%
  filter(Date >= as.Date("2023-03-12") & Date <= as.Date("2023-04-04"))

# Remove unneeded variables from TVariables
TVariables <- TVariables %>%
  rename("Date" = "GameDate") %>%
  select(-GameDayVolume, -WinningSeed, -LosingSeed)

# Join volume by school and game-day variables for each school
TVariables$Date <- as.Date(TVariables$Date)
MMVolumeByDay$Date <- as.Date(MMVolumeByDay$Date)
VolumeAndVariables <- MMVolumeByDay %>%
  left_join(TVariables, by = c("School", "Date")) %>%
  left_join(SchoolEnrollmentData, by = c("School"))

VolumeAndVariables <- VolumeAndVariables %>%
  mutate(Seed = ifelse(School %in% TVariables$School, TVariables$Seed, NA))

# Read in data
AddMajor <- read_csv("_data/DateRoundTeamSeedMajor.csv", show_col_types = FALSE) %>%
  select(School, Major)
# Find matching indices
matching_indices <- match(VolumeAndVariables$School, AddMajor$School)
VolumeAndVariables$Major <- ifelse(!is.na(matching_indices), AddMajor$Major[matching_indices], NA)

# Add EndDate column to VolumeAndVariables
MMVolVarJustEndDate <- MMVolumeVar %>%
  select(School, EndDate)
VolumeAndVariables <- VolumeAndVariables %>%
  left_join(MMVolVarJustEndDate, by = "School")

# Create column for DayAfterGame
DateRoundTeam2 <- DateRoundTeam %>%
  rename(Date = GameDate) %>%
  mutate(DayAfterGame = as.Date(Date + 1)) %>%
  select(School, Date, DayAfterGame)

# Join to dataset
VolumeAndVariables <- VolumeAndVariables %>%
  left_join(DateRoundTeam2, by = c("School", "Date")) %>%
mutate(
    IsMarchMadness = ifelse(Date >= as.Date("2023-03-12") & Date <= EndDate, 1, 0),
    IsGameDay = ifelse(is.na(Time), 0, 1)) %>%
  mutate(
    NextRow = if_else(!is.na(DayAfterGame), row_number() + 1, NA_integer_),
    IsDayAfterGame = if_else(row_number() %in% NextRow & !is.na(NextRow), 1, 0)) %>%
  mutate(
    IsDayAfterGame = replace(IsDayAfterGame, NextRow, 1),
    GDayOrAfter = ifelse(IsGameDay == 1 | IsDayAfterGame == 1, 1, 0)) %>%
    ungroup() %>%
  mutate(
    IsWinner = ifelse(is.na(IsWinner), "No", IsWinner),
    UpsetWin = ifelse(is.na(UpsetWin), "No", UpsetWin), 
    UpsetLoss = ifelse(is.na(UpsetLoss), "No", UpsetLoss),
    FavoriteWin = ifelse(is.na(FavoriteWin), "No", FavoriteWin),
    FavoriteLoss = ifelse(is.na(FavoriteLoss),  "No", FavoriteLoss),
    UnderdogWin = ifelse(is.na(UnderdogWin), "No", UnderdogWin),
    UnderdogLoss = ifelse(is.na(UnderdogLoss), "No", UnderdogLoss),
    SeedDifference = abs(SeedDifference),
    SeedDifference = ifelse(is.na(SeedDifference), "Not March Madness", SeedDifference),
   TournamentRound = fct_na_value_to_level(as.factor(TournamentRound), "Not March Madness"),
    Time = fct_na_value_to_level(as.factor(Time), "Not March Madness"),
    SizeSetting = as.factor(SizeSetting),
    Control = as.factor(Control),
    Seed = ifelse(is.na(Seed), "Not March Madness", Seed)) %>%
  select(Date, School, Mention_Volume, Seed, Time, TournamentRound, IsWinner, SeedDifference, UpsetWin, UpsetLoss, FavoriteWin, FavoriteLoss, UnderdogWin, UnderdogLoss, SizeSetting, Control, F20Enrollment, Major, IsMarchMadness, IsGameDay, IsDayAfterGame, GDayOrAfter)

VolumeAndVariables
# A tibble: 1,632 × 22
   Date       School         Mention_Volume  Seed Time  TournamentRound IsWinner
   <date>     <chr>                   <dbl> <dbl> <fct> <fct>           <chr>   
 1 2023-03-12 University of…            786    11 Not … Not March Madn… No      
 2 2023-03-12 University of…           1532    16 Not … Not March Madn… No      
 3 2023-03-12 Arizona State…           2664    11 Not … Not March Madn… No      
 4 2023-03-12 Duke Universi…           8882    16 Not … Not March Madn… No      
 5 2023-03-12 Utah State Un…            638    15 Not … Not March Madn… No      
 6 2023-03-12 University of…           3980    13 Not … Not March Madn… No      
 7 2023-03-12 North Carolin…           2181    10 Not … Not March Madn… No      
 8 2023-03-12 University of…           1476     9 Not … Not March Madn… No      
 9 2023-03-12 Providence Co…           1012     8 Not … Not March Madn… No      
10 2023-03-12 Xavier Univer…            801     8 Not … Not March Madn… No      
# ℹ 1,622 more rows
# ℹ 15 more variables: SeedDifference <chr>, UpsetWin <chr>, UpsetLoss <chr>,
#   FavoriteWin <chr>, FavoriteLoss <chr>, UnderdogWin <chr>,
#   UnderdogLoss <chr>, SizeSetting <fct>, Control <fct>, F20Enrollment <dbl>,
#   Major <chr>, IsMarchMadness <dbl>, IsGameDay <dbl>, IsDayAfterGame <dbl>,
#   GDayOrAfter <dbl>

Model 1 - Participation in March Madness

There is one independent variable to be considered in this section - whether the day’s mention volume is from a day when the school was not yet eliminated from the tournament (1 - Yes, 2 - No).

Code
# Run linear regression with NonTournProp and IsMarchMadness
IsMarchMadnessLR <- lm(Mention_Volume ~ IsMarchMadness, data = VolumeAndVariables)
summary(IsMarchMadnessLR)

Call:
lm(formula = Mention_Volume ~ IsMarchMadness, data = VolumeAndVariables)

Residuals:
   Min     1Q Median     3Q    Max 
 -2787   -682   -428    339  37243 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      720.54      80.47   8.954   <2e-16 ***
IsMarchMadness  2150.27     130.13  16.524   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2555 on 1630 degrees of freedom
Multiple R-squared:  0.1435,    Adjusted R-squared:  0.1429 
F-statistic:   273 on 1 and 1630 DF,  p-value: < 2.2e-16

While linear regression does indeed demonstrate a statistically significant relationship between IsMarchMadness and mention volume, a scatterplot comparing the two variables shows large cluster of values, followed by a long tail indicating outliers on the high end of mention volume.

Code
# Scatterplot of IsMarchMadness vs. (Mention_Volume)
plot(VolumeAndVariables$IsMarchMadness, VolumeAndVariables$Mention_Volume, 
     xlab = "IsMarchMadness", ylab = "Mention Volume", 
     main = "Scatterplot of IsMarchMadness vs. Mention Volume")

Whereas if I plot IsMarchMadness against the log of Mention_Volume, this reduces the skew and makes the data more symmetric.

Code
# Scatterplot of IsMarchMadness vs. log(Mention_Volume)
plot(VolumeAndVariables$IsMarchMadness, log(VolumeAndVariables$Mention_Volume), 
     xlab = "IsMarchMadness", ylab = "log(Mention Volume)", 
     main = "Scatterplot of IsMarchMadness vs. log(Mention Volume)") 

Code
# Run linear regression with MentionVolume and log(IsMarchMadness)
IsMarchMadnessLRlog <- lm(log(Mention_Volume) ~ IsMarchMadness, data = VolumeAndVariables)
summary(IsMarchMadnesslog)
Error in eval(expr, envir, enclos): object 'IsMarchMadnesslog' not found

Using the log transformed dependent variable has a much higher r squared (0.2491) vs. the model using the original dependent variable (0.1395).

Accordingly, I am going to use test using the log transformed version of Mention_Volume and the original version in all of my other models.

Model 5 - All Variables

I will now run a model with every variable from all four previous models.

Code
AllVariablesLR <- lm(Mention_Volume ~ IsMarchMadness + IsWinner + UpsetWin + UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss + IsGameDay + IsDayAfterGame + IsFirst4 + IsRd64 + IsRd32 + IsRd16 + IsRd8 + IsRd4 + IsChamp + Is15 + Is13 + Is11 + Is9 + Is8 + Is7 + Is6 + Is5 + Is4 + Is3 + Is1 + Is0 + IsEarlyEvening + IsLateAfternoon + IsLateEvening + IsMidAfternoon + IsEarlyAfternoon + Size + Major, data = VolumeAndVariables)
summary(AllVariablesLR)

Call:
lm(formula = Mention_Volume ~ IsMarchMadness + IsWinner + UpsetWin + 
    UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss + 
    IsGameDay + IsDayAfterGame + IsFirst4 + IsRd64 + IsRd32 + 
    IsRd16 + IsRd8 + IsRd4 + IsChamp + Is15 + Is13 + Is11 + Is9 + 
    Is8 + Is7 + Is6 + Is5 + Is4 + Is3 + Is1 + Is0 + IsEarlyEvening + 
    IsLateAfternoon + IsLateEvening + IsMidAfternoon + IsEarlyAfternoon + 
    Size + Major, data = VolumeAndVariables)

Residuals:
     Min       1Q   Median       3Q      Max 
-13366.5   -538.1   -148.0    256.2  16412.3 

Coefficients: (4 not defined because of singularities)
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         424.49      78.70   5.393 7.98e-08 ***
IsMarchMadness    23193.97     786.22  29.501  < 2e-16 ***
IsWinnerYes        1048.41    1065.28   0.984 0.325188    
UpsetWinYes        7102.00     839.81   8.457  < 2e-16 ***
UpsetLossYes       3839.19     838.70   4.578 5.08e-06 ***
FavoriteWinYes     2504.38     950.12   2.636 0.008476 ** 
FavoriteLossYes     968.62     947.51   1.022 0.306805    
UnderdogWinYes     3029.86     878.18   3.450 0.000575 ***
UnderdogLossYes    1107.89     873.47   1.268 0.204854    
IsGameDay          -100.41     880.61  -0.114 0.909234    
IsDayAfterGame     -265.74     147.21  -1.805 0.071240 .  
IsFirst4         -22583.09     788.85 -28.628  < 2e-16 ***
IsRd64           -22352.73     799.16 -27.970  < 2e-16 ***
IsRd32           -21261.16     806.37 -26.366  < 2e-16 ***
IsRd16           -19989.55     815.03 -24.526  < 2e-16 ***
IsRd8            -17565.23     876.73 -20.035  < 2e-16 ***
IsRd4            -14329.90     936.47 -15.302  < 2e-16 ***
IsChamp                 NA         NA      NA       NA    
Is15               1942.69     668.07   2.908 0.003690 ** 
Is13              -1818.91     677.58  -2.684 0.007343 ** 
Is11              -1742.82     678.82  -2.567 0.010338 *  
Is9               -1348.84     622.47  -2.167 0.030392 *  
Is8               -1126.76     636.16  -1.771 0.076723 .  
Is7                 -84.35     560.00  -0.151 0.880293    
Is6                 832.81    1246.77   0.668 0.504248    
Is5               -2904.69     588.19  -4.938 8.73e-07 ***
Is4                3963.33     644.11   6.153 9.65e-10 ***
Is3                1092.60     494.37   2.210 0.027245 *  
Is1                     NA         NA      NA       NA    
Is0                     NA         NA      NA       NA    
IsEarlyEvening      646.44     470.42   1.374 0.169591    
IsLateAfternoon    1999.72     530.69   3.768 0.000171 ***
IsLateEvening      -858.93     472.92  -1.816 0.069529 .  
IsMidAfternoon     1470.80     550.89   2.670 0.007668 ** 
IsEarlyAfternoon        NA         NA      NA       NA    
SizeMedium         -505.87     103.73  -4.877 1.19e-06 ***
SizeSmall          -402.88     230.77  -1.746 0.081044 .  
MajorYes            842.34      87.83   9.590  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1506 on 1550 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.715, Adjusted R-squared:  0.7089 
F-statistic: 117.8 on 33 and 1550 DF,  p-value: < 2.2e-16
Code
AllVariablesLRlog <- lm(log(Mention_Volume) ~ IsMarchMadness + IsWinner + UpsetWin + UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + UnderdogLoss + IsGameDay + IsDayAfterGame + IsFirst4 + IsRd64 + IsRd32 + IsRd16 + IsRd8 + IsRd4 + IsChamp + Is15 + Is13 + Is11 + Is9 + Is8 + Is7 + Is6 + Is5 + Is4 + Is3 + Is1 + Is0 + IsEarlyEvening + IsLateAfternoon + IsLateEvening + IsMidAfternoon + IsEarlyAfternoon + Size + Major, data = VolumeAndVariables)
summary(AllVariablesLRlog)

Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + IsWinner + 
    UpsetWin + UpsetLoss + FavoriteWin + FavoriteLoss + UnderdogWin + 
    UnderdogLoss + IsGameDay + IsDayAfterGame + IsFirst4 + IsRd64 + 
    IsRd32 + IsRd16 + IsRd8 + IsRd4 + IsChamp + Is15 + Is13 + 
    Is11 + Is9 + Is8 + Is7 + Is6 + Is5 + Is4 + Is3 + Is1 + Is0 + 
    IsEarlyEvening + IsLateAfternoon + IsLateEvening + IsMidAfternoon + 
    IsEarlyAfternoon + Size + Major, data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2475 -0.5444 -0.0327  0.4991  3.1885 

Coefficients: (4 not defined because of singularities)
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5.548279   0.041446 133.867  < 2e-16 ***
IsMarchMadness    3.667101   0.414029   8.857  < 2e-16 ***
IsWinnerYes       0.128204   0.560985   0.229 0.819262    
UpsetWinYes       1.437542   0.442252   3.251 0.001177 ** 
UpsetLossYes      0.489885   0.441667   1.109 0.267527    
FavoriteWinYes   -0.267184   0.500339  -0.534 0.593414    
FavoriteLossYes  -0.578416   0.498964  -1.159 0.246540    
UnderdogWinYes   -0.289704   0.462454  -0.626 0.531112    
UnderdogLossYes  -0.332252   0.459974  -0.722 0.470203    
IsGameDay         1.239874   0.463736   2.674 0.007582 ** 
IsDayAfterGame   -0.132229   0.077523  -1.706 0.088269 .  
IsFirst4         -2.826484   0.415412  -6.804 1.45e-11 ***
IsRd64           -2.631728   0.420845  -6.253 5.18e-10 ***
IsRd32           -2.232335   0.424641  -5.257 1.67e-07 ***
IsRd16           -1.558388   0.429201  -3.631 0.000292 ***
IsRd8            -1.404315   0.461691  -3.042 0.002392 ** 
IsRd4            -0.896345   0.493151  -1.818 0.069320 .  
IsChamp                 NA         NA      NA       NA    
Is15              0.749508   0.351812   2.130 0.033294 *  
Is13              0.101789   0.356816   0.285 0.775475    
Is11              0.092589   0.357469   0.259 0.795661    
Is9               0.286940   0.327796   0.875 0.381512    
Is8               0.191323   0.335005   0.571 0.568011    
Is7               0.426904   0.294898   1.448 0.147923    
Is6               0.244996   0.656558   0.373 0.709086    
Is5              -0.688168   0.309745  -2.222 0.026446 *  
Is4               0.268176   0.339194   0.791 0.429283    
Is3               0.025999   0.260339   0.100 0.920462    
Is1                     NA         NA      NA       NA    
Is0                     NA         NA      NA       NA    
IsEarlyEvening   -0.142239   0.247729  -0.574 0.565934    
IsLateAfternoon  -0.203041   0.279464  -0.727 0.467620    
IsLateEvening    -0.585250   0.249044  -2.350 0.018899 *  
IsMidAfternoon    0.006344   0.290104   0.022 0.982556    
IsEarlyAfternoon        NA         NA      NA       NA    
SizeMedium       -0.661744   0.054622 -12.115  < 2e-16 ***
SizeSmall        -1.033428   0.121527  -8.504  < 2e-16 ***
MajorYes          1.147779   0.046254  24.815  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7932 on 1550 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.6692,    Adjusted R-squared:  0.6621 
F-statistic:    95 on 33 and 1550 DF,  p-value: < 2.2e-16

In this version of the model, the p-values are again the same for the original mention volume variable and the log-transformed one; however, the adjusted r squared for the original model is higher (0.7089 vs. 0.6621).

Model 6 - Only Significant Variables

I now want to create a model with statistically significant variables from the four models that I created. I am going to use backward elimination to create this model. I am not going to include every step of me backwardly creating this model, but will include the final model with only significant variables.

Code
# Use backward elimination to create model of significant variables
SigVariablesLR <- lm(Mention_Volume ~ IsMarchMadness + UpsetWin + UpsetLoss + FavoriteWin + UnderdogWin + IsFirst4 + IsRd64 + IsRd32 + IsRd16 + IsRd8 + IsRd4 + Is15 + Is13 + Is9 + Is5 + Is4 + Is3 + IsEarlyEvening + IsLateAfternoon + IsMidAfternoon + Major, data = VolumeAndVariables)
summary(SigVariablesLR)

Call:
lm(formula = Mention_Volume ~ IsMarchMadness + UpsetWin + UpsetLoss + 
    FavoriteWin + UnderdogWin + IsFirst4 + IsRd64 + IsRd32 + 
    IsRd16 + IsRd8 + IsRd4 + Is15 + Is13 + Is9 + Is5 + Is4 + 
    Is3 + IsEarlyEvening + IsLateAfternoon + IsMidAfternoon + 
    Major, data = VolumeAndVariables)

Residuals:
     Min       1Q   Median       3Q      Max 
-13319.3   -526.5    -92.7    206.5  16456.0 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        174.71      62.27   2.806 0.005082 ** 
IsMarchMadness   23396.59     764.40  30.608  < 2e-16 ***
UpsetWinYes       7154.29     768.31   9.312  < 2e-16 ***
UpsetLossYes      3937.23     584.18   6.740 2.22e-11 ***
FavoriteWinYes    2611.99     531.59   4.914 9.88e-07 ***
UnderdogWinYes    3028.71     283.61  10.679  < 2e-16 ***
IsFirst4        -22793.33     768.04 -29.677  < 2e-16 ***
IsRd64          -22658.56     775.93 -29.202  < 2e-16 ***
IsRd32          -21514.15     786.95 -27.339  < 2e-16 ***
IsRd16          -20233.70     805.01 -25.135  < 2e-16 ***
IsRd8           -17617.28     859.38 -20.500  < 2e-16 ***
IsRd4           -14420.43     941.36 -15.319  < 2e-16 ***
Is15              2351.40     575.09   4.089 4.56e-05 ***
Is13             -1617.32     599.77  -2.697 0.007081 ** 
Is9              -1147.69     509.99  -2.250 0.024561 *  
Is5              -2692.84     476.91  -5.646 1.94e-08 ***
Is4               4492.86     608.05   7.389 2.40e-13 ***
Is3               1499.10     411.92   3.639 0.000282 ***
IsEarlyEvening    1124.35     312.27   3.601 0.000327 ***
IsLateAfternoon   2558.36     377.14   6.784 1.66e-11 ***
IsMidAfternoon    2123.85     426.94   4.975 7.26e-07 ***
MajorYes          1056.04      77.09  13.699  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1519 on 1562 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.7078,    Adjusted R-squared:  0.7038 
F-statistic: 180.2 on 21 and 1562 DF,  p-value: < 2.2e-16

And then for the log-transformed version of mention volume:

Code
# Use backward elimination to create model of significant variables
SigVariablesLRlog <- lm(log(Mention_Volume) ~ IsMarchMadness + IsWinner + UpsetWin + FavoriteLoss + UnderdogLoss + IsFirst4 + IsRd64 + IsRd32 + IsRd16 + IsRd8 + Is15 + Is5 + Size + Major, data = VolumeAndVariables)
summary(SigVariablesLRlog)

Call:
lm(formula = log(Mention_Volume) ~ IsMarchMadness + IsWinner + 
    UpsetWin + FavoriteLoss + UnderdogLoss + IsFirst4 + IsRd64 + 
    IsRd32 + IsRd16 + IsRd8 + Is15 + Is5 + Size + Major, data = VolumeAndVariables)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2516 -0.5599 -0.0313  0.5010  3.1930 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      5.54377    0.04121 134.521  < 2e-16 ***
IsMarchMadness   2.92884    0.23678  12.369  < 2e-16 ***
IsWinnerYes      0.97977    0.12242   8.003 2.34e-15 ***
UpsetWinYes      1.56557    0.29417   5.322 1.17e-07 ***
FavoriteLossYes  0.75590    0.19610   3.855 0.000121 ***
UnderdogLossYes  0.81726    0.13380   6.108 1.27e-09 ***
IsFirst4        -2.08275    0.23861  -8.729  < 2e-16 ***
IsRd64          -1.90091    0.24268  -7.833 8.71e-15 ***
IsRd32          -1.49297    0.25236  -5.916 4.04e-09 ***
IsRd16          -0.87063    0.26330  -3.307 0.000965 ***
IsRd8           -0.66700    0.30565  -2.182 0.029239 *  
Is15             0.62959    0.29513   2.133 0.033055 *  
Is5             -0.81295    0.23441  -3.468 0.000539 ***
SizeMedium      -0.65313    0.05435 -12.017  < 2e-16 ***
SizeSmall       -1.03135    0.12132  -8.501  < 2e-16 ***
MajorYes         1.14217    0.04608  24.784  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7954 on 1568 degrees of freedom
  (48 observations deleted due to missingness)
Multiple R-squared:  0.6635,    Adjusted R-squared:  0.6602 
F-statistic: 206.1 on 15 and 1568 DF,  p-value: < 2.2e-16

Excluding IsMarchMadess, the non-log transformed model had 20 statistically significant variables, while the log-transformed model had 13.

The non-log transformed model differed from the log-transformed model with the inclusion of the UpsetLoss, FavoriteWin, UnderdogWin, IsRd4, Is13, Is9, Is4, Is3, IsEarlyEvening, IsLateAfternoon, and IsMidAfternoon variables.

The log transformed model differed from the non-log transformed model with the inclusion of the IsWinner, FavoriteLoss, UnderdogLoss, and Size variables.

The two models once again had the same p-value (< 2.2e-16) but the non-log transformed model once again had the larger adjusted r squared (0.7038 vs. 0.6602).

Comparing the Models Using Adjusted r-squared, AIC, and BIC

I have compiled a final summary of all six non-log transformed models, and a summary for the six log-transformed models.

Code
models <- list(IsMarchMadnessLR, GameOutcomeLR, TournamentRelatedLR, SchoolRelatedLR, AllVariablesLR, SigVariablesLR)
stargazer(models, 
          title = "Linear Regression Models", 
          align = TRUE, 
          single.row = TRUE,
          type = "text", 
          font.size = "small",
          add.lines = list(c("AIC", 
                             round(AIC(IsMarchMadnessLR), 2), 
                             round(AIC(GameOutcomeLR), 2), 
                             round(AIC(TournamentRelatedLR), 2), 
                             round(AIC(SchoolRelatedLR), 2), 
                             round(AIC(AllVariablesLR), 2),
                             round(AIC(SigVariablesLR), 2)), 
                           c("BIC", 
                             round(BIC(IsMarchMadnessLR), 2), 
                             round(BIC(GameOutcomeLR), 2), 
                             round(BIC(TournamentRelatedLR), 2), 
                             round(BIC(SchoolRelatedLR), 2), 
                             round(BIC(AllVariablesLR), 2),
                             round(BIC(SigVariablesLR), 2)))) 

Linear Regression Models
=================================================================================================================================================================================
                                                                                         Dependent variable:                                                                     
                    -------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                                                           Mention_Volume                                                                        
                               (1)                       (2)                       (3)                       (4)                       (5)                        (6)            
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
IsMarchMadness       2,150.268*** (130.132)    1,248.844*** (120.370)    23,445.480*** (892.060)    2,104.942*** (126.861)   23,193.970*** (786.221)    23,396.590*** (764.397)  
IsWinnerYes                                                                                                                   1,048.415 (1,065.283)                              
UpsetWinYes                                   4,381.744*** (1,009.731)                                                        7,102.001*** (839.815)     7,154.288*** (768.309)  
UpsetLossYes                                                                                                                  3,839.192*** (838.703)     3,937.226*** (584.177)  
FavoriteWinYes                                 6,681.318*** (701.888)                                                         2,504.379*** (950.118)     2,611.989*** (531.591)  
FavoriteLossYes                                4,532.355*** (513.694)                                                           968.625 (947.508)                                
UnderdogWinYes                                 5,265.937*** (345.628)                                                         3,029.861*** (878.177)     3,028.715*** (283.613)  
UnderdogLossYes                                1,779.368*** (345.628)                                                          1,107.887 (873.468)                               
IsGameDay                                                                                                                       -100.411 (880.612)                               
IsDayAfterGame                                                             -335.408** (167.289)                                -265.745* (147.212)                               
IsFirst4                                                                 -22,897.870*** (894.474)                            -22,583.090*** (788.846)   -22,793.330*** (768.044) 
IsRd64                                                                   -22,341.780*** (903.967)                            -22,352.730*** (799.164)   -22,658.560*** (775.934) 
IsRd32                                                                   -21,030.360*** (914.912)                            -21,261.160*** (806.371)   -21,514.150*** (786.946) 
IsRd16                                                                   -20,197.160*** (923.072)                            -19,989.550*** (815.030)   -20,233.700*** (805.009) 
IsRd8                                                                    -17,579.510*** (989.145)                            -17,565.230*** (876.728)   -17,617.280*** (859.384) 
IsRd4                                                                   -14,517.690*** (1,072.525)                           -14,329.900*** (936.470)   -14,420.430*** (941.364) 
IsChamp                                                                                                                                                                          
Is15                                                                      5,777.345*** (647.241)                              1,942.691*** (668.073)     2,351.403*** (575.086)  
Is13                                                                      2,093.956*** (686.043)                             -1,818.909*** (677.575)    -1,617.316*** (599.769)  
Is11                                                                                                                          -1,742.822** (678.815)                             
Is9                                                                       2,032.296*** (616.835)                              -1,348.843** (622.467)     -1,147.685** (509.988)  
Is8                                                                       2,086.841*** (575.695)                              -1,126.764* (636.157)                              
Is7                                                                       3,004.990*** (498.296)                                -84.348 (559.997)                                
Is6                                                                      7,239.143*** (1,318.326)                              832.813 (1,246.771)                               
Is5                                                                       2,149.854*** (531.601)                             -2,904.689*** (588.191)    -2,692.837*** (476.905)  
Is4                                                                       6,187.630*** (672.376)                              3,963.332*** (644.113)     4,492.858*** (608.050)  
Is3                                                                       3,635.314*** (475.062)                              1,092.598** (494.371)      1,499.101*** (411.917)  
Is1                                                                       2,570.985*** (416.446)                                                                                 
Is0                                                                                                                                                                              
IsEarlyEvening                                                                                                                  646.438 (470.425)        1,124.348*** (312.271)  
IsLateAfternoon                                                           2,774.030*** (423.352)                              1,999.720*** (530.688)     2,558.360*** (377.143)  
IsLateEvening                                                            -1,217.310*** (376.697)                               -858.932* (472.921)                               
IsMidAfternoon                                                            1,342.854*** (503.565)                              1,470.798*** (550.893)     2,123.854*** (426.935)  
IsEarlyAfternoon                                                                                                                                                                 
SizeMedium                                                                                          -695.316*** (166.956)     -505.874*** (103.725)                              
SizeSmall                                                                                            -732.447* (378.056)       -402.884* (230.774)                               
MajorYes                                                                                             763.765*** (140.525)      842.340*** (87.834)       1,056.045*** (77.088)   
Constant               720.538*** (80.467)       720.538*** (69.218)       734.180*** (54.774)       508.443*** (127.016)      424.485*** (78.704)        174.705*** (62.267)    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
AIC                          30243.8                  29757.29                   28131.13                  30160.56                  27712.21                   27727.59         
BIC                         30259.99                  29800.47                   28254.59                  30192.94                  27900.08                   27851.04         
Observations                  1,632                     1,632                     1,584                     1,632                     1,584                      1,584           
R2                            0.143                     0.368                     0.623                     0.189                     0.715                      0.708           
Adjusted R2                   0.143                     0.366                     0.618                     0.187                     0.709                      0.704           
Residual Std. Error   2,554.742 (df = 1630)     2,197.609 (df = 1625)     1,725.566 (df = 1562)     2,488.128 (df = 1627)     1,506.217 (df = 1550)      1,519.184 (df = 1562)   
F Statistic         273.034*** (df = 1; 1630) 157.803*** (df = 6; 1625) 122.909*** (df = 21; 1562) 94.824*** (df = 4; 1627) 117.808*** (df = 33; 1550) 180.154*** (df = 21; 1562)
=================================================================================================================================================================================
Note:                                                                                                                                                 *p<0.1; **p<0.05; ***p<0.01
Code
logmodels <- list(IsMarchMadnessLRlog, GameOutcomeLRlog, TournamentRelatedLRlog, SchoolRelatedLRlog, AllVariablesLRlog, SigVariablesLRlog)
stargazer(logmodels, 
          title = "Linear Regression Models - log", 
          align = TRUE, 
          single.row = TRUE,
          type = "text", 
          font.size = "small",
          add.lines = list(c("AIC", 
                             round(AIC(IsMarchMadnessLRlog), 2), 
                             round(AIC(GameOutcomeLRlog), 2), 
                             round(AIC(TournamentRelatedLRlog), 2), 
                             round(AIC(SchoolRelatedLRlog), 2), 
                             round(AIC(AllVariablesLRlog), 2),
                             round(AIC(SigVariablesLRlog), 2)), 
                           c("BIC", 
                             round(BIC(IsMarchMadnessLRlog), 2), 
                             round(BIC(GameOutcomeLRlog), 2), 
                             round(BIC(TournamentRelatedLRlog), 2), 
                             round(BIC(SchoolRelatedLRlog), 2), 
                             round(BIC(AllVariablesLRlog), 2),
                             round(BIC(SigVariablesLRlog), 2)))) 

Linear Regression Models - log
================================================================================================================================================================================
                                                                                        Dependent variable:                                                                     
                    ------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                                                        log(Mention_Volume)                                                                     
                               (1)                       (2)                       (3)                       (4)                       (5)                       (6)            
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
IsMarchMadness          1.412*** (0.060)          1.154*** (0.062)          2.731*** (0.218)          1.352*** (0.046)          3.667*** (0.414)           2.929*** (0.237)     
IsWinnerYes                                                                                                                       0.128 (0.561)            0.980*** (0.122)     
UpsetWinYes                                                                                                                     1.438*** (0.442)           1.566*** (0.294)     
UpsetLossYes                                                                                                                      0.490 (0.442)                                 
FavoriteWinYes                                    1.937*** (0.264)                                                               -0.267 (0.500)                                 
FavoriteLossYes                                   1.471*** (0.264)                                                               -0.578 (0.499)            0.756*** (0.196)     
UnderdogWinYes                                    1.440*** (0.177)                                                               -0.290 (0.462)                                 
UnderdogLossYes                                   0.743*** (0.177)                                                               -0.332 (0.460)            0.817*** (0.134)     
IsGameDay                                                                   0.923*** (0.121)                                    1.240*** (0.464)                                
IsDayAfterGame                                                                                                                   -0.132* (0.078)                                
IsFirst4                                                                    -1.888*** (0.222)                                   -2.826*** (0.415)         -2.083*** (0.239)     
IsRd64                                                                      -1.580*** (0.232)                                   -2.632*** (0.421)         -1.901*** (0.243)     
IsRd32                                                                      -1.069*** (0.251)                                   -2.232*** (0.425)         -1.493*** (0.252)     
IsRd16                                                                      -0.693** (0.270)                                    -1.558*** (0.429)         -0.871*** (0.263)     
IsRd8                                                                                                                           -1.404*** (0.462)          -0.667** (0.306)     
IsRd4                                                                                                                            -0.896* (0.493)                                
IsChamp                                                                                                                                                                         
Is15                                                                                                                             0.750** (0.352)           0.630** (0.295)      
Is13                                                                                                                              0.102 (0.357)                                 
Is11                                                                                                                              0.093 (0.357)                                 
Is9                                                                                                                               0.287 (0.328)                                 
Is8                                                                                                                               0.191 (0.335)                                 
Is7                                                                                                                               0.427 (0.295)                                 
Is6                                                                                                                               0.245 (0.657)                                 
Is5                                                                                                                             -0.688** (0.310)          -0.813*** (0.234)     
Is4                                                                                                                               0.268 (0.339)                                 
Is3                                                                                                                               0.026 (0.260)                                 
Is1                                                                                                                                                                             
Is0                                                                                                                                                                             
IsEarlyEvening                                                                                                                   -0.142 (0.248)                                 
IsLateAfternoon                                                                                                                  -0.203 (0.279)                                 
IsLateEvening                                                                                                                   -0.585** (0.249)                                
IsMidAfternoon                                                                                                                    0.006 (0.290)                                 
IsEarlyAfternoon                                                                                                                                                                
SizeMedium                                                                                            -0.684*** (0.060)         -0.662*** (0.055)         -0.653*** (0.054)     
SizeSmall                                                                                             -1.096*** (0.137)         -1.033*** (0.122)         -1.031*** (0.121)     
MajorYes                                                                                              1.098*** (0.051)          1.148*** (0.046)           1.142*** (0.046)     
Constant                5.951*** (0.037)          5.951*** (0.036)          5.951*** (0.035)          5.576*** (0.046)          5.548*** (0.041)           5.544*** (0.041)     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
AIC                          5166.33                   5033.57                   4826.28                   4297.86                   3796.8                    3787.85          
BIC                          5182.52                   5071.35                   4869.22                   4330.24                   3984.67                    3879.1          
Observations                  1,632                     1,632                     1,584                     1,632                     1,584                     1,584           
R2                            0.254                     0.316                     0.344                     0.563                     0.669                     0.663           
Adjusted R2                   0.254                     0.314                     0.342                     0.562                     0.662                     0.660           
Residual Std. Error     1.177 (df = 1630)         1.128 (df = 1626)         1.107 (df = 1577)         0.901 (df = 1627)         0.793 (df = 1550)         0.795 (df = 1568)     
F Statistic         554.943*** (df = 1; 1630) 149.983*** (df = 5; 1626) 138.039*** (df = 6; 1577) 524.973*** (df = 4; 1627) 95.004*** (df = 33; 1550) 206.086*** (df = 15; 1568)
================================================================================================================================================================================
Note:                                                                                                                                                *p<0.1; **p<0.05; ***p<0.01

AllVariablesLR had the highest adjusted r-squared (0.709) while SigVariablesLR had the next highest (0.704). When it came to AIC and BIC, however, the numbers for the log-adjusted models were substantially lower - AIC was 3,787.85 to 5,166.33 for the log-adjusted models and 27,727.59 to 30,243.8 for the non-adjusted ones; BIC was similar, varying from 3,879.1 to 5,182.52 for the log-adjusted models and 27,851.04 to 30,259.99 for the non-adjusted ones.

High values of AIC and BIC suggest that the model may not be a good fit for the data and that it may be overfitting. Accordingly, while the adjusted r squared was highest for the non-log models, I believe a log-adjusted model is a better fit for the data overall. Specifically, SigVariablesLRlog - the model for all significant variables from the four categories of variables (is march madness, game outcome-related, tournament-related, and school-related) is the best fitting model, with the lowest AIC and BIC, and the highest adjusted r squared amongst the log-adjusted models.

Diagnostics

Code
par(mfrow = c(2,3))
plot(SigVariablesLRlog, which = 1:5)

Residuals vs. Fitted

There is a slight curve to the residuals vs. fitted. While there are some outliers, this is mostly a well behaving residuals vs. fitted plot. This suggests the model is capturing most of the relationship.

Q-Q Residuals

There is greater variability/outliers at the lower end of this plot and then points fall nearly perfectly along the line until there is again slight variability at the top end. This suggests that the residuals are largely normally distributed.

Scale-Location

In the scale-location plot, the red line should be approximately horizontal, which this is. The points also appear to be randomly scattered around the line. These two observations suggest that the residuals have a largely consistent spread across the range of fitted values.

Cook’s Distance

If I use 4/n as my threshold for Cook’s distance, this plot surpasses that threshold; however, if I use 1, then this plot does not surpass that threshold. The presence of influential observations makes sense within the research context and arguably, the higher threshold is appropriate.

Residuals vs. Leverage

There are no points outside of the lines for Cook’s distance.

Conclusion

The results for both hypotheses were statistically significant -

  1. Participation in the men’s March Madness tournament increases online conversation about the schools that are involved
  2. This increase in conversation volume for each school is influenced by three types of additional mediating variables: game-outcome related variables, school-related variables, and tournament related variables.

Testing of the different variables and models evidenced that there are a large number of variables and factors that can contribute to differences in conversation volume for the schools/teams involved in the tournament in a statistically significant fashion. While I believe I was able to determine many of the factors that can help to predict an increase in volume, I think this also presents a significant limitation when it comes to the utility of this research for the schools involved. When it comes to social media management strategies for the teams involved in March Madness, a model with tens of different variables is not going to be easy to use; in this sense I believe that quantitative analysis to determine the statistically significant factors followed by qualitative analysis and subsequent overarching recommendations leveraging both components would provide such groups with the best opportunities to leverage the information when moving forward.