Check In 2

finalproject

Final Project Check In 2 - Darron Bunt

Author

Darron Bunt

Published

May 13, 2023

In April 2013, Doug J. Chung published a research paper¹ endeavoring to quantify and model the impact of the so-called “Flutie Effect” - the spillover impact that athletics has on the quantity and quality of applicants to US colleges (named after Boston College quarterback Doug Flutie, who in 1984 threw a Hail Mary touchdown pass to secure victory with six seconds in a game against the University of Miami, qualifying the team to compete in the Cotton Bowl²). The legacy of Flutie’s on-field success has been credited with catalyzing a 30% increase in undergraduate applications at Boston College, though institutional officials have argued that other, non-athletic factors were the “true” reason behind the increase. This trend of increased applications after prominent athletic successes, however, has been observed at other institutions including Georgetown, Northwestern, Boise State, and Texas Christian University.

Chung was able to find a statistically significant relationship between athletic success and both the quantity (number of applications) and quality (SAT scores of those applicants) of applicants to a given institution; his findings included that when a school rises from being classified as a mediocre football program to great one, applications rise by 18.7%³.

While in my job I do not focus directly on applications to US colleges, I do work with on campus marketing and communications leaders, and generate insights from data derived from online conversation about their schools so that they can better understand that conversation in ways that can help them to develop, refine, and align their communications strategies with the goals of the institutions that they serve⁴. A trend I frequently observe while analyzing this online conversation data is the impact that athletics has on the volume and reach of mentions related to schools.

In benchmarking work that we’ve undertaken in order to better understand online conversation trends in higher education, we’ve found that, on average, 63% of all online conversation related to schools is about their athletics - and for some schools, this proportion can be as high as 91%⁵. And while the proportion of overall online conversation relating to different colleges is already quite high, there are certain significant events within the realm of college sports that send mention volume even higher. One such event is March Madness.

Run by the NCAA every year during the month of March, the men’s version of the March Madness tournament is widely regarded as one of biggest American sporting events. 68 teams participate in the three-week long single-elimination tournament, which produced $1.14 billion in annual revenue for the NCAA in 2022⁶. One of the most exciting parts of the tournament are the “Cinderella stories”; teams with a low ranking that manage to eliminate higher ranked teams.

In 2022, I became interested in developing a greater understanding of the impact that March Madness could have on a school’s online conversation volume and began tracking that year’s Cinderella - the Saint Peter’s College Peacocks, a #15 seed that made it all the way to the Elite 8. In their Sweet 16 game against Purdue, Saint Peter’s had more mentions per minute than their average monthly volume for the prior five months. Their total mention volume during the month of March was 12,380 times more than that same monthly average⁷.

After completing that work, I became increasingly curious about how participation in the tournament affected conversation volume and reach for all teams, not just Cinderella stories. I wanted to investigate the factors that might influence more/less conversation volume for any given team, and indeed, whether conversation about the team was the only facet of conversation about a college that increased - or if the school as a whole received more conversation overall. I became very interested in learning more about the overall trends that relate to participation in events like March Madness and how we could use that information to provide strategic insights to colleges across the US.

Bringing it back to the Flutie Effect - if on-field athletic success can impact enrollment at US colleges, it stands to reason that online conversation about athletic success (or athletic events in general) could have a similar impact. Accordingly, it would be prudent to further explore the impact an event like the men’s March Madness tournament has on online conversation volume for US colleges.

Specifically, the research question that I want to explore is:

What impact does participation in the men’s March Madness tournament have on online conversation about the schools involved?

And more specifically, some nuances to this question that I want to consider are:

How does mention volume differ when schools are participating in the tournament?
- Is there a difference in reach for schools participating in the tournament?
- Is there a correlation between increase in mention volume and increase in reach?
What is the baseline of conversation volume/reach increase that can be expected when a team participates in the tournament?
What variables contribute to increased/decreased mention volume/reach during the tournament?
- Size of school
- Major / mid-major
- Round of the tournament
- Favorite / underdog
- Win/Loss
- Upset
- Time of day
If there is a change in mention volume/reach during the tournament, how long does this change last?
- Is this length of time similar or different for volume and reach?
- Do the same variables that contribute to increased/decreased mention volume also impact the length of that increase?

Answers to these questions will provide a better understanding of the overall impact that events such as the men’s March Madness tournament have on conversation volume/reach and can subsequently serve to inform strategies and tactics that can be implemented by marketing and communications professionals who focus on digital media in higher education. That is, schools can use this information to develop further hypotheses to be tested that would facilitate a goal of leveraging and maximizing the impact of athletic participation and success on their online conversation.

My overarching hypothesis is that participation in the men’s March Madness tournament increases online conversation about the schools that are involved. I further hypothesize that:

There is a correlation between increases in mention volume and overall/average reach; that is, it’s not just the number of mentions about each school that increases, but also the number of people who see those mentions.
The increase in conversation (volume and reach) for each school will be influenced by seven factors: school size, classification as a major or mid-major school, which round of the tournament the game happens during, whether the team was the favorite (ranked higher) or the underdog (ranked lower), whether the team won or lost, whether the game was considered a major upset, and the time of day the game took place.
The change in mention volume/reach will return to baseline within a matter of days after a team’s exit from the tournament.
The variables I hypothesize contribute to variations in how volume/reach increase will not also impact the length of that increase.

What impact does participation in the men’s March Madness tournament have on online conversation about the schools involved?

The data used for this project was collected by me using the Brandwatch Consumer Research platform. I first wrote Boolean to search for mentions about all 68 schools with teams in the men’s March Madness tournament, using the same parameters to construct each one (specifically, the school’s full name, shortened name(s), acronym(s), mascot(s), website URL, athletics website URL, and Twitter usernames for the school’s flagship, flagship athletics, and men’s basketball team accounts). The Boolean was then used to run a query within Brandwatch to pull all relevant, retrievable online mentions for the 68 schools made between January 1 and April 30 2023. This search returned 8,099,045 mentions.

One of the features included in Brandwatch is the ability to segment data into categories, and this can be done in a wide range of manners (using Boolean or a variety of pre-built parameters such as content sources and mention types). I used this feature to create a variety of categories that would be necessary for me to be able to parse the full dataset into manageable chunks that could be used to answer my research question.

First I need to do some data manipulation in order to create my variables of interest.

Data Manipulation For Mention Volume

I need to calculate the average daily volume for the following time spans: * Outside of Tournament Play (January 1 to March 13; two days after elimination to May 12) * During the Tournament (March 14 to day after elimination) * Before the Tournament (January 1 to March 13) * After the Tournament (Two days after elimination to May 12)

And then I need to add columns with the daily volume for the following times: * Post Elimination (two days after) * Post Plus 1 (three days after) * Post Plus 2 (four days after) * Post Plus 3 (five days after)

Code

# Read in data
TournamentVariables <- read_csv("_data/MMVariables.csv", show_col_types = FALSE)
MMVolume <- read_csv("_data/MentionsPerDayALL.csv", show_col_types = FALSE)

# Join data
MMVolumeVar <- MMVolume %>%
  left_join(TournamentVariables, by = "School")

# Calculate pre-tournament average daily volume 
MMVolumeVar$VolBefore <- (round(rowMeans(MMVolumeVar[, 2:73]), digits = 0))

# Create a function to pull the rows needed to calculate "during tournament" for each of the schools
fun1 <- function(row) {
  start_col1 <- as.numeric(row[134])  # Convert the value to a numeric column index
  end_col1 <- as.numeric(row[135])    # Convert the value to a numeric column index
  values1 <- as.numeric(row[start_col1:end_col1]) # The rows to calculate the average
  average1 <- (round(mean(values1, na.rm = TRUE), digits = 0))
  return(average1)
}

# Return the average daily volume for "during tournament" for each school 
MMVolumeVar$VolDuring <- apply(MMVolumeVar, 1, fun1)

# Create a function to pull the rows needed to calculate "after tournament" for each of the schools
fun2 <- function(row) {
  start_col2 <- as.numeric(row[136])  # Convert the value to a numeric column index
  end_col2 <- as.numeric(row[137])    # Convert the value to a numeric column index
  values2 <- as.numeric(row[start_col2:end_col2]) # The rows to calculate the average
  average2 <- (round(mean(values2, na.rm = TRUE), digits = 0))
  return(average2)
}

# Return the average daily volume for "during tournament" for each school 
MMVolumeVar$VolAfter <- apply(MMVolumeVar, 1, fun2)

# Create a function to pull the rows needed for post-elimination volume 
fun3 <- function(row) {
  column_index3 <- as.numeric(row[136])  # Convert the value to a numeric column index
  value3 <- as.numeric(row[column_index3])  # Get the value from the specified column
  return(value3)
}

# Return the post-elimination day volume for each school 
MMVolumeVar$VolPostElim <- apply(MMVolumeVar, 1, fun3)

# Create a function to pull the rows needed for non-tournament volume
fun14 <- function(row) {
  start_col14 <- as.numeric(row[142])    # Start column index for range 1
  end_col14 <- as.numeric(row[143])      # End column index for range 1
  start_col14a <- as.numeric(row[140])    # Start column index for range 2
  end_col14a <- as.numeric(row[141])      # End column index for range 2
  
  values14 <- as.numeric(row[start_col14:end_col14])    # Values for range 1
  values14a <- as.numeric(row[start_col14a:end_col14a])    # Values for range 2
  
  values14b <- c(values14, values14a)    # Combine the values from both ranges
  
  average14 <- (round(mean(values14b, na.rm = TRUE), digits = 0))
  return(average14)
}

# Return the average non-tournament daily volume for each school 
MMVolumeVar$VolNonTourn <- apply(MMVolumeVar, 1, fun14) 

# Create a function to pull the rows needed for post-elimination +1 volume 
fun4 <- function(row) {
  column_index4 <- as.numeric(row[137])  # Convert the value to a numeric column index
  value4 <- as.numeric(row[column_index4])  # Get the value from the specified column
  return(value4)
}

# Return the post-elimination +1 volume for each school 
MMVolumeVar$VolPostElim1 <- apply(MMVolumeVar, 1, fun4)

# Create a function to pull the rows needed for post-elimination +2 volume 
fun5 <- function(row) {
  column_index5 <- as.numeric(row[138])  # Convert the value to a numeric column index
  value5 <- as.numeric(row[column_index5])  # Get the value from the specified column
  return(value5)
}

# Return the post-elimination +2 volume for each school 
MMVolumeVar$VolPostElim2 <- apply(MMVolumeVar, 1, fun5)

# Create a function to pull the rows needed for post-elimination +3 volume 
fun6 <- function(row) {
  column_index6 <- as.numeric(row[139])  # Convert the value to a numeric column index
  value6 <- as.numeric(row[column_index6])  # Get the value from the specified column
  return(value6)
}

# Return the post-elimination +3 volume for each school 
MMVolumeVar$VolPostElim3 <- apply(MMVolumeVar, 1, fun6)

# Create a data frame with the columns for the calculated means
MMVolAvgs <- MMVolumeVar %>%
  select(School, VolDuring, VolNonTourn, VolBefore, VolAfter, VolPostElim,  VolPostElim1, VolPostElim2, VolPostElim3)
MMVolAvgs

# A tibble: 68 × 9
   School      VolDuring VolNonTourn VolBefore VolAfter VolPostElim VolPostElim1
   <chr>           <dbl>       <dbl>     <dbl>    <dbl>       <dbl>        <dbl>
 1 University…       786         259       311      224         239          208
 2 University…      2782        1312      1393     1084        1377          791
 3 Arizona St…      2435        1318      1467     1025         964         1086
 4 Duke Unive…      4390        2127      2430     1434        1330         1537
 5 Utah State…       718         295       353      139         147          131
 6 University…      3623        1440      1839     1058        1252          865
 7 North Caro…       963         857       973      506         327          684
 8 University…      5826        1477      1557     2411        2715         2107
 9 Providence…       696         359       393     1262         269         2255
10 Xavier Uni…      1215         351       419      218         212          224
# ℹ 58 more rows
# ℹ 2 more variables: VolPostElim2 <dbl>, VolPostElim3 <dbl>

This data frame shows:

Average daily volume
- during the tournament
- outside of the tournament
- before the tournament
- after the tournament
Daily volume
- two days after elimination
- three days after elimination
- four days after elimination
- five days after elimination

Data Manipulation for Independent Variables and Game Day Volume

Independent variables include:

Game Date
Round
Seed
Is Major
Is Favorite
Is Winner
Is Upset
Favorite Winner
Underdog Winner
Time of Day

And I will be also adding game day volume during this step.

Code

# Read in game scores, dates, etc.
DateAndTeam <- read_csv("_data/MBBMMGameData.csv", show_col_types = FALSE)

DateAndTeam$GameDate <- as.Date(DateAndTeam$Date, format = "%m/%d/%Y")
DateAndTeam$Time <- DateAndTeam$Time <- gsub(" ET", "", DateAndTeam$Time)
DateAndTeam$Time <- strptime(DateAndTeam$Time, format="%I:%M%p")
DateAndTeam$Time <- format(DateAndTeam$Time, format = "%H:%M:%S")

# Use GameDate to create a TournamentRound variable
DateRoundTeam <- DateAndTeam %>%
  mutate(TournamentRound = case_when(
      GameDate >= "2023-03-14" & GameDate <= "2023-03-15" ~ "First 4", 
      GameDate >= "2023-03-16" & GameDate <= "2023-03-17" ~ "Round of 64",
      GameDate >= "2023-03-18" & GameDate <= "2023-03-19" ~ "Round of 32",
      GameDate >= "2023-03-23" & GameDate <= "2023-03-24" ~ "Sweet 16",
      GameDate >= "2023-03-25" & GameDate <= "2023-03-26" ~ "Elite 8",
      GameDate == "2023-04-01" ~ "Final 4",
      GameDate == "2023-04-03" ~ "Championship")) %>%

# Replace game start times with time categories
  mutate(Time = case_when(
    Time >= "12:00:00" & Time < "14:14:59" ~ "Early Afternoon",
    Time >= "14:15:00" & Time < "16:29:59" ~ "Mid Afternoon",
    Time >= "16:30:00" & Time < "18:44:59" ~ "Late Afternoon",
    Time >= "18:45:00" & Time < "20:59:59" ~ "Early Evening",
    Time >= "21:00:00" & Time < "22:45:00" ~ "Late Evening")) %>%

# Import game day volume
  mutate(GDayVolCol = case_when(
      GameDate == "2023-03-14" ~ 74, 
      GameDate == "2023-03-15" ~ 75, 
      GameDate == "2023-03-16" ~ 76,
      GameDate == "2023-03-17" ~ 77,
      GameDate == "2023-03-18" ~ 78,
      GameDate == "2023-03-19" ~ 79,
      GameDate == "2023-03-23" ~ 83,
      GameDate == "2023-03-24" ~ 84,
      GameDate == "2023-03-25" ~ 85,
      GameDate == "2023-03-26" ~ 86,
      GameDate == "2023-04-01" ~ 92,
      GameDate == "2023-04-03" ~ 94
            )) %>%
  rename(School = Team) %>%
  select(School, GameDate, TournamentRound, Time, GDayVolCol, IsWinner, WinningSeed, LosingSeed)

# Create a function to bring in Game Day Volume 
fun13 <- function(row) {
  school <- row["School"]  # Get the school name from the row
  column_index <- as.numeric(row["GDayVolCol"])  # Convert the value to a numeric column index
  value <- MMVolumeVar[MMVolumeVar$School == school, column_index]
  return(as.numeric(value))
}

# Return the flattened game day volume for each school
DateRoundTeam$GameDayVolume <- unlist(apply(DateRoundTeam, 1, fun13))

# Create data frame with School, GameDate, Time, TournamentRound, GameDayVolume, IsWinner, WinningSeed, LosingSeed
TVariables <- DateRoundTeam %>%
  left_join(TournamentVariables, by = "School") %>%
  select(School, GameDate, Time, TournamentRound, GameDayVolume, IsWinner, WinningSeed, LosingSeed)

# Mutate WinningSeed and LosingSeed to create IsFavorite column 
TVariables2 <- TVariables %>%
  mutate(SeedDifference = ifelse(IsWinner == "Yes", WinningSeed - LosingSeed, LosingSeed - WinningSeed)) %>%
  mutate(SeedDifference = abs(SeedDifference)) %>%
  mutate(Upset = ifelse(SeedDifference >= 5, "Yes", "No")) %>%
  mutate(IsFavorite = ifelse((IsWinner == "Yes" & WinningSeed > LosingSeed) |
                             (IsWinner == "No" & LosingSeed > WinningSeed), "Favorite",
                             ifelse(WinningSeed == LosingSeed, NA, "Underdog"))) %>% 
 mutate(FavoriteWin = ifelse(IsFavorite == "Yes" & IsWinner == "Yes", "Yes", "No")) %>%
  mutate(UnderdogWin = ifelse(IsFavorite == "No" & IsWinner == "Yes", "Yes", "No"))

TVariables3 <- TVariables2 %>%
   select(School, GameDate, Time, TournamentRound, GameDayVolume, IsFavorite, IsWinner, Upset, FavoriteWin, UnderdogWin)

# Read in data for enrollment 
SchoolNameToSchool <- read_csv("_data/SchoolNameToSchool.csv", show_col_types = FALSE) 
EnrollmentData <- read_csv("_data/CCIHE2021PublicData.csv", show_col_types = FALSE)
SchoolEnrollmentData <- SchoolNameToSchool %>%
  left_join(EnrollmentData, by = "SchoolName")

# Add School Size data to TVariables3
TVariables4 <- TVariables3 %>%
  left_join(SchoolEnrollmentData, by = "School")

# Read in Is Major Variable
DateRoundTeamSeedMajor <- read_csv("_data/DateRoundTeamSeedMajor.csv", show_col_types = FALSE) %>%
  select(School, Major)

# Add Is Major to TVariables4
TVariables4 <- TVariables4 %>%
  left_join(DateRoundTeamSeedMajor, by = "School", relationship =
  "many-to-many") 

# Create data frame with all variables 
VolumeFull <- TVariables4 %>%
  left_join(MMVolAvgs, by = "School") %>%
  select(-SchoolName)
VolumeFull

# A tibble: 374 × 22
   School     GameDate   Time  TournamentRound GameDayVolume IsFavorite IsWinner
   <chr>      <date>     <chr> <chr>                   <dbl> <chr>      <chr>   
 1 Pittsburg… 2023-03-14 Late… First 4                  4671 <NA>       Yes     
 2 Pittsburg… 2023-03-14 Late… First 4                  4671 <NA>       Yes     
 3 Pittsburg… 2023-03-14 Late… First 4                  4671 <NA>       Yes     
 4 Texas A&M… 2023-03-14 Late… First 4                   510 <NA>       Yes     
 5 Texas A&M… 2023-03-14 Late… First 4                   510 <NA>       Yes     
 6 Arizona S… 2023-03-15 Late… First 4                  3902 <NA>       Yes     
 7 Arizona S… 2023-03-15 Late… First 4                  3902 <NA>       Yes     
 8 Fairleigh… 2023-03-15 Late… First 4                  1641 <NA>       Yes     
 9 Fairleigh… 2023-03-15 Late… First 4                  1641 <NA>       Yes     
10 Fairleigh… 2023-03-15 Late… First 4                  1641 <NA>       Yes     
# ℹ 364 more rows
# ℹ 15 more variables: Upset <chr>, FavoriteWin <chr>, UnderdogWin <chr>,
#   SizeSetting <chr>, Control <chr>, F20Enrollment <dbl>, Major <chr>,
#   VolDuring <dbl>, VolNonTourn <dbl>, VolBefore <dbl>, VolAfter <dbl>,
#   VolPostElim <dbl>, VolPostElim1 <dbl>, VolPostElim2 <dbl>,
#   VolPostElim3 <dbl>

I now have a data frame with all of my variables of interest.

Create variables for diffences in average volume/volume

Solely comparing differences in volume between the schools is not helpful because their baselines are very different; a 1,000 mention increase for a school that typically receives 10 mentions per day is very different than a school that typically receives 1,000.

Accordingly, I will transform volume metrics into percentage differences so that I can better compare the impact of March Madness across the different schools.

Code

# Calculate difference between average volume during tournament and non-tournament
MMVolDiffs <- MMVolAvgs %>%
  mutate(
    InTournamentDifference = round(((VolDuring / VolNonTourn) * 100), digits = 2),
    DayPostElimDifference = round(((VolPostElim / VolNonTourn) * 100), digits = 2),
    ElimPlus1Difference = round(((VolPostElim1 / VolNonTourn) * 100), digits = 2),
    ElimPlus2Difference = round(((VolPostElim2 / VolNonTourn) * 100), digits = 2), 
    ElimPlus3Difference = round(((VolPostElim3 / VolNonTourn) * 100), digits = 2),
  ) %>%
  select(School, InTournamentDifference, DayPostElimDifference, ElimPlus1Difference, ElimPlus2Difference, ElimPlus3Difference)
MMVolDiffs

# A tibble: 68 × 6
   School       InTournamentDifference DayPostElimDifference ElimPlus1Difference
   <chr>                         <dbl>                 <dbl>               <dbl>
 1 University …                   303.                  92.3                80.3
 2 University …                   212.                 105.                 60.3
 3 Arizona Sta…                   185.                  73.1                82.4
 4 Duke Univer…                   206.                  62.5                72.3
 5 Utah State …                   243.                  49.8                44.4
 6 University …                   252.                  86.9                60.1
 7 North Carol…                   112.                  38.2                79.8
 8 University …                   394.                 184.                143. 
 9 Providence …                   194.                  74.9               628. 
10 Xavier Univ…                   346.                  60.4                63.8
# ℹ 58 more rows
# ℹ 2 more variables: ElimPlus2Difference <dbl>, ElimPlus3Difference <dbl>

Footnotes

https://www.hbs.edu/faculty/Pages/item.aspx?num=44778 ↩︎
https://hbswk.hbs.edu/item/diagnosing-the-flutie-effect-on-college-marketing ↩︎
https://www.forbes.com/sites/hbsworkingknowledge/2013/04/29/the-flutie-effect-how-athletic-success-boosts-college-applications/?sh=61f984206e96 ↩︎
https://www.campussonar.com/campus-sonar-expertise ↩︎
Publication of this data is forthcoming; it will be discussed during this webinar https://t.co/MK2fARTWb5 ↩︎
https://wallethub.com/blog/march-madness-statistics/11016 ↩︎
https://blog.campussonar.com/blog/leverage-the-everyday-impact-of-athletics ↩︎