In April 2013, Doug J. Chung published a research paper1 endeavoring to quantify and model the impact of the so-called “Flutie Effect” - the spillover impact that athletics has on the quantity and quality of applicants to US colleges (named after Boston College quarterback Doug Flutie, who in 1984 threw a Hail Mary touchdown pass to secure victory with six seconds in a game against the University of Miami, qualifying the team to compete in the Cotton Bowl2). The legacy of Flutie’s on-field success has been credited with catalyzing a 30% increase in undergraduate applications at Boston College, though institutional officials have argued that other, non-athletic factors were the “true” reason behind the increase. This trend of increased applications after prominent athletic successes, however, has been observed at other institutions including Georgetown, Northwestern, Boise State, and Texas Christian University.
Chung was able to find a statistically significant relationship between athletic success and both the quantity (number of applications) and quality (SAT scores of those applicants) of applicants to a given institution; his findings included that when a school rises from being classified as a mediocre football program to great one, applications rise by 18.7%3.
While in my job I do not focus directly on applications to US colleges, I do work with on campus marketing and communications leaders, and generate insights from data derived from online conversation about their schools so that they can better understand that conversation in ways that can help them to develop, refine, and align their communications strategies with the goals of the institutions that they serve4. A trend I frequently observe while analyzing this online conversation data is the impact that athletics has on the volume and reach of mentions related to schools.
In benchmarking work that we’ve undertaken in order to better understand online conversation trends in higher education, we’ve found that, on average, 63% of all online conversation related to schools is about their athletics - and for some schools, this proportion can be as high as 91%5. And while the proportion of overall online conversation relating to different colleges is already quite high, there are certain significant events within the realm of college sports that send mention volume even higher. One such event is March Madness.
Run by the NCAA every year during the month of March, the men’s version of the March Madness tournament is widely regarded as one of biggest American sporting events. 68 teams participate in the three-week long single-elimination tournament, which produced $1.14 billion in annual revenue for the NCAA in 20226. One of the most exciting parts of the tournament are the “Cinderella stories”; teams with a low ranking that manage to eliminate higher ranked teams.
In 2022, I became interested in developing a greater understanding of the impact that March Madness could have on a school’s online conversation volume and began tracking that year’s Cinderella - the Saint Peter’s College Peacocks, a #15 seed that made it all the way to the Elite 8. In their Sweet 16 game against Purdue, Saint Peter’s had more mentions per minute than their average monthly volume for the prior five months. Their total mention volume during the month of March was 12,380 times more than that same monthly average7.
After completing that work, I became increasingly curious about how participation in the tournament affected conversation volume and reach for all teams, not just Cinderella stories. I wanted to investigate the factors that might influence more/less conversation volume for any given team, and indeed, whether conversation about the team was the only facet of conversation about a college that increased - or if the school as a whole received more conversation overall. I became very interested in learning more about the overall trends that relate to participation in events like March Madness and how we could use that information to provide strategic insights to colleges across the US.
Bringing it back to the Flutie Effect - if on-field athletic success can impact enrollment at US colleges, it stands to reason that online conversation about athletic success (or athletic events in general) could have a similar impact. Accordingly, it would be prudent to further explore the impact an event like the men’s March Madness tournament has on online conversation volume for US colleges.
Specifically, the research question that I want to explore is:
What impact does participation in the men’s March Madness tournament have on online conversation about the schools involved?
And more specifically, some nuances to this question that I want to consider are:
How does mention volume differ when schools are participating in the tournament?
Is there a difference in reach for schools participating in the tournament?
Is there a correlation between increase in mention volume and increase in reach?
What is the baseline of conversation volume/reach increase that can be expected when a team participates in the tournament?
What variables contribute to increased/decreased mention volume/reach during the tournament?
Size of school
Major / mid-major
Round of the tournament
Favorite / underdog
Win/Loss
Upset
Time of day
If there is a change in mention volume/reach during the tournament, how long does this change last?
Is this length of time similar or different for volume and reach?
Do the same variables that contribute to increased/decreased mention volume also impact the length of that increase?
Answers to these questions will provide a better understanding of the overall impact that events such as the men’s March Madness tournament have on conversation volume/reach and can subsequently serve to inform strategies and tactics that can be implemented by marketing and communications professionals who focus on digital media in higher education. That is, schools can use this information to develop further hypotheses to be tested that would facilitate a goal of leveraging and maximizing the impact of athletic participation and success on their online conversation.
My overarching hypothesis is that participation in the men’s March Madness tournament increases online conversation about the schools that are involved. I further hypothesize that:
There is a correlation between increases in mention volume and overall/average reach; that is, it’s not just the number of mentions about each school that increases, but also the number of people who see those mentions.
The increase in conversation (volume and reach) for each school will be influenced by seven factors: school size, classification as a major or mid-major school, which round of the tournament the game happens during, whether the team was the favorite (ranked higher) or the underdog (ranked lower), whether the team won or lost, whether the game was considered a major upset, and the time of day the game took place.
The change in mention volume/reach will return to baseline within a matter of days after a team’s exit from the tournament.
The variables I hypothesize contribute to variations in how volume/reach increase will not also impact the length of that increase.
What impact does participation in the men’s March Madness tournament have on online conversation about the schools involved?
The data used for this project was collected by me using the Brandwatch Consumer Research platform. I first wrote Boolean to search for mentions about all 68 schools with teams in the men’s March Madness tournament, using the same parameters to construct each one (specifically, the school’s full name, shortened name(s), acronym(s), mascot(s), website URL, athletics website URL, and Twitter usernames for the school’s flagship, flagship athletics, and men’s basketball team accounts). The Boolean was then used to run a query within Brandwatch to pull all relevant, retrievable online mentions for the 68 schools made between January 1 and April 30 2023. This search returned 8,099,045 mentions.
One of the features included in Brandwatch is the ability to segment data into categories, and this can be done in a wide range of manners (using Boolean or a variety of pre-built parameters such as content sources and mention types). I used this feature to create a variety of categories that would be necessary for me to be able to parse the full dataset into manageable chunks that could be used to answer my research question.
First I need to do some data manipulation in order to create my variables of interest.
Data Manipulation For Mention Volume
I need to calculate the average daily volume for the following time spans: * Outside of Tournament Play (January 1 to March 13; two days after elimination to May 12) * During the Tournament (March 14 to day after elimination) * Before the Tournament (January 1 to March 13) * After the Tournament (Two days after elimination to May 12)
And then I need to add columns with the daily volume for the following times: * Post Elimination (two days after) * Post Plus 1 (three days after) * Post Plus 2 (four days after) * Post Plus 3 (five days after)
Code
# Read in dataTournamentVariables <-read_csv("_data/MMVariables.csv", show_col_types =FALSE)MMVolume <-read_csv("_data/MentionsPerDayALL.csv", show_col_types =FALSE)# Join dataMMVolumeVar <- MMVolume %>%left_join(TournamentVariables, by ="School")# Calculate pre-tournament average daily volume MMVolumeVar$VolBefore <- (round(rowMeans(MMVolumeVar[, 2:73]), digits =0))# Create a function to pull the rows needed to calculate "during tournament" for each of the schoolsfun1 <-function(row) { start_col1 <-as.numeric(row[134]) # Convert the value to a numeric column index end_col1 <-as.numeric(row[135]) # Convert the value to a numeric column index values1 <-as.numeric(row[start_col1:end_col1]) # The rows to calculate the average average1 <- (round(mean(values1, na.rm =TRUE), digits =0))return(average1)}# Return the average daily volume for "during tournament" for each school MMVolumeVar$VolDuring <-apply(MMVolumeVar, 1, fun1)# Create a function to pull the rows needed to calculate "after tournament" for each of the schoolsfun2 <-function(row) { start_col2 <-as.numeric(row[136]) # Convert the value to a numeric column index end_col2 <-as.numeric(row[137]) # Convert the value to a numeric column index values2 <-as.numeric(row[start_col2:end_col2]) # The rows to calculate the average average2 <- (round(mean(values2, na.rm =TRUE), digits =0))return(average2)}# Return the average daily volume for "during tournament" for each school MMVolumeVar$VolAfter <-apply(MMVolumeVar, 1, fun2)# Create a function to pull the rows needed for post-elimination volume fun3 <-function(row) { column_index3 <-as.numeric(row[136]) # Convert the value to a numeric column index value3 <-as.numeric(row[column_index3]) # Get the value from the specified columnreturn(value3)}# Return the post-elimination day volume for each school MMVolumeVar$VolPostElim <-apply(MMVolumeVar, 1, fun3)# Create a function to pull the rows needed for non-tournament volumefun14 <-function(row) { start_col14 <-as.numeric(row[142]) # Start column index for range 1 end_col14 <-as.numeric(row[143]) # End column index for range 1 start_col14a <-as.numeric(row[140]) # Start column index for range 2 end_col14a <-as.numeric(row[141]) # End column index for range 2 values14 <-as.numeric(row[start_col14:end_col14]) # Values for range 1 values14a <-as.numeric(row[start_col14a:end_col14a]) # Values for range 2 values14b <-c(values14, values14a) # Combine the values from both ranges average14 <- (round(mean(values14b, na.rm =TRUE), digits =0))return(average14)}# Return the average non-tournament daily volume for each school MMVolumeVar$VolNonTourn <-apply(MMVolumeVar, 1, fun14) # Create a function to pull the rows needed for post-elimination +1 volume fun4 <-function(row) { column_index4 <-as.numeric(row[137]) # Convert the value to a numeric column index value4 <-as.numeric(row[column_index4]) # Get the value from the specified columnreturn(value4)}# Return the post-elimination +1 volume for each school MMVolumeVar$VolPostElim1 <-apply(MMVolumeVar, 1, fun4)# Create a function to pull the rows needed for post-elimination +2 volume fun5 <-function(row) { column_index5 <-as.numeric(row[138]) # Convert the value to a numeric column index value5 <-as.numeric(row[column_index5]) # Get the value from the specified columnreturn(value5)}# Return the post-elimination +2 volume for each school MMVolumeVar$VolPostElim2 <-apply(MMVolumeVar, 1, fun5)# Create a function to pull the rows needed for post-elimination +3 volume fun6 <-function(row) { column_index6 <-as.numeric(row[139]) # Convert the value to a numeric column index value6 <-as.numeric(row[column_index6]) # Get the value from the specified columnreturn(value6)}# Return the post-elimination +3 volume for each school MMVolumeVar$VolPostElim3 <-apply(MMVolumeVar, 1, fun6)# Create a data frame with the columns for the calculated meansMMVolAvgs <- MMVolumeVar %>%select(School, VolDuring, VolNonTourn, VolBefore, VolAfter, VolPostElim, VolPostElim1, VolPostElim2, VolPostElim3)MMVolAvgs
Data Manipulation for Independent Variables and Game Day Volume
Independent variables include:
Game Date
Round
Seed
Is Major
Is Favorite
Is Winner
Is Upset
Favorite Winner
Underdog Winner
Time of Day
And I will be also adding game day volume during this step.
Code
# Read in game scores, dates, etc.DateAndTeam <-read_csv("_data/MBBMMGameData.csv", show_col_types =FALSE)DateAndTeam$GameDate <-as.Date(DateAndTeam$Date, format ="%m/%d/%Y")DateAndTeam$Time <- DateAndTeam$Time <-gsub(" ET", "", DateAndTeam$Time)DateAndTeam$Time <-strptime(DateAndTeam$Time, format="%I:%M%p")DateAndTeam$Time <-format(DateAndTeam$Time, format ="%H:%M:%S")# Use GameDate to create a TournamentRound variableDateRoundTeam <- DateAndTeam %>%mutate(TournamentRound =case_when( GameDate >="2023-03-14"& GameDate <="2023-03-15"~"First 4", GameDate >="2023-03-16"& GameDate <="2023-03-17"~"Round of 64", GameDate >="2023-03-18"& GameDate <="2023-03-19"~"Round of 32", GameDate >="2023-03-23"& GameDate <="2023-03-24"~"Sweet 16", GameDate >="2023-03-25"& GameDate <="2023-03-26"~"Elite 8", GameDate =="2023-04-01"~"Final 4", GameDate =="2023-04-03"~"Championship")) %>%# Replace game start times with time categoriesmutate(Time =case_when( Time >="12:00:00"& Time <"14:14:59"~"Early Afternoon", Time >="14:15:00"& Time <"16:29:59"~"Mid Afternoon", Time >="16:30:00"& Time <"18:44:59"~"Late Afternoon", Time >="18:45:00"& Time <"20:59:59"~"Early Evening", Time >="21:00:00"& Time <"22:45:00"~"Late Evening")) %>%# Import game day volumemutate(GDayVolCol =case_when( GameDate =="2023-03-14"~74, GameDate =="2023-03-15"~75, GameDate =="2023-03-16"~76, GameDate =="2023-03-17"~77, GameDate =="2023-03-18"~78, GameDate =="2023-03-19"~79, GameDate =="2023-03-23"~83, GameDate =="2023-03-24"~84, GameDate =="2023-03-25"~85, GameDate =="2023-03-26"~86, GameDate =="2023-04-01"~92, GameDate =="2023-04-03"~94 )) %>%rename(School = Team) %>%select(School, GameDate, TournamentRound, Time, GDayVolCol, IsWinner, WinningSeed, LosingSeed)# Create a function to bring in Game Day Volume fun13 <-function(row) { school <- row["School"] # Get the school name from the row column_index <-as.numeric(row["GDayVolCol"]) # Convert the value to a numeric column index value <- MMVolumeVar[MMVolumeVar$School == school, column_index]return(as.numeric(value))}# Return the flattened game day volume for each schoolDateRoundTeam$GameDayVolume <-unlist(apply(DateRoundTeam, 1, fun13))# Create data frame with School, GameDate, Time, TournamentRound, GameDayVolume, IsWinner, WinningSeed, LosingSeedTVariables <- DateRoundTeam %>%left_join(TournamentVariables, by ="School") %>%select(School, GameDate, Time, TournamentRound, GameDayVolume, IsWinner, WinningSeed, LosingSeed)# Mutate WinningSeed and LosingSeed to create IsFavorite column TVariables2 <- TVariables %>%mutate(SeedDifference =ifelse(IsWinner =="Yes", WinningSeed - LosingSeed, LosingSeed - WinningSeed)) %>%mutate(SeedDifference =abs(SeedDifference)) %>%mutate(Upset =ifelse(SeedDifference >=5, "Yes", "No")) %>%mutate(IsFavorite =ifelse((IsWinner =="Yes"& WinningSeed > LosingSeed) | (IsWinner =="No"& LosingSeed > WinningSeed), "Favorite",ifelse(WinningSeed == LosingSeed, NA, "Underdog"))) %>%mutate(FavoriteWin =ifelse(IsFavorite =="Yes"& IsWinner =="Yes", "Yes", "No")) %>%mutate(UnderdogWin =ifelse(IsFavorite =="No"& IsWinner =="Yes", "Yes", "No"))TVariables3 <- TVariables2 %>%select(School, GameDate, Time, TournamentRound, GameDayVolume, IsFavorite, IsWinner, Upset, FavoriteWin, UnderdogWin)# Read in data for enrollment SchoolNameToSchool <-read_csv("_data/SchoolNameToSchool.csv", show_col_types =FALSE) EnrollmentData <-read_csv("_data/CCIHE2021PublicData.csv", show_col_types =FALSE)SchoolEnrollmentData <- SchoolNameToSchool %>%left_join(EnrollmentData, by ="SchoolName")# Add School Size data to TVariables3TVariables4 <- TVariables3 %>%left_join(SchoolEnrollmentData, by ="School")# Read in Is Major VariableDateRoundTeamSeedMajor <-read_csv("_data/DateRoundTeamSeedMajor.csv", show_col_types =FALSE) %>%select(School, Major)# Add Is Major to TVariables4TVariables4 <- TVariables4 %>%left_join(DateRoundTeamSeedMajor, by ="School", relationship ="many-to-many") # Create data frame with all variables VolumeFull <- TVariables4 %>%left_join(MMVolAvgs, by ="School") %>%select(-SchoolName)VolumeFull
# A tibble: 374 × 22
School GameDate Time TournamentRound GameDayVolume IsFavorite IsWinner
<chr> <date> <chr> <chr> <dbl> <chr> <chr>
1 Pittsburg… 2023-03-14 Late… First 4 4671 <NA> Yes
2 Pittsburg… 2023-03-14 Late… First 4 4671 <NA> Yes
3 Pittsburg… 2023-03-14 Late… First 4 4671 <NA> Yes
4 Texas A&M… 2023-03-14 Late… First 4 510 <NA> Yes
5 Texas A&M… 2023-03-14 Late… First 4 510 <NA> Yes
6 Arizona S… 2023-03-15 Late… First 4 3902 <NA> Yes
7 Arizona S… 2023-03-15 Late… First 4 3902 <NA> Yes
8 Fairleigh… 2023-03-15 Late… First 4 1641 <NA> Yes
9 Fairleigh… 2023-03-15 Late… First 4 1641 <NA> Yes
10 Fairleigh… 2023-03-15 Late… First 4 1641 <NA> Yes
# ℹ 364 more rows
# ℹ 15 more variables: Upset <chr>, FavoriteWin <chr>, UnderdogWin <chr>,
# SizeSetting <chr>, Control <chr>, F20Enrollment <dbl>, Major <chr>,
# VolDuring <dbl>, VolNonTourn <dbl>, VolBefore <dbl>, VolAfter <dbl>,
# VolPostElim <dbl>, VolPostElim1 <dbl>, VolPostElim2 <dbl>,
# VolPostElim3 <dbl>
I now have a data frame with all of my variables of interest.
Create variables for diffences in average volume/volume
Solely comparing differences in volume between the schools is not helpful because their baselines are very different; a 1,000 mention increase for a school that typically receives 10 mentions per day is very different than a school that typically receives 1,000.
Accordingly, I will transform volume metrics into percentage differences so that I can better compare the impact of March Madness across the different schools.
---title: "Check In 2"author: "Darron Bunt"description: "Final Project Check In 2 - Darron Bunt"date: "05/13/2023"format: html: toc: true code-fold: true code-copy: true code-tools: truecategories: - finalproject---::: panel-tabset```{r}library(tidyverse)library(ggplot2)library(dplyr)library(lubridate)```### Research QuestionIn April 2013, Doug J. Chung published a research paper[^1] endeavoring to quantify and model the impact of the so-called "Flutie Effect" - the spillover impact that athletics has on the quantity and quality of applicants to US colleges (named after Boston College quarterback Doug Flutie, who in 1984 threw a Hail Mary touchdown pass to secure victory with six seconds in a game against the University of Miami, qualifying the team to compete in the Cotton Bowl[^2]). The legacy of Flutie's on-field success has been credited with catalyzing a 30% increase in undergraduate applications at Boston College, though institutional officials have argued that other, non-athletic factors were the "true" reason behind the increase. This trend of increased applications after prominent athletic successes, however, has been observed at other institutions including Georgetown, Northwestern, Boise State, and Texas Christian University.[^1]: <https://www.hbs.edu/faculty/Pages/item.aspx?num=44778>[^2]: <https://hbswk.hbs.edu/item/diagnosing-the-flutie-effect-on-college-marketing>Chung was able to find a statistically significant relationship between athletic success and both the quantity (number of applications) and quality (SAT scores of those applicants) of applicants to a given institution; his findings included that when a school rises from being classified as a mediocre football program to great one, applications rise by 18.7%[^3]. [^3]: <https://www.forbes.com/sites/hbsworkingknowledge/2013/04/29/the-flutie-effect-how-athletic-success-boosts-college-applications/?sh=61f984206e96>While in my job I do not focus directly on applications to US colleges, I do work with on campus marketing and communications leaders, and generate insights from data derived from online conversation about their schools so that they can better understand that conversation in ways that can help them to develop, refine, and align their communications strategies with the goals of the institutions that they serve[^4]. A trend I frequently observe while analyzing this online conversation data is the impact that athletics has on the volume and reach of mentions related to schools. [^4]: <https://www.campussonar.com/campus-sonar-expertise>In benchmarking work that we've undertaken in order to better understand online conversation trends in higher education, we've found that, on average, 63% of all online conversation related to schools is about their athletics - and for some schools, this proportion can be as high as 91%[^5]. And while the proportion of overall online conversation relating to different colleges is already quite high, there are certain significant events within the realm of college sports that send mention volume even higher. One such event is March Madness.[^5]: Publication of this data is forthcoming; it will be discussed during this webinar <https://t.co/MK2fARTWb5>Run by the NCAA every year during the month of March, the men's version of the March Madness tournament is widely regarded as one of biggest American sporting events. 68 teams participate in the three-week long single-elimination tournament, which produced $1.14 billion in annual revenue for the NCAA in 2022[^6]. One of the most exciting parts of the tournament are the "Cinderella stories"; teams with a low ranking that manage to eliminate higher ranked teams. [^6]: <https://wallethub.com/blog/march-madness-statistics/11016>In 2022, I became interested in developing a greater understanding of the impact that March Madness could have on a school's online conversation volume and began tracking that year's Cinderella - the Saint Peter's College Peacocks, a #15 seed that made it all the way to the Elite 8. In their Sweet 16 game against Purdue, Saint Peter's had more mentions *per minute* than their average monthly volume for the prior five months. Their total mention volume during the month of March was *12,380 times* more than that same monthly average[^7].[^7]: <https://blog.campussonar.com/blog/leverage-the-everyday-impact-of-athletics>After completing that work, I became increasingly curious about how participation in the tournament affected conversation volume and reach for all teams, not just Cinderella stories. I wanted to investigate the factors that might influence more/less conversation volume for any given team, and indeed, whether conversation about the team was the only facet of conversation about a college that increased - or if the school as a whole received more conversation overall. I became very interested in learning more about the overall trends that relate to participation in events like March Madness and how we could use that information to provide strategic insights to colleges across the US.Bringing it back to the Flutie Effect - if on-field athletic success can impact enrollment at US colleges, it stands to reason that online conversation about athletic success (or athletic events in general) could have a similar impact. Accordingly, it would be prudent to further explore the impact an event like the men's March Madness tournament has on online conversation volume for US colleges. Specifically, the research question that I want to explore is:**What impact does participation in the men's March Madness tournament have on online conversation about the schools involved?** And more specifically, some nuances to this question that I want to consider are:* How does mention volume differ when schools are participating in the tournament? + Is there a difference in reach for schools participating in the tournament? + Is there a correlation between increase in mention volume and increase in reach?* What is the baseline of conversation volume/reach increase that can be expected when a team participates in the tournament?* What variables contribute to increased/decreased mention volume/reach during the tournament? + Size of school + Major / mid-major + Round of the tournament + Favorite / underdog + Win/Loss + Upset + Time of day* If there is a change in mention volume/reach during the tournament, how long does this change last? + Is this length of time similar or different for volume and reach? + Do the same variables that contribute to increased/decreased mention volume also impact the length of that increase?Answers to these questions will provide a better understanding of the overall impact that events such as the men's March Madness tournament have on conversation volume/reach and can subsequently serve to inform strategies and tactics that can be implemented by marketing and communications professionals who focus on digital media in higher education. That is, schools can use this information to develop further hypotheses to be tested that would facilitate a goal of leveraging and maximizing the impact of athletic participation and success on their online conversation. ### HypothesisMy overarching hypothesis is that **participation in the men's March Madness tournament increases online conversation about the schools that are involved.** I further hypothesize that:* There is a correlation between increases in mention volume and overall/average reach; that is, it's not just the number of mentions about each school that increases, but also the number of people who see those mentions. * The increase in conversation (volume and reach) for each school will be influenced by seven factors: school size, classification as a major or mid-major school, which round of the tournament the game happens during, whether the team was the favorite (ranked higher) or the underdog (ranked lower), whether the team won or lost, whether the game was considered a major upset, and the time of day the game took place. * The change in mention volume/reach will return to baseline within a matter of days after a team's exit from the tournament. * The variables I hypothesize contribute to variations in how volume/reach increase will **not** also impact the length of that increase.**What impact does participation in the men's March Madness tournament have on online conversation about the schools involved?** ### Data CollectionThe data used for this project was collected by me using the Brandwatch Consumer Research platform. I first wrote Boolean to search for mentions about all 68 schools with teams in the men's March Madness tournament, using the same parameters to construct each one (specifically, the school's full name, shortened name(s), acronym(s), mascot(s), website URL, athletics website URL, and Twitter usernames for the school's flagship, flagship athletics, and men's basketball team accounts). The Boolean was then used to run a query within Brandwatch to pull all relevant, retrievable online mentions for the 68 schools made between January 1 and April 30 2023. This search returned 8,099,045 mentions. One of the features included in Brandwatch is the ability to segment data into categories, and this can be done in a wide range of manners (using Boolean or a variety of pre-built parameters such as content sources and mention types). I used this feature to create a variety of categories that would be necessary for me to be able to parse the full dataset into manageable chunks that could be used to answer my research question. ### Variables of Interest First I need to do some data manipulation in order to create my variables of interest.#### Data Manipulation For Mention Volume I need to calculate the average daily volume for the following time spans:* Outside of Tournament Play (January 1 to March 13; two days after elimination to May 12)* During the Tournament (March 14 to day after elimination)* Before the Tournament (January 1 to March 13)* After the Tournament (Two days after elimination to May 12)And then I need to add columns with the daily volume for the following times:* Post Elimination (two days after)* Post Plus 1 (three days after)* Post Plus 2 (four days after)* Post Plus 3 (five days after)```{r}# Read in dataTournamentVariables <-read_csv("_data/MMVariables.csv", show_col_types =FALSE)MMVolume <-read_csv("_data/MentionsPerDayALL.csv", show_col_types =FALSE)# Join dataMMVolumeVar <- MMVolume %>%left_join(TournamentVariables, by ="School")# Calculate pre-tournament average daily volume MMVolumeVar$VolBefore <- (round(rowMeans(MMVolumeVar[, 2:73]), digits =0))# Create a function to pull the rows needed to calculate "during tournament" for each of the schoolsfun1 <-function(row) { start_col1 <-as.numeric(row[134]) # Convert the value to a numeric column index end_col1 <-as.numeric(row[135]) # Convert the value to a numeric column index values1 <-as.numeric(row[start_col1:end_col1]) # The rows to calculate the average average1 <- (round(mean(values1, na.rm =TRUE), digits =0))return(average1)}# Return the average daily volume for "during tournament" for each school MMVolumeVar$VolDuring <-apply(MMVolumeVar, 1, fun1)# Create a function to pull the rows needed to calculate "after tournament" for each of the schoolsfun2 <-function(row) { start_col2 <-as.numeric(row[136]) # Convert the value to a numeric column index end_col2 <-as.numeric(row[137]) # Convert the value to a numeric column index values2 <-as.numeric(row[start_col2:end_col2]) # The rows to calculate the average average2 <- (round(mean(values2, na.rm =TRUE), digits =0))return(average2)}# Return the average daily volume for "during tournament" for each school MMVolumeVar$VolAfter <-apply(MMVolumeVar, 1, fun2)# Create a function to pull the rows needed for post-elimination volume fun3 <-function(row) { column_index3 <-as.numeric(row[136]) # Convert the value to a numeric column index value3 <-as.numeric(row[column_index3]) # Get the value from the specified columnreturn(value3)}# Return the post-elimination day volume for each school MMVolumeVar$VolPostElim <-apply(MMVolumeVar, 1, fun3)# Create a function to pull the rows needed for non-tournament volumefun14 <-function(row) { start_col14 <-as.numeric(row[142]) # Start column index for range 1 end_col14 <-as.numeric(row[143]) # End column index for range 1 start_col14a <-as.numeric(row[140]) # Start column index for range 2 end_col14a <-as.numeric(row[141]) # End column index for range 2 values14 <-as.numeric(row[start_col14:end_col14]) # Values for range 1 values14a <-as.numeric(row[start_col14a:end_col14a]) # Values for range 2 values14b <-c(values14, values14a) # Combine the values from both ranges average14 <- (round(mean(values14b, na.rm =TRUE), digits =0))return(average14)}# Return the average non-tournament daily volume for each school MMVolumeVar$VolNonTourn <-apply(MMVolumeVar, 1, fun14) # Create a function to pull the rows needed for post-elimination +1 volume fun4 <-function(row) { column_index4 <-as.numeric(row[137]) # Convert the value to a numeric column index value4 <-as.numeric(row[column_index4]) # Get the value from the specified columnreturn(value4)}# Return the post-elimination +1 volume for each school MMVolumeVar$VolPostElim1 <-apply(MMVolumeVar, 1, fun4)# Create a function to pull the rows needed for post-elimination +2 volume fun5 <-function(row) { column_index5 <-as.numeric(row[138]) # Convert the value to a numeric column index value5 <-as.numeric(row[column_index5]) # Get the value from the specified columnreturn(value5)}# Return the post-elimination +2 volume for each school MMVolumeVar$VolPostElim2 <-apply(MMVolumeVar, 1, fun5)# Create a function to pull the rows needed for post-elimination +3 volume fun6 <-function(row) { column_index6 <-as.numeric(row[139]) # Convert the value to a numeric column index value6 <-as.numeric(row[column_index6]) # Get the value from the specified columnreturn(value6)}# Return the post-elimination +3 volume for each school MMVolumeVar$VolPostElim3 <-apply(MMVolumeVar, 1, fun6)# Create a data frame with the columns for the calculated meansMMVolAvgs <- MMVolumeVar %>%select(School, VolDuring, VolNonTourn, VolBefore, VolAfter, VolPostElim, VolPostElim1, VolPostElim2, VolPostElim3)MMVolAvgs```This data frame shows:* Average daily volume + during the tournament + outside of the tournament + before the tournament + after the tournament * Daily volume + two days after elimination + three days after elimination + four days after elimination + five days after elimination#### Data Manipulation for Independent Variables and Game Day VolumeIndependent variables include:* Game Date* Round * Seed * Is Major* Is Favorite * Is Winner* Is Upset * Favorite Winner* Underdog Winner* Time of DayAnd I will be also adding game day volume during this step. ```{r}# Read in game scores, dates, etc.DateAndTeam <-read_csv("_data/MBBMMGameData.csv", show_col_types =FALSE)DateAndTeam$GameDate <-as.Date(DateAndTeam$Date, format ="%m/%d/%Y")DateAndTeam$Time <- DateAndTeam$Time <-gsub(" ET", "", DateAndTeam$Time)DateAndTeam$Time <-strptime(DateAndTeam$Time, format="%I:%M%p")DateAndTeam$Time <-format(DateAndTeam$Time, format ="%H:%M:%S")# Use GameDate to create a TournamentRound variableDateRoundTeam <- DateAndTeam %>%mutate(TournamentRound =case_when( GameDate >="2023-03-14"& GameDate <="2023-03-15"~"First 4", GameDate >="2023-03-16"& GameDate <="2023-03-17"~"Round of 64", GameDate >="2023-03-18"& GameDate <="2023-03-19"~"Round of 32", GameDate >="2023-03-23"& GameDate <="2023-03-24"~"Sweet 16", GameDate >="2023-03-25"& GameDate <="2023-03-26"~"Elite 8", GameDate =="2023-04-01"~"Final 4", GameDate =="2023-04-03"~"Championship")) %>%# Replace game start times with time categoriesmutate(Time =case_when( Time >="12:00:00"& Time <"14:14:59"~"Early Afternoon", Time >="14:15:00"& Time <"16:29:59"~"Mid Afternoon", Time >="16:30:00"& Time <"18:44:59"~"Late Afternoon", Time >="18:45:00"& Time <"20:59:59"~"Early Evening", Time >="21:00:00"& Time <"22:45:00"~"Late Evening")) %>%# Import game day volumemutate(GDayVolCol =case_when( GameDate =="2023-03-14"~74, GameDate =="2023-03-15"~75, GameDate =="2023-03-16"~76, GameDate =="2023-03-17"~77, GameDate =="2023-03-18"~78, GameDate =="2023-03-19"~79, GameDate =="2023-03-23"~83, GameDate =="2023-03-24"~84, GameDate =="2023-03-25"~85, GameDate =="2023-03-26"~86, GameDate =="2023-04-01"~92, GameDate =="2023-04-03"~94 )) %>%rename(School = Team) %>%select(School, GameDate, TournamentRound, Time, GDayVolCol, IsWinner, WinningSeed, LosingSeed)# Create a function to bring in Game Day Volume fun13 <-function(row) { school <- row["School"] # Get the school name from the row column_index <-as.numeric(row["GDayVolCol"]) # Convert the value to a numeric column index value <- MMVolumeVar[MMVolumeVar$School == school, column_index]return(as.numeric(value))}# Return the flattened game day volume for each schoolDateRoundTeam$GameDayVolume <-unlist(apply(DateRoundTeam, 1, fun13))# Create data frame with School, GameDate, Time, TournamentRound, GameDayVolume, IsWinner, WinningSeed, LosingSeedTVariables <- DateRoundTeam %>%left_join(TournamentVariables, by ="School") %>%select(School, GameDate, Time, TournamentRound, GameDayVolume, IsWinner, WinningSeed, LosingSeed)# Mutate WinningSeed and LosingSeed to create IsFavorite column TVariables2 <- TVariables %>%mutate(SeedDifference =ifelse(IsWinner =="Yes", WinningSeed - LosingSeed, LosingSeed - WinningSeed)) %>%mutate(SeedDifference =abs(SeedDifference)) %>%mutate(Upset =ifelse(SeedDifference >=5, "Yes", "No")) %>%mutate(IsFavorite =ifelse((IsWinner =="Yes"& WinningSeed > LosingSeed) | (IsWinner =="No"& LosingSeed > WinningSeed), "Favorite",ifelse(WinningSeed == LosingSeed, NA, "Underdog"))) %>%mutate(FavoriteWin =ifelse(IsFavorite =="Yes"& IsWinner =="Yes", "Yes", "No")) %>%mutate(UnderdogWin =ifelse(IsFavorite =="No"& IsWinner =="Yes", "Yes", "No"))TVariables3 <- TVariables2 %>%select(School, GameDate, Time, TournamentRound, GameDayVolume, IsFavorite, IsWinner, Upset, FavoriteWin, UnderdogWin)# Read in data for enrollment SchoolNameToSchool <-read_csv("_data/SchoolNameToSchool.csv", show_col_types =FALSE) EnrollmentData <-read_csv("_data/CCIHE2021PublicData.csv", show_col_types =FALSE)SchoolEnrollmentData <- SchoolNameToSchool %>%left_join(EnrollmentData, by ="SchoolName")# Add School Size data to TVariables3TVariables4 <- TVariables3 %>%left_join(SchoolEnrollmentData, by ="School")# Read in Is Major VariableDateRoundTeamSeedMajor <-read_csv("_data/DateRoundTeamSeedMajor.csv", show_col_types =FALSE) %>%select(School, Major)# Add Is Major to TVariables4TVariables4 <- TVariables4 %>%left_join(DateRoundTeamSeedMajor, by ="School", relationship ="many-to-many") # Create data frame with all variables VolumeFull <- TVariables4 %>%left_join(MMVolAvgs, by ="School") %>%select(-SchoolName)VolumeFull```I now have a data frame with all of my variables of interest. #### Create variables for diffences in average volume/volumeSolely comparing differences in volume between the schools is not helpful because their baselines are very different; a 1,000 mention increase for a school that typically receives 10 mentions per day is very different than a school that typically receives 1,000. Accordingly, I will transform volume metrics into percentage differences so that I can better compare the impact of March Madness across the different schools. ```{r}# Calculate difference between average volume during tournament and non-tournamentMMVolDiffs <- MMVolAvgs %>%mutate(InTournamentDifference =round(((VolDuring / VolNonTourn) *100), digits =2),DayPostElimDifference =round(((VolPostElim / VolNonTourn) *100), digits =2),ElimPlus1Difference =round(((VolPostElim1 / VolNonTourn) *100), digits =2),ElimPlus2Difference =round(((VolPostElim2 / VolNonTourn) *100), digits =2), ElimPlus3Difference =round(((VolPostElim3 / VolNonTourn) *100), digits =2), ) %>%select(School, InTournamentDifference, DayPostElimDifference, ElimPlus1Difference, ElimPlus2Difference, ElimPlus3Difference)MMVolDiffs```:::