Kruzlic Class Project 2

final project 2
Qasim Abbas
dplyr
Author

Bryn Kruzlic

Published

August 18, 2023

Code
knitr::opts_chunk$set(echo = TRUE)

These are the libraries being used in the following project.

Code
library(readxl)
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(lubridate)
library(ggplot2)
library(stringr)
library(readr)
library(cluster)
library(dplyr)
library(AER)
Warning: package 'AER' was built under R version 4.2.3
Loading required package: car
Warning: package 'car' was built under R version 4.2.3
Loading required package: carData
Warning: package 'carData' was built under R version 4.2.3

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: lmtest
Warning: package 'lmtest' was built under R version 4.2.3
Loading required package: zoo
Warning: package 'zoo' was built under R version 4.2.3

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Loading required package: sandwich
Warning: package 'sandwich' was built under R version 4.2.3
Loading required package: survival
Code
library(caret)
Warning: package 'caret' was built under R version 4.2.3
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:survival':

    cluster

The following object is masked from 'package:purrr':

    lift
Code
library(dummy)
dummy 0.1.3
dummyNews()
Code
library(stargazer)

Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
Code
library(glmnet)
Warning: package 'glmnet' was built under R version 4.2.3
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-7
Code
library(car)
library(lmtest)
library(sandwich)

knitr::opts_chunk$set(echo = TRUE)

Cleaning up the data

Unfortunately, this dataset does contain a fair amount of missing data in addition to more measures of music than we need. For this project, we are only focused on genre, dur (duration) and pop (popularity) when it comes to the Top Spotify charting songs from 2010 to 2019. The below code features the cleaning process, merging together different subgenres into one ‘umbrella’ genre, eliminating repeat values and omitting NAs.
Code
# Read in dataset

top10s <- read_csv("C:/Users/Bryn Kruzlic/OneDrive/Desktop/MS_DACSS/DACSS603/top10s.csv")
New names:
Rows: 603 Columns: 15
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(3): title, artist, top genre dbl (12): ...1, year, bpm, nrgy, dnce, dB, live,
val, dur, acous, spch, pop
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
View(top10s)
Code
# Filtering NAs from the dataset

top10s %>% 
  select_all() %>% 
  filter(bpm == 0 |
           nrgy == 0 |
           dnce == 0 |
           dB == 0 |
           live == 0 |
           dur == 0 |
           spch == 0 |
           pop  == 0 )
# A tibble: 5 × 15
   ...1 title artist `top genre`  year   bpm  nrgy  dnce    dB  live   val   dur
  <dbl> <chr> <chr>  <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    51 Hello Marti… big room     2010   128    98    67    -3    10    45   191
2   139 Blow… P!nk   dance pop    2012   114    92    60    -3    25    75   256
3   268 Not … Justi… dance pop    2014    86    73    59    -6    38    46   266
4   363 L.A.… Fergie dance pop    2015   202    39    48    -8    26    27   193
5   443 Mill… Adele  british so…  2016     0     0     0   -60     0     0   227
# ℹ 3 more variables: acous <dbl>, spch <dbl>, pop <dbl>
Code
summary(top10s)
      ...1          title              artist           top genre        
 Min.   :  1.0   Length:603         Length:603         Length:603        
 1st Qu.:151.5   Class :character   Class :character   Class :character  
 Median :302.0   Mode  :character   Mode  :character   Mode  :character  
 Mean   :302.0                                                           
 3rd Qu.:452.5                                                           
 Max.   :603.0                                                           
      year           bpm             nrgy           dnce      
 Min.   :2010   Min.   :  0.0   Min.   : 0.0   Min.   : 0.00  
 1st Qu.:2013   1st Qu.:100.0   1st Qu.:61.0   1st Qu.:57.00  
 Median :2015   Median :120.0   Median :74.0   Median :66.00  
 Mean   :2015   Mean   :118.5   Mean   :70.5   Mean   :64.38  
 3rd Qu.:2017   3rd Qu.:129.0   3rd Qu.:82.0   3rd Qu.:73.00  
 Max.   :2019   Max.   :206.0   Max.   :98.0   Max.   :97.00  
       dB               live            val             dur       
 Min.   :-60.000   Min.   : 0.00   Min.   : 0.00   Min.   :134.0  
 1st Qu.: -6.000   1st Qu.: 9.00   1st Qu.:35.00   1st Qu.:202.0  
 Median : -5.000   Median :12.00   Median :52.00   Median :221.0  
 Mean   : -5.579   Mean   :17.77   Mean   :52.23   Mean   :224.7  
 3rd Qu.: -4.000   3rd Qu.:24.00   3rd Qu.:69.00   3rd Qu.:239.5  
 Max.   : -2.000   Max.   :74.00   Max.   :98.00   Max.   :424.0  
     acous            spch             pop       
 Min.   : 0.00   Min.   : 0.000   Min.   : 0.00  
 1st Qu.: 2.00   1st Qu.: 4.000   1st Qu.:60.00  
 Median : 6.00   Median : 5.000   Median :69.00  
 Mean   :14.33   Mean   : 8.358   Mean   :66.52  
 3rd Qu.:17.00   3rd Qu.: 9.000   3rd Qu.:76.00  
 Max.   :99.00   Max.   :48.000   Max.   :99.00  
Code
# Repeated in the chart due to seperate releases

top10s <- top10s %>% 
  filter(title != 'The Hills' | year != 2016)
Code
# Merging subgenres together for more cohesion

genre <- top10s["top genre"]

unique(top10s$'top genre')
 [1] "neo mellow"                "detroit hip hop"          
 [3] "dance pop"                 "pop"                      
 [5] "canadian pop"              "hip pop"                  
 [7] "barbadian pop"             "atl hip hop"              
 [9] "australian pop"            "indie pop"                
[11] "art pop"                   "colombian pop"            
[13] "big room"                  "british soul"             
[15] "chicago rap"               "acoustic pop"             
[17] "permanent wave"            "boy band"                 
[19] "baroque pop"               "celtic rock"              
[21] "electro"                   "complextro"               
[23] "canadian hip hop"          "candy pop"                
[25] "alaska indie"              "folk-pop"                 
[27] "metropopolis"              "house"                    
[29] "australian hip hop"        "electropop"               
[31] "australian dance"          "hollywood"                
[33] "canadian contemporary r&b" "irish singer-songwriter"  
[35] "tropical house"            "belgian edm"              
[37] "french indie pop"          "hip hop"                  
[39] "danish pop"                "latin"                    
[41] "canadian latin"            "electronic trap"          
[43] "edm"                       "electro house"            
[45] "downtempo"                 "brostep"                  
[47] "contemporary country"      "moroccan pop"             
[49] "escape room"               "alternative r&b"          
Code
top10s <-top10s%>%separate((`top genre`),c("variable","genre"),extra='merge')
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 97 rows [5, 12, 24, 55,
58, 60, 65, 106, 108, 110, 114, 118, 119, 138, 148, 150, 153, 155, 163, 168,
...].
Code
other <- c("complextro", "downtempo", "brostep", "electro", "electronic", "escape", "folk-pop", "hollywood", "house", "irish", "permanent", "neo", "metropolis", "tropical")
top10s$genre[top10s$variable %in% other] <- 'other'
top10s$genre[top10s$variable == 'hip'] <- 'hip hop'
top10s$genre[top10s$variable == 'latin'] <- 'latin'
top10s$genre[top10s$variable == 'boy'] <- 'pop'
top10s$genre[top10s$variable == 'french'] <- 'pop'
top10s$genre[top10s$variable == 'electropop'] <- 'pop'
top10s$genre[top10s$variable == 'pop'] <- 'pop'
top10s$genre[top10s$genre == 'contemporary r&b'] <- 'r&b'
top10s$genre[top10s$genre == 'room'] <- 'other'
table(top10s$genre)

country   dance     edm hip hop   indie   latin   other     pop     r&b     rap 
      1       6       2      20       1       5      46     493       9       1 
   rock    soul 
      1      11 
Code
miss <- colSums(is.na(top10s))
print(miss[miss>0])
genre 
    6 
Code
top10s <-na.omit(top10s)

Grouping and Analysis- Defining and Representing Variables

Top Artists

This below chart represents the artists with the highest number of charting songs between 2010 and 2019.
Code
pop_artist <- top10s%>%arrange(desc(pop))%>%group_by(artist)
pop_artist
# A tibble: 596 × 16
# Groups:   artist [181]
    ...1 title   artist variable genre  year   bpm  nrgy  dnce    dB  live   val
   <dbl> <chr>   <chr>  <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1   573 "Memor… Maroo… pop      pop    2019    91    32    76    -7     8    57
 2   574 "Lose … Selen… dance    pop    2019   102    34    51    -9    21     9
 3   575 "Someo… Lewis… pop      pop    2019   110    41    50    -6    11    45
 4   576 "Se\xf… Shawn… canadian pop    2019   117    54    76    -6     9    75
 5   577 "How D… Sam S… pop      pop    2019   111    68    48    -5     8    35
 6   578 "South… Ed Sh… pop      pop    2019    98    62    86    -6     9    67
 7   579 "Tramp… SHAED  electro… pop    2019   127    46    62    -6    14    50
 8   580 "Happi… Marsh… brostep  other  2019   100    79    69    -3    17    67
 9   581 "Truth… Lizzo  escape   other  2019   158    62    72    -3    12    41
10   582 "Good … Lizzo  escape   other  2019    96    89    67    -3    74    48
# ℹ 586 more rows
# ℹ 4 more variables: dur <dbl>, acous <dbl>, spch <dbl>, pop <dbl>
Code
top_artist<-top10s%>%
  group_by(`year`)%>%
  count(artist)%>%
  mutate(prop=n/sum(n))

most_popular_artist <- top10s %>% 
  group_by(artist) %>% 
  summarize(no.of.songs = n()) %>% 
  arrange(desc(no.of.songs)) 

summarized_songs_artist <- most_popular_artist %>% 
  slice_max(no.of.songs,n=15)
view(summarized_songs_artist)

most_popular_artist <- pop_artist%>%count(artist)%>%arrange(desc(n))
most_popular_artist%>%head(15)%>%ggplot(aes(x=reorder(artist,n),y=n))+geom_col(fill="skyblue")+ggtitle("15 Most Popular Artists")+coord_flip()

Song Duration

This below chart represents the average duration of songs by genre. Not surprisingly, the majority of top charting Spotify songs are in Pop, with most of the songs averaging between 200 and 300 seconds. This means that the majority of charting Pop songs range from 3.5 to 5 minutes.
Code
duration <-top10s%>%ggplot(aes(x=reorder(genre,dur),y=dur))+geom_boxplot(fill="limegreen")+ggtitle("Song Duration")+theme_light()
duration

Code
duration <- top10s %>% 
  ggplot(aes(x = reorder(genre, dur), y = dur)) +
  geom_point() +
  ggtitle("Song Duration") +
  xlab("Genre") +
  ylab("Duration") +
  theme_light() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1.2)) +
  geom_point(color = "darkgreen")
duration

Code
summarized_data_genre <- top10s %>% 
  group_by(genre) %>% 
  summarize(avg_duration = mean(dur)) %>% 
  arrange(desc(avg_duration))
View(summarized_data_genre)

colnames(summarized_data_genre) <- c("Song Genre", "Average Song Duration")
view(dur_genre)
Error in view(dur_genre): object 'dur_genre' not found

Plot PDF and CDF

Code
var1_pdf <- prop.table(table(top10s$genre))
var1_cdf <- cumsum(var1_pdf)
var1_pdf_cdf <- data.frame(Genre = names(var1_pdf), Probability = var1_pdf, Cumulative = var1_cdf)
print(var1_pdf_cdf)
          Genre Probability.Var1 Probability.Freq  Cumulative
country country          country      0.001677852 0.001677852
dance     dance            dance      0.010067114 0.011744966
edm         edm              edm      0.003355705 0.015100671
hip hop hip hop          hip hop      0.033557047 0.048657718
indie     indie            indie      0.001677852 0.050335570
latin     latin            latin      0.008389262 0.058724832
other     other            other      0.077181208 0.135906040
pop         pop              pop      0.827181208 0.963087248
r&b         r&b              r&b      0.015100671 0.978187919
rap         rap              rap      0.001677852 0.979865772
rock       rock             rock      0.001677852 0.981543624
soul       soul             soul      0.018456376 1.000000000
Code
var2_pdf <- density(top10s$dur)
var2_pdf <- prop.table(table(top10s$dur))
var2_cdf <- cumsum(top10s$dur)


var2_pdf <- prop.table(table(top10s$dur))
var2_cdf <- cumsum(var2_pdf)
var2_pdf_cdf <- data.frame(Duration = unique(top10s$dur),
                      Probability = var2_pdf,
                      Cumulative = var2_cdf)

tibble(var2_pdf_cdf)
# A tibble: 144 × 4
   Duration Probability.Var1 Probability.Freq Cumulative
      <dbl> <fct>                       <dbl>      <dbl>
 1      217 134                       0.00168    0.00168
 2      263 148                       0.00168    0.00336
 3      200 157                       0.00168    0.00503
 4      295 158                       0.00168    0.00671
 5      221 159                       0.00336    0.0101 
 6      214 160                       0.00168    0.0117 
 7      203 162                       0.00168    0.0134 
 8      225 164                       0.00168    0.0151 
 9      216 166                       0.00168    0.0168 
10      235 169                       0.00336    0.0201 
# ℹ 134 more rows
PDF or probability density function and CDF or cumulative distribution function can help us understand the likelihood of a random variable taking on a particular value. In the example of Spotify chart data, the PDF can provide insight on how frequent these values may appear and the CDF can help us understand how these specific features can impact a songs’ relative popularity and performance.

Variable Defining and Redefining

Code
# naming the variables

genre_mapping <- unique(top10s$genre)
variable_1 <- as.numeric(factor(top10s$genre, levels = genre_mapping))
view(variable_1) # genre of the popular songs

variable_2 <- top10s$dur # duration of the popular songs 

variable_3 <- top10s$pop # popularity score of songs; used for omitted variable bias

top10s$variable_1 <- variable_1
top10s$variable_2 <- variable_2

# mean and standard deviations

mean(variable_1)
[1] 3.092282
Code
sd(variable_1)
[1] 1.427182
Code
mean(variable_2)
[1] 224.745
Code
sd(variable_2)
[1] 34.26896
Code
mean(variable_3)
[1] 66.38255
Code
sd(variable_3)
[1] 14.53045

Standard Errors

Code
standard_errors <- aggregate(dur ~ genre, data = top10s, FUN = function(x) sd(x) / sqrt(length(x)))
sufficient_observations <- aggregate(dur ~ genre, data = top10s, FUN = function(x) length(x) >= 2)


genre_proportions <- prop.table(table(top10s$genre))
genre_sample_sizes <- as.integer(table(top10s$genre))

Generating Plots

Code
average_duration <- aggregate(dur ~ genre, data = top10s, FUN = mean)
barplot(average_duration$dur, names.arg = average_duration$genre, xlab = "Genre", ylab = "Duration", main = "Average Duration by Genre", col = "steelblue")

Regression Analysis

Code
top10s$dur <- as.numeric(top10s$dur)
missing_values <- is.na(top10s$dur)
subset_data <- top10s[!missing_values, ]


barplot(table(top10s$genre, useNA = "ifany"), beside = TRUE, col = "lightgray", xlab = "Genre", ylab = "Count", main = "Top 10s - Genre Distribution")


boxplot(dur ~ genre, data = top10s, col = "maroon", xlab = "Genre", ylab = "Duration", abline(h = median(top10s$dur), col = "blue", lwd = 2, lty = 2), main = "Top 10s - Duration by Genre")

Correlations

The graphs below will allow us to see the correlation between specific variables within the dataset. For the purpose of our research, only the variables “genre”, “dur” and “bpm” will be utilized.
Code
bpm_pop <- top10s%>%ggplot(aes(x=bpm,y=pop))+geom_point()+ggtitle("BPM/Popularity")+theme_minimal()+geom_smooth(se=FALSE)
bpm_pop
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Code
cor(top10s$bpm, top10s$pop)
[1] 0.02314881
Code
dur_pop <- top10s%>%ggplot(aes(x=dur,y=pop))+geom_point()+ggtitle("Duration/Popularity")+theme_minimal()+geom_smooth(se=FALSE)
dur_pop
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Code
cor(top10s$dur, top10s$pop)
[1] -0.1056947
Code
genre_pop <- top10s%>%ggplot(aes(x=variable_1,y=pop))+geom_point()+ggtitle("Genre/Popularity")+theme_minimal()+geom_smooth(se=FALSE)
genre_pop
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 0.945
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: neighborhood radius 2.055
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 4.4549e-16
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 1
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: zero-width neighborhood. make span bigger
Warning: Computation failed in `stat_smooth()`
Caused by error in `predLoess()`:
! NA/NaN/Inf in foreign function call (arg 5)

Code
cor(top10s$variable_1, top10s$pop)
[1] -0.01021489

Customize Variable Models

Code
model_genre <- lm(pop ~ genre, data = top10s)

model_dur <- lm(pop ~ dur, data = top10s)

model_1 <- lm(pop ~ variable_1 + variable_2, data = top10s) #multiple regression model
summary(model_1)$coef
               Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 76.94574192 4.18664674 18.3788475 4.801203e-60
variable_1  -0.14312546 0.41598167 -0.3440667 7.309181e-01
variable_2  -0.04503152 0.01732417 -2.5993458 9.572657e-03
Code
model_2 <- lm(pop ~ log(variable_1) + log(variable_2), data = top10s) # log-linear model
summary(model_2)$coef
                   Estimate Std. Error    t value     Pr(>|t|)
(Intercept)     124.1855261  22.308875  5.5666423 3.938166e-08
log(variable_1)  -0.3867623   1.527494 -0.2532006 8.002009e-01
log(variable_2) -10.6205119   4.109617 -2.5843067 9.995379e-03
Code
model_3 <- lm(pop ~ I(variable_1) + I(variable_2) + I(variable_1^2) + I(variable_2^2) + I(variable_1^3) + I(variable_2^3), data = top10s) # high order polynomial 
summary(model_3)$coef
                     Estimate   Std. Error    t value  Pr(>|t|)
(Intercept)      6.348489e+01 5.489305e+01  1.1565198 0.2479375
I(variable_1)   -2.408607e+00 2.932493e+00 -0.8213513 0.4117785
I(variable_2)    1.506765e-01 6.501551e-01  0.2317547 0.8168091
I(variable_1^2)  7.025044e-01 6.774892e-01  1.0369234 0.3001971
I(variable_2^2) -8.068174e-04 2.486779e-03 -0.3244427 0.7457181
I(variable_1^3) -4.945496e-02 4.340270e-02 -1.1394442 0.2549811
I(variable_2^3)  1.060373e-06 3.078065e-06  0.3444935 0.7305982

Plot Observations

Code
#Attempting again to omit NA values

top10s <- na.omit(top10s)
top10s$genre <- na.omit(top10s$genre)
variable_1 <- na.omit(variable_1)
variable_2 <- na.omit(variable_2)
model_1 <- na.omit(model_1)
model_2 <- na.omit(model_2)
model_3 <- na.omit(model_3)

data_for_plot <- data.frame(variable_1, variable_2)
data_for_plot <- na.omit(data_for_plot)

Model 1:

Code
#Plotting the following observations- Multiple Regression Model

intercept_1 <- coef(model_1)["(Intercept)"]
coef_variable_1 <- coef(model_1)["variable_1"]
coef_variable_2 <- coef(model_1)["variable_2"]

predicted_pop_1 <- intercept_1 + coef_variable_1 * top10s$variable_1 +
                 coef_variable_2 * top10s$variable_2

ggplot(top10s, aes(x = genre, y = pop)) +
geom_point() +
geom_line(aes(x = variable_1, y = predicted_pop_1), color = "red", size = 3) +
geom_smooth(method = "lm", color = "blue", se = FALSE) +
  labs(x = "Genre", y = "Popularity") +
     theme_bw() +
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using formula = 'y ~ x'

Code
ggplot(top10s, aes(x = dur, y = pop)) +
  geom_point() +
  geom_line(aes(x = variable_2, y = predicted_pop_1), color = "red", size = 3) +
  geom_smooth(method = "lm", color = "blue", se = FALSE) +
  labs(x = "Duration", y = "Popularity") +
     theme_bw() +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))
`geom_smooth()` using formula = 'y ~ x'

Model 2:

Code
# Plotting the following observations- OLS Equation

intercept_ols <- coef(model_2)["(Intercept)"]
coef_variable_1 <- coef(model_2)["variable_1"]
coef_variable_2 <- coef(model_2)["variable_2"]

predicted_pop_ols <- intercept_ols + coef_variable_1 * top10s$variable_1 + coef_variable_2 * top10s$variable_2
na.rm = TRUE

ggplot(top10s, aes(x = genre, y = pop)) +
  geom_point(pch = 20, col = "green", size = 2.5) +
  geom_line(aes(x = variable_1, y = fitted(model_2)), col = "pink", lwd = 2) +
  labs(x = "Genre", y = "Popularity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code
ggplot(top10s, aes(x = dur, y = pop)) +
  geom_point(pch = 20, col = "green", size = 2.5) +
  geom_line(aes(x = variable_2, y = fitted(model_2)), col = "pink", lwd = 2) +
  labs(x = "Duration", y = "Popularity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Model 3:

Code
# Plotting the following observations- High Order Polynomial

top10s$predicted_pop <- predict(model_3, newdata = top10s)

ggplot(top10s, aes(x = genre, y = pop)) +
  geom_point() +
  geom_line(aes(x = variable_1, y = predicted_pop), color = "maroon", size = 1) +
  labs(x = "Genre", y = "Popularity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code
ggplot(top10s, aes(x = dur, y = pop)) +
  geom_point() +
  geom_line(aes(x = variable_2, y = predicted_pop), color = "maroon", size = 1) +
  labs(x = "Duration", y = "Popularity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Finding the Best Model Specifications

Code
#Creating multiple specifications to find best model

top10s_mod1 <- lm(pop ~ variable_1 + I(variable_1^2) + I(variable_1^3), data = top10s)
  
top10s_mod2 <- lm(pop ~ variable_2 + I(variable_2^2) + I(variable_2^3), data = top10s)
  
top10s_mod3 <- lm(pop ~ variable_1 + variable_2 + variable_3, data = top10s) # exclude from future manipulations
  
top10s_mod4 <- lm(pop ~ variable_1 + variable_2 + bpm, data = top10s)

Checking for multicollinearity in the following models:

Code
# Attempting to mitigate multicollinearity in the previous models

correlation_matrix <- cor(top10s[, c("variable_1", "variable_2")])
print(correlation_matrix)
            variable_1  variable_2
variable_1  1.00000000 -0.03618429
variable_2 -0.03618429  1.00000000
Code
### correlation value of -0.03618429 suggests that the variables are not strongly correlated.

vif_results_1 <- vif(model_1)
vif_results_2 <- vif(model_2)
vif_results_3 <- vif(model_3)
print(vif_results_1)
variable_1 variable_2 
  1.001311   1.001311 
Code
print(vif_results_2)
log(variable_1) log(variable_2) 
       1.000437        1.000437 
Code
print(vif_results_3)
  I(variable_1)   I(variable_2) I(variable_1^2) I(variable_2^2) I(variable_1^3) 
       49.56246      1404.61111       359.86366      5140.77194       167.87602 
I(variable_2^3) 
     1247.37702 
Code
### very high variance inflation factor (VIF) for model_3, the higher order polynomial equation, signifying multicollinearity in the model. 

tolerance_values_1 <- 1 / vif_results_1
tolerance_values_2 <- 1 / vif_results_2
tolerance_values_3 <- 1 / vif_results_3

print(tolerance_values_1)
variable_1 variable_2 
 0.9986907  0.9986907 
Code
print(tolerance_values_2)
log(variable_1) log(variable_2) 
      0.9995633       0.9995633 
Code
print(tolerance_values_3)
  I(variable_1)   I(variable_2) I(variable_1^2) I(variable_2^2) I(variable_1^3) 
   0.0201765606    0.0007119408    0.0027788302    0.0001945233    0.0059567770 
I(variable_2^3) 
   0.0008016822 
Code
### very low (close to 0) score for model_3, signifying multicollinearity in the model. 

# Attempting to mitigate multicollinearity in the newest models

correlation_matrix <- cor(top10s[, c("variable_1", "variable_2", "pop", "bpm")])
print(correlation_matrix)
            variable_1  variable_2         pop         bpm
variable_1  1.00000000 -0.03618429 -0.01021489  0.06691989
variable_2 -0.03618429  1.00000000 -0.10569472 -0.03262715
pop        -0.01021489 -0.10569472  1.00000000  0.02314881
bpm         0.06691989 -0.03262715  0.02314881  1.00000000
Code
### correlation value less than 1 suggest that the variables are not strongly correlated.

Obtaining Robust Standard Errors and Stargazer

Code
se_1 <- sqrt(diag(vcovHC(top10s_mod1, type = "HC1")))
se_2 <- sqrt(diag(vcovHC(top10s_mod2, type = "HC1")))
se_3 <- sqrt(diag(vcovHC(top10s_mod3, type = "HC1"))) # 'Y' value of 'pop' within the dataset 'top10s' equates to the third variable and therefore, cannot be used. 
Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
Code
se_4 <- sqrt(diag(vcovHC(top10s_mod4, type = "HC1")))

rob_se <- list(sqrt(diag(vcovHC(top10s_mod1, type = "HC1"))),
 sqrt(diag(vcovHC(top10s_mod2, type = "HC1"))),
 sqrt(diag(vcovHC(top10s_mod4, type = "HC1"))))

stargazer(top10s_mod1, top10s_mod2, top10s_mod4,
title = "New Models",
type = "latex",
digits = 3,
header = FALSE,
se = rob_se,
object.names = TRUE,
model.numbers = FALSE,
column.labels = c("(i)", "(ii)", "(iiii)")) 

\begin{table}[!htbp] \centering 
  \caption{New Models} 
  \label{} 
\begin{tabular}{@{\extracolsep{5pt}}lccc} 
\\[-1.8ex]\hline 
\hline \\[-1.8ex] 
 & \multicolumn{3}{c}{\textit{Dependent variable:}} \\ 
\cline{2-4} 
\\[-1.8ex] & \multicolumn{3}{c}{pop} \\ 
 & (i) & (ii) & (iiii) \\ 
\\[-1.8ex] & top10s_mod1 & top10s_mod2 & top10s_mod4\\ 
\hline \\[-1.8ex] 
 variable\_1 & $-$1.938 &  & $-$0.157 \\ 
  & (3.354) &  & (0.535) \\ 
  & & & \\ 
 I(variable\_1$\hat{\mkern6mu}$2) & 0.574 &  &  \\ 
  & (0.794) &  &  \\ 
  & & & \\ 
 I(variable\_1$\hat{\mkern6mu}$3) & $-$0.041 &  &  \\ 
  & (0.054) &  &  \\ 
  & & & \\ 
 variable\_2 &  & 0.135 & $-$0.045$^{***}$ \\ 
  &  & (0.473) & (0.015) \\ 
  & & & \\ 
 I(variable\_2$\hat{\mkern6mu}$2) &  & $-$0.001 &  \\ 
  &  & (0.002) &  \\ 
  & & & \\ 
 I(variable\_2$\hat{\mkern6mu}$3) &  & 0.00000 &  \\ 
  &  & (0.00000) &  \\ 
  & & & \\ 
 bpm &  &  & 0.012 \\ 
  &  &  & (0.033) \\ 
  & & & \\ 
 Constant & 68.098$^{***}$ & 62.304 & 75.491$^{***}$ \\ 
  & (4.824) & (40.853) & (5.380) \\ 
  & & & \\ 
\hline \\[-1.8ex] 
Observations & 596 & 596 & 596 \\ 
R$^{2}$ & 0.002 & 0.011 & 0.012 \\ 
Adjusted R$^{2}$ & $-$0.003 & 0.006 & 0.007 \\ 
Residual Std. Error (df = 592) & 14.554 & 14.484 & 14.481 \\ 
F Statistic (df = 3; 592) & 0.358 & 2.274$^{*}$ & 2.356$^{*}$ \\ 
\hline 
\hline \\[-1.8ex] 
\textit{Note:}  & \multicolumn{3}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\ 
\end{tabular} 
\end{table} 
Code
stargazer(top10s_mod1, top10s_mod2, top10s_mod4,
          title = "Regression Models",
          type = "text",
          header = TRUE,
          digits = 4, 
          se = rob_se,
          column.labels = c("(i)", "(ii)", "(iiii)")) 

Regression Models
===============================================================
                                     Dependent variable:       
                               --------------------------------
                                             pop               
                                  (i)        (ii)      (iiii)  
                                  (1)        (2)        (3)    
---------------------------------------------------------------
variable_1                      -1.9383               -0.1570  
                                (3.3536)              (0.5349) 
                                                               
I(variable_12)                   0.5736                        
                                (0.7939)                       
                                                               
I(variable_13)                  -0.0405                        
                                (0.0536)                       
                                                               
variable_2                                  0.1348   -0.0448***
                                           (0.4729)   (0.0155) 
                                                               
I(variable_22)                             -0.0007             
                                           (0.0018)            
                                                               
I(variable_23)                             0.000001            
                                          (0.000002)           
                                                               
bpm                                                    0.0121  
                                                      (0.0334) 
                                                               
Constant                       68.0979***  62.3041   75.4906***
                                (4.8238)  (40.8534)   (5.3800) 
                                                               
---------------------------------------------------------------
Observations                      596        596        596    
R2                               0.0018     0.0114     0.0118  
Adjusted R2                     -0.0032     0.0064     0.0068  
Residual Std. Error (df = 592)  14.5540    14.4840    14.4810  
F Statistic (df = 3; 592)        0.3577    2.2738*    2.3556*  
===============================================================
Note:                               *p<0.1; **p<0.05; ***p<0.01

Linear Hypothesis Testing

Code
coeftest(top10s_mod1, vcov. = vcovHC, type = "HC1")

t test of coefficients:

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     68.097924   4.823789 14.1171   <2e-16 ***
variable_1      -1.938256   3.353584 -0.5780   0.5635    
I(variable_1^2)  0.573615   0.793874  0.7226   0.4702    
I(variable_1^3) -0.040503   0.053563 -0.7562   0.4498    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
coeftest(top10s_mod2, vcov. = vcovHC, type = "HC1")

t test of coefficients:

                   Estimate  Std. Error t value Pr(>|t|)
(Intercept)      6.2304e+01  4.0853e+01  1.5251   0.1278
variable_2       1.3475e-01  4.7288e-01  0.2850   0.7758
I(variable_2^2) -7.3572e-04  1.7722e-03 -0.4152   0.6782
I(variable_2^3)  9.6901e-07  2.1268e-06  0.4556   0.6488
Code
coeftest(top10s_mod4, vcov. = vcovHC, type = "HC1")

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 75.490554   5.379970 14.0318 < 2.2e-16 ***
variable_1  -0.157023   0.534907 -0.2936  0.769204    
variable_2  -0.044766   0.015478 -2.8921  0.003967 ** 
bpm          0.012119   0.033433  0.3625  0.717123    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
linearHypothesis(top10s_mod1,
   c("variable_1 = 0", "I(variable_1^2) = 0", "I(variable_1^3) = 0"),
   vcov. = vcovHC(top10s_mod1, type = "HC1"))
Linear hypothesis test

Hypothesis:
variable_1 = 0
I(variable_1^2) = 0
I(variable_1^3) = 0

Model 1: restricted model
Model 2: pop ~ variable_1 + I(variable_1^2) + I(variable_1^3)

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F Pr(>F)
1    595                 
2    592  3 0.1985 0.8974
Code
linearHypothesis(top10s_mod2,
   c("variable_2 = 0", "I(variable_2^2) = 0", "I(variable_2^3) = 0"),
   vcov. = vcovHC(top10s_mod2, type = "HC1"))
Linear hypothesis test

Hypothesis:
variable_2 = 0
I(variable_2^2) = 0
I(variable_2^3) = 0

Model 1: restricted model
Model 2: pop ~ variable_2 + I(variable_2^2) + I(variable_2^3)

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F Pr(>F)  
1    595                   
2    592  3 3.6335 0.0128 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
linearHypothesis(top10s_mod4,
   c("variable_1 = 0", "variable_2 = 0", "bpm = 0"),
   vcov. = vcovHC(top10s_mod4, type = "HC1"))
Linear hypothesis test

Hypothesis:
variable_1 = 0
variable_2 = 0
bpm = 0

Model 1: restricted model
Model 2: pop ~ variable_1 + variable_2 + bpm

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F  Pr(>F)  
1    595                    
2    592  3 2.8018 0.03925 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Written explanations: What do we see here?

A linear model works well with variables that have a directly linear relationship between one another. It can best represent a simple relationship that falls within a straight line, such as the relationship between hours worked and money earned. The ‘line of best fit’ in the case of Spotify chart data would not be best utilized by a linear relationship, as there is too many factors to consider and too many complex patterns that would be wrongfully ignored in a linear example.

The benefit of non-linear models allows for more flexibility within the growth patterns and the visualizations. These non-linear models in my example, consisting both of polynomial regression and OLS models, provide a more accurate fit to the chart data and to the patterns we are able to see. In the case of model_2, the relationship between duration and popularity, there seems to be a negative relationship as shown by the pink line. From this plot, we are able to see the relationship between duration and popularity, in which the most popular songs (rated on a scale of 0 to 100) have a duration that is under 3 minutes. As the duration increases, the chance of the song charting decreases.

Citations

Henrique, L. (2019, December 26). Top Spotify songs from 2010-2019 - by Year. Kaggle. https://www.kaggle.com/datasets/leonardopena/top-spotify-songs-from-20102019-by-year

Mermovich, A. (2023, April 28). Spotify Hits (2010-2019) Analytics

Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.3. https://CRAN.R-project.org/package=stargazer