Data Analytics and Computational Social Science: Kruzlic Homework 4

Bryn Kruzlic

Overview of Final Project

The dataset I am using for my final project is a set taken from ‘Kaggle’, containing all of the songs and lyrics from Taylor Swift’s discography up until 2017.Due to the wide range of albums containing 20+ songs each, I will be comparing the lyrics of the first album ‘Taylor Swift’ and the most recent in the data set ‘Reputation’.

library(readr)
library(tidyverse)
library(tidyselect)
library(dplyr)
library(ggplot2)
library(tidytext)
library(stringr)
library(lubridate)

Swift_lyrics <- read_csv("C:/Users/Bryn Kruzlic/OneDrive/Desktop/DACSS601/taylor_swift_lyrics.csv")
View(Swift_lyrics)

head(Swift_lyrics)

# A tibble: 6 x 7
  artist       album        track_title track_n lyric       line  year
  <chr>        <chr>        <chr>         <dbl> <chr>      <dbl> <dbl>
1 Taylor Swift Taylor Swift Tim McGraw        1 "He said ~     1  2006
2 Taylor Swift Taylor Swift Tim McGraw        1 "Put thos~     2  2006
3 Taylor Swift Taylor Swift Tim McGraw        1 "I said, ~     3  2006
4 Taylor Swift Taylor Swift Tim McGraw        1 "Just a b~     4  2006
5 Taylor Swift Taylor Swift Tim McGraw        1 "That had~     5  2006
6 Taylor Swift Taylor Swift Tim McGraw        1 "On backr~     6  2006

The variables within the data set include:

artist- character data (Taylor Swift)
album- character data (Taylor Swift, Fearless, etc.)
track_title- character data (Tim McGraw, etc. )
lyric- character data (He said the way..)
track_n- doubles data (1-etc.)
line- doubles data (1-etc.)
year- doubles data (2005-2017)]

Time to tidy the data up

library(stringr)
library(dplyr)
library(tokenizers)

df <- data.frame(Swift_lyrics)
              
tidy_lyric <- Swift_lyrics %>%
  unnest_tokens(word, lyric) 

word_count <- tidy_lyric %>%
  group_by(track_title) %>%
  summarise(num_words = n()) %>%
  arrange(desc(num_words))

Visualizations for Swift Lyrics Data

At this stage, I was able to use the ‘token’ feature to separate the ‘lyric’ column into individual words, in order to find trends within the lyrics. Since the data set contains multiple albums with 10+ tracks each, my research will filter out the middle albums to show the progression between the first studio album and the last one¹

word_count %>%
  ggplot() +
  geom_histogram(aes(x = num_words), fill = "#a458c4", alpha =.5) +
  ylab("Song Count") +
  xlab("Word Count Per Song") +
  ggtitle("Word Count Distribution")

I am visualizing the word count per song, which is a way to map out the amount of words per song in order to me to dissect the lyrics and word types within the songs later on in the project. We can conclude at this point that the word count per songs tend to be within the 300-500 range. With this information, we can see how many potential words we are working with and how often the words within these songs are repeated, and how the words may change.

tidy_tslyric <- tidy_lyric %>% 
  anti_join(stop_words) %>%
  filter(nchar(word) > 3)

tidy_tslyric %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(word = reorder(word,n)) %>%
  ggplot() +
  geom_col(aes(word, n), fill = "#FFDAB8", alpha = .75) +
  xlab("Word") +
  ylab("Count Per Song") +
  ggtitle("Frequently Used Words") +
  theme_minimal()

Here, I am visualizing the frequency of certain words within the discography and which ones appear the most. The small chunk of code at the top is being used to filter out articles such as “I”, “a”, “the”, etc. which are unimportant words in this research. We could reasonably conclude with no tidying at all that these articles and short words would appear the most frequently. I am interested in every word besides them.

Limitations of Research and Visualizations Thus Far

Questions Left Unanswered?

I think at this point, the clarification between the albums are not clear yet. Due to various struggles with the visualizations in the first place, these ggplots are referencing her discography as a whole and not based album-by-album. As I advance in my skills further, I will be able to separate the data by the two albums and re-convey the same information shown above but with a tighter focus.

What might be unclear?

I think most people will be able to digest the data as is, but those who are unfamiliar with the artist might not care about the importance of the words, nor the frequency in which the words appear in songs.

How could I improve the visualizations?

I think the spacing of the graphs and the organization overall could be much improved. I am still working on the scales of the y-axis to make it look cleaner and easily understandable but at this point of the project, I am working with the defaults.

Potential Research Questions for Research

What word shows up the most overall? Are there any visible trends in words or topics in the chosen albums? How have the lyrics changed over time? What tone is visible within the selected albums?

This data set was created prior to 2019, in which she has since released 3 more albums and two re-recordings.↩︎

Comment on this article Share:

Kruzlic Homework 4