Data Analytics and Computational Social Science: Kruzlic Homework 5

Bryn Kruzlic

Overview of Final Project

The dataset I am using for my final project is a set taken from ‘Kaggle’, containing all of the songs and lyrics from Taylor Swift’s discography up until 2017. Due to the wide range of albums containing 20+ songs each, I will be comparing the lyrics of the first album ‘Taylor Swift’ and the most recent in the data set ‘Reputation’.Later in my research, I will introduce the middle albums in order to further emphasize the trends being shown throughout the album and the themes in each one individually.

Show code

library(readr)
library(tidyverse)
library(tidyselect)
library(dplyr)
library(ggplot2)
library(tidytext)
library(stringr)

Locate the File

Show code

Swift_lyrics <- read_csv("C:/Users/Bryn Kruzlic/OneDrive/Desktop/DACSS601/taylor_swift_lyrics.csv")
View(Swift_lyrics)

Preview of the Data

Show code

head(Swift_lyrics)

# A tibble: 6 x 7
  artist       album        track_title track_n lyric       line  year
  <chr>        <chr>        <chr>         <dbl> <chr>      <dbl> <dbl>
1 Taylor Swift Taylor Swift Tim McGraw        1 "He said ~     1  2006
2 Taylor Swift Taylor Swift Tim McGraw        1 "Put thos~     2  2006
3 Taylor Swift Taylor Swift Tim McGraw        1 "I said, ~     3  2006
4 Taylor Swift Taylor Swift Tim McGraw        1 "Just a b~     4  2006
5 Taylor Swift Taylor Swift Tim McGraw        1 "That had~     5  2006
6 Taylor Swift Taylor Swift Tim McGraw        1 "On backr~     6  2006

The variables within the data set include:

artist- character data (Taylor Swift)
album- character data (Taylor Swift, Fearless, etc.)
track_title- character data (Tim McGraw, etc. )
lyric- character data (He said the way..)
track_n- doubles data (1-etc.)
line- doubles data (1-etc.)
year- doubles data (2005-2017)]

Tidy Data

Based on the size of the dataset, there are many factors that do not need to be included in our research. For example, articles in the lyrics such as “I”, “a”, “we”, “the”, etc. do not need to be included, as those words would clearly be the most popular out of any song that could be analyzed.

Show code

library(stringr)
library(dplyr)
library(tokenizers)

df <- data.frame(Swift_lyrics)
              
tidy_lyric <- Swift_lyrics %>%
  unnest_tokens(word, lyric) 

word_count <- tidy_lyric %>%
  group_by(track_title) %>%
  summarise(num_words = n()) %>%
  arrange(desc(num_words))

Visualizations

At this stage, I was able to use the ‘token’ feature to separate the ‘lyric’ column into individual words, in order to find trends within the lyrics. Since the data set contains multiple albums with 10+ tracks each, my research will filter out the middle albums to show the progression between the first studio album and the last one¹

Word Count Distribution

Show code

word_count %>%
  ggplot() +
  geom_histogram(aes(x = num_words), fill = "#a458c4", alpha =.5) +
  ylab("Song Count") +
  xlab("Word Count Per Song") +
  ggtitle("Word Count Distribution")

I am visualizing the word count per song, which is a way to map out the amount of words per song in order to me to dissect the lyrics and word types within the songs later on in the project. We can conclude at this point that the word count per songs tend to be within the 300-500 range. With this information, we can see how many potential words we are working with and how often the words within these songs are repeated, and how the words may change.

Frequently Used Words

Show code

tidy_tslyric <- tidy_lyric %>% 
  anti_join(stop_words) %>%
  filter(nchar(word) > 3)

tidy_tslyric %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(word = reorder(word,n)) %>%
  ggplot() +
  geom_col(aes(word, n), fill = "#FFDAB8", alpha = .5) +
  xlab("Word") +
  ylab("Count Per Song") +
  ggtitle("Frequently Used Words") +
  theme_minimal()

Here, I am visualizing the frequency of certain words within the discography and which ones appear the most. The small chunk of code at the top is being used to filter out articles such as “I”, “a”, “the”, etc. which are unimportant words in this research. We could reasonably conclude with no tidying at all that these articles and short words would appear the most frequently. I am interested in every word besides them.

What are some of the trends we see based on the albums?

Since analyzing each word individually will take up too much time, we can use facet wraps and grids to easily compare multiple variables across one space, which can allow for easier comparisons of the trends overall (in this case, words and the frequency the words are used.)

Popular Words by Year

This section will contain all of the albums featured for more clarity as to the trends overtime. Each album is represented by a year, to show the changes (or lack thereof) overtime.

Show code

popwords <- tidy_tslyric %>%
group_by(year)%>%
  count(word, year, sort = TRUE) %>%
  slice(seq_len(10)) %>%
  ungroup() %>%
  arrange(year,n) %>%
  mutate(row = row_number())

library(ggplot2)
popwords %>%
  ggplot() +
  geom_col(aes(word, n), fill = "#83CC94", alpha =.5) +
  labs(x = "Words", y = "# Word is Sung", subtitle = "the longer the bar, the most frequently the word is used") +
  ggtitle("Popular Words by Album Year") +
  facet_wrap(~year, scales = "free") +
  coord_flip() +
  theme_classic()

Limitations of Research

What questions are left unanswered?

I think at this point, the clarification between the albums are not clear yet. Due to various struggles with the visualizations in the first place, these ggplots are referencing her discography as a whole and not based album-by-album. As I advance in my skills further, I will be able to separate the data by the two albums and re-convey the same information shown above but with a tighter focus.

What might be unclear?

I think most people will be able to digest the data as is, but those who are unfamiliar with the artist might not care about the importance of the words, nor the frequency in which the words appear in songs.

How could I improve the visualizations?

I think the spacing of the graphs and the organization overall could be much improved. I am still working on the scales of the y-axis to make it look cleaner and easily understandable but at this point of the project, I am working with the defaults.

Potential Research Questions

What word shows up the most overall?
Are there any visible trends in words or topics in the chosen albums?
How have the lyrics changed over time?
What tone is visible within the selected albums?

Answers to Research Questions

Based on the third visualization, the words that show up the most would be “love”, “time”, “baby”, “feel”, and “stay”. These words, despite the lack of context for the sentiment, represent an overall feeling of joy, intimacy, devotion, etc. Some of the words by themselves may be represented more ambiguously, however, in the context of the visualizations, can be seen as overwhelmingly positive.
It is not surprising to find out the trends of words overtime have remained relatively consistent. Themes of longing for someone, youth, dancing, and the idea of ‘forever’ appear multiple times throughout each album. From an objective standpoint, we can represent:

Albums 1 and 2 as the ‘longing’ albums in which love is seen as distant and achievable with strong tones of ‘hope’, ‘feel/feeling’, and ‘time.’
Albums 3 and 4 as the ‘present/realistic’ albums in which the words change slightly, alluding to the strong tones of ‘time’, ‘trouble’, ‘grow’, and ‘someday’ that may present love, or even heartbreak, in a less favorable light. ²
Albums 5 and 6 as the ‘transition’ albums in which words such as ‘stay’, ‘waiting’, and ‘getaway’ and other locations make strong appearances, potentially alluding to transitions within the singers’ life, straying away from some of the themes in the second set of albums and lining up more closely with the first set.

From first glance, there is not a huge shift in lexical distinctiveness between the words over time. We can reasonably assume that love plays a strong role in the inspiration for the albums, and from first glance, is not represented negatively. Only some of the words remain ambiguous enough to hint to other emotions.
The comparison of the first album ‘Taylor Swift’ and the last album ‘reputation’ showcase a lot of the same words being repeated.

Conclusion

Citations

This data set was created prior to 2019, in which she has since released 3 more albums and two re-recordings.↩︎
Since we cannot see the full content and analyze lyric-by-lyric ourselves, we can look at the words objectively to discuss the mood, which in some cases, might not always be accurate.↩︎

Comment on this article Share:

Kruzlic Homework 5

Overview of Final Project

Locate the File

Preview of the Data

Tidy Data

Visualizations

Word Count Distribution

Frequently Used Words

What are some of the trends we see based on the albums?

Popular Words by Year

Limitations of Research

Potential Research Questions

Answers to Research Questions

Conclusion

Citations

Reuse

Citation