Kruzlic Homework 5

HW5 First Attempt

Bryn Kruzlic
2022-04-24

Overview of Final Project

The dataset I am using for my final project is a set taken from ‘Kaggle’, containing all of the songs and lyrics from Taylor Swift’s discography up until 2017. Due to the wide range of albums containing 20+ songs each, I will be comparing the lyrics of the first album ‘Taylor Swift’ and the most recent in the data set ‘Reputation’.Later in my research, I will introduce the middle albums in order to further emphasize the trends being shown throughout the album and the themes in each one individually.

Show code

Locate the File

Show code
Swift_lyrics <- read_csv("C:/Users/Bryn Kruzlic/OneDrive/Desktop/DACSS601/taylor_swift_lyrics.csv")
View(Swift_lyrics)

Preview of the Data

Show code
head(Swift_lyrics)
# A tibble: 6 x 7
  artist       album        track_title track_n lyric       line  year
  <chr>        <chr>        <chr>         <dbl> <chr>      <dbl> <dbl>
1 Taylor Swift Taylor Swift Tim McGraw        1 "He said ~     1  2006
2 Taylor Swift Taylor Swift Tim McGraw        1 "Put thos~     2  2006
3 Taylor Swift Taylor Swift Tim McGraw        1 "I said, ~     3  2006
4 Taylor Swift Taylor Swift Tim McGraw        1 "Just a b~     4  2006
5 Taylor Swift Taylor Swift Tim McGraw        1 "That had~     5  2006
6 Taylor Swift Taylor Swift Tim McGraw        1 "On backr~     6  2006

The variables within the data set include:

  1. artist- character data (Taylor Swift)
  2. album- character data (Taylor Swift, Fearless, etc.)
  3. track_title- character data (Tim McGraw, etc. )
  4. lyric- character data (He said the way..)
  5. track_n- doubles data (1-etc.)
  6. line- doubles data (1-etc.)
  7. year- doubles data (2005-2017)]

Tidy Data

Based on the size of the dataset, there are many factors that do not need to be included in our research. For example, articles in the lyrics such as “I”, “a”, “we”, “the”, etc. do not need to be included, as those words would clearly be the most popular out of any song that could be analyzed.

Show code
library(stringr)
library(dplyr)
library(tokenizers)

df <- data.frame(Swift_lyrics)
              
tidy_lyric <- Swift_lyrics %>%
  unnest_tokens(word, lyric) 

word_count <- tidy_lyric %>%
  group_by(track_title) %>%
  summarise(num_words = n()) %>%
  arrange(desc(num_words))

Visualizations

At this stage, I was able to use the ‘token’ feature to separate the ‘lyric’ column into individual words, in order to find trends within the lyrics. Since the data set contains multiple albums with 10+ tracks each, my research will filter out the middle albums to show the progression between the first studio album and the last one1

Word Count Distribution

Show code
word_count %>%
  ggplot() +
  geom_histogram(aes(x = num_words), fill = "#a458c4", alpha =.5) +
  ylab("Song Count") +
  xlab("Word Count Per Song") +
  ggtitle("Word Count Distribution")

I am visualizing the word count per song, which is a way to map out the amount of words per song in order to me to dissect the lyrics and word types within the songs later on in the project. We can conclude at this point that the word count per songs tend to be within the 300-500 range. With this information, we can see how many potential words we are working with and how often the words within these songs are repeated, and how the words may change.

Frequently Used Words

Show code
tidy_tslyric <- tidy_lyric %>% 
  anti_join(stop_words) %>%
  filter(nchar(word) > 3)

tidy_tslyric %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  ungroup() %>%
  mutate(word = reorder(word,n)) %>%
  ggplot() +
  geom_col(aes(word, n), fill = "#FFDAB8", alpha = .5) +
  xlab("Word") +
  ylab("Count Per Song") +
  ggtitle("Frequently Used Words") +
  theme_minimal()

Here, I am visualizing the frequency of certain words within the discography and which ones appear the most. The small chunk of code at the top is being used to filter out articles such as “I”, “a”, “the”, etc. which are unimportant words in this research. We could reasonably conclude with no tidying at all that these articles and short words would appear the most frequently. I am interested in every word besides them.

What are some of the trends we see based on the albums?

Since analyzing each word individually will take up too much time, we can use facet wraps and grids to easily compare multiple variables across one space, which can allow for easier comparisons of the trends overall (in this case, words and the frequency the words are used.)

This section will contain all of the albums featured for more clarity as to the trends overtime. Each album is represented by a year, to show the changes (or lack thereof) overtime.

Show code
popwords <- tidy_tslyric %>%
group_by(year)%>%
  count(word, year, sort = TRUE) %>%
  slice(seq_len(10)) %>%
  ungroup() %>%
  arrange(year,n) %>%
  mutate(row = row_number())

library(ggplot2)
popwords %>%
  ggplot() +
  geom_col(aes(word, n), fill = "#83CC94", alpha =.5) +
  labs(x = "Words", y = "# Word is Sung", subtitle = "the longer the bar, the most frequently the word is used") +
  ggtitle("Popular Words by Album Year") +
  facet_wrap(~year, scales = "free") +
  coord_flip() +
  theme_classic()

Limitations of Research

What questions are left unanswered?

I think at this point, the clarification between the albums are not clear yet. Due to various struggles with the visualizations in the first place, these ggplots are referencing her discography as a whole and not based album-by-album. As I advance in my skills further, I will be able to separate the data by the two albums and re-convey the same information shown above but with a tighter focus.

What might be unclear?

I think most people will be able to digest the data as is, but those who are unfamiliar with the artist might not care about the importance of the words, nor the frequency in which the words appear in songs.

How could I improve the visualizations?

I think the spacing of the graphs and the organization overall could be much improved. I am still working on the scales of the y-axis to make it look cleaner and easily understandable but at this point of the project, I am working with the defaults.

Potential Research Questions

  1. What word shows up the most overall?
  2. Are there any visible trends in words or topics in the chosen albums?
  3. How have the lyrics changed over time?
  4. What tone is visible within the selected albums?

Answers to Research Questions

  1. Based on the third visualization, the words that show up the most would be “love”, “time”, “baby”, “feel”, and “stay”. These words, despite the lack of context for the sentiment, represent an overall feeling of joy, intimacy, devotion, etc. Some of the words by themselves may be represented more ambiguously, however, in the context of the visualizations, can be seen as overwhelmingly positive.
  2. It is not surprising to find out the trends of words overtime have remained relatively consistent. Themes of longing for someone, youth, dancing, and the idea of ‘forever’ appear multiple times throughout each album. From an objective standpoint, we can represent:
  1. From first glance, there is not a huge shift in lexical distinctiveness between the words over time. We can reasonably assume that love plays a strong role in the inspiration for the albums, and from first glance, is not represented negatively. Only some of the words remain ambiguous enough to hint to other emotions.
  2. The comparison of the first album ‘Taylor Swift’ and the last album ‘reputation’ showcase a lot of the same words being repeated.

Conclusion

Citations


  1. This data set was created prior to 2019, in which she has since released 3 more albums and two re-recordings.↩︎

  2. Since we cannot see the full content and analyze lyric-by-lyric ourselves, we can look at the words objectively to discuss the mood, which in some cases, might not always be accurate.↩︎

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Kruzlic (2022, April 27). Data Analytics and Computational Social Science: Kruzlic Homework 5. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscombkruzlichw5attempt1/

BibTeX citation

@misc{kruzlic2022kruzlic,
  author = {Kruzlic, Bryn},
  title = {Data Analytics and Computational Social Science: Kruzlic Homework 5},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscombkruzlichw5attempt1/},
  year = {2022}
}