HW4 First Attempt
The dataset I am using for my final project is a set taken from ‘Kaggle’, containing all of the songs and lyrics from Taylor Swift’s discography up until 2017.Due to the wide range of albums containing 20+ songs each, I will be comparing the lyrics of the first album ‘Taylor Swift’ and the most recent in the data set ‘Reputation’.
head(Swift_lyrics)
# A tibble: 6 x 7
artist album track_title track_n lyric line year
<chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Taylor Swift Taylor Swift Tim McGraw 1 "He said ~ 1 2006
2 Taylor Swift Taylor Swift Tim McGraw 1 "Put thos~ 2 2006
3 Taylor Swift Taylor Swift Tim McGraw 1 "I said, ~ 3 2006
4 Taylor Swift Taylor Swift Tim McGraw 1 "Just a b~ 4 2006
5 Taylor Swift Taylor Swift Tim McGraw 1 "That had~ 5 2006
6 Taylor Swift Taylor Swift Tim McGraw 1 "On backr~ 6 2006
At this stage, I was able to use the ‘token’ feature to separate the ‘lyric’ column into individual words, in order to find trends within the lyrics. Since the data set contains multiple albums with 10+ tracks each, my research will filter out the middle albums to show the progression between the first studio album and the last one1
word_count %>%
ggplot() +
geom_histogram(aes(x = num_words), fill = "#a458c4", alpha =.5) +
ylab("Song Count") +
xlab("Word Count Per Song") +
ggtitle("Word Count Distribution")
I am visualizing the word count per song, which is a way to map out the amount of words per song in order to me to dissect the lyrics and word types within the songs later on in the project. We can conclude at this point that the word count per songs tend to be within the 300-500 range. With this information, we can see how many potential words we are working with and how often the words within these songs are repeated, and how the words may change.
tidy_tslyric <- tidy_lyric %>%
anti_join(stop_words) %>%
filter(nchar(word) > 3)
tidy_tslyric %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
ungroup() %>%
mutate(word = reorder(word,n)) %>%
ggplot() +
geom_col(aes(word, n), fill = "#FFDAB8", alpha = .75) +
xlab("Word") +
ylab("Count Per Song") +
ggtitle("Frequently Used Words") +
theme_minimal()
Here, I am visualizing the frequency of certain words within the discography and which ones appear the most. The small chunk of code at the top is being used to filter out articles such as “I”, “a”, “the”, etc. which are unimportant words in this research. We could reasonably conclude with no tidying at all that these articles and short words would appear the most frequently. I am interested in every word besides them.
Questions Left Unanswered?
I think at this point, the clarification between the albums are not clear yet. Due to various struggles with the visualizations in the first place, these ggplots are referencing her discography as a whole and not based album-by-album. As I advance in my skills further, I will be able to separate the data by the two albums and re-convey the same information shown above but with a tighter focus.
What might be unclear?
I think most people will be able to digest the data as is, but those who are unfamiliar with the artist might not care about the importance of the words, nor the frequency in which the words appear in songs.
How could I improve the visualizations?
I think the spacing of the graphs and the organization overall could be much improved. I am still working on the scales of the y-axis to make it look cleaner and easily understandable but at this point of the project, I am working with the defaults.
What word shows up the most overall? Are there any visible trends in words or topics in the chosen albums? How have the lyrics changed over time? What tone is visible within the selected albums?
This data set was created prior to 2019, in which she has since released 3 more albums and two re-recordings.↩︎
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kruzlic (2022, April 3). Data Analytics and Computational Social Science: Kruzlic Homework 4. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscombkruzlichw4attempt1/
BibTeX citation
@misc{kruzlic2022kruzlic, author = {Kruzlic, Bryn}, title = {Data Analytics and Computational Social Science: Kruzlic Homework 4}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscombkruzlichw4attempt1/}, year = {2022} }