HW5 First Attempt
The dataset I am using for my final project is a set taken from ‘Kaggle’, containing all of the songs and lyrics from Taylor Swift’s discography up until 2017. Due to the wide range of albums containing 20+ songs each, I will be comparing the lyrics of the first album ‘Taylor Swift’ and the most recent in the data set ‘Reputation’.Later in my research, I will introduce the middle albums in order to further emphasize the trends being shown throughout the album and the themes in each one individually.
head(Swift_lyrics)
# A tibble: 6 x 7
artist album track_title track_n lyric line year
<chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 Taylor Swift Taylor Swift Tim McGraw 1 "He said ~ 1 2006
2 Taylor Swift Taylor Swift Tim McGraw 1 "Put thos~ 2 2006
3 Taylor Swift Taylor Swift Tim McGraw 1 "I said, ~ 3 2006
4 Taylor Swift Taylor Swift Tim McGraw 1 "Just a b~ 4 2006
5 Taylor Swift Taylor Swift Tim McGraw 1 "That had~ 5 2006
6 Taylor Swift Taylor Swift Tim McGraw 1 "On backr~ 6 2006
The variables within the data set include:
Based on the size of the dataset, there are many factors that do not need to be included in our research. For example, articles in the lyrics such as “I”, “a”, “we”, “the”, etc. do not need to be included, as those words would clearly be the most popular out of any song that could be analyzed.
At this stage, I was able to use the ‘token’ feature to separate the ‘lyric’ column into individual words, in order to find trends within the lyrics. Since the data set contains multiple albums with 10+ tracks each, my research will filter out the middle albums to show the progression between the first studio album and the last one1
I am visualizing the word count per song, which is a way to map out the amount of words per song in order to me to dissect the lyrics and word types within the songs later on in the project. We can conclude at this point that the word count per songs tend to be within the 300-500 range. With this information, we can see how many potential words we are working with and how often the words within these songs are repeated, and how the words may change.
tidy_tslyric <- tidy_lyric %>%
anti_join(stop_words) %>%
filter(nchar(word) > 3)
tidy_tslyric %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
ungroup() %>%
mutate(word = reorder(word,n)) %>%
ggplot() +
geom_col(aes(word, n), fill = "#FFDAB8", alpha = .5) +
xlab("Word") +
ylab("Count Per Song") +
ggtitle("Frequently Used Words") +
theme_minimal()
Here, I am visualizing the frequency of certain words within the discography and which ones appear the most. The small chunk of code at the top is being used to filter out articles such as “I”, “a”, “the”, etc. which are unimportant words in this research. We could reasonably conclude with no tidying at all that these articles and short words would appear the most frequently. I am interested in every word besides them.
Since analyzing each word individually will take up too much time, we can use facet wraps and grids to easily compare multiple variables across one space, which can allow for easier comparisons of the trends overall (in this case, words and the frequency the words are used.)
This section will contain all of the albums featured for more clarity as to the trends overtime. Each album is represented by a year, to show the changes (or lack thereof) overtime.
popwords <- tidy_tslyric %>%
group_by(year)%>%
count(word, year, sort = TRUE) %>%
slice(seq_len(10)) %>%
ungroup() %>%
arrange(year,n) %>%
mutate(row = row_number())
library(ggplot2)
popwords %>%
ggplot() +
geom_col(aes(word, n), fill = "#83CC94", alpha =.5) +
labs(x = "Words", y = "# Word is Sung", subtitle = "the longer the bar, the most frequently the word is used") +
ggtitle("Popular Words by Album Year") +
facet_wrap(~year, scales = "free") +
coord_flip() +
theme_classic()
What questions are left unanswered?
I think at this point, the clarification between the albums are not clear yet. Due to various struggles with the visualizations in the first place, these ggplots are referencing her discography as a whole and not based album-by-album. As I advance in my skills further, I will be able to separate the data by the two albums and re-convey the same information shown above but with a tighter focus.
What might be unclear?
I think most people will be able to digest the data as is, but those who are unfamiliar with the artist might not care about the importance of the words, nor the frequency in which the words appear in songs.
How could I improve the visualizations?
I think the spacing of the graphs and the organization overall could be much improved. I am still working on the scales of the y-axis to make it look cleaner and easily understandable but at this point of the project, I am working with the defaults.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kruzlic (2022, April 27). Data Analytics and Computational Social Science: Kruzlic Homework 5. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscombkruzlichw5attempt1/
BibTeX citation
@misc{kruzlic2022kruzlic, author = {Kruzlic, Bryn}, title = {Data Analytics and Computational Social Science: Kruzlic Homework 5}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscombkruzlichw5attempt1/}, year = {2022} }