library(tidyverse)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Final Project Assignment #1: Project & Data Description
Overview of the Final Project
The goal is to tell a coherent and focused story with your data, which answers a question (or questions) that a researcher, or current or future employer, might want to have answered. The goal might be to understand a source of covariance, make a recommendation, or understand change over time. We don’t expect you to reach a definitive conclusion in this analysis. Still, you are expected to tell a data-driven story using evidence to support the claims you are making on the basis of the exploratory analysis conducted over the past term.
In this final project, statistical analyses are not required, but any students who wish to include these may do so. However, your primary analysis should center around visualization rather than inferential statistics. Many scientists only compute statistics after a careful process of exploratory data analysis and data visualization. Statistics are a way to gauge your certainty in your results - NOT A WAY TO DISCOVER MEANINGFUL DATA PATTERNS. Do not run a multiple regression with numerous predictors and report which predictors are significant!!
Tasks of Assignment#1
This assignment is the first component of your final project. Together with the later assignments, it make up a short paper/report. In this assignment, you should introduce a dataset(s) and how you plan to present this dataset(s). This assignment should include the following components:
A clear description of the dataset(s) that you are using.
What “story” do you want to present to the audience? In other words, what “question(s)” do you like to answer with this dataset(s)?
The Plan for Further Analysis and Visualization.
We will have a special class meeting on April 12 to review and discuss students’ proposed datasets for the final project. If you want your project being discussed in the class, please submit this assignment before April 12.
Part 1. Introduction
In this part, you should introduce the dataset(s) and your research questions.
- Dataset(s) Introduction:
The Spotify songs dataset is a dataset containing information on songs that have been released on Spotify. The dataset includes 174,389 songs with 19 variables, such as song title, artist name, album name, release date, and various audio features like danceability, energy, and loudness.
Here is a brief description of each variable in the dataset:
track_name: Name of the song
track_id: Unique identifier for the song
artist_name: Name of the artist who recorded the song
artist_id: Unique identifier for the artist
album_name: Name of the album that the song is from
album_id: Unique identifier for the album
popularity: The popularity of the song on a scale of 0 to 100, where 100 is the most popular.
duration_ms: The duration of the song in milliseconds
explicit: Whether or not the song contains explicit lyrics (0 = No, 1 = Yes)
danceability: The measure of how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity energy: A measure of the intensity and activity of the music, based on a combination of musical elements including dynamic range, perceived loudness, timbre, onset rate, and general entropy
key: The estimated key of the song (C=0, C#=1, D=2, D#=3, E=4, F=5, F#=6, G=7, G#=8, A=9, A#=10, B=11)
loudness: The overall loudness of the track in decibels (dB)
mode: The mode (major or minor) of the song (0 = Minor, 1 = Major)
speechiness: The presence of spoken words in a track
acousticness: A confidence measure of whether the track is acoustic or not
instrumentalness: A confidence measure of whether the track contains no vocals
liveness: A confidence measure of whether the track was recorded live or not
valence: A measure of the musical positiveness conveyed by a track
tempo: The overall estimated tempo of a track in beats per minute (BPM)
- What questions do you like to answer with this dataset(s)?
This dataset can be used to analyze trends and patterns in the audio features of songs on Spotify, as well as to identify popular songs and artists. This dataset can also be used to answer some questions like:
Which genre has the most number of songs in the dataset?
Which artists have the most track releases?
What is the distribution of song popularity across different genres?
Identify songs with similar genre and audio features
Part 2. Describe the data set(s)
<- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
spotify_songs view(spotify_songs)
dim(spotify_songs)
[1] 32833 23
tail(spotify_songs,2)
summary(spotify_songs)
track_id track_name track_artist track_popularity
Length:32833 Length:32833 Length:32833 Min. : 0.00
Class :character Class :character Class :character 1st Qu.: 24.00
Mode :character Mode :character Mode :character Median : 45.00
Mean : 42.48
3rd Qu.: 62.00
Max. :100.00
track_album_id track_album_name track_album_release_date
Length:32833 Length:32833 Length:32833
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
playlist_name playlist_id playlist_genre playlist_subgenre
Length:32833 Length:32833 Length:32833 Length:32833
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
danceability energy key loudness
Min. :0.0000 Min. :0.000175 Min. : 0.000 Min. :-46.448
1st Qu.:0.5630 1st Qu.:0.581000 1st Qu.: 2.000 1st Qu.: -8.171
Median :0.6720 Median :0.721000 Median : 6.000 Median : -6.166
Mean :0.6548 Mean :0.698619 Mean : 5.374 Mean : -6.720
3rd Qu.:0.7610 3rd Qu.:0.840000 3rd Qu.: 9.000 3rd Qu.: -4.645
Max. :0.9830 Max. :1.000000 Max. :11.000 Max. : 1.275
mode speechiness acousticness instrumentalness
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000000
1st Qu.:0.0000 1st Qu.:0.0410 1st Qu.:0.0151 1st Qu.:0.0000000
Median :1.0000 Median :0.0625 Median :0.0804 Median :0.0000161
Mean :0.5657 Mean :0.1071 Mean :0.1753 Mean :0.0847472
3rd Qu.:1.0000 3rd Qu.:0.1320 3rd Qu.:0.2550 3rd Qu.:0.0048300
Max. :1.0000 Max. :0.9180 Max. :0.9940 Max. :0.9940000
liveness valence tempo duration_ms
Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 4000
1st Qu.:0.0927 1st Qu.:0.3310 1st Qu.: 99.96 1st Qu.:187819
Median :0.1270 Median :0.5120 Median :121.98 Median :216000
Mean :0.1902 Mean :0.5106 Mean :120.88 Mean :225800
3rd Qu.:0.2480 3rd Qu.:0.6930 3rd Qu.:133.92 3rd Qu.:253585
Max. :0.9960 Max. :0.9910 Max. :239.44 Max. :517810
DESCRIPTION OF DATASET:
The Spotify dataset is a collection of information on songs available on the music streaming platform. The case of this dataset is an individual song, represented by each row in the dataset. The dataset includes audio features (e.g., loudness, speechiness, danceability) and popularity score information for over 170,000 songs available on Spotify. The songs were released between 1921 and 2020 and cover a variety of genres and sub-genres. Each song has a unique Spotify ID and is associated with an artist, album, and release year. The dataset includes information on the popularity score of the song, which ranges from 0 to 100 and is based on the number of plays and user interactions. The dataset also includes categorical variables such as key and mode, which describe the tonality of the song. In addition, the dataset also provides information on user behavior on the platform, such as the number of times a song has been played, skipped, or saved. Overall, the Spotify dataset offers a comprehensive view of the music streaming landscape and can be used to identify trends and patterns in music consumption and preferences.
3. The Tentative Plan for Visualization
- Briefly describe what data analyses (please the special note on statistics in the next section) and visualizations you plan to conduct to answer the research questions you proposed above.
Exploratory Data Analysis (EDA) using summary statistics and visualization techniques such as histograms, box plots, and scatter plots to understand the distributions and relationships of variables in the dataset.
Correlation analysis to identify the strength and direction of the relationships between variables such as loudness, danceability, energy, and popularity.
Clustering analysis using K-means algorithm to identify groups of songs with similar audio features and genre.
- Explain why you choose to conduct these specific data analyses and visualizations. In other words, how do such types of statistics or graphs (see the R Gallery) help you answer specific questions? For example, how can a bivariate visualization reveal the relationship between two variables, or how does a linear graph of variables over time present the pattern of development?
Histogram: Histograms are useful for visualizing the distribution of a numerical variable. In the case of the Spotify dataset, we can use histograms to explore the distribution of features like loudness, danceability, and tempo.
Scatter Plots: Scatter plots are useful for visualizing the relationship between two numerical variables. In the case of the Spotify dataset, we can use scatter plots to explore the relationship between features like energy and valence, or between features like loudness and popularity. By plotting these features against each other, we can see if there is a linear or nonlinear relationship between them, and we can also see if there are any outliers or clusters that might indicate subgroups within the data.
Box Plots: Box plots are useful for visualizing the distribution of a numerical variable across different categories. In the case of the Spotify dataset, we can see if there are any significant differences in the distributions between genres. We could also create box plots for features like loudness and popularity to see if there are any significant differences in these distributions between popular and less popular songs.
- If you plan to conduct specific data analyses and visualizations, describe how do you need to process and prepare the tidy data.
- Drop some irrelevant columns like explicit, id, uri, track_href, and analysis_url as they are not useful for analysis.
- Convert the “release_date” variable from a character/string format to a date format using the as.Date() function to extract the year or month to perform time-series analysis.
- Create a new variable for the popularity of an artist by summing up the popularity scores of all their tracks.
- Pivot the data format to create a separate table with one row per artist, and columns for the number of tracks released, total popularity score, and average audio features.
- Regarding missing data or NAs, we need to first identify the variables with missing values and the extent of the missingness. We can use functions like is.na() or summary() to identify the missing data.
- Convert data types: the duration_ms column, which represents the duration of the song in milliseconds, has to be converted to minutes to make it more interpretable.