Final Project Assignment #1: Project & Data Description

final_Project_assignment_1
final_project_data_description
Project & Data Description
Author

Pradhakshya Dhanakumar

Published

April 10, 2023

Overview of the Final Project

The goal is to tell a coherent and focused story with your data, which answers a question (or questions) that a researcher, or current or future employer, might want to have answered. The goal might be to understand a source of covariance, make a recommendation, or understand change over time. We don’t expect you to reach a definitive conclusion in this analysis. Still, you are expected to tell a data-driven story using evidence to support the claims you are making on the basis of the exploratory analysis conducted over the past term.

In this final project, statistical analyses are not required, but any students who wish to include these may do so. However, your primary analysis should center around visualization rather than inferential statistics. Many scientists only compute statistics after a careful process of exploratory data analysis and data visualization. Statistics are a way to gauge your certainty in your results - NOT A WAY TO DISCOVER MEANINGFUL DATA PATTERNS. Do not run a multiple regression with numerous predictors and report which predictors are significant!!

Tasks of Assignment#1

This assignment is the first component of your final project. Together with the later assignments, it make up a short paper/report. In this assignment, you should introduce a dataset(s) and how you plan to present this dataset(s). This assignment should include the following components:

  1. A clear description of the dataset(s) that you are using.

  2. What “story” do you want to present to the audience? In other words, what “question(s)” do you like to answer with this dataset(s)?

  3. The Plan for Further Analysis and Visualization.

We will have a special class meeting on April 12 to review and discuss students’ proposed datasets for the final project. If you want your project being discussed in the class, please submit this assignment before April 12.

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Part 1. Introduction

In this part, you should introduce the dataset(s) and your research questions.

  1. Dataset(s) Introduction:

The Spotify songs dataset is a dataset containing information on songs that have been released on Spotify. The dataset includes 174,389 songs with 19 variables, such as song title, artist name, album name, release date, and various audio features like danceability, energy, and loudness.

Here is a brief description of each variable in the dataset:

track_name: Name of the song

track_id: Unique identifier for the song

artist_name: Name of the artist who recorded the song

artist_id: Unique identifier for the artist

album_name: Name of the album that the song is from

album_id: Unique identifier for the album

popularity: The popularity of the song on a scale of 0 to 100, where 100 is the most popular.

duration_ms: The duration of the song in milliseconds

explicit: Whether or not the song contains explicit lyrics (0 = No, 1 = Yes)

danceability: The measure of how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity energy: A measure of the intensity and activity of the music, based on a combination of musical elements including dynamic range, perceived loudness, timbre, onset rate, and general entropy

key: The estimated key of the song (C=0, C#=1, D=2, D#=3, E=4, F=5, F#=6, G=7, G#=8, A=9, A#=10, B=11)

loudness: The overall loudness of the track in decibels (dB)

mode: The mode (major or minor) of the song (0 = Minor, 1 = Major)

speechiness: The presence of spoken words in a track

acousticness: A confidence measure of whether the track is acoustic or not

instrumentalness: A confidence measure of whether the track contains no vocals

liveness: A confidence measure of whether the track was recorded live or not

valence: A measure of the musical positiveness conveyed by a track

tempo: The overall estimated tempo of a track in beats per minute (BPM)

  1. What questions do you like to answer with this dataset(s)?

This dataset can be used to analyze trends and patterns in the audio features of songs on Spotify, as well as to identify popular songs and artists. This dataset can also be used to answer some questions like:

  • Which genre has the most number of songs in the dataset?

  • Which artists have the most track releases?

  • What is the distribution of song popularity across different genres?

  • Identify songs with similar genre and audio features

Part 2. Describe the data set(s)

spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
view(spotify_songs)
dim(spotify_songs)
[1] 32833    23
tail(spotify_songs,2)
summary(spotify_songs)
   track_id          track_name        track_artist       track_popularity
 Length:32833       Length:32833       Length:32833       Min.   :  0.00  
 Class :character   Class :character   Class :character   1st Qu.: 24.00  
 Mode  :character   Mode  :character   Mode  :character   Median : 45.00  
                                                          Mean   : 42.48  
                                                          3rd Qu.: 62.00  
                                                          Max.   :100.00  
 track_album_id     track_album_name   track_album_release_date
 Length:32833       Length:32833       Length:32833            
 Class :character   Class :character   Class :character        
 Mode  :character   Mode  :character   Mode  :character        
                                                               
                                                               
                                                               
 playlist_name      playlist_id        playlist_genre     playlist_subgenre 
 Length:32833       Length:32833       Length:32833       Length:32833      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  danceability        energy              key            loudness      
 Min.   :0.0000   Min.   :0.000175   Min.   : 0.000   Min.   :-46.448  
 1st Qu.:0.5630   1st Qu.:0.581000   1st Qu.: 2.000   1st Qu.: -8.171  
 Median :0.6720   Median :0.721000   Median : 6.000   Median : -6.166  
 Mean   :0.6548   Mean   :0.698619   Mean   : 5.374   Mean   : -6.720  
 3rd Qu.:0.7610   3rd Qu.:0.840000   3rd Qu.: 9.000   3rd Qu.: -4.645  
 Max.   :0.9830   Max.   :1.000000   Max.   :11.000   Max.   :  1.275  
      mode         speechiness      acousticness    instrumentalness   
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
 1st Qu.:0.0000   1st Qu.:0.0410   1st Qu.:0.0151   1st Qu.:0.0000000  
 Median :1.0000   Median :0.0625   Median :0.0804   Median :0.0000161  
 Mean   :0.5657   Mean   :0.1071   Mean   :0.1753   Mean   :0.0847472  
 3rd Qu.:1.0000   3rd Qu.:0.1320   3rd Qu.:0.2550   3rd Qu.:0.0048300  
 Max.   :1.0000   Max.   :0.9180   Max.   :0.9940   Max.   :0.9940000  
    liveness         valence           tempo         duration_ms    
 Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  4000  
 1st Qu.:0.0927   1st Qu.:0.3310   1st Qu.: 99.96   1st Qu.:187819  
 Median :0.1270   Median :0.5120   Median :121.98   Median :216000  
 Mean   :0.1902   Mean   :0.5106   Mean   :120.88   Mean   :225800  
 3rd Qu.:0.2480   3rd Qu.:0.6930   3rd Qu.:133.92   3rd Qu.:253585  
 Max.   :0.9960   Max.   :0.9910   Max.   :239.44   Max.   :517810  

DESCRIPTION OF DATASET:

The Spotify dataset is a collection of information on songs available on the music streaming platform. The case of this dataset is an individual song, represented by each row in the dataset. The dataset includes audio features (e.g., loudness, speechiness, danceability) and popularity score information for over 170,000 songs available on Spotify. The songs were released between 1921 and 2020 and cover a variety of genres and sub-genres. Each song has a unique Spotify ID and is associated with an artist, album, and release year. The dataset includes information on the popularity score of the song, which ranges from 0 to 100 and is based on the number of plays and user interactions. The dataset also includes categorical variables such as key and mode, which describe the tonality of the song. In addition, the dataset also provides information on user behavior on the platform, such as the number of times a song has been played, skipped, or saved. Overall, the Spotify dataset offers a comprehensive view of the music streaming landscape and can be used to identify trends and patterns in music consumption and preferences.

3. The Tentative Plan for Visualization

  1. Briefly describe what data analyses (please the special note on statistics in the next section) and visualizations you plan to conduct to answer the research questions you proposed above.
  • Exploratory Data Analysis (EDA) using summary statistics and visualization techniques such as histograms, box plots, and scatter plots to understand the distributions and relationships of variables in the dataset.

  • Correlation analysis to identify the strength and direction of the relationships between variables such as loudness, danceability, energy, and popularity.

  • Clustering analysis using K-means algorithm to identify groups of songs with similar audio features and genre.

  1. Explain why you choose to conduct these specific data analyses and visualizations. In other words, how do such types of statistics or graphs (see the R Gallery) help you answer specific questions? For example, how can a bivariate visualization reveal the relationship between two variables, or how does a linear graph of variables over time present the pattern of development?
  • Histogram: Histograms are useful for visualizing the distribution of a numerical variable. In the case of the Spotify dataset, we can use histograms to explore the distribution of features like loudness, danceability, and tempo.

  • Scatter Plots: Scatter plots are useful for visualizing the relationship between two numerical variables. In the case of the Spotify dataset, we can use scatter plots to explore the relationship between features like energy and valence, or between features like loudness and popularity. By plotting these features against each other, we can see if there is a linear or nonlinear relationship between them, and we can also see if there are any outliers or clusters that might indicate subgroups within the data.

  • Box Plots: Box plots are useful for visualizing the distribution of a numerical variable across different categories. In the case of the Spotify dataset, we can see if there are any significant differences in the distributions between genres. We could also create box plots for features like loudness and popularity to see if there are any significant differences in these distributions between popular and less popular songs.

  1. If you plan to conduct specific data analyses and visualizations, describe how do you need to process and prepare the tidy data.
  • Drop some irrelevant columns like explicit, id, uri, track_href, and analysis_url as they are not useful for analysis.
  • Convert the “release_date” variable from a character/string format to a date format using the as.Date() function to extract the year or month to perform time-series analysis.
  • Create a new variable for the popularity of an artist by summing up the popularity scores of all their tracks.
  • Pivot the data format to create a separate table with one row per artist, and columns for the number of tracks released, total popularity score, and average audio features.
  • Regarding missing data or NAs, we need to first identify the variables with missing values and the extent of the missingness. We can use functions like is.na() or summary() to identify the missing data.
  • Convert data types: the duration_ms column, which represents the duration of the song in milliseconds, has to be converted to minutes to make it more interpretable.