Final Project Assignment#1: Jaswanth Reddy Kommuru

final_Project_assignment_1

final_project_data_description

Project & Data Description

Author

Jaswanth Reddy Kommuru

Published

April 10, 2023

Important Formatting & Submission Notes:

Use this file as the template to work on: start your own writing from Section “Part.1”
Please make the following changes to the above YAML header:
- Change the “title” to “Final Project Assignment#1: First Name Last Name”;
- Change the “author” to your name;
- Change the “date” to the current date in the “MM-DD-YYYY” format;
Submission:
- Delete the unnecessary sections (“Overview”, “Tasks”, “Special Note”, and “Evaluation”).
- In the posts folder of your local 601_Spring_2023 project, create a folder named “FirstNameLastName_FinalProjectData”, and save your final project dataset(s) in this folder. DO NOT save the dataset(s) to the _data folder which stores the dataset(s) for challenges.
- Render and submit the file to the blog post like a regular challenge.

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Part 1. Introduction

In this part, you should introduce the dataset(s) and your research questions.

Dataset(s) Introduction:

Divvy is Chicagoland’s bike share system across Chicago and Evanston. Divvy provides residents and visitors with a convenient, fun and affordable transportation option for getting around and exploring Chicago. Divvy, like other bike share systems, consists of fleet of specially-designed, sturdy and durable bikes that are locked into a network of docking stations throughout the region. The bikes can be unlocked from one station and returned to any other station in the system. People use bike share to explore Chicago, commute to work or school, run errands, get to appointments or social engagements, and more. Some part of the data collected from the app has been kept available for public access.

The dataset has numerous entries of each ride details without disclosing the personnel details of the person who booked it each row represents one ride information and each row has the below attribute’s of the ride.

ride_id: A unique identifier for each ride.
rideable_type: The type of bike used for the ride (e.g., docked_bike).
started_at: The date and time when the ride started.
ended_at: The date and time when the ride ended.
start_station_name: The name of the starting station for the ride.
start_station_id: The identifier of the starting station.
end_station_name: The name of the ending station for the ride.
end_station_id: The identifier of the ending station.
start_lat: The latitude coordinate of the starting station.
start_lng: The longitude coordinate of the starting station.
end_lat: The latitude coordinate of the ending station.
end_lng: The longitude coordinate of the ending station.
member_casual: Indicates whether the rider is a member or a casual user.
- identify the source of the dataset(s): who or which organization collected the dataset(s); some dataset(s) also tells you how and when it was collected ;
- a description of the “cases” represented by the dataset(s); in other words, what does each row represent?
- Erico’s hint: the website of the dataset(s) usually has a brief introduction of the above information; you can also look for the “user manual” document that comes with the dataset(s).
- For reference, you can check outthe “Introduction” section of this final project as an example of dataset(s) introduction.

What questions do you like to answer with this dataset(s)?

How much of the data is about members and how much is about casuals?
How much of the data is distributed by month?
How is the temperature/weather influencing the number of rides made in a month.
How much of the data is distributed by weekday, weekends?
What is the time spent on the ride by different categories of people?

Part 2. Describe the data set(s)

This part contains both a coding and a storytelling component.

In the coding component, you should:

read the dataset;
- (optional) If you have multiple dataset(s) you want to work with, you should combine these datasets at this step.
- (optional) If your dataset is too big (for example, it contains too many variables/columns that may not be useful for your analysis), you may want to subset the data just to include the necessary variables/columns.

  csv_files <- list.files(path = "/Users/jaswanth/Documents/601/JaswanthReddyKommuru_FinalProjectData", recursive = TRUE, full.names=TRUE)
  ride_data <- do.call(rbind, lapply(csv_files, read.csv))

present the descriptive information of the dataset(s) using the functions in Challenges 1, 2, and 3;
- for examples: dim(), length(unique()), head();
```
head(ride_data)
```

    dim(ride_data)

[1] 3489748      13

    colnames(ride_data)

 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"

length(unique(ride_data$start_station_name))

[1] 709

  ride_data %>% 
  select(end_station_name) %>% 
  n_distinct(.)

[1] 707

  ride_data %>% 
  select(member_casual) %>% 
  distinct(.)

  ride_data %>% 
  select(rideable_type) %>% 
  distinct(.)

  ride_data %>% 
  select(ride_id) %>% 
  n_distinct(.)

[1] 3489539

In the provided dataset, each row corresponds to a specific ride and contains multiple pieces of information. The dataset encompasses a distinctive identifier, known as “ride_id,” assigned to uniquely identify the details of all 3,489,539 rides. Additionally, the dataset captures data on the starting and ending stations, denoted by “start_station_name” and “end_station_name,” respectively. There are a total of 707 stations where bikes are picked up and 709 stations where bikes are returned. Moreover, the dataset includes geographical information, such as the latitude and longitude coordinates, for both the starting and ending bike stations, referred to as “start_lat,” “start_lng,” “end_lat,” and “end_lng.” Each station is also assigned a unique identifier, namely “start_station_id” and “end_station_id.” The dataset further accounts for three types of bikes, referred to as “rideable_type,” which are available for rides and their availability is dependent on the specific bike type. The individuals utilizing the bikes are categorized into two distinct groups based on their membership status, represented by the field “member_casual.” This field indicates whether the person is a member with an active subscription plan or a casual member who typically pays for each ride individually.

conduct summary statistics of the dataset(s); especially show the basic statistics (min, max, mean, median, etc.) for the variables you are interested in.

summary(ride_data)

   ride_id          rideable_type       started_at          ended_at        
 Length:3489748     Length:3489748     Length:3489748     Length:3489748    
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 start_station_name start_station_id   end_station_name   end_station_id    
 Length:3489748     Length:3489748     Length:3489748     Length:3489748    
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   start_lat       start_lng         end_lat         end_lng      
 Min.   :41.64   Min.   :-87.87   Min.   :41.54   Min.   :-88.07  
 1st Qu.:41.88   1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66  
 Median :41.90   Median :-87.64   Median :41.90   Median :-87.64  
 Mean   :41.90   Mean   :-87.64   Mean   :41.90   Mean   :-87.64  
 3rd Qu.:41.93   3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63  
 Max.   :42.08   Max.   :-87.52   Max.   :42.16   Max.   :-87.44  
                                  NA's   :4738    NA's   :4738    
 member_casual     
 Length:3489748    
 Class :character  
 Mode  :character

In the storytelling component, you should describe the basic information of the dataset(s) and the variables in a way that corresponds to your descriptive and summary statistics in the above coding component. DO NOT simply report the number of rows. Instead, describe the dataset(s) fully by specifying what each row and column mean. In other words, your description should be comprehensive and detailed enough for readers to picture or envision the dataset(s) in their brains.

For example, suppose I use a dataset of all the athletes who participated in the Olympic Games. Here is how I describe the basic information of the data: “the case of this dataset is ab individual athlete, represented by each row in the dataset. The dataset includes individual (e.g., gender, age, height, weight, race) and event performance (e.g., final placement) information for all athletes (22,398) competing in all events (e.g., Male 400m Free, Female …) in all Olympics Games since 1922 (24 Winter and 28 Summer Games. Athletes appearing in the dataset competed in anywhere from 1-11 distinct events (of 198 possible) during 1-5 distinct Olympic competitions, for a total of XXX, XXX athlete-event-Olympic-year observations. XXX Countries are represented, etc).”
Erico’s hint: as I mentioned above, sometimes a dataset is too large, and it is difficult to present and explain all the variables/columns (especially if you run summary statistics for the whole dataset). In this case, you will have to make a decision to select the most important variables/columns to discuss. For example, the Olympic dataset I mentioned above as an example contains more than 50 columns. For clarity of data presentation, I may just focus on 6 items/columns of individual athletes (gender, age, weight, height, race, nationality) and the column of final placement that are most relevant to answer my specific research questions. By doing so, you can just present the tables of the summary statistics of these 7 variables/columns without showing too much information and confusing the readers.
A good example is to can check out the Data Description section of the above student’s final project. As you can see, the student describes the dataset after he runs a few descriptive statistics. You can also see the weekly challenge solutions by Professor Rolfe for other examples of clear, concise data descriptions.

3. The Tentative Plan for Visualization

Briefly describe what data analyses (please the special note on statistics in the next section) and visualizations you plan to conduct to answer the research questions you proposed above.

To comprehend the correlation between specific variables that are pertinent to the study questions I have set, I will plot various histograms, box plots, scatter plots, and linear graphs. To determine which set of people uses the service the most regularly, for instance, I can plot a bar graph of casual vs. member.

Explain why you choose to conduct these specific data analyses and visualizations. In other words, how do such types of statistics or graphs (see the R Gallery) help you answer specific questions? For example, how can a bivariate visualization reveal the relationship between two variables, or how does a linear graph of variables over time present the pattern of development?
If you plan to conduct specific data analyses and visualizations, describe how do you need to process and prepare the tidy data.

Box Plots: These can be used to see how a numerical variable is distributed among various properties. I can plot the distribution of the amount of time that various groups of people spend riding bikes using the bike data.
Histogram: To see how numerical variables are distributed, utilize histograms. I can use it to determine the time that was driven by various groups of people for this dataset.
- What do you need to do to mutate the datasets (convert date data, create a new variable, pivot the data format, etc.)?
- How are you going to deal with the missing data/NAs and outliers? And why do you choose this way to deal with NAs?

(Optional) It is encouraged, but optional, to include a coding component of tidy data in this part.

x<-ride_data %>%
  select(end_station_id) %>% 
  is.na() %>%
  sum()
x

[1] 98104

 column_with_na<-ride_data %>%
  select_if(~ any(is.na(.))) %>%
  names()
column_with_na

[1] "start_station_id" "end_station_id"   "end_lat"          "end_lng"

Rename a few of the column names to reflect the data more accurately.
Create new columns to contain additional data that can be pulled from current columns, such as the ride’s duration time.
When dealing with missing data, or NAs, we must first identify the variables with missing values and the level of the missingness. We can use functions like is.na() or summary() to detect the missing data and then take the appropriate action to remove it.