Final Project
Introduction
Climate has always been a topic that sparks debate and there is continuous research being done on it every day. I wanted to contribute to this research and analyze the impacts climate factors like temperature and humidity have on bike users. There has been study related to weather conditions and biking and whether or not it results in more accidents which concluded in an increase in accidents. “It suggests that weather conditions should be considered in every analysis where bicycle volume data is needed” (Pazdan, 2020). The paper describes the importance of weather condition and how they should be used in any analysis regarding biking data. Here, I thought about if we know that these factors are important, how important are they? I can find this data online somewhere however, I would like to conduct my own study and determine the results from that and then compare it to results online.
A. How has temperature and humidity impacted bike users?
My motivation is driven by my own interest in biking and climate factors and the study of how the climate is impacting human movement. I think it is interesting how significant climate factors can impact certain human activities and I want to learn more about which ones are presenting the largest impact on human activity. The reason this study is different is that it is based on normalized data and is focusing how bikers are impacted by weather condition.
Hypothesis
My hypothesis is testing how and if climate factors like temperature and humidity have an impact on bike users. I am hypothesizing that temperature will have the largest impact on bike sales. I believe temperature will mostly cause it to increase as nicer weather would be good for a ride. However, I believe at a certain point it will dip down and no longer be as positive. I hypothesize that as humidity increases then the number of bike users drop since the biking conditions will grow away from the ideal conditions. This study will use normalized feeling temperature and normalized humidity compared to daily and hourly bike users. This study is using new data compared to the research documents that have been observed online and presents a new view to either support or reject their claims.
The hypothesis will be tested as follows:
Temperature does not impact bike users
Temperature does impact bike users
Humidity does not impact bike users
Humidity does impact bike users
Analytic Planning
Library
Code
library(tidyverse)
library(plyr)
library(dplyr)
library(kableExtra)
library(lubridate)
library(lmtest)
library(sandwich)
library(GGally)
library(caret)
library(ggthemes)
library(plotly)
library(hrbrthemes)
Descriptive statistics
Reading in the data
Code
<- read.csv("_data/hour.csv")
bike <- read.csv("_data/day.csv")
bike2
dim(bike)
[1] 17379 17
Code
dim(bike2)
[1] 731 16
Code
summary(bike)
instant dteday season yr
Min. : 1 Length:17379 Min. :1.000 Min. :0.0000
1st Qu.: 4346 Class :character 1st Qu.:2.000 1st Qu.:0.0000
Median : 8690 Mode :character Median :3.000 Median :1.0000
Mean : 8690 Mean :2.502 Mean :0.5026
3rd Qu.:13034 3rd Qu.:3.000 3rd Qu.:1.0000
Max. :17379 Max. :4.000 Max. :1.0000
mnth hr holiday weekday
Min. : 1.000 Min. : 0.00 Min. :0.00000 Min. :0.000
1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 1st Qu.:1.000
Median : 7.000 Median :12.00 Median :0.00000 Median :3.000
Mean : 6.538 Mean :11.55 Mean :0.02877 Mean :3.004
3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:5.000
Max. :12.000 Max. :23.00 Max. :1.00000 Max. :6.000
workingday weathersit temp atemp
Min. :0.0000 Min. :1.000 Min. :0.020 Min. :0.0000
1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.340 1st Qu.:0.3333
Median :1.0000 Median :1.000 Median :0.500 Median :0.4848
Mean :0.6827 Mean :1.425 Mean :0.497 Mean :0.4758
3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.660 3rd Qu.:0.6212
Max. :1.0000 Max. :4.000 Max. :1.000 Max. :1.0000
hum windspeed casual registered
Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 4.00 1st Qu.: 34.0
Median :0.6300 Median :0.1940 Median : 17.00 Median :115.0
Mean :0.6272 Mean :0.1901 Mean : 35.68 Mean :153.8
3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.: 48.00 3rd Qu.:220.0
Max. :1.0000 Max. :0.8507 Max. :367.00 Max. :886.0
cnt
Min. : 1.0
1st Qu.: 40.0
Median :142.0
Mean :189.5
3rd Qu.:281.0
Max. :977.0
Data
The data was collected from UCI machine learning repository. Where I collected for both daily and hourly information for 17 variables. Each variable is described below. Important variables will be in bold. Their relationship towards the analysis will be described underneath this section.
- instant - This is the record index. This is the count of how many rows there are. This object will not be utilized in this study
- dteday - This is the date. The date is currently in year-day-month format. This will be used to observe change over time
- season - This is the 4 seasons. This is expressed as: 1-Winter, 2-Spring, 3-Summer, 4- Fall. This will be used as a control variable since this could impact the sales of bikes and the independent variables
- yr - This is the year ranging from 2011-2012. This will be used analyze change over each year.
- mnth - This is the month 1-12. This will be used to analyze the change over months.
- hr - This is hour 0-23. This will be used to analyze the change by the hour.
- holiday - This is the holidays so whether or not it is a holiday. This will be used as an independent variable in conjunction with another
- weekday - Day of the week.
- workingday - if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit -
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp - Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- atemp - Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- hum - Normalized humidity. The values are divided to 100 (max)
- windspeed - Normalized wind speed. The values are divided to 67 (max)
- casual - count of casual users
- registered - count of registered users
- cnt - count of total rental bikes including both casual and registered
Regression 1
- Explanatory - normalized temperature feeling
- Outcome - cnt
- Control - Season, holiday, weekday, weathersit
Regression 2
- Explanatory - Normalized humidity
- Outcome - cnt
- Control - temperature(normalized and normalized feeling), wind speed, weekday, holiday, weathersit
Regression Model
Interaction Terms
I do not believe I need to use a quardratic or anything like that however, there is heteroskedascity so using a log may help with the funneling?
Regression Model 1
Cleaning the data
Code
<- bike %>%
bike ::rename('Date' = dteday) %>%
dplyr::rename('Year' = yr) %>%
dplyr::rename('Month' = mnth) %>%
dplyr::rename('Hour' = hr) %>%
dplyr::rename('Normalized_temperature_C' = temp) %>%
dplyr::rename('Normalized_feeling_temperature_C' = atemp) %>%
dplyr::rename('Normalized_Humidity' = hum) %>%
dplyr::rename("Total_bike_users" = cnt)
dplyr
<- bike2 %>%
bike2 ::rename('Date' = dteday) %>%
dplyr::rename('Year' = yr) %>%
dplyr::rename('Month' = mnth) %>%
dplyr::rename('Normalized_temperature_C' = temp) %>%
dplyr::rename('Normalized_feeling_temperature_C' = atemp) %>%
dplyr::rename('Normalized_Humidity' = hum) %>%
dplyr::rename("Total_bike_users" = cnt)
dplyr
$Date <- ymd(bike2$Date)
bike2
# Checking for multicollinearity (We notice that temp and feeling temp are almost identical) so we removed the normalized temperature form the study since I want to focus on feeling temperature. Also removing instant since it is just the count of rows.
<- bike[,-11]
bike <- bike2[,-10]
bike2
cor(bike[3:16])
season Year Month
season 1.000000000 -0.010742486 0.830385892
Year -0.010742486 1.000000000 -0.010472929
Month 0.830385892 -0.010472929 1.000000000
Hour -0.006116901 -0.003867005 -0.005771909
holiday -0.009584526 0.006691617 0.018430325
weekday -0.002335350 -0.004484851 0.010400061
workingday 0.013743102 -0.002196005 -0.003476922
weathersit -0.014523552 -0.019156853 0.005399522
Normalized_feeling_temperature_C 0.319379811 0.039221595 0.208096131
Normalized_Humidity 0.150624745 -0.083546421 0.164411443
windspeed -0.149772751 -0.008739533 -0.135386323
casual 0.120206447 0.142778528 0.068457301
registered 0.174225633 0.253684310 0.122272967
Total_bike_users 0.178055731 0.250494899 0.120637760
Hour holiday weekday
season -0.006116901 -0.009584526 -0.002335350
Year -0.003867005 0.006691617 -0.004484851
Month -0.005771909 0.018430325 0.010400061
Hour 1.000000000 0.000479136 -0.003497739
holiday 0.000479136 1.000000000 -0.102087791
weekday -0.003497739 -0.102087791 1.000000000
workingday 0.002284998 -0.252471370 0.035955071
weathersit -0.020202528 -0.017036113 0.003310740
Normalized_feeling_temperature_C 0.133749965 -0.030972737 -0.008820945
Normalized_Humidity -0.276497828 -0.010588465 -0.037158268
windspeed 0.137251568 0.003987632 0.011501545
casual 0.301201730 0.031563628 0.032721415
registered 0.374140710 -0.047345424 0.021577888
Total_bike_users 0.394071498 -0.030927303 0.026899860
workingday weathersit
season 0.013743102 -0.014523552
Year -0.002196005 -0.019156853
Month -0.003476922 0.005399522
Hour 0.002284998 -0.020202528
holiday -0.252471370 -0.017036113
weekday 0.035955071 0.003310740
workingday 1.000000000 0.044672224
weathersit 0.044672224 1.000000000
Normalized_feeling_temperature_C 0.054667235 -0.105563108
Normalized_Humidity 0.015687512 0.418130329
windspeed -0.011829789 0.026225652
casual -0.300942486 -0.152627885
registered 0.134325791 -0.120965520
Total_bike_users 0.030284368 -0.142426138
Normalized_feeling_temperature_C
season 0.319379811
Year 0.039221595
Month 0.208096131
Hour 0.133749965
holiday -0.030972737
weekday -0.008820945
workingday 0.054667235
weathersit -0.105563108
Normalized_feeling_temperature_C 1.000000000
Normalized_Humidity -0.051917696
windspeed -0.062336043
casual 0.454080065
registered 0.332558635
Total_bike_users 0.400929304
Normalized_Humidity windspeed casual
season 0.15062475 -0.149772751 0.12020645
Year -0.08354642 -0.008739533 0.14277853
Month 0.16441144 -0.135386323 0.06845730
Hour -0.27649783 0.137251568 0.30120173
holiday -0.01058846 0.003987632 0.03156363
weekday -0.03715827 0.011501545 0.03272142
workingday 0.01568751 -0.011829789 -0.30094249
weathersit 0.41813033 0.026225652 -0.15262788
Normalized_feeling_temperature_C -0.05191770 -0.062336043 0.45408007
Normalized_Humidity 1.00000000 -0.290104895 -0.34702809
windspeed -0.29010490 1.000000000 0.09028678
casual -0.34702809 0.090286775 1.00000000
registered -0.27393312 0.082320847 0.50661770
Total_bike_users -0.32291074 0.093233784 0.69456408
registered Total_bike_users
season 0.17422563 0.17805573
Year 0.25368431 0.25049490
Month 0.12227297 0.12063776
Hour 0.37414071 0.39407150
holiday -0.04734542 -0.03092730
weekday 0.02157789 0.02689986
workingday 0.13432579 0.03028437
weathersit -0.12096552 -0.14242614
Normalized_feeling_temperature_C 0.33255864 0.40092930
Normalized_Humidity -0.27393312 -0.32291074
windspeed 0.08232085 0.09323378
casual 0.50661770 0.69456408
registered 1.00000000 0.97215073
Total_bike_users 0.97215073 1.00000000
Model Evaluation - Backward Elimination
Here we are aiming for anything with 95% significance.
This method works by creating a linear model and then analyzing the results and determining which values are not significant and you remove them from the linear regression to drive a more fitted regression. Below I have created this with creating a seed for reproduction and then set the train control with the method cross validation and set the k-folds to 10. Next I tested that on the regression model with everything included and this will test each variable one by one and tell you how many variables are needed to have the most fit model. Here it states that the model with 8 variables is the most fit with both the RMSE and MAE method showing this below, This results in an R squared of 38.8%. This is for the bike data set and when we do it on bike2 data set we see that the number of variables remains the same. They both operate the best at 8 variables anything past that would over fit the data and would result in worse information. This one is showing an r squared at 79.79% which is a pretty good fit for the data. The only variable that does get changed is hour is removed and weather type gets added in. The regression have already been checked for significance but have been placed within the regression section and they are explained there.
Code
# Setting for reproducibility
set.seed(1)
# setting repeated k-fold cross validation
<- trainControl(method = "cv", number = 10)
train.control
# testing it on bike2 table
<- train(Total_bike_users ~. -Date -casual -registered, data = bike2,
Backward_temp2 method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:10),
trControl = train.control
)<- Backward_temp2$results
results_bike2
kable(results_bike2, digits = 4, align = "ccccccc", col.names = c("nvmax", "RMSE", "Rsquared", "MAE", "RMSESD", "RsquaredSD", "MAESD"), caption = "Bike2 Data set Model Evaluation") %>%
kable_styling(font_size = 16) %>%
row_spec(c(8,8,8), background = "cadetblue")
nvmax | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|
1 | 1499.2915 | 0.4046 | 1234.5472 | 106.3199 | 0.0447 | 100.5028 |
2 | 1081.6159 | 0.6922 | 849.4519 | 84.0027 | 0.0407 | 63.9932 |
3 | 1001.4803 | 0.7359 | 749.4293 | 75.5464 | 0.0293 | 38.5789 |
4 | 912.3321 | 0.7800 | 678.0571 | 76.8079 | 0.0358 | 51.4441 |
5 | 902.9102 | 0.7841 | 670.3104 | 85.8437 | 0.0392 | 53.3738 |
6 | 886.7729 | 0.7914 | 661.4331 | 86.9188 | 0.0402 | 51.9386 |
7 | 890.9950 | 0.7895 | 663.0067 | 96.0126 | 0.0445 | 55.0536 |
8 | 886.8600 | 0.7913 | 659.5744 | 100.0098 | 0.0469 | 61.1266 |
9 | 880.7110 | 0.7941 | 657.1637 | 100.5697 | 0.0469 | 62.2428 |
10 | 878.9295 | 0.7950 | 655.7529 | 102.1414 | 0.0472 | 61.9084 |
Code
# Here are the 12 variables in order
summary(Backward_temp2$finalModel)
Subset selection object
11 Variables (and intercept)
Forced in Forced out
instant FALSE FALSE
season FALSE FALSE
Year FALSE FALSE
Month FALSE FALSE
holiday FALSE FALSE
weekday FALSE FALSE
workingday FALSE FALSE
weathersit FALSE FALSE
Normalized_feeling_temperature_C FALSE FALSE
Normalized_Humidity FALSE FALSE
windspeed FALSE FALSE
1 subsets of each size up to 10
Selection Algorithm: backward
instant season Year Month holiday weekday workingday weathersit
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " "*" " " " " " " " " " "
3 ( 1 ) " " "*" "*" " " " " " " " " " "
4 ( 1 ) " " "*" "*" " " " " " " " " "*"
5 ( 1 ) " " "*" "*" " " " " "*" " " "*"
6 ( 1 ) " " "*" "*" " " " " "*" " " "*"
7 ( 1 ) " " "*" "*" " " " " "*" " " "*"
8 ( 1 ) " " "*" "*" " " "*" "*" " " "*"
9 ( 1 ) "*" "*" "*" " " "*" "*" " " "*"
10 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*"
Normalized_feeling_temperature_C Normalized_Humidity windspeed
1 ( 1 ) "*" " " " "
2 ( 1 ) "*" " " " "
3 ( 1 ) "*" " " " "
4 ( 1 ) "*" " " " "
5 ( 1 ) "*" " " " "
6 ( 1 ) "*" " " "*"
7 ( 1 ) "*" "*" "*"
8 ( 1 ) "*" "*" "*"
9 ( 1 ) "*" "*" "*"
10 ( 1 ) "*" "*" "*"
Code
# Training my model with backward elimination. Removed Date since there are other variables in here that accomplish time and removed
# Casual and registered since those two combined create total bike users.
<- train(Total_bike_users ~. -Date -casual -registered, data = bike,
Backward_temp1 method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:10),
trControl = train.control
)
# Here it shows the RMSE and MAE saying 8 variables is the best fit
<- Backward_temp1$results
results_bike
kable(results_bike, digits = 4, align = "ccccccc", col.names = c("nvmax", "RMSE", "Rsquared", "MAE", "RMSESD", "RsquaredSD", "MAESD"), caption = "Bike Data set Model Evaluation") %>%
kable_styling(font_size = 16) %>%
row_spec(c(8,8,8), background = "cadetblue")
nvmax | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
---|---|---|---|---|---|---|
1 | 166.1289 | 0.1609 | 125.9391 | 4.1424 | 0.0134 | 2.7129 |
2 | 154.0128 | 0.2791 | 113.6634 | 3.9548 | 0.0098 | 2.6805 |
3 | 147.8483 | 0.3357 | 110.5619 | 3.6648 | 0.0112 | 2.4969 |
4 | 143.4815 | 0.3743 | 106.5021 | 3.2789 | 0.0129 | 2.4428 |
5 | 142.0650 | 0.3865 | 105.9022 | 3.0654 | 0.0152 | 2.1619 |
6 | 141.9861 | 0.3872 | 105.8684 | 3.1385 | 0.0149 | 2.2025 |
7 | 141.9563 | 0.3874 | 105.8983 | 3.1446 | 0.0146 | 2.1957 |
8 | 141.8763 | 0.3881 | 105.8692 | 3.1399 | 0.0143 | 2.1592 |
9 | 141.8967 | 0.3880 | 105.9357 | 3.1397 | 0.0141 | 2.1585 |
10 | 141.8735 | 0.3882 | 106.0082 | 3.1303 | 0.0140 | 2.1435 |
Code
# Here are the 8 variables in order
summary(Backward_temp1$finalModel)
Subset selection object
12 Variables (and intercept)
Forced in Forced out
instant FALSE FALSE
season FALSE FALSE
Year FALSE FALSE
Month FALSE FALSE
Hour FALSE FALSE
holiday FALSE FALSE
weekday FALSE FALSE
workingday FALSE FALSE
weathersit FALSE FALSE
Normalized_feeling_temperature_C FALSE FALSE
Normalized_Humidity FALSE FALSE
windspeed FALSE FALSE
1 subsets of each size up to 10
Selection Algorithm: backward
instant season Year Month Hour holiday weekday workingday weathersit
1 ( 1 ) " " " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " "*" " " " " " " " "
3 ( 1 ) " " " " "*" " " "*" " " " " " " " "
4 ( 1 ) " " " " "*" " " "*" " " " " " " " "
5 ( 1 ) " " "*" "*" " " "*" " " " " " " " "
6 ( 1 ) " " "*" "*" " " "*" " " " " " " " "
7 ( 1 ) " " "*" "*" " " "*" "*" " " " " " "
8 ( 1 ) " " "*" "*" " " "*" "*" "*" " " " "
9 ( 1 ) " " "*" "*" " " "*" "*" "*" " " "*"
10 ( 1 ) " " "*" "*" " " "*" "*" "*" "*" "*"
Normalized_feeling_temperature_C Normalized_Humidity windspeed
1 ( 1 ) "*" " " " "
2 ( 1 ) "*" " " " "
3 ( 1 ) "*" " " " "
4 ( 1 ) "*" "*" " "
5 ( 1 ) "*" "*" " "
6 ( 1 ) "*" "*" "*"
7 ( 1 ) "*" "*" "*"
8 ( 1 ) "*" "*" "*"
9 ( 1 ) "*" "*" "*"
10 ( 1 ) "*" "*" "*"
Regression
Within the first test we notice a fairly low R squared but we also know they are all statistically significant meaning that we can reject the null hypothesis and say yes there appears to be some effect here. Once we implement the model evaluation to the regression we notice a large difference in the r squared.
Here we are looking at two different regression on two different scales. One is on hourly time and the other is on daily time which causes the variables to impact the outcome differently. Here we notice both temperature and humidity are statistically significant in both situations which is further evidence that we can really reject the null.
Code
# Regression 1 both temperature and humidity
lm(Total_bike_users ~ Normalized_feeling_temperature_C + Normalized_Humidity + season + weekday + Hour, data = bike) %>% summary()
Call:
lm(formula = Total_bike_users ~ Normalized_feeling_temperature_C +
Normalized_Humidity + season + weekday + Hour, data = bike)
Residuals:
Min 1Q Median 3Q Max
-358.16 -95.48 -30.75 53.96 691.81
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.2565 6.0492 5.994 2.09e-09 ***
Normalized_feeling_temperature_C 330.3458 6.9634 47.440 < 2e-16 ***
Normalized_Humidity -229.8928 6.1318 -37.492 < 2e-16 ***
season 19.1274 1.0843 17.640 < 2e-16 ***
weekday 1.9755 0.5586 3.537 0.000406 ***
Hour 7.4868 0.1698 44.090 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 147.6 on 17373 degrees of freedom
Multiple R-squared: 0.3383, Adjusted R-squared: 0.3381
F-statistic: 1776 on 5 and 17373 DF, p-value: < 2.2e-16
Code
# This was the first regression I made based on my own knowledge and will be used to compare against the one created by the model
# We notice that the original had an r squared of 33.8% while this one has a 38.8% which is a large difference and shows the importance of finding the best fitted model
<- lm(Total_bike_users ~ Normalized_feeling_temperature_C + Normalized_Humidity + Hour + Year + season + holiday + weekday + windspeed, data = bike)
Regression1
summary(Regression1)
Call:
lm(formula = Total_bike_users ~ Normalized_feeling_temperature_C +
Normalized_Humidity + Hour + Year + season + holiday + weekday +
windspeed, data = bike)
Residuals:
Min 1Q Median 3Q Max
-398.12 -92.84 -27.49 59.81 642.21
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -27.3890 6.7226 -4.074 4.64e-05 ***
Normalized_feeling_temperature_C 321.2238 6.7131 47.850 < 2e-16 ***
Normalized_Humidity -204.3276 6.1144 -33.418 < 2e-16 ***
Hour 7.6326 0.1638 46.608 < 2e-16 ***
Year 81.0198 2.1637 37.445 < 2e-16 ***
season 19.9910 1.0469 19.096 < 2e-16 ***
holiday -24.1092 6.4763 -3.723 0.000198 ***
weekday 1.9179 0.5399 3.552 0.000383 ***
windspeed 43.4284 9.2898 4.675 2.96e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 141.9 on 17370 degrees of freedom
Multiple R-squared: 0.3885, Adjusted R-squared: 0.3882
F-statistic: 1379 on 8 and 17370 DF, p-value: < 2.2e-16
Code
plot(Regression1)
Code
# Regression two on bike2 data set both temperature and humidity. Comparing original regression on bike2 data set here we have an r squared of 50.58%
lm(Total_bike_users ~ Normalized_feeling_temperature_C + Normalized_Humidity + weathersit + season, data = bike2) %>% summary()
Call:
lm(formula = Total_bike_users ~ Normalized_feeling_temperature_C +
Normalized_Humidity + weathersit + season, data = bike2)
Residuals:
Min 1Q Median 3Q Max
-4034.7 -1019.5 -183.7 1057.8 4356.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2276.00 260.57 8.735 < 2e-16 ***
Normalized_feeling_temperature_C 6486.06 340.09 19.072 < 2e-16 ***
Normalized_Humidity -1860.81 462.50 -4.023 6.34e-05 ***
weathersit -551.04 119.11 -4.626 4.41e-06 ***
season 436.18 49.21 8.863 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1366 on 726 degrees of freedom
Multiple R-squared: 0.5058, Adjusted R-squared: 0.503
F-statistic: 185.7 on 4 and 726 DF, p-value: < 2.2e-16
Code
# Here we have an r squared of 79.74% which is a MAJOR difference this has changed the coefficient of temperature by ~600. This goes to show how inaccurate the first model was.
<- lm(Total_bike_users ~ Normalized_feeling_temperature_C + Normalized_Humidity + weathersit + Year + season + holiday + weekday + windspeed, data = bike2)
Regression2 summary(Regression2)
Call:
lm(formula = Total_bike_users ~ Normalized_feeling_temperature_C +
Normalized_Humidity + weathersit + Year + season + holiday +
weekday + windspeed, data = bike2)
Residuals:
Min 1Q Median 3Q Max
-4138.0 -425.5 73.3 536.4 2823.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1467.57 228.77 6.415 2.55e-10 ***
Normalized_feeling_temperature_C 5931.29 219.28 27.048 < 2e-16 ***
Normalized_Humidity -1132.22 313.36 -3.613 0.000323 ***
weathersit -592.12 78.49 -7.544 1.37e-13 ***
Year 2038.35 65.50 31.121 < 2e-16 ***
season 407.23 31.91 12.762 < 2e-16 ***
holiday -614.59 195.40 -3.145 0.001728 **
weekday 69.21 16.35 4.234 2.60e-05 ***
windspeed -2449.82 452.25 -5.417 8.27e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 876.8 on 722 degrees of freedom
Multiple R-squared: 0.7974, Adjusted R-squared: 0.7952
F-statistic: 355.2 on 8 and 722 DF, p-value: < 2.2e-16
Residuals vs Fitted(Will need to be adjusted)
This shows signs of heteroskedasticity and this is when standard deviations of a predicated variable being monitored over different values of an independent variable are non-constant. The problems that arise from this issue is, the standard error is wrong and thus the confidence intervals and hypothesis tests can not be relied on. This issues needs to be resolved before declaring the conclusion.
Code
<- lm(Total_bike_users ~ .-Date -casual -registered, data = bike)
mod plot(mod)
Code
summary(mod)
Call:
lm(formula = Total_bike_users ~ . - Date - casual - registered,
data = bike)
Residuals:
Min 1Q Median 3Q Max
-392.56 -93.49 -27.65 60.96 641.27
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.109e+01 7.337e+00 -4.237 2.27e-05 ***
instant -4.527e-03 5.126e-03 -0.883 0.37719
season 1.993e+01 1.819e+00 10.959 < 2e-16 ***
Year 1.207e+02 4.487e+01 2.690 0.00714 **
Month 3.283e+00 3.775e+00 0.870 0.38457
Hour 7.671e+00 1.650e-01 46.477 < 2e-16 ***
holiday -2.151e+01 6.692e+00 -3.214 0.00131 **
weekday 1.928e+00 5.403e-01 3.569 0.00036 ***
workingday 4.031e+00 2.396e+00 1.683 0.09245 .
weathersit -3.334e+00 1.904e+00 -1.751 0.07994 .
Normalized_feeling_temperature_C 3.197e+02 6.777e+00 47.176 < 2e-16 ***
Normalized_Humidity -1.989e+02 6.880e+00 -28.906 < 2e-16 ***
windspeed 4.612e+01 9.401e+00 4.906 9.39e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 141.9 on 17366 degrees of freedom
Multiple R-squared: 0.3887, Adjusted R-squared: 0.3883
F-statistic: 920.3 on 12 and 17366 DF, p-value: < 2.2e-16
Resolving Heteroskedasticity
There is no Heteroskedasticity
There is Heteroskedasticity
Here we will conduct the Breusch-Pagan test using the lmtest package and bptest() function. This will let us know if there is heteroskedascity if the P < .05. Here we see that both meet this standard and thus we have evidence to reject the null hypothesis
Code
# Breusch-Pagan test to determine if Heteroskedasticity exist
bptest(Regression1)
studentized Breusch-Pagan test
data: Regression1
BP = 1450.9, df = 8, p-value < 2.2e-16
Code
bptest(Regression2)
studentized Breusch-Pagan test
data: Regression2
BP = 47.248, df = 8, p-value = 1.375e-07
Code
# both of these are p<.05 meaning that we reject the null hypothesis and say yes there is heteroskedacity
Visualizations
Here we see as temperature increases we can expect bike users to increase while as humidity increases we expect the opposite.
Code
plot(Regression1)
Code
plot(regression2)
Error in plot(regression2): object 'regression2' not found
Code
<- bike2 %>%
Bike_users_plot ggplot(aes(x=Date, y=Total_bike_users)) +
geom_area(fill="#69b3a2", alpha=0.5) +
geom_line(color="#69b3a2") +
ylab("total bike user") +
theme_ipsum()
# Making it interactive
<- ggplotly(Bike_users_plot)
Bike_users_plot Bike_users_plot
Code
# Value used to transform the data
<- 10000
coeff
# A few constants
<- "#69b3a2"
temperatureColor <- rgb(0.2, 0.6, 0.9, 1)
priceColor
ggplot(bike2, aes(x=Date)) +
geom_line( aes(y=Normalized_feeling_temperature_C), size=2, color=temperatureColor) +
geom_line( aes(y=Total_bike_users / coeff), size=2, color=priceColor) +
scale_y_continuous(
# Features of the first axis
name = "Temperature (Normalized)",
# Add a second axis and specify its features
sec.axis = sec_axis(~.*coeff, name="Bike users")
+
)
theme_ipsum() +
theme(
axis.title.y = element_text(color = temperatureColor, size=13),
axis.title.y.right = element_text(color = priceColor, size=13)
+
)
ggtitle("Temperature correlates with bikes users")
Code
# plotting the data to visualize
ggplot(data = bike2, aes(x=Normalized_feeling_temperature_C, y = Total_bike_users)) +
geom_point() +
geom_smooth(method = lm) +
theme_fivethirtyeight(base_size = 10, base_family = 'serif') +
theme(axis.title = element_text(family = 'serif', size = 15)) + ylab('Total Bike Users') + xlab('Normalized Feeling Temperature') +
labs(title = "Relationship between Temerpature and Bike users", caption = "")
Code
ggplot(data = bike2, aes(x=Normalized_Humidity, y = Total_bike_users)) +
geom_point() +
geom_smooth(method = lm) +
theme_fivethirtyeight(base_size = 10, base_family = 'serif') +
theme(axis.title = element_text(family = 'serif', size = 15)) + ylab('Total Bike Users') + xlab('Normalized Humidity') +
labs(title = "Relationship between Humidity and Bike users", caption = "")
Code
ggpairs(bike2, columns = c(10, 11, 15), ggplot2::aes(colour='red'))
Conclusions
In conclusion, for hypothesis one we reject the null hypothesis with evidence of the extremely significant p-value of 2e-16. This give us evidence that we can accept the alternative and say yes temperature has an impact on bike sales. There were two different tests done here, we did the daily data compared to the hourly data. Both were significant and we controlled for 7 variables which are specified at the top. The reason we controlled these variables is that they could impact the outcome variable and thus we controlled them to make sure that they were not impacting the results. So in conclusion, normalized feeling temperature Celsius has an impact on bike users. Looking at our correlation graph we can see that it has a positive correlation with bike users at .631.
For the second questions we can also reject the null hypothesis as humidity is significant with a p-value of < 2e-16 in data set one and 0.000323 in data set two. This is further evidence that we can reject the null as in both scales is what significant. This data held the same control variables as temperature and thus we can yes humidity has an impact on bike users and looking at our correlation we see a negative correlation of -.101.
References
- Fanaee-T, H. (n.d.). Bike Sharing Dataset . Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
- Pazdan, Sylwia. (2020). The impact of weather on bicycle risk exposure. Archives of Transport. 56. 89-105. 10.5604/01.3001.0014.5629.