finalpart3
Temperature and Humidity impact on number of bike users
Author

Ethan Campbell

Published

December 15, 2022

Introduction

Climate has always been a topic that sparks debate and there is continuous research being done on it every day. I wanted to contribute to this research and analyze the impacts climate factors like temperature and humidity have on bike users. There has been study related to weather conditions and biking and whether or not it results in more accidents which concluded in an increase in accidents. “It suggests that weather conditions should be considered in every analysis where bicycle volume data is needed” (Pazdan, 2020). The paper describes the importance of weather condition and how they should be used in any analysis regarding biking data. Here, I thought about if we know that these factors are important, how important are they? I can find this data online somewhere however, I would like to conduct my own study and determine the results from that and then compare it to results online.

Research Questions

A. How has temperature and humidity impacted bike users?

My motivation is driven by my own interest in biking and climate factors and the study of how the climate is impacting human movement. I think it is interesting how significant climate factors can impact certain human activities and I want to learn more about which ones are presenting the largest impact on human activity. The reason this study is different is that it is based on normalized data and is focusing how bikers are impacted by weather condition.

Hypothesis

My hypothesis is testing how and if climate factors like temperature and humidity have an impact on bike users. I am hypothesizing that temperature will have the largest impact on bike sales. I believe temperature will mostly cause it to increase as nicer weather would be good for a ride. However, I believe at a certain point it will dip down and no longer be as positive. I hypothesize that as humidity increases then the number of bike users drop since the biking conditions will grow away from the ideal conditions. This study will use normalized feeling temperature and normalized humidity compared to daily and hourly bike users. This study is using new data compared to the research documents that have been observed online and presents a new view to either support or reject their claims.

The hypothesis will be tested as follows:

H0A

Temperature does not impact bike users

H1A

Temperature does impact bike users

H0A

Humidity does not impact bike users

H1A

Humidity does impact bike users

Analytic Planning

flowchart LR
  A[Importing Data] --> B(Data Cleaning)
  B --> C[Descriptive Statistics]
  C --> D(Hypothesis testing)
  D --> E[Backward Elimination]
  E --> F[Regression Analysis 1]
  F --> G[Research Question 1]
  G --> H(Conclusion)
  E --> I[Regression Analysis 2]
  I --> J[Research Question 2]
  J --> H(Conclusion)

Code
library(tidyverse)
library(plyr)
library(dplyr)
library(kableExtra)
library(lubridate)
library(lmtest)
library(sandwich)
library(GGally)
library(caret)
library(ggthemes)
library(plotly)
library(hrbrthemes)
library(leaps)
library(ggResidpanel)
library(jtools)
library(huxtable)
Code
bike <- read.csv("_data/hour.csv")
bike2 <- read.csv("_data/day.csv")
dim(bike)
[1] 17379    17
Code
dim(bike2)
[1] 731  16
Code
summary(bike)
    instant         dteday              season            yr        
 Min.   :    1   Length:17379       Min.   :1.000   Min.   :0.0000  
 1st Qu.: 4346   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
 Median : 8690   Mode  :character   Median :3.000   Median :1.0000  
 Mean   : 8690                      Mean   :2.502   Mean   :0.5026  
 3rd Qu.:13034                      3rd Qu.:3.000   3rd Qu.:1.0000  
 Max.   :17379                      Max.   :4.000   Max.   :1.0000  
      mnth              hr           holiday           weekday     
 Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
 1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
 Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
 Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
 3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
 Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
   workingday       weathersit         temp           atemp       
 Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
 Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
 Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
 3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
 Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
      hum           windspeed          casual         registered   
 Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
 1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
 Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
 Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
 3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
 Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
      cnt       
 Min.   :  1.0  
 1st Qu.: 40.0  
 Median :142.0  
 Mean   :189.5  
 3rd Qu.:281.0  
 Max.   :977.0  

Data

The data was collected from UCI machine learning repository. Where I collected for both daily and hourly information for 17 variables. Each variable is described below. Important variables will be in bold. Their relationship towards the analysis will be described underneath this section.

  • instant - This is the record index. This is the count of how many rows there are. This object will not be utilized in this study
  • dteday - This is the date. The date is currently in year-day-month format. This will be used to observe change over time
  • season - This is the 4 seasons. This is expressed as: 1-Winter, 2-Spring, 3-Summer, 4- Fall. This will be used as a control variable since this could impact the sales of bikes and the independent variables
  • yr - This is the year ranging from 2011-2012. This will be used analyze change over each year.
  • mnth - This is the month 1-12. This will be used to analyze the change over months.
  • hr - This is hour 0-23. This will be used to analyze the change by the hour.
  • holiday - This is the holidays so whether or not it is a holiday. This will be used as an independent variable in conjunction with another
  • weekday - Day of the week.
  • workingday - if day is neither weekend nor holiday is 1, otherwise is 0.
  • weathersit -
    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp - Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
  • atemp - Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
  • hum - Normalized humidity. The values are divided to 100 (max)
  • windspeed - Normalized wind speed. The values are divided to 67 (max)
  • casual - count of casual users
  • registered - count of registered users
  • cnt - count of total rental bikes including both casual and registered

Regression 1

  • Explanatory - normalized temperature feeling
  • Outcome - cnt
  • Control - Season, holiday, weekday, weathersit

Regression 2

  • Explanatory - Normalized humidity
  • Outcome - cnt
  • Control - temperature(normalized and normalized feeling), wind speed, weekday, holiday, weathersit

Regression Model

Interaction Terms

\[ \hat{Y} = b0 + b1X1+b2X2 \]

Regression Model 1

\[ Cnt = FeelingTemperature + Humidity + Season + Week Day + Hour + Year + Holiday + Windspeed \] Regression Model 2

\[ Cnt = FeelingTemperature + Humidity + Season + Week Day + Weather Type + Year + Holiday + Windspeed \]

Code
bike <- bike %>%
  dplyr::rename('Date' = dteday) %>%
  dplyr::rename('Year' = yr) %>%
  dplyr::rename('Month' = mnth) %>%
  dplyr::rename('Hour' = hr) %>%
  dplyr::rename('Normalized_temperature_C' = temp) %>%
  dplyr::rename('Normalized_feeling_temperature_C' = atemp) %>%
  dplyr::rename('Normalized_Humidity' = hum) %>%
  dplyr::rename("Total_bike_users" = cnt)
bike2 <- bike2 %>%
  dplyr::rename('Date' = dteday) %>%
  dplyr::rename('Year' = yr) %>%
  dplyr::rename('Month' = mnth) %>%
  dplyr::rename('Normalized_temperature_C' = temp) %>%
  dplyr::rename('Normalized_feeling_temperature_C' = atemp) %>%
  dplyr::rename('Normalized_Humidity' = hum) %>%
  dplyr::rename("Total_bike_users" = cnt)
bike2$Date <- ymd(bike2$Date)
# Checking for multicollinearity (We notice that temp and feeling temp are almost identical) so we removed the normalized temperature form the study since I want to focus on feeling temperature. Also removing instant since it is just the count of rows.
bike <- bike[,-11]
bike2 <- bike2[,-10]
cor(bike[3:16])
                                       season         Year        Month
season                            1.000000000 -0.010742486  0.830385892
Year                             -0.010742486  1.000000000 -0.010472929
Month                             0.830385892 -0.010472929  1.000000000
Hour                             -0.006116901 -0.003867005 -0.005771909
holiday                          -0.009584526  0.006691617  0.018430325
weekday                          -0.002335350 -0.004484851  0.010400061
workingday                        0.013743102 -0.002196005 -0.003476922
weathersit                       -0.014523552 -0.019156853  0.005399522
Normalized_feeling_temperature_C  0.319379811  0.039221595  0.208096131
Normalized_Humidity               0.150624745 -0.083546421  0.164411443
windspeed                        -0.149772751 -0.008739533 -0.135386323
casual                            0.120206447  0.142778528  0.068457301
registered                        0.174225633  0.253684310  0.122272967
Total_bike_users                  0.178055731  0.250494899  0.120637760
                                         Hour      holiday      weekday
season                           -0.006116901 -0.009584526 -0.002335350
Year                             -0.003867005  0.006691617 -0.004484851
Month                            -0.005771909  0.018430325  0.010400061
Hour                              1.000000000  0.000479136 -0.003497739
holiday                           0.000479136  1.000000000 -0.102087791
weekday                          -0.003497739 -0.102087791  1.000000000
workingday                        0.002284998 -0.252471370  0.035955071
weathersit                       -0.020202528 -0.017036113  0.003310740
Normalized_feeling_temperature_C  0.133749965 -0.030972737 -0.008820945
Normalized_Humidity              -0.276497828 -0.010588465 -0.037158268
windspeed                         0.137251568  0.003987632  0.011501545
casual                            0.301201730  0.031563628  0.032721415
registered                        0.374140710 -0.047345424  0.021577888
Total_bike_users                  0.394071498 -0.030927303  0.026899860
                                   workingday   weathersit
season                            0.013743102 -0.014523552
Year                             -0.002196005 -0.019156853
Month                            -0.003476922  0.005399522
Hour                              0.002284998 -0.020202528
holiday                          -0.252471370 -0.017036113
weekday                           0.035955071  0.003310740
workingday                        1.000000000  0.044672224
weathersit                        0.044672224  1.000000000
Normalized_feeling_temperature_C  0.054667235 -0.105563108
Normalized_Humidity               0.015687512  0.418130329
windspeed                        -0.011829789  0.026225652
casual                           -0.300942486 -0.152627885
registered                        0.134325791 -0.120965520
Total_bike_users                  0.030284368 -0.142426138
                                 Normalized_feeling_temperature_C
season                                                0.319379811
Year                                                  0.039221595
Month                                                 0.208096131
Hour                                                  0.133749965
holiday                                              -0.030972737
weekday                                              -0.008820945
workingday                                            0.054667235
weathersit                                           -0.105563108
Normalized_feeling_temperature_C                      1.000000000
Normalized_Humidity                                  -0.051917696
windspeed                                            -0.062336043
casual                                                0.454080065
registered                                            0.332558635
Total_bike_users                                      0.400929304
                                 Normalized_Humidity    windspeed      casual
season                                    0.15062475 -0.149772751  0.12020645
Year                                     -0.08354642 -0.008739533  0.14277853
Month                                     0.16441144 -0.135386323  0.06845730
Hour                                     -0.27649783  0.137251568  0.30120173
holiday                                  -0.01058846  0.003987632  0.03156363
weekday                                  -0.03715827  0.011501545  0.03272142
workingday                                0.01568751 -0.011829789 -0.30094249
weathersit                                0.41813033  0.026225652 -0.15262788
Normalized_feeling_temperature_C         -0.05191770 -0.062336043  0.45408007
Normalized_Humidity                       1.00000000 -0.290104895 -0.34702809
windspeed                                -0.29010490  1.000000000  0.09028678
casual                                   -0.34702809  0.090286775  1.00000000
registered                               -0.27393312  0.082320847  0.50661770
Total_bike_users                         -0.32291074  0.093233784  0.69456408
                                  registered Total_bike_users
season                            0.17422563       0.17805573
Year                              0.25368431       0.25049490
Month                             0.12227297       0.12063776
Hour                              0.37414071       0.39407150
holiday                          -0.04734542      -0.03092730
weekday                           0.02157789       0.02689986
workingday                        0.13432579       0.03028437
weathersit                       -0.12096552      -0.14242614
Normalized_feeling_temperature_C  0.33255864       0.40092930
Normalized_Humidity              -0.27393312      -0.32291074
windspeed                         0.08232085       0.09323378
casual                            0.50661770       0.69456408
registered                        1.00000000       0.97215073
Total_bike_users                  0.97215073       1.00000000

Here we are aiming for anything with 95% significance.

This method works by creating a linear model and then analyzing the results and determining which values are not significant and you remove them from the linear regression to drive a more fitted regression. Below I have created this with creating a seed for reproduction and then set the train control with the method backward elimination and set the k-folds to 10. Next I tested that on the regression model with everything included and this will test each variable one by one and tell you how many variables are needed to have the most fit model. Here it states that the model with 8 variables is the most fit with both the RMSE and MAE method showing this below, This results in an R squared of 38.8%. This is for the bike data set and when we do it on bike2 data set we see that the number of variables remains the same. They both operate the best at 8 variables anything past that would over fit the data and would result in worse information. This one is showing an r squared at 79.79% which is a pretty good fit for the data. The only variable that does get changed is hour is removed and weather type gets added in. The regression have already been checked for significance but have been placed within the regression section and they are explained there. Using this information I performed the backward elimination and simiply removed the highest p-value variables and kept the most significant ones.

Code
# Setting for reproducibility
set.seed(1)
# setting repeated k-fold cross validation
train.control <- trainControl(method = "cv", number = 10)
# testing it on bike2 table
Backward_temp2 <- train(Total_bike_users ~. -Date -casual -registered, data = bike2,
                    method = "leapBackward", 
                    tuneGrid = data.frame(nvmax = 1:10),
                    trControl = train.control
                    )
results_bike2 <- Backward_temp2$results
kable(results_bike2, digits = 4, align = "ccccccc", col.names = c("nvmax", "RMSE", "Rsquared", "MAE", "RMSESD", "RsquaredSD", "MAESD"), caption = "Bike2 Data set Model Evaluation") %>%
  kable_styling(font_size = 16) %>%
  row_spec(c(8,8,8), background = "cadetblue")
Bike2 Data set Model Evaluation
nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1499.2915 0.4046 1234.5472 106.3199 0.0447 100.5028
2 1081.6159 0.6922 849.4519 84.0027 0.0407 63.9932
3 1001.4803 0.7359 749.4293 75.5464 0.0293 38.5789
4 912.3321 0.7800 678.0571 76.8079 0.0358 51.4441
5 902.9102 0.7841 670.3104 85.8437 0.0392 53.3738
6 886.7729 0.7914 661.4331 86.9188 0.0402 51.9386
7 890.9950 0.7895 663.0067 96.0126 0.0445 55.0536
8 886.8600 0.7913 659.5744 100.0098 0.0469 61.1266
9 880.7110 0.7941 657.1637 100.5697 0.0469 62.2428
10 878.9295 0.7950 655.7529 102.1414 0.0472 61.9084

Figure 1: Table shows the 9th variable being a better fit however, it is wanting to include the instant variable which is just a count of rows and thus has been rejected and I have accepted the 8 variable count

Code
# Here are the 12 variables in order
summary(Backward_temp2$finalModel)
Subset selection object
11 Variables  (and intercept)
                                 Forced in Forced out
instant                              FALSE      FALSE
season                               FALSE      FALSE
Year                                 FALSE      FALSE
Month                                FALSE      FALSE
holiday                              FALSE      FALSE
weekday                              FALSE      FALSE
workingday                           FALSE      FALSE
weathersit                           FALSE      FALSE
Normalized_feeling_temperature_C     FALSE      FALSE
Normalized_Humidity                  FALSE      FALSE
windspeed                            FALSE      FALSE
1 subsets of each size up to 10
Selection Algorithm: backward
          instant season Year Month holiday weekday workingday weathersit
1  ( 1 )  " "     " "    " "  " "   " "     " "     " "        " "       
2  ( 1 )  " "     " "    "*"  " "   " "     " "     " "        " "       
3  ( 1 )  " "     "*"    "*"  " "   " "     " "     " "        " "       
4  ( 1 )  " "     "*"    "*"  " "   " "     " "     " "        "*"       
5  ( 1 )  " "     "*"    "*"  " "   " "     "*"     " "        "*"       
6  ( 1 )  " "     "*"    "*"  " "   " "     "*"     " "        "*"       
7  ( 1 )  " "     "*"    "*"  " "   " "     "*"     " "        "*"       
8  ( 1 )  " "     "*"    "*"  " "   "*"     "*"     " "        "*"       
9  ( 1 )  "*"     "*"    "*"  " "   "*"     "*"     " "        "*"       
10  ( 1 ) "*"     "*"    "*"  "*"   "*"     "*"     " "        "*"       
          Normalized_feeling_temperature_C Normalized_Humidity windspeed
1  ( 1 )  "*"                              " "                 " "      
2  ( 1 )  "*"                              " "                 " "      
3  ( 1 )  "*"                              " "                 " "      
4  ( 1 )  "*"                              " "                 " "      
5  ( 1 )  "*"                              " "                 " "      
6  ( 1 )  "*"                              " "                 "*"      
7  ( 1 )  "*"                              "*"                 "*"      
8  ( 1 )  "*"                              "*"                 "*"      
9  ( 1 )  "*"                              "*"                 "*"      
10  ( 1 ) "*"                              "*"                 "*"      
Code
# Training my model with backward elimination. Removed Date since there are other variables in here that accomplish time and removed 
# Casual and registered since those two combined create total bike users. 
Backward_temp1 <- train(Total_bike_users ~. -Date -casual -registered, data = bike,
                    method = "leapBackward", 
                    tuneGrid = data.frame(nvmax = 1:10),
                    trControl = train.control
                    )
# Here it shows the RMSE and MAE saying 8 variables is the best fit 
results_bike <- Backward_temp1$results
kable(results_bike, digits = 4, align = "ccccccc", col.names = c("nvmax", "RMSE", "Rsquared", "MAE", "RMSESD", "RsquaredSD", "MAESD"), caption = "Bike Data set Model Evaluation") %>%
  kable_styling(font_size = 16) %>%
  row_spec(c(8,8,8), background = "cadetblue")
Bike Data set Model Evaluation
nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 166.1289 0.1609 125.9391 4.1424 0.0134 2.7129
2 154.0128 0.2791 113.6634 3.9548 0.0098 2.6805
3 147.8483 0.3357 110.5619 3.6648 0.0112 2.4969
4 143.4815 0.3743 106.5021 3.2789 0.0129 2.4428
5 142.0650 0.3865 105.9022 3.0654 0.0152 2.1619
6 141.9861 0.3872 105.8684 3.1385 0.0149 2.2025
7 141.9563 0.3874 105.8983 3.1446 0.0146 2.1957
8 141.8763 0.3881 105.8692 3.1399 0.0143 2.1592
9 141.8967 0.3880 105.9357 3.1397 0.0141 2.1585
10 141.8735 0.3882 106.0082 3.1303 0.0140 2.1435

Figure 2: Table shows the 9th variable being a better fit however, it is wanting to include the instant variable which is just a count of rows and thus has been rejected and I have accepted the 8 variable count

Code
# Here are the 8 variables in order
summary(Backward_temp1$finalModel)
Subset selection object
12 Variables  (and intercept)
                                 Forced in Forced out
instant                              FALSE      FALSE
season                               FALSE      FALSE
Year                                 FALSE      FALSE
Month                                FALSE      FALSE
Hour                                 FALSE      FALSE
holiday                              FALSE      FALSE
weekday                              FALSE      FALSE
workingday                           FALSE      FALSE
weathersit                           FALSE      FALSE
Normalized_feeling_temperature_C     FALSE      FALSE
Normalized_Humidity                  FALSE      FALSE
windspeed                            FALSE      FALSE
1 subsets of each size up to 10
Selection Algorithm: backward
          instant season Year Month Hour holiday weekday workingday weathersit
1  ( 1 )  " "     " "    " "  " "   " "  " "     " "     " "        " "       
2  ( 1 )  " "     " "    " "  " "   "*"  " "     " "     " "        " "       
3  ( 1 )  " "     " "    "*"  " "   "*"  " "     " "     " "        " "       
4  ( 1 )  " "     " "    "*"  " "   "*"  " "     " "     " "        " "       
5  ( 1 )  " "     "*"    "*"  " "   "*"  " "     " "     " "        " "       
6  ( 1 )  " "     "*"    "*"  " "   "*"  " "     " "     " "        " "       
7  ( 1 )  " "     "*"    "*"  " "   "*"  "*"     " "     " "        " "       
8  ( 1 )  " "     "*"    "*"  " "   "*"  "*"     "*"     " "        " "       
9  ( 1 )  " "     "*"    "*"  " "   "*"  "*"     "*"     " "        "*"       
10  ( 1 ) " "     "*"    "*"  " "   "*"  "*"     "*"     "*"        "*"       
          Normalized_feeling_temperature_C Normalized_Humidity windspeed
1  ( 1 )  "*"                              " "                 " "      
2  ( 1 )  "*"                              " "                 " "      
3  ( 1 )  "*"                              " "                 " "      
4  ( 1 )  "*"                              "*"                 " "      
5  ( 1 )  "*"                              "*"                 " "      
6  ( 1 )  "*"                              "*"                 "*"      
7  ( 1 )  "*"                              "*"                 "*"      
8  ( 1 )  "*"                              "*"                 "*"      
9  ( 1 )  "*"                              "*"                 "*"      
10  ( 1 ) "*"                              "*"                 "*"      

Within the first test we notice a fairly low R squared but we also know they are all statistically significant meaning that we can reject the null hypothesis and say yes there appears to be some effect here. Once we implement the model evaluation to the regression we notice a large difference in the r squared.

Here we are looking at two different regression on two different scales. One is on hourly time and the other is on daily time which causes the variables to impact the outcome differently. Here we notice both temperature and humidity are statistically significant in both situations which is further evidence that we can really reject the null.

Code
# Regression 1 both temperature and humidity
mod1 <- lm(Total_bike_users ~  Normalized_feeling_temperature_C + Normalized_Humidity + season + weekday + Hour, data = bike) 

summary(mod1)

Call:
lm(formula = Total_bike_users ~ Normalized_feeling_temperature_C + 
    Normalized_Humidity + season + weekday + Hour, data = bike)

Residuals:
    Min      1Q  Median      3Q     Max 
-358.16  -95.48  -30.75   53.96  691.81 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        36.2565     6.0492   5.994 2.09e-09 ***
Normalized_feeling_temperature_C  330.3458     6.9634  47.440  < 2e-16 ***
Normalized_Humidity              -229.8928     6.1318 -37.492  < 2e-16 ***
season                             19.1274     1.0843  17.640  < 2e-16 ***
weekday                             1.9755     0.5586   3.537 0.000406 ***
Hour                                7.4868     0.1698  44.090  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 147.6 on 17373 degrees of freedom
Multiple R-squared:  0.3383,    Adjusted R-squared:  0.3381 
F-statistic:  1776 on 5 and 17373 DF,  p-value: < 2.2e-16
Code
# We notice that the original had an r squared of 33.8% while this one has a 38.8% which is a large difference and shows the importance of finding the best fitted model
Regression1 <- lm(Total_bike_users ~ Normalized_feeling_temperature_C + Normalized_Humidity + Hour + Year + season + holiday + weekday + windspeed, data = bike)
summary(Regression1)

Call:
lm(formula = Total_bike_users ~ Normalized_feeling_temperature_C + 
    Normalized_Humidity + Hour + Year + season + holiday + weekday + 
    windspeed, data = bike)

Residuals:
    Min      1Q  Median      3Q     Max 
-398.12  -92.84  -27.49   59.81  642.21 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       -27.3890     6.7226  -4.074 4.64e-05 ***
Normalized_feeling_temperature_C  321.2238     6.7131  47.850  < 2e-16 ***
Normalized_Humidity              -204.3276     6.1144 -33.418  < 2e-16 ***
Hour                                7.6326     0.1638  46.608  < 2e-16 ***
Year                               81.0198     2.1637  37.445  < 2e-16 ***
season                             19.9910     1.0469  19.096  < 2e-16 ***
holiday                           -24.1092     6.4763  -3.723 0.000198 ***
weekday                             1.9179     0.5399   3.552 0.000383 ***
windspeed                          43.4284     9.2898   4.675 2.96e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 141.9 on 17370 degrees of freedom
Multiple R-squared:  0.3885,    Adjusted R-squared:  0.3882 
F-statistic:  1379 on 8 and 17370 DF,  p-value: < 2.2e-16
Code
# Regression two on bike2 data set both temperature and humidity. Comparing original regression on bike2 data set here we have an r squared of 50.58%
lm(Total_bike_users ~ Normalized_feeling_temperature_C + Normalized_Humidity + weathersit + season, data = bike2) %>% summary()

Call:
lm(formula = Total_bike_users ~ Normalized_feeling_temperature_C + 
    Normalized_Humidity + weathersit + season, data = bike2)

Residuals:
    Min      1Q  Median      3Q     Max 
-4034.7 -1019.5  -183.7  1057.8  4356.6 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       2276.00     260.57   8.735  < 2e-16 ***
Normalized_feeling_temperature_C  6486.06     340.09  19.072  < 2e-16 ***
Normalized_Humidity              -1860.81     462.50  -4.023 6.34e-05 ***
weathersit                        -551.04     119.11  -4.626 4.41e-06 ***
season                             436.18      49.21   8.863  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1366 on 726 degrees of freedom
Multiple R-squared:  0.5058,    Adjusted R-squared:  0.503 
F-statistic: 185.7 on 4 and 726 DF,  p-value: < 2.2e-16
Code
# Here we have an r squared of 79.74% which is a MAJOR difference this has changed the coefficient of temperature by ~600. This goes to show how inaccurate the first model was. 
Regression2 <- lm(Total_bike_users ~ Normalized_feeling_temperature_C + Normalized_Humidity + weathersit + Year + season + holiday + weekday + windspeed, data = bike2)
summary(Regression2)

Call:
lm(formula = Total_bike_users ~ Normalized_feeling_temperature_C + 
    Normalized_Humidity + weathersit + Year + season + holiday + 
    weekday + windspeed, data = bike2)

Residuals:
    Min      1Q  Median      3Q     Max 
-4138.0  -425.5    73.3   536.4  2823.2 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       1467.57     228.77   6.415 2.55e-10 ***
Normalized_feeling_temperature_C  5931.29     219.28  27.048  < 2e-16 ***
Normalized_Humidity              -1132.22     313.36  -3.613 0.000323 ***
weathersit                        -592.12      78.49  -7.544 1.37e-13 ***
Year                              2038.35      65.50  31.121  < 2e-16 ***
season                             407.23      31.91  12.762  < 2e-16 ***
holiday                           -614.59     195.40  -3.145 0.001728 ** 
weekday                             69.21      16.35   4.234 2.60e-05 ***
windspeed                        -2449.82     452.25  -5.417 8.27e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 876.8 on 722 degrees of freedom
Multiple R-squared:  0.7974,    Adjusted R-squared:  0.7952 
F-statistic: 355.2 on 8 and 722 DF,  p-value: < 2.2e-16

I am incorporating a log here since the temperature was a positive relationship at first but as the temperature continued to increase it would then turn towards being a negative one. This is caused by bikers not wanting to be out side when the temperature gets too hot as it becomes uncomfortable and unsafe. I added a log to help account for this factor and help balance out the analysis. When looking at the R^2 we notice that compared to the original without the log this one is a good point or two higher meaning it has definitely helped explain the outcome variable.

Code
Regression3 <- lm(Total_bike_users ~ log(Normalized_feeling_temperature_C) + Normalized_Humidity + weathersit + Year + season + holiday + weekday + windspeed, data = bike2)
summary(Regression3)

Call:
lm(formula = Total_bike_users ~ log(Normalized_feeling_temperature_C) + 
    Normalized_Humidity + weathersit + Year + season + holiday + 
    weekday + windspeed, data = bike2)

Residuals:
    Min      1Q  Median      3Q     Max 
-4142.2  -460.4    37.8   527.5  2659.3 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                            6727.61     239.61  28.078  < 2e-16 ***
log(Normalized_feeling_temperature_C)  2544.66      87.74  29.004  < 2e-16 ***
Normalized_Humidity                   -1444.41     303.93  -4.752 2.43e-06 ***
weathersit                             -593.93      75.56  -7.861 1.39e-14 ***
Year                                   1992.40      63.27  31.491  < 2e-16 ***
season                                  361.11      31.17  11.584  < 2e-16 ***
holiday                                -593.47     188.45  -3.149   0.0017 ** 
weekday                                  66.37      15.76   4.210 2.88e-05 ***
windspeed                             -2571.74     435.89  -5.900 5.59e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 845.5 on 722 degrees of freedom
Multiple R-squared:  0.8116,    Adjusted R-squared:  0.8095 
F-statistic: 388.8 on 8 and 722 DF,  p-value: < 2.2e-16
Code
resid_panel(Regression3, plots = c("resid","index","ls","qq"),
            smoother = TRUE)
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

H0A

There is no Heteroskedasticity

H1A

There is Heteroskedasticity

Here we will conduct the Breusch-Pagan test using the lmtest package and bptest() function. This will let us know if there is heterosexuality if the P < .05. Here we see that both meet this standard and thus we have evidence to reject the null hypothesis. Since we can reject the null we are able to say yes, heteroskedacity exists within this model. There are a few ways to go about correcting heteroskedacity the first being a robust standard error and making sure they account for the heteroskedacity. The other we can try is the generalized least squares method but for this one I am going to try the former. After applying the correction we can notice the standard error changes on each variable and we can see a different p-value. This will show us a more robust evaluation but what I have noticed is that while the significance may change a little they still remain significant in either method. Below we can see the comparison between regression 3 which is after the backward elimination and the log function and it is being compared to the robust version.

Code
# Breusch-Pagan test to determine if Heteroskedasticity exist
bptest(Regression1)

    studentized Breusch-Pagan test

data:  Regression1
BP = 1450.9, df = 8, p-value < 2.2e-16
Code
bptest(Regression3)

    studentized Breusch-Pagan test

data:  Regression3
BP = 35.161, df = 8, p-value = 2.5e-05
Code
# both of these are p<.05 meaning that we reject the null hypothesis and say yes there is heteroskedacity 

correction_r1 <- coeftest(Regression1, vcov. = vcovHC(Regression1, type = "HC1"))
correction_r3 <- coeftest(Regression3, vcov. = vcovHC(Regression3, type = "HC1"))

# comparing bike regression vs the robust version
export_summs(Regression1, correction_r1, scale = TRUE,
             model.names = c("Regression(bike)", "Robust Logged Regression"))
Original model not retained as part of coeftest object. For additional model summary information (r.squared, df, etc.), consider passing `glance.coeftest()` an object where the underlying model has been saved, i.e.`lmtest::coeftest(..., save = TRUE)`.
This message is displayed once per session.
Regression(bike)Robust Logged Regression
(Intercept)149.44 ***-27.39 ***
(1.54)   (5.89)   
Normalized_feeling_temperature_C55.20 ***321.22 ***
(1.15)   (6.61)   
Normalized_Humidity-39.42 ***-204.33 ***
(1.18)   (5.80)   
Hour52.77 ***7.63 ***
(1.13)   (0.15)   
Year81.02 ***81.02 ***
(2.16)   (2.15)   
season22.13 ***19.99 ***
(1.16)   (1.02)   
holiday-24.11 ***-24.11 ***
(6.48)   (5.03)   
weekday3.85 ***1.92 ***
(1.08)   (0.50)   
windspeed5.31 ***43.43 ***
(1.14)   (8.95)   
N17379       17379       
R20.39           
All continuous predictors are mean-centered and scaled by 1 standard deviation. *** p < 0.001; ** p < 0.01; * p < 0.05.
Code
# Comparing log bike 2 regresion against the log bike 2 robust regression
export_summs(Regression3, correction_r3, scale = TRUE,
             model.names = c("Logged Regression(bike2)", "Robust Logged Regression(bike2)"))
Logged Regression(bike2)Robust Logged Regression(bike2)
(Intercept)3523.83 ***6727.61 ***
(44.80)   (293.27)   
`log(Normalized_feeling_temperature_C)`1018.69 ***       
(35.12)          
Normalized_Humidity-205.73 ***-1444.41 ***
(43.29)   (338.84)   
weathersit-323.63 ***-593.93 ***
(41.17)   (84.87)   
Year1992.40 ***1992.40 ***
(63.27)   (64.82)   
season401.12 ***361.11 ***
(34.63)   (32.22)   
holiday-593.47 ** -593.47 *  
(188.45)   (260.99)   
weekday133.05 ***66.37 ***
(31.60)   (17.72)   
windspeed-199.30 ***-2571.74 ***
(33.78)   (501.01)   
log(Normalized_feeling_temperature_C)       2544.66 ***
       (100.41)   
N731       731       
R20.81           
All continuous predictors are mean-centered and scaled by 1 standard deviation. *** p < 0.001; ** p < 0.01; * p < 0.05.

Here we see as temperature increases we can expect bike users to increase while as humidity increases we expect the opposite.

Code
Bike_users_plot <- bike2 %>%
  ggplot(aes(x=Date, y=Total_bike_users)) +
    geom_area(fill="#69b3a2", alpha=0.5) +
    geom_line(color="#69b3a2") +
    ylab("total bike user") +
    theme_ipsum()
# Making it interactive
Bike_users_plot <- ggplotly(Bike_users_plot)
Bike_users_plot
Code
plot_summs(Regression2, Regression3,
           model.names = c("Not logged Regression", "Logged Regression"))

Code
plot_summs(Regression1, mod1,
           model.names = c("Regression(bike)", "Regression first take(bike)"))

Code
# Value used to transform the data
coeff <- 10000
# A few constants
temperatureColor <- "#69b3a2"
priceColor <- rgb(0.2, 0.6, 0.9, 1)
ggplot(bike2, aes(x=Date)) +
  
  geom_line( aes(y=Normalized_feeling_temperature_C), size=2, color=temperatureColor) + 
  geom_line( aes(y=Total_bike_users / coeff), size=2, color=priceColor) +
  
  scale_y_continuous(
    
    # Features of the first axis
    name = "Temperature (Normalized)",
    
    # Add a second axis and specify its features
    sec.axis = sec_axis(~.*coeff, name="Bike users")
  ) + 
  
  theme_ipsum() +
  theme(
    axis.title.y = element_text(color = temperatureColor, size=13),
    axis.title.y.right = element_text(color = priceColor, size=13)
  ) +
  ggtitle("Temperature correlation with bikes users")

Code
# Value used to transform the data
coeff <- 10000
# A few constants
humidityColor <- "#69b3a2"
bikeColor <- rgb(0.2, 0.6, 0.9, 1)
ggplot(bike2, aes(x=Date)) +
  
  geom_line( aes(y=Normalized_Humidity), size=2, color=temperatureColor) + 
  geom_line( aes(y=Total_bike_users / coeff), size=2, color=priceColor) +
  
  scale_y_continuous(
    
    # Features of the first axis
    name = "Humidity (Normalized)",
    
    # Add a second axis and specify its features
    sec.axis = sec_axis(~.*coeff, name="Bike users")
  ) + 
  
  theme_ipsum() +
  theme(
    axis.title.y = element_text(color = temperatureColor, size=13),
    axis.title.y.right = element_text(color = priceColor, size=13)
  ) +
  ggtitle("Humidity correlation with bikes users")

Code
# plotting the data to visualize 
ggplot(data = bike2, aes(x=Normalized_feeling_temperature_C, y = Total_bike_users)) +
  geom_point() +
  geom_smooth(method = lm) +
  theme_fivethirtyeight(base_size = 10, base_family = 'serif') +
  theme(axis.title = element_text(family = 'serif', size = 15)) + ylab('Total Bike Users') + xlab('Normalized Feeling Temperature') +
  labs(title = "Relationship between Temerpature and Bike users", caption = "") 

Code
ggplot(data = bike2, aes(x=Normalized_Humidity, y = Total_bike_users)) +
  geom_point() +
  geom_smooth(method = lm) +
  theme_fivethirtyeight(base_size = 10, base_family = 'serif') +
  theme(axis.title = element_text(family = 'serif', size = 15)) + ylab('Total Bike Users') + xlab('Normalized Humidity') +
  labs(title = "Relationship between Humidity and Bike users", caption = "")

Code
ggpairs(bike2, columns = c(10, 11, 15), ggplot2::aes(colour='red'))

Conclusions

In conclusion, for hypothesis one we reject the null hypothesis with evidence of the extremely significant p-value of 2e-16. This give us evidence that we can accept the alternative and say yes temperature has an impact on bike sales. There were two different tests done here, we did the daily data compared to the hourly data. Both were significant and we controlled for 7 variables which are specified at the top. The reason we controlled these variables is that they could impact the outcome variable and thus we controlled them to make sure that they were not impacting the results. We went further and added a log to the temperature variable since we expected it to increase and then decrease. We do this to account for this change in the regression analysis and after doing so we notice the R^2 jumped for 79% to 81% so it improved how well these variables explain the outcome. After this I looked into heteroskedascity and ran it with more robust standard errors and we noticed a minor change to the variables and their standard deviations and p values. Only one variables dropped in one level of significance while the rest stayed the same this means only that one varible was not as important as it seemed originally. So in conclusion, normalized feeling temperature Celsius has an impact on bike users. Looking at our correlation graph we can see that it has a positive correlation with bike users at .631.

For the second questions we can also reject the null hypothesis as humidity is significant with a p-value of < 2e-16 in data set one and 0.000323 in data set two. This is further evidence that we can reject the null as in both scales is what significant. This data held the same control variables as temperature and thus we can yes humidity has an impact on bike users and looking at our correlation we see a negative correlation of -.101.

  • Fanaee-T, H. (n.d.). Bike Sharing Dataset . Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
  • Pazdan, Sylwia. (2020). The impact of weather on bicycle risk exposure. Archives of Transport. 56. 89-105. 10.5604/01.3001.0014.5629.