HW2

DACSS 601 Data Science Fundamentals - Homework 2

Apoorva Hungund
2022-02-09

R Markdown HW2

#For this assignment, I’m exploring the Airbnb data in NY dataset, specifically looking at Airbnb rates in Manhattan and Brooklyn.

##1) Read in a dataset & view it.
bookings<-read.csv2(file = "AB_NYC_2019.csv", sep = ",")
dim(bookings)
[1] 48895    16
head(bookings)
    id                                             name host_id
1 2539               Clean & quiet apt home by the park    2787
2 2595                            Skylit Midtown Castle    2845
3 3647              THE VILLAGE OF HARLEM....NEW YORK !    4632
4 3831                  Cozy Entire Floor of Brownstone    4869
5 5022 Entire Apt: Spacious Studio/Loft by central park    7192
6 5099        Large Cozy 1 BR Apartment In Midtown East    7322
    host_name neighbourhood_group neighbourhood latitude longitude
1        John            Brooklyn    Kensington 40.64749 -73.97237
2    Jennifer           Manhattan       Midtown 40.75362 -73.98377
3   Elisabeth           Manhattan        Harlem 40.80902  -73.9419
4 LisaRoxanne            Brooklyn  Clinton Hill 40.68514 -73.95976
5       Laura           Manhattan   East Harlem 40.79851 -73.94399
6       Chris           Manhattan   Murray Hill 40.74767   -73.975
        room_type price minimum_nights number_of_reviews last_review
1    Private room   149              1                 9  2018-10-19
2 Entire home/apt   225              1                45  2019-05-21
3    Private room   150              3                 0            
4 Entire home/apt    89              1               270  2019-07-05
5 Entire home/apt    80             10                 9  2018-11-19
6 Entire home/apt   200              3                74  2019-06-22
  reviews_per_month calculated_host_listings_count availability_365
1              0.21                              6              365
2              0.38                              2              355
3                                                1              365
4              4.64                              1              194
5              0.10                              1                0
6              0.59                              1              129
##2) Explain variables in dataset.
lapply(bookings,class)
$id
[1] "integer"

$name
[1] "character"

$host_id
[1] "integer"

$host_name
[1] "character"

$neighbourhood_group
[1] "character"

$neighbourhood
[1] "character"

$latitude
[1] "character"

$longitude
[1] "character"

$room_type
[1] "character"

$price
[1] "integer"

$minimum_nights
[1] "integer"

$number_of_reviews
[1] "integer"

$last_review
[1] "character"

$reviews_per_month
[1] "character"

$calculated_host_listings_count
[1] "integer"

$availability_365
[1] "integer"
summary(bookings)
       id               name              host_id         
 Min.   :    2539   Length:48895       Min.   :     2438  
 1st Qu.: 9471945   Class :character   1st Qu.:  7822033  
 Median :19677284   Mode  :character   Median : 30793816  
 Mean   :19017143                      Mean   : 67620011  
 3rd Qu.:29152178                      3rd Qu.:107434423  
 Max.   :36487245                      Max.   :274321313  
  host_name         neighbourhood_group neighbourhood     
 Length:48895       Length:48895        Length:48895      
 Class :character   Class :character    Class :character  
 Mode  :character   Mode  :character    Mode  :character  
                                                          
                                                          
                                                          
   latitude          longitude          room_type        
 Length:48895       Length:48895       Length:48895      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
     price         minimum_nights    number_of_reviews
 Min.   :    0.0   Min.   :   1.00   Min.   :  0.00   
 1st Qu.:   69.0   1st Qu.:   1.00   1st Qu.:  1.00   
 Median :  106.0   Median :   3.00   Median :  5.00   
 Mean   :  152.7   Mean   :   7.03   Mean   : 23.27   
 3rd Qu.:  175.0   3rd Qu.:   5.00   3rd Qu.: 24.00   
 Max.   :10000.0   Max.   :1250.00   Max.   :629.00   
 last_review        reviews_per_month  calculated_host_listings_count
 Length:48895       Length:48895       Min.   :  1.000               
 Class :character   Class :character   1st Qu.:  1.000               
 Mode  :character   Mode  :character   Median :  1.000               
                                       Mean   :  7.144               
                                       3rd Qu.:  2.000               
                                       Max.   :327.000               
 availability_365
 Min.   :  0.0   
 1st Qu.:  0.0   
 Median : 45.0   
 Mean   :112.8   
 3rd Qu.:227.0   
 Max.   :365.0   
This dataset described the data for Airbnb prices in different boroughts of NYC. Along with necessary 
descriptive variables such as name, host id, host name, neighborhood_group, neighborhood, latitude, 
longitude, room_type, minimum_nights, reviews-related variables, etc. 
there are also variables that may affect the rating of the listings.

Variables are either character variables - such as name, host_name, neighborhood, etc., or 
integer variables - such as price, reviews for, etc. There are 48895 entries and 16 columns.
colSums(is.na(bookings))
                            id                           name 
                             0                              0 
                       host_id                      host_name 
                             0                              0 
           neighbourhood_group                  neighbourhood 
                             0                              0 
                      latitude                      longitude 
                             0                              0 
                     room_type                          price 
                             0                              0 
                minimum_nights              number_of_reviews 
                             0                              0 
                   last_review              reviews_per_month 
                             0                              0 
calculated_host_listings_count               availability_365 
                             0                              0 
##From this, we can see that there no NAs.
##3)
##Select columns
data_bookings <- dplyr::select(bookings, name, neighbourhood_group, neighbourhood, room_type, price, number_of_reviews)
head(data_bookings)
                                              name
1               Clean & quiet apt home by the park
2                            Skylit Midtown Castle
3              THE VILLAGE OF HARLEM....NEW YORK !
4                  Cozy Entire Floor of Brownstone
5 Entire Apt: Spacious Studio/Loft by central park
6        Large Cozy 1 BR Apartment In Midtown East
  neighbourhood_group neighbourhood       room_type price
1            Brooklyn    Kensington    Private room   149
2           Manhattan       Midtown Entire home/apt   225
3           Manhattan        Harlem    Private room   150
4            Brooklyn  Clinton Hill Entire home/apt    89
5           Manhattan   East Harlem Entire home/apt    80
6           Manhattan   Murray Hill Entire home/apt   200
  number_of_reviews
1                 9
2                45
3                 0
4               270
5                 9
6                74
##Filter data based on Manhattan and Brooklyn & arrange by highest price

bookings_brooklyn<-data_bookings %>%
  dplyr::filter(neighbourhood_group == "Brooklyn") %>%
  arrange(desc(price))
rmarkdown::paged_table(head(bookings_brooklyn))
bookings_manhattan<-data_bookings %>%
  dplyr::filter(neighbourhood_group == "Manhattan") %>%
  arrange(desc(price))
rmarkdown::paged_table(head(bookings_manhattan))

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Hungund (2022, Feb. 13). Data Analytics and Computational Social Science: HW2. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomahungundaphhw2/

BibTeX citation

@misc{hungund2022hw2,
  author = {Hungund, Apoorva},
  title = {Data Analytics and Computational Social Science: HW2},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomahungundaphhw2/},
  year = {2022}
}