Homework 2 Marcela Robinson

Author

Marcela Robinson

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

##Read the file “_data/AB_NYC_2019.csv”

Listings<-read_csv("_data/AB_NYC_2019.csv")%>%
           rename(borough = neighbourhood_group, neighborhood = neighbourhood) %>% 
  select(-c(host_id,host_name, last_review))
Rows: 48895 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
date  (1): last_review

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Listings

##Get the summary of listings

summary(Listings)
       id               name             borough          neighborhood      
 Min.   :    2539   Length:48895       Length:48895       Length:48895      
 1st Qu.: 9471945   Class :character   Class :character   Class :character  
 Median :19677284   Mode  :character   Mode  :character   Mode  :character  
 Mean   :19017143                                                           
 3rd Qu.:29152178                                                           
 Max.   :36487245                                                           
                                                                            
    latitude       longitude       room_type             price        
 Min.   :40.50   Min.   :-74.24   Length:48895       Min.   :    0.0  
 1st Qu.:40.69   1st Qu.:-73.98   Class :character   1st Qu.:   69.0  
 Median :40.72   Median :-73.96   Mode  :character   Median :  106.0  
 Mean   :40.73   Mean   :-73.95                      Mean   :  152.7  
 3rd Qu.:40.76   3rd Qu.:-73.94                      3rd Qu.:  175.0  
 Max.   :40.91   Max.   :-73.71                      Max.   :10000.0  
                                                                      
 minimum_nights    number_of_reviews reviews_per_month
 Min.   :   1.00   Min.   :  0.00    Min.   : 0.010   
 1st Qu.:   1.00   1st Qu.:  1.00    1st Qu.: 0.190   
 Median :   3.00   Median :  5.00    Median : 0.720   
 Mean   :   7.03   Mean   : 23.27    Mean   : 1.373   
 3rd Qu.:   5.00   3rd Qu.: 24.00    3rd Qu.: 2.020   
 Max.   :1250.00   Max.   :629.00    Max.   :58.500   
                                     NA's   :10052    
 calculated_host_listings_count availability_365
 Min.   :  1.000                Min.   :  0.0   
 1st Qu.:  1.000                1st Qu.:  0.0   
 Median :  1.000                Median : 45.0   
 Mean   :  7.144                Mean   :112.8   
 3rd Qu.:  2.000                3rd Qu.:227.0   
 Max.   :327.000                Max.   :365.0   
                                                

The dataset contains listing information for Airbnb properties in New York. It has 48895 observances (listings) and 16 variables. The variables contain the following information: - id: unique value assigned to each of the listings - name: name of the property/brief description of the listing - host_id and host_name: information regarding the hosts - neighbourhood_group: this variable is aggregated data, in which the neighbourhood was combined by borough. - neighbourhood: location of the neighbourhood. - latitude and longitude: exact location of the listing. This is an individual-level data. It may not be unique as there might be some listings in the same address (such as apartment rentals) or multiple rooms within the same property -room_type: this variable specifies if the rental is for the entire place or a private room. -price: price per night for each of the listings -minimun_nights: the minimum number of nights the property has to be rented out
-number of reviews: count of reviews given by previous renters - last review: date of the last review given to the property - reviews per month: number of reviews per month
- calculated_host_listings_count: number of listings available for that particular host - availability_365: how many days of the year is the listing available

I can also determine from the summary that the average price per night is $152.70 and the median $106. However, it appears that are some extreme values that we may have to remove later on for better visualization like $0 and $10000 on the price column.

My next step is cleaning up the dataset. I first removed the variables that I consider unnecessary: host_id, host_name and last_review. I also renamed the variables neighbourhood_group to borough and neighbourhood to neighborhood since these are more commonly used terms in the US. My next step is to determine if there is there is any missing data from this dataset.

##Determine is there is any missing data

colSums(is.na(Listings))
                            id                           name 
                             0                             16 
                       borough                   neighborhood 
                             0                              0 
                      latitude                      longitude 
                             0                              0 
                     room_type                          price 
                             0                              0 
                minimum_nights              number_of_reviews 
                             0                              0 
             reviews_per_month calculated_host_listings_count 
                         10052                              0 
              availability_365 
                             0 

There are 10052 missing data from the reviews_per_month column. I assume the information is missing because there are no reviews yet for those particular listings.

Some potential research questions that the dataset listing can help answer: -What are the most popular boroughs for rentals? -Where are the most/least expensive listings in NY? -Does the neighborhood and/or borough drive the prices for the listings? -Is there a correlation between the number of reviews and the prices of the listing? -Can the name of the property have any influence on the price of the listing? -What type of accommodation is more popular in each borough? -Can I create visualization based on the coordinates? -Is there any correction between the number of minimum nights and the location of the listing?