Rows: 48895 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): name, host_name, neighbourhood_group, neighbourhood, room_type
dbl (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
date (1): last_review
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Listings
##Get the summary of listings
summary(Listings)
id name borough neighborhood
Min. : 2539 Length:48895 Length:48895 Length:48895
1st Qu.: 9471945 Class :character Class :character Class :character
Median :19677284 Mode :character Mode :character Mode :character
Mean :19017143
3rd Qu.:29152178
Max. :36487245
latitude longitude room_type price
Min. :40.50 Min. :-74.24 Length:48895 Min. : 0.0
1st Qu.:40.69 1st Qu.:-73.98 Class :character 1st Qu.: 69.0
Median :40.72 Median :-73.96 Mode :character Median : 106.0
Mean :40.73 Mean :-73.95 Mean : 152.7
3rd Qu.:40.76 3rd Qu.:-73.94 3rd Qu.: 175.0
Max. :40.91 Max. :-73.71 Max. :10000.0
minimum_nights number_of_reviews reviews_per_month
Min. : 1.00 Min. : 0.00 Min. : 0.010
1st Qu.: 1.00 1st Qu.: 1.00 1st Qu.: 0.190
Median : 3.00 Median : 5.00 Median : 0.720
Mean : 7.03 Mean : 23.27 Mean : 1.373
3rd Qu.: 5.00 3rd Qu.: 24.00 3rd Qu.: 2.020
Max. :1250.00 Max. :629.00 Max. :58.500
NA's :10052
calculated_host_listings_count availability_365
Min. : 1.000 Min. : 0.0
1st Qu.: 1.000 1st Qu.: 0.0
Median : 1.000 Median : 45.0
Mean : 7.144 Mean :112.8
3rd Qu.: 2.000 3rd Qu.:227.0
Max. :327.000 Max. :365.0
The dataset contains listing information for Airbnb properties in New York. It has 48895 observances (listings) and 16 variables. The variables contain the following information: - id: unique value assigned to each of the listings - name: name of the property/brief description of the listing - host_id and host_name: information regarding the hosts - neighbourhood_group: this variable is aggregated data, in which the neighbourhood was combined by borough. - neighbourhood: location of the neighbourhood. - latitude and longitude: exact location of the listing. This is an individual-level data. It may not be unique as there might be some listings in the same address (such as apartment rentals) or multiple rooms within the same property -room_type: this variable specifies if the rental is for the entire place or a private room. -price: price per night for each of the listings -minimun_nights: the minimum number of nights the property has to be rented out
-number of reviews: count of reviews given by previous renters - last review: date of the last review given to the property - reviews per month: number of reviews per month
- calculated_host_listings_count: number of listings available for that particular host - availability_365: how many days of the year is the listing available
I can also determine from the summary that the average price per night is $152.70 and the median $106. However, it appears that are some extreme values that we may have to remove later on for better visualization like $0 and $10000 on the price column.
My next step is cleaning up the dataset. I first removed the variables that I consider unnecessary: host_id, host_name and last_review. I also renamed the variables neighbourhood_group to borough and neighbourhood to neighborhood since these are more commonly used terms in the US. My next step is to determine if there is there is any missing data from this dataset.
##Determine is there is any missing data
colSums(is.na(Listings))
id name
0 16
borough neighborhood
0 0
latitude longitude
0 0
room_type price
0 0
minimum_nights number_of_reviews
0 0
reviews_per_month calculated_host_listings_count
10052 0
availability_365
0
There are 10052 missing data from the reviews_per_month column. I assume the information is missing because there are no reviews yet for those particular listings.
Some potential research questions that the dataset listing can help answer: -What are the most popular boroughs for rentals? -Where are the most/least expensive listings in NY? -Does the neighborhood and/or borough drive the prices for the listings? -Is there a correlation between the number of reviews and the prices of the listing? -Can the name of the property have any influence on the price of the listing? -What type of accommodation is more popular in each borough? -Can I create visualization based on the coordinates? -Is there any correction between the number of minimum nights and the location of the listing?