library(tidyverse)
library(ggplot2)
::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) knitr
Challenge 5
Challenge Overview
Today’s challenge is to:
- read in a data set, and describe the data set using both words and any supporting information (e.g., tables, etc)
- tidy data (as needed, including sanity checks)
- mutate variables as needed (including sanity checks)
- create at least two univariate visualizations
- try to make them “publication” ready
- Explain why you choose the specific graph type
- Create at least one bivariate visualization
- try to make them “publication” ready
- Explain why you choose the specific graph type
R Graph Gallery is a good starting point for thinking about what information is conveyed in standard graph types, and includes example R code.
(be sure to only include the category tags for the data you use!)
Read in data
- AB_NYC_2019.csv ⭐⭐⭐
<- read_csv("_data/AB_NYC_2019.csv")
NYC_data NYC_data
# A tibble: 48,895 × 16
id name host_id host_…¹ neigh…² neigh…³ latit…⁴ longi…⁵ room_…⁶ price
<dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 2539 Clean & … 2787 John Brookl… Kensin… 40.6 -74.0 Privat… 149
2 2595 Skylit M… 2845 Jennif… Manhat… Midtown 40.8 -74.0 Entire… 225
3 3647 THE VILL… 4632 Elisab… Manhat… Harlem 40.8 -73.9 Privat… 150
4 3831 Cozy Ent… 4869 LisaRo… Brookl… Clinto… 40.7 -74.0 Entire… 89
5 5022 Entire A… 7192 Laura Manhat… East H… 40.8 -73.9 Entire… 80
6 5099 Large Co… 7322 Chris Manhat… Murray… 40.7 -74.0 Entire… 200
7 5121 BlissArt… 7356 Garon Brookl… Bedfor… 40.7 -74.0 Privat… 60
8 5178 Large Fu… 8967 Shunic… Manhat… Hell's… 40.8 -74.0 Privat… 79
9 5203 Cozy Cle… 7490 MaryEl… Manhat… Upper … 40.8 -74.0 Privat… 79
10 5238 Cute & C… 7549 Ben Manhat… Chinat… 40.7 -74.0 Entire… 150
# … with 48,885 more rows, 6 more variables: minimum_nights <dbl>,
# number_of_reviews <dbl>, last_review <date>, reviews_per_month <dbl>,
# calculated_host_listings_count <dbl>, availability_365 <dbl>, and
# abbreviated variable names ¹host_name, ²neighbourhood_group,
# ³neighbourhood, ⁴latitude, ⁵longitude, ⁶room_type
Briefly describe the data
The above dataset is about the airBNB listings in the New York City from 2019 and eit has 48895 rows and 16 columns. Each row is a listing which a lot of information like the location, host details like the unique ID and name, cost, room_type and various other details of the place. This dataset will help us in comparing the prices of the airBNB listings which are very similar to each other and also between the hosts. I have observed that there are hosts that have multiple listings.
Tidy Data (as needed)
I have observed that there is need for tidying the data because I have seen cases where the value is NA in the column “reviews_per_month” and I have also observed that there are NA in the “date_of_last_review” column but this is completely acceptable becasue there is actually no data actually available. Therefore, reviews_per_month we need to replace the values of NS to 0 and the NA values are basically caused as there is no reviews actually present.
replace_na(data, list(reviews_per_month = 0))
Error:
! Input must be a vector, not a function.
Mutate
In this particular dataset I have observed that there is no need for any kind of a mutation as all of the values that would help in the analysis are in extremely good shape and does not need tidying up.
Univariate Visualizations
- I have been extremely curious of how the listings in NYC are actually distributed throughout as we may assume Manhattan to have a lot of listings because it has a lot of touristic attractions. I am extremely interested in the borough of the various listings that the available in NYC.
ggplot(data, aes(neighbourhood_group, fill = room_type)) + geom_bar() +
theme_bw() +
labs(title = "AirBNB by Location ", y = "Number of Listings", x = "Borough")
Error in `ggplot()`:
! `data` cannot be a function.
ℹ Have you misspelled the `data` argument in `ggplot()`
As anticipated, Manhattan has the most number of the airBNB listings but you can also observe that Brooklyn also has a lot of airBNB listings whereas, Queens, bronx and the Staten Island have a very listings on airBNB.
This graph also shows us the complete breakdown of all of the listings in different locations as well the various types of rooms available in each of these places like the entire home/apt, Private room or a shared room. By just looking at the graph we can say that there are a lot of entire home/apt. One of the other interesting analysis is that there is a very high proportion of the entire home/apt in Manhattan than in other places.
For the graph I have chosen a bar graph as it is a very easy and a convenient way to prepare the counts and also implement the different colors.
- The second analysis I am extremely interested is the reviews per month, which could be a measure of the average length of stay at various locations.
ggplot(data, aes(reviews_per_month), xlim = c(0,10)) +
geom_histogram(binwidth = .25) +
labs(title = "Reviews Per Month")
Error in `ggplot()`:
! `data` cannot be a function.
ℹ Have you misspelled the `data` argument in `ggplot()`
I have chosen the histogram for this graph and it helps in showing the distribution of all the various reviews per month. I have also observed that there are many listings which do not have a lot of reviews which are given per month whereas there are some have reviews of 7 or 8. I have also observed that the listings in NYC are mostly long term and therefore this plot makes complete sense as most of them will not be rented a lot.
Bivariate Visualization(s)
One of the most interesting metric is the prices in the different locations of NYC. This is extremely important for those who are interested in investing in an airBNB as it will tell us which areas in NYC demand a very high price per night.
%>%
data ggplot(aes(neighbourhood_group, price), fill = neighbourhood) +
geom_boxplot() +
labs(title = "Price per each of the Listing by Borough") +
theme_bw()
Error in `ggplot()`:
! `data` cannot be a function.
ℹ Have you misspelled the `data` argument in `ggplot()`
The aforementioned graph shows how the rates for each night for the listings in various places vary. I chose a box plot because it shows how prices are distributed throughout the various regions. As predicte, the majority of listings in the Bronx and Staten Island are affordable and concentrate in one location. Despite a few expensive exceptions, Queens appears to be a mostly inexpensive city. Brooklyn and Manhattan, which are both undoubtedly more costly in general, are home to the majority of the most expensive AirBNB.