Challenge 5

challenge_5

railroads

cereal

air_bnb

pathogen_cost

australian_marriage

public_schools

usa_hh

Introduction to Visualization

Author

Nayan Jani

Published

August 22, 2022

library(tidyverse)
library(ggplot2)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

AB <- read_csv("_data/AB_NYC_2019.csv", col_names = c("del", "name", "del", "host_name","neighbourhood_group", "neighbourhood", "latitude", "longitude", "room_type", "price", "minimum_nights", "number_of_reviews", "last_review", "reviews_per_month", "calculated_host_listings_count", "availability_365" ), skip=1) %>% 
  select(!starts_with("del")) %>%
  drop_na(reviews_per_month)
  
  

AB

print(dfSummary(AB, varnumbers = FALSE,
                        plain.ascii  = FALSE, 
                        style        = "grid", 
                        graph.magnif = 0.70, 
                        valid.col    = FALSE),
      method = 'render',
      table.classes = 'table-condensed')

Data Frame Summary

AB

Dimensions: 38843 x 14
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Missing

name [character]

1. Home away from home

2. Loft Suite @ The Box Hous

3. Private Room

4. Brooklyn Apartment

5. Cozy Brooklyn Apartment

6. New york Multi-unit build

7. Private room

8. Beautiful Brooklyn Browns

9. Harlem Gem

10. Hillside Hotel

[ 38253 others ]

12	(	0.0%	)
11	(	0.0%	)
10	(	0.0%	)
9	(	0.0%	)
8	(	0.0%	)
8	(	0.0%	)
8	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
7	(	0.0%	)
38750	(	99.8%	)

6 (0.0%)

host_name [character]

1. Michael

2. David

3. John

4. Alex

5. Sonder (NYC)

6. Sarah

7. Maria

8. Daniel

9. Jessica

10. Anna

[ 9876 others ]

335	(	0.9%	)
309	(	0.8%	)
250	(	0.6%	)
229	(	0.6%	)
207	(	0.5%	)
179	(	0.5%	)
174	(	0.4%	)
170	(	0.4%	)
170	(	0.4%	)
160	(	0.4%	)
36644	(	94.4%	)

16 (0.0%)

neighbourhood_group [character]

1. Bronx

2. Brooklyn

3. Manhattan

4. Queens

5. Staten Island

876	(	2.3%	)
16447	(	42.3%	)
16632	(	42.8%	)
4574	(	11.8%	)
314	(	0.8%	)

0 (0.0%)

neighbourhood [character]

1. Williamsburg

2. Bedford-Stuyvesant

3. Harlem

4. Bushwick

5. Hell's Kitchen

6. East Village

7. Upper West Side

8. Upper East Side

9. Crown Heights

10. Midtown

[ 208 others ]

3163	(	8.1%	)
3141	(	8.1%	)
2206	(	5.7%	)
1944	(	5.0%	)
1532	(	3.9%	)
1490	(	3.8%	)
1482	(	3.8%	)
1405	(	3.6%	)
1265	(	3.3%	)
986	(	2.5%	)
20229	(	52.1%	)

0 (0.0%)

latitude [numeric]

Mean (sd) : 40.7 (0.1)

min ≤ med ≤ max:

40.5 ≤ 40.7 ≤ 40.9

IQR (CV) : 0.1 (0)

17443 distinct values

0 (0.0%)

longitude [numeric]

Mean (sd) : -74 (0)

min ≤ med ≤ max:

-74.2 ≤ -74 ≤ -73.7

IQR (CV) : 0 (0)

13641 distinct values

0 (0.0%)

room_type [character]

1. Entire home/apt

2. Private room

3. Shared room

20332	(	52.3%	)
17665	(	45.5%	)
846	(	2.2%	)

0 (0.0%)

price [numeric]

Mean (sd) : 142.3 (196.9)

min ≤ med ≤ max:

0 ≤ 101 ≤ 10000

IQR (CV) : 101 (1.4)

581 distinct values

0 (0.0%)

minimum_nights [numeric]

Mean (sd) : 5.9 (17.4)

min ≤ med ≤ max:

1 ≤ 2 ≤ 1250

IQR (CV) : 3 (3)

89 distinct values

0 (0.0%)

number_of_reviews [numeric]

Mean (sd) : 29.3 (48.2)

min ≤ med ≤ max:

1 ≤ 9 ≤ 629

IQR (CV) : 30 (1.6)

393 distinct values

0 (0.0%)

last_review [Date]

min : 2011-03-28

med : 2019-05-19

max : 2019-07-08

range : 8y 3m 10d

1764 distinct values

0 (0.0%)

reviews_per_month [numeric]

Mean (sd) : 1.4 (1.7)

min ≤ med ≤ max:

0 ≤ 0.7 ≤ 58.5

IQR (CV) : 1.8 (1.2)

937 distinct values

0 (0.0%)

calculated_host_listings_count [numeric]

Mean (sd) : 5.2 (26.3)

min ≤ med ≤ max:

1 ≤ 1 ≤ 327

IQR (CV) : 1 (5.1)

47 distinct values

0 (0.0%)

availability_365 [numeric]

Mean (sd) : 114.9 (129.5)

min ≤ med ≤ max:

0 ≤ 55 ≤ 365

IQR (CV) : 229 (1.1)

366 distinct values

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-28

Describe the data and Tidy

After reading in the data I can tell I am dealing with NYC Air BnB data from 2019. To Tidy the data, I removed the id columns and rows with a value of 0 in them for number_of_reviews. The reason for this is that number_of_reivews will be a focal point of the analysis for this dataset. I want to compare number of reviews with other variables to see if reviews have any effect whether it is the independent or dependent variable. By looking at the summary table I could also investigate the number of reviews for each neighborhood_group and neighborhood.

Univariate Visualizations

ggplot(data = AB) + 
  geom_bar(mapping = aes(x = neighbourhood_group, fill = room_type))

AB_n <- AB %>%
  group_by(neighbourhood_group) %>%
  summarise(number_of_reviews = sum(number_of_reviews))

AB_n

Here I created a Bar chart that counts the number of instances in a neighborhood group. I used the fill function to see which neighborhoods offer different types of rooms. Based on the results I can see that Brooklyn and Manhattan take up majority of the Air Bnbs in NYC. Also, within those two neighborhoods I can see that most of the Air BnBs are either a entire home/apt or a Private room. The Bronx and Staten Island do not offer any shared rooms.

Bivariate Visualization(s)

ggplot(data = AB) + 
  geom_point(mapping = aes(x = price, y = number_of_reviews, color =room_type))+
   facet_wrap(~ room_type, nrow = 2)

Here I created subplots that each display price ad the X axis and each room_type as the Y axis. Based on the results I can see that around the price of 500, the amount of reviews starts to decline for all room types. One problem with these graphs is that they are out of scop making it hard to visualize what is going on.