Connecticut Real Estate Sales Data
library(readr)
library(tidyverse)
library(dplyr)
library(mosaic)
library(ggplot2)
library(lubridate)
library(xts)
library(formattable)
library(forecast)
library(kableExtra)
library(plotly)
library(gapminder)
library(gganimate)
library(gifski)
library(png)
connecticut <- read_csv("/Users/nelsonfarrell/Downloads/Real_Estate_Sales_2001-2019_GL.csv")
county_town <- read_csv(("/Users/nelsonfarrell/Documents/501 Stats/Connecticut County:Towns.csv"))
county_town <- county_town %>%
select("subregion", "town")
connecticut <- rename(connecticut, "town" = "Town")
connecticut <-
left_join(connecticut, county_town, by = "town")
connecticut <- select(connecticut,
"List Year",
"Date Recorded",
"town",
"Assessed Value",
"Sale Amount",
"Residential Type",
"subregion")
connecticut <- connecticut %>%
na.omit(`Residential Type`)
connecticut <- rename(connecticut,
"list_year" = "List Year",
"sale_date" = "Date Recorded",
"assessed_price" = "Assessed Value",
"sale_price" = "Sale Amount",
"property_type" = "Residential Type",
"county" = "subregion")
connecticut$sale_date <- as_date(connecticut$sale_date,
format = "%m/%d/%Y")
connecticut <- connecticut %>%
mutate(sale_year = year(sale_date))
connecticut %>%
group_by(sale_year) %>%
count(sale_year) %>%
mutate("Proportion of Total" = n/548175) %>%
mutate(across(contains("Proportion"),
round,
2)) %>%
rename("Sale Year" = "sale_year",
"Properties Sold" = "n") %>%
knitr::kable(align = c("l", "c", "r"),
caption = "Number of Properties Sold Each Year ") %>%
kable_classic_2(bootstrap_options = c("striped",
"hover"),
full_width = T) %>%
row_spec(1:6,
background = "yellow")
Sale Year | Properties Sold | Proportion of Total |
---|---|---|
1999 | 1 | 0.00 |
2001 | 6 | 0.00 |
2003 | 1 | 0.00 |
2004 | 6 | 0.00 |
2005 | 3 | 0.00 |
2006 | 10977 | 0.02 |
2007 | 40955 | 0.07 |
2008 | 30749 | 0.06 |
2009 | 33457 | 0.06 |
2010 | 35715 | 0.07 |
2011 | 28892 | 0.05 |
2012 | 29427 | 0.05 |
2013 | 34223 | 0.06 |
2014 | 38112 | 0.07 |
2015 | 44940 | 0.08 |
2016 | 43681 | 0.08 |
2017 | 44609 | 0.08 |
2018 | 43085 | 0.08 |
2019 | 48348 | 0.09 |
2020 | 40988 | 0.07 |
The above table and graphic display the scope of the data. While there are potentially enough observations to keep the year 2006 (10,977 observations); it is still considerably less than the years following. The minimum year total after 2006 is 2011 (28,892 observations) which makes up approximately 5% of the data. 2006 only makes up 2%.
As a result of the data prior to and including 2006 being very limited, I will remove all observations prior to 2007.
Remove observations prior to 2006 and including 2006Here I have removed all data prior to and including 2006.
sale_price
Mean Sale Price | Median Sale Price | Standard Deviation Sale Price |
---|---|---|
349213.3 | 229500 | 672405.7 |
connecticut %>%
group_by(property_type) %>%
summarize("Mean Sale Price" = mean(sale_price),
"Median Sale Price" = median(sale_price),
"Standard Deviation Sale Price" = sd(sale_price)) %>%
rename("Proerty Type" = "property_type") %>%
knitr::kable(caption = "Summary Statistics of Sale Price for Different Property Types") %>%
kable_material(full_width = T)
Proerty Type | Mean Sale Price | Median Sale Price | Standard Deviation Sale Price |
---|---|---|---|
Condo | 260584.3 | 174000 | 670389.9 |
Four Family | 314826.4 | 185000 | 3428321.9 |
Single Family | 387831.8 | 250000 | 651765.0 |
Three Family | 177635.6 | 150000 | 167335.5 |
Two Family | 197504.1 | 160000 | 255104.2 |
sale price
, check for outliers
connecticut %>%
group_by(property_type) %>%
ggplot(aes(x = property_type,
y = sale_price)) +
geom_violin(notch = TRUE) +
labs(title = "Distribution of Sale Price for Residential Types",
x = "Property Type",
y = "Sale Price") +
theme_light()
The violin plots reveal that the data is heavily skewed to the right. The summary statistics reveal very high standard deviations, and medians and means that are not close in value. For these reasons I will effectively use trimmed mean as a measure of central tendency but I will actually remove the highest and lowest 2.5% of observations (sale_price).
Remove top and bottom 2.5% of “sale_price”sale_price
after being trimmedMean Sale Price | Median Sale Price | Standard Deviation Sale Price |
---|---|---|
289055.9 | 229900 | 221472.2 |
Here we see that standard has been reduced from 672,405.7 to 221,472.2 indicating that the dispersion has been decreased. The mean reduced from 339,121.6 to 289,055.9 and is now closer to the median. Trimming the top and bottom 2.5% removed a total of 27,903 observations.
View distribution after trimming the top and bottom 2.5%connecticut %>%
group_by(property_type) %>%
ggplot(aes(x = property_type,
y = sale_price)) +
geom_violin(aes(fill = property_type), notch = TRUE) +
labs(title = "Distribution of Sale Price for Residential Types",
x = "Residential Type",
y = "Sale Price") +
theme_light()
These graphics display a distribution that is still skewed to the right but is closer to normally distributed than it was prior to being trimmed. This will be important for any potential statistical analyses that will be run the course of the analysis.
connecticut %>%
group_by(property_type,
county,
year = lubridate::floor_date(sale_date, "year")) %>%
count(property_type) %>%
ggplot(aes(x = year,
y = n,
color = property_type)) +
geom_line() +
facet_wrap(vars(county)) +
ggtitle("Counts of Property Type Sold from 2008 to 2020",
subtitle = "Connecicut Counties") +
xlab("Year") +
ylab("Properties Sold") +
guides(color = guide_legend(title = "Property Type"))
Here we can see the counts of the different property types from 2007-2020 in each county. While not useful in ascertaining the specific number properties (grouped by type) sold every year in every county, it is useful in illastrating the limited number of observations of multi-unit properties once the data is grouped by county.
Here we can get a closer look.
Two family properties sold every year in each countytwo_family_count <- connecticut %>%
group_by(county,
sale_year) %>%
filter(property_type == "Two Family") %>%
count(property_type) %>%
ggplot(aes( x = sale_year,
y = n)) +
geom_bar(stat = "identity",
fill = "steelblue") +
ggtitle("Number of Two Family Properties Sold: 2007-2020",
subtitle = "Connecticut Counties") +
xlab("Year of Sale") +
ylab("Number of Properties Sold") +
facet_wrap(vars(county))
two_family_count<- ggplotly(two_family_count)
two_family_count
three_family_count <- connecticut %>%
group_by(county,
sale_year) %>%
filter(property_type == "Three Family") %>%
count(property_type) %>%
ggplot(aes( x = sale_year,
y = n)) +
geom_bar(stat = "identity",
fill = "steelblue") +
ggtitle("Number of Three Family Properties Sold: 2007-2020",
subtitle = "Connecticut Counties") +
xlab("Year of Sale") +
ylab("Number of Properties Sold") +
facet_wrap(vars(county))
three_family_count<- ggplotly(three_family_count)
three_family_count
four_family_count <- connecticut %>%
group_by(county,
sale_year) %>%
filter(property_type == "Four Family") %>%
count(property_type) %>%
ggplot(aes( x = sale_year,
y = n)) +
geom_bar(stat = "identity",
fill = "steelblue") +
ggtitle("Number of Four Family Properties Sold: 2007-2020",
subtitle = "Connecticut Counties") +
xlab("Year of Sale") +
ylab("Number of Properties Sold") +
facet_wrap(vars(county))
four_family_count<- ggplotly(four_family_count)
four_family_count
The graphics immediately above are made in ggplotly() so you can scroll over the bar to see the specific number of properties sold per county per year. What these and the further above graphics reveal is that analyzing the sale of price of property types other than single family at the county level or beyond will be challenging and potentially unreliable given the limited nature of the data in many of the counties. Analyzing “Two Family”, “Three Family”, and “Four Family” will be possible at the county level only in Fairfield, Hartford, and Windham.
Given the limited nature of the data for property types other than “Single Family” this analysis will primarily focus on “Single Family.”
options(scipen = 999)
lm1 <- connecticut %>%
group_by(month = floor_date(sale_date,
"month"),
property_type) %>%
filter(property_type == "Single Family") %>%
summarize(mean_price = mean(sale_price),
count = count(month))
lm2 <-lm(mean_price ~ count, data = lm1)
histogram(~residuals(lm2),
main = "Histogram of Residuals: Model m1",
xlab = "Residuals (m1)",
ylab = "Density")
This graphic displays a histogram of the residuals from the linear model. Here we checking if our residuals are approximately normally distributed. It is not perfect but it is close enough to satisfy the condition.
Scatter plot residuals and fitted valuesThis is scatterplot of the residuals versus their fitted values. In this plot we checking equal variance, independence, and if our expected value of the errors is zero.
The conditions have been satisfied and our model is appropriate.
I tried to make these plots in ggResidplot and I actually got it work once, but I could never get it working again. The appendix I am fixing at the end of the semester so I had to give up
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Farrell (2022, May 19). Data Analytics and Computational Social Science: Appendix. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomjnfarrell211901409/
BibTeX citation
@misc{farrell2022appendix, author = {Farrell, Joseph}, title = {Data Analytics and Computational Social Science: Appendix}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomjnfarrell211901409/}, year = {2022} }