Appendix

Connecticut Real Estate Sales Data

Joseph Farrell
2022-05-11

Appendix

Below is the code up to line 122 in FINAL PAPER DOC
hide
library(readr)
library(tidyverse)
library(dplyr)
library(mosaic)
library(ggplot2)
library(lubridate)
library(xts)
library(formattable)
library(forecast)
library(kableExtra)
library(plotly)
library(gapminder)
library(gganimate)
library(gifski)
library(png)

connecticut <- read_csv("/Users/nelsonfarrell/Downloads/Real_Estate_Sales_2001-2019_GL.csv")
county_town <- read_csv(("/Users/nelsonfarrell/Documents/501 Stats/Connecticut County:Towns.csv"))
county_town <- county_town %>%
  select("subregion", "town")

connecticut <- rename(connecticut, "town" = "Town")

connecticut <- 
  left_join(connecticut, county_town, by = "town")

connecticut <- select(connecticut, 
                      "List Year", 
                      "Date Recorded", 
                      "town", 
                      "Assessed Value", 
                      "Sale Amount",
                      "Residential Type",
                      "subregion")

connecticut <- connecticut %>%
  na.omit(`Residential Type`)

connecticut <- rename(connecticut, 
                      "list_year" = "List Year", 
                      "sale_date" = "Date Recorded", 
                      "assessed_price" = "Assessed Value", 
                      "sale_price" = "Sale Amount",
                      "property_type" = "Residential Type", 
                      "county" = "subregion")

connecticut$sale_date <- as_date(connecticut$sale_date, 
                                 format = "%m/%d/%Y")
connecticut <- connecticut %>%
  mutate(sale_year = year(sale_date))

Part A

Counts by Year

Examine counts in relation to year
hide
connecticut %>%
  group_by(sale_year) %>%
  count(sale_year) %>%
  mutate("Proportion of Total" = n/548175) %>%
  mutate(across(contains("Proportion"), 
                round, 
                2)) %>%
  rename("Sale Year" = "sale_year", 
         "Properties Sold" = "n") %>%
  knitr::kable(align = c("l", "c", "r"),
               caption = "Number of Properties Sold Each Year ") %>%
  kable_classic_2(bootstrap_options = c("striped", 
                                        "hover"), 
                  full_width = T) %>%
  row_spec(1:6, 
           background = "yellow") 
Table 1: Number of Properties Sold Each Year
Sale Year Properties Sold Proportion of Total
1999 1 0.00
2001 6 0.00
2003 1 0.00
2004 6 0.00
2005 3 0.00
2006 10977 0.02
2007 40955 0.07
2008 30749 0.06
2009 33457 0.06
2010 35715 0.07
2011 28892 0.05
2012 29427 0.05
2013 34223 0.06
2014 38112 0.07
2015 44940 0.08
2016 43681 0.08
2017 44609 0.08
2018 43085 0.08
2019 48348 0.09
2020 40988 0.07
Counts displayed in a bar chart
hide
connecticut %>%
  group_by(sale_year) %>%
  ggplot(aes(x = sale_year)) +
  geom_bar(fill = "steelblue") +
  ggtitle("Number of Properties Sold Per Year") +
  xlab("Sale Year") +
  ylab("Number of Properties Sold")

The above table and graphic display the scope of the data. While there are potentially enough observations to keep the year 2006 (10,977 observations); it is still considerably less than the years following. The minimum year total after 2006 is 2011 (28,892 observations) which makes up approximately 5% of the data. 2006 only makes up 2%.

As a result of the data prior to and including 2006 being very limited, I will remove all observations prior to 2007.

Remove observations prior to 2006 and including 2006
hide
connecticut <- connecticut %>%
  filter(sale_year > 2006)

Here I have removed all data prior to and including 2006.

Part B

Distribution of Sale Price

Get summary statistics of sale_price
hide
summarize(connecticut,
          "Mean Sale Price" = mean(sale_price),
          "Median Sale Price" = median(sale_price),
          "Standard Deviation Sale Price" = sd(sale_price)) %>%
knitr::kable(caption = "Overall Summary Statistics of Sale Price") %>%
kable_material(full_width = T)
Table 2: Overall Summary Statistics of Sale Price
Mean Sale Price Median Sale Price Standard Deviation Sale Price
349213.3 229500 672405.7
View summary statistics of different property types
hide
connecticut %>%
  group_by(property_type) %>%
  summarize("Mean Sale Price" = mean(sale_price),
            "Median Sale Price" = median(sale_price),
            "Standard Deviation Sale Price" = sd(sale_price)) %>%
  rename("Proerty Type" = "property_type") %>%
  knitr::kable(caption = "Summary Statistics of Sale Price for Different Property Types") %>%
  kable_material(full_width = T)
Table 3: Summary Statistics of Sale Price for Different Property Types
Proerty Type Mean Sale Price Median Sale Price Standard Deviation Sale Price
Condo 260584.3 174000 670389.9
Four Family 314826.4 185000 3428321.9
Single Family 387831.8 250000 651765.0
Three Family 177635.6 150000 167335.5
Two Family 197504.1 160000 255104.2
View distribution of sale price, check for outliers
hide
connecticut %>%
  group_by(property_type) %>%
  ggplot(aes(x = property_type, 
             y = sale_price)) +
  geom_violin(notch = TRUE) +
  labs(title = "Distribution of Sale Price for Residential Types",
       x = "Property Type",
       y = "Sale Price") +
  theme_light()

The violin plots reveal that the data is heavily skewed to the right. The summary statistics reveal very high standard deviations, and medians and means that are not close in value. For these reasons I will effectively use trimmed mean as a measure of central tendency but I will actually remove the highest and lowest 2.5% of observations (sale_price).

Remove top and bottom 2.5% of “sale_price”
hide
connecticut <- connecticut %>%
  filter(sale_price < quantile(connecticut$sale_price, 
                               .975) & 
           sale_price > quantile(connecticut$sale_price, 
                                 .025)) 
View summary statistics of sale_price after being trimmed
hide
options(scipen = 999)
summarize(connecticut,
          "Mean Sale Price" = mean(sale_price),
          "Median Sale Price" = median(sale_price),
          "Standard Deviation Sale Price" = sd(sale_price)) %>%
knitr::kable(caption = "Post Trim Overall Summary Statistics of Sale Price") %>%
kable_material(full_width = T)
Table 4: Post Trim Overall Summary Statistics of Sale Price
Mean Sale Price Median Sale Price Standard Deviation Sale Price
289055.9 229900 221472.2

Here we see that standard has been reduced from 672,405.7 to 221,472.2 indicating that the dispersion has been decreased. The mean reduced from 339,121.6 to 289,055.9 and is now closer to the median. Trimming the top and bottom 2.5% removed a total of 27,903 observations.

View distribution after trimming the top and bottom 2.5%
hide
connecticut %>%
  group_by(property_type) %>%
  ggplot(aes(x = property_type, 
             y = sale_price)) +
  geom_violin(aes(fill = property_type), notch = TRUE) +
  labs(title = "Distribution of Sale Price for Residential Types",
       x = "Residential Type",
       y = "Sale Price") +
  theme_light()

Distribution of sale price
hide
connecticut %>%
  ggplot(aes(x=sale_price)) +
    geom_histogram(binwidth = 50000, fill = "steelblue") +
    labs(title = "Distribution of Sale Price",
         y = "Count",
         x = "Sale Price") +
    theme_bw() 

These graphics display a distribution that is still skewed to the right but is closer to normally distributed than it was prior to being trimmed. This will be important for any potential statistical analyses that will be run the course of the analysis.

Part C

Property Type Counts

Line graph of the counts of property types in Connecticut Counties
hide
connecticut %>%
  group_by(property_type, 
           county, 
           year = lubridate::floor_date(sale_date, "year")) %>%
  count(property_type) %>%
  ggplot(aes(x = year, 
             y = n, 
             color = property_type)) +
  geom_line() +
  facet_wrap(vars(county)) +
  ggtitle("Counts of Property Type Sold from 2008 to 2020", 
          subtitle = "Connecicut Counties") +
  xlab("Year") +
  ylab("Properties Sold") +
  guides(color = guide_legend(title = "Property Type"))

Here we can see the counts of the different property types from 2007-2020 in each county. While not useful in ascertaining the specific number properties (grouped by type) sold every year in every county, it is useful in illastrating the limited number of observations of multi-unit properties once the data is grouped by county.

Here we can get a closer look.

Two family properties sold every year in each county
hide
two_family_count <- connecticut %>%
  group_by(county, 
           sale_year) %>%
  filter(property_type == "Two Family") %>%
  count(property_type) %>%
  ggplot(aes( x = sale_year, 
              y = n)) +
  geom_bar(stat = "identity", 
           fill = "steelblue") +
  ggtitle("Number of Two Family Properties Sold: 2007-2020",
       subtitle = "Connecticut Counties") +
  xlab("Year of Sale") +
  ylab("Number of Properties Sold") +
  facet_wrap(vars(county))
  
two_family_count<- ggplotly(two_family_count)

two_family_count
Three family properties sold every year in each county
hide
three_family_count <- connecticut %>%
  group_by(county, 
           sale_year) %>%
  filter(property_type == "Three Family") %>%
  count(property_type) %>%
  ggplot(aes( x = sale_year, 
              y = n)) +
  geom_bar(stat = "identity", 
           fill = "steelblue") +
  ggtitle("Number of Three Family Properties Sold: 2007-2020",
       subtitle = "Connecticut Counties") +
  xlab("Year of Sale") +
  ylab("Number of Properties Sold") +
  facet_wrap(vars(county))
three_family_count<- ggplotly(three_family_count)

three_family_count
Four family properties sold every year in each county
hide
four_family_count <- connecticut %>%
  group_by(county, 
           sale_year) %>%
  filter(property_type == "Four Family") %>%
  count(property_type) %>%
  ggplot(aes( x = sale_year, 
              y = n)) +
  geom_bar(stat = "identity", 
           fill = "steelblue") +
  ggtitle("Number of Four Family Properties Sold: 2007-2020",
       subtitle = "Connecticut Counties") +
  xlab("Year of Sale") +
  ylab("Number of Properties Sold") +
  facet_wrap(vars(county))

four_family_count<- ggplotly(four_family_count)

four_family_count

The graphics immediately above are made in ggplotly() so you can scroll over the bar to see the specific number of properties sold per county per year. What these and the further above graphics reveal is that analyzing the sale of price of property types other than single family at the county level or beyond will be challenging and potentially unreliable given the limited nature of the data in many of the counties. Analyzing “Two Family”, “Three Family”, and “Four Family” will be possible at the county level only in Fairfield, Hartford, and Windham.

Given the limited nature of the data for property types other than “Single Family” this analysis will primarily focus on “Single Family.”

Part D

Conditions Check

Histogram of residuals
hide
options(scipen = 999)
lm1 <- connecticut %>%
  group_by(month = floor_date(sale_date, 
                              "month"), 
           property_type) %>%
  filter(property_type == "Single Family") %>%
  summarize(mean_price = mean(sale_price), 
            count = count(month))
  
lm2 <-lm(mean_price ~ count, data = lm1)

histogram(~residuals(lm2),
          main = "Histogram of Residuals: Model m1",
          xlab = "Residuals (m1)",
          ylab = "Density")

This graphic displays a histogram of the residuals from the linear model. Here we checking if our residuals are approximately normally distributed. It is not perfect but it is close enough to satisfy the condition.

Scatter plot residuals and fitted values
hide
xyplot(residuals(lm2)~fitted(lm2),
       main = "Residuals Plot: Model m1",
       xlab = "Fitted (m1)",
       ylab = "Residuals")

This is scatterplot of the residuals versus their fitted values. In this plot we checking equal variance, independence, and if our expected value of the errors is zero.

The conditions have been satisfied and our model is appropriate.

I tried to make these plots in ggResidplot and I actually got it work once, but I could never get it working again. The appendix I am fixing at the end of the semester so I had to give up

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Farrell (2022, May 19). Data Analytics and Computational Social Science: Appendix. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomjnfarrell211901409/

BibTeX citation

@misc{farrell2022appendix,
  author = {Farrell, Joseph},
  title = {Data Analytics and Computational Social Science: Appendix},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomjnfarrell211901409/},
  year = {2022}
}