Data Analytics and Computational Social Science: Appendix

Joseph Farrell

Appendix

Below is the code up to line 122 in FINAL PAPER DOC

hide

library(readr)
library(tidyverse)
library(dplyr)
library(mosaic)
library(ggplot2)
library(lubridate)
library(xts)
library(formattable)
library(forecast)
library(kableExtra)
library(plotly)
library(gapminder)
library(gganimate)
library(gifski)
library(png)

connecticut <- read_csv("/Users/nelsonfarrell/Downloads/Real_Estate_Sales_2001-2019_GL.csv")
county_town <- read_csv(("/Users/nelsonfarrell/Documents/501 Stats/Connecticut County:Towns.csv"))
county_town <- county_town %>%
  select("subregion", "town")

connecticut <- rename(connecticut, "town" = "Town")

connecticut <- 
  left_join(connecticut, county_town, by = "town")

connecticut <- select(connecticut, 
                      "List Year", 
                      "Date Recorded", 
                      "town", 
                      "Assessed Value", 
                      "Sale Amount",
                      "Residential Type",
                      "subregion")

connecticut <- connecticut %>%
  na.omit(`Residential Type`)

connecticut <- rename(connecticut, 
                      "list_year" = "List Year", 
                      "sale_date" = "Date Recorded", 
                      "assessed_price" = "Assessed Value", 
                      "sale_price" = "Sale Amount",
                      "property_type" = "Residential Type", 
                      "county" = "subregion")

connecticut$sale_date <- as_date(connecticut$sale_date, 
                                 format = "%m/%d/%Y")
connecticut <- connecticut %>%
  mutate(sale_year = year(sale_date))

Part A

Counts by Year

Examine counts in relation to year

hide

connecticut %>%
  group_by(sale_year) %>%
  count(sale_year) %>%
  mutate("Proportion of Total" = n/548175) %>%
  mutate(across(contains("Proportion"), 
                round, 
                2)) %>%
  rename("Sale Year" = "sale_year", 
         "Properties Sold" = "n") %>%
  knitr::kable(align = c("l", "c", "r"),
               caption = "Number of Properties Sold Each Year ") %>%
  kable_classic_2(bootstrap_options = c("striped", 
                                        "hover"), 
                  full_width = T) %>%
  row_spec(1:6, 
           background = "yellow")

Table 1: Number of Properties Sold Each Year
Sale Year	Properties Sold	Proportion of Total
1999	1	0.00
2001	6	0.00
2003	1	0.00
2004	6	0.00
2005	3	0.00
2006	10977	0.02
2007	40955	0.07
2008	30749	0.06
2009	33457	0.06
2010	35715	0.07
2011	28892	0.05
2012	29427	0.05
2013	34223	0.06
2014	38112	0.07
2015	44940	0.08
2016	43681	0.08
2017	44609	0.08
2018	43085	0.08
2019	48348	0.09
2020	40988	0.07

Counts displayed in a bar chart

hide

connecticut %>%
  group_by(sale_year) %>%
  ggplot(aes(x = sale_year)) +
  geom_bar(fill = "steelblue") +
  ggtitle("Number of Properties Sold Per Year") +
  xlab("Sale Year") +
  ylab("Number of Properties Sold")

The above table and graphic display the scope of the data. While there are potentially enough observations to keep the year 2006 (10,977 observations); it is still considerably less than the years following. The minimum year total after 2006 is 2011 (28,892 observations) which makes up approximately 5% of the data. 2006 only makes up 2%.

As a result of the data prior to and including 2006 being very limited, I will remove all observations prior to 2007.

Remove observations prior to 2006 and including 2006

hide

connecticut <- connecticut %>%
  filter(sale_year > 2006)

Here I have removed all data prior to and including 2006.

Part B

Distribution of Sale Price

Get summary statistics of sale_price

hide

summarize(connecticut,
          "Mean Sale Price" = mean(sale_price),
          "Median Sale Price" = median(sale_price),
          "Standard Deviation Sale Price" = sd(sale_price)) %>%
knitr::kable(caption = "Overall Summary Statistics of Sale Price") %>%
kable_material(full_width = T)

Table 2: Overall Summary Statistics of Sale Price
Mean Sale Price	Median Sale Price	Standard Deviation Sale Price
349213.3	229500	672405.7

View summary statistics of different property types

hide

connecticut %>%
  group_by(property_type) %>%
  summarize("Mean Sale Price" = mean(sale_price),
            "Median Sale Price" = median(sale_price),
            "Standard Deviation Sale Price" = sd(sale_price)) %>%
  rename("Proerty Type" = "property_type") %>%
  knitr::kable(caption = "Summary Statistics of Sale Price for Different Property Types") %>%
  kable_material(full_width = T)

Table 3: Summary Statistics of Sale Price for Different Property Types
Proerty Type	Mean Sale Price	Median Sale Price	Standard Deviation Sale Price
Condo	260584.3	174000	670389.9
Four Family	314826.4	185000	3428321.9
Single Family	387831.8	250000	651765.0
Three Family	177635.6	150000	167335.5
Two Family	197504.1	160000	255104.2

View distribution of sale price, check for outliers

hide

connecticut %>%
  group_by(property_type) %>%
  ggplot(aes(x = property_type, 
             y = sale_price)) +
  geom_violin(notch = TRUE) +
  labs(title = "Distribution of Sale Price for Residential Types",
       x = "Property Type",
       y = "Sale Price") +
  theme_light()

The violin plots reveal that the data is heavily skewed to the right. The summary statistics reveal very high standard deviations, and medians and means that are not close in value. For these reasons I will effectively use trimmed mean as a measure of central tendency but I will actually remove the highest and lowest 2.5% of observations (sale_price).

Remove top and bottom 2.5% of “sale_price”

hide

connecticut <- connecticut %>%
  filter(sale_price < quantile(connecticut$sale_price, 
                               .975) & 
           sale_price > quantile(connecticut$sale_price, 
                                 .025))

View summary statistics of sale_price after being trimmed

hide

options(scipen = 999)
summarize(connecticut,
          "Mean Sale Price" = mean(sale_price),
          "Median Sale Price" = median(sale_price),
          "Standard Deviation Sale Price" = sd(sale_price)) %>%
knitr::kable(caption = "Post Trim Overall Summary Statistics of Sale Price") %>%
kable_material(full_width = T)

Table 4: Post Trim Overall Summary Statistics of Sale Price
Mean Sale Price	Median Sale Price	Standard Deviation Sale Price
289055.9	229900	221472.2

Here we see that standard has been reduced from 672,405.7 to 221,472.2 indicating that the dispersion has been decreased. The mean reduced from 339,121.6 to 289,055.9 and is now closer to the median. Trimming the top and bottom 2.5% removed a total of 27,903 observations.

View distribution after trimming the top and bottom 2.5%

hide

connecticut %>%
  group_by(property_type) %>%
  ggplot(aes(x = property_type, 
             y = sale_price)) +
  geom_violin(aes(fill = property_type), notch = TRUE) +
  labs(title = "Distribution of Sale Price for Residential Types",
       x = "Residential Type",
       y = "Sale Price") +
  theme_light()

Distribution of sale price

hide

connecticut %>%
  ggplot(aes(x=sale_price)) +
    geom_histogram(binwidth = 50000, fill = "steelblue") +
    labs(title = "Distribution of Sale Price",
         y = "Count",
         x = "Sale Price") +
    theme_bw()

These graphics display a distribution that is still skewed to the right but is closer to normally distributed than it was prior to being trimmed. This will be important for any potential statistical analyses that will be run the course of the analysis.

Part C

Property Type Counts

Line graph of the counts of property types in Connecticut Counties

hide

connecticut %>%
  group_by(property_type, 
           county, 
           year = lubridate::floor_date(sale_date, "year")) %>%
  count(property_type) %>%
  ggplot(aes(x = year, 
             y = n, 
             color = property_type)) +
  geom_line() +
  facet_wrap(vars(county)) +
  ggtitle("Counts of Property Type Sold from 2008 to 2020", 
          subtitle = "Connecicut Counties") +
  xlab("Year") +
  ylab("Properties Sold") +
  guides(color = guide_legend(title = "Property Type"))

Here we can see the counts of the different property types from 2007-2020 in each county. While not useful in ascertaining the specific number properties (grouped by type) sold every year in every county, it is useful in illastrating the limited number of observations of multi-unit properties once the data is grouped by county.

Here we can get a closer look.

Two family properties sold every year in each county

hide

two_family_count <- connecticut %>%
  group_by(county, 
           sale_year) %>%
  filter(property_type == "Two Family") %>%
  count(property_type) %>%
  ggplot(aes( x = sale_year, 
              y = n)) +
  geom_bar(stat = "identity", 
           fill = "steelblue") +
  ggtitle("Number of Two Family Properties Sold: 2007-2020",
       subtitle = "Connecticut Counties") +
  xlab("Year of Sale") +
  ylab("Number of Properties Sold") +
  facet_wrap(vars(county))
  
two_family_count<- ggplotly(two_family_count)

two_family_count

Three family properties sold every year in each county

hide

three_family_count <- connecticut %>%
  group_by(county, 
           sale_year) %>%
  filter(property_type == "Three Family") %>%
  count(property_type) %>%
  ggplot(aes( x = sale_year, 
              y = n)) +
  geom_bar(stat = "identity", 
           fill = "steelblue") +
  ggtitle("Number of Three Family Properties Sold: 2007-2020",
       subtitle = "Connecticut Counties") +
  xlab("Year of Sale") +
  ylab("Number of Properties Sold") +
  facet_wrap(vars(county))
three_family_count<- ggplotly(three_family_count)

three_family_count

Four family properties sold every year in each county

hide

four_family_count <- connecticut %>%
  group_by(county, 
           sale_year) %>%
  filter(property_type == "Four Family") %>%
  count(property_type) %>%
  ggplot(aes( x = sale_year, 
              y = n)) +
  geom_bar(stat = "identity", 
           fill = "steelblue") +
  ggtitle("Number of Four Family Properties Sold: 2007-2020",
       subtitle = "Connecticut Counties") +
  xlab("Year of Sale") +
  ylab("Number of Properties Sold") +
  facet_wrap(vars(county))

four_family_count<- ggplotly(four_family_count)

four_family_count

The graphics immediately above are made in ggplotly() so you can scroll over the bar to see the specific number of properties sold per county per year. What these and the further above graphics reveal is that analyzing the sale of price of property types other than single family at the county level or beyond will be challenging and potentially unreliable given the limited nature of the data in many of the counties. Analyzing “Two Family”, “Three Family”, and “Four Family” will be possible at the county level only in Fairfield, Hartford, and Windham.

Given the limited nature of the data for property types other than “Single Family” this analysis will primarily focus on “Single Family.”

Part D

Conditions Check

Histogram of residuals

hide

options(scipen = 999)
lm1 <- connecticut %>%
  group_by(month = floor_date(sale_date, 
                              "month"), 
           property_type) %>%
  filter(property_type == "Single Family") %>%
  summarize(mean_price = mean(sale_price), 
            count = count(month))
  
lm2 <-lm(mean_price ~ count, data = lm1)

histogram(~residuals(lm2),
          main = "Histogram of Residuals: Model m1",
          xlab = "Residuals (m1)",
          ylab = "Density")

This graphic displays a histogram of the residuals from the linear model. Here we checking if our residuals are approximately normally distributed. It is not perfect but it is close enough to satisfy the condition.

Scatter plot residuals and fitted values

hide

xyplot(residuals(lm2)~fitted(lm2),
       main = "Residuals Plot: Model m1",
       xlab = "Fitted (m1)",
       ylab = "Residuals")

This is scatterplot of the residuals versus their fitted values. In this plot we checking equal variance, independence, and if our expected value of the errors is zero.

Linearity Previously established.
Normality Previously established.
Equal Variance: The plot the does thicken in either direction, the values do not systematically increase or decrease. This condition is satisfied.
Independence: There is no apparent pattern, the residuals do not appear to exhibit any dependence on the fitted values.
E(errors)= 0: There appears to be a similar number of points above and below zero.

The conditions have been satisfied and our model is appropriate.

I tried to make these plots in ggResidplot and I actually got it work once, but I could never get it working again. The appendix I am fixing at the end of the semester so I had to give up

Comment on this article Share:

Appendix

Appendix

Part A

Counts by Year

Part B

Distribution of Sale Price

Part C

Property Type Counts

Part D

Conditions Check

Reuse

Citation