Data Analytics and Computational Social Science: Homework 6

Joseph Farrell

Load Libraries

library(readr)
library(tidyverse)
library(dplyr)
library(mosaic)
library(ggplot2)
library(lubridate)
library(xts)
library(formattable)
library(forecast)
library(kableExtra)
library(plotly)
library(gapminder)
library(gganimate)
library(gifski)
library(png)
library(ggfortify)
library(tidyquant)
library(tidycensus)
library(ggstatsplot)
library(scales)
library(feasts)
library(tsibble)

Introduction

The following is an analysis of Connecticut real estate sales data. The data set used in this analysis contains 930,621 observations (transactions) and contains data from 2005-2020. There are 14 variables in the original data set. Not all observations and variables will be pertinent to answering the research questions and will be subsequently filtered out. The research questions this analysis will attempt to answer or at least provide insights into answering are:

Where in Connecticut are properties values the highest?
Are property values in the Connecticut increasing or decreasing?
Where are property values increasing/decreasing?
Is the price behavior different for different property types?

Read in Data

I will read in two data sets that will be joined and used for the analysis.

Connecticut contains Connecticut real estate sales data from 2005-2020.
County_town contains a list of Connecticut counties and the towns that are in each county. This latter data set will be useful for grouping in the analysis. As will be displayed later, many of the towns contain limited observations so an alternative grouping option will be useful.

connecticut <- read_csv("/Users/nelsonfarrell/Downloads/Real_Estate_Sales_2001-2019_GL.csv")
county_town <- read_csv(("/Users/nelsonfarrell/Documents/501 Stats/Connecticut County:Towns.csv"))
county_town <- county_town %>%
  select("subregion", 
         "town")

Data Cleaning

Rename “Town” variable “town” to join datasets

connecticut <- rename(connecticut, "town" = "Town")

Here I am changing the name of the “Town” variable in the Connecticut data set so it will match the “town” variable in the county_town data set.

Join datasets

connecticut <- 
  left_join(connecticut, county_town, by = "town")

Here am joining the data sets, this will be useful when grouping in the analysis.

Remove unnecassary columns

connecticut <- select(connecticut, 
                      "List Year", 
                      "Date Recorded", 
                      "town",
                      "Sale Amount",
                      "Residential Type",
                      "subregion")

Here I have removed the columns “Serial number”, “Non Use Code”, “Assessor Remarks”, “Sales Ratio”, “Property Type”, “Location”, “Assessed Value”, and “OPM Remarks”

These removed columns do not appear to be essential to the current analysis

Remove NAs from “Residential Type”

connecticut <- connecticut %>%
  na.omit(`Residential Type`)

The NAs removed here are simply nonresidential properties, which will not be part of the analysis.

Rename columns without spaces and for clarity

connecticut <- rename(connecticut, 
                      "list_year" = "List Year", 
                      "sale_date" = "Date Recorded", 
                      "sale_price" = "Sale Amount",
                      "property_type" = "Residential Type", 
                      "county" = "subregion")

In addition to removing spaces from the column names, I have renamed residential type “property type” as there are no longer non-residential properties, this naming scheme is more intuitive.

Convert “sale_date” from character to date and create new column “sale_year”

connecticut$sale_date <- as_date(connecticut$sale_date, 
                                 format = "%m/%d/%Y")
connecticut <- connecticut %>%
  mutate(sale_year = year(sale_date))

Remove all observations prior to 2007

connecticut <- connecticut %>%
  filter(sale_year > 2006)

The data prior to and including 2006 is very limited. As a result, I have removed all observations prior to 2007.

See appendix Part A for justification.

Remove top and bottom 2.5% of “sale_price”

connecticut <- connecticut %>%
  filter(sale_price < quantile(connecticut$sale_price, .975) & 
           sale_price > quantile(connecticut$sale_price, .025))

The data was heavily skewed to the right, with extreme outliers in the right direction which impacted the measure of central tendency (mean). After trimming the data it is still skewed to the right but not as heavily.

See appendix Part B for justification.

View data set following cleaning

str(connecticut)

tibble [509,278 × 7] (S3: tbl_df/tbl/data.frame)
 $ list_year    : num [1:509278] 2014 2014 2014 2014 2014 ...
 $ sale_date    : Date[1:509278], format: "2015-08-06" ...
 $ town         : chr [1:509278] "Stamford" "New Haven" "Ridgefield" "New Britain" ...
 $ sale_price   : num [1:509278] 850000 149900 570000 261000 250000 ...
 $ property_type: chr [1:509278] "Single Family" "Single Family" "Single Family" "Condo" ...
 $ county       : chr [1:509278] "fairfield" "new haven" "fairfield" "hartford" ...
 $ sale_year    : num [1:509278] 2015 2015 2015 2015 2015 ...
 - attr(*, "na.action")= 'omit' Named int [1:382446] 13 27 28 33 53 78 83 105 127 132 ...
  ..- attr(*, "names")= chr [1:382446] "13" "27" "28" "33" ...

After cleaning the data set we are left with seven variables (columns):

list_year: This is a numeric variable. This is the year that the property was listed (i.e., when the property was put up for sale).
sale_date: This is a date variable. This is the date that the property was actually sold.
town: This is a nominal character variable. This is the town where the property is located.
county: This is a nominal character variable. This is the county where the property is located.
sale_price: This is a numeric variable. This is the price that property sold for.
property_type: This is nominal character variable. This is the type of property (i.e., condo, single family, two family, etc.)
sale_year: This is a numeric variable. This is the year the property was sold.

Now that we have data cleaned we take a look at some statewide data so we can get an idea of the general state of the real estate market in Connecticut.

Statewide Data

Line Graph: Statewide mean

connecticut %>%
  group_by(year = lubridate::floor_date(sale_date, "year")) %>%
  summarize(mean_year = mean(sale_price)) %>%
  ggplot(aes(x = year, y = mean_year)) +
  geom_line(aes()) +
  geom_smooth() +
  ggtitle("Overall Mean Sale Price of Connecticut Properties: 2007-2020") +
  ylab("Sale Price") +
  xlab("Year") +
  theme_update()

Here we can observe the behavior of the overall mean sale price of properties in Connecticut from 2007 to 2020. As expected the mean price decreased sharply following the 2008 financial crisis and appears to have bottomed out and stabilized around 2010. From 2010 to 2019 there does not appear to be a lot change in the overall mean sale price.

Map Visualization: Geom_polygon: Statewide Mean

Read in map data

con_map <- map_data("county", 
         "connecticut")

This data set contains latitude and longitude data that can be used to make a map visualization.

Rename “subregion” “county” to join “con_map” with “connecticut”

con_map <- rename(con_map,
                  "county" = "subregion")

The “subregion” in the “con_map” data set is Connecticut counties so I will rename the variable so I can join it with the sales data.

Create object “map_vis”: grouping by county and creating a “county_mean” column

map_vis <- connecticut %>%
  group_by(county) %>%
  summarize(county_mean = mean(sale_price))

This object contains the mean sale price of each county in Connecticut and will be joined the map data to display the mean sale price on a map.

Join “mapping” object with “con_map” data

map_vis <- inner_join(map_vis, 
                      con_map, 
                      by = "county")

This object will be used to create a map visualization displaying county mean sale price.

Create object to use when labeling the counties on the map

county_names <- aggregate(cbind(long, 
                                lat) ~ county, 
                          data=map_vis, 
                    FUN=function(x)mean(range(x)))

This creates an object that will center the name of the county inside that county on the map visualization.

Create map of Connecticut counties displaying the mean sale price of residential properties

ggplot(map_vis, 
       aes(x = long, 
           y = lat)) +
  geom_polygon(aes(fill = county_mean, 
                   group = county)) +
  geom_text(data = county_names, 
            aes(long, 
                lat, 
                label = county), 
                size =3) +
  scale_fill_gradient(low = "bisque4", 
                      high = "bisque2", 
                      name = "Average Sale Price (USD)") +
  ggtitle("Overall Average Sale Price in Connecticut Counties") +
  theme_light()

Here we can get a visualization of where in Connecticut (grouped by county) property values are the highest. Not surprisingly, the area closest to New York City appears to have the highest property value.

The previous two graphics display all the property types combined which may not generate the most reliable results. For the rest of the analysis, like property types will be compared.

Visualizations: Grouped by Property Type

Scatterplot of the mean monthly “sale price” of each “property type”

mean_property_type <-connecticut %>%
  group_by(year = lubridate::floor_date(sale_date, "year"), 
           property_type) %>%
  summarize(mean = mean(sale_price))

mean_property_type <-ggplot(mean_property_type, 
              aes(year, 
                  mean, 
                  color = property_type)) +
  geom_line() +
  transition_reveal(year) +
  labs(title = "Mean Sale Price: 2007-2020",
       subtitle = "Grouped By Property Type",
       x = "Year",
       y = "Price ($)") +
  guides(color = guide_legend(title = "Property Type"))

 
animate(mean_property_type)

This graphic displays the mean monthly sale price of all properties in Connecticut grouped by the type of property. Interestingly, single family homes have the highest mean sale price. We can get a closer look below.

Scatterplot of yearly mean sale price for each “Property Type”

options(scipen = 999) # remove scientific notation
connecticut %>%
  group_by(property_type, 
           year = lubridate::floor_date(sale_date, "year")) %>%
  summarize(mean_year_type = mean(sale_price)) %>%
  ggplot(aes(x = year, 
             y = mean_year_type, 
             group = property_type)) +
  geom_point((aes(color = property_type))) +
  geom_smooth() +
  ggtitle("Yearly Mean Sale Price: Property Types") +
  ylab("Mean Sale Price ($)") +
  xlab("Year") +
  theme_linedraw() +
  facet_wrap(vars(property_type)) +
  guides(color = guide_legend(title = "Property Type"))

Overall, the graphic demonstrates that the behavior of the mean sale price is different for different property types.
The graphic shows that multi-unit properties appear to have started increasing in value earlier than single family properties (i.e. condo and single family).
The mean sale price of condos has been largely constant.
Single family properties appear to have been constant right up until 2020 when they experienced a rapid increase. This aligns with the well documented housing boom that resulted from the Covid-19 pandemic.
Four family properties appear to be increasing the fastest, but also appear to be the most volatile and potentially risky.
Two and three family properties appear to have started increasing directly after the crash of 2008 and appear to have been steadily increasing ever since.
Before we can continue the analysis we will have to examine the counts of the different property types.

Counts: Property Types: Statewide & County

Here we will examine the counts of the different property types in our data set. We will also look at the counts in each specific county and later in each town.

View Counts of Property types

connecticut %>%
  count(property_type) %>%
  mutate("Proportion"= percent(n/sum(n))) %>%
  rename("Property Type" = "property_type") %>%
  rename("Number of Properties Sold" = "n") %>%
  kbl() %>%
  kable_material(c("striped", 
                   "hover"))

Property Type	Number of Properties Sold	Proportion
Condo	98463	19.3%
Four Family	1985	0.4%
Single Family	373121	73.3%
Three Family	11262	2.2%
Two Family	24447	4.8%

Here we can see how many observations there are for each “property type.”

We can see that majority of the observations are “Single Family” and “Condo.”
The least represented group is “Four Family” with only 1985 observations across all towns in Connecticut over the 14 years of the data.
This partially explains the apparent volatility in the mean sale price of “Four Family” properties displayed in (###### ADD VIS IDENTIFIER).
This will become more impactful as we continue the analysis. Once we start looking at individual towns there will be very limited data on any property types other than “single family.”
For a full exploration of property types counts see appendix Part C

Delta mean: Counties

connecticut %>%
  group_by(county, 
           property_type, 
           sale_year) %>%
  summarize(mean_sale_price = mean(sale_price)) %>%
  select(county,
         property_type,
         sale_year, 
         mean_sale_price) %>%
  filter(sale_year == "2008" | 
         sale_year =="2020") %>%
  filter(property_type == "Single Family") %>%
  pivot_wider(names_from = sale_year, 
              values_from = mean_sale_price) %>%
  mutate(delta_mean = (`2020` - `2008`)) %>%
  mutate(delta_mean_percent = (delta_mean/`2008`)) %>%
  inner_join(con_map, 
             by = "county") %>%
  ggplot(aes(x = long, 
           y = lat)) +
  geom_polygon(aes(fill = delta_mean_percent, 
                   group = county)) +
  geom_text(data = county_names, 
            aes(long, 
                lat, 
                label = county), 
                size =3) +
  scale_fill_gradient(low = "bisque4", 
                      high = "bisque2", 
                      name = "Percent Change") +
  ggtitle("Percent Change in Single Family Sale Price in Connecticut Counties") +
  theme_light()

Here we can an idea of where geographically the mean sale price has increased the most since 2007.

connecticut %>%
  group_by(county, 
           property_type, 
           sale_year) %>%
  summarize(mean_sale_price = mean(sale_price)) %>%
  select(county,
         property_type,
         sale_year, 
         mean_sale_price) %>%
  filter(sale_year == "2008" | 
         sale_year =="2020") %>%
  filter(property_type == "Single Family") %>%
  pivot_wider(names_from = sale_year, 
              values_from = mean_sale_price) %>%
  mutate(delta_mean_ID = (`2020` - `2008`)) %>%
  mutate(delta_mean_percent = percent(delta_mean_ID/`2008`)) %>%
  mutate(across(contains("ID"), round, 2)) %>%
  arrange(desc(delta_mean_percent)) %>%
  rename("County" = "county",
         "Mean 2008" = "2008",
         "Mean 2020" = "2020",
         "Change in Mean ($)" = "delta_mean_ID",
         "Change in Mean (%)" = "delta_mean_percent") %>%
  mutate(County = str_to_title(County)) %>%
  formattable(align = c("l", 
                        "c", 
                        "r"),
              list(`Mean 2020` = FALSE,
                   `Mean 2008` = FALSE,
                   `property_type` = FALSE,
                   `Change in Mean (%)` = formatter("span", 
                 style = ~ formattable::style(color = ifelse(`Change in Mean (%)` < 0, 
                                                "firebrick", 
                                                "forestgreen")),
                 ~ icontext(sapply(`Change in Mean (%)`, 
                                   function(x) if (x < 0) "arrow-down" else if (x > 0) "arrow-up"), 
                            `Change in Mean (%)`))))

County	Change in Mean ($)	Change in Mean (%)
Hartford	13627.15	5%
Litchfield	64267.71	22%
Fairfield	64666.65	13%
Windham	25014.99	12%
New Haven	-219.55	0%
New London	397.25	0%
Middlesex	-13839.78	-4%
Tolland	-6941.51	-3%

The “delta_mean” column here is obtained by subtracting the mean of “Single Family” properties sold in 2008 from the mean of “Single Family” properties in 2020.
The “c_mean” column is obtained by taking “delta_mean” and dividing by the mean of “Single Family” properties sold in 2008 and multipying by 100.

connecticut %>%
  group_by(county, 
           property_type, 
           sale_year) %>%
  summarize(mean_sale_price = mean(sale_price)) %>%
  select(county,
         property_type,
         sale_year, 
         mean_sale_price) %>%
  filter(sale_year == "2008" | sale_year =="2020") %>%
  filter(property_type == "Single Family") %>%
  pivot_wider(names_from = sale_year, values_from = mean_sale_price) %>%
  mutate(delta_mean_ID = (`2020` - `2008`)) %>%
  mutate(delta_mean_percent = (delta_mean_ID/`2008`*100)) %>%
  mutate(across(contains("percent"), round, 2)) %>%
  mutate(Color = ifelse(delta_mean_percent < 0, "rosybrown","steelblue")) %>%
  ggplot(aes(x = reorder(county, delta_mean_percent), 
             y = delta_mean_percent, 
             fill = Color)) +
  geom_col()+
  geom_text(aes(label = delta_mean_percent, 
                vjust = 1.2),
                size = 3) +
  labs(title = "Percentage Change in Mean Sale Price for Single Family Properties: 2008 to 2020",
                x = "County Name",
                y = "Percent Change") +
  theme(axis.text.x = element_text(angle = 30, 
                                   hjust = 1)) +
  scale_fill_identity(guide = "none")

The above two graphics display the change in mean sale price of “Single Family” properties from 2007 to 2020 in the 8 Connecticut counties.

These graphics demonstrate that the average sale price of “Single Family” properties has increased the most in Litchfield County (21.85%), followed by Fairfield County (12.72%) and Windham (12.14%). ###ADD HYPOTHESIS, populations, rural vs urbab…

Create line graph of the mean sale price of single family properties in different counties

connecticut %>%
  group_by(county, 
           property_type, 
           year = lubridate::floor_date(sale_date, "year")) %>%
  summarize(mean_year = mean(sale_price)) %>%
  filter(property_type == "Single Family") %>%
  ggplot(aes(x = year, 
             y = mean_year)) +
  geom_line(aes(color = county)) +
  ggtitle("Mean Sale Price of Single Family Properties: Connecticut Counties") +
  ylab("Sale Price") +
  xlab("Year") +
  theme_bw() +
  facet_wrap(vars(county)) +
  guides(color = "none")

These graphics display the behavior of the mean sale price of single family properties in the eight Connecticut counties over the ~15 years of the data.
We can see clearly that property values are the highest in Fairfield County, but we can also see that property values are
increasing in a number of Connecticut counties, this increase appears most pronounced in Fairfield and Litchfield. Windham County, which shows a 12.14% increase appears graphically to have started increasing prior to 2019, and has been increasing more steadily.

Next, we will shift to focus to potential drivers of sale price

What is driving price?

Number of Properties Sold

connecticut %>%
  group_by(month = floor_date(sale_date, 
                              "month"), 
           property_type) %>%
  summarize(mean_price = mean(sale_price), 
            count = count(month)) %>%
  ggplot(aes(x = count, 
             y = mean_price, 
             color = property_type)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Number of Properties Sold (Month) and Average Sale Price",
          subtitle = "Grouped by Property Type") +
  xlab("Number of Properties Sold (Month)") +
  ylab("Average Sale Price") +
  facet_wrap(vars(property_type), 
             scales = "free") +
  guides(color = "none")

Here we are looking the relationship between the number of properties sold and how it impacts sales price. Number of properties sold is the x variable and mean sale price is the y variable. Here we are aggregating by month.
This graphic demonstrates that there does appear to be a least a slight positive linear relationship between the number of properties sold in a given month and the average sale price.

Next we will take a closer look at this relationship as it relates to Single Family properties.

connecticut %>%
  group_by(property_type, 
           month = floor_date(sale_date, 
                              "month")) %>%
  filter(property_type == "Single Family") %>%
  summarize(mean = mean(sale_price),
            properties_sold = count(month)) %>%
  mutate(previous_mean = lag(mean)) %>%
  mutate(previous_month = lag(month)) %>%
  mutate(Direction = ifelse(properties_sold < lag(properties_sold), 
                            "Decreasing Sales",
                            "Increasing Sales")) %>%
  mutate(Direction2 = ifelse(mean < lag(mean), 
                             "Decreasing Price",
                             "Increasing Price")) %>% 
  ggplot(aes(month, mean, 
              xend = previous_month,
              yend = previous_mean)) +
  geom_segment(aes(color = Direction2)) +
  scale_color_manual("Direction of Change", 
                     values = c("Decreasing Price" = "rosybrown", 
                                "Increasing Price" = "steelblue")) +
  geom_col(aes(x = month, 
               y = properties_sold*50, 
               fill = Direction)) +
  scale_fill_manual("Direction of Change", 
                    values = c("Decreasing Sales" = "rosybrown", 
                               "Increasing Sales" = "steelblue")) +
  ggtitle("How Volume Impacts Price Movement")

  xlab("Time Grouped by Month") +
  scale_y_continuous(name = "Average Sale Price ($)", 
                     sec.axis = sec_axis(~./50, 
                     name = "Number of Properties Sold"))

NULL

Here we can see the relationship graphed differently. The line and bars are blue when the value is higher than the previous value and they are red when the value is lower than the previous value. This graphic demonstrates that as more properties are sold, price tends to go up. More properties sold can indicate increased demand.
Additionally, if you look closely at this graphic you may notice that there are approximately the same amount of the peeks and troughs are there are years in the data set. This indicate an element seasonality, where prices go up and down cyclically with the seasons. We look at seasonality as potential price driver later.
Next we will look at the relationship statistically.

options(scipen = 999) # remove scientific notation

connecticut %>%
  group_by(month = floor_date(sale_date, "month"), property_type) %>%
  filter(property_type == "Single Family") %>%
  summarize(mean_price = mean(sale_price), 
            count = count(month)) %>%
  ggplot(aes(x = count, y = mean_price)) +
      geom_point() +
  geom_smooth(method = "lm")

connecticut %>%
  group_by(month = floor_date(sale_date, "month"), 
           property_type) %>%
  filter(property_type == "Single Family") %>%
  summarize(mean_price = mean(sale_price), 
            count = count(month), 
            na.rm = TRUE) %>%
  ggscatterstats(x = count, 
                 y = mean_price) +
  ggtitle("Relationship Between Count and Sale Price") +
  xlab("Number of Properties Sold") +
  ylab("Average Price ($)") +
  theme(axis.text.x = element_text(angle = 30, 
                                   hjust = 1))

Here we see the relationship displayed numerically as a correlation coefficient. The observed correlation coefficient is 0.46 (on a -1 to 1 scale). We are 95% confident that the true correlation coefficient is between (0.33 and 0.57). The p-value (which is essentially zero), reveals the probability of observing a sample like this or more extreme if there was no relationship between price and the number of properties sold.

Next we run a simple linear regression.

lm1 <- connecticut %>%
  group_by(month = floor_date(sale_date, 
                              "month"), 
           property_type) %>%
  filter(property_type == "Single Family") %>%
  summarize(mean_price = mean(sale_price), 
            count = count(month))
  
lm2 <-lm(mean_price ~ count, data = lm1)

summary(lm2)


Call:
lm(formula = mean_price ~ count, data = lm1)

Residuals:
   Min     1Q Median     3Q    Max 
-42584 -16779  -5474  14385  74928 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 271155.647   6334.772  42.804 < 0.0000000000000002 ***
count           17.492      2.669   6.555       0.000000000701 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24750 on 163 degrees of freedom
Multiple R-squared:  0.2086,    Adjusted R-squared:  0.2037 
F-statistic: 42.96 on 1 and 163 DF,  p-value: 0.0000000007011

See appendix Part D for conditions check of the linear model.

Looking at the coefficient of determination (Multiple R-squared), we can see that model does not fit particularly well. The coefficient of determination tells us the proportion of the variability in the response variable that is explained by the explanatory variable.

We can conclude that the number properties sold does impact price, but it does not determine price. There are other factors involved.

Next we will explore another potential driver, current price.

Current Price

connecticut %>%
  group_by(month = floor_date(sale_date, "month")) %>%
  summarise(mean = mean(sale_price)) %>%
  mutate(previous = lag(mean),
         change = (mean - previous),
         percent_change = (change/mean) * 100) %>%
  ggplot(aes(x = mean, y = change)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_tq()

Seasonality

c <- connecticut %>%
  group_by(month = floor_date(sale_date,
                              "month"),
           year = floor_date(sale_date, 
                            "year")) %>%
  filter(property_type == "Single Family") %>%
  summarize(mean = mean(sale_price)) %>%
  mutate(Month2 = tsibble::yearmonth(month)) %>%
  ungroup(month)

  c <- select(c,
         "Month2", 
         "mean")
  c <- as_tsibble(c)

  gg_season(c)

Population

# census_api_key("a2e56978408ace3c130f93ace8eb74b48564a9f4", install = TRUE)
# readRenviron("~/.Renviron")
# Sys.getenv("CENSUS_API_KEY")
get_estimates(geography = "county",
              product = "population",
              state = "CT",
              time_series = TRUE,
              year = 2019)

# A tibble: 192 × 5
   NAME                           DATE GEOID variable  value
   <chr>                         <dbl> <chr> <chr>     <dbl>
 1 New Haven County, Connecticut     1 09009 POP      862477
 2 New Haven County, Connecticut     2 09009 POP      862442
 3 New Haven County, Connecticut     3 09009 POP      863357
 4 New Haven County, Connecticut     4 09009 POP      863871
 5 New Haven County, Connecticut     5 09009 POP      864566
 6 New Haven County, Connecticut     6 09009 POP      862820
 7 New Haven County, Connecticut     7 09009 POP      862885
 8 New Haven County, Connecticut     8 09009 POP      860186
 9 New Haven County, Connecticut     9 09009 POP      857901
10 New Haven County, Connecticut    10 09009 POP      857748
# … with 182 more rows

Litchfield County

Delta mean

Create graphic of mean yearly sale price for just Litchfield County, group by property type

connecticut %>%
  filter(county == "litchfield") %>%
  group_by(property_type, 
           year = lubridate::floor_date(sale_date, "year")) %>%
  summarize(mean_litchfield = mean(sale_price)) %>%
  ggplot(aes(x = year, 
             y = mean_litchfield)) +
  geom_line(aes(color = property_type)) +
  geom_smooth() +
  theme_bw() +
  labs(title = "Mean Sale Price of Different Property Types",
       subtitle = "Litchfield County 2008-2020", 
       x = "Year", 
       y = "Mean Sale Price ($)") +
  facet_wrap(vars(property_type)) +
  guides(color = "none")

This graphic displays the behavior of the mean sale price of the different “Property Types” in Litchfield County from 2008 to 2020.

Insights:
- Similar to the statewide data it appears that multi unit properties (i.e., Four-Family, Three Family, etc.) are increasing at a faster rate than “Single Family” properties.
- “Four Family” properties appear to be the most volatile.
- We have to keep in mind what we observed earlier about our sample size with regards to multi unit properties, they constitute our smallest sample.

Next we examine the counts of property types in Litchfield County.

Counts

Display counts of different property types over the 12 years 2008-2020

Litchfield <- connecticut %>%
  group_by(sale_year) %>%
  filter(county == "litchfield")

Litchfield %>%
  group_by(sale_year, property_type) %>%
  ggplot(aes(x = sale_year, fill = property_type)) +
  geom_bar(aes(stat="identity")) + 
  facet_wrap(vars(property_type)) +
  labs(title = "Total Number of Properties Sold Each Year in Litchfield: 2008-2020",
       subtitle = "Displayed by Property Type",
       x = "Year 2008-2020",
       y = "Numnber of Properties Sold") +
  theme(legend.position = "none")

Here we can see clearly that our data set is primarily made up of “Single Family” properties and “Condos”.

Examine counts of property type in Litchfield County

Litchfield %>%
  group_by(property_type) %>%
  count(property_type) %>%
  rename("Property Type" = "property_type", 
         "Properties Sold" = "n") %>%
  knitr::kable(caption = "Counts of Property Types Sold: Litchfield County") %>%
  kable_classic(full_width = T) %>%
  row_spec(2, background = "yellow") %>%
  row_spec(4, background = "yellow")

Table 1: Counts of Property Types Sold: Litchfield County
Property Type	Properties Sold
Condo	4005
Four Family	64
Single Family	24969
Three Family	363
Two Family	1305

Here we can see a partial explanation for the volatility of the mean sale price of “Four Family” properties; with only 64 observations over 14 years it seems hard to rely on the data. We also see that “Three Family” is not well represented over the time period.

Looking a little deeper at “Four” and “Three Family” properties

Litchfield %>%
  filter(property_type == "Four Family" | property_type == "Three Family") %>%
  group_by(sale_year) %>%
  count(property_type)  %>%
  pivot_wider(names_from = property_type, values_from = n) %>%
  rename("Sale Year" = "sale_year") %>%
  kable(align = c("l", "c", "r"), 
               caption = "Counts of Four Family Properties Sold Each Year: Litchfield County") %>%
  kable_classic(full_width = T) %>%
  column_spec(1, bold = T) %>%
  row_spec(0, bold = T) %>%
  row_spec(5:7, background = "yellow") %>%
  row_spec(14, background =  "yellow")

Table 2: Counts of Four Family Properties Sold Each Year: Litchfield County
Sale Year	Four Family	Three Family
2007	3	40
2008	3	21
2009	2	28
2010	4	25
2011	1	21
2012	3	18
2013	3	16
2014	5	31
2015	3	29
2016	5	31
2017	4	23
2018	10	30
2019	10	40
2020	8	10

Here I finally got use the pivot function, I didn’t notice any opportunities before this.

I have highlighted 2011-2013 and 2020 where there are very limited observations for “Three” and “Four Family” properties. While the counts do increase in some of the years the data does not appear reliable at even the county level. As a result of the limitations of the data examining different “property types” beyond the county level will not be revealing or possible.

As we examine the towns in Litchfield County, we will only be looking a “Single Family” properties.

Towns: Litchfield County

Filter Litchfield County Object for only “Single Family” properties

Litchfield_SF <- Litchfield %>%
  filter(property_type == "Single Family")

As a result of the data being limited for “property types” other than “Single Family” I will filter for “Single Family” before I look at the specific towns in Litchfield County.

Display the mean sale price of “Single Family” properties in Litchfeild County towns: 2008-2020

Litchfield %>%
  group_by(town, year = lubridate::floor_date(sale_date, "year")) %>%
  summarize(meanL = mean(sale_price)) %>%
  ggplot(aes(x = year, y = meanL)) +
  geom_line(aes(color = town)) +
  labs(title = "Mean Sale Price: 2008 to 2020",
       subtitle = "Towns in Litchfield County, Connecticut",
       x = "County Name",
       y = "Sale Price",
       caption = "**Data missing for Torrington 2020") +
  theme_bw() +
  facet_wrap(vars(town)) +
  guides(color = "none")

Create object to analyze the delta mean sale price of Litchfield County towns

L08 <- Litchfield_SF %>%
  group_by(town, sale_year) %>%
  summarise(mean_L08 = mean(sale_price)) %>%
  filter(sale_year == "2008")

L20 <- Litchfield_SF %>%
  group_by(town, sale_year) %>%
  summarise(mean_L20 = mean(sale_price)) %>%
  filter(sale_year == "2020")

L_combined <- L08 %>%
  left_join(L20, by = "town")

L_combined <- L_combined %>%
  mutate(L_delta_mean = mean_L20 - mean_L08) %>%
  mutate(L_c_mean = (mean_L20 - mean_L08)/mean_L08) %>%
  mutate(L_c_mean = L_c_mean * 100) %>%
  mutate(across(contains("mean"), round, 2))

Create bargraph of the delta mean of Litchfield County towns

L_combined %>% 
  mutate(Color = ifelse(L_c_mean < 0, "rosybrown","steelblue")) %>%
  ggplot(aes(x = reorder(town, +L_c_mean), 
             y = L_c_mean, 
             fill = Color)) +
  geom_col() +
  labs(title = "Percentage Change in Mean Sale Price: 2008 to 2020",
       subtitle = "Towns in Litchfield County, Connecticut",
       x = "County Name",
       y = "Percent Change",
       caption = "**Data missing for Torrington 2020") +
  theme(axis.text.x = element_text(angle = 45, 
                                            hjust = 1)) +
  scale_fill_identity(guide = "none")

This graphic displays the change in mean sale price in all the Litchfeild County towns from 2008 to 2020.

Some interesting takeaways:
- Torrington has no data for a change in mean indicating that either 2008 or 2020 are missing.
- Goshen and Cornwall show a 116.77% and 54.03%. respectively, increases of this magnitude seem potentially implausble.
- These items will be investigated further.
Other takeaways:
- While the aggregated data show that Litchfield property values have increased the most, this is not true for all of Litchfield.
- By town, it appears that approximately half of Litchfield is increasing and the other half is not.
- We could hypothesize that as a result of the pandemic people wanted to move away from densely populated areas and as a result Litchfield County became more desirable. We could further postulate that the towns that have not seen an increase in mean sale price could be lagging behind and may in the future.
- We could also speculate that the higher property values in Fairfield County have caused people to look elsewhere to buy properties.
- This is all highly speculative and without further analysis and/or information nothing definitively actionable can be ascertained.

Change in mean Torrington

Litchfield_SF %>%
  filter(town == "Torrington") %>%
  select(sale_year, sale_price) %>%
  group_by(sale_year) %>%
  summarise(mean = mean(sale_price)) %>%
  pivot_wider(names_from = sale_year, values_from = mean) %>%
  mutate("Delta Mean" = (`2019`-`2007`)) %>%
  pivot_longer(0:14, names_to = "Sale_Year", values_to = "Mean_Sale_Price") %>%
  mutate(Color = ifelse(Mean_Sale_Price < 0, "rosybrown","steelblue")) %>%
  ggplot(aes(x = Sale_Year, y = Mean_Sale_Price, fill = Color)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 45, 
                                            hjust = 1)) +
  scale_fill_identity(guide = "none") +
  labs(title = "Mean Price Torrington: 2008 to 2020",
       subtitle = "Litchfield County, Connecticut",
       x = "Sale Year",
       y = "Mean Sale Price",
       caption = "**Delta Mean = 2019 - 2007**")

This graphic shows the behavior of the mean sale price of Torrington. Torrington did not have 2020 data so it was not included in the grouped bar chart of Litchfield County towns. The “Delta Mean” bar at the right end is calculated by subtracting the mean of 2008 from the mean of 2020. As we can see the mean sale price of Torrington “Single Family” properties has decreased.

Now we will take a look Goshen. A hundred percent increase in mean sale price is a little remarkable. We will analyze to see if there is an explanation in the data.

Litchfield_SF %>%
  filter(town == "Goshen") %>%
  group_by(sale_year) %>%
  summarize(goshen_mean = mean(sale_price),
            properties_sold = count(sale_year))

# A tibble: 14 × 3
   sale_year goshen_mean properties_sold
       <dbl>       <dbl>           <int>
 1      2007     405302.              45
 2      2008     219587.               9
 3      2009     360256.              45
 4      2010     303607               50
 5      2011     391851.              58
 6      2012     305958.              24
 7      2013     456833.              15
 8      2014     350857.              56
 9      2015     340886.              71
10      2016     361325.              75
11      2017     333381.              52
12      2018     333617.              12
13      2019     337689.              80
14      2020     475989.              64

  Litchfield_SF %>%
  filter(town == "Goshen") %>%
  group_by(sale_year) %>%
  summarize(goshen_mean = mean(sale_price),
            properties_sold = count(sale_year)) %>%
  ggplot(aes(x = sale_year)) +
  geom_line(aes(x = sale_year, y = goshen_mean)) +
  geom_col(aes(x = sale_year, y = properties_sold*2500)) +
  scale_x_continuous(breaks = seq(2007, 2020, 1)) +
  scale_y_continuous(name = "Goshen Mean", sec.axis = sec_axis(~./2500, name = "properties sold"))

Where I will go from here

What is missing from your final project?

I think my project is still kind of unorganized. I need to move some things to the appendix, remove others, and move some things around. I also need to clean up and recode some of the graphics. I also need to write out explanations of the decisions I made and the graphics. Generally, I need to work on the narration of the project. I also missing error bars.

What do I hope to accomplish

I hope to accomplish everything above and continue the analysis for Fairfield County. I also to refine and/or polish the graphics and tables.

#ts_object <- connecticut %>%
  #filter(property_type == "Single Family") %>%
  #group_by(month = lubridate::floor_date(sale_date, "month")) %>%
  #summarise(mean = mean(sale_price)) 
#ts_object_2 <- xts(ts_object$mean, ts_object$month)

plot(ts_object_2)

pred_12 <- auto.arima(ts_object_2)

pred_12

forecast_1 <- forecast(pred_12, 12)

###plot(forcast_1)

forecast_1

accuracy(pred_12)

qqnorm(pred_12$residuals)

fit <-stl(pred_12, s.window = “period”) plot(fit)

Comment on this article Share:

Homework 6