Homework 6

Connecticut Real Estate Sales Data

Joseph Farrell
2022-04-27

Load Libraries

Introduction

The following is an analysis of Connecticut real estate sales data. The data set used in this analysis contains 930,621 observations (transactions) and contains data from 2005-2020. There are 14 variables in the original data set. Not all observations and variables will be pertinent to answering the research questions and will be subsequently filtered out. The research questions this analysis will attempt to answer or at least provide insights into answering are:

Read in Data

I will read in two data sets that will be joined and used for the analysis.

connecticut <- read_csv("/Users/nelsonfarrell/Downloads/Real_Estate_Sales_2001-2019_GL.csv")
county_town <- read_csv(("/Users/nelsonfarrell/Documents/501 Stats/Connecticut County:Towns.csv"))
county_town <- county_town %>%
  select("subregion", 
         "town")

Data Cleaning

Rename “Town” variable “town” to join datasets
connecticut <- rename(connecticut, "town" = "Town")

Here I am changing the name of the “Town” variable in the Connecticut data set so it will match the “town” variable in the county_town data set.

Join datasets
connecticut <- 
  left_join(connecticut, county_town, by = "town")

Here am joining the data sets, this will be useful when grouping in the analysis.

Remove unnecassary columns
connecticut <- select(connecticut, 
                      "List Year", 
                      "Date Recorded", 
                      "town",
                      "Sale Amount",
                      "Residential Type",
                      "subregion")

Here I have removed the columns “Serial number”, “Non Use Code”, “Assessor Remarks”, “Sales Ratio”, “Property Type”, “Location”, “Assessed Value”, and “OPM Remarks”

Remove NAs from “Residential Type”
connecticut <- connecticut %>%
  na.omit(`Residential Type`)

The NAs removed here are simply nonresidential properties, which will not be part of the analysis.

Rename columns without spaces and for clarity
connecticut <- rename(connecticut, 
                      "list_year" = "List Year", 
                      "sale_date" = "Date Recorded", 
                      "sale_price" = "Sale Amount",
                      "property_type" = "Residential Type", 
                      "county" = "subregion")

In addition to removing spaces from the column names, I have renamed residential type “property type” as there are no longer non-residential properties, this naming scheme is more intuitive.

Convert “sale_date” from character to date and create new column “sale_year”
connecticut$sale_date <- as_date(connecticut$sale_date, 
                                 format = "%m/%d/%Y")
connecticut <- connecticut %>%
  mutate(sale_year = year(sale_date))
Remove all observations prior to 2007
connecticut <- connecticut %>%
  filter(sale_year > 2006)

The data prior to and including 2006 is very limited. As a result, I have removed all observations prior to 2007.

Remove top and bottom 2.5% of “sale_price”
connecticut <- connecticut %>%
  filter(sale_price < quantile(connecticut$sale_price, .975) & 
           sale_price > quantile(connecticut$sale_price, .025)) 

The data was heavily skewed to the right, with extreme outliers in the right direction which impacted the measure of central tendency (mean). After trimming the data it is still skewed to the right but not as heavily.

View data set following cleaning
str(connecticut)
tibble [509,278 × 7] (S3: tbl_df/tbl/data.frame)
 $ list_year    : num [1:509278] 2014 2014 2014 2014 2014 ...
 $ sale_date    : Date[1:509278], format: "2015-08-06" ...
 $ town         : chr [1:509278] "Stamford" "New Haven" "Ridgefield" "New Britain" ...
 $ sale_price   : num [1:509278] 850000 149900 570000 261000 250000 ...
 $ property_type: chr [1:509278] "Single Family" "Single Family" "Single Family" "Condo" ...
 $ county       : chr [1:509278] "fairfield" "new haven" "fairfield" "hartford" ...
 $ sale_year    : num [1:509278] 2015 2015 2015 2015 2015 ...
 - attr(*, "na.action")= 'omit' Named int [1:382446] 13 27 28 33 53 78 83 105 127 132 ...
  ..- attr(*, "names")= chr [1:382446] "13" "27" "28" "33" ...

After cleaning the data set we are left with seven variables (columns):

Now that we have data cleaned we take a look at some statewide data so we can get an idea of the general state of the real estate market in Connecticut.

Statewide Data

Line Graph: Statewide mean

connecticut %>%
  group_by(year = lubridate::floor_date(sale_date, "year")) %>%
  summarize(mean_year = mean(sale_price)) %>%
  ggplot(aes(x = year, y = mean_year)) +
  geom_line(aes()) +
  geom_smooth() +
  ggtitle("Overall Mean Sale Price of Connecticut Properties: 2007-2020") +
  ylab("Sale Price") +
  xlab("Year") +
  theme_update()

Here we can observe the behavior of the overall mean sale price of properties in Connecticut from 2007 to 2020. As expected the mean price decreased sharply following the 2008 financial crisis and appears to have bottomed out and stabilized around 2010. From 2010 to 2019 there does not appear to be a lot change in the overall mean sale price.

Map Visualization: Geom_polygon: Statewide Mean

Read in map data
con_map <- map_data("county", 
         "connecticut")

This data set contains latitude and longitude data that can be used to make a map visualization.

Rename “subregion” “county” to join “con_map” with “connecticut”
con_map <- rename(con_map,
                  "county" = "subregion")

The “subregion” in the “con_map” data set is Connecticut counties so I will rename the variable so I can join it with the sales data.

Create object “map_vis”: grouping by county and creating a “county_mean” column
map_vis <- connecticut %>%
  group_by(county) %>%
  summarize(county_mean = mean(sale_price))

This object contains the mean sale price of each county in Connecticut and will be joined the map data to display the mean sale price on a map.

Join “mapping” object with “con_map” data
map_vis <- inner_join(map_vis, 
                      con_map, 
                      by = "county")

This object will be used to create a map visualization displaying county mean sale price.

Create object to use when labeling the counties on the map
county_names <- aggregate(cbind(long, 
                                lat) ~ county, 
                          data=map_vis, 
                    FUN=function(x)mean(range(x)))

This creates an object that will center the name of the county inside that county on the map visualization.

Create map of Connecticut counties displaying the mean sale price of residential properties
ggplot(map_vis, 
       aes(x = long, 
           y = lat)) +
  geom_polygon(aes(fill = county_mean, 
                   group = county)) +
  geom_text(data = county_names, 
            aes(long, 
                lat, 
                label = county), 
                size =3) +
  scale_fill_gradient(low = "bisque4", 
                      high = "bisque2", 
                      name = "Average Sale Price (USD)") +
  ggtitle("Overall Average Sale Price in Connecticut Counties") +
  theme_light()

Here we can get a visualization of where in Connecticut (grouped by county) property values are the highest. Not surprisingly, the area closest to New York City appears to have the highest property value.

Visualizations: Grouped by Property Type

Scatterplot of the mean monthly “sale price” of each “property type”
mean_property_type <-connecticut %>%
  group_by(year = lubridate::floor_date(sale_date, "year"), 
           property_type) %>%
  summarize(mean = mean(sale_price))

mean_property_type <-ggplot(mean_property_type, 
              aes(year, 
                  mean, 
                  color = property_type)) +
  geom_line() +
  transition_reveal(year) +
  labs(title = "Mean Sale Price: 2007-2020",
       subtitle = "Grouped By Property Type",
       x = "Year",
       y = "Price ($)") +
  guides(color = guide_legend(title = "Property Type"))

 
animate(mean_property_type)

This graphic displays the mean monthly sale price of all properties in Connecticut grouped by the type of property. Interestingly, single family homes have the highest mean sale price. We can get a closer look below.

Scatterplot of yearly mean sale price for each “Property Type”
options(scipen = 999) # remove scientific notation
connecticut %>%
  group_by(property_type, 
           year = lubridate::floor_date(sale_date, "year")) %>%
  summarize(mean_year_type = mean(sale_price)) %>%
  ggplot(aes(x = year, 
             y = mean_year_type, 
             group = property_type)) +
  geom_point((aes(color = property_type))) +
  geom_smooth() +
  ggtitle("Yearly Mean Sale Price: Property Types") +
  ylab("Mean Sale Price ($)") +
  xlab("Year") +
  theme_linedraw() +
  facet_wrap(vars(property_type)) +
  guides(color = guide_legend(title = "Property Type"))

Counts: Property Types: Statewide & County

Here we will examine the counts of the different property types in our data set. We will also look at the counts in each specific county and later in each town.

View Counts of Property types
connecticut %>%
  count(property_type) %>%
  mutate("Proportion"= percent(n/sum(n))) %>%
  rename("Property Type" = "property_type") %>%
  rename("Number of Properties Sold" = "n") %>%
  kbl() %>%
  kable_material(c("striped", 
                   "hover"))
Property Type Number of Properties Sold Proportion
Condo 98463 19.3%
Four Family 1985 0.4%
Single Family 373121 73.3%
Three Family 11262 2.2%
Two Family 24447 4.8%

Here we can see how many observations there are for each “property type.”

Delta mean: Counties

connecticut %>%
  group_by(county, 
           property_type, 
           sale_year) %>%
  summarize(mean_sale_price = mean(sale_price)) %>%
  select(county,
         property_type,
         sale_year, 
         mean_sale_price) %>%
  filter(sale_year == "2008" | 
         sale_year =="2020") %>%
  filter(property_type == "Single Family") %>%
  pivot_wider(names_from = sale_year, 
              values_from = mean_sale_price) %>%
  mutate(delta_mean = (`2020` - `2008`)) %>%
  mutate(delta_mean_percent = (delta_mean/`2008`)) %>%
  inner_join(con_map, 
             by = "county") %>%
  ggplot(aes(x = long, 
           y = lat)) +
  geom_polygon(aes(fill = delta_mean_percent, 
                   group = county)) +
  geom_text(data = county_names, 
            aes(long, 
                lat, 
                label = county), 
                size =3) +
  scale_fill_gradient(low = "bisque4", 
                      high = "bisque2", 
                      name = "Percent Change") +
  ggtitle("Percent Change in Single Family Sale Price in Connecticut Counties") +
  theme_light()

connecticut %>%
  group_by(county, 
           property_type, 
           sale_year) %>%
  summarize(mean_sale_price = mean(sale_price)) %>%
  select(county,
         property_type,
         sale_year, 
         mean_sale_price) %>%
  filter(sale_year == "2008" | 
         sale_year =="2020") %>%
  filter(property_type == "Single Family") %>%
  pivot_wider(names_from = sale_year, 
              values_from = mean_sale_price) %>%
  mutate(delta_mean_ID = (`2020` - `2008`)) %>%
  mutate(delta_mean_percent = percent(delta_mean_ID/`2008`)) %>%
  mutate(across(contains("ID"), round, 2)) %>%
  arrange(desc(delta_mean_percent)) %>%
  rename("County" = "county",
         "Mean 2008" = "2008",
         "Mean 2020" = "2020",
         "Change in Mean ($)" = "delta_mean_ID",
         "Change in Mean (%)" = "delta_mean_percent") %>%
  mutate(County = str_to_title(County)) %>%
  formattable(align = c("l", 
                        "c", 
                        "r"),
              list(`Mean 2020` = FALSE,
                   `Mean 2008` = FALSE,
                   `property_type` = FALSE,
                   `Change in Mean (%)` = formatter("span", 
                 style = ~ formattable::style(color = ifelse(`Change in Mean (%)` < 0, 
                                                "firebrick", 
                                                "forestgreen")),
                 ~ icontext(sapply(`Change in Mean (%)`, 
                                   function(x) if (x < 0) "arrow-down" else if (x > 0) "arrow-up"), 
                            `Change in Mean (%)`))))
County Change in Mean ($) Change in Mean (%)
Hartford 13627.15 5%
Litchfield 64267.71 22%
Fairfield 64666.65 13%
Windham 25014.99 12%
New Haven -219.55 0%
New London 397.25 0%
Middlesex -13839.78 -4%
Tolland -6941.51 -3%
connecticut %>%
  group_by(county, 
           property_type, 
           sale_year) %>%
  summarize(mean_sale_price = mean(sale_price)) %>%
  select(county,
         property_type,
         sale_year, 
         mean_sale_price) %>%
  filter(sale_year == "2008" | sale_year =="2020") %>%
  filter(property_type == "Single Family") %>%
  pivot_wider(names_from = sale_year, values_from = mean_sale_price) %>%
  mutate(delta_mean_ID = (`2020` - `2008`)) %>%
  mutate(delta_mean_percent = (delta_mean_ID/`2008`*100)) %>%
  mutate(across(contains("percent"), round, 2)) %>%
  mutate(Color = ifelse(delta_mean_percent < 0, "rosybrown","steelblue")) %>%
  ggplot(aes(x = reorder(county, delta_mean_percent), 
             y = delta_mean_percent, 
             fill = Color)) +
  geom_col()+
  geom_text(aes(label = delta_mean_percent, 
                vjust = 1.2),
                size = 3) +
  labs(title = "Percentage Change in Mean Sale Price for Single Family Properties: 2008 to 2020",
                x = "County Name",
                y = "Percent Change") +
  theme(axis.text.x = element_text(angle = 30, 
                                   hjust = 1)) +
  scale_fill_identity(guide = "none")

The above two graphics display the change in mean sale price of “Single Family” properties from 2007 to 2020 in the 8 Connecticut counties.

Create line graph of the mean sale price of single family properties in different counties
connecticut %>%
  group_by(county, 
           property_type, 
           year = lubridate::floor_date(sale_date, "year")) %>%
  summarize(mean_year = mean(sale_price)) %>%
  filter(property_type == "Single Family") %>%
  ggplot(aes(x = year, 
             y = mean_year)) +
  geom_line(aes(color = county)) +
  ggtitle("Mean Sale Price of Single Family Properties: Connecticut Counties") +
  ylab("Sale Price") +
  xlab("Year") +
  theme_bw() +
  facet_wrap(vars(county)) +
  guides(color = "none")

Next, we will shift to focus to potential drivers of sale price

What is driving price?

Number of Properties Sold

connecticut %>%
  group_by(month = floor_date(sale_date, 
                              "month"), 
           property_type) %>%
  summarize(mean_price = mean(sale_price), 
            count = count(month)) %>%
  ggplot(aes(x = count, 
             y = mean_price, 
             color = property_type)) +
  geom_point() +
  geom_smooth(method = "lm") +
  ggtitle("Number of Properties Sold (Month) and Average Sale Price",
          subtitle = "Grouped by Property Type") +
  xlab("Number of Properties Sold (Month)") +
  ylab("Average Sale Price") +
  facet_wrap(vars(property_type), 
             scales = "free") +
  guides(color = "none")

Next we will take a closer look at this relationship as it relates to Single Family properties.

connecticut %>%
  group_by(property_type, 
           month = floor_date(sale_date, 
                              "month")) %>%
  filter(property_type == "Single Family") %>%
  summarize(mean = mean(sale_price),
            properties_sold = count(month)) %>%
  mutate(previous_mean = lag(mean)) %>%
  mutate(previous_month = lag(month)) %>%
  mutate(Direction = ifelse(properties_sold < lag(properties_sold), 
                            "Decreasing Sales",
                            "Increasing Sales")) %>%
  mutate(Direction2 = ifelse(mean < lag(mean), 
                             "Decreasing Price",
                             "Increasing Price")) %>% 
  ggplot(aes(month, mean, 
              xend = previous_month,
              yend = previous_mean)) +
  geom_segment(aes(color = Direction2)) +
  scale_color_manual("Direction of Change", 
                     values = c("Decreasing Price" = "rosybrown", 
                                "Increasing Price" = "steelblue")) +
  geom_col(aes(x = month, 
               y = properties_sold*50, 
               fill = Direction)) +
  scale_fill_manual("Direction of Change", 
                    values = c("Decreasing Sales" = "rosybrown", 
                               "Increasing Sales" = "steelblue")) +
  ggtitle("How Volume Impacts Price Movement")
  xlab("Time Grouped by Month") +
  scale_y_continuous(name = "Average Sale Price ($)", 
                     sec.axis = sec_axis(~./50, 
                     name = "Number of Properties Sold"))
NULL
options(scipen = 999) # remove scientific notation

connecticut %>%
  group_by(month = floor_date(sale_date, "month"), property_type) %>%
  filter(property_type == "Single Family") %>%
  summarize(mean_price = mean(sale_price), 
            count = count(month)) %>%
  ggplot(aes(x = count, y = mean_price)) +
      geom_point() +
  geom_smooth(method = "lm")
connecticut %>%
  group_by(month = floor_date(sale_date, "month"), 
           property_type) %>%
  filter(property_type == "Single Family") %>%
  summarize(mean_price = mean(sale_price), 
            count = count(month), 
            na.rm = TRUE) %>%
  ggscatterstats(x = count, 
                 y = mean_price) +
  ggtitle("Relationship Between Count and Sale Price") +
  xlab("Number of Properties Sold") +
  ylab("Average Price ($)") +
  theme(axis.text.x = element_text(angle = 30, 
                                   hjust = 1))

Next we run a simple linear regression.

lm1 <- connecticut %>%
  group_by(month = floor_date(sale_date, 
                              "month"), 
           property_type) %>%
  filter(property_type == "Single Family") %>%
  summarize(mean_price = mean(sale_price), 
            count = count(month))
  
lm2 <-lm(mean_price ~ count, data = lm1)

summary(lm2)

Call:
lm(formula = mean_price ~ count, data = lm1)

Residuals:
   Min     1Q Median     3Q    Max 
-42584 -16779  -5474  14385  74928 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 271155.647   6334.772  42.804 < 0.0000000000000002 ***
count           17.492      2.669   6.555       0.000000000701 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24750 on 163 degrees of freedom
Multiple R-squared:  0.2086,    Adjusted R-squared:  0.2037 
F-statistic: 42.96 on 1 and 163 DF,  p-value: 0.0000000007011

See appendix Part D for conditions check of the linear model.

We can conclude that the number properties sold does impact price, but it does not determine price. There are other factors involved.

Next we will explore another potential driver, current price.

Current Price

connecticut %>%
  group_by(month = floor_date(sale_date, "month")) %>%
  summarise(mean = mean(sale_price)) %>%
  mutate(previous = lag(mean),
         change = (mean - previous),
         percent_change = (change/mean) * 100) %>%
  ggplot(aes(x = mean, y = change)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_tq()

Seasonality

c <- connecticut %>%
  group_by(month = floor_date(sale_date,
                              "month"),
           year = floor_date(sale_date, 
                            "year")) %>%
  filter(property_type == "Single Family") %>%
  summarize(mean = mean(sale_price)) %>%
  mutate(Month2 = tsibble::yearmonth(month)) %>%
  ungroup(month)

  c <- select(c,
         "Month2", 
         "mean")
  c <- as_tsibble(c)

  gg_season(c)

Population

# census_api_key("a2e56978408ace3c130f93ace8eb74b48564a9f4", install = TRUE)
# readRenviron("~/.Renviron")
# Sys.getenv("CENSUS_API_KEY")
get_estimates(geography = "county",
              product = "population",
              state = "CT",
              time_series = TRUE,
              year = 2019)
# A tibble: 192 × 5
   NAME                           DATE GEOID variable  value
   <chr>                         <dbl> <chr> <chr>     <dbl>
 1 New Haven County, Connecticut     1 09009 POP      862477
 2 New Haven County, Connecticut     2 09009 POP      862442
 3 New Haven County, Connecticut     3 09009 POP      863357
 4 New Haven County, Connecticut     4 09009 POP      863871
 5 New Haven County, Connecticut     5 09009 POP      864566
 6 New Haven County, Connecticut     6 09009 POP      862820
 7 New Haven County, Connecticut     7 09009 POP      862885
 8 New Haven County, Connecticut     8 09009 POP      860186
 9 New Haven County, Connecticut     9 09009 POP      857901
10 New Haven County, Connecticut    10 09009 POP      857748
# … with 182 more rows

Litchfield County

Delta mean

Create graphic of mean yearly sale price for just Litchfield County, group by property type
connecticut %>%
  filter(county == "litchfield") %>%
  group_by(property_type, 
           year = lubridate::floor_date(sale_date, "year")) %>%
  summarize(mean_litchfield = mean(sale_price)) %>%
  ggplot(aes(x = year, 
             y = mean_litchfield)) +
  geom_line(aes(color = property_type)) +
  geom_smooth() +
  theme_bw() +
  labs(title = "Mean Sale Price of Different Property Types",
       subtitle = "Litchfield County 2008-2020", 
       x = "Year", 
       y = "Mean Sale Price ($)") +
  facet_wrap(vars(property_type)) +
  guides(color = "none")

This graphic displays the behavior of the mean sale price of the different “Property Types” in Litchfield County from 2008 to 2020.

Next we examine the counts of property types in Litchfield County.

Counts

Display counts of different property types over the 12 years 2008-2020
Litchfield <- connecticut %>%
  group_by(sale_year) %>%
  filter(county == "litchfield")

Litchfield %>%
  group_by(sale_year, property_type) %>%
  ggplot(aes(x = sale_year, fill = property_type)) +
  geom_bar(aes(stat="identity")) + 
  facet_wrap(vars(property_type)) +
  labs(title = "Total Number of Properties Sold Each Year in Litchfield: 2008-2020",
       subtitle = "Displayed by Property Type",
       x = "Year 2008-2020",
       y = "Numnber of Properties Sold") +
  theme(legend.position = "none")

Here we can see clearly that our data set is primarily made up of “Single Family” properties and “Condos”.

Examine counts of property type in Litchfield County
Litchfield %>%
  group_by(property_type) %>%
  count(property_type) %>%
  rename("Property Type" = "property_type", 
         "Properties Sold" = "n") %>%
  knitr::kable(caption = "Counts of Property Types Sold: Litchfield County") %>%
  kable_classic(full_width = T) %>%
  row_spec(2, background = "yellow") %>%
  row_spec(4, background = "yellow") 
Table 1: Counts of Property Types Sold: Litchfield County
Property Type Properties Sold
Condo 4005
Four Family 64
Single Family 24969
Three Family 363
Two Family 1305

Here we can see a partial explanation for the volatility of the mean sale price of “Four Family” properties; with only 64 observations over 14 years it seems hard to rely on the data. We also see that “Three Family” is not well represented over the time period.

Looking a little deeper at “Four” and “Three Family” properties
Litchfield %>%
  filter(property_type == "Four Family" | property_type == "Three Family") %>%
  group_by(sale_year) %>%
  count(property_type)  %>%
  pivot_wider(names_from = property_type, values_from = n) %>%
  rename("Sale Year" = "sale_year") %>%
  kable(align = c("l", "c", "r"), 
               caption = "Counts of Four Family Properties Sold Each Year: Litchfield County") %>%
  kable_classic(full_width = T) %>%
  column_spec(1, bold = T) %>%
  row_spec(0, bold = T) %>%
  row_spec(5:7, background = "yellow") %>%
  row_spec(14, background =  "yellow")
Table 2: Counts of Four Family Properties Sold Each Year: Litchfield County
Sale Year Four Family Three Family
2007 3 40
2008 3 21
2009 2 28
2010 4 25
2011 1 21
2012 3 18
2013 3 16
2014 5 31
2015 3 29
2016 5 31
2017 4 23
2018 10 30
2019 10 40
2020 8 10

Here I finally got use the pivot function, I didn’t notice any opportunities before this.

I have highlighted 2011-2013 and 2020 where there are very limited observations for “Three” and “Four Family” properties. While the counts do increase in some of the years the data does not appear reliable at even the county level. As a result of the limitations of the data examining different “property types” beyond the county level will not be revealing or possible.

As we examine the towns in Litchfield County, we will only be looking a “Single Family” properties.

Towns: Litchfield County

Filter Litchfield County Object for only “Single Family” properties
Litchfield_SF <- Litchfield %>%
  filter(property_type == "Single Family")

As a result of the data being limited for “property types” other than “Single Family” I will filter for “Single Family” before I look at the specific towns in Litchfield County.

Display the mean sale price of “Single Family” properties in Litchfeild County towns: 2008-2020
Litchfield %>%
  group_by(town, year = lubridate::floor_date(sale_date, "year")) %>%
  summarize(meanL = mean(sale_price)) %>%
  ggplot(aes(x = year, y = meanL)) +
  geom_line(aes(color = town)) +
  labs(title = "Mean Sale Price: 2008 to 2020",
       subtitle = "Towns in Litchfield County, Connecticut",
       x = "County Name",
       y = "Sale Price",
       caption = "**Data missing for Torrington 2020") +
  theme_bw() +
  facet_wrap(vars(town)) +
  guides(color = "none")

Create object to analyze the delta mean sale price of Litchfield County towns
L08 <- Litchfield_SF %>%
  group_by(town, sale_year) %>%
  summarise(mean_L08 = mean(sale_price)) %>%
  filter(sale_year == "2008")

L20 <- Litchfield_SF %>%
  group_by(town, sale_year) %>%
  summarise(mean_L20 = mean(sale_price)) %>%
  filter(sale_year == "2020")

L_combined <- L08 %>%
  left_join(L20, by = "town")

L_combined <- L_combined %>%
  mutate(L_delta_mean = mean_L20 - mean_L08) %>%
  mutate(L_c_mean = (mean_L20 - mean_L08)/mean_L08) %>%
  mutate(L_c_mean = L_c_mean * 100) %>%
  mutate(across(contains("mean"), round, 2))
Create bargraph of the delta mean of Litchfield County towns
L_combined %>% 
  mutate(Color = ifelse(L_c_mean < 0, "rosybrown","steelblue")) %>%
  ggplot(aes(x = reorder(town, +L_c_mean), 
             y = L_c_mean, 
             fill = Color)) +
  geom_col() +
  labs(title = "Percentage Change in Mean Sale Price: 2008 to 2020",
       subtitle = "Towns in Litchfield County, Connecticut",
       x = "County Name",
       y = "Percent Change",
       caption = "**Data missing for Torrington 2020") +
  theme(axis.text.x = element_text(angle = 45, 
                                            hjust = 1)) +
  scale_fill_identity(guide = "none")

This graphic displays the change in mean sale price in all the Litchfeild County towns from 2008 to 2020.

Change in mean Torrington
Litchfield_SF %>%
  filter(town == "Torrington") %>%
  select(sale_year, sale_price) %>%
  group_by(sale_year) %>%
  summarise(mean = mean(sale_price)) %>%
  pivot_wider(names_from = sale_year, values_from = mean) %>%
  mutate("Delta Mean" = (`2019`-`2007`)) %>%
  pivot_longer(0:14, names_to = "Sale_Year", values_to = "Mean_Sale_Price") %>%
  mutate(Color = ifelse(Mean_Sale_Price < 0, "rosybrown","steelblue")) %>%
  ggplot(aes(x = Sale_Year, y = Mean_Sale_Price, fill = Color)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 45, 
                                            hjust = 1)) +
  scale_fill_identity(guide = "none") +
  labs(title = "Mean Price Torrington: 2008 to 2020",
       subtitle = "Litchfield County, Connecticut",
       x = "Sale Year",
       y = "Mean Sale Price",
       caption = "**Delta Mean = 2019 - 2007**")

This graphic shows the behavior of the mean sale price of Torrington. Torrington did not have 2020 data so it was not included in the grouped bar chart of Litchfield County towns. The “Delta Mean” bar at the right end is calculated by subtracting the mean of 2008 from the mean of 2020. As we can see the mean sale price of Torrington “Single Family” properties has decreased.

Now we will take a look Goshen. A hundred percent increase in mean sale price is a little remarkable. We will analyze to see if there is an explanation in the data.

Litchfield_SF %>%
  filter(town == "Goshen") %>%
  group_by(sale_year) %>%
  summarize(goshen_mean = mean(sale_price),
            properties_sold = count(sale_year))
# A tibble: 14 × 3
   sale_year goshen_mean properties_sold
       <dbl>       <dbl>           <int>
 1      2007     405302.              45
 2      2008     219587.               9
 3      2009     360256.              45
 4      2010     303607               50
 5      2011     391851.              58
 6      2012     305958.              24
 7      2013     456833.              15
 8      2014     350857.              56
 9      2015     340886.              71
10      2016     361325.              75
11      2017     333381.              52
12      2018     333617.              12
13      2019     337689.              80
14      2020     475989.              64
  Litchfield_SF %>%
  filter(town == "Goshen") %>%
  group_by(sale_year) %>%
  summarize(goshen_mean = mean(sale_price),
            properties_sold = count(sale_year)) %>%
  ggplot(aes(x = sale_year)) +
  geom_line(aes(x = sale_year, y = goshen_mean)) +
  geom_col(aes(x = sale_year, y = properties_sold*2500)) +
  scale_x_continuous(breaks = seq(2007, 2020, 1)) +
  scale_y_continuous(name = "Goshen Mean", sec.axis = sec_axis(~./2500, name = "properties sold")) 

Where I will go from here

What is missing from your final project?

I think my project is still kind of unorganized. I need to move some things to the appendix, remove others, and move some things around. I also need to clean up and recode some of the graphics. I also need to write out explanations of the decisions I made and the graphics. Generally, I need to work on the narration of the project. I also missing error bars.

What do I hope to accomplish

I hope to accomplish everything above and continue the analysis for Fairfield County. I also to refine and/or polish the graphics and tables.

#ts_object <- connecticut %>%
  #filter(property_type == "Single Family") %>%
  #group_by(month = lubridate::floor_date(sale_date, "month")) %>%
  #summarise(mean = mean(sale_price)) 
#ts_object_2 <- xts(ts_object$mean, ts_object$month)

plot(ts_object_2)

pred_12 <- auto.arima(ts_object_2)

pred_12

forecast_1 <- forecast(pred_12, 12)

###plot(forcast_1)

forecast_1

accuracy(pred_12)

qqnorm(pred_12$residuals)

fit <-stl(pred_12, s.window = “period”) plot(fit)

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Farrell (2022, April 27). Data Analytics and Computational Social Science: Homework 6. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomjnfarrell211892727/

BibTeX citation

@misc{farrell2022homework,
  author = {Farrell, Joseph},
  title = {Data Analytics and Computational Social Science: Homework 6},
  url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomjnfarrell211892727/},
  year = {2022}
}