Connecticut Real Estate Sales Data
library(readr)
library(tidyverse)
library(dplyr)
library(mosaic)
library(ggplot2)
library(lubridate)
library(xts)
library(formattable)
library(forecast)
library(kableExtra)
library(plotly)
library(gapminder)
library(gganimate)
library(gifski)
library(png)
library(ggfortify)
library(tidyquant)
library(tidycensus)
library(ggstatsplot)
library(scales)
library(feasts)
library(tsibble)
The following is an analysis of Connecticut real estate sales data. The data set used in this analysis contains 930,621 observations (transactions) and contains data from 2005-2020. There are 14 variables in the original data set. Not all observations and variables will be pertinent to answering the research questions and will be subsequently filtered out. The research questions this analysis will attempt to answer or at least provide insights into answering are:
I will read in two data sets that will be joined and used for the analysis.
connecticut <- rename(connecticut, "town" = "Town")
Here I am changing the name of the “Town” variable in the Connecticut data set so it will match the “town” variable in the county_town data set.
Join datasetsconnecticut <-
left_join(connecticut, county_town, by = "town")
Here am joining the data sets, this will be useful when grouping in the analysis.
Remove unnecassary columnsconnecticut <- select(connecticut,
"List Year",
"Date Recorded",
"town",
"Sale Amount",
"Residential Type",
"subregion")
Here I have removed the columns “Serial number”, “Non Use Code”, “Assessor Remarks”, “Sales Ratio”, “Property Type”, “Location”, “Assessed Value”, and “OPM Remarks”
The NAs removed here are simply nonresidential properties, which will not be part of the analysis.
Rename columns without spaces and for clarityconnecticut <- rename(connecticut,
"list_year" = "List Year",
"sale_date" = "Date Recorded",
"sale_price" = "Sale Amount",
"property_type" = "Residential Type",
"county" = "subregion")
In addition to removing spaces from the column names, I have renamed residential type “property type” as there are no longer non-residential properties, this naming scheme is more intuitive.
Convert “sale_date” from character to date and create new column “sale_year”The data prior to and including 2006 is very limited. As a result, I have removed all observations prior to 2007.
The data was heavily skewed to the right, with extreme outliers in the right direction which impacted the measure of central tendency (mean). After trimming the data it is still skewed to the right but not as heavily.
str(connecticut)
tibble [509,278 × 7] (S3: tbl_df/tbl/data.frame)
$ list_year : num [1:509278] 2014 2014 2014 2014 2014 ...
$ sale_date : Date[1:509278], format: "2015-08-06" ...
$ town : chr [1:509278] "Stamford" "New Haven" "Ridgefield" "New Britain" ...
$ sale_price : num [1:509278] 850000 149900 570000 261000 250000 ...
$ property_type: chr [1:509278] "Single Family" "Single Family" "Single Family" "Condo" ...
$ county : chr [1:509278] "fairfield" "new haven" "fairfield" "hartford" ...
$ sale_year : num [1:509278] 2015 2015 2015 2015 2015 ...
- attr(*, "na.action")= 'omit' Named int [1:382446] 13 27 28 33 53 78 83 105 127 132 ...
..- attr(*, "names")= chr [1:382446] "13" "27" "28" "33" ...
After cleaning the data set we are left with seven variables (columns):
Now that we have data cleaned we take a look at some statewide data so we can get an idea of the general state of the real estate market in Connecticut.
connecticut %>%
group_by(year = lubridate::floor_date(sale_date, "year")) %>%
summarize(mean_year = mean(sale_price)) %>%
ggplot(aes(x = year, y = mean_year)) +
geom_line(aes()) +
geom_smooth() +
ggtitle("Overall Mean Sale Price of Connecticut Properties: 2007-2020") +
ylab("Sale Price") +
xlab("Year") +
theme_update()
Here we can observe the behavior of the overall mean sale price of properties in Connecticut from 2007 to 2020. As expected the mean price decreased sharply following the 2008 financial crisis and appears to have bottomed out and stabilized around 2010. From 2010 to 2019 there does not appear to be a lot change in the overall mean sale price.
con_map <- map_data("county",
"connecticut")
This data set contains latitude and longitude data that can be used to make a map visualization.
Rename “subregion” “county” to join “con_map” with “connecticut”con_map <- rename(con_map,
"county" = "subregion")
The “subregion” in the “con_map” data set is Connecticut counties so I will rename the variable so I can join it with the sales data.
Create object “map_vis”: grouping by county and creating a “county_mean” columnThis object contains the mean sale price of each county in Connecticut and will be joined the map data to display the mean sale price on a map.
Join “mapping” object with “con_map” datamap_vis <- inner_join(map_vis,
con_map,
by = "county")
This object will be used to create a map visualization displaying county mean sale price.
Create object to use when labeling the counties on the mapThis creates an object that will center the name of the county inside that county on the map visualization.
Create map of Connecticut counties displaying the mean sale price of residential propertiesggplot(map_vis,
aes(x = long,
y = lat)) +
geom_polygon(aes(fill = county_mean,
group = county)) +
geom_text(data = county_names,
aes(long,
lat,
label = county),
size =3) +
scale_fill_gradient(low = "bisque4",
high = "bisque2",
name = "Average Sale Price (USD)") +
ggtitle("Overall Average Sale Price in Connecticut Counties") +
theme_light()
Here we can get a visualization of where in Connecticut (grouped by county) property values are the highest. Not surprisingly, the area closest to New York City appears to have the highest property value.
mean_property_type <-connecticut %>%
group_by(year = lubridate::floor_date(sale_date, "year"),
property_type) %>%
summarize(mean = mean(sale_price))
mean_property_type <-ggplot(mean_property_type,
aes(year,
mean,
color = property_type)) +
geom_line() +
transition_reveal(year) +
labs(title = "Mean Sale Price: 2007-2020",
subtitle = "Grouped By Property Type",
x = "Year",
y = "Price ($)") +
guides(color = guide_legend(title = "Property Type"))
animate(mean_property_type)
This graphic displays the mean monthly sale price of all properties in Connecticut grouped by the type of property. Interestingly, single family homes have the highest mean sale price. We can get a closer look below.
Scatterplot of yearly mean sale price for each “Property Type”options(scipen = 999) # remove scientific notation
connecticut %>%
group_by(property_type,
year = lubridate::floor_date(sale_date, "year")) %>%
summarize(mean_year_type = mean(sale_price)) %>%
ggplot(aes(x = year,
y = mean_year_type,
group = property_type)) +
geom_point((aes(color = property_type))) +
geom_smooth() +
ggtitle("Yearly Mean Sale Price: Property Types") +
ylab("Mean Sale Price ($)") +
xlab("Year") +
theme_linedraw() +
facet_wrap(vars(property_type)) +
guides(color = guide_legend(title = "Property Type"))
Here we will examine the counts of the different property types in our data set. We will also look at the counts in each specific county and later in each town.
View Counts of Property typesconnecticut %>%
count(property_type) %>%
mutate("Proportion"= percent(n/sum(n))) %>%
rename("Property Type" = "property_type") %>%
rename("Number of Properties Sold" = "n") %>%
kbl() %>%
kable_material(c("striped",
"hover"))
Property Type | Number of Properties Sold | Proportion |
---|---|---|
Condo | 98463 | 19.3% |
Four Family | 1985 | 0.4% |
Single Family | 373121 | 73.3% |
Three Family | 11262 | 2.2% |
Two Family | 24447 | 4.8% |
Here we can see how many observations there are for each “property type.”
We can see that majority of the observations are “Single Family” and “Condo.”
The least represented group is “Four Family” with only 1985 observations across all towns in Connecticut over the 14 years of the data.
This partially explains the apparent volatility in the mean sale price of “Four Family” properties displayed in (###### ADD VIS IDENTIFIER).
This will become more impactful as we continue the analysis. Once we start looking at individual towns there will be very limited data on any property types other than “single family.”
For a full exploration of property types counts see appendix Part C
connecticut %>%
group_by(county,
property_type,
sale_year) %>%
summarize(mean_sale_price = mean(sale_price)) %>%
select(county,
property_type,
sale_year,
mean_sale_price) %>%
filter(sale_year == "2008" |
sale_year =="2020") %>%
filter(property_type == "Single Family") %>%
pivot_wider(names_from = sale_year,
values_from = mean_sale_price) %>%
mutate(delta_mean = (`2020` - `2008`)) %>%
mutate(delta_mean_percent = (delta_mean/`2008`)) %>%
inner_join(con_map,
by = "county") %>%
ggplot(aes(x = long,
y = lat)) +
geom_polygon(aes(fill = delta_mean_percent,
group = county)) +
geom_text(data = county_names,
aes(long,
lat,
label = county),
size =3) +
scale_fill_gradient(low = "bisque4",
high = "bisque2",
name = "Percent Change") +
ggtitle("Percent Change in Single Family Sale Price in Connecticut Counties") +
theme_light()
connecticut %>%
group_by(county,
property_type,
sale_year) %>%
summarize(mean_sale_price = mean(sale_price)) %>%
select(county,
property_type,
sale_year,
mean_sale_price) %>%
filter(sale_year == "2008" |
sale_year =="2020") %>%
filter(property_type == "Single Family") %>%
pivot_wider(names_from = sale_year,
values_from = mean_sale_price) %>%
mutate(delta_mean_ID = (`2020` - `2008`)) %>%
mutate(delta_mean_percent = percent(delta_mean_ID/`2008`)) %>%
mutate(across(contains("ID"), round, 2)) %>%
arrange(desc(delta_mean_percent)) %>%
rename("County" = "county",
"Mean 2008" = "2008",
"Mean 2020" = "2020",
"Change in Mean ($)" = "delta_mean_ID",
"Change in Mean (%)" = "delta_mean_percent") %>%
mutate(County = str_to_title(County)) %>%
formattable(align = c("l",
"c",
"r"),
list(`Mean 2020` = FALSE,
`Mean 2008` = FALSE,
`property_type` = FALSE,
`Change in Mean (%)` = formatter("span",
style = ~ formattable::style(color = ifelse(`Change in Mean (%)` < 0,
"firebrick",
"forestgreen")),
~ icontext(sapply(`Change in Mean (%)`,
function(x) if (x < 0) "arrow-down" else if (x > 0) "arrow-up"),
`Change in Mean (%)`))))
County | Change in Mean ($) | Change in Mean (%) |
---|---|---|
Hartford | 13627.15 | 5% |
Litchfield | 64267.71 | 22% |
Fairfield | 64666.65 | 13% |
Windham | 25014.99 | 12% |
New Haven | -219.55 | 0% |
New London | 397.25 | 0% |
Middlesex | -13839.78 | -4% |
Tolland | -6941.51 | -3% |
connecticut %>%
group_by(county,
property_type,
sale_year) %>%
summarize(mean_sale_price = mean(sale_price)) %>%
select(county,
property_type,
sale_year,
mean_sale_price) %>%
filter(sale_year == "2008" | sale_year =="2020") %>%
filter(property_type == "Single Family") %>%
pivot_wider(names_from = sale_year, values_from = mean_sale_price) %>%
mutate(delta_mean_ID = (`2020` - `2008`)) %>%
mutate(delta_mean_percent = (delta_mean_ID/`2008`*100)) %>%
mutate(across(contains("percent"), round, 2)) %>%
mutate(Color = ifelse(delta_mean_percent < 0, "rosybrown","steelblue")) %>%
ggplot(aes(x = reorder(county, delta_mean_percent),
y = delta_mean_percent,
fill = Color)) +
geom_col()+
geom_text(aes(label = delta_mean_percent,
vjust = 1.2),
size = 3) +
labs(title = "Percentage Change in Mean Sale Price for Single Family Properties: 2008 to 2020",
x = "County Name",
y = "Percent Change") +
theme(axis.text.x = element_text(angle = 30,
hjust = 1)) +
scale_fill_identity(guide = "none")
The above two graphics display the change in mean sale price of “Single Family” properties from 2007 to 2020 in the 8 Connecticut counties.
connecticut %>%
group_by(county,
property_type,
year = lubridate::floor_date(sale_date, "year")) %>%
summarize(mean_year = mean(sale_price)) %>%
filter(property_type == "Single Family") %>%
ggplot(aes(x = year,
y = mean_year)) +
geom_line(aes(color = county)) +
ggtitle("Mean Sale Price of Single Family Properties: Connecticut Counties") +
ylab("Sale Price") +
xlab("Year") +
theme_bw() +
facet_wrap(vars(county)) +
guides(color = "none")
Next, we will shift to focus to potential drivers of sale price
connecticut %>%
group_by(month = floor_date(sale_date,
"month"),
property_type) %>%
summarize(mean_price = mean(sale_price),
count = count(month)) %>%
ggplot(aes(x = count,
y = mean_price,
color = property_type)) +
geom_point() +
geom_smooth(method = "lm") +
ggtitle("Number of Properties Sold (Month) and Average Sale Price",
subtitle = "Grouped by Property Type") +
xlab("Number of Properties Sold (Month)") +
ylab("Average Sale Price") +
facet_wrap(vars(property_type),
scales = "free") +
guides(color = "none")
Here we are looking the relationship between the number of properties sold and how it impacts sales price. Number of properties sold is the x variable and mean sale price is the y variable. Here we are aggregating by month.
This graphic demonstrates that there does appear to be a least a slight positive linear relationship between the number of properties sold in a given month and the average sale price.
Next we will take a closer look at this relationship as it relates to Single Family properties.
connecticut %>%
group_by(property_type,
month = floor_date(sale_date,
"month")) %>%
filter(property_type == "Single Family") %>%
summarize(mean = mean(sale_price),
properties_sold = count(month)) %>%
mutate(previous_mean = lag(mean)) %>%
mutate(previous_month = lag(month)) %>%
mutate(Direction = ifelse(properties_sold < lag(properties_sold),
"Decreasing Sales",
"Increasing Sales")) %>%
mutate(Direction2 = ifelse(mean < lag(mean),
"Decreasing Price",
"Increasing Price")) %>%
ggplot(aes(month, mean,
xend = previous_month,
yend = previous_mean)) +
geom_segment(aes(color = Direction2)) +
scale_color_manual("Direction of Change",
values = c("Decreasing Price" = "rosybrown",
"Increasing Price" = "steelblue")) +
geom_col(aes(x = month,
y = properties_sold*50,
fill = Direction)) +
scale_fill_manual("Direction of Change",
values = c("Decreasing Sales" = "rosybrown",
"Increasing Sales" = "steelblue")) +
ggtitle("How Volume Impacts Price Movement")
xlab("Time Grouped by Month") +
scale_y_continuous(name = "Average Sale Price ($)",
sec.axis = sec_axis(~./50,
name = "Number of Properties Sold"))
NULL
Here we can see the relationship graphed differently. The line and bars are blue when the value is higher than the previous value and they are red when the value is lower than the previous value. This graphic demonstrates that as more properties are sold, price tends to go up. More properties sold can indicate increased demand.
Additionally, if you look closely at this graphic you may notice that there are approximately the same amount of the peeks and troughs are there are years in the data set. This indicate an element seasonality, where prices go up and down cyclically with the seasons. We look at seasonality as potential price driver later.
Next we will look at the relationship statistically.
options(scipen = 999) # remove scientific notation
connecticut %>%
group_by(month = floor_date(sale_date, "month"), property_type) %>%
filter(property_type == "Single Family") %>%
summarize(mean_price = mean(sale_price),
count = count(month)) %>%
ggplot(aes(x = count, y = mean_price)) +
geom_point() +
geom_smooth(method = "lm")
connecticut %>%
group_by(month = floor_date(sale_date, "month"),
property_type) %>%
filter(property_type == "Single Family") %>%
summarize(mean_price = mean(sale_price),
count = count(month),
na.rm = TRUE) %>%
ggscatterstats(x = count,
y = mean_price) +
ggtitle("Relationship Between Count and Sale Price") +
xlab("Number of Properties Sold") +
ylab("Average Price ($)") +
theme(axis.text.x = element_text(angle = 30,
hjust = 1))
Next we run a simple linear regression.
lm1 <- connecticut %>%
group_by(month = floor_date(sale_date,
"month"),
property_type) %>%
filter(property_type == "Single Family") %>%
summarize(mean_price = mean(sale_price),
count = count(month))
lm2 <-lm(mean_price ~ count, data = lm1)
summary(lm2)
Call:
lm(formula = mean_price ~ count, data = lm1)
Residuals:
Min 1Q Median 3Q Max
-42584 -16779 -5474 14385 74928
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 271155.647 6334.772 42.804 < 0.0000000000000002 ***
count 17.492 2.669 6.555 0.000000000701 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 24750 on 163 degrees of freedom
Multiple R-squared: 0.2086, Adjusted R-squared: 0.2037
F-statistic: 42.96 on 1 and 163 DF, p-value: 0.0000000007011
See appendix Part D for conditions check of the linear model.
We can conclude that the number properties sold does impact price, but it does not determine price. There are other factors involved.
Next we will explore another potential driver, current price.
connecticut %>%
group_by(month = floor_date(sale_date, "month")) %>%
summarise(mean = mean(sale_price)) %>%
mutate(previous = lag(mean),
change = (mean - previous),
percent_change = (change/mean) * 100) %>%
ggplot(aes(x = mean, y = change)) +
geom_point() +
geom_smooth(method = "lm") +
theme_tq()
c <- connecticut %>%
group_by(month = floor_date(sale_date,
"month"),
year = floor_date(sale_date,
"year")) %>%
filter(property_type == "Single Family") %>%
summarize(mean = mean(sale_price)) %>%
mutate(Month2 = tsibble::yearmonth(month)) %>%
ungroup(month)
c <- select(c,
"Month2",
"mean")
c <- as_tsibble(c)
gg_season(c)
# census_api_key("a2e56978408ace3c130f93ace8eb74b48564a9f4", install = TRUE)
# readRenviron("~/.Renviron")
# Sys.getenv("CENSUS_API_KEY")
get_estimates(geography = "county",
product = "population",
state = "CT",
time_series = TRUE,
year = 2019)
# A tibble: 192 × 5
NAME DATE GEOID variable value
<chr> <dbl> <chr> <chr> <dbl>
1 New Haven County, Connecticut 1 09009 POP 862477
2 New Haven County, Connecticut 2 09009 POP 862442
3 New Haven County, Connecticut 3 09009 POP 863357
4 New Haven County, Connecticut 4 09009 POP 863871
5 New Haven County, Connecticut 5 09009 POP 864566
6 New Haven County, Connecticut 6 09009 POP 862820
7 New Haven County, Connecticut 7 09009 POP 862885
8 New Haven County, Connecticut 8 09009 POP 860186
9 New Haven County, Connecticut 9 09009 POP 857901
10 New Haven County, Connecticut 10 09009 POP 857748
# … with 182 more rows
connecticut %>%
filter(county == "litchfield") %>%
group_by(property_type,
year = lubridate::floor_date(sale_date, "year")) %>%
summarize(mean_litchfield = mean(sale_price)) %>%
ggplot(aes(x = year,
y = mean_litchfield)) +
geom_line(aes(color = property_type)) +
geom_smooth() +
theme_bw() +
labs(title = "Mean Sale Price of Different Property Types",
subtitle = "Litchfield County 2008-2020",
x = "Year",
y = "Mean Sale Price ($)") +
facet_wrap(vars(property_type)) +
guides(color = "none")
This graphic displays the behavior of the mean sale price of the different “Property Types” in Litchfield County from 2008 to 2020.
Next we examine the counts of property types in Litchfield County.
Litchfield <- connecticut %>%
group_by(sale_year) %>%
filter(county == "litchfield")
Litchfield %>%
group_by(sale_year, property_type) %>%
ggplot(aes(x = sale_year, fill = property_type)) +
geom_bar(aes(stat="identity")) +
facet_wrap(vars(property_type)) +
labs(title = "Total Number of Properties Sold Each Year in Litchfield: 2008-2020",
subtitle = "Displayed by Property Type",
x = "Year 2008-2020",
y = "Numnber of Properties Sold") +
theme(legend.position = "none")
Here we can see clearly that our data set is primarily made up of “Single Family” properties and “Condos”.
Examine counts of property type in Litchfield CountyLitchfield %>%
group_by(property_type) %>%
count(property_type) %>%
rename("Property Type" = "property_type",
"Properties Sold" = "n") %>%
knitr::kable(caption = "Counts of Property Types Sold: Litchfield County") %>%
kable_classic(full_width = T) %>%
row_spec(2, background = "yellow") %>%
row_spec(4, background = "yellow")
Property Type | Properties Sold |
---|---|
Condo | 4005 |
Four Family | 64 |
Single Family | 24969 |
Three Family | 363 |
Two Family | 1305 |
Here we can see a partial explanation for the volatility of the mean sale price of “Four Family” properties; with only 64 observations over 14 years it seems hard to rely on the data. We also see that “Three Family” is not well represented over the time period.
Looking a little deeper at “Four” and “Three Family” propertiesLitchfield %>%
filter(property_type == "Four Family" | property_type == "Three Family") %>%
group_by(sale_year) %>%
count(property_type) %>%
pivot_wider(names_from = property_type, values_from = n) %>%
rename("Sale Year" = "sale_year") %>%
kable(align = c("l", "c", "r"),
caption = "Counts of Four Family Properties Sold Each Year: Litchfield County") %>%
kable_classic(full_width = T) %>%
column_spec(1, bold = T) %>%
row_spec(0, bold = T) %>%
row_spec(5:7, background = "yellow") %>%
row_spec(14, background = "yellow")
Sale Year | Four Family | Three Family |
---|---|---|
2007 | 3 | 40 |
2008 | 3 | 21 |
2009 | 2 | 28 |
2010 | 4 | 25 |
2011 | 1 | 21 |
2012 | 3 | 18 |
2013 | 3 | 16 |
2014 | 5 | 31 |
2015 | 3 | 29 |
2016 | 5 | 31 |
2017 | 4 | 23 |
2018 | 10 | 30 |
2019 | 10 | 40 |
2020 | 8 | 10 |
Here I finally got use the pivot function, I didn’t notice any opportunities before this.
I have highlighted 2011-2013 and 2020 where there are very limited observations for “Three” and “Four Family” properties. While the counts do increase in some of the years the data does not appear reliable at even the county level. As a result of the limitations of the data examining different “property types” beyond the county level will not be revealing or possible.
As we examine the towns in Litchfield County, we will only be looking a “Single Family” properties.
As a result of the data being limited for “property types” other than “Single Family” I will filter for “Single Family” before I look at the specific towns in Litchfield County.
Display the mean sale price of “Single Family” properties in Litchfeild County towns: 2008-2020Litchfield %>%
group_by(town, year = lubridate::floor_date(sale_date, "year")) %>%
summarize(meanL = mean(sale_price)) %>%
ggplot(aes(x = year, y = meanL)) +
geom_line(aes(color = town)) +
labs(title = "Mean Sale Price: 2008 to 2020",
subtitle = "Towns in Litchfield County, Connecticut",
x = "County Name",
y = "Sale Price",
caption = "**Data missing for Torrington 2020") +
theme_bw() +
facet_wrap(vars(town)) +
guides(color = "none")
L08 <- Litchfield_SF %>%
group_by(town, sale_year) %>%
summarise(mean_L08 = mean(sale_price)) %>%
filter(sale_year == "2008")
L20 <- Litchfield_SF %>%
group_by(town, sale_year) %>%
summarise(mean_L20 = mean(sale_price)) %>%
filter(sale_year == "2020")
L_combined <- L08 %>%
left_join(L20, by = "town")
L_combined <- L_combined %>%
mutate(L_delta_mean = mean_L20 - mean_L08) %>%
mutate(L_c_mean = (mean_L20 - mean_L08)/mean_L08) %>%
mutate(L_c_mean = L_c_mean * 100) %>%
mutate(across(contains("mean"), round, 2))
L_combined %>%
mutate(Color = ifelse(L_c_mean < 0, "rosybrown","steelblue")) %>%
ggplot(aes(x = reorder(town, +L_c_mean),
y = L_c_mean,
fill = Color)) +
geom_col() +
labs(title = "Percentage Change in Mean Sale Price: 2008 to 2020",
subtitle = "Towns in Litchfield County, Connecticut",
x = "County Name",
y = "Percent Change",
caption = "**Data missing for Torrington 2020") +
theme(axis.text.x = element_text(angle = 45,
hjust = 1)) +
scale_fill_identity(guide = "none")
This graphic displays the change in mean sale price in all the Litchfeild County towns from 2008 to 2020.
Litchfield_SF %>%
filter(town == "Torrington") %>%
select(sale_year, sale_price) %>%
group_by(sale_year) %>%
summarise(mean = mean(sale_price)) %>%
pivot_wider(names_from = sale_year, values_from = mean) %>%
mutate("Delta Mean" = (`2019`-`2007`)) %>%
pivot_longer(0:14, names_to = "Sale_Year", values_to = "Mean_Sale_Price") %>%
mutate(Color = ifelse(Mean_Sale_Price < 0, "rosybrown","steelblue")) %>%
ggplot(aes(x = Sale_Year, y = Mean_Sale_Price, fill = Color)) +
geom_col() +
theme(axis.text.x = element_text(angle = 45,
hjust = 1)) +
scale_fill_identity(guide = "none") +
labs(title = "Mean Price Torrington: 2008 to 2020",
subtitle = "Litchfield County, Connecticut",
x = "Sale Year",
y = "Mean Sale Price",
caption = "**Delta Mean = 2019 - 2007**")
This graphic shows the behavior of the mean sale price of Torrington. Torrington did not have 2020 data so it was not included in the grouped bar chart of Litchfield County towns. The “Delta Mean” bar at the right end is calculated by subtracting the mean of 2008 from the mean of 2020. As we can see the mean sale price of Torrington “Single Family” properties has decreased.
Now we will take a look Goshen. A hundred percent increase in mean sale price is a little remarkable. We will analyze to see if there is an explanation in the data.
Litchfield_SF %>%
filter(town == "Goshen") %>%
group_by(sale_year) %>%
summarize(goshen_mean = mean(sale_price),
properties_sold = count(sale_year))
# A tibble: 14 × 3
sale_year goshen_mean properties_sold
<dbl> <dbl> <int>
1 2007 405302. 45
2 2008 219587. 9
3 2009 360256. 45
4 2010 303607 50
5 2011 391851. 58
6 2012 305958. 24
7 2013 456833. 15
8 2014 350857. 56
9 2015 340886. 71
10 2016 361325. 75
11 2017 333381. 52
12 2018 333617. 12
13 2019 337689. 80
14 2020 475989. 64
Litchfield_SF %>%
filter(town == "Goshen") %>%
group_by(sale_year) %>%
summarize(goshen_mean = mean(sale_price),
properties_sold = count(sale_year)) %>%
ggplot(aes(x = sale_year)) +
geom_line(aes(x = sale_year, y = goshen_mean)) +
geom_col(aes(x = sale_year, y = properties_sold*2500)) +
scale_x_continuous(breaks = seq(2007, 2020, 1)) +
scale_y_continuous(name = "Goshen Mean", sec.axis = sec_axis(~./2500, name = "properties sold"))
What is missing from your final project?
I think my project is still kind of unorganized. I need to move some things to the appendix, remove others, and move some things around. I also need to clean up and recode some of the graphics. I also need to write out explanations of the decisions I made and the graphics. Generally, I need to work on the narration of the project. I also missing error bars.
What do I hope to accomplish
I hope to accomplish everything above and continue the analysis for Fairfield County. I also to refine and/or polish the graphics and tables.
#ts_object <- connecticut %>%
#filter(property_type == "Single Family") %>%
#group_by(month = lubridate::floor_date(sale_date, "month")) %>%
#summarise(mean = mean(sale_price))
#ts_object_2 <- xts(ts_object$mean, ts_object$month)
plot(ts_object_2)
pred_12 <- auto.arima(ts_object_2)
pred_12
forecast_1 <- forecast(pred_12, 12)
###plot(forcast_1)
forecast_1
accuracy(pred_12)
qqnorm(pred_12$residuals)
fit <-stl(pred_12, s.window = “period”) plot(fit)
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Farrell (2022, April 27). Data Analytics and Computational Social Science: Homework 6. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomjnfarrell211892727/
BibTeX citation
@misc{farrell2022homework, author = {Farrell, Joseph}, title = {Data Analytics and Computational Social Science: Homework 6}, url = {https://github.com/DACSS/dacss_course_website/posts/httpsrpubscomjnfarrell211892727/}, year = {2022} }