---
title: "Texas Housing Sales"
output: html_document
params:
spotlight: "Houston"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, results='asis')
# this is an R option that controls the number of digits printed
options(digits=4)
# for dataset
library(dplyr)
# for graphing functions
library(ggplot2)
# for pretty regression tables
library(broom)
# graphical parameters
theme_set(theme_bw())
# Using 2 colors from ColorBrewer scle RdBu
palette <- c("#ef8a62", "#67a9cf")
```
# Background
## Purpose
This is a study of housing sales in Texas cities, with a spotlight on the city of `r params$spotlight`. We will have graphs of time trends and run a linear regression model.
## Quick look at the dataset
Dataset from `ggplot2` package
**Variables:**
* **city:** Name of MLS area
* **year, month, date:** year and month, and date expressed as continuous year variable
* **sales:** Number of sales
* **volume:** Total value of sales
* **median:** Median sale price
* **listings:** Total active listings
* **inventory:** "Months inventory", amount of time it would take to sell all current listings at current pace of sales
Here are the first 20 rows of the spotlight city:
```{r dataset}
spotcity <- subset(txhousing, txhousing$city==params$spotlight)
knitr::kable(head(spotcity, n=20), align='c')
txhousing$month <- factor(txhousing$month, labels=c("Jan", "Feb", "Mar",
"Apr", "May", "Jun",
"Jul", "Aug", "Sep",
"Oct", "Nov", "Dec"))
```
# Plots
## Time series of Sales By City, Spotlight on `r params$spotlight`
Sales appear to be rising through 2006, then slow until 2010, and then rise again.
```{r, results='hold'}
ggplot(txhousing, aes(x=date, y=sales, group=city)) +
geom_line(color=palette[2], alpha=.5) +
geom_line(data=spotcity, color=palette[1])
```
## Seasonal trends
We detect seasonal trends in the time series plot, so let's aggregate across months (by city), to examine sales by month.
```{r seasonal_data, include=FALSE}
# aggregate data by city and month
by_month <- group_by(txhousing, city, month) %>%
summarise(sales=mean(sales), volume=mean(volume),
median=mean(median), listings=mean(listings),
inventory=mean(inventory))
```
```{r seasonal-plots}
ggplot(by_month, aes(x=month, y=sales, group=city)) +
geom_line(color=palette[2], alpha=.5) +
geom_line(data=filter(by_month, city==params$spotlight), color=palette[1])
```
## Removing city effects from seasonal effects
Large differences in the scale of sales make the seasonal effects difficult to see in some plots. Let's remove the city effect from the monthly aggregates by dividing sales by the mean of sales for that city.
Then, we replot the mean of the rescaled sales by month and city.
```{r de-city the seasonal plots}
by_month$mean_city_sales <- rep(tapply(by_month$sales, INDEX=by_month$city, mean, na.rm=TRUE), each=12)
by_month$dsales <- by_month$sales/by_month$mean_city_sales
ggplot(by_month, aes(x=month, y=dsales, group=city)) +
geom_line(color=palette[2], alpha=.5) +
geom_line(data=filter(by_month, city==params$spotlight), color=palette[1])
```
# Statistical Model
## Regression models
Sales is very skewed so transforming to log for linear regression
```{r, fig.show='hold', out.width='50%'}
txhousing$logsales <- log(txhousing$sales)
# layout(matrix(c(1,2), nrow=1))
# hist(txhousing$sales)
# hist(txhousing$logsales)
# layout(1)
ggplot(txhousing, aes(x=sales)) + geom_histogram(fill=palette[2])
ggplot(txhousing, aes(x=logsales)) + geom_histogram(fill=palette[2])
```
## Model
Model is $log(sales) = city + year + month + median + \epsilon$, where $city$ and $month$ are factors and $year$ and $median$ are continuous predictors.
Here are the coefficients, not including the city effects (too many):
```{r linear model}
m <- lm(logsales ~ city + I(year-2000) + month + median , data=txhousing)
mycoefs <- tidy(m, conf.int=TRUE)
mycoefs$term[mycoefs$term==("I(year - 2000)")] <- "year"
coefs_small <- filter(mycoefs, !grepl("city", term))
knitr::kable(coefs_small)
```
## Diagnostic plots
### Residuals vs. fitted
Linearity assumption seems to be ok, but maybe not homoskedasticity assumption.
```{r data and rvf}
# use broom::augment() to make a dataset of regression diagnostics
diagnostics <- augment(m)
# add the city variable so we can spotlight
diagnostics$city <- txhousing$city[as.numeric(diagnostics$`.rownames`)]
# residualsi vs fitted plto
ggplot(diagnostics, aes(x=.fitted, y=.resid)) +
geom_point(color=palette[2], alpha=.5) +
geom_smooth(color="black")+
geom_point(data=subset(diagnostics, diagnostics$city==params$spotlight), color=palette[1])
```
### q-q plot of residuals
Some departures from normality in tails, but we have a very large sample size, so maybe not a big deal.
```{r qq-plot}
ggplot(diagnostics, aes(sample=.resid)) +
geom_qq(alpha=.25) +
geom_qq_line()
```
### Influence plot
Residuals vs levarage, sized by Cook's distance.
```{r influence plot}
ggplot(diagnostics, aes(x=.hat, y=.std.resid, size=.cooksd)) +
geom_point(color=palette[2], alpha=.5) +
geom_point(data=subset(diagnostics, diagnostics$city==params$spotlight), color=palette[1])
```