forked from tpall/modeling-demo
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathindex.Rmd
387 lines (331 loc) · 12.2 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
---
title: "Modelling"
author: "Taavi Päll"
date: "17 10 2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Estonian Land Board data
Available via [www.maaamet.ee](http://www.maaamet.ee/kinnisvara/htraru/FilterUI.aspx).
API can be accessed programmatically using html requests.
One can use "rvest" package. Please have a look at "R/maaamet.R" script in [rstats-tartu/datasets](https://www.github.com/rstats-tartu/datasets) repo.
Returns html, but that ok too.
Html results need little wrangling to get table.
Data that can be downloaded are number of transactions and price summaries per property size, and type for different time periods.
Price info is given for data splits with more than 5 transactions.
## Load dataset
```{r}
## Check if file is alredy downloaded
if(!file.exists("data/transactions_residential_apartments.csv")){
url <- "https://raw.githubusercontent.com/rstats-tartu/datasets/master/transactions_residential_apartments.csv"
## Check if data folder is present
if(!dir.exists("data")){
dir.create("data")
}
## Download file to data folder
download.file(url, "data/transactions_residential_apartments.csv")
}
```
## Import
```{r}
library(tidyverse)
library(lubridate)
library(stringr)
# Please try broom from GitHub as CRAN version may give obscure warning
# with augment
# devtools::install_github("dgrtwo/broom")
library(broom)
library(viridis)
apartments <- read_csv("data/transactions_residential_apartments.csv")
apartments <- apartments %>%
mutate(date = ymd(str_c(year, month, 1, sep = "-"))) %>%
select(date, everything())
harju <- apartments %>% filter(str_detect(county, "Harju"))
harju
```
## Strategy
- Start with single unit and identify interesting pattern
- Summarise pattern with model
- Apply model to all units
- Look for units that don't fit pattern
- Summarise with single model
## First glimpse
Plot **number of transactions** per year for Harju county and add smooth line.
```{r}
p <- ggplot(harju, aes(factor(year), transactions, group = area, color = area)) +
geom_point() +
geom_smooth(method = 'loess') +
facet_wrap(~county, scales = "free_y") +
labs(
y = "Transactions",
x = "Year"
) +
scale_color_viridis(discrete = TRUE)
p
```
Mean price per unit area:
```{r}
## Add date
p <- harju %>%
ggplot(aes(date, price_unit_area_mean, group = area, color = area)) +
geom_line() +
geom_vline(xintercept = ymd("2008-09-15"), linetype = 2) +
labs(title = "Transactions with residential apartments",
subtitle = "Harju county",
x = "Date",
y = bquote(Mean~price~(eur/m^2)),
caption = str_c("Source: Estonian Land Board, transactions database.\nDashed line, the collapse of the investment bank\nLehman Brothers on Sep 15, 2008.")) +
scale_color_viridis(discrete = TRUE, name = bquote(Apartment~size~m^2))
p
```
Adjust mean price to changes in consumer index:
```{r}
download.file("https://raw.githubusercontent.com/rstats-tartu/datasets/master/consumer_index.csv",
"data/consumer_index.csv")
consumer_index <- read_csv("data/consumer_index.csv")
consumer_index
## check if 2005 is 100%
consumer_index[10, 3:14] %>% rowMeans()
divide_by_100 <- function(x) {
x/100
}
consumer_index <- consumer_index %>%
select(-X1) %>%
gather(key = month, value = consumerindex, -year) %>%
mutate_at("consumerindex", divide_by_100)
consumer_index
```
Join apartments and consumer index data:
```{r}
apartments <- left_join(apartments, consumer_index, by = c("year", "month"))
```
Select harju county data:
```{r}
harju_ci <- filter(apartments, str_detect(county, "Harju"))
harju_ci
harju_ci <- harju_ci %>%
mutate(pua_mean_ci = price_unit_area_mean / consumerindex)
```
Price per m^2 after consumer index adjustment:
```{r}
harju_ci %>%
ggplot(aes(date, pua_mean_ci, group = area, color = area)) +
geom_line() +
geom_vline(xintercept = ymd("2008-09-15"), linetype = 2) +
labs(title = "Transactions with residential apartments",
subtitle = "Harju county",
x = "Date",
y = "Consumer index corrected mean price\nrelative to 2005 (eur/m^2)",
caption = str_c("Source: Estonian Land Board, transactions database.\nDashed line, the collapse of the investment bank\nLehman Brothers on Sep 15, 2008.")) +
scale_color_viridis(discrete = TRUE, name = bquote(Apartment~size~m^2))
```
## Seasonal pattern
Seasonal pattern in number of transactions? Mean price eur/m^2 per month:
```{r}
## Note that we are rearranging x axis using R builtin vector of month names
p <- harju %>%
ggplot(aes(factor(month, levels = month.abb), transactions, group = year)) +
geom_line(alpha = 0.3) +
geom_vline(xintercept = ymd("2008-09-15"), linetype = 2) +
facet_wrap(~ area, scales = "free_y") +
labs(title = "Seasonal trend in number of transactions\nwith residential apartments",
subtitle = "Harju county",
x = "Month",
y = "Transactions",
caption = "Source: Estonian Land Board, transactions database.") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
p
```
Add trendline per month.
Fit linear model per month for each apartment size class.
`augment()` function from "broom" library adds fitted values and residuals to the original data.
We can use this to overlay model fit to our data.
```{r}
predicted_transa <- harju %>%
lm(transactions ~ month + area, data = .) %>%
augment() # broom
predicted_transa %>%
arrange(desc(.cooksd)) %>%
dplyr::select(.cooksd, everything()) %>%
head
```
Overlay fitted values to previous plot.
```{r}
p + geom_line(data = predicted_transa, aes(month, .fitted, group = 1),
color = "red",
size = 2) +
labs(caption = "Red line: model fit of seasonal effect to number of transactions.\nSource: Estonian Land Board, transactions database.")
```
The number of transactions increases during period from March to May?
Do we have similar effect to average price per m^2?
```{r}
predicted_pua <- harju %>%
lm(price_unit_area_mean ~ month + area, data = .) %>%
augment()
predicted_pua %>% head
```
```{r}
harju %>%
ggplot(aes(factor(month, levels = month.abb), price_unit_area_mean, group = year)) +
geom_line(alpha = 0.3) +
geom_vline(xintercept = ymd("2008-09-15"), linetype = 2) +
facet_wrap(~ area, scales = "free_y") +
labs(title = bquote(Seasonal~trend~price~per~m^2~of~residential~apartments),
subtitle = "Harju county",
x = "Month",
y = bquote(Mean~price~per~unit~area~(eur/m^2)),
caption = "Source: Estonian Land Board, transactions database.") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
geom_line(data = predicted_pua, aes(month, .fitted, group = 1),
color = "red",
size = 2)
```
Residuals? Add also year to model to visualise residuals.
```{r}
predicted_trans2 <- harju %>%
lm(transactions ~ month + year + area, data = .) %>%
augment
predicted_trans2 %>%
head()
predicted_trans2 %>%
ggplot(aes(factor(month, levels = month.abb), .resid, group = year)) +
geom_hline(yintercept = 0, colour = "white", size = 3) +
geom_line() +
facet_wrap(~ area)
```
For some apartement size classes, few years may have large influence on average number of transactions. We can look at the large cook's distance.
```{r}
predicted_trans2 %>%
arrange(desc(.cooksd)) %>%
dplyr::select(.cooksd, everything()) %>%
head()
```
Same for mean price per m^2:
```{r}
predicted_pua2 <- harju %>%
lm(price_unit_area_mean ~ month + year + area, data = .) %>%
augment
predicted_pua2 %>%
ggplot(aes(factor(month, levels = month.abb), .resid, group = year, color = factor(year))) +
geom_hline(yintercept = 0, colour = "white", size = 3) +
geom_line() +
facet_wrap(~ area) +
scale_color_viridis(discrete = TRUE) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
```
Seems ok!
```{r}
predicted_pua2 %>%
arrange(desc(.cooksd)) %>%
dplyr::select(.cooksd, everything(.)) %>%
head()
```
Robust lm fit to number of transactions. Play with weights: price_min, price_unit_area_mean etc..
```{r}
# install.packages("MASS")
library(MASS)
predicted_transa_robust <- harju %>%
rlm(transactions ~ month + area, data = ., weights = price_min) %>%
augment
```
Add robust fit to plot:
```{r}
p + geom_line(data = predicted_transa, aes(month, .fitted, group = 1),
color = "red",
size = 1) +
geom_line(data = predicted_transa_robust, aes(month, .fitted, group = 1),
color = "blue",
size = 1) +
labs(caption = "Red line: model fit of seasonal effect to number of transactions.\nBlue line: robust model fit of seasonal effect to number of transactions, weighted for minimal price. \nSource: Estonian Land Board, transactions database.")
```
## Whole dataset: all counties
Plot whole dataset:
```{r}
apartments %>%
ggplot(aes(date, transactions, group = area, color = area)) +
geom_line() +
facet_wrap(~ county, scales = "free_y")
```
```{r}
apartments %>%
mutate(price_unit_area_mean = price_unit_area_mean / consumerindex) %>%
ggplot(aes(date, price_unit_area_mean, group = area, color = area)) +
geom_line() +
facet_wrap(~ county, scales = "free_y")
```
## Fit multiple models
Fit model for each county.
Goodness of model fit is evaluated by its ability to predict response values.
Response values can be predicted using the same data that was used to train model.
In this case we are dealing with in-sample analysis, which is ok when all we want to know is model coeficients (eg. intercept and slope) to describe present data.
In cases when we want to predict using some unseen future data, such models are usually overfitted, overly optimistic and don't perform very well.
To manage overfitting, we split dataset into training set for model fitting and test set for model performance assessment.
Resampling can be performed very well using base R function `sample()`.
Let's say we want to use approx. 70% of rows from "mtcars" table to be used in model training and leave remaining 30% for model testing.
Basically, all we need to do is to generate to non-overlapping row indexes.
```{r}
set.seed(123)
index <- sample(1:nrow(mtcars), floor(0.7 * nrow(mtcars)), replace = FALSE)
index
## ~70% data goes to training set
train <- mtcars[index, ]
## remaining ~30% will be used for testing model performance
test <- mtcars[-index, ]
```
Index based resampling is also implemented in tidyverse.
We can use `resample_partition()` function from "modelr" library to perform.
Here we generate data splits and move them to separate columns.
```{r}
library(modelr)
## Generate data splits and move into separate columns
trans <- apartments %>%
group_by(county) %>%
nest %>%
mutate(parts = map(data, ~resample_partition(.x, c(test = 0.3, train = 0.7))),
test = map(parts, "test"),
train = map(parts, "train")) %>%
dplyr::select(-parts)
```
Fit model to each data split.
Let's start with number of transactions.
```{r}
## Fit model
trans <- trans %>%
mutate(mod_train = map(train, ~lm(transactions ~ month + area, data = .x)))
## Print data with new model column
trans %>%
dplyr::select(mod_train, everything())
```
`coef()` and `summary()` return vector and S3 object respectively.
Model coeficients and summary statistics can be returned as a data frame using `tidy()` function from "broom" library:
```{r}
## model coeficients
trans %>%
mutate(mod_tidy = map(mod_train, tidy)) %>%
dplyr::select(county, mod_tidy) %>%
unnest %>%
head()
```
Pseudo out-of-sample model performance, `augment()` adds predictions to original data frame (.fitted).
We subtract fitted values from the original values and calculate root-mean-squared deviation (rmsd).
Smaller rmsd indicates better fit.
```{r}
trans <- trans %>%
mutate(mod_pred = map2(mod_train, test, ~augment(.x, newdata = .y)),
mod_pred = map(mod_pred, ~mutate(.x, .resid = transactions - .fitted)),
mod_rmsd = map_dbl(mod_pred, ~sqrt(mean(.x$`.resid`^2))))
trans %>%
dplyr::select(county, mod_rmsd) %>%
arrange(desc(mod_rmsd))
```
We can see that Harju county displays worst fit and Tartu is next. These two counties show also majority of transactions.
```{r}
trans_pred <- trans %>%
dplyr::select(county, mod_pred) %>%
unnest
trans_pred %>%
dplyr::select(.fitted, transactions, everything()) %>%
head()
```