CrossValidation-SC/Cross-ValidationInClassHO2.Rmd at master · STAT-ATA-ASU/CrossValidation-SC · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
---
title: "Cross-Validation Hand Out"
author: "Alan Arnholt"
date: 'Last knit on: `r format(Sys.time(), "%B %d, %Y at %X")`'
bibliography: CV.bib
output:
  bookdown::html_document2:
    highlight: textmate
    theme: yeti
---

```{r, label = "SETUP", echo = FALSE, results= 'hide', message = FALSE, warning = FALSE}
library(knitr)
knitr::opts_chunk$set(fig.show = 'as.is', fig.height = 4, fig.width = 4, prompt = FALSE, highlight = TRUE, tidy = FALSE, warning = FALSE, message = FALSE, fig.align = "center", comment = NA, tidy.opts=list(blank = TRUE, width.cutoff= 75, cache = TRUE))
```

# Cross-Validation Handout

*Note: Working definitions and graphs are taken from @ugarte_probability_2016*

## The Validation Set Approach

The basic idea behind the validation set approach is to split the available data into a training set and a testing set. A regression model is developed using only the training set. Consider Figure \@ref(fig:vsa) which illustrates a split of the available data into a training set and a testing set.

```{r, label = "vsa", echo = FALSE, fig.cap = "Validation set approach"}
knitr::include_graphics("./PNG/valdset.png", dpi = 96)
```

The percent of values that are allocated into training and testing may vary based on the size of the available data. It is not unusual to allocate 70–75% of the available data as the training set and the remaining 25–30% as the testing set. The predictive performance of a regression model is assessed using the testing set. One of the more common methods to assess the predictive performance of a regression model is the mean square prediction error (MSPE). The MSPE is defined as

\begin{equation}
\textrm{MSPE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
(\#eq:mspe)
\end{equation}

## Leave-One-Out Cross Validation

The leave-one-out cross-validation (LOOCV) eliminates the problem of variability in MSPE present in the validation set approach. The LOOCV is similar to the validation set approach as the available $n$ observations are split into training and testing sets. The difference is that each of the available $n$ observations are split into $n$ training and $n$ testing sets where each of the $n$ training sets consist of $n - 1$ observations and each of the testing sets consists of a single different value from the original $n$ observations. Figure \@ref(fig:LOOC) provides a schematic display of the leave-one-out cross-validation process with testing sets (light shade) and training sets (dark shade) for a data set of $n$ observations.

```{r, label = "LOOC", echo = FALSE, fig.cap = "Leave-one-out cross validation"}
knitr::include_graphics("./PNG/loocv.png", dpi = 96)
```

The MSPE is computed with each testing set resulting in $n$ values of MSPE. The LOOCV estimate for the test MSPE is the average of these $n$ MSPE values denoted as

\begin{equation}
\textrm{CV}_{(n)} = \frac{1}{n}\sum_{i=1}^{n}\textrm{MSPE}_i
(\#eq:loocv)
\end{equation}

## $k$-Fold Cross Validation

$k$-fold cross-validation is similar to LOOCV in that the available data is split into training sets and testing sets; however, instead of creating $n$ different training and testing sets, $k$ folds/groups of training and testing sets are created where $k < n$ and each fold consists of roughly $n/k$ values in the testing set and $n - n/k$ values in the training set. Figure \@ref(fig:KFCV) shows a schematic display of 5-fold cross-validation. The lightly shaded rectangles are the testing sets and the darker shaded rectangles are the training sets.

```{r, label = "KFCV", echo = FALSE, fig.cap = "Five fold cross-validation"}
knitr::include_graphics("./PNG/cv.png", dpi = 96)
```


The MSPE is computed on each of the $k$ folds using the testing set to evaluate the regression model built from the training set. The weighted average of $k$ MSPE values is denoted as

\begin{equation}
\textrm{CV}_{(k)} = \sum_{k=1}^{k}\frac{n_k}{n}\textrm{MSPE}_k
(\#eq:kfcv)
\end{equation}

Note that LOOCV is a special case of $k$-fold cross-validation where $k$ is set equal to $n$. An important advantage $k$-fold cross-validation has over LOOCV is that $\textrm{CV}_{k}$ for $k = 5$ or $k = 10$ provides a more accurate estimate of the test error rate than does $\textrm{CV}_n$.

## Creating some data


```{r label = "somedata"}
set.seed(24)
n <- 1000           # Number of observations to generate
SD <- 0.5
xs <- sort(runif(n, 5, 9))
ys <- sin(xs) + rnorm(n, 0, SD)
DF <- data.frame(x = xs, y = ys)
rm(xs, ys)
library(DT)
datatable(DF)
```

## Validation Set Approach

* Create a training set using 75% of the observations in `DF`.
* Sort the observations in the training and testing sets.

```{r}
n <- nrow(DF)
train <- sample(n, floor(0.75 * n), replace = FALSE)
train <- sort(train)
trainSET <- DF[train, ]
testSET <- DF[-train, ]
dim(trainSET)
dim(testSET)
```

* Fit a quadratic model using the training set (`trainSET`).

```{r}
library(ggplot2)
# Base R
plot(y ~ x, data = trainSET, pch = 19, cex = .25, col = "blue")
modq <- lm(y ~ poly(x, 2, raw = TRUE), data = trainSET)
yhat <- predict(modq, data = trainSET)
lines(trainSET$x, yhat, col = "red")
# ggplot2 approach
ggplot(data = trainSET, aes(x = x, y = y)) +
  geom_point(color = "blue", size = 1) +
  theme_bw() +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2, raw = TRUE), color = "red", se = FALSE)
# Summary of quadratic model
summary(modq)
```

## Compute the training `MSPE`

```{r}
MSPE <- mean(resid(modq)^2)
MSPE
```

## Compute the testing `MSPE`

```{r}
yhtest <- predict(modq, newdata = testSET)
MSPEtest <- mean((testSET$y - yhtest)^2)
MSPEtest
```

## Fit a cubic model.

```{r}
# Base R
plot(y ~ x, data = trainSET, pch = 19, cex = .25, col = "blue")
modc <- lm(y ~ poly(x, 3, raw = TRUE), data = trainSET)
yhat <- predict(modc, data = trainSET)
lines(trainSET$x, yhat, col = "red")
# ggplot2 approach
ggplot(data = trainSET, aes(x = x, y = y)) +
  geom_point(color = "blue", size = 0.5) +
  theme_bw() +
  geom_smooth(method = "lm", formula = y ~ poly(x, 3, raw = TRUE), color = "red", se = FALSE)
# Summary of cubic model
summary(modc)
```

## Compute the training `MSPE`

```{r}
MSPE <- mean(resid(modc)^2)
MSPE
```

## Compute the testing `MSPE`

```{r}
yhtest <- predict(modc, newdata = testSET)
MSPEtest <- mean((testSET$y - yhtest)^2)
MSPEtest
```

## Your Turn

* Create a training set (80%) and testing set (20%) of the observations from the data frame `HSWRESTLER` from the `PASWR2` package. Store the results from regressing `hwfat` onto `abs` and `triceps` in the object `modf`.

* Compute the test `MSPE`.

* Note how the answers of your classmates are all different.  The validation estimate of the test MSPE can be highly variable.

```{r}
# Your code here
library(PASWR2)
#
#
#
#
#
#
#
#
#
#
#
```


## Your Turn

The left side of Figure 5.2 on page 178 of @james_introduction_2013 shows the validation approach used on the `Auto` data set in order to estimate the test error that results from predicting `mpg` using polynomial functions of `horsepower` for one particular split of the original data.  The code below creates a similar graph.

```{r}
library(ISLR)
n <- nrow(Auto)
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(15, 30),
     ylab = "Mean Squared Prediction Error")
IND <- sample(1:n, size = floor(n/2), replace = FALSE)
train <- Auto[IND, ]
test <- Auto[-IND, ]
MSPE <- numeric(10)
for(i in 1:10){
    mod <- lm(mpg ~ poly(horsepower, i), data = train)
    pred <- predict(mod, newdata = test)
    MSPE[i] <- mean((test$mpg - pred)^2)
  }
lines(1:10, MSPE, col = "red")
points(1:10, MSPE, col = "red", pch = 19)
# ggplot2 approach
DF2 <- data.frame(x = 1:10, MSPE = MSPE)
ggplot(data = DF2, aes(x = x, y = MSPE)) +
  geom_point(color = "red", size = 2) +
  geom_line(color = "red") +
  theme_bw() +
  ylim(15, 30) +
  scale_x_continuous(breaks = 1:10) +
  labs(x = "Degree of Polynomial", y = "Mean Squared Prediction Error")
```


* Modify the code above to recreate a graph similar to the right side of Figure 5.2 on page 178 of @james_introduction_2013.  Hint: Place a `for` loop before `IND`.

```{r, fig.width = 5, fig.height = 5}
# Your code here
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# ggplot2 approach
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
```


## $k$ Fold Cross Validation

* Create $k = 5$ folds.
* Compute the $\textrm{CV}_{k=5}$ for `modq`.

```{r}
set.seed(1)
k <- 5
MSPE <- numeric(k)
folds <- sample(x = 1:k, size = nrow(DF), replace = TRUE)
xtabs(~folds)
# or
table(folds)
sum(xtabs(~folds))
for(j in 1:k){
  modq <- lm(y ~ poly(x, 2, raw = TRUE), data = DF[folds != j, ])
  pred <- predict(modq, newdata = DF[folds ==j, ])
  MSPE[j] <- mean((DF[folds == j, ]$y - pred)^2)
}
MSPE
weighted.mean(MSPE, table(folds)/sum(folds))
```

### Using `caret`

```{r}
library(caret)
model <- train(
  form = y ~ poly(x, 2, raw = TRUE),
  data = DF,
  method = "lm",
  trControl = trainControl(
    method = "cv", number = 5
  )
)
model
```


## Your Turn

* Compute the $\textrm{CV}_8$ for `modf`. Recall that `modf` was created from regressing `hwfat` onto `abs` and `triceps`.


```{r}
# Your code here
set.seed(13)
k <- 8
MSPE <- numeric(k)
folds <- sample(x = 1:k, size = nrow(HSWRESTLER), replace = TRUE)
#
#
#
#
#
#
#
#
#
```

### Using `caret`

```{r}
model <- train(
  form = hwfat ~ abs + triceps,
  data = HSWRESTLER,
  method = "lm",
  trControl = trainControl(
    method = "cv", number = 8
  )
)
model
model$results$RMSE^2
```


## Using `cv.glm` from `boot`

```{r}
set.seed(1)
library(boot)
glm.fit <- glm(y ~ poly(x, 2, raw = TRUE), data = DF)
cv.err <- cv.glm(data = DF, glmfit = glm.fit, K = 5)$delta[1]
cv.err
```


## Your Turn

* Compute $\textrm{CV}_8$ for `modf` using `cv.glm`.  Recall that `modf` was created from regressing `hwfat` onto `abs` and `triceps`.

```{r}
# Your code here
glm.fit <- glm(hwfat ~ abs + triceps, data = HSWRESTLER)
#
#
```
*  Use `caret` and compare answers.

```{r}
# Your Code Here
#
#
#
#
#
#
#
#
#
#
```

## Your Turn

The right side of Figure 5.4 on page 180 of @james_introduction_2013 shows the 10-fold cross-validation approach used on the `Auto` data set in order to estimate the test error that results from predicting `mpg` using polynomial functions of `horsepower` run nine separate times.  The code below creates a graph showing one particular run.

```{r}
# Your code here
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26),
     ylab = "Mean Squared Prediction Error", main = "10-fold CV")
k <- 10 # number of folds
MSPE <- numeric(k)
cv <- numeric(k)
#
#
#
#
#
#
#
#
#
#
#
```


* Use a `for` loop to run the above code nine times.  The result should look similar to the right side of Figure 5.4 on page 180 of @james_introduction_2013.

```{r}
# Your Code Here
set.seed(123)
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26),
     ylab = "Mean Squared Prediction Error", main = "10-fold CV")
k <- 10      # number of folds
MSPE <- numeric(k)
cv <- numeric(k)
#
#
#
#
#
#
#
#
#
#
#
#
#
## GGplot2 approach
set.seed(123)
cv <- matrix(NA, 10, 9)
k <- 10      # number of folds
MSPE <- numeric(k)
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
```

* Use the function `cv.glm` to create similar a graph to the right side of Figure 5.4 on page 180 of @james_introduction_2013.

```{r}
# Your Code Here
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26),
     ylab = "Mean Squared Prediction Error", main = "10-fold CV")
#
#
#
#
#
#
#
#
#
# GGplot2 approach
cv.err <- matrix(NA, 10, 9)
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
```


## Leave-One-Out Cross-Validation


```{r}
set.seed(1)
k <- nrow(DF)
MSPE <- numeric(k)
folds <- sample(x = 1:k, size = nrow(DF), replace = FALSE)
# Note that replace changes to FALSE for LOOCV...can you explain why?
for(j in 1:k){
  modq <- lm(y ~ poly(x, 2, raw = TRUE), data = DF[folds != j, ])
  pred <- predict(modq, newdata = DF[folds ==j, ])
  MSPE[j] <- mean((DF[folds == j, ]$y - pred)^2)
}
mean(MSPE)
```

## Your Turn

* Compute $\textrm{CV}_n$ for `modf`.  Recall that `modf` was created from regressing `hwfat` onto `abs` and `triceps`.

```{r}
# Your Code Here
set.seed(1)
k <- nrow(HSWRESTLER)
MSPE <- numeric(k)
#
#
#
#
#
#
#
```


Recall

$$CV_{n}=\frac{1}{n}\sum_{i=1}^n\left(\frac{y_i - \hat{y_i}}{1 - h_i}\right)^2$$

```{r}
modq <- lm(y ~ poly(x, 2, raw = TRUE), data = DF)
h <- hatvalues(modq)
CVn <- mean(((DF$y - predict(modq))/(1 - h))^2)
CVn
```

## Your Turn

* Compute $\textrm{CV}_n$ for `modf` using the mathematical shortcut. Recall that `modf` was created from regressing `hwfat` onto `abs` and `triceps`.

```{r}
# Your Code Here
modf <- lm(hwfat ~ abs + triceps, data = HSWRESTLER)
#
#
#
```


## Using `cv.glm` from `boot`

* Note: If one does not use the `K` argument for the number of folds, `gv.glm` will compute LOOCV.

```{r}
library(boot)
glm.fit <- glm(y ~ poly(x, 2, raw = TRUE), data = DF)
cv.err <- cv.glm(data = DF, glmfit = glm.fit)$delta[1]
cv.err
```

* * *

## Your Turn

* Compute $\textrm{CV}_n$ for `modf` using `cv.glm`.  Recall that `modf` was created from regressing `hwfat` onto `abs` and `triceps`.

```{r}
# Your Code Here
glm.fit <- glm(hwfat ~ abs + triceps, data = HSWRESTLER)
#
#
```


## Your Turn

*  Create a graph similar to the left side of Figure 5.4 on page 180 of @james_introduction_2013.

```{r}
# Your Code Here
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26),
     ylab = "Mean Squared Prediction Error")
k <- nrow(Auto) # number of folds
MSPE <- numeric(k)
cv <- numeric(10)
#
#
#
#
#
#
#
#
#
#
#
#
```

Using the short cut formula:

```{r}
# Your Code Here
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26),
     ylab = "Mean Squared Prediction Error")
cv <- numeric(10)
#
#
#
#
#
#
#
```

_________________


## References