-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy path05-Data-summary-and-analysis.Rmd
897 lines (634 loc) · 42.3 KB
/
05-Data-summary-and-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
---
title: "Data summary and analysis"
author: "Seth Mottaghinejad"
output: github_document
date: "`r Sys.Date()`"
---
```{r chap05chunk01, include=FALSE}
source('setup.R')
```
Let's recap where we are in the process:
1. load all the data (and combine them if necessary)
2. inspect the data in preparation cleaning it
3. clean the data in preparation for analysis
4. add any interesting features or columns as far as they pertain to the analysis
5. **find ways to analyze or summarize the data and report your findings**
Of course in practice a workflow is not clean-cut the way we have it here, and it tends to be circular in that finding out certain quirks about the data forces us to go back and make certain changes to the data-cleaning process or add other features and so on.
We now have a data set that's more or less ready for analysis. In the next section we go over ways we can summarize the data and produce plots and tables. Let's run `str(nyc_taxi)` and `head(nyc_taxi)` again to review all the work we did so far.
```{r chap05chunk02}
str(nyc_taxi)
```
```{r chap05chunk03}
head(nyc_taxi, 3)
```
We divide this chapter into three section:
1. **Overview of some important statistical summary functions:** This is by no means a comprehensive glossary of statistical functions, but rather a sampling of the important ones and how to use them, how to modify them, and some common patterns among them.
2. **Data summary with `base` R tools:** The `base` R tools for summarizing data are a bit more tedious and some have a different notation or way of passing arguments, but they are also widely used and they can be very efficient if used right.
3. **Data summary with `dplyr`:** `dplyr` offers a consistent and popular notation for processing and summarizing data, and one worth learning on top of `base` R.
To reiterate, statistical summary functions which we cover in section 1 can be used in either of the above cases, but what's different is the way we query the data using those functions. For the latter, we will review two (mostly alternative) ways: one using `base` functions in section 2 and one using the `dplyr` library in section 3.
## Summary functions
We already learned of one all-encompassing summary function, namely `summary`:
```{r chap05chunk04}
summary(nyc_taxi) # summary of the whole data
```
We can use `summary` to run a sanity check on the data and find ways that the data might need to be cleaned in preparation for analysis, but we are now interested in individual summaries. For example, here's how we can find the average fare amount for the whole data.
```{r chap05chunk05}
mean(nyc_taxi$fare_amount) # the average of `fare_amount`
```
By specifying `trim = .10` we can get a 10 percent trimmed average, i.e. the average after throwing out the top and bottom 10 percent of the data:
```{r chap05chunk06}
mean(nyc_taxi$fare_amount, trim = .10) # trimmed mean
```
By default, the `mean` function will return NA if there is any NA in the data, but we can overwrite that with `na.rm = TRUE`. This same argument shows up in almost all the statistical functions we encounter in this section.
```{r chap05chunk07}
mean(nyc_taxi$trip_duration, na.rm = TRUE) # removes NAs before computing the average
```
We can use `weighted.mean` to find a weighted average. The weights are specified as the second argument, and if we fail to specify anything for weights, we just get a simple average.
```{r chap05chunk08}
weighted.mean(nyc_taxi$tip_percent, nyc_taxi$trip_distance, na.rm = TRUE) # weighted average
```
The `sd` function returns the standard deviation of the data, which is the same as returning the square root of its variance.
```{r chap05chunk09}
sd(nyc_taxi$trip_duration, na.rm = TRUE) # standard deviation
```
```{r chap05chunk10}
sqrt(var(nyc_taxi$trip_duration, na.rm = TRUE)) # standard deviation == square root of variance
```
We can use `range` to get the minimum and maximum of the data at once, or use `min` and `max` individually.
```{r chap05chunk11}
range(nyc_taxi$trip_duration, na.rm = TRUE) # minimum and maximum
```
```{r chap05chunk12}
c(min(nyc_taxi$trip_duration, na.rm = TRUE), max(nyc_taxi$trip_duration, na.rm = TRUE))
```
We can use `median` to return the median of the data.
```{r chap05chunk13}
median(nyc_taxi$trip_duration, na.rm = TRUE) # median
```
The `quantile` function is used to get any percentile of the data, where the percentile is specified by the `probs` argument. For example, letting `probs = .5` returns the median.
```{r chap05chunk14}
quantile(nyc_taxi$trip_duration, probs = .5, na.rm = TRUE) # median == 50th percentile
```
We can specify a vector for `probs` to get multiple percentiles all at once. For example setting `probs = c(.25, .75)` returns the 25th and 75th percentiles.
```{r chap05chunk15}
quantile(nyc_taxi$trip_duration, probs = c(.25, .75), na.rm = TRUE) # IQR == difference b/w 75th and 25th percentiles
```
The difference between the 25th and 75th percentiles is called the inter-quartile range, which we can also get using the `IQR` function.
```{r chap05chunk16}
IQR(nyc_taxi$trip_duration, na.rm = TRUE) # interquartile range
```
Let's look at a common bivariate summary statistic for numeric data: correlation.
```{r chap05chunk17}
cor(nyc_taxi$trip_distance, nyc_taxi$trip_duration, use = "complete.obs")
```
We can use `mothod` to switch from Pearson's correlation to Spearman rank correlation.
```{r chap05chunk18}
cor(nyc_taxi$trip_distance, nyc_taxi$trip_duration, use = "complete.obs", method = "spearman")
```
Why does the Spearman correlation coefficient takes so much longer to compute?
So far we've examined functions for summarizing numeric data. Let's now shift our attention to categorical data. We already saw that we can use `table` to get counts for each level of a `factor` column.
```{r chap05chunk19}
table(nyc_taxi$pickup_nhood) # one-way table
```
When we pass more than one column to `table`, we get counts for each *combination* of the factor levels. For example, with two columns we get counts for each combination of the levels of the first factor and the second factor. In other words, we get a two-way table.
```{r chap05chunk20}
two_way <- with(nyc_taxi, table(pickup_nhood, dropoff_nhood)) # two-way table: an R `matrix`
two_way[1:5, 1:5]
```
What about a three-way table? A three-way table (or n-way table where n is an integer) is represented in R by an object we call an `array`. A vector is a one- dimensional array, a matrix a two-dimensional array, and a three-way table is a kind of three-dimensional array.
What about a three-way table? A three-way table (or n-way table where n is an integer) is represented in R by an object we call an `array`. A vector is a one- dimensional array, a matrix a two-dimensional array, and a three-way table is a kind of three-dimensional array.
```{r chap05chunk21}
arr_3d <- with(nyc_taxi, table(pickup_dow, pickup_hour, payment_type)) # a three-way table, an R 3D `array`
arr_3d
```
Let's see how we query a 3-dimensional `array`: Because we have a 3-dimensional array, we need to index it across three different dimensions:
```{r chap05chunk22}
arr_3d[3, 2, 2] # give me the 3rd row, 2nd column, 2nd 'page'
```
Just as with a `data.frame`, leaving out the index for one of the dimensions returns all the values for that dimension.
```{r chap05chunk23}
arr_3d[ , , 2]
```
We can use the names of the dimensions instead of their numeric index:
```{r chap05chunk24}
arr_3d['Tue', '5AM-9AM', 'cash']
```
We can turn the `array` representation into a `data.frame` representation:
```{r chap05chunk25}
df_arr_3d <- as.data.frame(arr_3d) # same information, formatted as data frame
head(df_arr_3d)
```
We can subset the `data.frame` using the `subset` function:
```{r chap05chunk26}
subset(df_arr_3d, pickup_dow == 'Tue' & pickup_hour == '5AM-9AM' & payment_type == 'cash')
```
Notice how the `array` notation is more terse, but not as readable (because we need to remember the order of the dimensions).
We can use `apply` to get aggregates of a multidimensional array across some dimension(s). The second argument to `apply` is used to specify which dimension(s) we are aggregating over.
```{r chap05chunk27}
apply(arr_3d, 2, sum) # because `pickup_hour` is the second dimension, we sum over `pickup_hour`
```
Once again, when the dimensions have names it is better to use the names instead of the numeric index.
```{r chap05chunk28}
apply(arr_3d, "pickup_hour", sum) # same as above, but more readable notation
```
So in the above example, we used apply to collapse a 3D `array` into a 2D `array` by summing across the values in the second dimension (the dimension representing pick-up hour).
We can use `prop.table` to turn the counts returned by `table` into proportions. The `prop.table` function has a second argument. When we leave it out, we get proportions for the grand total of the table.
```{r chap05chunk29}
prop.table(arr_3d) # as a proportion of the grand total
```
For proportions out of marginal totals, we provide the second argument to `prop.table`. For example, specifying 1 as the second argument gives us proportions out of "row" totals. Recall that in a 3d object, a "row" is a 2D object, for example `arr_3d[1, , ]` is the first "row", `arr3d[2, , ]` is the second "row" and so on.
```{r chap05chunk30}
prop.table(arr_3d, 1) # as a proportion of 'row' totals, or marginal totals for the first dimension
```
We can confirm this by using `apply` to run the `sum` function across the first dimension to make sure that they all add up to 1.
```{r chap05chunk31}
apply(prop.table(arr_3d, 1), 1, sum) # check that across rows, proportions add to 1
```
Similarly, if the second argument to `prop.table` is 2, we get proportions that add up to 1 across the values of the 2nd dimension. Since the second dimension corresponds to pick-up hour, for each pickup-hour, we get the proportion of observations that fall into each pick-up day of week and payment type combination.
```{r chap05chunk32}
prop.table(arr_3d, 2) # as a proportion of column totals
```
Which once again we can double-check with `apply`:
```{r chap05chunk33}
apply(prop.table(arr_3d, 2), 2, sum) # check that across columns, proportions add to 1
```
Finally, if the second argument to `prop.table` is 3, we get proportions that add up to 1 across the values of the 3rd dimension. So for each payment type, the proportions now add up to 1.
```{r chap05chunk34}
prop.table(arr_3d, 3) # as a proportion of totals across third dimension
```
Both `prop.table` and `apply` also accepts combinations of dimensions as the second argument. This makes them powerful tools for aggregation, as long as we're careful. For example, letting the second argument be `c(1, 2)` gives us proportions that add up to 1 for each combination of "row" and "column". So in other words, we get the percentage of card vs cash payments for each pick-up day of week and hour combination.
```{r chap05chunk35}
prop.table(arr_3d, c(1, 2)) # as a proportion of totals for each combination of 1st and 2nd dimensions
```
### Exercises
(1) The `trim` argument for the `mean` function is two-sided. Let's build a one- sided trimmed mean function, and one that uses counts instead of percentiles. Call it `mean_except_top_10`. For example `mean_except_top_10(x, 5)` will throw out the highest 5 values of `x` before computing the average. HINT: you can sort `x` using the `sort` function.
```{r chap05chunk36, eval=FALSE}
mean_except_top_10(c(1, 5, 3, 99), 1) # should return 3
```
---
We just leared that the `probs` argument of `quantile` can be a vector. So instead of getting multiple quantiles separately, such as
```{r chap05chunk37}
c(quantile(nyc_taxi$trip_distance, probs = .9),
quantile(nyc_taxi$trip_distance, probs = .6),
quantile(nyc_taxi$trip_distance, probs = .3))
```
we can get them all at once by passing the percentiles we want as a single vector to `probs`:
```{r chap05chunk38}
quantile(nyc_taxi$trip_distance, probs = c(.3, .6, .9))
```
As it turns out, there's a considerable difference in efficiency between the first and second approach. We explore this in this exercise:
There are two important tools we can use when considering efficiency:
- **profiling** is a helpful tool if we need to understand what a function does under the hood (good for finding bottlenecks)
- **benchmarking** is the process of comparing multiple functions to see which is faster
Both of these tools can be slow when working with large datasets (especially the benchmarking tool), so instead we create a vector of random numbers and use that for testing (alternatively, we could use a sample of the data). We want the vector to be big enough that test result are stable (not due to chance), but small enough that they will run within a reasonable time frame.
Let's begin by profiling, for which we rely on the `profr` library:
```{r chap05chunk39}
random_vec <- rnorm(10^6) # a million random numbers generated from a standard normal distribution
library(profr)
my_test_function <- function(){
quantile(random_vec, p = seq(0, 1, by = .01))
}
p <- profr(my_test_function())
plot(p)
```
(2) Describe what the plot is telling us: what is the bottleneck in getting quantiles?
---
Now onto benchmarking, we compare two functions: `first` and `scond`. `first` finds the 30th, 60th, and 90th percentiles of the data in one function call, but `scond` uses three separate function calls, one for each percentile. From the profiling tool, we now know that every time we compute percentiles, we need to sort the data, and that sorting the data is the most time-consuming part of the calculation. The benchmarking tool should show that `first` is three times more efficient than `scond`, because `first` sorts the data once and finds all three percentiles, whereas `scond` sorts the data three different times and finds one of the percentiles every time.
```{r chap05chunk40}
first <- function(x) quantile(x, probs = c(.3, .6, .9)) # get all percentiles at the same time
scond <- function(x) {
c(
quantile(x, probs = .9),
quantile(x, probs = .6),
quantile(x, probs = .3))
}
library(microbenchmark) # makes benchmarking easy
print(microbenchmark(
first(random_vec), # vectorized version
scond(random_vec), # non-vectorized
times = 10))
```
(3) Describe what the results say? Do the runtimes bear out our intuition?
### Solutions
(1) We can sort the vector in reverse order, throw out the top n entries and then compute the average.
```{r chap05chunk41}
mean_except_top_10 <- function(x, n) {
mean(-sort(-x)[-(1:n)], na.rm = TRUE)
}
mean_except_top_10(c(1, 5, 3, 99), 1)
```
(2) Based on the plot we can see that the almost all of the runtime of `my_test_function` is spent on sorting the vector. Sorting is a computationally intensive process and many algorithms have been devised to sort data for that reason.
(3) We can look at the `mean` column to compare the average runtime of running `first` 10 times (recall that we set `times = 10`) to the average runtime for `scond`. We can see that `scond` is about twice as long.
## Data summary in `base` R
One of the most important sets of functions in `base` R are the `apply` family of functions: we learned about `apply` earlier, and learn about `sapply`, `lapply`, and `tapply` in this section (there are more of them, but we won't cover them all).
- We already learned how `apply` runs a summary function across any dimension of an `array`
- `sapply` and `lapply` allow us to apply a summary function to multiple column of the data at once using them means we can type less and avoid writing loops.
- `tapply` is used to run a summary function on a column of the data, but group the result by other columns of the data
Say we were interested in obtained summary statistics for all the columns listed in the vector `trip_metrics`:
```{r chap05chunk42}
trip_metrics <- c('passenger_count', 'trip_distance', 'fare_amount', 'tip_amount', 'trip_duration', 'tip_percent')
```
We can use either `sapply` or `lapply` for this task. In fact, `sapply` and `lapply` have an identical syntax, but the difference is in the type output return. Let's first look at `sapply`: `sapply` generally organizes the results in a tidy format (unsually a vector or a matrix):
```{r chap05chunk43}
s_res <- sapply(nyc_taxi[ , trip_metrics], mean)
s_res
```
One of the great advantages of the `apply`-family of functions is that in addition to the statistical summary, we can pass any secondary argument the function takes to the function. Notice how we pass `na.rm = TRUE` to `sapply` hear so that we can remove missing values from the data before we compute the means.
```{r chap05chunk44}
s_res <- sapply(nyc_taxi[ , trip_metrics], mean, na.rm = TRUE)
s_res
```
The object `sapply` returns in this case is a vector: `mean` is a summary function that returns a single number, and `sapply` applies `mean` to multiple columns, returning a **named vector** with the means as its elements and the original column names preserved. Because `s_res` is a named vector, we can query it by name:
```{r chap05chunk45}
s_res["passenger_count"] # we can query the result object by name
```
Now let's see what `lapply` does: unlike `sapply`, `lapply` makes no attempt to organize the results. Instead, it always returns a `list` as its output. A `list` is a very "flexible" data type, in that anything can be "dumped" into it.
```{r chap05chunk46}
l_res <- lapply(nyc_taxi[ , trip_metrics], mean)
l_res
```
In this case, we can 'flatten' the `list` with the `unlist` function to get the same result as `sapply`.
```{r chap05chunk47}
unlist(l_res) # this 'flattens' the `list` and returns what `sapply` returns
```
Querying a `list` is a bit more complicated. We use one bracket to query a `list`, but the return object is still a `list`, in other words, with a single bracket, we get a sublist.
```{r chap05chunk48}
l_res["passenger_count"] # this is still a `list`
```
If we want to return the object itself, we use two brackets.
```{r chap05chunk49}
l_res[["passenger_count"]] # this is the average count itself
```
The above distinction is not very important when all we want to do is look at the result. But when we need to perform more computations on the results we obtained, the distinction is crucial. For example, recall that both `s_res` and `l_res` store column averages for the data. Say now that we wanted to take the average for passenger count and add 1 to it, so that the count includes the driver too. With `s_res` we do the following:
```{r chap05chunk50}
s_res["passenger_count"] <- s_res["passenger_count"] + 1
s_res
```
With `l_res` using a single bracket fails, because `l_res["passenger_count"]` is still a `list` and we can't add 1 to a `list`.
```{r chap05chunk51, error=TRUE}
l_res["passenger_count"] <- l_res["passenger_count"] + 1
```
So we need to use two brackets to perform the same operation on `l_res`.
```{r chap05chunk52}
l_res[["passenger_count"]] <- l_res[["passenger_count"]] + 1
l_res
```
Let's look at our last function in the `apply` family now, namely `tapply`: We use `tapply` to apply a function to the a column, **but group the results by the values other columns.**
```{r chap05chunk53}
tapply(nyc_taxi$tip_amount, nyc_taxi$pickup_nhood, mean, trim = 0.1, na.rm = TRUE) # trimmed average tip, by pickup neighborhood
```
We can group the results by pickup and dropoff neighborhood pairs, by combining those two columns into one. For example, the `paste` function concatenates the pick-up and drop-off neighborhoods into a single string. The result is a flat vector with one element for each pick-up and drop-off neighborhood combination.
```{r chap05chunk54}
flat_array <- tapply(nyc_taxi$tip_amount,
paste(nyc_taxi$pickup_nhood, nyc_taxi$dropoff_nhood, sep = " to "),
mean, trim = 0.1, na.rm = TRUE)
head(flat_array)
```
By putting both grouping columns in a `list` we can get an `array` (a 2D `array` or `matrix` in this case) instead of the flat vector we got earlier.
```{r chap05chunk55}
square_array <- tapply(nyc_taxi$tip_amount,
list(nyc_taxi$pickup_nhood, nyc_taxi$dropoff_nhood),
mean, trim = 0.1, na.rm = TRUE)
square_array[1:5, 1:5]
```
### Exercises
Let's look at two other cases of using `sapply` vs `lapply`, one involving `quantile` and one involving `unique`.
```{r chap05chunk56}
qsap1 <- sapply(nyc_taxi[ , trip_metrics], quantile, probs = c(.01, .05, .95, .99), na.rm = TRUE)
qlap1 <- lapply(nyc_taxi[ , trip_metrics], quantile, probs = c(.01, .05, .95, .99), na.rm = TRUE)
```
(1) Query `qsap1` and `qlap1` for the 5th and 95th percentiles of `trip_distance` and `trip_duration`.
Let's now try the same, but this time pass the `unique` function to both, which returns the unique values in the data for each of the columns.
```{r chap05chunk57}
qsap2 <- sapply(nyc_taxi[ , trip_metrics], unique)
qlap2 <- lapply(nyc_taxi[ , trip_metrics], unique)
```
(2) Query `qsap2` and `qlap2` to show the distinct values of `passenger_count` and `tip_percent`. Can you tell why did `sapply` and `lapply` both return lists in the second case?
(3) Use `qlap2` to find the number of unique values for each column.
### Solutions
(1) Because `qsap1` is a matrix, we can query it the same way we query any n-dimensional `array`:
```{r chap05chunk58}
qsap1[c('5%', '95%'), c('trip_distance', 'trip_duration')]
```
Since `qlap1` is a list with one element per each column of the data, we use two brackets to extract the percentiles for column separately. Moreover, because the percentiles themselves are stored in a named vector, we can pass the names of the percentiles we want in a single bracket to get the desired result.
```{r chap05chunk59}
qlap1[['trip_distance']][c('5%', '95%')]
```
```{r chap05chunk60}
qlap1[['trip_duration']][c('5%', '95%')]
```
(2) In this case, `sapply` and `lapply` both return a `list`, simply because there is no other way for `sapply` to organize the results. We can just return the results for `passenger_count` and `tip_percent` as a sublist.
```{r chap05chunk61}
qsap2[c('passenger_count', 'tip_percent')]
```
(3) Since we have the unique values for each column stored in `qlap2`, we can just run the `length` function to count how many unique values each column has. For example, for `passenger_count` we have
```{r chap05chunk62}
length(qlap2[['passenger_count']]) # don't forget the double bracket here!
```
But we want to do this automatically for all the columns at once. The solution is to use `sapply`. So far we've been using `sapply` and `lapply` with the dataset as input. But we can just as well feed them any random list like `qsap` and apply a function to each element of that list (as long as doing so doesn't result in an error for any of the list's elements).
```{r chap05chunk63}
sapply(qlap2, length)
```
The above exercise offers a glimpse of how powerful R can be and quickly and succinctly processing the basic data types, as long as we write good functions and use the `apply` family of functions to iterate through the data types. A good goal to set for yourself as an R programmer is to increase your reliance on the `apply` family of function to run your code.
## Writing summary functions
As we use R more and more, we will see that a lot of R functions return a `list` as output (or something that is fundamentally a `list` but looks cosmetically different). In fact, as it happens a `data.frame` is also just a kind a `list`, with each element of the list corresponding to a column of the `data.frame`, and **all elements having the same length**. Why would a `data.frame` be a `list` and not a `matrix`? Because like a `vector`, a `matirx` or any `array` is **atomic**, meaning that its elements must be of the same type (usually `numeric`). Notice what happens if we try to force a vector to have one `character` element and one `numeric` one:
```{r chap05chunk64}
c("one", 1)
```
The second element was **coerced** into the string "1". A `list` will not complain about this:
```{r chap05chunk65}
list("one", 1)
```
Since columns of a `data.frame` can be of different types, it makes sense that under the hood a `data.frame` is really just a `list.` We can check that a `data.frame` is a kind of list **under the hood** by using the `typeof` function instead of `class`:
```{r chap05chunk66}
class(nyc_taxi)
```
```{r chap05chunk67}
typeof(nyc_taxi)
```
This **flexibility** is the reason functions that return lots of loosely-related results return them as a single list. This includes most functions that perform various statistical tests, such as the `lm` function.
We can also write our own summary functions and demonstrate this. In section 6.1, we focused on single summaries (such as `mean`), or multiple related ones (such as `quantile`), but now we want to write a function that combines different summaries and returns all of them at once. The trick is basically to wrap everything into a `list` and return the `list`. The function `my.summary` shown here is an example of such a function. It consists of mostly of separate but related summaries that are calculated piece-wise and then put together into a list and returned by the function.
```{r chap05chunk68}
my.summary <- function(grp_1, grp_2, resp) {
# `grp_1` and `grp_2` are `character` or `factor` columns
# `resp` is a numeric column
mean <- mean(resp, na.rm = TRUE) # mean
sorted_resp <- sort(resp)
n <- length(resp)
mean_minus_top = mean(sorted_resp[1:(n-19)], na.rm = TRUE) # average after throwing out highest 20 values
tt_1 <- table(grp_1, grp_2) # the total count
ptt_1 <- prop.table(tt_1, 1) # proportions for each level of the response
ptt_2 <- prop.table(tt_1, 2) # proportions for each level of the response
tt_2 <- tapply(resp, list(grp_1, grp_2), mean, na.rm = TRUE)
# return everything as a list:
list(mean = mean,
trimmed_mean = mean_minus_top,
row_proportions = ptt_1,
col_proportions = ptt_2,
average_by_group = tt_2
)
}
my.summary(nyc_taxi$pickup_dow, nyc_taxi$pickup_hour, nyc_taxi$tip_amount) # test the function
```
Looking at the above result, we can see that something went wrong with the trimmed mean: the trimmed mean and the mean appear to be the same, which is very unlikely. It's not obvious what the bug is. Take a moment and try find out what the problem is and propose a fix.
One thing that makes it hard to debug the function is that we do not have direct access to its **environment**. We need a way to "step inside" the function and run it line by line so we can see where the problem is. This is what `debug` is for.
```{r chap05chunk69, eval=FALSE}
debug(my.summary) # puts the function in debug mode
```
Now, anytime we run the function we leave our current "global" environment and step into the function's environment, where we have access to all the local variables in the function as we run the code line-by-line.
```{r chap05chunk70, eval=FALSE}
my.summary(nyc_taxi$pickup_dow, nyc_taxi$pickup_hour, nyc_taxi$tip_amount)
```
We start at the beginning, where the only things evaluated are the function's arguments. We can press ENTER to run the next line. After running each line, we can query the object to see if it looks like it should. We can always go back to the global environment by pressing Q and ENTER. If you were unsuccessful at fixing the bug earlier, take a second stab at it now. (HINT: it has something to do with NAs.)
Once we resolve the issue, we run `undebug` so the function can now run normally.
```{r chap05chunk71, eval=FALSE}
undebug(my.summary)
```
To run `my.summary` on multiple numeric columns at once, we can use `lapply`:
```{r chap05chunk72}
res <- lapply(nyc_taxi[ , trip_metrics], my.summary, grp_1 = nyc_taxi$pickup_dow, grp_2 = nyc_taxi$pickup_hour)
```
`res` is just a nested `list` and we can 'drill into' any individual piece we want with the right query. At the first level are the column names.
```{r chap05chunk73}
res$tip_amount$col_proportions # the next level has the statistics that the function outputs.
```
```{r chap05chunk74}
res$tip_amount$col_proportions["Mon", "9AM-12PM"]
```
Since `res` contains a lot of summaries, it might be a good idea to save it using `save`. Any R object can be saved with `save`. This way if our R session crashes we can reload the object with the `load` function.
```{r chap05chunk75}
save(res, file = "res.RData") # save this result
rm(res) # it is now safe to delete `res` from the current session
load(file = "res.RData") # we can use `load` to reopen it anytime we need it again
file.remove(file = "res.RData") # delete the file
```
### Exercises
Debug the summary function `my.summary` in the last section so that the trimmed mean returns the correct result.
HINT: Notice what `sort` does to missing values.
```{r chap05chunk76}
sort(c(5, 1, NA, NA, 2))
```
### Solutions
This is a tricky type of "bug" because nothing in R went wrong and no error message was returned. Instead, we simply overlooked something in the code. As it turns out, when we pass a vector with missing values to `sort`, `sort` returns the sorted vector with the missing values **removed**. But if missing values are removed, then the length of the vector changes and we need to recompute it. This is a quick fix: we need to change `n <- length(resp)` in the body of the function to `n <- length(sorted_resp)`.
## Data summary with `dplyr`
When it comes to summarizing data, we have a lot of options. We covered just a few in the last section, but there are many more functions both in `base` R and packages. We will cover `dplyr` in this section, as an example of a third-party package. What makes `dplyr` very popular is the simple and streight-farward notation for creating increasing complex data pipelines.
First let's review important functions in `dplyr`: `filter`, `mutate`, `transmute`, `group_by`, `select`, `slice`, `summarize`, `distinct`, `arrange`, `rename`, `inner_join`, `outer_join`, `left_join`. With each of the above function, we can either pass the data directly to the function or infer it from the the pipeline. Here's an example of `filter` being used in both ways. In the first case we pass the data as the first argument to `filter`.
```{r chap05chunk77}
library(dplyr)
head(filter(nyc_taxi, fare_amount > 500)) # pass data directly to the function
```
In the second case, we start a pipeline with the data, followed by the piping function `%>%`, followed by `filter` which now inherits the data from the previous step and only needs the filtering condition.
```{r chap05chunk78}
nyc_taxi %>% filter(fare_amount > 500) %>% head # infer the data from the pipeline
```
Piping is especially useful for longer pipelines. Here's an example of a query without piping.
```{r chap05chunk79}
summarize( # (3)
group_by( # (2)
filter(nyc_taxi, fare_amount > 500), # (1)
payment_type),
ave_duration = mean(trip_duration), ave_distance = mean(trip_distance))
```
To understand the query, we need to work from the inside out:
1. First filter the data to show only fare amounts above \$500
2. Group the resulting data by payment type
3. For each group find average trip duration and trip distance
The same query, using piping, looks like this:
```{r chap05chunk80}
nyc_taxi %>%
filter(fare_amount > 500) %>% # (1)
group_by(payment_type) %>% # (2)
summarize(ave_duration = mean(trip_duration), ave_distance = mean(trip_distance)) # (3)
```
Instead of working from the inside out, piping allows us to read the code from top to bottom. This makes it easier (1) to understand what the query does and (2) to build upon the query.
The best way to learn `dplyr` is by example. So instead of covering functions one by one, we state some interesting queries and use `dplyr` to implement them. There are obvious parallels between `dplyr` and the SQL language, but important differences exist too. We point out some of those differences along the way.
### Exercises
(1) In the following query, we want to add a forth step: Sort the results by descending average trip duration. The `dplyr` function to sort is `arrange`. For example `arrange(data, x1, desc(x2))` will sort `data` by increasing values of `x1` and decreasing values of `x2` within each value of `x1`.
Implement this forth step to both the code with and without the pipeline, both of which are shown here:
```{r chap05chunk81, eval=FALSE}
summarize( # (3)
group_by( # (2)
filter(nyc_taxi, fare_amount > 500), # (1)
payment_type),
ave_duration = mean(trip_duration), ave_distance = mean(trip_distance))
```
```{r chap05chunk82, eval=FALSE}
nyc_taxi %>%
filter(fare_amount > 500) %>% # (1)
group_by(payment_type) %>% # (2)
summarize(ave_duration = mean(trip_duration), ave_distance = mean(trip_distance)) # (3)
```
The remaining exercises are questions about the data that need to be translated into a `dplyr` pipeline. The goal of the exercise is two-fold: learn to break down a question into multiple pieces and learn to translate each piece into a line in `dplyr`, which together comprise the pipeline.
(2) What are the pick-up times of the day and the days of the week with the highest average fare per mile of ride?
(3) For each pick-up neighborhood, find the number and percentage of trips that "fan out" into other neighborhoods. Sort results by pickup neighborhood and descending percentage. Limit results to top 50 percent coverage. In other words, show only the top 50 percent of destinations for each pick-up neighborhood.
(4) Are any dates missing from the data?
(5) Find the 3 consecutive days with the most total number of trips?
(6) Get the average, standard deviation, and mean absolute deviation of `trip_distance` and `trip_duration`, as well as the ratio of `trip_duration` over `trip_distance`. Results should be broken up by `pickup_nhood` and `dropoff_nhood`.
### Solutions
(1) Without the pipeline function, we would have `arrange` as the outermost function:
```{r chap05chunk83}
arrange( # (4)
summarize( # (3)
group_by( # (2)
filter(nyc_taxi, fare_amount > 500), # (1)
payment_type),
ave_duration = mean(trip_duration), ave_distance = mean(trip_distance)),
desc(ave_duration))
```
With the pipeline function, we simply add the pipe to the end of `summarize` and add `arrange` as a new line to the end of the code:
```{r chap05chunk84}
q1 <- nyc_taxi %>%
filter(fare_amount > 500) %>% # (1)
group_by(payment_type) %>% # (2)
summarize(ave_duration = mean(trip_duration), ave_distance = mean(trip_distance)) %>% # (3)
arrange(desc(ave_duration)) # (4)
head(q1)
```
(2) What are the times of the day and the days of the week with the highest fare per mile of ride?
```{r chap05chunk85}
q2 <- nyc_taxi %>%
filter(trip_distance > 0) %>%
group_by(pickup_dow, pickup_hour) %>%
summarize(ave_fare_per_mile = mean(fare_amount / trip_distance, na.rm = TRUE), count = n()) %>%
group_by() %>% # we 'reset', or remove, the group by, otherwise sorting won't work
arrange(desc(ave_fare_per_mile))
head(q2)
```
(3) For each pick-up neighborhood, find the number and percentage of trips that "fan out" into other neighborhoods. Sort results by pickup neighborhood and descending percentage. Limit results to top 50 percent coverage. In other words, show only the top 50 percent of destinations for each pick-up neighborhood.
```{r chap05chunk86}
q3 <- nyc_taxi %>%
filter(!is.na(pickup_nhood) & !is.na(dropoff_nhood)) %>%
group_by(pickup_nhood, dropoff_nhood) %>%
summarize(count = n()) %>%
group_by(pickup_nhood) %>%
mutate(proportion = prop.table(count),
cum.prop = order_by(desc(proportion),
cumsum(proportion))) %>%
group_by() %>%
arrange(pickup_nhood, desc(proportion)) %>%
group_by(pickup_nhood) %>%
filter(row_number() < 11 | cum.prop < .50)
head(q3)
```
(4) Are any dates missing from the data?
There are many ways to answer this query and we cover three because each way highlights an important point. The first way consists sorting the data by date and using the `lag` function to find the difference between each date and the date proceeding it. If this difference is greater than 1, then we skipped one or more days.
```{r chap05chunk87}
nyc_taxi %>%
select(pickup_datetime) %>%
distinct(date = as.Date(pickup_datetime)) %>%
arrange(date) %>% # this is an important step!
mutate(diff = date - lag(date)) %>%
arrange(desc(diff))
```
The second solution is more involved. First we create a `data.frame` of all dates available in `nyc_taxi`.
```{r chap05chunk88}
nyc_taxi %>%
select(pickup_datetime) %>%
distinct(date = as.Date(pickup_datetime)) %>%
filter(!is.na(date)) -> data_dates
```
Then we create a new `data.frame` of all dates that span the time range in the data. We can use `seq` to do that.
```{r chap05chunk89}
start_date <- min(data_dates$date)
end_date <- max(data_dates$date)
all_dates <- data.frame(date = seq(start_date, end_date, by = '1 day'))
```
Finally, we ask for the "anti-join" of the two datasets. An anti-join is the opposite of an left join: any keys present in left dataset but not the right are returned.
```{r chap05chunk90}
anti_join(all_dates, data_dates, by = 'date') # an anti-join is the reverse of an left join
```
The third solution consists of comparing the number of days between the earliest and latest dates in the data to the number of days we expect to see if no days were missing.
```{r chap05chunk91}
nyc_taxi %>%
distinct(date = as.Date(pickup_datetime)) %>%
filter(is.na(date) == FALSE) %>%
summarize(min_date = min(date), max_date = max(date), n = n()) %>%
mutate(diff = max_date - min_date + 1)
```
(5) Find the 3 consecutive days with the most total number of trips?
This is a hard exercise. In query 3, we need to compute rolling statistics (rolling sums in this case). There are functions in R that we can use for that purpose, but one of the advantages of R is that writing our own functions is not always that hard. Write a function called `rolling_sum` that takes in two arguments: `x` and `nlag`: `x` is a numeric vector `nlags` is a positive integer for the number of days we're rolling by. The function returns a vector of the same length as `x`, of the rolling sum of `x` over `nlag` elements.
For example, given `x <- 1:6` and `n <- 2` as inputs, the function returns `c(NA, NA, 6, 9, 12, 15)`
```{r chap05chunk92}
rolling_sum <- function(x, nlag) {
stopifnot(nlag > 0, nlag < length(x))
c(rep(NA, nlag), sapply((nlag + 1):length(x), function(ii) sum(x[(ii - nlag):ii])))
}
# Here's an easy test to see if things seem to be working:
rolling_sum(rep(1, 100), 10) # Should return 10 NAs followed by 90 entries that are all 11
```
We can even go one step further. Let's rename the function to `rolling` and add a third argument to it called `FUN`, allowing us to specify any rolling function, not just `sum`.
```{r chap05chunk93}
rolling <- function(x, nlag, FUN) {
stopifnot(nlag > 0, nlag < length(x))
c(rep(NA, nlag), sapply((nlag + 1):length(x), function(ii) FUN(x[(ii - nlag):ii])))
}
```
We can now use `rolling` to find the 3 consecutive days with the most total number of trips.
```{r chap05chunk94}
nlag <- 2
q5 <- nyc_taxi %>%
filter(!is.na(pickup_datetime)) %>%
transmute(end_date = as.Date(pickup_datetime)) %>%
group_by(end_date) %>%
summarize(n = n()) %>%
group_by() %>%
mutate(start_date = end_date - nlag, cn = rolling(n, nlag, sum)) %>%
arrange(desc(cn)) %>%
select(start_date, end_date, n, cn) %>%
top_n(10, cn)
head(q5)
```
As it turns out, there's already a similar function we could have used, called `rollapply` in the `zoo` package. Sometimes it pays off to take the time and search for the right function, especially if what we're trying to do is common enough that we should not have to "reinvent the wheel".
```{r chap05chunk95}
library(zoo)
rollapply(1:10, 3, sum, fill = NA, align = 'right')
```
Here's an alternative solution: We could have run the above query without `rolling` by just using the `lag` function, but the code is more complicated and harder to automate it for different values of `nlag` or for different functions other than `sum`. Here's how we can rewrite the above query with `lag`:
```{r chap05chunk96}
q5 <- nyc_taxi %>%
filter(!is.na(pickup_datetime)) %>%
transmute(end_date = as.Date(pickup_datetime)) %>%
group_by(end_date) %>%
summarize(n = n()) %>%
group_by() %>%
mutate(start_date = end_date - 3,
n_lag_1 = lag(n), n_lag_2 = lag(n, 2),
cn = n + n_lag_1 + n_lag_2) %>%
arrange(desc(cn)) %>%
select(start_date, end_date, n, cn) %>%
top_n(10, cn)
head(q5)
```
(6) Get the average, standard deviation, and mean absolute deviation of `trip_distance` and `trip_duration`, as well as the ratio of `trip_duration` over `trip_distance`. Results should be broken up by `pickup_nhood` and `dropoff_nhood`.
Here's how we compute the mean absolute deviation:
```{r chap05chunk97}
mad <- function(x) mean(abs(x - median(x))) # one-liner functions don't need curly braces
```
This query can easily be written with the tools we learned so far.
```{r chap05chunk98}
q6 <- nyc_taxi %>%
filter(!is.na(pickup_nhood) & !is.na(dropoff_nhood)) %>%
group_by(pickup_nhood, dropoff_nhood) %>%
summarize(mean_trip_distance = mean(trip_distance, na.rm = TRUE),
mean_trip_duration = mean(trip_duration, na.rm = TRUE),
sd_trip_distance = sd(trip_distance, na.rm = TRUE),
sd_trip_duration = sd(trip_duration, na.rm = TRUE),
mad_trip_distance = mad(trip_distance),
mad_trip_duration = mad(trip_duration))
head(q6)
```
You may have noticed that the query we wrote in the last exercise was a little tedious and repetitive. Let's now see a way of rewriting the query using some "shortcut" functions available in `dplyr`:
- When we apply the same summary function(s) to the same column(s) of the data, we can save a lot of time typing by using `summarize_each` instead of `summarize`. There is also a `mutate_each` function.
- We can select `trip_distance` and `trip_duration` automatically using `starts_with('trip_')`, since they are the only columns that begin with that prefix, this can be a time-saver if we are selecting lots of columns at once (and we named them in a smart way). There are other helper functions called `ends_with` and `contains`.
- Instead of defining the `mad` function separately, we can define it in-line. In fact, there's a shortcut whereby we just name the function and provide the body of the function, replacing `x` with a period.
```{r chap05chunk99}
q6 <- nyc_taxi %>%
filter(!is.na(pickup_nhood) & !is.na(dropoff_nhood)) %>%
group_by(pickup_nhood, dropoff_nhood) %>%
summarize_each(
funs(mean, sd, mad = mean(abs(. - median(.)))), # all the functions that we apply to the data are listed here
starts_with('trip_'), # `trip_distance` and `trip_duration` are the only columns that start with `trip_`
wait_per_mile = trip_duration / trip_distance) # `duration_over_dist` is created on the fly
head(q6)
```
We can do far more with `dplyr` but we leave it at this for an introduction. The goal was to give the user enough `dplyr` to develop an appreciation and be inspired to learn more.