-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathsession-02-presentation.qmd
377 lines (258 loc) · 10.5 KB
/
session-02-presentation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
---
title: Statistical programming
title-slide-attributes:
data-background-image: mvtec-cover-statistical-programming-4x3.png
data-background-size: contain
data-background-opacity: "0.1"
subtitle: Data manipulation and tidying data
author: Marc Comas-Cufí
format:
revealjs:
self-contained: true
smaller: false
logo: mvtec-cover-statistical-programming-4x3.png
fontsize: 12pt
---
```{r setup, include=FALSE}
library(dplyr)
knitr::opts_chunk$set(echo = TRUE)
options(tibble.print_min = 5, tibble.print_max = 5, width=70)
```
## Today's session
```{r, echo=FALSE, results='asis'}
cat(readr::read_lines("session-02-content.md"), sep='\n')
```
# `dplyr`: a grammar of data manipulation
## Datasets {.smaller}
```{r, warning=FALSE, echo=TRUE, results='hide', message=FALSE}
library(nycflights13)
data(package = 'nycflights13')
```
```
Data sets in package ‘nycflights13’:
airlines Airline names.
airports Airport metadata
flights Flights data
planes Plane metadata.
weather Hourly weather data
```
## Dataset: `flights` {.smaller}
```{r, warning=FALSE, echo=TRUE}
flights
```
## `dplyr` summary {.smaller}
By loading `dplyr` package,
```{r, echo=TRUE}
library(dplyr)
```
we get access to a very useful set of functions:
* `filter()` picks cases based on their values.
* `arrange()` changes the ordering of the rows.
* `select()` picks variables based on their names.
* `mutate()` adds new variables that are functions of existing variables.
* `summarise()` reduces multiple values down to a single summary.
* `group_by()` allows performing any operation “by group”.
Cheat sheet [pdf](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)
## Row filtering (`filter()`)
* <div class="green2">Selecting all flights on January 1st</div>(`month == 1` and `day == 1`)
```{r}
filter(flights, month == 1, day == 1)
```
## Row filtering (`filter()`) {.smaller background-color=#CCE5FF}
* <div>Selecting all flights on January 1st</div>(`month == 1` and `day == 1`)
* <div>Selecting all flights from six first years of the year</div>(`month <= 6`)
* <div>Selecting all flights departing with no more than 5 minutes from scheduled departure time</div>(`-5 <= dep_delay` and `dep_delay <= 5` ) or (`abs(dep_delay) <= 5`) or (`between(dep_delay, -5, 5)`)
__Activity__
* Select flights flying to "IAH" or "HOU"
* Departed in summer (July, August and September)
## More row-selection functions
* `slice()`. Select rows by position (helpers `slice_head()`, `slice_tail()`, `slice_min()`, `slice_max()`, `slice_sample()`)
* `distinct()`. Select distinct observations given certain variables
* `sample_n()`, `sample_frac()`. Random selection of rows
## Ordering observations (`arrange()`)
* Order flights by year, month, day and delay
```{r}
arrange(flights, year, month, day, dep_delay)
```
## Ordering observations (`arrange()`)
* Order flights by year, month, day and __decreasing__ delay
```{r}
arrange(flights, year, month, day, desc(dep_delay))
```
## Ordering observations (`arrange()`) {.smaller background-color=#CCE5FF}
* How could you use `arrange()` to sort all missing values to the start?
* Sort flights to find the fastest (highest speed) flights.
* Which flights travelled the farthest? Which travelled the shortest?
## Selecting variables (`select()`)
* Select year, month, day and flight number
```{r}
select(flights, year, month, day, flight)
```
## `select()` helpers {.smaller}
Special functions can be used to facilitate variable selection.
* Matching names: `starts_with()`, `ends_with()`, `contains()`, `matches()`, `num_range()`.
* From vector of names: `all_of()`, `any_of()`.
* Using a function: `where()`
```{r}
select(flights, starts_with('dep_'), contains('arr_'))
```
## Selecting variables (`select()`) {.smaller background-color=#CCE5FF}
* Brainstorm as many ways as possible to __only__ select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights
* What does the `any_of()` function do? Why might it be helpful with this vector?
```{r}
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
```
* Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
```{r, eval=FALSE}
select(flights, contains("TIME"))
```
## Transform variables (`mutate()`) {.smaller}
* Calculating average flight speed (km/h)
```{r}
mutate(flights,
distance_km = distance * 1.60934,
air_time_h = air_time * 60,
speed_km_h = distance_km / air_time_h)
```
## Transform variables (`mutate()`) {.smaller background-color=#CCE5FF}
* Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related?
## Summarising variables (`summarise()`) {.smaller}
* Calculate the average delays in departure and arrival
```{r}
summarise(flights,
average_dep_delay = mean(dep_delay, na.rm=TRUE),
average_arr_delay = mean(arr_delay, na.rm=TRUE))
```
## Working by groups (`group_by()`) {.smaller}
* Stratify `flight` table by day
```{r}
flights_by_day = group_by(flights, year, month, day)
```
* Calculate the average delays in departure and arrival __by day__
```{r, include=FALSE}
options(dplyr.summarise.inform=FALSE)
```
```{r}
summarise(flights_by_day,
average_dep_delay = mean(dep_delay, na.rm=TRUE),
average_arr_delay = mean(arr_delay, na.rm=TRUE))
```
## Working by groups (`group_by()`) {.smaller background-color=#CCE5FF}
* Considering a cancelled flight, a flight with missing either in `dep_delay` or `arr_delay`. Look at the number of cancelled flights per day. Is the proportion of cancelled flights related to the departure delay?
## The pipe operator (`%>%`) {.smaller}
* Summarise the `arr_delay` variable for each month (giving the number of flights and the average delay), for those flights that had a positive `dep_delay`.
* Nesting functions:
```{r, eval=FALSE}
summarise(group_by(filter(flights, dep_delay > 0), month),
average_arr_delay = mean(arr_delay, na.rm = TRUE))
```
* Temporal tables:
```{r, eval=FALSE}
flights_filtered = filter(flights, dep_delay > 0)
flights_grouped = group_by(flights_filtered, month)
summarise(flights_grouped,
average_arr_delay = mean(arr_delay, na.rm = TRUE))
```
* The pipe (`%>%`) approach:
```{r, eval=FALSE}
flights %>%
filter(dep_delay > 0) %>%
group_by(month) %>%
summarise(
average_arr_delay = mean(arr_delay, na.rm = TRUE))
```
## Other usefully `dplyr` functions
* Mutating joins: `inner_join()`, `left_join()`, `right_join`, ...
* `expand()`, `expand_grid()`. Create a tibble from all combinations of inputs.
* `bind_rows()`, `bind_cols()`. Efficiently bind multiple data frames by row and column.
## Tidy Data
Tidy data is a standard way of mapping the meaning of a dataset to its structure.
1. Every column is a variable
2. Every row is an observation
3. Every cell is a single value
## `tidyr`: helpers to create tidy data {.smaller}
* "Pivotting". `pivot_longer()` and `pivot_wider()`
* "Rectangling". `unnest_longer()`, `unnest_wider()`
* "Nesting". `nest()`, `unnest()`
* Splitting and combining: `separate()` and `unite()`.
* Missings: `complete()`, `drop_na()`, `fill()`.
<br><br><br><br>
Cheat Sheet [pdf](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf)
```{r, include=FALSE}
library(tidyr)
rjson::fromJSON('{"character":["Toothless","Dory"],"metadata":[{"species":["dragon"],"color":["black"],"films":["How to Train Your Dragon","How to Train Your Dragon 2","How to Train Your Dragon: The Hidden World"]},{"species":["blue tang"],"color":["blue"],"films":["Finding Nemo","Finding Dory"]}]}') %>%
as_tibble() %>%
unnest_wider(metadata) %>%
unnest_longer(films)
jsonlite::fromJSON('{"character":["Toothless","Dory"],"metadata":[{"species":["dragon"],"color":["black"],"films":["How to Train Your Dragon","How to Train Your Dragon 2","How to Train Your Dragon: The Hidden World"]},{"species":["blue tang"],"color":["blue"],"films":["Finding Nemo","Finding Dory"]}]}', simplifyDataFrame = FALSE) %>%
as_tibble() %>%
unnest(character) %>%
unnest_wider(metadata) %>% unnest_longer(films)
```
## Messy datasets (1) {.smaller}
__Column headers are values, no variable names__
```{r}
tidyr::relig_income
```
## Messy datasets (2) {.smaller}
__Multiple variables stored in a column__
```{r}
tidyr::who
```
## Messy datasets (3) {.smaller}
__Variables are stored in both rows and columns__
```{r, eval=FALSE, message=FALSE}
library(readr)
read_csv("weather.csv")
```
```{r, echo=FALSE, message=FALSE}
library(readr)
read_csv("session-02-presentation/weather.csv")
```
## Dates and time (`lubridate`) (1) {.smaller}
* Parsing dates: `ymd()`, `ydm()`, `myd()`, `mdy()`, `dmy()`, `dym()`
```{r, message=FALSE}
library(lubridate)
x <- c(20090101, "2009-01-02", "2009 01 03", "2009-1-4",
"2009-1, 5", "Created on 2009 1 6", "200901 !!! 07")
(ddates = ymd(x))
```
* Extracting `year`, `month` or `day`
```{r}
year(ddates)
```
## Dates and time (`lubridate`) (2) {.smaller}
* Calculating time differences
```{r}
diff(ymd(20100101) + years(1:5))
```
* How many weeks?
```{r}
interval(ymd(20101001), ymd(20101101)) / weeks(1)
```
Cheat Sheet [pdf](https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf)
## String (`stringr`) {.smaller}
A cohesive set of functions designed to make working with strings as easy as possible.
* Detect matches: `str_detect()`
* Subset strings: `str_sub()`
* Manage lengths: `str_length()`, `str_trim()`
* Mutate strings: `str_replace()`, `str_to_lower()`, `str_to_title()`
* Join and split: `str_c()`, `str_split()`
Cheat Sheet [pdf](https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf)
## Categorical variables (`forcats`) {.smaller}
A suite of tools that solve common problems with factors:
* `fct_reoder()`: reorder the levels of a factor according to some function
* `fct_infreq()`: reorder the levels of a factor according to category frequencies
* `fct_relevel()`: reorder the levels of a factor manually
* `fct_lump_min()`, `fct_lump_n()`, `fct_lump_prop()`: collapse the least frequent values
```{r, include=FALSE}
library(forcats)
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
x %>% fct_lump_n(3) %>% table()
```
Cheat Sheet [pdf](https://raw.githubusercontent.com/rstudio/cheatsheets/master/factors.pdf)
# That's all for today
## Next week session
```{r, echo=FALSE, results='asis'}
cat(readr::read_lines("session-03-content.md"), sep='\n')
```