session-02-presentation.qmd

---
title: Statistical programming
title-slide-attributes:
  data-background-image: mvtec-cover-statistical-programming-4x3.png
  data-background-size: contain
  data-background-opacity: "0.1"
subtitle: Data manipulation and tidying data
author: Marc Comas-Cufí
format: 
  revealjs:
    self-contained: true
    smaller: false
    logo: mvtec-cover-statistical-programming-4x3.png
fontsize: 12pt
---

```{r setup, include=FALSE}
library(dplyr)
knitr::opts_chunk$set(echo = TRUE)
options(tibble.print_min = 5, tibble.print_max = 5, width=70)
```

## Today's session

```{r, echo=FALSE, results='asis'}
cat(readr::read_lines("session-02-content.md"), sep='\n')
```

# `dplyr`: a grammar of data manipulation

## Datasets {.smaller}

```{r, warning=FALSE, echo=TRUE, results='hide', message=FALSE}
library(nycflights13)
data(package = 'nycflights13')
```

```
Data sets in package ‘nycflights13’:

airlines                      Airline names.
airports                      Airport metadata
flights                       Flights data
planes                        Plane metadata.
weather                       Hourly weather data
```

## Dataset: `flights` {.smaller}

```{r, warning=FALSE, echo=TRUE}
flights
```

## `dplyr` summary {.smaller}

By loading `dplyr` package, 

```{r, echo=TRUE}
library(dplyr)
```

we get access to a very useful set of functions:

* `filter()` picks cases based on their values.
* `arrange()` changes the ordering of the rows.
* `select()` picks variables based on their names.
* `mutate()` adds new variables that are functions of existing variables.
* `summarise()` reduces multiple values down to a single summary.
* `group_by()` allows performing any operation “by group”.


Cheat sheet [pdf](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)

## Row filtering (`filter()`) 

* <div class="green2">Selecting all flights on January 1st</div>(`month == 1` and `day == 1`)

```{r}
filter(flights, month == 1, day == 1)
```

## Row filtering (`filter()`) {.smaller background-color=#CCE5FF}

* <div>Selecting all flights on January 1st</div>(`month == 1` and `day == 1`)
* <div>Selecting all flights from six first years of the year</div>(`month <= 6`)
* <div>Selecting all flights departing with no more than 5 minutes from scheduled departure time</div>(`-5 <= dep_delay` and `dep_delay <= 5` ) or (`abs(dep_delay) <= 5`) or (`between(dep_delay, -5, 5)`)

__Activity__

* Select flights flying to "IAH" or "HOU"
* Departed in summer (July, August and September)

## More row-selection functions

* `slice()`. Select rows by position (helpers `slice_head()`, `slice_tail()`, `slice_min()`, `slice_max()`, `slice_sample()`)
* `distinct()`. Select distinct observations given certain variables
* `sample_n()`, `sample_frac()`. Random selection of rows

## Ordering observations (`arrange()`)

* Order flights by year, month, day and delay

```{r}
arrange(flights, year, month, day, dep_delay)
```

## Ordering observations (`arrange()`)

* Order flights by year, month, day and __decreasing__ delay

```{r}
arrange(flights, year, month, day, desc(dep_delay))
```

## Ordering observations (`arrange()`) {.smaller background-color=#CCE5FF}

* How could you use `arrange()` to sort all missing values to the start?
* Sort flights to find the fastest (highest speed) flights.
* Which flights travelled the farthest? Which travelled the shortest?


## Selecting variables (`select()`)

* Select year, month, day and flight number

```{r}
select(flights, year, month, day, flight)
```

## `select()` helpers {.smaller}

Special functions can be used to facilitate variable selection.

* Matching names: `starts_with()`, `ends_with()`, `contains()`, `matches()`, `num_range()`. 
* From vector of names: `all_of()`, `any_of()`. 
* Using a function: `where()`

```{r}
select(flights, starts_with('dep_'), contains('arr_'))
```

## Selecting variables (`select()`) {.smaller background-color=#CCE5FF}

* Brainstorm as many ways as possible to __only__ select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights
* What does the `any_of()` function do? Why might it be helpful with this vector?

```{r}
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
```

* Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

```{r, eval=FALSE}
select(flights, contains("TIME"))
```


## Transform variables (`mutate()`) {.smaller}

* Calculating average flight speed (km/h)

```{r}
mutate(flights,
       distance_km = distance * 1.60934,
       air_time_h = air_time * 60,
       speed_km_h = distance_km / air_time_h)
```

## Transform variables (`mutate()`) {.smaller background-color=#CCE5FF}

* Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related?

## Summarising variables (`summarise()`) {.smaller}

* Calculate the average delays in departure and arrival

```{r}
summarise(flights,
          average_dep_delay = mean(dep_delay, na.rm=TRUE),
          average_arr_delay = mean(arr_delay, na.rm=TRUE))
```

## Working by groups (`group_by()`) {.smaller}

* Stratify `flight` table by day

```{r}
flights_by_day = group_by(flights, year, month, day)
```

* Calculate the average delays in departure and arrival __by day__

```{r, include=FALSE}
options(dplyr.summarise.inform=FALSE)
```

```{r}
summarise(flights_by_day, 
          average_dep_delay = mean(dep_delay, na.rm=TRUE),
          average_arr_delay = mean(arr_delay, na.rm=TRUE))
```

## Working by groups (`group_by()`)  {.smaller background-color=#CCE5FF}

* Considering a cancelled flight, a flight with missing either in `dep_delay` or `arr_delay`. Look at the number of cancelled flights per day. Is the proportion of cancelled flights related to the departure delay?


## The pipe operator (`%>%`) {.smaller}

* Summarise the `arr_delay` variable for each month (giving the number of flights and the average delay), for those flights that had a positive `dep_delay`.

  * Nesting functions:
    
    ```{r, eval=FALSE}
    summarise(group_by(filter(flights, dep_delay > 0), month),
              average_arr_delay = mean(arr_delay, na.rm = TRUE))
    ```

  * Temporal tables:
    
    ```{r, eval=FALSE}
    flights_filtered = filter(flights, dep_delay > 0)
    flights_grouped = group_by(flights_filtered, month)
    summarise(flights_grouped,
              average_arr_delay = mean(arr_delay, na.rm = TRUE))
    ```

* The pipe (`%>%`) approach:

```{r, eval=FALSE}
flights %>%
  filter(dep_delay > 0) %>%
  group_by(month) %>%
  summarise(
    average_arr_delay = mean(arr_delay, na.rm = TRUE))
```

## Other usefully `dplyr` functions

* Mutating joins: `inner_join()`, `left_join()`, `right_join`, ...
* `expand()`, `expand_grid()`. Create a tibble from all combinations of inputs.
* `bind_rows()`, `bind_cols()`. Efficiently bind multiple data frames by row and column.

## Tidy Data

Tidy data is a standard way of mapping the meaning of a dataset to its structure.

1. Every column is a variable
2. Every row is an observation
3. Every cell is a single value

## `tidyr`: helpers to create tidy data {.smaller}

* "Pivotting". `pivot_longer()` and `pivot_wider()`
* "Rectangling". `unnest_longer()`, `unnest_wider()`
* "Nesting". `nest()`, `unnest()`
* Splitting and combining: `separate()` and `unite()`.
* Missings: `complete()`, `drop_na()`, `fill()`.

<br><br><br><br>

Cheat Sheet [pdf](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf)

```{r, include=FALSE}
library(tidyr)
rjson::fromJSON('{"character":["Toothless","Dory"],"metadata":[{"species":["dragon"],"color":["black"],"films":["How to Train Your Dragon","How to Train Your Dragon 2","How to Train Your Dragon: The Hidden World"]},{"species":["blue tang"],"color":["blue"],"films":["Finding Nemo","Finding Dory"]}]}') %>% 
  as_tibble() %>% 
  unnest_wider(metadata) %>% 
  unnest_longer(films)

jsonlite::fromJSON('{"character":["Toothless","Dory"],"metadata":[{"species":["dragon"],"color":["black"],"films":["How to Train Your Dragon","How to Train Your Dragon 2","How to Train Your Dragon: The Hidden World"]},{"species":["blue tang"],"color":["blue"],"films":["Finding Nemo","Finding Dory"]}]}', simplifyDataFrame = FALSE) %>% 
  as_tibble() %>% 
  unnest(character) %>% 
  unnest_wider(metadata) %>% unnest_longer(films)
```

## Messy datasets (1) {.smaller} 

__Column headers are values, no variable names__

```{r}
tidyr::relig_income
```

## Messy datasets (2) {.smaller}   

__Multiple variables stored in a column__

```{r}
tidyr::who
```

## Messy datasets (3) {.smaller} 

__Variables are stored in both rows and columns__

```{r, eval=FALSE, message=FALSE}
library(readr)
read_csv("weather.csv")
```

```{r, echo=FALSE, message=FALSE}
library(readr)
read_csv("session-02-presentation/weather.csv")
```

## Dates and time (`lubridate`) (1) {.smaller}

* Parsing dates: `ymd()`, `ydm()`, `myd()`, `mdy()`, `dmy()`, `dym()`

```{r, message=FALSE}
library(lubridate)
x <- c(20090101, "2009-01-02", "2009 01 03", "2009-1-4",
       "2009-1, 5", "Created on 2009 1 6", "200901 !!! 07")
(ddates = ymd(x))
```

* Extracting `year`, `month` or `day`

```{r}
year(ddates)
```

## Dates and time (`lubridate`) (2) {.smaller}

* Calculating time differences

```{r}
diff(ymd(20100101) + years(1:5))
```

* How many weeks?

```{r}
interval(ymd(20101001), ymd(20101101)) / weeks(1)
```


Cheat Sheet [pdf](https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf)

## String (`stringr`) {.smaller}

A cohesive set of functions designed to make working with strings as easy as possible.

* Detect matches: `str_detect()`
* Subset strings: `str_sub()`
* Manage lengths: `str_length()`, `str_trim()`
* Mutate strings: `str_replace()`, `str_to_lower()`, `str_to_title()`
* Join and split: `str_c()`, `str_split()`

Cheat Sheet [pdf](https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf)

## Categorical variables (`forcats`) {.smaller}

A suite of tools that solve common problems with factors:

* `fct_reoder()`: reorder the levels of a factor according to some function
* `fct_infreq()`: reorder the levels of a factor according to category frequencies
* `fct_relevel()`: reorder the levels of a factor manually
* `fct_lump_min()`, `fct_lump_n()`, `fct_lump_prop()`: collapse the least frequent values

```{r, include=FALSE}
library(forcats)
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
x %>% fct_lump_n(3) %>% table()
```

Cheat Sheet [pdf](https://raw.githubusercontent.com/rstudio/cheatsheets/master/factors.pdf)

# That's all for today

## Next week session

```{r, echo=FALSE, results='asis'}
cat(readr::read_lines("session-03-content.md"), sep='\n')
```