03-Cleaning-data.Rmd

---
title: "Cleaning the data"
author: "Seth Mottaghinejad"
output: github_document
date: "`r Sys.Date()`"
---

```{r chap03chunk01, include=FALSE}
source('setup.R')
```

In the last section, we proposed ways that we could clean the data. In this section, we actually clean the data. Let's review where we are in the EDA (exploratory data analysis) process:

1. load all the data (and combine them if necessary)
2. inspect the data in preparation cleaning it
3. **clean the data in preparation for analysis**
4. add any interesting features or columns as far as they pertain to the analysis
5. find ways to analyze or summarize the data and report your findings

### Exercises

Run `summary` on the data.

```{r chap03chunk02}
summary(nyc_taxi)
```

What are some important things we can tell about the data by looking at the
above summary?

Discuss possible ways that some columns may need to be 'cleaned'. By 'cleaned'
here we mean
- reformatted into the appropriate type,
- replaced with another value or an NA,
- removed from the data for the purpose of the analysis.

### Solutions

Here are some of the ways we can clean the data:

- `pickup_datetime` and `dropoff_datetime` should be `datetime` columns, not `character`
- `rate_code_id` and `payment_type` should be a `factor`, not `character`
- the geographical coordinates for pick-up and drop-off occasionally fall outside a reasonable bound (probably due to error)
- `fare_amount` is sometimes negative (could be refunds, could be errors, could be something else)

Some data-cleaning jobs depend on the analysis. For example, turning `payment_type` into a `factor` is unnecessary if we don't intend to use it as a categorical variable in the model. Even so, we might still benefit from turning it into a factor so that we can see counts for it when we run `summary` on the data, or have it show the proper labels when we use it in a plot. Other data- cleaning jobs on the other hand relate to data quality issues. For example, unreasonable bounds for pick-up or drop-off coordinates can be due to error. In such cases, we must decide whether we should clean the data by

- removing rows that have incorrect information for some columns, even though other columns might still be correct
- replace the incorrect information with NAs and decide whether we should impute missing values somehow
- leave the data as is, but think about how doing so could skew some results from our analysis

## Dealing with datetimes

Next we format `pickup_datetime` and `dropoff_datetime` as `datetime` columns. There are different functions for dealing with `datetime` column types, including functions in the `base` package, but we will be using the `lubridate` package for its rich set of functions and simplicity.

```{r chap03chunk03}
library(lubridate)
Sys.setenv(TZ = "US/Pacific") # not important for this dataset, but this is how we set the time zone
```

The function we need is called `ymd_hms`, but before we run it on the data let's test it on a string. Doing so gives us a chance to test the function on a simple input and catch any errors or wrong argument specifications.

```{r chap03chunk04}
ymd_hms("2015-01-25 00:13:08", tz = "US/Eastern") # we can ignore warning message about timezones
```

We seem to have the right function and the right set of arguments, so let's now apply it to the data. If we are still unsure about whether things will work, it might be prudent to not immediately overwrite the existing column. We could either write the transformation to a new column or run the transformation on the first few rows of the data and just display the results in the console.

```{r chap03chunk05}
ymd_hms(nyc_taxi$pickup_datetime[1:20], tz = "US/Eastern")
```

We now apply the transformation to the whole data and overwrite the original column with it.

```{r chap03chunk06}
nyc_taxi$pickup_datetime <- ymd_hms(nyc_taxi$pickup_datetime, tz = "US/Eastern")
```

There's another way to do the above transformation: by using the `transform` function. Just as was the case with `subset`, `transform` allows us to pass the data as the first argument so that we don't have to prefix the column names with `nyc_taxi$`. The result is a cleaner and more readable notation.

```{r chap03chunk07}
nyc_taxi <- transform(nyc_taxi, dropoff_datetime = ymd_hms(dropoff_datetime, tz = "US/Eastern"))
```

Let's now see some of the benefits of formatting the above columns as `datetime`. The first benefit is that we can now perform date calculations on the data. Say for example that we wanted to know how many data points are in each week. We can use `table` to get the counts and the `week` function in `lubridate` to extract the week (from 1 to 52 for a non-leap year) from `pickup_datetime`.

```{r chap03chunk08}
table(week(nyc_taxi$pickup_datetime)) # `week`
```

```{r chap03chunk09}
table(week(nyc_taxi$pickup_datetime), month(nyc_taxi$pickup_datetime)) # `week` and `month` are datetime functions
```

Another benefit of the `datetime` format is that plotting functions can do a better job of displaying the data in the expected format.# (2) many data summaries and data visualizations automatically 'look right' when the data has the proper format. We do not cover data visualization in-depth in this course, but we provide many examples to get you started. Here's a histogram of `pickup_datetime`.

```{r chap03chunk10}
library(ggplot2)
ggplot(data = nyc_taxi) +
  geom_histogram(aes(x = pickup_datetime), col = "black", fill = "lightblue",
  binwidth = 60*60*24*7) # the bin has a width of one week
```

Notice how the x-axis is properly formatted as a date without any manual input from us. Both the summary and the plot above would not have been possible if `pickup_datetime` was still a character column.

## Dealing with factors

It's time to turn our attention to the categorical columns in the dataset. Ideally, categorical columns should be turned into `factor` (usually from `character` or `integer`). A `factor` is the appropriate data type for a categorical column. When we loaded the data in R using `read.csv`, we set `stringsAsFactors = FALSE` to prevent any `character` columns from being turned into a factor. This is generally a good idea, because some character columns (such as columns with raw text in them or alpha-numeric ID columns) are not appropriate for factors. Accidentally turning such columns into factors can result in overhead, especially when data sizes are large. The overhead is the result of R having to keep a tally of all the factor levels. We do not have any `character` columns in this dataset that need to be converted to factors, but we have `integer` columns that represent categorical data. These are the columns with low cardinality, as can be seen here:

```{r chap03chunk11}
sapply(nyc_taxi, num.distinct)
```

Fortunately, the site that hosted the dataset also provides us with a [data dictionary](http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_y ellow.pdf). Going over the document helps answer what the categorical columns are and what each category represents.

For example, for `rate_code_id`, the mapping is as follows:

- 1 = Standard rate
- 2 = JFK
- 3 = Newark
- 4 = Nassau or Westchester
- 5 = Negotiated fare
- 6 = Group ride

The above information helps us properly label the factor levels.

Notice how `summary` shows us numeric summaries for the categorical columns right now.

```{r chap03chunk12}
summary(nyc_taxi[ , c('rate_code_id', 'payment_type')]) # shows numeric summaries for both columns
```

A quick glance at `payment_type` shows two payments as by far the most common. The data dictionary confirms for us that they correspond to card and cash payments.

```{r chap03chunk13}
table(nyc_taxi$payment_type)
```

We now turn both `rate_code_id` and `payment_type` into `factor` columns. For `rate_code_id` we keep all the labels, but for `payment_type` we only keep the two most common and label them as 'card' and 'cash'. We do so by specifying `levels = 1:2` instead of `levels = 1:6` and provide labels for only the first two categories. This means the other values of `payment_type` get lumped together and replaced with NAs, resulting in information loss (which we are comfortable with, for the sake of this analysis).

```{r chap03chunk14}
nyc_taxi <- transform(nyc_taxi,
rate_code_id = factor(rate_code_id,
levels = 1:6, labels = c('standard', 'JFK', 'Newark', 'Nassau or Westchester', 'negotiated', 'group ride')),
payment_type = factor(payment_type,
levels = 1:2, labels = c('card', 'cash')
))
```

```{r chap03chunk15}
head(nyc_taxi[ , c('rate_code_id', 'payment_type')]) # now proper labels are showing in the data
```

```{r chap03chunk16}
summary(nyc_taxi[ , c('rate_code_id', 'payment_type')]) # now counts are showing in the summary
```

It is very important that the `labels` be in the same order as the `levels` they map into.

What about `passenger_count`? should it be treated as a `factor` or left as integer? The answer is it depends on how it will be used, especially in the context of modeling. Most of the time, such a column is best left as `integer` in the data and converted into factor 'on-the-fly' when need be (such as when we want to see counts, or when we want a model to treat the column as a `factor`).

Our data-cleaning is for now done. We are ready to now add new features to the data, but before we do so, let's briefly revisit what we have so far done from the beginning, and see if we could have taken any shortcuts. That is the subject of the next chapter.

### Exercises

Let's create a sample with replacement of size 2000 from the colors red, blue and green. This is like reaching into a jar with three balls of each color, grabbing one and recording the color, placing it back into the jar and repeating this 2000 times.

```{r chap03chunk17}
rbg_chr <- sample(c("red", "blue", "green"), 2000, replace = TRUE)
```

We add one last entry to the sample: the entry is 'pink':

```{r chap03chunk18}
rbg_chr <- c(rbg_chr, "pink") # add a pink entry to the sample
```

We now turn `rbg_chr` (which is a character vector) into a `factor` and call it `rbg_fac`. We then drop the 'pink' entry from both vectors.

```{r chap03chunk19}
rbg_fac <- factor(rbg_chr) # turn `rbg_chr` into a `factor` `rbg_fac`
rbg_chr <- rbg_chr[1:(length(rbg_chr)-1)] # dropping the last entry from `rbg_chr`
rbg_fac <- rbg_fac[1:(length(rbg_fac)-1)] # dropping the last entry from `rbg_fac`
```

Note that `rbg_chr` and `rbg_fac` contain the same information, but are of different types. Discuss what differences you notice between `rbg_chr` and `rbg_fac` in each of the below cases:

(1) When we query the first few entries of each:

```{r chap03chunk20}
head(rbg_chr)
```

```{r chap03chunk21}
head(rbg_fac)
```

(2) When we compare the size of each in the memory:

```{r chap03chunk22}
sprintf("Size as characters: %s. Size as factor: %s",
object.size(rbg_chr), object.size(rbg_fac))
```

(3) When we ask for counts within each category:

```{r chap03chunk23}
table(rbg_chr)
table(rbg_fac)
```

(4) when we try to replace an entry with something other than 'red', 'blue' and 'green':

```{r chap03chunk24}
rbg_chr[3] <- "yellow" # replaces the 3rd entry in `rbg_chr` with 'yellow'
rbg_fac[3] <- "yellow" # throws a warning, replaces the 3rd entry with NA
```

(5) Each category in a categorical column (formatted as `factor`) is called a **factor level**. We can look at factor levels using the `levels` function:

```{r chap03chunk25}
levels(rbg_fac)
```

We can relabel the factor levels directly with `levels`. Change the levels of `rbg_fac` so that the labels start with capital letters.

(6) We can add new factor levels to the existing ones. Add "Yellow" as a new level for `rbg_fac`.

(7) Once new factor levels have been created, we can have entries which match the new level. Change the third entry of `rbg_fac` to now be "Yellow".

(8) Finally, we need to recreate the `factor` column if we want to drop a particular level or change the order of the levels.

```{r chap03chunk26}
table(rbg_chr) # what we see in the orignal `character` column
```

If we don't provide the `factor` with levels (through the `levels` argument), we create a `factor` by scanning the data to find all the levels and sort the levels alphabetically.

```{r chap03chunk27}
rbg_fac <- factor(rbg_chr)
table(rbg_fac) # the levels are just whatever was present in `rbg_chr`
```

We can overwrite that by explicitly passing factor levels to the `factor` function, in the order that we wish them to be. Recreate `rbg_fac` by passing `rbg_chr` `factor` function, but this time specify only "red", "green" and "blue" as the levels. Run `table` on both `rbg_chr` and `rbg_fac`. What differences do you see?

(9) What benefits do you see in being able to overwrite factor levels? Specifically, what could be useful about adding new factor levels? Removing certain existing factor levels? Reordering factor levels?

### Solutions

(1) We see quotes around `rbg_chr` but no quotes for `rbg_fac` and factor levels at the bottom.

```{r chap03chunk28}
head(rbg_chr) # we see quotes
```

```{r chap03chunk29}
head(rbg_fac) # we don't see quotes and we see the factor levels at the bottom
```

(2) A `factor` column tends to take up less space than `character` column, the more so when the strings in the `character` column are longer. This is because a `factor` column stores the information as integers under the hood, with a mapping from each integer to the string it represents.

```{r chap03chunk30}
sprintf("Size as characters: %s. Size as factor: %s",
object.size(rbg_chr), object.size(rbg_fac))
```

(3)

```{r chap03chunk31}
table(rbg_chr)
```

```{r chap03chunk32}
table(rbg_fac) # we can see a count of 0 for 'pink', becuase it's one of the factor levels
```

(4) Changing an entry in a `factor` column to a values other than one of its acceptable levels will result in an NA. Notice that this happens without any warnings.

```{r chap03chunk33}
head(rbg_chr) # the 3rd entry changed to 'yellow'
```

```{r chap03chunk34}
head(rbg_fac) # we could not change the 3rd entry to 'yellow' because it's not one of the factor levels
```

(5) We simply re-assign the factor levels, but we must be careful to provide the new levels **in the same order** as the old ones.

```{r chap03chunk35}
levels(rbg_fac) <- c('Blue', 'Green', 'Pink', 'Red') # we capitalize the first letters
head(rbg_fac)
```

(6) We simply append "Yellow" to the old factor levels and assign this as the new factor levels.

```{r chap03chunk36}
levels(rbg_fac) <- c(levels(rbg_fac), "Yellow") # we add 'Yellow' as a new factor level
table(rbg_fac) # even though the data has no 'Yellow' entries, it's an acceptable value
```

(7) Since "Yellow" is one of the levels now, we can change any entry to "Yellow" and we won't get an NA anymore.

```{r chap03chunk37}
rbg_fac[3] <- "Yellow" # does not throw a warning anymore
head(rbg_fac) # now the data has one 'Yellow' entry
```

(8) We use the `levels` argument in the `factor` function. Since "yellow" was one of the entries in `rgb_chr` and we are not specifying "yellow" as one of the factor levels we want, it will be turned into an NA.

```{r chap03chunk38}
table(rbg_chr)
```

```{r chap03chunk39}
rbg_fac <- factor(rbg_chr, levels = c('red', 'green', 'blue')) # create a `factor`, with only the levels provided, in the order provided
table(rbg_fac) # notice how 'yellow' has disappeared
```

```{r chap03chunk40}
table(rbg_fac, useNA = "ifany") # 'yellow' was turned into an NA
```

(9) There are three important advantages to providing factor levels:

1. We can reorder the levels to any order we want (instead of having them alphabetically ordered). This way related levels can appear next to each other in summaries and plots.
2. The factor levels don't have to be limited to what's in the data: we can provide additional levels that are not part of the data if we expect them to be part of future data. This way levels that are not in the data can still be represented in summaries and plots.
3. Factor levels that are in the data, but not relevant to the analysis can be ignored (replaced with NAs) by not including them in `levels`. **Note that doing so results in information loss if we overwrite the original column.**

## Being more efficient

Before we move to the next exciting section about feature creation, we need to take a quick step back and revisit what we've so far done with an eye toward doing it more efficiently and in fewer steps. Often when doing exploratory data analysis we don't know much about the data ahead of time and need to learn as we go. But once we have the basics down, we can find shortcuts for some of the data-processing jobs. This is especially helpful if we intend to use the data to generate regular reports or somehow in a production environment. Therefore, in this section, we go back to the original CSV file and load it into R and redo all the data-cleaning to bring the data to where we left it off in the last section. But as you will see, we take a slightly different approach to do it.

Our approach in the last few sections was to load the data, and process it by "cleaning" each column. But some of the steps we took could have been taken at the time we loaded the data. We sometime refer to this as **pre-processing**. Pre-processing can speed up reading the data and allow us to skip certain steps. It is useful to read data as we did in section 1 for the sake of exploring it, but in a production environment where efficiency matters these small steps can go a long way in optimizing the workflow.

We are now going to read the CSV file again, but add a few additional steps so we can tell it which type each column needs to have (we can use `col_skip()` when we wish the column dropped) and the name we wish to give to each column. We store the column types and names in an object called `col_types` for ease of access.

```{r chap03chunk41}
col_types <- cols(
pickup_datetime       = col_datetime(format = ""),
dropoff_datetime      = col_datetime(format = ""),
passenger_count       = col_integer(),
trip_distance         = col_number(),
pickup_longitude      = col_number(),
pickup_latitude       = col_number(),
rate_code_id          = col_factor(levels = 1:6),
dropoff_longitude     = col_number(),
dropoff_latitude      = col_number(),
payment_type          = col_factor(levels = 1:4),
fare_amount           = col_number(),
extra                 = col_number(),
mta_tax               = col_number(),
tip_amount            = col_number(),
tolls_amount          = col_number(),
improvement_surcharge = col_number(),
total_amount          = col_number()
)

st <- Sys.time()
nyc_taxi <- bind_rows(lapply(1:6, read_each_month, progress = FALSE, col_names = names(col_types$cols), col_types = col_types, skip = 1))
Sys.time() - st
```

Reading the data the way we did above means we can now skip some steps, such as factor conversions, but we have still have some work left before we get the data to where it was when we left it in the last section.

Before we do so, let's quickly review the two ways we learned to both query and transform data: We can query and transform data using a direct approach, or we can do so using functions such as `subset` and `transform`. The notation for the latter is cleaner and easier to follow. The two different approaches are shown in the table below. Additionally, we now introduce a third way performing the above two tasks: by using the popular `dplyr` package. `dplyr` has a host of functions for querying, processing, and summarizing data. We learn more about its querying and processing capabilities in this section and the next, and about how to summarize data with `dplyr` in the section about data summaries.

| task | direct approach | using `base` functions | using `dplyr` functions |
|----------------|----------------------------------|--------------------------- -------------------|--------------------------------------|
| query data | `data[data$x > 10, c('x', 'y')]` | `subset(data, x > 10, select = c('x', 'y'))` | `select(filter(data, x > 10), x, y)` |
| transform data | `data$z <- data$x + data$y` | `transform(data, z = x + y)` | `mutate(data, z = x + y)` |

As we can see in the above table, `dplyr` has two functions called `mutate` and `filter`, and in notation they mirror `transform` and `subset` respectively. The one difference is that `subset` has an argument called `select` for selecting specific columns, whereas `dplyr` has a function called `select` for doing so (and the column names we pass are unquoted).

We cover more of `dplyr` in the next two sections to give you a chance to get comfortable with the `dplyr` functions and their notation, and it's in section 6 that we really gain an appreciation for `dplyr` and its simple notation for creating complicated data pipelines.

In this section, we use `dplyr` to redo all the transformations to clean the data. This will essentially consist of using `mutate` instead of `transform`. Beyond simply changing function names, `dplyr` functions are generally more efficient too.

Here's what remains for us to do:

1. Replace the unusual geographical coordinates for pick-up and drop-off with NAs
2. Assign the proper labels to the factor levels and drop any unnecessary factor levels (in the case of `payment_type`)

```{r chap03chunk42}
library(lubridate)
library(dplyr)
nyc_taxi <- mutate(nyc_taxi,
  pickup_longitude = ifelse(pickup_longitude < -75 | pickup_longitude > -73, NA, pickup_longitude),
  dropoff_longitude = ifelse(dropoff_longitude < -75 | dropoff_longitude > -73, NA, dropoff_longitude),
  pickup_latitude = ifelse(pickup_latitude < 38 | pickup_latitude > 41, NA, pickup_latitude),
  dropoff_latitude = ifelse(dropoff_latitude < 38 | dropoff_latitude > 41, NA, dropoff_latitude)
  )
```

As for the factor columns: firstly, `rate_code_id` is a factor now, but we still need to assign the proper labels it.

```{r chap03chunk43}
levels(nyc_taxi$rate_code_id) <- c('standard', 'JFK', 'Newark', 'Nassau or Westchester', 'negotiated', 'group ride', 'n/a')
```

Secondly, `payment_type` is also a factor, but with all six levels, so we need to "refactor" it so we can only keep the top two.

```{r chap03chunk44}
table(nyc_taxi$payment_type, useNA = "ifany") # we can see all different payment types
```

```{r chap03chunk45}
nyc_taxi <- mutate(nyc_taxi, payment_type = factor(payment_type, levels = 1:2, labels = c('card', 'cash')))
table(nyc_taxi$payment_type, useNA = "ifany") # other levels turned into NAs
```

We now have the data to where it was when we left it at the end of the previous section. In the next section, we work on adding new features (columns) to the data.

### Exercises

A useful question we might want to ask is the following: Are longitude and latitude mostly missing as pairs? In other words, is it generally the case that when longitude is missing, so is latitude and vice versa?

Once missing values are formatted as NAs, we use the `is.na` function to determine what's an NA.

```{r chap03chunk46}
is.na(c(2, 4, NA, -1, 5, NA))
```

Combine `is.na` and `table` to answer the following question:

(1) How many of the `pickup_longitude` values are NAs? (This was also answered when we ran `summary`.)

(2) How many times are `pickup_longitude` and `pickup_latitude` missing together vs separately?

(3) Of the times when the pair `pickup_longitude` and `pickup_latitude` are missing, how many times is the pair `dropoff_longitude` and `dropoff_latitude` also missing?

### Solutions

(1) We use `is.na` inside `table`.

```{r chap03chunk47}
table(is.na(nyc_taxi$pickup_longitude))
```

(2) We can combine both statements using `&`.

```{r chap03chunk48}
table(is.na(nyc_taxi$pickup_longitude) & is.na(nyc_taxi$pickup_latitude))
```

We can also separate the two statements and pass them as separate arguments to `table`. Doing so gives us a **two-way table** with a little more information.

```{r chap03chunk49}
table(is.na(nyc_taxi$pickup_longitude), is.na(nyc_taxi$pickup_latitude)) # better solution to (2)
```

(3) Providing n arguments to `table` gives us an n-way table, which is an `array` object. When n > 3 it gets confusing to look at it, so here we can use `&` to simplify things.

```{r chap03chunk50}
with(nyc_taxi,
  table(is.na(pickup_longitude) & is.na(pickup_latitude), is.na(dropoff_longitude) & is.na(dropoff_latitude))
)
```