index.Rmd

---
title       : Intro to R Workshop
subtitle    : UCI Data Science Initiative
author      : Sepehr Akhavan, Homer Strong, Fulya Ozcan, Bonnie Bui
job         : Dept. of Statistics
framework   : io2012        # {io2012, html5slides, shower, dzslides, ...}
highlighter : highlight.js  # {highlight.js, prettify, highlight}
hitheme     : tomorrow      # 
widgets     : mathjax            # {mathjax, quiz, bootstrap}
logo     : logo.png
mode        : selfcontained # {standalone, draft}
knit        : slidify::knit2slides
github:
  user: UCIDataScienceInitiative
  repo: IntroR_Workshop

---


## Introduction

1) The class will include 5 sessions: 
  + Session 1  (9-10:20): Data Types in R 
  + Session 2  (10:30-11:20): Control Structures and Functions
  + Session 3  (11:30-12): Statistical Distributions in R
  + Exercise 1 (12:30-1:20): Basic Data Exploration
  + Session 4  (1:20-2:50): Statistical Analysis in R 
  + Session 5  (3:00-4:20): Plotting and Data Visualization in R
  + Exercise 2 (4:20-5:00): Data visualization & Statistical Analysis
  

---

```{r,echo=FALSE, warning=FALSE, error=FALSE, message=FALSE}
library(ggplot2)
```

## Introduction

2) We are going to work in pairs. Please find a partner. 

3) Feel free to ask questions anytime during lectures.

4) To access this presentation and the codes used during the workshop please visit:
  + http://ucidatascienceinitiative.github.io/IntroR_Workshop/#1


---

## Session 1 - Agenda

1. RStudio
2. Data Types in R
3. Subsetting in R

---

## What is R?

+ R is a free software environment for statistical computing and graphics
  + See http://www.r-project.org/ for more info
  
  
+ R compiles and runs on a wide variety of UNIX platforms, Windows and Mac OS

+ R is Open-Source and free

+ R is fundamentally a command-driven system

+ R is an object-oriented programming language 
  + everything in R is considered as an object!


--- 


## R Studio:

1. RStudio is a free and open source integrated development environment (IDE) for R.

2. To download RStudio please visit: http://rstudio.org/

3. Please note that you must have R already installed before installing R Studio.

---

## Data Types in R:

1. R has 5 main atomic data types:
  + Numeric
  + Integer
  + Complex
  + Logical
  + Character
  
2. Everything in R is object. Objects can have some attributes.
  + names, dimension, length are some possible attributes

---

## Vectors in R:

Vector is the most basic object in R

```{r echo=TRUE}
numVec <- 1:10 # <- : is assigning operator
numVec

charVec <- c("a", "b", "c") # c: to combine elements
charVec

logVec <- vector(mode = "logical", length = 10)
logVec
```

---

### Special Values:

There are some special values in R:
  + use L to refer to an integer value: 1L
  + R knows infinity: Inf, -Inf
  + NaN: refers to "Not a number"
  
```{r echo=TRUE}
intVec <- c(1L, 2L, 3L, 4L) 
intVec

a <- Inf; b <- 0
rslt <- c(b/a, a/a)
rslt
```

---

### Logical, Complex, & Character Vectors:

Let's see some examples of logical, complex, and character vectors:
```{r echo=TRUE}
logVec <- c(TRUE, FALSE, FALSE, T, F)
logVec

compVec <- c(1 + 0i, 3 + 1i)
compVec

charVec <- c("red", "green", "blue")
charVec
```

---

### Data Type Coercion:

+ In general, vectors CAN NOT have mixed types of objects
+ exception: lists in R 

```{r echo=TRUE, results='hide'}
numCharVec <- c(3.14, "a")
numCharVec # ? what would you expect to be printed?

numLogVec <- c(pi, T)
numLogVec # any guess?

charLogVec <- c("a", TRUE)
charLogVec # ?
```

+ In examples above, we saw implicit coercion 
+ Explicit coercion is also possible!

---

### Data Type Coercion:

+ as(): To explicitly coerce objects from one type to another

```{r echo=TRUE}
numVec <- seq(from = 1200, to = 1300, by = 15)
numVec
numToChar <- as(numVec, "character")
numToChar
logVec <- c(F, T, F, T, T)
as(logVec, "numeric")
```

---

### Data Type Coercion:

+ Coercion does not always work! Be careful about warnings:

```{r echo=TRUE}
compVec <- c(12+10i, 1+6i, -3-2i)
as(compVec, "numeric")

charVec <- c("2.5", "3", "2.8", "1.5", "zero")
as(charVec, "numeric")
```

---

### Factors:

+ Factor is a vector object used to specify a discrete classification (categorical values).
+ Factors can be: 1) ordered, 2) un-ordered
+ Levels of a Factor are better to be labeled (self-descriptive)
  + Consider gender as (0, 1) as opposed to labeled ("F", "M")
```{r echo=TRUE}
Gender <- rep(c("Female", "Male"), times = 3)
Gender
GenderFac1 <- factor(Gender)
GenderFac1
```

---

### Factors:

```{r echo=TRUE}
levels(GenderFac1)
table(GenderFac1)
unclass(GenderFac1) # bring the factor down to integer values
```

---

### Factors:
```{r echo=TRUE}
GenderFac1 # levels are ordered alphabetically - 1st level = BaseLevel
GenderFac2 <- factor(Gender, levels = c("Male", "Female"))
GenderFac1
GenderFac2
```

---

### Missing Values:

+ There are two kinds of missing values in R:
  + NaN: refers to "Not a Number" and is a a missing value produced by numerical computation.
  + NA: When a value is "Not Available" or is "Missing", NA is assigned as its value.

+ NaN is also considered as NA (the reverse is NOT true). 
```{r echo=TRUE}
testScore <- NA
is.na(testScore)
is.nan(testScore)
```

---

### Matrices:

+ Matrix is a special case of vector:
  + Matrix has dimension attribute

```{r echo=TRUE}
myMat <- matrix(nrow = 2, ncol = 4)
myMat
attributes(myMat)
```

---

### Matrices:
```{r echo=TRUE}
myMat <- matrix(1:8, nrow = 2, ncol = 4)
myMat # matrices are filled in column-wise
```

---

### Matrix is a special vector:

```{r echo=TRUE}
myVec <- 1:8
myVec
dim(myVec) <- c(2,4)
myVec
```

+ Similar to vectors, all elements of a matrix should have the same type.
  + if not, R does an automatic coercion.

---

### Other Ways to Create Matrix:

+ As it's intuitive, matrices seem to be a combination of vectors that are put next to each other (either column wise or row wise).

+ rbind() (row bind) and cbind (column bind) do a similar job:
```{r echo=TRUE}
vec1 <- 1:4
vec2 <- sample(1:100, 4, replace = FALSE)
vec3 <- rnorm(4, mean = 0, sd = 1)
colMat <- cbind(vec1, vec2, vec3)
colMat
```

---

### Other Ways to Create Matrix:

```{r echo=TRUE}
vec1 <- 1:4
vec2 <- sample(1:100, 4, replace = FALSE)
vec3 <- rnorm(4, mean = 0, sd = 1)
rowMat <- rbind(vec1, vec2, vec3)
rowMat
```

---

## Lists:

+ Consider list as a vector but with two main differences:
  + each element of a list can have its own class regardless of other elements
  + This means, each element can be of a different data type and a different length
```{r echo=TRUE}
myVec <- c(10, "R", 10-5i, T)
myList <- list(10, "R", 10-5i, T)
myVec
```

---

## Lists:

```{r echo=TRUE}
myList <- list(10, "R", 10-5i, T)
myList
```
+ Elements of list are shown with [[]]
+ Elements of vector are shown with []

---

## Data Frames:

+ We use data frames to store tabular data
+ Data frame is a special list where all objects have equal length
+ The main difference between data.frame and Matrix?
```{r echo=TRUE}
studentID <- paste("S#", sample(c(6473:7392), 10), sep = "")
score <- sample(c(0:100), 10)
gender <- sample(c("female", "male"), 10, replace = TRUE)
data <- data.frame(studentID = studentID, score = score, gender = gender)
head(data)
```

---

## Subsetting:

+ Often times we need to take a subset of a vector, a matrix, a list, or a dataframe.
+ We consider three main operators to take a subset of an object:
  + [ ]: single brackets return an object of the same class of the original object. By using [], we can also choose more than one element.
  + [[ ]]: double brackets are used primarily for lists and dataframes. 
  + "$": is used primarily for lists and dataframes (similar to double brackets). 

+ With [[ ]] and $, we can only choose one object!

+ [[ ]] and $ can return an object with a different class than the original objects we are subsetting from.

---

### Subsetting examples:

```{r echo=TRUE}
myVec <- 10:20
myVec[3]

myList <- list(obj1 = "a", obj2 = 10, obj3 = T, obj4 = 10-5i)
myList[[3]]
myList$obj3
```

---

## Subsetting with [ ]:

+ By using single bracket, we can choose more than one element of an object.
+ In this case, index vectors can be very useful:
  + Index vector is a vector of indices of another vector that is used to select a subset of another vector (or Matrix)
  
```{r echo=TRUE}
x <- seq(from=0, to=100,by=10) # length(x) is ??
IndVec <- c(1, 2, 3, 4, 5) # the first 5 elements 
x[IndVec]
```

---

## Index Vectors:

+ There are four types of Index vectors:
  1. Logical Index Vector: The logical index vector should be of the same length of the vector from which we are selecting a subset. Values corresponding to TRUE in the index vector are selected.
  2. Vector of Positive integers: All the values in this type of index vector must lie in 1:(length(x)).
  3. Vector of Negative integers: This type of index vector indicates the values to be excluded from the
vector.
  4. A Vector of Character Strings: if a vector has a name attribute, we can simply take a subset of the vector by calling the names of the elements.

---

## Index Vectors:
```{r echo=TRUE}
myVec <- letters[1:10]
names(myVec) <- paste("e", 1:10, sep = "")
myVec

logIndVec <- rep(c(T, F), each = 5)
logIndVec

posIndVec <- 1:5
negIndVec <- -6:-10
chIndVec <- c("e1", "e2", "e3", "e4", "e5")
```

---

## Index Vectors:
```{r echo=TRUE}
myVec[logIndVec]

myVec[negIndVec]

myVec[chIndVec]
```  

---

## Logical Index Vectors:

+ logical index vectors can be generated by using conditional statements:
  + Using ==, !=, <, >, ...

```{r echo=TRUE}
myVec <- 1:10
logIndVec <- (myVec < 5)
logIndVec
myVec[logIndVec]
```  

---

## Matrix Indexing:
+ Similar to vector indexing, we can refer to individual elements of a matrix.
```{r echo=TRUE}
myMat <- matrix(1:8, ncol = 4)
myMat
myMat[1,1] # refering to an element
myMat[2,] # refering to the second row
myMat[,3] # refering to the third column
```  

---

## Matrix Indexing:

+ By default, when the retrieved elements of a matrix look like a vector, R drops their dimension attribute. We can turn this feature off by setting drop = FALSE

```{r echo=TRUE}
myMat[1,1]
myMat[1,1, drop = FALSE]

myMat[2,, drop = FALSE]
``` 

---

## Subsetting Lists:
```{r echo=TRUE}
myList <- list(ch = letters[1:2], lg = F, nm = 1:3)
myList
myList[1] # subset is still a list
``` 

---

## Subsetting Lists:
```{r echo=TRUE}
myList[1:2] # subset is still a list
myList[[1]] # returning the 1st obj with its own class
myList$ch # alternative to [[]]
``` 

---

## Subsetting Lists:
```{r echo=TRUE}
myList[[1]][2] # returning the 2nd element of the 1st obj
myList$ch[2]
myList[[c(1,2)]]
``` 

---

## Subsetting Data Frames:
```{r echo=TRUE}
library(datasets)
data(quakes) # ?quakes for more info
str(quakes)
head(quakes$long)
``` 

---

## Subsetting Data Frames:

```{r echo=TRUE}
quakes[1:10,]
``` 

---

## Time to Break for 10 Minutes :)

---

## Session 2 - Agenda

1. Vectorized Operations in R
2. Reading and Writing in R
3. Control Structure
4. R Packages and Functions

---

## Vectorized Operations

R is capable of vectorized operations without any need for running loops:
```{r echo=TRUE}
x <- 1:5
y <- c(1, 2, 6, 7, 10)
x + y # R does an element by element summation
x < y
```

---

## Vectorized Operations

+ Similar to vectors, vectorized operations can be performed for Matrices:
```{r echo=TRUE}
x <- matrix(1:9, ncol = 3)
y <- matrix(rep(c(5,6,7), 3), ncol = 3)
x + y # R does an element by element summation
x < y
```

---

## Reading and Writing Data

**The slides for "Reading and Writing Data" section were mainly from Dr. Roger D. Peng, Associate Professor at Johns Hopkins**

Main functions for reading data into R:

1. read.table(), read.csv(): to read tabular data 
2. readLines(): to read lines of a text file
3. source(), dget(): reading R codes
4. load(): to read saved workspaces

+ Only read.table() and read.csv() are covered in this lecture. 

---

## Reading and Writing Data

Main functions for writing data from R:

1. write.table(), write.csv(): to write tabular data to file
2. writeLines(): to write lines to a text file
3. dump, dput: to write R codes to a file
4. save: to save a workspace

+ Only write.table() is covered in this lecture. 

---

## read.table():

+ read.table() is the most commonly used function to read data in R. Below are important arguments of this function:

  + file: name or address to the file of interest
  + header: logical indicator on whether the file has header or not
  + sep: string on how columns of data are separated (in .csv, sep = ",")
  + colClasses: is a character vector for class of each column
  + nrows: number of rows in the dataset
  + comment.char: a character that is used in the dataset for commenting
  + skip: number of lines to skip from the beginning of the file
  + stringAsFactors: logical indicator on whether characters should be converted to factors 

+ read.csv() is equivalent to read.table with sep = "," and header = TRUE

---

## read.table():

```{r echo = TRUE, eval=TRUE}
irisFile <- read.table(file = "iris.csv", sep=",", header = TRUE)
head(irisFile)
```

+ to make read.table() run faster:
  + set comment.char = " "
  + set colClasses upfront

---

## Calculating Memory Requirements:

+ Note that datasets will be read into RAM. So, you should have enough RAM in order to read a dataset. 

+ Consider a data frame with 1.5 million rows and 120 columns. How much memory is required to read this dataset?

1.5m * 120 * 8 bytes/numeric = 1.44 * $10^9$ = 1.44 * $10^9$/ $2^{20}$ MB = 1,373.29 MB = 1.34 GB

+ So it's recommended to have a RAM of size 2 * 1.34GB to read that dataset.

---

## write.table():

```{r echo = T, eval=FALSE}
write.table(irisFile, file = "path/to/the/file")
```

---

## Loops:

+ There are 3 ways in R to write loops:
  + for 
  + repeat (skipped!)
  + while (skipped!)

---

### for:

```{r echo = T}
for(i in 1:4){
  print(paste("cycle #", i, sep = ""))
  i <- i + 1 
}
```

---

## if:

+ if/else statements are used to write conditional statements

```{r echo = T}
x <- 7
if (x < 10){
  print("x is less than 10")
}else{
  print("x is greater than 10")
}
```

---

## if:
```{r echo = T}
age <- sample(1:100, 10)
ageCat <- rep(NA, length(age))
for (i in 1:(length(age))) {
    if (age[i] <= 35){
       ageCat[i] <- "Young"
      }else if (age[i] <= 55){
        ageCat[i] <- "Middle-Aged"
      }else{
         ageCat[i] <- "Old"
      } 
}
age.df <- data.frame(age = age, ageCat = ageCat)
age.df[1:3,]
```

---

## Functions and Packages:

1. R language has many built-in functions
2. Each function has a name followed by ()
3. Arguments of a function are put within parentheses
4. R packages are a comfortable way to maintain collections of R functions and data sets
5. Packages allow for easy, transparent and cross-platform extension of the R base system


---

## Functions and Packages:

There are some terms which sometimes get confused and should be clarified:

1. Package: An extension of the R base system with code, data and documentation in a standardized format
2. Library: A directory containing installed packages
3. Repository: A website providing packages for installation
4. Source: The original version of a package with human-readable text and code
5. Base packages: Part of the R source tree, maintained by R Core

+ for more info on how R packages are developed, please read: "Creating R Packages: A Tutorial" (Friedrich Leisch)
  + http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf


---

## How to install a package in R:

There are three main ways to install a package in R:

1. Installing from CRAN: install a package directly from the repository
  + Using R studio: tools/install packages
  + From R console: install.packages()

2. Installing from Source: In this method, you should first download the add-on R package and use the following unix command in the console to install the package:
  + R CMD INSTALL packageName -l path/to/your/Rpackage/Directory

3. Installing from a version control (Github): 
  + Check-out https://github.com/hadley/devtools

+ Once you install a package, you need to load it into R using the function library()


---

## Popular Packages in R:

1. To visualize data:
  + ggplot2: to create beautiful graphics
  + googleVis: to use Google Chart tools
  
2. To report results:
  + shiny: to create interactive web-based apps
  + knitr: to combine R codes and Latex/Markdown codes
  + slidify: to build HTML 5 slide shows
  
3. To write high-performance R code:
  + Rcpp: to write R functions that call C++ code
  + data.table: to organize datasets for fast operations
  + parallel: to use parallel processing in R
  

---


## Calling a function in R


```{r echo=TRUE}
str(sample)
```

+ consider sample() in R. Simply run ?sample in R console to read the help on this function.
+ sample() gets four arguments: 
  + x: sample space in form of a vector
  + size: your desired sample size
  + replace: sampling with/without replacement
  + prob: a vector of probability weights

+ some of the arguments have default values. What are those arguments?

+ How to use (or call) this function? 


---

## Calling a function in R

```{r echo=TRUE}
# Functions arguments can be matched: 1) by position or 2) by name
sampSpace <- 1:6 # rolling a die
sample(sampSpace, 1) # arguments with default values can be omitted
sample(size = 1, x = sampSpace) # no need to remember the order 
sample(size = 1, sampSpace)
```


---

## Writing your Own functions

```{r echo=TRUE, eval=FALSE}
yourFnName <- function(<your arguments>){
  # body of your code
  
  # return the output of the function
}

# to use your function, you can simply call the function name as:
yourFnName(<set values for the input arguments>)
```


---

## Writing your Own functions

+ Let's write a function that gets three arguments: a, b, c
+ The function then returns min of these two numbers

```{r echo=TRUE, eval=TRUE}
myMin <- function(a, b, c){
  myMinVal <- min(a, b, c)
  return(myMinVal)
}

myMin(10, 20, 30)
myMin(10, NA, 20) # ? how to fix this so it returns 10
```


---

## Some notes on Functions

> 1. Variables defined within a function are locally defined (i.e. not defined outside of the function).

> 2. Functions in R are treated like any other first class objects. This means functions can be passed as arguments of other functions.

> 3. Arguments of functions are evaluated as they are needed (lazy evaluation). 

> 4. " ... " can be an argument of a function and it refers to a situation where number of input arguments can be varied and is not fixed upfront. 


---

## Lazy Evaluation of Function Arguments

```{r echo=TRUE}
myLazyFn1 <- function(a, b){
  return(a)
}
myLazyFn1(10) # No error!

myLazyFn2 <- function(a, b){
  print(a)
  print(b)
  return(1)
}
myLazyFn2(10) 
```


---

## Some useful functions:

+ Here we are going to talk about:
  + str(): a function to explain internal structure of a function
  + apply(): to apply a function to a matrix or dataframe
  + lapply(), sapply(), tapply(), mapply(): applying a function to a vector
  + split(): to split a dataset by levels of a factor


---

### str():

+ str() is a compact way of understanding what an object is and what is in that object.
```{r echo=TRUE}
str(str)
str(sample)

genderF <- factor(sample(c("Male", "Female"), 20, replace = TRUE))
str(genderF)
```


---

### str():
```{r echo=TRUE}
myMat <- matrix(1:10, ncol = 5)
str(myMat)
myList <- list(numVec = 1:3, logVec = F, charVec = LETTERS[1:4])
str(myList)
```


---

### apply():

```{r echo=TRUE}
str(apply) # try ?apply for more info
```

+ apply() is a useful function to apply a function (FUN) on a MARGIN of a matrix or dataframe (X)

+ MARGIN: a vector giving the subscripts which the function will be applied over
  + 1: indicates rows
  + 2: indicates columns
  + c(1, 2): indicates rows and columns

+ FUN: refers to the function that we want to apply on the dataset

+ "..." : additional arguments of FUN


---

### apply():

```{r echo=TRUE}
myMat <- matrix(1:10, ncol = 5)
myMat[2,c(2, 5)] <- NA
myMat
apply(myMat, 2, sum, na.rm = TRUE)
```


---

### apply():

```{r echo=TRUE}
# consider iris dataset: 
head(iris) # more info ?iris
# suppose we are interested in getting 25% and 75% of each numeric column
```


---

### apply():

```{r echo=TRUE}
# Consider iris dataset: 
apply(iris[,-5], 2, quantile, probs = c(0.25, 0.75))
```


---

### lapply() and sapply():

```{r echo=TRUE}
str(lapply)
str(sapply)
```
+ x: a list, dataframe, or a vector
+ FUN: the function to be applied to each element of X
+ "...": other arguments of FUN


---

### lapply() and sapply():
```{r echo=TRUE}
myList <- list(e1 = 1:10, e2 = -1:-10)
lapply(myList, mean)
sapply(myList, mean)
```


---

### lapply() v. sapply()?:

+ sapply() simplifies the result of lapply.
  
+ If the result of lapply is a list with all elements of the same length:
    + if length == 1: sapply() returns a vector
    + if length != 1: sapply() returns a matrix
    
+ otherwise, sapply() generates a list similar to lapply()


---

### lapply() & sapply() with a user-defined FUN
```{r echo=TRUE}
myList <- list(e1 = 1:10, e2 = -1:-10)
lapply(myList, function(element){return(mean(c(element[1], element[length(element)])))})
sapply(myList, function(element){return(mean(c(element[1], element[length(element)])))})
```


---

### tapply():
```{r echo = TRUE}
str(tapply)
```

+ tapply() applies a function on a subset of a vector
+ X: is a vector 
+ INDEX: list of one or more factors, each of same length as X
+ FUN: our function of interest
+ "...": other arguments of FUN
+ simplify: any guess???


---

### tapply():
```{r echo = TRUE}
HeightDF <- data.frame(height = c(rnorm(100, 180, 3), rnorm(100, 170, 3)), gender = factor(rep(c("M", "F"), each = 100)))
head(HeightDF)
tapply(HeightDF$height, HeightDF$gender, mean)
```


---

### mapply():
```{r echo=TRUE}
str(mapply)
```
  
+ all previous "apply" functions were univariate
  + f(x, {some other parameters})
  
+ What to do if we want to apply a multivariate function:
  + f(x, y, {some other parameters}) # we can have more than 2 variables 


---

### mapply():
```{r echo=TRUE}
l1 <- list(e1 = 1:10, e2 = 1:10)
l2 <- list(e1 = -1:-10, e2 = -1:-10)
# how to get l1$e1[i] + l1$e2[i] + l2$e1[i] + l2$e2[i] ? 
mapply(sum, l1$e1, l1$e2, l2$e1, l2$e2)
```


---

### split():
```{r echo=TRUE}
str(split) # ?split for more info
```

+ X: a vector or a data frame
+ f: factor
+ drop: should R drop empty factor levels?


---

### split():
```{r echo=TRUE}
str(HeightDF)
# Goal: to separate Females from Males
splitData <- split(HeightDF$height, HeightDF$gender)
str(splitData)
```


---


## Time for Break for 10 Minutes :)


---


## Session 3 - Agenda

1. Useful Matrix Functions
2. Statistical Distributions in R

---


## Useful Matrix Functions

Consider matrix "A". We can then have:
> 1. t(A): transpose of A
> 2. solve(): to get inverse of A
> 3. eigen(): to get eigen values and eigen vectors (if A is symmetric)
> 4. We only cover solve() in this lecture

---

### solve():
+ Consider B = A %*% X (where X is an unknown matrix)
+ Then: X = solve(A, B)
+ In a special case where B = I, X = $A^{-1}$

```{r echo=TRUE}
A <- matrix(c(1, 2, 3, 2, 4, 5, 3, 5, 6), ncol = 3)
A
# to get inverse of A: solve(A)
```

---

### solve():
```{r echo=TRUE}
solve(A)
# To check that solve(A) is inverse of A:
solve(A)%*%A
```
+ Machine epsilon is defined to be the smallest positive number which, when added to 1, gives a number different from 1.
+ Please visit http://en.wikipedia.org/wiki/Machine_epsilon for more info

---

## Statistical Distributions in R:

+ R has many in-built statistical distributions
  + examples: binomial, poisson, normal, chi square, ...

+ Each distribution in R has four functions:
  + these functions begin with a "d", "p", "q", or "r" and are followed by the name of the distribution

+ ddist(parameters): refers to the density of each distribution
+ rdist(parameters): generates random numbers out of each distribution
+ qdist(parameters): to get quantile of a distribution
+ pdist(parameters): to calculate CDF


---

### Example of a Discrete Distribution:
```{r echo=TRUE}  
# Consider tossing a coin 10 times
str(dbinom)
dbinom(5, 10, 0.5) # prob of getting five heads

str(pbinom) # cumulative dist
pbinom(5, 10, 0.5) # Pr[X <= 5]
```


---

### Example of a Discrete Distribution:
```{r echo=TRUE}  
str(qbinom) # quantile: Pr[X <= ?] = known value
qbinom(0.6230, 10, 0.5) # get the value of ? s.t. Pr[X <= ?]=0.6230

str(rbinom) # Generating random numbers
rbinom(20, 10, 0.5) # 20 ind samples from binomial(10, 0.5)
```


---

### Example of a Continuous Distribution:
```{r echo=TRUE}  
# Consider a standard Normal distribution
str(dnorm)
dnorm(x = 0, mean = 0, sd = 1, log = FALSE)

str(pnorm) # cumulative dist
pnorm(0, mean = 0, sd = 1)
```


---

### Example of a Continuous Distribution:
```{r echo=TRUE}  
str(qnorm) # quantile
qnorm(0.5, mean = 0, sd = 1)

str(pnorm) # cumulative dist
rnorm(10, mean = 0, sd = 1)
```


---

### Example of a Continuous Distribution:
```{r echo=TRUE, fig.height = 4.5, fig.align='center'}  
# Let's try plotting Normal curve (more on plotting later)
x <- seq(from = -3, to = 3, by = 0.05)
y <- dnorm(x, mean = 0, sd = 1)
plot(x, y, type = "l")
```

---


## Time for Lunch Break for 30 Minutes. Please come back at 12:30 :)


---

## Exercises: Analysis for Auto-Mpg Data

1. Dataset: Auto-Mpg Data from UCI Machine Learning Repository (the data is slightly modified for the exercises of this workshop)
2. Download: click the "download" button in slide 1 and extract the zip file. The data files will be in the folder "data". 
3. Variables (names saved in auto-mpg-names.txt): 
  + continuous: mpg, displacement, horsepower, weight, acceleration
  + discrete: cylinders, model year, origin
  + string: car name (not used in the analysis)
  + descriptions: mpg (city-cycle fuel consumption in miles per gallon), cylinders (# of cylinders), displacement (engine displacement in cu. inches), weight (vehicle weight in lbs.), accelerate (time to accelerate from O to 60 mph in sec.), model year (modulo 100), origin (1: American, 2: European, 3: Japanese).
4. More information:   https://archive.ics.uci.edu/ml/datasets/Auto+MPG

---

## Questions to answer
1. Does mpg depend on the origin of the car?
2. How is mpg related to other variables?
3. Predict mpg using the other variables provided in the data.

---

## Exercises: Section 1

  + The exercises can be found in section1_exercises.Rmd 
  + The solutions are in ex_code.r
  
# Some suggestions
  + Try to solve the exercises without looking at the solutions
  + Feel free to ask us for help (or use Slack!)

---

## Exercises: Section 1

The first section of exercises will deal with reading a dataset into R, exploring various structural and content-related feature of the data, and manipulating the dataset so that it is in a form we can use later for analyses.

---


## Exercise 0. Getting ready.

**0.1** Open a new R script file to write and save your code for the exercises. 

**0.2** To execute code, you can either highlight the code and press Ctrl+Enter (Cmd+Return), or copy and paste the code to the console and press Enter (Return).

---

## Exercise 1. Find and import R data.

**1.1** Find the folder where your R data files are saved and set your working directory to that folder using ```setwd()```. 


**1.2** Import "auto-mpg.csv" using ```read.csv()```, storing the data as an object called "data" (i.e., ```data <- read.csv(...)```)

* In this dataset, there is no header (i.e., no variable names) and missing values are denoted as NA. Therefore, within the ```read.csv()``` function:
    + Set ```header = FALSE```
    + Set ```na.strings = "NA"```
    + *Note*: If you need help, type ```?read.csv```


---

## Exercise 1 (continued)

**1.3** Now that your data is loaded, use the ```head()``` function to look at the first few rows of the data to make sure it looks okay (you can open the original CSV file in Excel or Notepad to compare). As mentioned above, you should notice that the data does not contain variable names. We will fix that in the next exercise. 

**1.4** Check the dimensions of the data, the number of rows in the data, and the number of columns in the data using the functions ```dim()```, ```nrow()```, and ```ncol()```, respectively. 

---


## Exercise 2. Add variable names to the data.

**2.1** Use the function ```readLines()``` to read in "auto-mpg-names.txt", a file that contains the variable names for our data. Store this as an object called "varnames".

* *Note*: The difference between ```readLines()``` and ```read.table()``` or ```read.csv()``` is that ```readLines()``` imports the data file into a vector of strings, while ```read.table()``` imports the data file into a data frame.

**2.2** Run ```names(data)```. This returns the variable names of our data frame.

**2.3** Assign the new variable names (i.e., varnames) to ```names(data)```. 


---

## Exercise 3. Summarize the data. 

**3.1** Summarize the data using the ```str()``` and ```summary()``` commands.

* *Note*: Notice the different kinds of information each of these functions provide with respect to the data. In particular, ```str()``` summarizes the structure of the data, while ```summary()``` summarizes the content of the data. 


---

## Exercise 4. Subsetting the data.

**4.1** Subset the following:

a. The first row of the data frame.
b. The mpg (first) column of the data frame (there are three ways to do this).
c. The second row, first column of the data frame.


**4.2** Summarize the variable mpg using ```summary()```. Do you see something weird in the result? What might be the reason? We will get back to this later.


**4.3** Above we summarized a single variable. Next, we will summarize multiple variables at once. 

* Create an index vector called "index_cont" for the numbers 1,3,4,5,6 using ```c()```. These numbers the correspond to the columns that contain continuous variables. Then, use that vector to subset the continuous variables from our data, and summarize them using ```summary()```. 


**4.4** Finally, let's remove the variable car_name (we will not use it in subsequent exercises). 

* *Hint*: you can either assign NULL (empty) to the variable "car_name", or redefine data to be the subset of the data that does not contain "car_name".


---

## Exercise 5. Discrete variables and factors. 

In this set of exercises, we will convert a variable to a factor and change the levels of the factor.

**5.1** The variable "origin" is of the class integer (run ```class(data$origin)``` to check for yourself), but it is categorical by nature. Convert "origin" to a factor using the ```factor()``` function and assign it back to ```data$origin```. 


**5.2** Next, we want to change the levels of ```data$origin```. Check the current levels by running ```levels(data$origin)```. Then, change the levels to the following: 

* 1: American, 2: European, 3: Japanese
* *Hint*: create a character vector with the new levels and assign it to ```levels(data$origin)```. 


---

## Exercise 6. Missing values. 

In this section, we will recode missing values and then remove entries containing missing values from our data.

**6.1** Recall that in Exercise 4.2 we saw the weird value of "-99" in "mpg". Sometimes, an unlikely value (commonly, values like -99, 99, or 999) is used to code missing values. It's always important to confirm these values were coded as missing with the data entry clerk. Let's assume that this has been confirmed, and replace all instances of "-99" with NA. 

**6.2** Read the help file for the function ```na.omit()```, and use this function to create a new dataset (store it as "data_noNA") that contains only the instances that has no missing value on any variables. We will be using data_noNA for the remaining exercises. 


---


## Session 4 - Agenda

1. T-Test in R
2. ANOVA in R
3. Linear Regression in R
4. Logistic Regression in R


---

## T-Test in R

T-tests can be categorized into two groups:
  + 1) One-Sample t-test
  + 2) two-sample t-test


---

###  One-Sample T-Test
```{r echo=TRUE}
oneSampData <- rnorm(100, mean = 0, sd = 1)
oneSampTest.0 <- t.test(oneSampData) # ?t.test
oneSampTest.0
```


---

###  One-Sample T-Test
```{r echo=TRUE}
names(oneSampTest.0) # alternative to names()?? 
```  


---

###  Two-Sample T-Test
Two sample t-tests are categorized into 3 groups:
  + T-Test with equal variances
  + T-Test with un-equal variances
  + Paired T-Test: can be also considered as one-sample t-test on deltas.


---

###  Two-Sample T-Test (Un-equal Variances)
```{r echo = TRUE}
Samp1 <- rnorm(30, mean = 2.5, sd = 1)
Samp2 <- rnorm(50, mean = 5.5, sd = 1)
t.test(Samp1, Samp2)  # default assump: unequal variances
```


---

###  Two-Sample T-Test (Equal Variances)
```{r echo = TRUE}
t.test(Samp1, Samp2, var.equal = TRUE)  # default assump: unequal variances
```


---

###  Two-Sample T-Test (Paired T Test)
```{r echo = TRUE}
t.test(Samp1, Samp2[1:30], paired = TRUE)
```


---

##  ANOVA
If you are not familiar with ANOVA, simply consider ANOVA as an extension to two-sample t-test where we have more than two groups.

```{r echo = TRUE}
Samp1 <- round(rnorm(10, mean = 25, sd = 1), 1)
Samp2 <- round(rnorm(10, mean = 30, sd = 1), 1)
Samp3 <- round(rnorm(10, mean = 35, sd = 1), 1)
myDF <- data.frame(y = c(Samp1, Samp2, Samp3), group = rep(c(1, 2, 3), each = 10))
myDF$group <- as.factor(myDF$group)
str(myDF)
```


---

##  ANOVA
```{r echo = TRUE}
ANOVAfit <- lm(y ~ group, data = myDF)  # instead of lm, aov() can also be used!
myANOVA <- anova(ANOVAfit)  # anova computes analysis of variance tables on a fitted model object.
str(myANOVA) # see what is 
```


---

##  ANOVA
+ To learn more on how to fit ANOVA, please visit: 
  + http://www.statmethods.net/stats/anova.html
  

---

##  Linear Regression- Data:
+ lm() is used to fit linear regression
+ Here we use "Prestige" dataset from "car" package
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
# install.package("car")
library(car)
data(Prestige) # load the data
str(Prestige)
```

---

##  Linear Regression- Data Description: 

+ education: Average education of occupational incumbents, years, in 1971.

+ income: Average income of incumbents, dollars, in 1971.

+ women: Percentage of incumbents who are women.

+ prestige :Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.

+ census: Canadian Census occupational code.

+ type: Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar.


---

##  Linear Regression - Fit:
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
myReg <- lm(prestige ~ education + income + women, data = Prestige)
myReg # summary(myReg)
names(myReg)
```
  

---

##  Linear Regression - Summary of Fit:
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
summary(myReg) # summary(myReg)
```


---

##  Linear Regression - Predict
+ Predict the output for a new input
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
newData = data.frame(education=13.2, income=12000, women=12);
predict(myReg, newData, interval="predict");
```


---


##  Linear Regression - Confidence Interval:
+ 95% confidence interval for coefficient of 'income'
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
confint(myReg, 'income', level=0.95)
```

+ 95% confidence interval for each coefficient
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
confint(myReg, level=0.95)
```
  
---

##  Linear Regression - Diagnostics:
+ Here we cover some common regression diagnostics including:
  + Testing for Normality
  + Testing for Constant Variance
  
+ Reference: http://www.statmethods.net/stats/rdiagnostics.html
  

---
## Model diagnostic plot
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE, fig.height=6.5, fig.width=8}
par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(myReg)
```

---

##  Logistic Regression - Data:
+ glm() is used to fit logistic regression model

+ Mroz data

library(car)
data(Mroz); # load Mroz data

```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
head(Mroz)
```

---

## Logistic Regression - Data Description

+ lfp: labor-force participation; a factor with levels: no; yes.
+ k5: number of children 5 years old or younger.
+ k618: number of children 6 to 18 years old.
+ age: in years.
+ wc: wife's college attendance; a factor with levels: no; yes.
+ hc: husband's college attendance; a factor with levels: no; yes.
+ lwg: log expected wage rate; for women in the labor force, the actual wage rate; for women not in the labor force, an imputed value based on the regression of lwg on the other variables.
+ inc: family income exclusive of wife's income.
  

---

## Logistic Regression - Model Fit
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
fitLogistic <- glm(lfp ~ k5 + age, 
                   family=binomial(logit), data=Mroz); 
fitLogistic # summary(fitLogistic)
```

---


## Logistic Regression - Model Fit
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
names(fitLogistic)
```


---
## Summary of Fit
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
summary(fitLogistic)
```

---
+ 95% CI for exp(coefficients) (profile liklihood mehtod)
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
exp(confint(fitLogistic, level=0.95))
```

+ 95% CI for exp(coefficients) (Wald confident interval)
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
exp(confint.default(fitLogistic, level=0.95))
```

---
+ Update model by adding 'inc' and 'lwg'
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
fitLogistic2 = update(fitLogistic, . ~ . + inc + lwg, data=Mroz);
```

+ After update
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
fitLogistic2
```

---
## Model Comparison
+ Use change of deviance of fitted model
```{r echo=TRUE, error=FALSE, message=FALSE, warning=FALSE}
anova(fitLogistic, fitLogistic2, test='Chisq');
```

---

## Time for Break for 10 Minutes :)

---

## Session 5 - Agenda

+ Goal: use ggplot2 to explore data afterwards
+ Emphasize simple examples
+ Emphasize principles
+ Some examples will be developed today
+ ... but there's a lot that won't be covered
+ Base plotting system

---

### Information Visualization

+ Efficiency
+ Interpretability
+ Parsimony
+ ggplot2 lies in "sweet spot" of functionality

---

### Hello, ggplot2

+ ggplot2 is a very popular graphics system written by Hadley Wickham
+ implementation of Leland Wilkinson' Grammar of Graphics
+ I'll use the `diamonds` dataset for most of the examples.

```{r}
head(diamonds)
```

---

### First ggplot2: histogram

Let's make a histogram!

```{r, fig.width=8, fig.height=4}
ggplot(diamonds, aes(price)) + geom_histogram()
```


---

### Gotchas

Easy to run into unhelpful errors

```{r}
library(ggplot2)
ggplot(airquality) # :(
ggplot(airquality, aes(temp)) # :'''(
```

---

### Now make it fancier

Group diamonds by cut.

```{r, fig.width=8, fig.height=4}
m <- ggplot(diamonds, aes(price))
m + geom_histogram(aes(fill=cut))
```

---

### Facets are an alternative

Group diamonds by cut.

```{r, fig.width=12, fig.height=5 }
m <- ggplot(diamonds, aes(price))
m + geom_histogram(binwidth=100) + facet_grid(cut~color)
```


---
### Let's pause for a moment

+ what's up with `aes`?
+ `aes(x, y, ...)`
+ allows functions of columns e.g. `aes(x=price^2)`, `aes(x=price/carat)`
+ what are layers?

```
m <- ggplot(diamonds, aes(price))
m + geom_histogram(aes(fill=cut))
```

+ note that plots can be built up _incrementally_
+ "Geoms, short for geometric objects, describe the type of plot you will produce."
+ geom names always begin with `geom_`

---
### Scatterplots

Note: There's no "scatterplot" function. Use `geom_point`.

```{r, fig.width=8, fig.height=4}
ggplot(diamonds, aes(price, carat)) + geom_point()
```

---
### Log scales


```{r, fig.width=8, fig.height=4}
ggplot(diamonds, aes(price, carat)) + geom_point() + scale_x_log10()
```

Scales begin with `scale_`, and are not only for continuous variables: also `datetime`, `shape`, `colour`, etc

---
### Adding factors

Similar to histogram

```{r, fig.width=10, fig.height=5}
ggplot(diamonds, aes(price, carat)) + geom_point(aes(colour=color, shape=cut))
```

Note the legend for each mapping!

---
### Overview of components

+ `geom_*` : Geoms, short for geometric objects, describe the type of plot you will produce.
+ `stat_*`: Statistical transformations transform your data before plotting
+ `scale_*`: Scales control the mapping between data and aesthetics.
+ `facet_*`: Facets display subsets of the dataset in different panels.
+ `coord_*`: Coordinate systems adjust the mapping from coordinates to the 2d plane of the computer screen.

And a few others...

---
### Problem: overplotting, approach 1a
Try lowering opacity

```{r, fig.width=10, fig.height=5}
ggplot(diamonds, aes(price, carat)) + geom_point(alpha=0.1)
```


---
### Problem: overplotting, approach 1b

Try mapping the inverse of a variable to opacity.

```{r, fig.width=10, fig.height=5}
ggplot(diamonds, aes(price, carat)) + geom_point(aes(alpha=1/carat))
```


---
### Problem: overplotting, approach 2

Shake the points around a little bit.

```{r, fig.width=10, fig.height=5}
ggplot(diamonds, aes(price, carat)) + geom_jitter()
```


---
### Problem: overplotting, approach 3

Bin into hexagons!

```{r, fig.width=10, fig.height=5}
library(hexbin)
ggplot(diamonds, aes(price, carat)) + geom_hex()
```


---
### Problem: overplotting, approach 4

Smooth with a 2d density

```{r, fig.width=10, fig.height=5}
ggplot(diamonds, aes(price, carat)) + stat_density2d()
```


---
### Something completely different: map!


```{r, fig.width=10, fig.height=5}
library(maps)
states <- map_data("state")
ggplot(states) + geom_polygon(aes(x=long, y=lat, group = group), colour="white")
```

---
### The world is your oyster


```{r, fig.width=10, fig.height=5}
ggplot(map_data("world")) + geom_polygon(aes(x=long, y=lat, group = group), colour="white")
```

---
### What's the point?


```{r, fig.width=10, fig.height=5}
ucs <- data.frame(lat=c(37.870007, 33.64945), long=c(-122.270501, -117.845707))
m <- ggplot(map_data("state"), aes(x=long, y=lat)) + geom_polygon(aes(group=group))
m + geom_point(data=ucs, colour="red", size=5)
```


---
### FYI

+ easy to add legend titles, axis labels, etc
+ `ggsave` function will save the plot to an image (or can just save via Rstudio)
+ `+ theme_bw()` will create a plot more suitable for printing
+ pie charts are possible but please do not make them
+ the `qplot` function is available as a more concise option
+ there are many packages that extend ggplot2's functionality!
+ e.g. `bdscale`, `GGally`, `xkcd`, `ggmap`

---

### Notes on oddities

+ normal R docs are not the best
+ excellent online documentation
+ default theme has grey background
+ uses British English spellings (e.g. "colour")


---

## Exercises: Section 2

+ The exercises are located in section2_exercises.html and the solutions in ex_code.r

+ This set of exercises will focus on data descriptives and data analysis.

---


## Exercise 7. Descriptive plots. 

*Now that we have our dataset in its final form, we can start analyzing it. We will start by simply plotting the data to check for outliers and the distributions of the variables.* 

**7.1** If you haven't already, install and load the package ggplot2. 

**7.2** Generate a histogram plot for each continuous variable (remember to use data_noNA). 

**7.3** Generate a boxplot of mpg by origin to visually check if mpg is different across different countries of origin. Look up how to make a boxplot in the online ggplot2 documentation or type ```?geom_boxplot``` for help. Make sure that your variable on the x-axis (in this case, origin) is a factor (you can type ```class(data_noNA$origin)``` to confirm this; if not refer to Exercise 5).


---

## Exercise 7 (continued)

**7.4** Next we will create a scatterplot of mpg by cylinders and examine the form of the relationship (i.e., is it linear or not?). In other words, we want to decide if we should treat cylinders as a numerical variable (linear) or categorical variable (not linear). Do the following:

* First, create a scatterplot using ```geom_point()```.
* Next, we will add smoothers to the plot. Read the help file for ```stat_smooth()``` and the argument "method". 
* Create two more scatterplots: 
      + One with the default smooth curve overlayed (```method = "auto"```)
      + One with a linear regression fit overlayed (```method = "lm"```)
      
**7.5** Create a scatterplot matrix by applying the function ```pairs()``` to the data.


---

## Exercise 8. Data transformations. 

*Based on the scatterplot matrix from 7.5, we need to transform some of our variables before we can perform a statistical analysis. In particular, we can see that the variance increases as mpg increases, and there are non-linear relationships between mpg and some of the other variables.* 

**8.1** Add the following variables to the dataset: 

* Add log-transformed versions of mpg, horsepower, displacement, and weight. Name them as logmpg, loghorsepower, etc.
      + *Hint*: to add a new variable, assign, for example, ```log(data_noNA$mpg)``` to ```data_noNA$logmpg```. 
* Add a factor version of cylinders. Call it "cylinders_cat". 

**8.2** Look at the data using the ```head()``` function to make sure everything looks good.


---

## Exercise 9. Statistical analysis.

*Now that we have transformed our variables, we can perform statistical analyses to explore the relationship of mpg to other variables.*

**9.1** Let's test whether mean mpg is different across cars of the three origins, using a significance level of 0.05. First, fit a linear regression model for mpg against origin. Then, use both ```anova()``` and ```summary()``` to check the results. 

**9.2** Next, fit a linear regression model predicting mpg. Include all other variables, using only the log-transformed versions if available. Store the model as "model". Then, fit a model using the same predictors, but predict log(mpg). Store the model as "model_log".

**9.3** Apply ```summary()``` to both of the model objects and examine the results. Is origin still helpful in predicting mpg/log(mpg) after including other predictors?


---

## Exercise 9 (continued)


**9.4** What do the numbers in the "Estimate" column in the ```summary()``` output  represent? 

**9.5** Run ```newcase = data_noNA[1:10,]``` to take the first 10 instances, and treat them as new car data for which we want to predict mpg. Use ```predict()``` to predict mpg for them using respectively the object "model" and "model_log". Keep in mind that from "model_log", ```predict()``` returns the predictions for log(mpg) instead of mpg. 


---

## Exercise 10. Bootstrapping. (Optional)

*In this exercise, we will learn the technique of bootstrapping, a general method for determining the variance of a parameter. In particular, we will find an estimate of the variance for the median mpg.*

**10.1** Subset the mpg column of the data and store it as "mpg_data". 

**10.2** Find the median mpg using the function ```median()```. We will eventually work toward finding an estimate for the variance of this parameter. 

**10.3** Sample mpg_data using the function ```sample()```. Store this as an object called "mpg_bootstrap". 

* *Hint*: There are ```length(mpg_data)```= 392 elements in mpg_data. We want to sample mpg_data 392 times (with replacement). Read ```?sample``` if you need help.

**10.4** Find the median of mpg_bootstrap. Store this as an object called "med".  


---

## Exercise 10 (continued)


**10.5** Now, we want to repeat steps 10.3 and 10.4 one thousand times, storing the median of mpg_bootstrap each time. Create a for loop to do this. 

* *Hint*: Begin by creating a NULL vector called med_bootstrap. Within the for loop, include a line of code that concatenates the previous medians ("med_bootstrap") with current median ("med") using the function ```c()```. Store this as "med_bootstrap". 

**10.6** After running your for loop, you should be left with a vector called med_bootstrap that contains 1000 median mpg estimates. Find the variance of this using the function ```var()```. 


---

## Online Resources to Learn R:

> 1. Very useful resources available on **The Comprehensive R Archive Network** (CRAN)
  + please visit: http://cran.us.r-project.org

> 2. R built-in facility:
  + try ?lm, help(lm) in R console

> 3. There are many free tutorials available online:
  + Quick R: http://www.statmethods.net/
  + R-Twotorials: http://www.twotorials.com/
  + UCLA Academy Technology Services: http://www.ats.ucla.edu/stat/r/
  
> 4. R-Bloggers (http://www.r-bloggers.com/): is a central hub (e.g: A blog aggregator) of content collected from bloggers who write about R (in English). 


---

## Useful Books in learning R:

1. Chambers(1998). Programming with Data, Springer.

2. Venables & Ripley (2000). S Programming, Springer.

3. Chambers (2008). Software for Data Analysis, Springer. (highly recommended)

4. More resources available at: http://www.r-project.org/doc/bib/R-books.html

---

## How to get help in R:

1. Simply use the built-in help function in R
  + example: ?lm, help(lm)
  
2. R mailing lists: r-help and r-devel
  + For more info: https://stat.ethz.ch/mailman/listinfo/r-help
  + How to ask good questions: http://www.r-project.org/posting-guide.html

3. Use Q&A websites in particular:
  + stackoverflow (http://stackoverflow.com): for programming related questions.
  + crossValidated (http://stats.stackexchange.com): for Stats related questions.
  
4. Google :)