Homework6.Rmd

---
title: "Homework6"
author: "Sam Rosenblatt"
date: "10/3/2018"
output: html_document
---


## Long and wide data formats
Try converting the `iris` data set into the long format, with a column called “trait” to indicate sepal and petal length and width.

Once you have converted to the long format, calculate the average for each combination of species and trait.

```{r}

library(tidyverse)

longIris <- iris %>%
  gather(Sepal.Length:Petal.Width, key="Trait",value="TraitVal")


longIris %>%
  group_by(Species, Trait) %>%
  summarize(avgTrait = mean(TraitVal))

```

## Simulating and Fitting Data Distributions


### Open Libraries
```{r}
library(ggplot2) # for graphics
library(MASS) # for maximum likelihood estimation
```


### Read in data vector
```{r}
# # quick and dirty, a truncated normal distribution to work on the solution set
# 
# z <- rnorm(n=3000,mean=0.2)
# z <- data.frame(1:3000,z)
# names(z) <- list("ID","myVar")
# z <- z[z$myVar>0,]
# str(z)
# summary(z$myVar)


z <- read.table("Betweenness40pctUnknown5pctImmunized2.csv",header=TRUE ,sep=",", stringsAsFactors=FALSE)


names(z) <- list( "myVar")
z <- sample(z$myVar, 500, replace = FALSE, prob = NULL) #Sample from my distribution at Lauren's suggestion because the entire distribution makes visualization difficult
z <- data.frame(1:500,z)
#z <- data.frame(z)
names(z) <- list("ID","myVar") #I cleaned my data seperately and named the header myVar to start so I didnt have to reset that
#names(z) <- list("myVar") 
head(z)


str(z)
summary(z)

```


###Plot histogram of data
Dr. Gotelli: Plot a histogram of the data, using a modification of the code from lecture. Here we are switching from qplot to ggplot for more graphics options. We are also rescaling the y axis of the histogram from counts to density, so that the area under the histogram equals 1.0.
Sam: I adjusted binwidth for better visibility
```{r}
p1 <- ggplot(data=z, aes(x=myVar, y=..density..)) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2, binwidth=0.001) 
print(p1)
```


### Add empirical density curve
Dr. Gotelli: Now modify the code to add in a kernel density plot of the data. This is an empirical curve that is fitted to the data. It does not assume any particular probability distribution, but it smooths out the shape of the histogram:


```{r}
p1 <-  p1 +  geom_density(linetype="dotted",size=0.75)
print(p1)
```


### Get maximum likelihood parameters for normal
Dr Gotelli: Next, fit a normal distribution to your data and grab the maximum likelihood estimators of the two parameters of the normal, the mean and the variance:
```{r}
normPars <- fitdistr(z$myVar,"normal")
print(normPars)
str(normPars)
normPars$estimate["mean"] # note structure of getting a named attribute
```


### Plot normal probability density
Dr. Gotelli: Now let’s call the dnorm function inside ggplot’s stat_function to generate the probability density for the normal distribution. Read about  stat_function in the help system to see how you can use this to add a smooth function to any ggplot. Note that we first get the maximum likelihood parameters for a normal distribution fitted to thse data by calling fitdistr. Then we pass those parameters (meanML and sdML to  stat_function:
```{r}
meanML <- normPars$estimate["mean"]
sdML <- normPars$estimate["sd"]

xval <- seq(0,max(z$myVar),len=length(z$myVar))

 stat <- stat_function(aes(x = xval, y = ..y..), fun = dnorm, colour="red", n = length(z$myVar), args = list(mean = meanML, sd = sdML))
 p1 + stat
```
Dr Gotelli(concerning fake data): Notice that the best-fitting normal distribution (red curve) for these data actually has a biased mean. That is because the data set has no negative values, so the normal distribution (which is symmetric) is not working well.
Sam: This is true with my dataset as well


### Plot exponential probability density
Dr. Gotelli: Now let’s use the same template and add in the curve for the exponential:
```{r}
expoPars <- fitdistr(z$myVar,"exponential")
rateML <- expoPars$estimate["rate"]

stat2 <- stat_function(aes(x = xval, y = ..y..), fun = dexp, colour="blue", n = length(z$myVar), args = list(rate=rateML))
 p1 + stat + stat2
```

### Plot uniform probability density
Dr. Gotelli: For the uniform, we don’t need to use fitdistr because the maximum likelihood estimators of the two parameters are just the minimum and the maximum of the data:
```{r}
stat3 <- stat_function(aes(x = xval, y = ..y..), fun = dunif, colour="darkgreen", n = length(z$myVar), args = list(min=min(z$myVar), max=max(z$myVar)))
 p1 + stat + stat2 + stat3
```


###Plot gamma probability density
```{r}
gammaPars <- fitdistr(z$myVar,"gamma")
shapeML <- gammaPars$estimate["shape"]
rateML <- gammaPars$estimate["rate"]

stat4 <- stat_function(aes(x = xval, y = ..y..), fun = dgamma, colour="brown", n = length(z$myVar), args = list(shape=shapeML, rate=rateML))
 p1 + stat + stat2 + stat3 + stat4
```


### Plot beta probability density
Dr. Gotelli(concerning fake data): This one has to be shown in its own plot because the raw data must be rescaled so they are between 0 and 1, and then they can be compared to the beta.
Sam: Since my data actually is all between 0 and 1 (each datapoint represents a proportion), we can actually display this without rescaling, although since a value of 1 is next to impossible, we rescale anyway to get results that look like the examples, but the commented out code represents the equivalent unscaled version using the theoretical max value of 1
```{r}
pSpecial <- ggplot(data=z, aes(x=myVar/(max(myVar + 0.1)), y=..density..)) +
#pSpecial <- ggplot(data=z, aes(x=myVar/1, y=..density..)) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2, binwidth=0.001) + 
  xlim(c(0,1)) +
  geom_density(size=0.75,linetype="dotted")

betaPars <- fitdistr(x=z$myVar/max(z$myVar + 0.1),start=list(shape1=1,shape2=2),"beta")
#betaPars <- fitdistr(x=z$myVar/1,start=list(shape1=1,shape2=2),"beta") 
shape1ML <- betaPars$estimate["shape1"]
shape2ML <- betaPars$estimate["shape2"]

statSpecial <- stat_function(aes(x = xval, y = ..y..), fun = dbeta, colour="orchid", n = length(z$myVar), args = list(shape1=shape1ML,shape2=shape2ML))
pSpecial + statSpecial
```
Note: Although the above produces warnings, Lauren said that is pretty normal with beta distribution calculations


### Part 4
Dr. Gotelli: Take a look at the second-to-last graph which shows the histogram of your data and 4 probability density curves (normal, uniform, exponential, gamma) that are fit to the data. The beta distribution in the final graph is somewhat special. It often fits the data pretty well, but that is because we have assumed the largest data point is the true upper bound, and everything is scaled to that. The fit of the uniform distribution also fixes the upper bound. The other curves (normal, exponential, and gamma) are more realistic because they do not have an upper bound. For most data sets, the gamma will probably fit best, but if you data set is small, it may be very hard to see much of a difference between the curves.

Sam: The gamma distribution appears to fit best, although a zero-inflated negative binomial would likely fit better

### Part 5
Dr. Gotelli: Using the best-fitting distribution, go back to the code and get the maximum likelihood parameters. Use those to simulate a new data set, with the same length as your original vector, and plot that in a histogram and add the probability density curve. Right below that, generate a fresh histogram plot of the original data, and also include the probability density curve.

Sam: We will simulate a new dataset based on the gamma distibution


```{r}
zSim = data.frame(rgamma(shape = shapeML, rate = rateML, n=500))
names(zSim) <- list("myVarSim")

pSim <- ggplot(data=zSim, aes(x=myVarSim, y=..density..)) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2, binwidth=0.001) 
#print(pSim)

pSim <-  pSim +  geom_density(linetype="dotted",size=0.75)
print(pSim)
print(p1)


```

How do the two histogram profiles compare? Do you think the model is doing a good job of simulating realistic data that match your original measurements? Why or why not?

### Answer

I do not think the gamma distribution model is doing a good job of simulating my data. Compared to the model data, my data has a much higher density  for the lower values, and the density curve is much steeper near the beginning, and much flatter later on. Gamma was the best distribution to describe my data out of the ones suggested by the assignment, but I believe it probably is not in fact the distribution that my data came from