-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathcausal-inference-week1.Rmd
175 lines (142 loc) · 5.62 KB
/
causal-inference-week1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "casual inference"
author: "Zihao Wang"
date: "2017-5-20"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Cases where there is no effect ( $\bar Yk = \bar Y0$ for all k = 1,......,9)
In this case,transcendentally, I choose all the mean value 0 and all the variance 1,that is:$$\bar Y_{k}=0,\sigma_{k}=1$$The total number of the unit n is 10000 and the total number of the arms K is 10. For convinience I divide the population units evenly into 10 groups,that is $n_{K}=1000$ for all k=0,......,9.
With the knowledge of central limit theorem, I have $$\bar Y_{k} ^{obs}\sim N(\bar Y_{k},\sigma_{k}\sqrt\frac{K}{n}) , s_{k}^2\sim \frac{\sigma_{k}^2\chi^2_{\frac{n}{K}-1}}{\frac{n}{K}-1}$$. With the number I set above,I use the random numbers which satisfy the distribution above to compute the corresponding p-value and to obtain the distributino of the $p_{combine}$.
```{r 1}
N=10000
K=10
n=as.numeric()
for (i in 1:10){n[i]=N/10}
Ybar=as.numeric()
for (i in 1:10){Ybar[i]=0}
sigma=as.numeric()
for (i in 1:10){sigma[i]=1}
```
I decide to do simulation 10000 times. Since the null hypothesis is to test whether there exist obvious treatment effect,that is$$H_{0}:\bar Y_{k}-\bar Y_{0}=0,for k=0,...,9$$, thus,we construct the statistic $\theta_{k}$ using the following formula$$\theta_{k}=\frac{\bar Y_{k} ^{obs}-\bar Y_{0} ^{obs}}{\sqrt{\frac{s_{k}^2}{n_{k}}+\frac{s_{0}^2}{n_{0}}}}$$
```{r}
Ybar.obs<-matrix(nrow = 10,ncol = 10000)
variance.obs<-matrix(nrow = 10,ncol = 10000)
for (i in 1:10){
for (j in 1:10000){
Ybar.obs[i,j]=rnorm(1,Ybar[i],sigma[i]*sqrt(K/N))
variance.obs[i,j]=rchisq(1,n[i]-1)*(sigma[i]^2)/(n[i]-1)
}
}
```
```{r}
Vneyman<-matrix(nrow = 9,ncol = 10000)
for(i in 1:9){
for(j in 1:10000) {
Vneyman[i,j]<-variance.obs[i+1,j]/n[i]+variance.obs[1,j]/n[1]
}
}
theta<-matrix(nrow = 9,ncol = 10000)
for(i in 1:9){
for(j in 1:10000) {
theta[i,j]<-(Ybar.obs[i+1,j]-Ybar.obs[1,j])/sqrt(Vneyman[i,j])
}
}
```
```{r}
pvalue<-matrix(nrow = 9,ncol = 10000)
for(i in 1:9){
for(j in 1:10000){
pvalue[i,j]=1-integrate(dnorm,-1*abs(theta[i,j]),abs(theta[i,j]))$value
}
}
pcombine<-as.numeric()
for(j in 1:10000){pcombine[j]=min(pvalue[,j])}
```
```{r}
pvalue.1<-matrix(nrow = 9,ncol = 10000)
for(i in 1:9){
for(j in 1:10000){
pvalue.1[i,j]=2*pnorm(-abs(theta[i,j]))
}
}
pcombine.1<-as.numeric()
for(j in 1:10000){pcombine.1[j]=min(pvalue.1[,j])}
```
```{r}
hist(pvalue.1[5,])
pcombince.order=sort(pcombine)
cutoff1<-pcombince.order[500]
```
```{r}
hist(pcombine,main = "The distribution of combined P-value")
```
From above, I find that the closer the combined p-value is to zero, the higher frequency density it has and then the frequency tail off as combined p-value increases. Since p-value is the possibility of null hypothesis holds, the simulation does not offer us sufficient evidence to reject the null hypothesis that there is no casual effect among various treatment.
#Cases where the variance is different in different treatment arms.
Here, I choose the variance from 1 to 10 transcendentally to simulate the different-variance case.The times of simulation is 10000 as well.
```{r}
N=10000
K=10
nk=as.numeric()
for (i in 1:10){nk[i]=N/10}
Ybar=as.numeric()
for (i in 1:10){Ybar[i]=0}
sigma=as.numeric()
for (i in 1:10){sigma[i]=i}
```
```{r}
simulation<-function(N,K,n,Ybar,sigma,M){
Ybar.sim<-matrix(nrow = K,ncol = M)
variance.sim<-matrix(nrow = K,ncol = M)
for (i in 1:K){
for (j in 1:M){
Ybar.sim[i,j]=rnorm(1,Ybar[i],sigma[i]*sqrt(K/N))
variance.sim[i,j]=rchisq(1,n[i]-1)*(sigma[i]^2)/(n[i]-1)
}
}
Vneyman.sim<-matrix(nrow = (K-1),ncol = M)
for(i in 1:(K-1)){
for(j in 1:M) {
Vneyman.sim[i,j]<-variance.sim[i+1,j]/n[i]+variance.sim[1,j]/n[1]
}
}
theta.sim<-matrix(nrow = (K-1),ncol = M)
for(i in 1:(K-1)){
for(j in 1:M) {
theta.sim[i,j]<-(Ybar.sim[i+1,j]-Ybar.sim[1,j])/sqrt(Vneyman.sim[i,j])
}
}
pvalue.sim<-matrix(nrow = (K-1),ncol = M)
for(i in 1:(K-1)){
for(j in 1:M){
pvalue.sim[i,j]=2*pnorm(-abs(theta.sim[i,j]))
}
}
pcombine.sim<-as.numeric()
for(j in 1:M) {pcombine.sim[j]=min(pvalue.sim[,j])}
hist(pvalue.sim[5,])
pcombine.sim
}
```
```{r}
result=simulation(10000,10,rep(1000,10),rep(0,10),seq(1,10,1),10000)
cutoff2<-sort(result)[500]
hist(result)
```
From above I find that the distribution figure I get is similar with the one I obtain in case I,the frequency density centers at the neighborhood of zero and tail off as combined p-value increases. The only difference is that the figure I got has a heavier tail this time. It can be explained by the increasing value of variance in this case:noting that the variance I set in this case is bigger than the one I set in the previous case, thus, the observation of $\bar Y_{k}$ loses precision and so does the combined p-value. However, since the p-value centers around zero, there is no obvious evidence to reject the null hypothesis that there exist no treatment effect in this case.
#Various combinations of size of $\tau_{k}$ and size of $\sigma_{k}$.
Here I try two differenr cases: one is increasing $\bar Y_{k}$ with constant variance and the other one is increasing $\bar Y_{k}$ with increasing variance.
```{r}
result1=simulation(10000,10,rep(1000,10),seq(0,0.09,0.01),rep(10,10),10000)
hist(result1)
abline(v=cutoff1,col="red")
```
```{r}
result2=simulation(10000,10,rep(1000,10),seq(0,0.09,0.01),seq(1,10,1),10000)
hist(result2,breaks = seq(0,0.7,0.001))
abline(v=cutoff2,col="red")
```
Q1.how to make decision with cutoff?
Q2.the pvalue of the second but the last one does not follow uniform distribution when variance is small,say 1 or 2. However, it is under uniform distribution when variance is big?