book/04-normalgamma-04-reference.Rmd at master · StatsWithR/book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
## Reference Priors  {#sec:NG-reference}

```{r packages, echo=FALSE, warning=FALSE, message=FALSE,  eval=TRUE}
library(statsr)
library(ggplot2)
```

In Section \@ref(sec:NG-predictive), we demonstrated how to specify an informative prior distribution for inference about TTHM in tapwater using additional prior information. The resulting informative  Normal-Gamma prior distribution had an effective prior sample size that was comparable to the observed sample size to be compatible with the reported prior interval.

There are, however, situations where you may wish to provide an analysis that does not depend on prior information.  There may be cases where prior information is simply not available.  Or, you may wish to present an **objective** analysis where minimal prior information is used to provide a baseline or reference analysis to contrast with other analyses based on informative prior distributions. Or perhaps, you want to use the Bayesian paradigm to make probability statements about parameters, but not use any prior information.  In this section, we will examine the qustion of  **Can you actually perform a Bayesian analysis without using prior information?**
We will present reference priors for normal data, which can be viewed as a limiting form of the Normal-Gamma conjugate prior distribution.

Conjugate priors can be interpreted to be based on a historical or imaginary prior sample. What happens in the conjugate Normal-Gamma prior if we take our prior sample size $n_0$ to go to zero?  If we have no data, then we will define the prior sample variance $s_0^2$ to go to  0, and based on the relationship between prior sample sized and prior degrees of freedom, we will let the prior degrees of freedom go to the prior sample size minus one, or negative one, i.e. $v_0 = n_0 - 1 \rightarrow -1$.

With this limit, we have the following properties:

* The posterior mean goes to the sample mean.

* The posterior sample size is the observed sample size.

* The posterior degrees of freedom go to the sample degrees of freedom.

* The posterior variance parameter goes to the sample variance.

In this limit, the posterior hyperparameters do not depend on the prior hyperparameters.

Since $n_0 \rightarrow 0, s^2_0 \rightarrow 0, v_0 = n_0 - 1 \rightarrow -1$, we have in mathematical terms:

$$\begin{aligned}
m_n &= \frac{n \bar{Y} + n_0 m_0} {n + n_0}  \rightarrow \bar{Y} \\
n_n &= n_0 + n  \rightarrow n \\
v_n &= v_0 + n  \rightarrow n-1 \\
s^2_n &= \frac{1}{v_n}\left[s^2_0 v_0 + s^2 (n-1) + \frac{n_0 n}{n_n} (\bar{Y} - m_0)^2 \right] \rightarrow s^2
\end{aligned}$$

This limiting normal-gamma distribution, $\textsf{NormalGamma}(0,0,0,-1)$, is not really a normal-gamma distribution, as the density does not integrate to 1. The form of the limit can be viewed as a prior for $\mu$ that is proportional to a constant, or uniform/flat on the whole real line. And a prior for the variance is proportional to 1 over the variance. The joint prior is taken as the product of the two.


$$\begin{aligned}
p(\mu \mid \sigma^2) & \propto  1 \\
p(\sigma^2) & \propto  1/\sigma^2 \\
p(\mu, \sigma^2) & \propto  1/\sigma^2
\end{aligned}$$

This is refered to as a **reference prior** because the posterior hyperparameters do not depend on the prior hyperparameters.

In addition, $\textsf{NormalGamma}(0,0,0,-1)$ is a special case of a reference prior, known as the independent Jeffreys prior.  While Jeffreys used other arguments to arrive at the form of the prior, the goal was to have an **objective prior** invariant to shifting the data by a constant or multiplying by a constant.

Now, a naive approach to constructing a non-informative distribution might be to use a uniform distribution to represent lack of knowledge. However, would you use a uniform distribution for $\sigma^2$, or a uniform distribution for the precision $1/\sigma^2$? Or perhaps a uniform distribution for $\sigma$? These would all lead to different posteriors with little justification for any of them. This ambiguity led Sir Harold Jeffreys to propose reference distributions for the mean and variance for situations where prior information was limited. These priors are **invariant** to the units of the data.

The unnormalized priors that do not integrate to a constant are called **improper distributions**. An important consideration in using them is that one cannot generate samples from the prior or the prior predictive distribution to data and are referred to as **non-generative distributions**.

While the reference prior is not a proper prior distribution, and cannot reflect anyone's actual prior beliefs, the formal application phase rule can still be used to show that **the posterior distribution is a valid normal gamma distribution**, leading to a formal phase posterior distribution. That depends only on summary statistics of the data.

The posterior distribution $\textsf{NormalGamma}(\bar{Y}, n, s^2, n-1)$ breaks down to

$$\begin{aligned}
\mu \mid \sigma^2, \data & \sim \No(\bar{Y}, \sigma^2/n) \\
1/\sigma^2  \mid \data & \sim \Ga((n-1)/2, s^2(n - 1)/2).
\end{aligned}$$

* Under the reference prior $p(\mu, \sigma^2) \propto 1/\sigma^2$, the posterior distribution  after standardizing $\mu$ has a Student $t$  distribution with $n-1$ degrees of freedom.

$$\frac{\mu - \bar{Y}}{\sqrt{s^2/n}} \mid \data \sim  \St(n-1, 0, 1)$$

* Prior to seeing the data, the distribution of the standardized sample mean given $\mu$ and $\sigma$ also has a Student t distribution.

$$\frac{\mu - \bar{Y}}{\sqrt{s^2/n}} \mid \mu, \sigma^2 \sim  \St(n-1, 0, 1) $$

* Both frequentist sampling distributions and Bayesian reference posterior distributions lead to intervals of this form:

$$(\bar{Y} - t_{1 - \alpha/2}\times s/\sqrt{n}, \, \bar{Y} + t_{1 - \alpha/2} \times s/\sqrt{n})$$

* However, only the Bayesian approach justifies the probability statements about $\mu$ being in the interval after seeing the data.

$$P(\bar{Y} - t_{1 - \alpha/2}\times s/\sqrt{n} < \mu <  \bar{Y} + t_{1 - \alpha/2}\times s/\sqrt{n}) = 1 - \alpha$$

We can use either analytic expressions based on the t-distribution, or Monte Carlo samples from the posterior predictive distribution, to make predictions about a new sample.

```{r reference-example, echo=FALSE, include=FALSE}
library(statsr)
data(tapwater)
# data
m_0 = 35; n_0 = 25; s2_0 = ((60-10)/4)^2; v_0 = n_0 - 1
data(tapwater); Y = tapwater$tthm
ybar = round(mean(Y), 1); s2 = round(var(Y), 1); n = length(Y)
n_n = n_0 + n
m_n = round((n*ybar + n_0*m_0)/n_n, 1)
v_n = v_0 + n
s2_n = round(((n-1)*s2 + v_0*s2_0 + n_0*n*(m_0 - ybar)^2/n_n)/v_n, 1)
set.seed(8675309)

# post-pred
phi = rgamma(10000, v_n/2, s2_n*v_n/2)
sigma = 1/sqrt(phi)
mu = rnorm(10000, mean=m_n, sd=sigma/(sqrt(n_n)))
y =  rnorm(10000, mu, sigma)
```

Here is some code to generate the Monte Carlo samples from the tap water example:
```{r ref-post-pred}
phi = rgamma(10000, (n-1)/2, s2*(n-1)/2)
sigma = 1/sqrt(phi)
post_mu = rnorm(10000, mean=ybar, sd=sigma/(sqrt(n)))
pred_y =  rnorm(10000,post_mu, sigma)
quantile(pred_y, c(.025, .975))
```

Using the Monte Carlo samples, Figure \@ref(fig:plot-post-pred) shows the posterior distribution based on the informative Normal-Gamma prior and the reference prior. Both the posterior distribution for $\mu$ and the posterior predictive distribution for a new sample are shifted to the right, and are centered at the sample mean. The posterior for $\mu$ under the reference prior is less concentrated around its mean than the posterior under the informative prior, which leads to an increased posterior sample size and hence increased precision.

```{r plot-post-pred, echo=FALSE, message=FALSE, warning=FALSE, fig.align="center", fig.width=5, fig.height=4, fig.cap="Comparison of posterior densities"}
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#000000", "#009E73", "#0072B2", "#D55E00", "#CC79A7")

nsim = length(post_mu)

df = data.frame(
  parameter = c(rep("NG posterior mu", nsim), rep("NG posterior predictive", nsim),
                rep("ref posterior mu", nsim), rep("ref posterior predictive", nsim)),
  x = c(mu, y,  post_mu, pred_y))


ggplot(data=df, aes(x=pred_y)) +
  geom_density(aes(x=x, colour=parameter, linetype=parameter),
              show_legend=FALSE, size=1.0)+
  stat_density(aes(x=x, colour=parameter, linetype=parameter),
               geom="line",position="identity") + xlab("TTHM (ppb)") +
  scale_colour_manual(values=cbPalette) +
  theme(panel.background = element_rect(fill = "transparent", colour = NA),
        legend.key = element_rect(colour = "transparent", fill = NA),
        plot.background = element_rect(fill = "transparent", colour = NA),
        legend.position=c(.76, .85),
        text = element_text(size=15))
```

The posterior probability that a new sample will exceed the legal limit of 80 ppb under the reference prior is roughly 0.15, which is more than double the probability of 0.06 from the posterior under the informative prior.

```{r post-exceed-80}
sum(pred_y > 80)/length(pred_y)  # P(Y > 80 | data)
```

In constructing the informative prior from the reported interval, there are two critical assumptions. First, the prior data are exchangeable with the observed data. Second, the conjugate normal gamma distribution is suitable for representing the prior information. These assumptions may or may not be verifiable, but they should be considered carefully when using informative conjugate priors.

In the case of the tap water example, there are several concerns: One, it is unclear that the prior data are exchangeable with the observed data.  For example, water treatment conditions may have changed. Two, the prior sample size was not based on a real prior sample, but instead selected so that the prior predictive intervals under the normal gamma model agreed with the prior data.  As we do not have access to the prior data, we cannot check assumptions about normality that would help justify the prior. Other skewed distributions may be consistent with the prior interval, but lead to different conclusions.

To recap,  we have introduced a  reference prior for inference for
normal data with an unknown mean and variance. Reference priors are often part of a prior sensitivity study and are used when objectivity is of utmost importance.

If conclusions are fundamentally different with an informative prior and a reference prior, one may wish to carefully examine assumputions that led to the informative prior.

* Is the prior information based on a prior sample that is exchangable with the observed data?

* Is the normal-gamma assumption appropriate?

Informative priors can provide more accurate inference when data are limited, and the transparency of explicitly laying out prior assumptions is an important aspect of reproducible research. However, one needs to be careful that certain prior assumptions may lead to un-intended consequences.

Next, we will investigate a prior distribution that is a mixture of
conjugate priors, so the new prior distribution provides robustness to prior mis-specification in the prior sample size.

While we will no longer have nice analytical expressions for the posterior, we can simulate from the posterior distribution using a Monte Carlo algorithm
called Markov chain Monte Carlo (MCMC).