Skip to content

Commit bce58fc

Browse files
committed
PRML update
1 parent 1ad32d6 commit bce58fc

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+12222
-9
lines changed

PRML/ch3.pdf

330 KB
Binary file not shown.
126 KB
Loading

PRML/ch3/main.tex

+246
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
\documentclass[5p,sort&compress]{elsarticle}
2+
3+
\usepackage{amssymb} % Mathematical symbols
4+
\usepackage{amsmath} % More options for mathematics
5+
\usepackage{subfigure} % More options for figures
6+
\usepackage{epstopdf} % Convert eps to pdf
7+
\usepackage[separate-uncertainty=true]{siunitx} % Proper formatting of units in math mode
8+
\usepackage{color} % Supports text color if needed
9+
\usepackage{soul} % https://ctan.org/pkg/soul
10+
\usepackage{lmodern} % Loading fonts
11+
\usepackage{hyperref} % To insert clickable references/urls
12+
\usepackage{listings} % To input code in the text
13+
\usepackage{amsmath}
14+
\usepackage{amsmath}
15+
\usepackage{amssymb}
16+
\usepackage{graphicx}
17+
\usepackage{epstopdf}
18+
\usepackage{booktabs}
19+
\setlength{\parskip}{2em}
20+
\newcommand{\stirlingii}{\genfrac{\{}{\}}{0pt}{}}
21+
22+
% Choose the style of the reference list (do not change)
23+
\bibliographystyle{elsarticle-num}
24+
25+
\journal{ifding/learning-notes}
26+
27+
% Begin the document
28+
29+
\begin{document}
30+
31+
\begin{frontmatter}
32+
\title{Chatper 3: Linear Models for Regression and Classification}
33+
\author{ifding}
34+
35+
\begin{abstract}
36+
Linear Basis Function Models, The Bias-Variance Decomposition, Discriminant Functions, Probabilistic Generative Models
37+
\end{abstract}
38+
39+
40+
\end{frontmatter}
41+
42+
%% How to make a heading and divide the documents into different sections
43+
The goal of regression is to predict the value of one or more continuous \textit{target} variables \textit{t} given the value of a $D$-dimensional vector \textbf{x} of \textit{input} variables. From a probabilistic perspective, we aim to model the predictive distribution $p(t|\mathbf{x})$ because this expresses our uncertainty about the value of t for each value of \textbf{x}.
44+
45+
The goal in classification is to take an input vector $\boldsymbol{x}$ ans to assign it to one of K discrete classes $\mathcal{C}_k$ where $k=1, \ldots, K$. The input space is divided into \textit{decision regions} whose boundaries are called \textit{decision boundaries} or \textit{decision surfaces}. For the targe variable $\boldsymbol{t}$, it is convenient to use a 1-of-K coding scheme.
46+
47+
48+
\section{Linear Basis Function Models}
49+
50+
Consider linear combinations of fixed nonlinear functions of the input variables,
51+
\begin{equation}
52+
y(\mathbf{x}, \mathbf{w})=w_{0}+\sum_{j=1}^{M-1} w_{j} \phi_{j}(\mathbf{x})
53+
\end{equation}
54+
where $\phi_{j}(\mathbf{x})$ are known as \textit{basis functions}. The total number of parameters in this model will be M. It is often convenient to define an additional dummy `basis function' $\phi_{0}(\mathbf{x}) = 1$.
55+
\begin{equation}
56+
y(\mathbf{x}, \mathbf{w})=\sum_{j=0}^{M-1} w_{j} \phi_{j}(\mathbf{x})=\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{x})
57+
\end{equation}
58+
where $\mathbf{w} = (w_0, \ldots, w_{M-1})^{\mathrm{T}}$ and $\boldsymbol{\phi} = (\phi_0, \ldots, \phi_{M-1})^{\mathrm{T}}$. If the original variables comprise the vector $\textbf{x}$, the nonlinear basis functions $\{\phi_{j}(\mathbf{x})\}$ express the extracted features.
59+
60+
61+
\subsection{Maximum likelihood and least squares}
62+
63+
We assume that the target variable \textit{t} is given by a deterministic function $y(\mathbf{x},\mathbf{w})$ with additive Gaussian noise so that
64+
\begin{equation}
65+
t=y(\mathbf{x}, \mathbf{w})+\epsilon
66+
\end{equation}
67+
where $\epsilon$ is a zero mean Gaussian random variable with precision (inverse variance) $\beta$. Thus we can write
68+
69+
\begin{equation}
70+
p(t | \mathbf{x}, \mathbf{w}, \beta)=\mathcal{N}\left(t | y(\mathbf{x}, \mathbf{w}), \beta^{-1}\right)
71+
\end{equation}
72+
73+
If we assume a squared loss function, then the optimal prediction, for a new value of \textbf{x}, will be given by the conditional mean of the target variable. In the case of a Gaussian conditional distribution, the conditional mean will be simply
74+
\begin{equation}
75+
\mathbb{E}[t | \mathbf{x}]=\int t p(t | \mathbf{x}) \mathrm{d} t=y(\mathbf{x}, \mathbf{w})
76+
\end{equation}
77+
Note hat the Gaussian noise assumption implies that the conditional distribution of t given \textbf{x} is unimodal, which may be inappropriate for some applications.
78+
79+
80+
\section{The Bias-Variance Decomposition}
81+
82+
Although the regularization terms can control over-fitting for models, how to determine a suitable regularization coefficient $\lambda$? What's the \textit{bias-variance} trade-off?
83+
84+
Given the conditional distribution $p(t | \mathbf{x})$, a popular choice is the squared loss function, for which the optimal prediction is given by the conditional expectation, $h(x)$,
85+
\begin{equation}
86+
h(\mathbf{x})=\mathbb{E}[t | \mathbf{x}]=\int t p(t | \mathbf{x}) \mathrm{d} t
87+
\end{equation}
88+
89+
The expected squared loss can be written in the form
90+
\begin{equation}\begin{aligned}
91+
\mathbb{E}[L]= & \\
92+
\int\{y(\mathbf{x})-h(\mathbf{x})\}^{2} p(\mathbf{x}) \mathrm{d} \mathbf{x}+ \int\{h(\mathbf{x})-t\}^{2} p(\mathbf{x}, t) \mathrm{d} \mathbf{x} \mathrm{d} t
93+
\end{aligned}
94+
\end{equation}
95+
The second term arises from the intrinsic noise on the data and represents the minimum achievable value of the expected loss. The first term depends on our choice for the function $y(\mathbf{x})$, and we will seek a solution for $y(\mathbf{x})$ which makes this term a minimum. For a particular data set \begin{equation}
96+
\begin{aligned}{\mathbb{E}_{\mathcal{D}}\left[\{y(\mathbf{x} ; \mathcal{D})-h(\mathbf{x})\}^{2}\right] =} \\ {\underbrace{\left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-h(\mathbf{x})\right\}^{2}}_{(\text {bias })^{2}}+\underbrace{\mathbb{E}_{\mathcal{D}}\left[\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}^{2}\right]}_{\text {variance }}}\end{aligned}
97+
\end{equation}
98+
The squared \textit{bias} represents the extent to which the average prediction over all data sets differs from the desired regression function. The \textit{variance} measures the extent to which the solutions for individual data sets vary around their average, and hence this measures the extent to which the function $y(\mathbf{x} ; \mathcal{D})$ is sensitive to the particular choice of data set.
99+
100+
\begin{figure}[ht]
101+
\centering
102+
\includegraphics[width = \linewidth]{figure/bias-variance-tradeoff.png}
103+
\caption{The bias-variance tradeoff.}
104+
\label{fig:tradeoff}
105+
\end{figure}
106+
107+
The decomposition of the expected squared loss
108+
\begin{equation}
109+
\text { expected loss }=(\text { bias })^{2}+\text { variance }+\text { noise }
110+
\end{equation}
111+
where
112+
\begin{equation}
113+
\begin{aligned}(\text { bias })^{2} &=\int\left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-h(\mathbf{x})\right\}^{2} p(\mathbf{x}) \mathrm{d} \mathbf{x} \\ \text { variance } &=\int \mathbb{E}_{\mathcal{D}}\left[\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}^{2}\right] p(\mathbf{x}) \mathrm{d} \mathbf{x} \\ \text { noise } &=\int\{h(\mathbf{x})-t\}^{2} p(\mathbf{x}, t) \mathrm{d} \mathbf{x} \mathrm{d} t \end{aligned}
114+
\end{equation}
115+
There is a trade-off between bias and variance, with very flexible models having low bias and high variance, and relatively rigid models having high bias and low variance, as shown in Figure~\ref{fig:tradeoff}.
116+
117+
118+
\section{Discriminant Functions}
119+
120+
A discriminant is a function that takes an input vector $\boldsymbol{x}$ and assigns it to one of K classes, denoted $\mathcal{C}_k$. This section focuses on \textit{linear discriminants} of two classes.
121+
122+
\begin{equation}
123+
y(\mathbf{x})=\mathbf{w}^{\mathrm{T}} \mathbf{x}+w_{0}
124+
\end{equation}
125+
where $\mathbf{w}$ is called a \textit{weight vector}, and $w_0$ is a \textit{bias}. An input vector $\mathbf{x}$ is assigned to class $\mathcal{C}_1$ if $y(\mathbf{x}) \leq 0$ and to class $\mathcal{C}_2$ otherwise. The decision boundary is $y(\mathbf{x}) = 0$. Consider two points $\mathbf{x}_A$ and $\mathbf{x}_B$, $y(\mathbf{x}_A) = y(\mathbf{x}_B) = 0$, we have $\mathbf{w}^{\mathrm{T}} (\mathbf{x}_A -\mathbf{x}_B) = 0$ and hence the vector $\mathbf{w}$ is orthogonal to every vector lying within the decision surface.
126+
127+
128+
129+
\section{Probabilistic Generative Models}
130+
131+
Consider first of all the classes. The posterior probability for class $\mathcal{C}_1$ can be written as
132+
\begin{equation}
133+
\begin{aligned} p\left(\mathcal{C}_{1} | \mathbf{x}\right) &=\frac{p\left(\mathbf{x} | \mathcal{C}_{1}\right) p\left(\mathcal{C}_{1}\right)}{p\left(\mathbf{x} | \mathcal{C}_{1}\right) p\left(\mathcal{C}_{1}\right)+p\left(\mathbf{x} | \mathcal{C}_{2}\right) p\left(\mathcal{C}_{2}\right)} \\ &=\frac{1}{1+\exp (-a)}=\sigma(a) \end{aligned}
134+
\end{equation}
135+
where we have defined
136+
\begin{equation}
137+
a=\ln \frac{p\left(\mathbf{x} | \mathcal{C}_{1}\right) p\left(\mathcal{C}_{1}\right)}{p\left(\mathbf{x} | \mathcal{C}_{2}\right) p\left(\mathcal{C}_{2}\right)}
138+
\end{equation}
139+
For the case of $K > 2$ classes, we have
140+
\begin{equation}
141+
\begin{aligned} p\left(\mathcal{C}_{k} | \mathbf{x}\right) &=\frac{p\left(\mathbf{x} | \mathcal{C}_{k}\right) p\left(\mathcal{C}_{k}\right)}{\sum_{j} p\left(\mathbf{x} | \mathcal{C}_{j}\right) p\left(\mathcal{C}_{j}\right)} \\ &=\frac{\exp \left(a_{k}\right)}{\sum_{j} \exp \left(a_{j}\right)} \end{aligned}
142+
\end{equation}
143+
which is known as the \textit{normalized exponential} and is also known as the \textit{softmax function}, as it represents a smoothed version of the `max' function because if $a_k \gg a_j$ for all $j \neq k$, then $p(\mathcal{C}_k|\mathbf{x}) \simeq 1$, and $p(\mathcal{C}_j|\mathbf{x}) \simeq 0$. Here the quantities $a_k$ are defined by
144+
\begin{equation}
145+
a_{k}=\ln p\left(\mathbf{x} | \mathcal{C}_{k}\right) p\left(\mathcal{C}_{k}\right)
146+
\end{equation}
147+
148+
149+
\subsection{Continuous inputs}
150+
151+
Let us assume that the class-conditional densities are Gaussian and all classes share the same covariance matrix.
152+
\begin{equation}
153+
\begin{aligned}
154+
p\left(\mathbf{x} | \mathcal{C}_{k}\right)= \\
155+
\frac{1}{(2 \pi)^{D / 2}} \frac{1}{|\mathbf{\Sigma}|^{1 / 2}} \exp \left\{-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)^{\mathrm{T}} \mathbf{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)\right\}
156+
\end{aligned}
157+
\end{equation}
158+
Consider first the case of two classes.
159+
\begin{equation}
160+
p\left(\mathcal{C}_{1} | \mathbf{x}\right)=\sigma\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}+w_{0}\right)
161+
\end{equation}
162+
where we have defined
163+
\begin{equation}
164+
\mathbf{w}=\mathbf{\Sigma}^{-1}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right)
165+
\end{equation}
166+
\begin{equation}
167+
w_{0}=-\frac{1}{2} \boldsymbol{\mu}_{1}^{\mathrm{T}} \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_{1}+\frac{1}{2} \boldsymbol{\mu}_{2}^{\mathrm{T}} \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_{2}+\ln \frac{p\left(\mathcal{C}_{1}\right)}{p\left(\mathcal{C}_{2}\right)}
168+
\end{equation}
169+
The decision boundaries are linear in input space. The prior probabilities $p(\mathcal{C}_k)$ enter only through the bias $w_0$ so that changes in the priors have the effect of making parallel shits of the decision boundary.
170+
171+
172+
\subsection{Maximum likelihood solution}
173+
174+
Suppose we have a data set $\{\mathbf{x}_n, t_n\}$ where $n = 1, \ldots, N$. Here $t_n = 1$ denotes class $\mathcal{C}_1$ and $t_n=0$ denotes class $\mathcal{C}_2$. We denote the prior class probability $p(\mathcal{C}_1) = \pi$, so that $p(\mathcal{C}_2) = 1 - \pi$. All classes share the same covariance matrix. For a data point $\mathbf{x}_n$ from class $\mathcal{C}_1$, we have $t_n = 1$ and hence
175+
\begin{equation}
176+
p\left(\mathbf{x}_{n}, \mathcal{C}_{1}\right)=p\left(\mathcal{C}_{1}\right) p\left(\mathbf{x}_{n} | \mathcal{C}_{1}\right)=\pi \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)
177+
\end{equation}
178+
Similarly for class $\mathcal{C}_2$, we have $t_n = 0$ and hence
179+
\begin{equation}
180+
p\left(\mathbf{x}_{n}, \mathcal{C}_{2}\right)=p\left(\mathcal{C}_{2}\right) p\left(\mathbf{x}_{n} | \mathcal{C}_{2}\right)=(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)
181+
\end{equation}
182+
Thus the likelihood function is given by
183+
\begin{equation}
184+
\begin{aligned}
185+
p\left(\mathbf{t} | \pi, \boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \boldsymbol{\Sigma}\right)= \\
186+
\prod_{n=1}^{N}\left[\pi \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)\right]^{t_{n}}\left[(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)\right]^{1-t_{n}}
187+
\end{aligned}
188+
\end{equation}
189+
where $\mathbf{t} = (t_1, \ldots, t_N)^{\mathrm{T}}$. It's convenient to maximize the log of the likelihood function. Consider first the maximization with respect to $\pi$. The terms in the log likelihood function that depend on $\pi$ are
190+
\begin{equation}
191+
\sum_{n=1}^{N}\left\{t_{n} \ln \pi+\left(1-t_{n}\right) \ln (1-\pi)\right\}
192+
\end{equation}
193+
Setting the derivative with respect to $\pi$ equal to zero and rearranging, we obtain
194+
\begin{equation}
195+
\pi=\frac{1}{N} \sum_{n=1}^{N} t_{n}=\frac{N_{1}}{N}=\frac{N_{1}}{N_{1}+N_{2}}
196+
\end{equation}
197+
where $N_1$ denotes the total number of data points in class $\mathcal{C}_1$, and $N_2$ denotes the total number of data points in $\mathcal{C}_2$.
198+
199+
Now consider the maximization with respect to $\boldsymbol{\mu}_1$. Again we can pick out of the log likelihood function those terms that depend on $\boldsymbol{\mu}_1$ giving
200+
\begin{equation}
201+
\begin{aligned}
202+
\sum_{n=1}^{N} t_{n} \ln \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)= \\
203+
-\frac{1}{2} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)+\text { const. }
204+
\end{aligned}
205+
\end{equation}
206+
Setting the derivative with respect $\boldsymbol{\mu}_1$ to zero and rearranging, we obtain
207+
\begin{equation}
208+
\boldsymbol{\mu}_{1}=\frac{1}{N_{1}} \sum_{n=1}^{N} t_{n} \mathbf{x}_{n}
209+
\end{equation}
210+
which is simply the mean of all the input vectors $\mathbf{x}_{n}$ assigned to class $\mathcal{C}_1$. The corresponding result for $\boldsymbol{\mu}_2$ is given by
211+
\begin{equation}
212+
\boldsymbol{\mu}_{2}=\frac{1}{N_{2}} \sum_{n=1}^{N} (1-t_{n}) \mathbf{x}_{n}
213+
\end{equation}
214+
which again is the mean of all the input vectors $\mathbf{x}_{n}$ assigned to class $\mathcal{C}_2$.
215+
216+
Finally, consider the maximum likelihood solution for the shared covariance matrix $\mathbf{\Sigma}$, we have
217+
\begin{equation}
218+
-\frac{N}{2} \ln |\mathbf{\Sigma}|-\frac{N}{2} \operatorname{Tr}\left\{\mathbf{\Sigma}^{-1} \mathbf{S}\right\}
219+
\end{equation}
220+
where we have defined
221+
\begin{equation}
222+
\mathbf{S}=\frac{N_{1}}{N} \mathbf{S}_{1}+\frac{N_{2}}{N} \mathbf{S}_{2}
223+
\end{equation}
224+
\begin{equation}
225+
\mathbf{S}_{1}=\frac{1}{N_{1}} \sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}}
226+
\end{equation}
227+
\begin{equation}
228+
\mathbf{S}_{2}=\frac{1}{N_{2}} \sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)^{\mathrm{T}}
229+
\end{equation}
230+
Using the standard result for the maximum likelihood solution for a Gaussian distribution, we see that $\mathbf{\Sigma} = \mathbf{S}$, which represents a weighted average of the covariance matrices associated with each of the two classes separately.
231+
232+
233+
234+
\section*{References}
235+
\bibliography{references}
236+
237+
\begin{thebibliography}{9}
238+
239+
\bibitem{Bishop}
240+
Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
241+
242+
\bibitem{The bias-variance tradeoff}
243+
https://www.machinelearningtutorial.net/2017/01/26/the-bias-variance-tradeoff/
244+
245+
\end{thebibliography}
246+
\end{document}

PRML/ch5.pdf

287 KB
Binary file not shown.

PRML/ch5/figure/figure5_1.png

58.9 KB
Loading

PRML/ch5/figure/figure5_20.png

51.5 KB
Loading

0 commit comments

Comments
 (0)