|
| 1 | +\documentclass[5p,sort&compress]{elsarticle} |
| 2 | + |
| 3 | +\usepackage{amssymb} % Mathematical symbols |
| 4 | +\usepackage{amsmath} % More options for mathematics |
| 5 | +\usepackage{subfigure} % More options for figures |
| 6 | +\usepackage{epstopdf} % Convert eps to pdf |
| 7 | +\usepackage[separate-uncertainty=true]{siunitx} % Proper formatting of units in math mode |
| 8 | +\usepackage{color} % Supports text color if needed |
| 9 | +\usepackage{soul} % https://ctan.org/pkg/soul |
| 10 | +\usepackage{lmodern} % Loading fonts |
| 11 | +\usepackage{hyperref} % To insert clickable references/urls |
| 12 | +\usepackage{listings} % To input code in the text |
| 13 | +\usepackage{amsmath} |
| 14 | +\usepackage{amsmath} |
| 15 | +\usepackage{amssymb} |
| 16 | +\usepackage{graphicx} |
| 17 | +\usepackage{epstopdf} |
| 18 | +\usepackage{booktabs} |
| 19 | +\setlength{\parskip}{2em} |
| 20 | +\newcommand{\stirlingii}{\genfrac{\{}{\}}{0pt}{}} |
| 21 | + |
| 22 | +% Choose the style of the reference list (do not change) |
| 23 | +\bibliographystyle{elsarticle-num} |
| 24 | + |
| 25 | +\journal{ifding/learning-notes} |
| 26 | + |
| 27 | +% Begin the document |
| 28 | + |
| 29 | +\begin{document} |
| 30 | + |
| 31 | +\begin{frontmatter} |
| 32 | + \title{Chatper 3: Linear Models for Regression and Classification} |
| 33 | + \author{ifding} |
| 34 | + |
| 35 | + \begin{abstract} |
| 36 | + Linear Basis Function Models, The Bias-Variance Decomposition, Discriminant Functions, Probabilistic Generative Models |
| 37 | + \end{abstract} |
| 38 | + |
| 39 | + |
| 40 | +\end{frontmatter} |
| 41 | + |
| 42 | +%% How to make a heading and divide the documents into different sections |
| 43 | +The goal of regression is to predict the value of one or more continuous \textit{target} variables \textit{t} given the value of a $D$-dimensional vector \textbf{x} of \textit{input} variables. From a probabilistic perspective, we aim to model the predictive distribution $p(t|\mathbf{x})$ because this expresses our uncertainty about the value of t for each value of \textbf{x}. |
| 44 | + |
| 45 | +The goal in classification is to take an input vector $\boldsymbol{x}$ ans to assign it to one of K discrete classes $\mathcal{C}_k$ where $k=1, \ldots, K$. The input space is divided into \textit{decision regions} whose boundaries are called \textit{decision boundaries} or \textit{decision surfaces}. For the targe variable $\boldsymbol{t}$, it is convenient to use a 1-of-K coding scheme. |
| 46 | + |
| 47 | + |
| 48 | +\section{Linear Basis Function Models} |
| 49 | + |
| 50 | +Consider linear combinations of fixed nonlinear functions of the input variables, |
| 51 | +\begin{equation} |
| 52 | +y(\mathbf{x}, \mathbf{w})=w_{0}+\sum_{j=1}^{M-1} w_{j} \phi_{j}(\mathbf{x}) |
| 53 | +\end{equation} |
| 54 | +where $\phi_{j}(\mathbf{x})$ are known as \textit{basis functions}. The total number of parameters in this model will be M. It is often convenient to define an additional dummy `basis function' $\phi_{0}(\mathbf{x}) = 1$. |
| 55 | +\begin{equation} |
| 56 | +y(\mathbf{x}, \mathbf{w})=\sum_{j=0}^{M-1} w_{j} \phi_{j}(\mathbf{x})=\mathbf{w}^{\mathrm{T}} \boldsymbol{\phi}(\mathbf{x}) |
| 57 | +\end{equation} |
| 58 | +where $\mathbf{w} = (w_0, \ldots, w_{M-1})^{\mathrm{T}}$ and $\boldsymbol{\phi} = (\phi_0, \ldots, \phi_{M-1})^{\mathrm{T}}$. If the original variables comprise the vector $\textbf{x}$, the nonlinear basis functions $\{\phi_{j}(\mathbf{x})\}$ express the extracted features. |
| 59 | + |
| 60 | + |
| 61 | +\subsection{Maximum likelihood and least squares} |
| 62 | + |
| 63 | +We assume that the target variable \textit{t} is given by a deterministic function $y(\mathbf{x},\mathbf{w})$ with additive Gaussian noise so that |
| 64 | +\begin{equation} |
| 65 | +t=y(\mathbf{x}, \mathbf{w})+\epsilon |
| 66 | +\end{equation} |
| 67 | +where $\epsilon$ is a zero mean Gaussian random variable with precision (inverse variance) $\beta$. Thus we can write |
| 68 | + |
| 69 | +\begin{equation} |
| 70 | +p(t | \mathbf{x}, \mathbf{w}, \beta)=\mathcal{N}\left(t | y(\mathbf{x}, \mathbf{w}), \beta^{-1}\right) |
| 71 | +\end{equation} |
| 72 | + |
| 73 | +If we assume a squared loss function, then the optimal prediction, for a new value of \textbf{x}, will be given by the conditional mean of the target variable. In the case of a Gaussian conditional distribution, the conditional mean will be simply |
| 74 | +\begin{equation} |
| 75 | +\mathbb{E}[t | \mathbf{x}]=\int t p(t | \mathbf{x}) \mathrm{d} t=y(\mathbf{x}, \mathbf{w}) |
| 76 | +\end{equation} |
| 77 | +Note hat the Gaussian noise assumption implies that the conditional distribution of t given \textbf{x} is unimodal, which may be inappropriate for some applications. |
| 78 | + |
| 79 | + |
| 80 | +\section{The Bias-Variance Decomposition} |
| 81 | + |
| 82 | +Although the regularization terms can control over-fitting for models, how to determine a suitable regularization coefficient $\lambda$? What's the \textit{bias-variance} trade-off? |
| 83 | + |
| 84 | +Given the conditional distribution $p(t | \mathbf{x})$, a popular choice is the squared loss function, for which the optimal prediction is given by the conditional expectation, $h(x)$, |
| 85 | +\begin{equation} |
| 86 | +h(\mathbf{x})=\mathbb{E}[t | \mathbf{x}]=\int t p(t | \mathbf{x}) \mathrm{d} t |
| 87 | +\end{equation} |
| 88 | + |
| 89 | +The expected squared loss can be written in the form |
| 90 | +\begin{equation}\begin{aligned} |
| 91 | +\mathbb{E}[L]= & \\ |
| 92 | +\int\{y(\mathbf{x})-h(\mathbf{x})\}^{2} p(\mathbf{x}) \mathrm{d} \mathbf{x}+ \int\{h(\mathbf{x})-t\}^{2} p(\mathbf{x}, t) \mathrm{d} \mathbf{x} \mathrm{d} t |
| 93 | +\end{aligned} |
| 94 | +\end{equation} |
| 95 | +The second term arises from the intrinsic noise on the data and represents the minimum achievable value of the expected loss. The first term depends on our choice for the function $y(\mathbf{x})$, and we will seek a solution for $y(\mathbf{x})$ which makes this term a minimum. For a particular data set \begin{equation} |
| 96 | +\begin{aligned}{\mathbb{E}_{\mathcal{D}}\left[\{y(\mathbf{x} ; \mathcal{D})-h(\mathbf{x})\}^{2}\right] =} \\ {\underbrace{\left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-h(\mathbf{x})\right\}^{2}}_{(\text {bias })^{2}}+\underbrace{\mathbb{E}_{\mathcal{D}}\left[\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}^{2}\right]}_{\text {variance }}}\end{aligned} |
| 97 | +\end{equation} |
| 98 | +The squared \textit{bias} represents the extent to which the average prediction over all data sets differs from the desired regression function. The \textit{variance} measures the extent to which the solutions for individual data sets vary around their average, and hence this measures the extent to which the function $y(\mathbf{x} ; \mathcal{D})$ is sensitive to the particular choice of data set. |
| 99 | + |
| 100 | +\begin{figure}[ht] |
| 101 | + \centering |
| 102 | + \includegraphics[width = \linewidth]{figure/bias-variance-tradeoff.png} |
| 103 | + \caption{The bias-variance tradeoff.} |
| 104 | + \label{fig:tradeoff} |
| 105 | +\end{figure} |
| 106 | + |
| 107 | +The decomposition of the expected squared loss |
| 108 | +\begin{equation} |
| 109 | +\text { expected loss }=(\text { bias })^{2}+\text { variance }+\text { noise } |
| 110 | +\end{equation} |
| 111 | +where |
| 112 | +\begin{equation} |
| 113 | +\begin{aligned}(\text { bias })^{2} &=\int\left\{\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]-h(\mathbf{x})\right\}^{2} p(\mathbf{x}) \mathrm{d} \mathbf{x} \\ \text { variance } &=\int \mathbb{E}_{\mathcal{D}}\left[\left\{y(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[y(\mathbf{x} ; \mathcal{D})]\right\}^{2}\right] p(\mathbf{x}) \mathrm{d} \mathbf{x} \\ \text { noise } &=\int\{h(\mathbf{x})-t\}^{2} p(\mathbf{x}, t) \mathrm{d} \mathbf{x} \mathrm{d} t \end{aligned} |
| 114 | +\end{equation} |
| 115 | +There is a trade-off between bias and variance, with very flexible models having low bias and high variance, and relatively rigid models having high bias and low variance, as shown in Figure~\ref{fig:tradeoff}. |
| 116 | + |
| 117 | + |
| 118 | +\section{Discriminant Functions} |
| 119 | + |
| 120 | +A discriminant is a function that takes an input vector $\boldsymbol{x}$ and assigns it to one of K classes, denoted $\mathcal{C}_k$. This section focuses on \textit{linear discriminants} of two classes. |
| 121 | + |
| 122 | +\begin{equation} |
| 123 | +y(\mathbf{x})=\mathbf{w}^{\mathrm{T}} \mathbf{x}+w_{0} |
| 124 | +\end{equation} |
| 125 | +where $\mathbf{w}$ is called a \textit{weight vector}, and $w_0$ is a \textit{bias}. An input vector $\mathbf{x}$ is assigned to class $\mathcal{C}_1$ if $y(\mathbf{x}) \leq 0$ and to class $\mathcal{C}_2$ otherwise. The decision boundary is $y(\mathbf{x}) = 0$. Consider two points $\mathbf{x}_A$ and $\mathbf{x}_B$, $y(\mathbf{x}_A) = y(\mathbf{x}_B) = 0$, we have $\mathbf{w}^{\mathrm{T}} (\mathbf{x}_A -\mathbf{x}_B) = 0$ and hence the vector $\mathbf{w}$ is orthogonal to every vector lying within the decision surface. |
| 126 | + |
| 127 | + |
| 128 | + |
| 129 | +\section{Probabilistic Generative Models} |
| 130 | + |
| 131 | +Consider first of all the classes. The posterior probability for class $\mathcal{C}_1$ can be written as |
| 132 | +\begin{equation} |
| 133 | +\begin{aligned} p\left(\mathcal{C}_{1} | \mathbf{x}\right) &=\frac{p\left(\mathbf{x} | \mathcal{C}_{1}\right) p\left(\mathcal{C}_{1}\right)}{p\left(\mathbf{x} | \mathcal{C}_{1}\right) p\left(\mathcal{C}_{1}\right)+p\left(\mathbf{x} | \mathcal{C}_{2}\right) p\left(\mathcal{C}_{2}\right)} \\ &=\frac{1}{1+\exp (-a)}=\sigma(a) \end{aligned} |
| 134 | +\end{equation} |
| 135 | +where we have defined |
| 136 | +\begin{equation} |
| 137 | +a=\ln \frac{p\left(\mathbf{x} | \mathcal{C}_{1}\right) p\left(\mathcal{C}_{1}\right)}{p\left(\mathbf{x} | \mathcal{C}_{2}\right) p\left(\mathcal{C}_{2}\right)} |
| 138 | +\end{equation} |
| 139 | +For the case of $K > 2$ classes, we have |
| 140 | +\begin{equation} |
| 141 | +\begin{aligned} p\left(\mathcal{C}_{k} | \mathbf{x}\right) &=\frac{p\left(\mathbf{x} | \mathcal{C}_{k}\right) p\left(\mathcal{C}_{k}\right)}{\sum_{j} p\left(\mathbf{x} | \mathcal{C}_{j}\right) p\left(\mathcal{C}_{j}\right)} \\ &=\frac{\exp \left(a_{k}\right)}{\sum_{j} \exp \left(a_{j}\right)} \end{aligned} |
| 142 | +\end{equation} |
| 143 | +which is known as the \textit{normalized exponential} and is also known as the \textit{softmax function}, as it represents a smoothed version of the `max' function because if $a_k \gg a_j$ for all $j \neq k$, then $p(\mathcal{C}_k|\mathbf{x}) \simeq 1$, and $p(\mathcal{C}_j|\mathbf{x}) \simeq 0$. Here the quantities $a_k$ are defined by |
| 144 | +\begin{equation} |
| 145 | +a_{k}=\ln p\left(\mathbf{x} | \mathcal{C}_{k}\right) p\left(\mathcal{C}_{k}\right) |
| 146 | +\end{equation} |
| 147 | + |
| 148 | + |
| 149 | +\subsection{Continuous inputs} |
| 150 | + |
| 151 | +Let us assume that the class-conditional densities are Gaussian and all classes share the same covariance matrix. |
| 152 | +\begin{equation} |
| 153 | +\begin{aligned} |
| 154 | +p\left(\mathbf{x} | \mathcal{C}_{k}\right)= \\ |
| 155 | +\frac{1}{(2 \pi)^{D / 2}} \frac{1}{|\mathbf{\Sigma}|^{1 / 2}} \exp \left\{-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)^{\mathrm{T}} \mathbf{\Sigma}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{k}\right)\right\} |
| 156 | +\end{aligned} |
| 157 | +\end{equation} |
| 158 | +Consider first the case of two classes. |
| 159 | +\begin{equation} |
| 160 | +p\left(\mathcal{C}_{1} | \mathbf{x}\right)=\sigma\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}+w_{0}\right) |
| 161 | +\end{equation} |
| 162 | +where we have defined |
| 163 | +\begin{equation} |
| 164 | +\mathbf{w}=\mathbf{\Sigma}^{-1}\left(\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}\right) |
| 165 | +\end{equation} |
| 166 | +\begin{equation} |
| 167 | +w_{0}=-\frac{1}{2} \boldsymbol{\mu}_{1}^{\mathrm{T}} \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_{1}+\frac{1}{2} \boldsymbol{\mu}_{2}^{\mathrm{T}} \mathbf{\Sigma}^{-1} \boldsymbol{\mu}_{2}+\ln \frac{p\left(\mathcal{C}_{1}\right)}{p\left(\mathcal{C}_{2}\right)} |
| 168 | +\end{equation} |
| 169 | +The decision boundaries are linear in input space. The prior probabilities $p(\mathcal{C}_k)$ enter only through the bias $w_0$ so that changes in the priors have the effect of making parallel shits of the decision boundary. |
| 170 | + |
| 171 | + |
| 172 | +\subsection{Maximum likelihood solution} |
| 173 | + |
| 174 | +Suppose we have a data set $\{\mathbf{x}_n, t_n\}$ where $n = 1, \ldots, N$. Here $t_n = 1$ denotes class $\mathcal{C}_1$ and $t_n=0$ denotes class $\mathcal{C}_2$. We denote the prior class probability $p(\mathcal{C}_1) = \pi$, so that $p(\mathcal{C}_2) = 1 - \pi$. All classes share the same covariance matrix. For a data point $\mathbf{x}_n$ from class $\mathcal{C}_1$, we have $t_n = 1$ and hence |
| 175 | +\begin{equation} |
| 176 | +p\left(\mathbf{x}_{n}, \mathcal{C}_{1}\right)=p\left(\mathcal{C}_{1}\right) p\left(\mathbf{x}_{n} | \mathcal{C}_{1}\right)=\pi \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right) |
| 177 | +\end{equation} |
| 178 | +Similarly for class $\mathcal{C}_2$, we have $t_n = 0$ and hence |
| 179 | +\begin{equation} |
| 180 | +p\left(\mathbf{x}_{n}, \mathcal{C}_{2}\right)=p\left(\mathcal{C}_{2}\right) p\left(\mathbf{x}_{n} | \mathcal{C}_{2}\right)=(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right) |
| 181 | +\end{equation} |
| 182 | +Thus the likelihood function is given by |
| 183 | +\begin{equation} |
| 184 | +\begin{aligned} |
| 185 | +p\left(\mathbf{t} | \pi, \boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \boldsymbol{\Sigma}\right)= \\ |
| 186 | +\prod_{n=1}^{N}\left[\pi \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)\right]^{t_{n}}\left[(1-\pi) \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{2}, \mathbf{\Sigma}\right)\right]^{1-t_{n}} |
| 187 | +\end{aligned} |
| 188 | +\end{equation} |
| 189 | +where $\mathbf{t} = (t_1, \ldots, t_N)^{\mathrm{T}}$. It's convenient to maximize the log of the likelihood function. Consider first the maximization with respect to $\pi$. The terms in the log likelihood function that depend on $\pi$ are |
| 190 | +\begin{equation} |
| 191 | +\sum_{n=1}^{N}\left\{t_{n} \ln \pi+\left(1-t_{n}\right) \ln (1-\pi)\right\} |
| 192 | +\end{equation} |
| 193 | +Setting the derivative with respect to $\pi$ equal to zero and rearranging, we obtain |
| 194 | +\begin{equation} |
| 195 | +\pi=\frac{1}{N} \sum_{n=1}^{N} t_{n}=\frac{N_{1}}{N}=\frac{N_{1}}{N_{1}+N_{2}} |
| 196 | +\end{equation} |
| 197 | +where $N_1$ denotes the total number of data points in class $\mathcal{C}_1$, and $N_2$ denotes the total number of data points in $\mathcal{C}_2$. |
| 198 | + |
| 199 | +Now consider the maximization with respect to $\boldsymbol{\mu}_1$. Again we can pick out of the log likelihood function those terms that depend on $\boldsymbol{\mu}_1$ giving |
| 200 | +\begin{equation} |
| 201 | +\begin{aligned} |
| 202 | +\sum_{n=1}^{N} t_{n} \ln \mathcal{N}\left(\mathbf{x}_{n} | \boldsymbol{\mu}_{1}, \mathbf{\Sigma}\right)= \\ |
| 203 | +-\frac{1}{2} \sum_{n=1}^{N} t_{n}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)+\text { const. } |
| 204 | +\end{aligned} |
| 205 | +\end{equation} |
| 206 | +Setting the derivative with respect $\boldsymbol{\mu}_1$ to zero and rearranging, we obtain |
| 207 | +\begin{equation} |
| 208 | +\boldsymbol{\mu}_{1}=\frac{1}{N_{1}} \sum_{n=1}^{N} t_{n} \mathbf{x}_{n} |
| 209 | +\end{equation} |
| 210 | +which is simply the mean of all the input vectors $\mathbf{x}_{n}$ assigned to class $\mathcal{C}_1$. The corresponding result for $\boldsymbol{\mu}_2$ is given by |
| 211 | +\begin{equation} |
| 212 | +\boldsymbol{\mu}_{2}=\frac{1}{N_{2}} \sum_{n=1}^{N} (1-t_{n}) \mathbf{x}_{n} |
| 213 | +\end{equation} |
| 214 | +which again is the mean of all the input vectors $\mathbf{x}_{n}$ assigned to class $\mathcal{C}_2$. |
| 215 | + |
| 216 | +Finally, consider the maximum likelihood solution for the shared covariance matrix $\mathbf{\Sigma}$, we have |
| 217 | +\begin{equation} |
| 218 | +-\frac{N}{2} \ln |\mathbf{\Sigma}|-\frac{N}{2} \operatorname{Tr}\left\{\mathbf{\Sigma}^{-1} \mathbf{S}\right\} |
| 219 | +\end{equation} |
| 220 | +where we have defined |
| 221 | +\begin{equation} |
| 222 | +\mathbf{S}=\frac{N_{1}}{N} \mathbf{S}_{1}+\frac{N_{2}}{N} \mathbf{S}_{2} |
| 223 | +\end{equation} |
| 224 | +\begin{equation} |
| 225 | +\mathbf{S}_{1}=\frac{1}{N_{1}} \sum_{n \in \mathcal{C}_{1}}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{1}\right)^{\mathrm{T}} |
| 226 | +\end{equation} |
| 227 | +\begin{equation} |
| 228 | +\mathbf{S}_{2}=\frac{1}{N_{2}} \sum_{n \in \mathcal{C}_{2}}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{2}\right)^{\mathrm{T}} |
| 229 | +\end{equation} |
| 230 | +Using the standard result for the maximum likelihood solution for a Gaussian distribution, we see that $\mathbf{\Sigma} = \mathbf{S}$, which represents a weighted average of the covariance matrices associated with each of the two classes separately. |
| 231 | + |
| 232 | + |
| 233 | + |
| 234 | +\section*{References} |
| 235 | +\bibliography{references} |
| 236 | + |
| 237 | +\begin{thebibliography}{9} |
| 238 | + |
| 239 | +\bibitem{Bishop} |
| 240 | +Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. |
| 241 | + |
| 242 | +\bibitem{The bias-variance tradeoff} |
| 243 | +https://www.machinelearningtutorial.net/2017/01/26/the-bias-variance-tradeoff/ |
| 244 | + |
| 245 | +\end{thebibliography} |
| 246 | +\end{document} |
0 commit comments