|
1 | | -\begin{enumerate} |
2 | | -\item[1)] The conclusion is incorrect. Since the tree structure is built recursively, the algorithm does not necessarily identify the optimal tree with lowest empirical risk on the training data. |
3 | | -This lies in the nature of greedy optimization procedures. |
4 | | -Empirical risk minimization (ERM) is only performed to identify \textit{the next} splitting rule, and not entire sets of subsequent splitting rules. |
5 | | - |
6 | | -\item[2)] CART does automatically select features for splitting nodes if they lead to an expected reduction in empirical risk. |
7 | | -Irrelevant features are therefore more likely to be picked less often for split rules in model construction. |
8 | | -(Of course, the subject of assessing feature importance is left for the chapter on random forests.) However, one could gain a rough understanding of a feature's relevance by looking at how often it was picked for splitting a node. However, this kind of "split rule selection frequency" does not necessarily relate to a feature variable's contribution to ERM. |
9 | | - |
10 | | -\item[3)] CART can perform automatic feature selection by remembering surrogate splits in an extra step in model construction. Per default, the \texttt{rpart} package retains up to 5 surrogate splits. For each split rule, a surrogate split rule that leads to sorting observations into child nodes in a similar way is retained. These surrogate splits can then be used to "guide" observations through the tree even if they have some missing feature values. Therefore, CART is generally-speaking well-suited to handle missing observations. |
11 | | - |
12 | | -\item[4)] The number of possible split points evaluated per feature variable is equal to the number of different values the respective feature has in the training data minus 1, e.g., a numerical or categorical variable with 4 different values in the training data has 3 potential split points. (Actually, for a continuous feature there is an infinite amount of possible split points, but there are just 3 which lead to different results for the training data.) As each feature variable can be used for the split point, one needs to sum over the feature variables in the data set. |
13 | | -\begin{align*} |
14 | | -\text{Number of possible split points} = \sumjp \text{({number of different values in the training data}}_j-1) \le 3 \cdot (n-1) |
15 | | -\end{align*} |
16 | | -\end{enumerate} |
| 1 | +\begin{itemize} |
| 2 | + \item \textbf{(1) Holger's conclusion correct?} \textbf{No.} |
| 3 | + |
| 4 | +\begin{itemize} |
| 5 | + |
| 6 | +\item CART is a \textbf{greedy algorithm}: It minimizes risk $\mathcal{R}(\mathcal{N}_p, j, t)$ only for the \textit{current} split, not the entire tree simultaneously. |
| 7 | + |
| 8 | +\item \textbf{No "Lookahead":} It commits to the best immediate split. It misses cases where a sub-optimal split \textit{now} enables much better splits \textit{later}, potentially resulting in a globally worse tree. |
| 9 | + |
| 10 | +\item \textbf{Computational Constraint:} An exhaustive global search is computationally infeasible (too many combinations). CART accepts this greedy shortcut to remain efficient. |
| 11 | + |
| 12 | +\end{itemize} |
| 13 | + |
| 14 | + \item \textbf{(2) Automatic feature selection + relevance assessment.} |
| 15 | + \begin{itemize} |
| 16 | + \item At each node, CART tests all features $x_j$ and thresholds $t$ and picks the split with largest impurity/risk reduction. |
| 17 | + \item Features never selected for any split do not appear in the tree $\Rightarrow$ effectively excluded; relevant features tend to be chosen more often (and earlier in the tree) since they yield larger risk reductions. |
| 18 | + \item Feature relevance proxies after training: (i) used vs.\ unused features (split-frequency, i.e., how often a feature is used, is only a rough proxy and can be misleading), (ii) better: total impurity/risk decrease (often node-size weighted), (iii) permutation importance (validate performance drop when shuffling $x_j$, see random forest later) |
| 19 | + \end{itemize} |
| 20 | + |
| 21 | + \item \textbf{(3) Missing values at prediction?} \textbf{Often yes (implementation-dependent).} |
| 22 | + \begin{itemize} |
| 23 | + \item Classic CART can use \emph{surrogate splits}: if primary split feature is missing, route via an alternative split that best mimics the primary split on training data. |
| 24 | + \item Some implementations learn explicit NA-routing (missing $\to$ left/right) or require imputation; surrogate lists are often stored per node (sometimes with a small default maximum number, e.g., rpart uses 5). |
| 25 | + \end{itemize} |
| 26 | + |
| 27 | + \item \textbf{(4) Upper bound on \# split points at the root.} |
| 28 | + \begin{itemize} |
| 29 | + \item For numeric $x_j$: although thresholds are infinite, only splits between consecutive distinct training values change the partition: |
| 30 | + \[ |
| 31 | + \#\text{candidates for }x_j = \#\text{unique}(x_j)-1 \le n-1. |
| 32 | + \] |
| 33 | + \item With $p=3$ numeric features: |
| 34 | + \[ |
| 35 | + \#\text{root candidates} \le \sum_{j=1}^{3}(\#\text{unique}(x_j)-1)\le 3(n-1). |
| 36 | + \] |
| 37 | + \item (If later considering \emph{unordered categorical} predictors with $k$ levels: in principle up to $2^{k-1}-1$ binary partitions, though implementations may restrict this.) |
| 38 | + \end{itemize} |
| 39 | +\end{itemize} |
| 40 | + |
| 41 | +% |
| 42 | +% |
| 43 | +% \begin{enumerate} |
| 44 | +% \item[1)] The conclusion is incorrect. Since the tree structure is built recursively, the algorithm does not necessarily identify the optimal tree with lowest empirical risk on the training data. |
| 45 | +% This lies in the nature of greedy optimization procedures. |
| 46 | +% Empirical risk minimization (ERM) is only performed to identify \textit{the next} splitting rule, and not entire sets of subsequent splitting rules. |
| 47 | +% |
| 48 | +% \item[2)] CART does automatically select features for splitting nodes if they lead to an expected reduction in empirical risk. |
| 49 | +% Irrelevant features are therefore more likely to be picked less often for split rules in model construction. |
| 50 | +% (Of course, the subject of assessing feature importance is left for the chapter on random forests.) However, one could gain a rough understanding of a feature's relevance by looking at how often it was picked for splitting a node. However, this kind of "split rule selection frequency" does not necessarily relate to a feature variable's contribution to ERM. |
| 51 | +% |
| 52 | +% \item[3)] CART can perform automatic feature selection by remembering surrogate splits in an extra step in model construction. Per default, the \texttt{rpart} package retains up to 5 surrogate splits. For each split rule, a surrogate split rule that leads to sorting observations into child nodes in a similar way is retained. These surrogate splits can then be used to "guide" observations through the tree even if they have some missing feature values. Therefore, CART is generally-speaking well-suited to handle missing observations. |
| 53 | +% |
| 54 | +% \item[4)] The number of possible split points evaluated per feature variable is equal to the number of different values the respective feature has in the training data minus 1, e.g., a numerical or categorical variable with 4 different values in the training data has 3 potential split points. (Actually, for a continuous feature there is an infinite amount of possible split points, but there are just 3 which lead to different results for the training data.) As each feature variable can be used for the split point, one needs to sum over the feature variables in the data set. |
| 55 | +% \begin{align*} |
| 56 | +% \text{Number of possible split points} = \sumjp \text{({number of different values in the training data}}_j-1) \le 3 \cdot (n-1) |
| 57 | +% \end{align*} |
| 58 | +% \end{enumerate} |
17 | 59 |
|
18 | 60 |
|
19 | 61 |
|
0 commit comments