|
| 1 | +######################## |
| 2 | +Bias-Variance User Guide |
| 3 | +######################## |
| 4 | + |
| 5 | +********** |
| 6 | +Motivation |
| 7 | +********** |
| 8 | + |
| 9 | +Statistical Bias vs. "Fairness" |
| 10 | +=============================== |
| 11 | + |
| 12 | +For this user guide and associated submodule, we are referring to |
| 13 | +`statistical bias <https://en.wikipedia.org/wiki/Bias_(statistics)>`_ rather |
| 14 | +than the "fairness" type of bias. |
| 15 | + |
| 16 | +Why should we care about bias and variance? |
| 17 | +=========================================== |
| 18 | + |
| 19 | +Bias and variance are two indicators of model performance and together represent |
| 20 | +two-thirds of model error (the remaining one-third is irreducible "noise" error that |
| 21 | +comes from the data set itself). We can define bias and variance as follows |
| 22 | +by training a model with multiple `bootstrap sampled |
| 23 | +<https://en.wikipedia.org/wiki/Bootstrapping_(statistics)>`_ training sets, resulting in |
| 24 | +multiple instances of the model. |
| 25 | + |
| 26 | +.. topic:: Bias and variance defined over multiple training sets: |
| 27 | + |
| 28 | + * Bias represents the average difference between the prediction a model makes and the correct prediction. |
| 29 | + * Variance represents the average variability of the prediction a model makes. |
| 30 | + |
| 31 | +Typically, a model with high bias is "underfit" and a model with high variance is |
| 32 | +"overfit," but keep in mind this is not always the case and there can be many reasons |
| 33 | +why a model has high bias or high variance. An "underfit" model is oversimplified and |
| 34 | +performs poorly on the training data, whereas an "overfit" model sticks too closely to |
| 35 | +the training data and performs poorly on unseen examples. See Scikit-Learn's |
| 36 | +`Underfitting vs. Overfitting |
| 37 | +<https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html>`_ |
| 38 | +for a clear example of an "underfit" model vs. an "overfit" model. |
| 39 | + |
| 40 | +There is a concept |
| 41 | +known as the `"bias-variance tradeoff" |
| 42 | +<https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff>`_ that describes |
| 43 | +the relationship between high bias and high variance in a model. Our ultimate goal |
| 44 | +here is to find the ideal balance where both bias and variance is at a minimum. |
| 45 | +It is also important from a business problem standpoint on whether the model |
| 46 | +error that we are unable to reduce should favor bias or variance. |
| 47 | + |
| 48 | +***************************************** |
| 49 | +Visualize Bias and Variance With Examples |
| 50 | +***************************************** |
| 51 | + |
| 52 | +In order to easily understand the concepts of bias and variance, we will show |
| 53 | +four different examples of models for each of the high and low bias and variance |
| 54 | +combinations. These are extreme and engineered cases for the purpose of clearly |
| 55 | +seeing the bias/variance. |
| 56 | + |
| 57 | +Before we begin, let's take a look at the distribution of the labels. Notice |
| 58 | +that the majority of label values are around 1 and 2, and much less around 5. |
| 59 | + |
| 60 | +.. figure:: images/bias_variance_label_distribution.png |
| 61 | + :align: center |
| 62 | + :alt: alternate text |
| 63 | + :figclass: align-center |
| 64 | + |
| 65 | +First we have a model with high bias and low variance. We artificially |
| 66 | +introduce bias to the model by adding 10 to every training label, but leaving |
| 67 | +the test labels as is. Given that values of greater than 5 in the entire label |
| 68 | +set are considered outliers, we are fitting the model against outliers. |
| 69 | + |
| 70 | +.. figure:: images/high_bias_low_variance.png |
| 71 | + :align: center |
| 72 | + :alt: alternate text |
| 73 | + :figclass: align-center |
| 74 | + |
| 75 | + Five sets of mean squared error results from the test set from the five |
| 76 | + bootstrap sample trainings of the model. Notice the model error is very |
| 77 | + consistent among the trials and is not centered around 0. |
| 78 | + |
| 79 | +Next we have a model with low bias and high variance. We simulate this by |
| 80 | +introducing 8 random "noise" features to the data set. We also reduce the size |
| 81 | +of the training set and train a neural network over a low number of epochs. |
| 82 | + |
| 83 | +.. figure:: images/low_bias_high_variance.png |
| 84 | + :align: center |
| 85 | + :alt: alternate text |
| 86 | + :figclass: align-center |
| 87 | + |
| 88 | + Five sets of mean squared error results from the test set from the five |
| 89 | + bootstrap sample trainings of the model. Notice the model error has |
| 90 | + different distributions among the trials and centers mainly around 0. |
| 91 | + |
| 92 | +Next we have a model with high bias and high variance. We simulate through |
| 93 | +a combination of the techniques from the high bias low variance example and |
| 94 | +the low bias high variance example and train another neural network. |
| 95 | + |
| 96 | +.. figure:: images/high_bias_high_variance.png |
| 97 | + :align: center |
| 98 | + :alt: alternate text |
| 99 | + :figclass: align-center |
| 100 | + |
| 101 | + Five sets of mean squared error results from the test set from the five |
| 102 | + bootstrap sample trainings of the model. Notice the model error has |
| 103 | + different distributions among the trials and is not centered around 0. |
| 104 | + |
| 105 | +Finally we have a model with low bias and low variance. This is a simple |
| 106 | +linear regression model with no modifications to the training or test labels. |
| 107 | + |
| 108 | +.. figure:: images/low_bias_low_variance.png |
| 109 | + :align: center |
| 110 | + :alt: alternate text |
| 111 | + :figclass: align-center |
| 112 | + |
| 113 | + Five sets of mean squared error results from the test set from the five |
| 114 | + bootstrap sample trainings of the model. Notice the model error is very |
| 115 | + consistent among the trials and centers mainly around 0. |
| 116 | + |
| 117 | +*************************** |
| 118 | +Bias-Variance Decomposition |
| 119 | +*************************** |
| 120 | + |
| 121 | +.. currentmodule:: mvtk.bias_variance |
| 122 | + |
| 123 | +There are formulas for breaking down total model error into three parts: bias, |
| 124 | +variance, and noise. This can be applied to both regression problem loss |
| 125 | +functions (mean squared error) and classification problem loss functions |
| 126 | +(0-1 loss). In a paper by Pedro Domingos, a method of unified |
| 127 | +decomposition was proposed for both types of problems :cite:`domingos2000decomp`. |
| 128 | + |
| 129 | +First lets define :math:`y` as a single prediction, :math:`D` as the set of |
| 130 | +training sets used to train the models, :math:`Y` as the set of predictions |
| 131 | +from the models trained on :math:`D`, and a loss function :math:`L` that |
| 132 | +calculates the error between our prediction :math:`y` and the correct |
| 133 | +prediction. |
| 134 | +The main prediction :math:`y_m` is the smallest average loss for a prediction |
| 135 | +when compared to the set of predictions :math:`Y`. The main prediction is |
| 136 | +the mean of :math:`Y` for mean squared error and the mode of :math:`Y` for |
| 137 | +0-1 loss :cite:`domingos2000decomp`. |
| 138 | + |
| 139 | +Bias can now be defined for a single example :math:`x` over the set of models |
| 140 | +trained on :math:`D` as the loss calculated between the main prediction |
| 141 | +:math:`y_m` and the correct prediction :math:`y_*` :cite:`domingos2000decomp`. |
| 142 | + |
| 143 | +.. math:: |
| 144 | + B(x) = L(y_*,y_m) |
| 145 | +
|
| 146 | +Variance can now be defined for a single example :math:`x` over the set of |
| 147 | +models trained on :math:`D` as the average loss calculated between all predictions |
| 148 | +and the main prediction :math:`y_m` :cite:`domingos2000decomp`. |
| 149 | + |
| 150 | +.. math:: |
| 151 | + V(x) = E_D[L(y_m, y)] |
| 152 | +
|
| 153 | +We will need to take the average of the bias over all examples as |
| 154 | +:math:`E_x[B(x)]` and the average of the variance over all examples as |
| 155 | +:math:`E_x[V(x)]` :cite:`domingos2000decomp`. |
| 156 | + |
| 157 | +With :math:`N(x)` representing the irreducible error from observation noise, we |
| 158 | +can decompose the average expected loss as :cite:`domingos2000decomp` |
| 159 | + |
| 160 | +.. math:: |
| 161 | + E_x[N(x)] + E_x[B(x)] + E_x[cV(x)] |
| 162 | +
|
| 163 | +In other words, the average loss over all examples is equal to the noise plus the |
| 164 | +average bias plus the net variance (the :math:`c` factor included with the variance |
| 165 | +when calculating average variance gives us the net variance). |
| 166 | + |
| 167 | +.. note:: |
| 168 | + We are generalizing the actual value of :math:`N(x)`, as the Model Validation |
| 169 | + Toolkit's implementation of bias-variance decomposition does not include noise |
| 170 | + in the average expected loss. This noise represents error in the actual data |
| 171 | + and not error related to the model itself. If you would like to dive deeper |
| 172 | + into the noise representation, please consult the `Pedro Domingos paper |
| 173 | + <https://homes.cs.washington.edu/~pedrod/papers/mlc00a.pdf>`_. |
| 174 | + |
| 175 | +For mean squared loss functions, :math:`c = 1`, meaning that average variance |
| 176 | +is equal to net variance. |
| 177 | + |
| 178 | +For zero-one loss functions, :math:`c = 1` when :math:`y_m = y_*` otherwise |
| 179 | +:math:`c = -P_D(y = y_* | y != y_m)`. :cite:`domingos2000decomp` In other words, |
| 180 | +:math:`c` is 1 when the main prediction is the correct prediction. If the main |
| 181 | +prediction is not the correct prediction, then :math:`c` is equal to the |
| 182 | +probability of a single prediction being the correct prediction given that the |
| 183 | +single prediction is not the main prediction. |
| 184 | + |
| 185 | +Usage |
| 186 | +===== |
| 187 | + |
| 188 | +:meth:`bias_variance_compute` will return the average loss, average bias, average |
| 189 | +variance, and net variance for an estimator trained and tested over a specified number |
| 190 | +of training sets. This was inspired and modeled after Sebastian Raschka's |
| 191 | +`bias_variance_decomp |
| 192 | +<https://github.com/rasbt/mlxtend/blob/master/mlxtend/evaluate/bias_variance_decomp.py>`_ |
| 193 | +function :cite:`mlxtenddecomp`. |
| 194 | +We use the `bootstrapping <https://en.wikipedia.org/wiki/Bootstrapping_(statistics)>`_ |
| 195 | +method to produce our sets of training data from the original training set. By default |
| 196 | +it will use mean squared error as the loss function, but it will accept the following |
| 197 | +functions for calculating loss. |
| 198 | + |
| 199 | +* :meth:`bias_variance_mse` for mean squared error |
| 200 | +* :meth:`bias_variance_0_1_loss` for 0-1 loss |
| 201 | + |
| 202 | +Since :meth:`bias_variance_compute` trains an estimator over multiple iterations, it also |
| 203 | +expects the estimator to be wrapped in a class that extends the |
| 204 | +:class:`estimators.EstimatorWrapper` class, which provides fit and predict methods |
| 205 | +that not all estimator implementations conform to. The following estimator wrappers are |
| 206 | +provided. |
| 207 | + |
| 208 | +* :class:`estimators.PyTorchEstimatorWrapper` for `PyTorch <https://pytorch.org/>`_ |
| 209 | +* :class:`estimators.SciKitLearnEstimatorWrapper` for `Scikit-Learn <https://scikit-learn.org/stable/>`_ |
| 210 | +* :class:`estimators.TensorFlowEstimatorWrapper` for `TensorFlow <https://www.tensorflow.org/>`_ |
| 211 | + |
| 212 | +:meth:`bias_variance_compute` works well for smaller data sets and less complex models, but what |
| 213 | +happens when you have a very large set of data, a very complex model, or both? |
| 214 | +:meth:`bias_variance_compute_parallel` does the same calculation, but leverages `Ray |
| 215 | +<https://www.ray.io/>`_ for parallelization of bootstrapping, training, and predicting. |
| 216 | +This allows for faster calculations using computations over a distributed architecture. |
| 217 | + |
| 218 | +.. topic:: Tutorials: |
| 219 | + |
| 220 | + * :doc:`Bias-Variance Visualization <notebooks/bias_variance/BiasVarianceVisualization>` |
| 221 | + * :doc:`Bias-Variance Regression <notebooks/bias_variance/BiasVarianceRegression>` |
| 222 | + * :doc:`Bias-Variance Classification <notebooks/bias_variance/BiasVarianceClassification>` |
| 223 | + |
| 224 | +.. bibliography:: refs.bib |
| 225 | + :cited: |
0 commit comments