Skip to content

Commit 6439e4c

Browse files
committed
Pushing the docs to dev/ for branch: main, commit 69dc086d2065a9e39e0c0f8c5aec8d77bed72df2
1 parent ed870ad commit 6439e4c

File tree

1,218 files changed

+4864
-4526
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,218 files changed

+4864
-4526
lines changed
Binary file not shown.

Diff for: dev/_downloads/62397dcd82eb2478e27036ac96fe2ab9/plot_feature_selection.py

+64-40
Original file line numberDiff line numberDiff line change
@@ -3,36 +3,26 @@
33
Univariate Feature Selection
44
============================
55
6-
An example showing univariate feature selection.
7-
8-
Noisy (non informative) features are added to the iris data and
9-
univariate feature selection is applied. For each feature, we plot the
10-
p-values for the univariate feature selection and the corresponding
11-
weights of an SVM. We can see that univariate feature selection
12-
selects the informative features and that these have larger SVM weights.
13-
14-
In the total set of features, only the 4 first ones are significant. We
15-
can see that they have the highest score with univariate feature
16-
selection. The SVM assigns a large weight to one of these features, but also
17-
Selects many of the non-informative features.
18-
Applying univariate feature selection before the SVM
19-
increases the SVM weight attributed to the significant features, and will
20-
thus improve classification.
6+
This notebook is an example of using univariate feature selection
7+
to improve classification accuracy on a noisy dataset.
8+
9+
In this example, some noisy (non informative) features are added to
10+
the iris dataset. Support vector machine (SVM) is used to classify the
11+
dataset both before and after applying univariate feature selection.
12+
For each feature, we plot the p-values for the univariate feature selection
13+
and the corresponding weights of SVMs. With this, we will compare model
14+
accuracy and examine the impact of univariate feature selection on model
15+
weights.
2116
2217
"""
2318

19+
# %%
20+
# Generate sample data
21+
# --------------------
22+
#
2423
import numpy as np
25-
import matplotlib.pyplot as plt
26-
2724
from sklearn.datasets import load_iris
2825
from sklearn.model_selection import train_test_split
29-
from sklearn.preprocessing import MinMaxScaler
30-
from sklearn.svm import LinearSVC
31-
from sklearn.pipeline import make_pipeline
32-
from sklearn.feature_selection import SelectKBest, f_classif
33-
34-
# #############################################################################
35-
# Import some data to play with
3626

3727
# The iris dataset
3828
X, y = load_iris(return_X_y=True)
@@ -46,25 +36,46 @@
4636
# Split dataset to select feature and evaluate the classifier
4737
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
4838

49-
plt.figure(1)
50-
plt.clf()
51-
52-
X_indices = np.arange(X.shape[-1])
39+
# %%
40+
# Univariate feature selection
41+
# ----------------------------
42+
#
43+
# Univariate feature selection with F-test for feature scoring.
44+
# We use the default selection function to select
45+
# the four most significant features.
46+
from sklearn.feature_selection import SelectKBest, f_classif
5347

54-
# #############################################################################
55-
# Univariate feature selection with F-test for feature scoring
56-
# We use the default selection function to select the four
57-
# most significant features
5848
selector = SelectKBest(f_classif, k=4)
5949
selector.fit(X_train, y_train)
6050
scores = -np.log10(selector.pvalues_)
6151
scores /= scores.max()
62-
plt.bar(
63-
X_indices - 0.45, scores, width=0.2, label=r"Univariate score ($-Log(p_{value})$)"
64-
)
6552

66-
# #############################################################################
67-
# Compare to the weights of an SVM
53+
# %%
54+
import matplotlib.pyplot as plt
55+
56+
X_indices = np.arange(X.shape[-1])
57+
plt.figure(1)
58+
plt.clf()
59+
plt.bar(X_indices - 0.05, scores, width=0.2)
60+
plt.title("Feature univariate score")
61+
plt.xlabel("Feature number")
62+
plt.ylabel(r"Univariate score ($-Log(p_{value})$)")
63+
plt.show()
64+
65+
# %%
66+
# In the total set of features, only the 4 of the original features are significant.
67+
# We can see that they have the highest score with univariate feature
68+
# selection.
69+
70+
# %%
71+
# Compare with SVMs
72+
# -----------------
73+
#
74+
# Without univariate feature selection
75+
from sklearn.pipeline import make_pipeline
76+
from sklearn.preprocessing import MinMaxScaler
77+
from sklearn.svm import LinearSVC
78+
6879
clf = make_pipeline(MinMaxScaler(), LinearSVC())
6980
clf.fit(X_train, y_train)
7081
print(
@@ -76,8 +87,8 @@
7687
svm_weights = np.abs(clf[-1].coef_).sum(axis=0)
7788
svm_weights /= svm_weights.sum()
7889

79-
plt.bar(X_indices - 0.25, svm_weights, width=0.2, label="SVM weight")
80-
90+
# %%
91+
# After univariate feature selection
8192
clf_selected = make_pipeline(SelectKBest(f_classif, k=4), MinMaxScaler(), LinearSVC())
8293
clf_selected.fit(X_train, y_train)
8394
print(
@@ -89,17 +100,30 @@
89100
svm_weights_selected = np.abs(clf_selected[-1].coef_).sum(axis=0)
90101
svm_weights_selected /= svm_weights_selected.sum()
91102

103+
# %%
104+
plt.bar(
105+
X_indices - 0.45, scores, width=0.2, label=r"Univariate score ($-Log(p_{value})$)"
106+
)
107+
108+
plt.bar(X_indices - 0.25, svm_weights, width=0.2, label="SVM weight")
109+
92110
plt.bar(
93111
X_indices[selector.get_support()] - 0.05,
94112
svm_weights_selected,
95113
width=0.2,
96114
label="SVM weights after selection",
97115
)
98116

99-
100117
plt.title("Comparing feature selection")
101118
plt.xlabel("Feature number")
102119
plt.yticks(())
103120
plt.axis("tight")
104121
plt.legend(loc="upper right")
105122
plt.show()
123+
124+
# %%
125+
# Without univariate feature selection, the SVM assigns a large weight
126+
# to the first 4 original significant features, but also selects many of the
127+
# non-informative features. Applying univariate feature selection before
128+
# the SVM increases the SVM weight attributed to the significant features,
129+
# and will thus improve classification.
Binary file not shown.

Diff for: dev/_downloads/fe71806a900680d092025bf56d0dfcb3/plot_feature_selection.ipynb

+99-2
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,68 @@
1515
"cell_type": "markdown",
1616
"metadata": {},
1717
"source": [
18-
"\n# Univariate Feature Selection\n\nAn example showing univariate feature selection.\n\nNoisy (non informative) features are added to the iris data and\nunivariate feature selection is applied. For each feature, we plot the\np-values for the univariate feature selection and the corresponding\nweights of an SVM. We can see that univariate feature selection\nselects the informative features and that these have larger SVM weights.\n\nIn the total set of features, only the 4 first ones are significant. We\ncan see that they have the highest score with univariate feature\nselection. The SVM assigns a large weight to one of these features, but also\nSelects many of the non-informative features.\nApplying univariate feature selection before the SVM\nincreases the SVM weight attributed to the significant features, and will\nthus improve classification.\n"
18+
"\n# Univariate Feature Selection\n\nThis notebook is an example of using univariate feature selection\nto improve classification accuracy on a noisy dataset.\n\nIn this example, some noisy (non informative) features are added to\nthe iris dataset. Support vector machine (SVM) is used to classify the\ndataset both before and after applying univariate feature selection.\nFor each feature, we plot the p-values for the univariate feature selection\nand the corresponding weights of SVMs. With this, we will compare model\naccuracy and examine the impact of univariate feature selection on model\nweights.\n"
19+
]
20+
},
21+
{
22+
"cell_type": "markdown",
23+
"metadata": {},
24+
"source": [
25+
"## Generate sample data\n\n\n"
26+
]
27+
},
28+
{
29+
"cell_type": "code",
30+
"execution_count": null,
31+
"metadata": {
32+
"collapsed": false
33+
},
34+
"outputs": [],
35+
"source": [
36+
"import numpy as np\nfrom sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\n\n# The iris dataset\nX, y = load_iris(return_X_y=True)\n\n# Some noisy data not correlated\nE = np.random.RandomState(42).uniform(0, 0.1, size=(X.shape[0], 20))\n\n# Add the noisy data to the informative features\nX = np.hstack((X, E))\n\n# Split dataset to select feature and evaluate the classifier\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)"
37+
]
38+
},
39+
{
40+
"cell_type": "markdown",
41+
"metadata": {},
42+
"source": [
43+
"## Univariate feature selection\n\nUnivariate feature selection with F-test for feature scoring.\nWe use the default selection function to select\nthe four most significant features.\n\n"
44+
]
45+
},
46+
{
47+
"cell_type": "code",
48+
"execution_count": null,
49+
"metadata": {
50+
"collapsed": false
51+
},
52+
"outputs": [],
53+
"source": [
54+
"from sklearn.feature_selection import SelectKBest, f_classif\n\nselector = SelectKBest(f_classif, k=4)\nselector.fit(X_train, y_train)\nscores = -np.log10(selector.pvalues_)\nscores /= scores.max()"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": null,
60+
"metadata": {
61+
"collapsed": false
62+
},
63+
"outputs": [],
64+
"source": [
65+
"import matplotlib.pyplot as plt\n\nX_indices = np.arange(X.shape[-1])\nplt.figure(1)\nplt.clf()\nplt.bar(X_indices - 0.05, scores, width=0.2)\nplt.title(\"Feature univariate score\")\nplt.xlabel(\"Feature number\")\nplt.ylabel(r\"Univariate score ($-Log(p_{value})$)\")\nplt.show()"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"In the total set of features, only the 4 of the original features are significant.\nWe can see that they have the highest score with univariate feature\nselection.\n\n"
73+
]
74+
},
75+
{
76+
"cell_type": "markdown",
77+
"metadata": {},
78+
"source": [
79+
"## Compare with SVMs\n\nWithout univariate feature selection\n\n"
1980
]
2081
},
2182
{
@@ -26,7 +87,43 @@
2687
},
2788
"outputs": [],
2889
"source": [
29-
"import numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.svm import LinearSVC\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.feature_selection import SelectKBest, f_classif\n\n# #############################################################################\n# Import some data to play with\n\n# The iris dataset\nX, y = load_iris(return_X_y=True)\n\n# Some noisy data not correlated\nE = np.random.RandomState(42).uniform(0, 0.1, size=(X.shape[0], 20))\n\n# Add the noisy data to the informative features\nX = np.hstack((X, E))\n\n# Split dataset to select feature and evaluate the classifier\nX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)\n\nplt.figure(1)\nplt.clf()\n\nX_indices = np.arange(X.shape[-1])\n\n# #############################################################################\n# Univariate feature selection with F-test for feature scoring\n# We use the default selection function to select the four\n# most significant features\nselector = SelectKBest(f_classif, k=4)\nselector.fit(X_train, y_train)\nscores = -np.log10(selector.pvalues_)\nscores /= scores.max()\nplt.bar(\n X_indices - 0.45, scores, width=0.2, label=r\"Univariate score ($-Log(p_{value})$)\"\n)\n\n# #############################################################################\n# Compare to the weights of an SVM\nclf = make_pipeline(MinMaxScaler(), LinearSVC())\nclf.fit(X_train, y_train)\nprint(\n \"Classification accuracy without selecting features: {:.3f}\".format(\n clf.score(X_test, y_test)\n )\n)\n\nsvm_weights = np.abs(clf[-1].coef_).sum(axis=0)\nsvm_weights /= svm_weights.sum()\n\nplt.bar(X_indices - 0.25, svm_weights, width=0.2, label=\"SVM weight\")\n\nclf_selected = make_pipeline(SelectKBest(f_classif, k=4), MinMaxScaler(), LinearSVC())\nclf_selected.fit(X_train, y_train)\nprint(\n \"Classification accuracy after univariate feature selection: {:.3f}\".format(\n clf_selected.score(X_test, y_test)\n )\n)\n\nsvm_weights_selected = np.abs(clf_selected[-1].coef_).sum(axis=0)\nsvm_weights_selected /= svm_weights_selected.sum()\n\nplt.bar(\n X_indices[selector.get_support()] - 0.05,\n svm_weights_selected,\n width=0.2,\n label=\"SVM weights after selection\",\n)\n\n\nplt.title(\"Comparing feature selection\")\nplt.xlabel(\"Feature number\")\nplt.yticks(())\nplt.axis(\"tight\")\nplt.legend(loc=\"upper right\")\nplt.show()"
90+
"from sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.svm import LinearSVC\n\nclf = make_pipeline(MinMaxScaler(), LinearSVC())\nclf.fit(X_train, y_train)\nprint(\n \"Classification accuracy without selecting features: {:.3f}\".format(\n clf.score(X_test, y_test)\n )\n)\n\nsvm_weights = np.abs(clf[-1].coef_).sum(axis=0)\nsvm_weights /= svm_weights.sum()"
91+
]
92+
},
93+
{
94+
"cell_type": "markdown",
95+
"metadata": {},
96+
"source": [
97+
"After univariate feature selection\n\n"
98+
]
99+
},
100+
{
101+
"cell_type": "code",
102+
"execution_count": null,
103+
"metadata": {
104+
"collapsed": false
105+
},
106+
"outputs": [],
107+
"source": [
108+
"clf_selected = make_pipeline(SelectKBest(f_classif, k=4), MinMaxScaler(), LinearSVC())\nclf_selected.fit(X_train, y_train)\nprint(\n \"Classification accuracy after univariate feature selection: {:.3f}\".format(\n clf_selected.score(X_test, y_test)\n )\n)\n\nsvm_weights_selected = np.abs(clf_selected[-1].coef_).sum(axis=0)\nsvm_weights_selected /= svm_weights_selected.sum()"
109+
]
110+
},
111+
{
112+
"cell_type": "code",
113+
"execution_count": null,
114+
"metadata": {
115+
"collapsed": false
116+
},
117+
"outputs": [],
118+
"source": [
119+
"plt.bar(\n X_indices - 0.45, scores, width=0.2, label=r\"Univariate score ($-Log(p_{value})$)\"\n)\n\nplt.bar(X_indices - 0.25, svm_weights, width=0.2, label=\"SVM weight\")\n\nplt.bar(\n X_indices[selector.get_support()] - 0.05,\n svm_weights_selected,\n width=0.2,\n label=\"SVM weights after selection\",\n)\n\nplt.title(\"Comparing feature selection\")\nplt.xlabel(\"Feature number\")\nplt.yticks(())\nplt.axis(\"tight\")\nplt.legend(loc=\"upper right\")\nplt.show()"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"metadata": {},
125+
"source": [
126+
"Without univariate feature selection, the SVM assigns a large weight\nto the first 4 original significant features, but also selects many of the\nnon-informative features. Applying univariate feature selection before\nthe SVM increases the SVM weight attributed to the significant features,\nand will thus improve classification.\n\n"
30127
]
31128
}
32129
],

Diff for: dev/_downloads/scikit-learn-docs.zip

18.4 KB
Binary file not shown.
-167 Bytes
-17 Bytes
-106 Bytes

Diff for: dev/_images/sphx_glr_plot_anomaly_comparison_001.png

-258 Bytes
-24 Bytes
-169 Bytes
-98 Bytes

Diff for: dev/_images/sphx_glr_plot_cluster_comparison_001.png

-747 Bytes
-29 Bytes

Diff for: dev/_images/sphx_glr_plot_coin_segmentation_001.png

34 Bytes

Diff for: dev/_images/sphx_glr_plot_coin_segmentation_002.png

-55 Bytes

Diff for: dev/_images/sphx_glr_plot_coin_segmentation_003.png

42 Bytes
-4 Bytes
-254 Bytes
90 Bytes

Diff for: dev/_images/sphx_glr_plot_compare_methods_001.png

-884 Bytes

Diff for: dev/_images/sphx_glr_plot_compare_methods_thumb.png

-21 Bytes

Diff for: dev/_images/sphx_glr_plot_dict_face_patches_001.png

-94 Bytes
-108 Bytes

Diff for: dev/_images/sphx_glr_plot_digits_pipe_001.png

-28 Bytes

Diff for: dev/_images/sphx_glr_plot_digits_pipe_thumb.png

-24 Bytes
-54 Bytes
8 Bytes

Diff for: dev/_images/sphx_glr_plot_feature_selection_001.png

-1.72 KB

Diff for: dev/_images/sphx_glr_plot_feature_selection_002.png

14.3 KB
-2.19 KB
167 Bytes
30 Bytes
217 Bytes
-239 Bytes

Diff for: dev/_images/sphx_glr_plot_image_denoising_004.png

-53 Bytes

Diff for: dev/_images/sphx_glr_plot_image_denoising_005.png

128 Bytes
-4 Bytes
162 Bytes
337 Bytes
147 Bytes
35 Bytes

Diff for: dev/_images/sphx_glr_plot_learning_curve_001.png

6.52 KB

Diff for: dev/_images/sphx_glr_plot_learning_curve_thumb.png

1.08 KB

Diff for: dev/_images/sphx_glr_plot_linkage_comparison_001.png

-1.61 KB
-6 Bytes

Diff for: dev/_images/sphx_glr_plot_lle_digits_005.png

12 Bytes

Diff for: dev/_images/sphx_glr_plot_lle_digits_006.png

127 Bytes

Diff for: dev/_images/sphx_glr_plot_lle_digits_007.png

-314 Bytes

Diff for: dev/_images/sphx_glr_plot_lle_digits_008.png

-72 Bytes

Diff for: dev/_images/sphx_glr_plot_lle_digits_009.png

-26 Bytes

Diff for: dev/_images/sphx_glr_plot_lle_digits_010.png

16 Bytes

Diff for: dev/_images/sphx_glr_plot_lle_digits_011.png

-10 Bytes

Diff for: dev/_images/sphx_glr_plot_lle_digits_012.png

-159 Bytes

Diff for: dev/_images/sphx_glr_plot_lle_digits_013.png

-270 Bytes

Diff for: dev/_images/sphx_glr_plot_lle_digits_014.png

-74 Bytes

Diff for: dev/_images/sphx_glr_plot_manifold_sphere_001.png

114 Bytes

Diff for: dev/_images/sphx_glr_plot_manifold_sphere_thumb.png

26 Bytes
-3 Bytes
100 Bytes
641 Bytes
11 Bytes
-178 Bytes
4 Bytes
-139 Bytes

Diff for: dev/_images/sphx_glr_plot_prediction_latency_001.png

-1.11 KB

Diff for: dev/_images/sphx_glr_plot_prediction_latency_002.png

74 Bytes

Diff for: dev/_images/sphx_glr_plot_prediction_latency_003.png

866 Bytes

Diff for: dev/_images/sphx_glr_plot_prediction_latency_004.png

32 Bytes
-527 Bytes
2.98 KB
1.29 KB
-31 Bytes
115 Bytes

Diff for: dev/_images/sphx_glr_plot_sgd_early_stopping_001.png

1.06 KB
44 Bytes

Diff for: dev/_images/sphx_glr_plot_stack_predictors_001.png

-20 Bytes

Diff for: dev/_images/sphx_glr_plot_stack_predictors_thumb.png

11 Bytes
193 Bytes
-9 Bytes

Diff for: dev/_images/sphx_glr_plot_theilsen_002.png

-17 Bytes

Diff for: dev/_sources/auto_examples/applications/plot_cyclical_feature_engineering.rst.txt

+1-1

Diff for: dev/_sources/auto_examples/applications/plot_digits_denoising.rst.txt

+1-1

Diff for: dev/_sources/auto_examples/applications/plot_face_recognition.rst.txt

+5-5

Diff for: dev/_sources/auto_examples/applications/plot_model_complexity_influence.rst.txt

+12-12

0 commit comments

Comments
 (0)