You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+31-20
Original file line number
Diff line number
Diff line change
@@ -21,15 +21,14 @@
21
21
<li><ahref="#disclaimer">Disclaimer</a></li>
22
22
</ul>
23
23
24
-
## Update (Jan 2025)
24
+
## Latest Update (Jan 2025)
25
25
<ol>
26
-
<li><b>featurewiz is now upgraded to version 0.6 </b>which means it now runs on Python 3.12 or greater and also pandas 2.0 - this is a huge upgrade to those working in Colab, Kaggle and other latest kernels. Please make sure you check the `requirements.txt` file to know which versions are recommended.</li>
26
+
<li><b>featurewiz is now upgraded to version 0.6</b>
27
+
Anything above this version now runs on Python 3.12 or greater and also runs on pandas 2.0.
28
+
- this is a huge upgrade to those working in Colab, Kaggle and other latest kernels.
29
+
- Please make sure you check the `requirements.txt` file to know which versions are recommended.</li>
27
30
</ol>
28
31
29
-
## Latest
30
-
`featurewiz` 5.0 version is out! It contains brand new Deep Learning Auto Encoders to enrich your data for the toughest imbalanced and multi-class datasets. In addition, it has multiple brand-new Classifiers built for imbalanced and multi-class problems such as the `IterativeDoubleClassifier` and the `BlaggingClassifier`. If you are looking for the latest and greatest updates about our library, check out our <ahref="https://github.com/AutoViML/featurewiz/blob/main/updates.md">updates page</a>.
31
-
<br>
32
-
33
32
## Citation
34
33
If you use featurewiz in your research project or paper, please use the following format for citations:<p>
35
34
"Seshadri, Ram (2020). GitHub - AutoViML/featurewiz: Use advanced feature engineering strategies and select the best features from your data set fast with a single line of code. source code: https://github.com/AutoViML/featurewiz"</p>
@@ -42,10 +41,10 @@ If you use featurewiz in your research project or paper, please use the followin
42
41
43
42
### What Makes FeatureWiz Stand Out? 🔍
44
43
✔️ Automatically select the most relevant features without specifying a number
45
-
🚀 Fast and user-friendly, perfect for data scientists at all levels
46
-
🎯 Provides a built-in categorical-to-numeric encoder
47
-
📚 Well-documented with plenty of examples
48
-
📝 Actively maintained and regularly updated
44
+
🚀 Provides the fastest and best implementation of the MRMR algorithm
45
+
🎯 Provides a built-in transformer (lazytransform library) that converts all features to numeric
46
+
📚 Includes deep learning models such as Variational Auto Encoders to capture complex interactions in your data
47
+
📝 Provides feature engineering in addition to feature selection - all with one single API call!
49
48
50
49
### Simple tips for success using featurewiz 💡
51
50
📈 First create additional features using the feature engg module
@@ -54,21 +53,31 @@ If you use featurewiz in your research project or paper, please use the followin
54
53
🎯 Try adding auto-encoders for additional features that may help boost performance
55
54
56
55
### Feature Engineering
57
-
Create new features effortlessly with a single line of code. featurewiz enables you to generate hundreds of interaction, group-by, or target-encoded features, eliminating the need for expert-level skills.
56
+
Create new features effortlessly with a single line of code! featurewiz enables you to generate hundreds of interaction, group-by, target-encoded features and higher order features, eliminating the need for expert-level knowledge to create your own features. Now you can create even deep learning based features such as Variational Auto Encoders to capture complex interactions hidden among your features. See the <ahref="https://github.com/AutoViML/featurewiz/blob/main/updates.md">latest</a> page for more information on this amazing feature.
58
57
59
58
### What is MRMR?
60
-
featurewiz provides one of the best automatic feature selection algorithms, MRMR, described by wikipedia in this page <ahref="https://en.wikipedia.org/wiki/Minimum_redundancy_feature_selection"> as follows: "The MRMR feature selection algorithm has been found to be more powerful than the maximum relevance feature selection algorithm"</a> Boruta.
59
+
featurewiz provides one of the best automatic feature selection algorithms, MRMR, as described by wikipedia in this page as follows: <ahref="https://en.wikipedia.org/wiki/Minimum_redundancy_feature_selection">"The MRMR feature selection algorithm has been found to be more powerful than other feature selection algorithms such as Boruta"</a>.
60
+
61
+
In addition, other researchers have compared <ahref="https://github.com/smazzanti/mrmr/blob/15cb0983a3e53114bbab94a9629e404c1d42f5d8/notebooks/mnist.ipynb">MRMR against multiple feature selection algorithms</a> and found MRMR to be the best.
62
+
63
+

61
64
62
65
### How does MRMR feature selection work?🔍
63
-
After creating new features, featurewiz uses the MRMR algorithm to answer crucial questions: Which features are important? Are they redundant or multi-correlated? Does your model suffer from or benefit from these new features? To answer these questions, two more steps are needed: ⚙️ SULOV Algorithm: The "Searching for Uncorrelated List of Variables" method ensures you're left with the most relevant, non-redundant features. ⚙️ Recursive XGBoost: featurewiz leverages XGBoost to repeatedly identify the best features among the selected variables after SULOV.
66
+
After creating new features, featurewiz uses the MRMR algorithm to answer crucial questions: Which features are important? Are they redundant or mutually-correlated? Will your model suffer from or benefit from adding all features? To answer these questions, featurewiz uses two crucial steps in MRMR:
67
+
68
+
⚙️ The SULOV Algorithm: SULOV means "Searching for Uncorrelated List of Variables". It is a fast algorithm that removes mutually correlated features so that you're left with only the most non-redundant (un-correlated) features. It uses the Mutual Information Score to accomplish this feat.
69
+
70
+
⚙️ Recursive XGBoost: Second, featurewiz uses XGBoost's feature importance scores by selecting smaller and smaller feature sets repeatedly to identify the most relevant features for your task among all the variables remaining after SULOV algorithm.
64
71
65
72
### Advanced Feature Engineering Options
66
73
67
-
featurewiz extends beyond traditional feature selection by including powerful feature engineering capabilities such as:
68
-
<li>Auto Encoders, including Denoising Auto Encoders (DAEs) Variational Auto Encoders (VAEs), CNN's (Convolutional Nueral Networks) and GAN's (Generative Adversarial Networks) for additional feature extraction, especially on imbalanced datasets.</li>
featurewiz extends traditional feature selection to the realm of deep learning using <b>Auto Encoders</b>, including Denoising Auto Encoders (DAEs), Variational Auto Encoders (VAEs), CNN's (Convolutional Neural Networks) and GAN's (Generative Adversarial Networks) for additional feature extraction, especially on imbalanced datasets. Just set the 'feature_engg' flag to 'VAE_add' or 'DAE_add' to create these additional features.
<li>A variety of category encoders like HashingEncoder, SumEncoder, PolynomialEncoder, BackwardDifferenceEncoder, OneHotEncoder, HelmertEncoder, OrdinalEncoder, and BaseNEncoder.</li>
71
-
<li>The ability to add interaction features (e.g., x1x2, x2x3, x1^2), polynomial (X**2, X**3) and group by features, and target encoding</li>
80
+
<li>The ability to add interaction features (e.g., x1x2, x2x3, x1^2), polynomial (X**2, X**3) and group by features, and target encoding features.</li>
72
81
73
82
### Examples and Updates
74
83
- featurewiz is well-documented, and it comes with a number of <ahref="https://github.com/AutoViML/featurewiz/tree/main/examples">examples</a>
@@ -94,7 +103,9 @@ featurewiz extends beyond traditional feature selection by including powerful fe
94
103
<li>Optimal Feature Subset: Uses Recursive XGBoost in combination with SULOV to identify the most critical features, reducing overfitting and improving model interpretability.</li>
95
104
96
105
#### Comparing featurewiz to Boruta:
97
-
Featurewiz uses what is known as a `Minimal Optimal` algorithm while Boruta uses an `All-Relevant` algorithm. To understand how featurewiz's MRMR approach differs Boruta for comprehensive feature selection you need to see the chart below. It shows how the SULOV algorithm performs <ahref="https://towardsdatascience.com/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b">MRMR feature selection</a> which provides a smaller feature set compared to Boruta. Additionally, Boruta contains redundant features (highly correlated features) which will hamper model performance while featurewiz doesn't.
106
+
Featurewiz uses what is known as a `Minimal Optimal` algorithm such as MRMR while Boruta uses an `All-Relevant` approach. To understand how featurewiz's MRMR approach differs Boruta's 'All-Relevant' approach for best feature selection you need to study the chart below. It shows how the SULOV algorithm performs <ahref="https://towardsdatascience.com/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b">MRMR feature selection</a> which provides a smaller feature set compared to Boruta which uses a bigger feature set.
107
+
108
+
One of the weaknesses of Boruta is that it contains redundant features (highly correlated features) which will hamper model performance while featurewiz doesn't.
98
109
99
110

100
111
@@ -106,8 +117,8 @@ Transform your feature engineering and selection process with featurewiz - the t
106
117
<ol>
107
118
<li>Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)).</li>
108
119
<li>Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target.</li>
109
-
<li>Now take each pair of correlated variables, then knock off the one with the lower MIS score.</li>
110
-
<li>What’s left is the ones with the highest Information scores and least correlation with each other.</li>
120
+
<li>Now take each pair of correlated variables (using Pearson coefficient higher than the threshold above), and then eliminate the feature with the lower MIS score from the pair. Do this repeatedly with each pair until no feature pair is left to analyze.</li>
121
+
<li>What’s left after this step are the features with the highest Information score and the least Pearson correlation with each other.</li>
0 commit comments