Skip to content

Commit 1fe1341

Browse files
committed
new updates to version 0.6
1 parent 38630af commit 1fe1341

File tree

8 files changed

+413
-318
lines changed

8 files changed

+413
-318
lines changed

README.md

+31-20
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,14 @@
2121
<li><a href="#disclaimer">Disclaimer</a></li>
2222
</ul>
2323

24-
## Update (Jan 2025)
24+
## Latest Update (Jan 2025)
2525
<ol>
26-
<li><b>featurewiz is now upgraded to version 0.6 </b>which means it now runs on Python 3.12 or greater and also pandas 2.0 - this is a huge upgrade to those working in Colab, Kaggle and other latest kernels. Please make sure you check the `requirements.txt` file to know which versions are recommended.</li>
26+
<li><b>featurewiz is now upgraded to version 0.6</b>
27+
Anything above this version now runs on Python 3.12 or greater and also runs on pandas 2.0.
28+
- this is a huge upgrade to those working in Colab, Kaggle and other latest kernels.
29+
- Please make sure you check the `requirements.txt` file to know which versions are recommended.</li>
2730
</ol>
2831

29-
## Latest
30-
`featurewiz` 5.0 version is out! It contains brand new Deep Learning Auto Encoders to enrich your data for the toughest imbalanced and multi-class datasets. In addition, it has multiple brand-new Classifiers built for imbalanced and multi-class problems such as the `IterativeDoubleClassifier` and the `BlaggingClassifier`. If you are looking for the latest and greatest updates about our library, check out our <a href="https://github.com/AutoViML/featurewiz/blob/main/updates.md">updates page</a>.
31-
<br>
32-
3332
## Citation
3433
If you use featurewiz in your research project or paper, please use the following format for citations:<p>
3534
"Seshadri, Ram (2020). GitHub - AutoViML/featurewiz: Use advanced feature engineering strategies and select the best features from your data set fast with a single line of code. source code: https://github.com/AutoViML/featurewiz"</p>
@@ -42,10 +41,10 @@ If you use featurewiz in your research project or paper, please use the followin
4241

4342
### What Makes FeatureWiz Stand Out? 🔍
4443
✔️ Automatically select the most relevant features without specifying a number
45-
🚀 Fast and user-friendly, perfect for data scientists at all levels
46-
🎯 Provides a built-in categorical-to-numeric encoder
47-
📚 Well-documented with plenty of examples
48-
📝 Actively maintained and regularly updated
44+
🚀 Provides the fastest and best implementation of the MRMR algorithm
45+
🎯 Provides a built-in transformer (lazytransform library) that converts all features to numeric
46+
📚 Includes deep learning models such as Variational Auto Encoders to capture complex interactions in your data
47+
📝 Provides feature engineering in addition to feature selection - all with one single API call!
4948

5049
### Simple tips for success using featurewiz 💡
5150
📈 First create additional features using the feature engg module
@@ -54,21 +53,31 @@ If you use featurewiz in your research project or paper, please use the followin
5453
🎯 Try adding auto-encoders for additional features that may help boost performance
5554

5655
### Feature Engineering
57-
Create new features effortlessly with a single line of code. featurewiz enables you to generate hundreds of interaction, group-by, or target-encoded features, eliminating the need for expert-level skills.
56+
Create new features effortlessly with a single line of code! featurewiz enables you to generate hundreds of interaction, group-by, target-encoded features and higher order features, eliminating the need for expert-level knowledge to create your own features. Now you can create even deep learning based features such as Variational Auto Encoders to capture complex interactions hidden among your features. See the <a href="https://github.com/AutoViML/featurewiz/blob/main/updates.md">latest</a> page for more information on this amazing feature.
5857

5958
### What is MRMR?
60-
featurewiz provides one of the best automatic feature selection algorithms, MRMR, described by wikipedia in this page <a href="https://en.wikipedia.org/wiki/Minimum_redundancy_feature_selection"> as follows: "The MRMR feature selection algorithm has been found to be more powerful than the maximum relevance feature selection algorithm"</a> Boruta.
59+
featurewiz provides one of the best automatic feature selection algorithms, MRMR, as described by wikipedia in this page as follows: <a href="https://en.wikipedia.org/wiki/Minimum_redundancy_feature_selection">"The MRMR feature selection algorithm has been found to be more powerful than other feature selection algorithms such as Boruta"</a>.
60+
61+
In addition, other researchers have compared <a href="https://github.com/smazzanti/mrmr/blob/15cb0983a3e53114bbab94a9629e404c1d42f5d8/notebooks/mnist.ipynb">MRMR against multiple feature selection algorithms</a> and found MRMR to be the best.
62+
63+
![feature_mrmr](images/featurewiz_mrmr.png)
6164

6265
### How does MRMR feature selection work?🔍
63-
After creating new features, featurewiz uses the MRMR algorithm to answer crucial questions: Which features are important? Are they redundant or multi-correlated? Does your model suffer from or benefit from these new features? To answer these questions, two more steps are needed: ⚙️ SULOV Algorithm: The "Searching for Uncorrelated List of Variables" method ensures you're left with the most relevant, non-redundant features. ⚙️ Recursive XGBoost: featurewiz leverages XGBoost to repeatedly identify the best features among the selected variables after SULOV.
66+
After creating new features, featurewiz uses the MRMR algorithm to answer crucial questions: Which features are important? Are they redundant or mutually-correlated? Will your model suffer from or benefit from adding all features? To answer these questions, featurewiz uses two crucial steps in MRMR:
67+
68+
⚙️ The SULOV Algorithm: SULOV means "Searching for Uncorrelated List of Variables". It is a fast algorithm that removes mutually correlated features so that you're left with only the most non-redundant (un-correlated) features. It uses the Mutual Information Score to accomplish this feat.
69+
70+
⚙️ Recursive XGBoost: Second, featurewiz uses XGBoost's feature importance scores by selecting smaller and smaller feature sets repeatedly to identify the most relevant features for your task among all the variables remaining after SULOV algorithm.
6471

6572
### Advanced Feature Engineering Options
6673

67-
featurewiz extends beyond traditional feature selection by including powerful feature engineering capabilities such as:
68-
<li>Auto Encoders, including Denoising Auto Encoders (DAEs) Variational Auto Encoders (VAEs), CNN's (Convolutional Nueral Networks) and GAN's (Generative Adversarial Networks) for additional feature extraction, especially on imbalanced datasets.</li>
69-
<a href="https://github.com/AutoViML/featurewiz"><img src="https://i.ibb.co/sJsKphR/VAE-model-flowchart.png" alt="VAE-model-flowchart" border="0"></a>
74+
featurewiz extends traditional feature selection to the realm of deep learning using <b>Auto Encoders</b>, including Denoising Auto Encoders (DAEs), Variational Auto Encoders (VAEs), CNN's (Convolutional Neural Networks) and GAN's (Generative Adversarial Networks) for additional feature extraction, especially on imbalanced datasets. Just set the 'feature_engg' flag to 'VAE_add' or 'DAE_add' to create these additional features.
75+
76+
<a href="https://github.com/AutoViML/featurewiz/blob/main/updates.md"><img src="https://i.ibb.co/sJsKphR/VAE-model-flowchart.png" alt="VAE-model-flowchart" border="0"></a>
77+
78+
In addition, we include:
7079
<li>A variety of category encoders like HashingEncoder, SumEncoder, PolynomialEncoder, BackwardDifferenceEncoder, OneHotEncoder, HelmertEncoder, OrdinalEncoder, and BaseNEncoder.</li>
71-
<li>The ability to add interaction features (e.g., x1x2, x2x3, x1^2), polynomial (X**2, X**3) and group by features, and target encoding</li>
80+
<li>The ability to add interaction features (e.g., x1x2, x2x3, x1^2), polynomial (X**2, X**3) and group by features, and target encoding features.</li>
7281

7382
### Examples and Updates
7483
- featurewiz is well-documented, and it comes with a number of <a href="https://github.com/AutoViML/featurewiz/tree/main/examples">examples</a>
@@ -94,7 +103,9 @@ featurewiz extends beyond traditional feature selection by including powerful fe
94103
<li>Optimal Feature Subset: Uses Recursive XGBoost in combination with SULOV to identify the most critical features, reducing overfitting and improving model interpretability.</li>
95104

96105
#### Comparing featurewiz to Boruta:
97-
Featurewiz uses what is known as a `Minimal Optimal` algorithm while Boruta uses an `All-Relevant` algorithm. To understand how featurewiz's MRMR approach differs Boruta for comprehensive feature selection you need to see the chart below. It shows how the SULOV algorithm performs <a href="https://towardsdatascience.com/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b">MRMR feature selection</a> which provides a smaller feature set compared to Boruta. Additionally, Boruta contains redundant features (highly correlated features) which will hamper model performance while featurewiz doesn't.
106+
Featurewiz uses what is known as a `Minimal Optimal` algorithm such as MRMR while Boruta uses an `All-Relevant` approach. To understand how featurewiz's MRMR approach differs Boruta's 'All-Relevant' approach for best feature selection you need to study the chart below. It shows how the SULOV algorithm performs <a href="https://towardsdatascience.com/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b">MRMR feature selection</a> which provides a smaller feature set compared to Boruta which uses a bigger feature set.
107+
108+
One of the weaknesses of Boruta is that it contains redundant features (highly correlated features) which will hamper model performance while featurewiz doesn't.
98109

99110
![Learn More About MRMR](images/MRMR.png)
100111

@@ -106,8 +117,8 @@ Transform your feature engineering and selection process with featurewiz - the t
106117
<ol>
107118
<li>Find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.7)).</li>
108119
<li>Then find their MIS score (Mutual Information Score) to the target variable. MIS is a non-parametric scoring method. So its suitable for all kinds of variables and target.</li>
109-
<li>Now take each pair of correlated variables, then knock off the one with the lower MIS score.</li>
110-
<li>What’s left is the ones with the highest Information scores and least correlation with each other.</li>
120+
<li>Now take each pair of correlated variables (using Pearson coefficient higher than the threshold above), and then eliminate the feature with the lower MIS score from the pair. Do this repeatedly with each pair until no feature pair is left to analyze.</li>
121+
<li>What’s left after this step are the features with the highest Information score and the least Pearson correlation with each other.</li>
111122
</ol>
112123

113124
![sulov](images/SULOV.jpg)

featurewiz/__version__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,6 @@
55
__author__ = "Ram Seshadri"
66
__description__ = "Advanced Feature Engineering and Feature Selection for any data set, any size"
77
__url__ = "https://github.com/Auto_ViML/featurewiz.git"
8-
__version__ = "0.6.0"
8+
__version__ = "0.6.1"
99
__license__ = "Apache License 2.0"
1010
__copyright__ = "2020-23 Google"

featurewiz/featurewiz.py

+1
Original file line numberDiff line numberDiff line change
@@ -3520,6 +3520,7 @@ def transform(self, X, y=None):
35203520
X_sel = X_sel.drop(self.cols_zero_variance, axis=1)
35213521
df = df.drop(self.cols_zero_variance, axis=1)
35223522
self.numvars = X_sel.columns.tolist()
3523+
35233524
if not self.skip_sulov:
35243525
self.numvars = FE_remove_variables_using_SULOV_method(df, self.numvars, self.model_type, self.targets,
35253526
self.corr_limit, self.verbose, self.dask_xgboost_flag)

featurewiz/ml_models.py

+26-16
Original file line numberDiff line numberDiff line change
@@ -600,9 +600,11 @@ def xgbm_model_fit(random_search_flag, x_train, y_train, x_test, y_test, modelty
600600
'class_weight':[None, 'balanced'],
601601
}
602602
##### Set the params for GPU and CPU here ###
603-
tree_method = 'hist'
603+
#tree_method = 'hist'
604+
tree_method = 'exact'
604605
if check_if_GPU_exists():
605-
tree_method = 'gpu_hist'
606+
#tree_method = 'gpu_hist'
607+
tree_method = 'approx'
606608
###### This is where we set the default parameters ###########
607609
if modeltype == 'Regression':
608610
objective = 'reg:squarederror'
@@ -805,10 +807,12 @@ def xgboost_model_fit(model, x_train, y_train, x_test, y_test, modeltype, log_y,
805807
try:
806808
if modeltype == 'Regression':
807809
if log_y:
808-
model.fit(x_train, np.log(y_train), early_stopping_rounds=early_stopping, eval_metric=['rmse'],
810+
model.fit(x_train, np.log(y_train), early_stopping_rounds=early_stopping,
811+
#eval_metric=['rmse'],
809812
eval_set=[(x_test, np.log(y_test))], verbose=0)
810813
else:
811-
model.fit(x_train, y_train, early_stopping_rounds=early_stopping, eval_metric=['rmse'],
814+
model.fit(x_train, y_train, early_stopping_rounds=early_stopping,
815+
#eval_metric=['rmse'],
812816
eval_set=[(x_test, y_test)], verbose=0)
813817
else:
814818
if modeltype == 'Binary_Classification':
@@ -817,7 +821,8 @@ def xgboost_model_fit(model, x_train, y_train, x_test, y_test, modeltype, log_y,
817821
else:
818822
objective='multi:softprob'
819823
eval_metric = 'auc'
820-
model.fit(x_train, y_train, early_stopping_rounds=early_stopping, eval_metric = eval_metric,
824+
model.fit(x_train, y_train, early_stopping_rounds=early_stopping,
825+
#eval_metric = eval_metric,
821826
eval_set=[(x_test, y_test)], verbose=0)
822827
except:
823828
print('GPU is present but not turned on. Please restart after that. Currently using CPU...')
@@ -842,13 +847,16 @@ def xgboost_model_fit(model, x_train, y_train, x_test, y_test, modeltype, log_y,
842847
model = model.set_params(**cpu_params)
843848
if modeltype == 'Regression':
844849
if log_y:
845-
model.fit(x_train, np.log(y_train), early_stopping_rounds=6, eval_metric=['rmse'],
850+
model.fit(x_train, np.log(y_train), early_stopping_rounds=6,
851+
#eval_metric=['rmse'],
846852
eval_set=[(x_test, np.log(y_test))], verbose=0)
847853
else:
848-
model.fit(x_train, y_train, early_stopping_rounds=6, eval_metric=['rmse'],
854+
model.fit(x_train, y_train, early_stopping_rounds=6,
855+
#eval_metric=['rmse'],
849856
eval_set=[(x_test, y_test)], verbose=0)
850857
else:
851-
model.fit(x_train, y_train, early_stopping_rounds=6,eval_metric=eval_metric,
858+
model.fit(x_train, y_train, early_stopping_rounds=6,
859+
#eval_metric=eval_metric,
852860
eval_set=[(x_test, y_test)], verbose=0)
853861
return model
854862
#################################################################################
@@ -939,17 +947,19 @@ def simple_XGBoost_model(X_train, y_train, X_test, log_y=False, GPU_flag=False,
939947
##### Set the Scoring Parameters here based on each model and preferences of user ###
940948
cpu_params = {}
941949
param = {}
942-
tree_method = 'hist'
950+
#tree_method = 'hist'
951+
tree_method = 'exact'
943952
if GPU_exists:
944-
tree_method = 'gpu_hist'
945-
cpu_params['tree_method'] = 'hist'
953+
#tree_method = 'gpu_hist'
954+
tree_method = 'approx'
955+
cpu_params['tree_method'] = tree_method
946956
cpu_params['gpu_id'] = 0
947-
cpu_params['updater'] = 'grow_colmaker'
957+
#cpu_params['updater'] = 'grow_colmaker'
948958
cpu_params['predictor'] = 'cpu_predictor'
949959
if GPU_exists:
950-
param['tree_method'] = 'gpu_hist'
960+
param['tree_method'] = tree_method
951961
param['gpu_id'] = 0
952-
param['updater'] = 'grow_gpu_hist' #'prune'
962+
#param['updater'] = 'grow_gpu_hist' #'prune'
953963
param['predictor'] = 'gpu_predictor'
954964
print(' Hyper Param Tuning XGBoost with GPU parameters. This will take time. Please be patient...')
955965
else:
@@ -973,7 +983,7 @@ def simple_XGBoost_model(X_train, y_train, X_test, log_y=False, GPU_flag=False,
973983
subsample=0.7,
974984
random_state=99,
975985
objective='reg:squarederror',
976-
eval_metric='rmse',
986+
#eval_metric='rmse',
977987
verbosity = 0,
978988
n_jobs=-1,
979989
tree_method=tree_method,
@@ -1057,7 +1067,7 @@ def simple_XGBoost_model(X_train, y_train, X_test, log_y=False, GPU_flag=False,
10571067

10581068
#### Don't move this. It has to be done after you transform Y_valid to numeric ########
10591069
early_stopping_params={"early_stopping_rounds":5,
1060-
"eval_metric" : eval_metric,
1070+
#"eval_metric" : eval_metric,
10611071
"eval_set" : [[X_valid, Y_valid]]
10621072
}
10631073
gbm_model = xgboost_model_fit(model, X_train, Y_train, X_valid, Y_valid, modeltype,

0 commit comments

Comments
 (0)