-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SW-2454] Expose preprocessing
Parameter on AutoML
#2337
base: master
Are you sure you want to change the base?
Conversation
|
||
automl.fit(dataset.withColumn("CAPSULE", 'CAPSULE.cast("string"))) | ||
val numberOfModelsWithTE = automl.getLeaderboard().filter('model_id.like("%TargetEncoder%")).count() | ||
assert(numberOfModelsWithTE > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sebhrusen I'm struggling to make a proper assertion that the target encoder configuration got propagated to the H2O-3 backend correctly. I always get models without TE in the name:
+---+---------------------------------------------------+------------------+------------------+------------------+--------------------+-------------------+-------------------+
| |model_id |auc |logloss |aucpr |mean_per_class_error|rmse |mse |
+---+---------------------------------------------------+------------------+------------------+------------------+--------------------+-------------------+-------------------+
|0 |XGBoost_grid__1_AutoML_20200930_144425_model_2 |0.8008983329014425|0.5324630843647561|0.7264049743268661|0.2481068785810947 |0.4206946467535924 |0.17698398580712993|
|1 |StackedEnsemble_BestOfFamily_AutoML_20200930_144425|0.796089948461029 |0.5352290055942586|0.7157876486963061|0.2413549854596758 |0.4218637819513173 |0.1779690505222686 |
|2 |GBM_3_AutoML_20200930_144425 |0.794304799746624 |0.5428050914514395|0.7136431558249916|0.240001727563272 |0.4233602336874402 |0.17923388746788396|
|3 |XGBoost_2_AutoML_20200930_144425 |0.7897555497970113|0.5438787026540678|0.6745633505945536|0.23801502980046646 |0.4241036764691094 |0.179863928394615 |
|4 |StackedEnsemble_AllModels_AutoML_20200930_144425 |0.788546255506608 |0.5456330425471603|0.6993100693507803|0.25662952405631856 |0.4271499395420181 |0.1824570708507497 |
|5 |XRT_1_AutoML_20200930_144425 |0.7869914485618036|0.5458060203482348|0.7085552022729108|0.2466816388816907 |0.4275499590875706 |0.1827989675157833 |
|6 |XGBoost_1_AutoML_20200930_144425 |0.7866603322680027|0.5514808802580531|0.6793692069162894|0.2721200080619619 |0.42849949489139705|0.18361181712218239|
|7 |XGBoost_3_AutoML_20200930_144425 |0.7852926780109988|0.5585836539217749|0.7115903473071291|0.2751720365091705 |0.43035151169917557|0.18520242362176567|
|8 |GBM_2_AutoML_20200930_144425 |0.7832771875269932|0.5586083399653526|0.7057832950302982|0.25236820131870663 |0.4333758325064237 |0.1878146122006358 |
|9 |GBM_4_AutoML_20200930_144425 |0.779678097376983 |0.5586810781201778|0.7046138968774106|0.26643344562494603 |0.42958672754465704|0.1845447564825274 |
|10 |DRF_1_AutoML_20200930_144425 |0.7746105784457689|0.6439713162064343|0.6941656782674943|0.27183208084996113 |0.43279714728695945|0.18731337069973006|
|11 |GBM_grid__1_AutoML_20200930_144425_model_1 |0.7733436987129654|0.566404361998668 |0.7024741690635927|0.307866171431862 |0.43590550114090165|0.19001360592490063|
|12 |GBM_1_AutoML_20200930_144425 |0.7721056117013619|0.585228079727758 |0.6925117235220278|0.3010422965074429 |0.44199090396413526|0.19535595918703344|
|13 |DeepLearning_1_AutoML_20200930_144425 |0.7687080705997524|0.6047802837929463|0.6842854311283253|0.28547983069879934 |0.4470894540833909 |0.1998889799525845 |
|14 |XGBoost_grid__1_AutoML_20200930_144425_model_1 |0.7667357691975469|0.5698147128763724|0.6416898453775174|0.25798278195272234 |0.4382533915528264 |0.19206603520755497|
|15 |GBM_5_AutoML_20200930_144425 |0.7561256514353172|0.5771390347876815|0.6219361203422313|0.27766260689297745 |0.44333348493674946|0.19654457886616306|
|16 |DeepLearning_grid__1_AutoML_20200930_144425_model_1|0.750712619849702 |0.8332323707654468|0.6709959427755803|0.32626472027871356 |0.4750086911835548 |0.22563325669991374|
+---+---------------------------------------------------+------------------+------------------+------------------+--------------------+-------------------+-------------------+
The parameters send to H2OBackend are:
{
"input_spec": {
"response_column": "CAPSULE",
"fold_column": null,
"weights_column": null,
"sort_metric": "AUTO",
"training_frame": "frame_rdd_133-929173483"
},
"build_models": {
"exploitation_ratio": 0,
"preprocessing": [
{
"type": "TargetEncoding"
}
],
"include_algos": [
"DRF",
"GBM",
"DeepLearning",
"StackedEnsemble",
"XGBoost"
],
"exclude_algos": null
},
"build_control": {
"class_sampling_factors": null,
"keep_cross_validation_fold_assignment": false,
"max_after_balance_size": 5,
"balance_classes": false,
"stopping_criteria": {
"stopping_rounds": 3,
"seed": -1,
"max_runtime_secs_per_model": 0,
"max_runtime_secs": 0,
"max_models": 15,
"stopping_tolerance": -1,
"stopping_metric": "AUTO"
},
"export_checkpoints_dir": null,
"nfolds": 3,
"keep_cross_validation_predictions": false,
"project_name": null,
"keep_cross_validation_models": false
}
}
Do you have an idea what i'm doing wrong? I went over the tests in your PR h2oai/h2o-3#4927, but haven't noticed any special configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point: I don't know if we will want to change the model's name when they're trained with TE, currently it's not the case.
I think you do everything right: checking if a model uses TE is not simple today, especially as AutoML will apply TE only in certain conditions (training dataset must have to categorical columns which themselves need to fulfill certain cardinality constraints).
The easiest is to check backend logs (look for preprocessors
property in model parameters) and/or download the model's json representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sebhrusen!
I don't know if we will want to change the model's name when they're trained with TE, currently it's not the case.
I got inspired by your tests here https://github.com/h2oai/h2o-3/pull/4927/files#diff-9f262b275056f042a5247e16d4bf59c9R35, but apparently there is no relation between keys in your test and model_id
in the leaderbord.
The easiest is to check backend logs (look for preprocessors property in model parameters) and/or download the model's json representation.
I will try to investigate json details of the model.
No description provided.