[SW-2454] Expose `preprocessing` Parameter on AutoML #2337

mn-mikke · 2020-09-30T13:03:48Z

No description provided.

mn-mikke · 2020-09-30T13:12:55Z

ml/src/test/scala/ai/h2o/sparkling/ml/algos/H2OAutoMLTestSuite.scala

+
+    automl.fit(dataset.withColumn("CAPSULE", 'CAPSULE.cast("string")))
+    val numberOfModelsWithTE = automl.getLeaderboard().filter('model_id.like("%TargetEncoder%")).count()
+    assert(numberOfModelsWithTE > 0)


@sebhrusen I'm struggling to make a proper assertion that the target encoder configuration got propagated to the H2O-3 backend correctly. I always get models without TE in the name:

+---+---------------------------------------------------+------------------+------------------+------------------+--------------------+-------------------+-------------------+ | |model_id |auc |logloss |aucpr |mean_per_class_error|rmse |mse | +---+---------------------------------------------------+------------------+------------------+------------------+--------------------+-------------------+-------------------+ |0 |XGBoost_grid__1_AutoML_20200930_144425_model_2 |0.8008983329014425|0.5324630843647561|0.7264049743268661|0.2481068785810947 |0.4206946467535924 |0.17698398580712993| |1 |StackedEnsemble_BestOfFamily_AutoML_20200930_144425|0.796089948461029 |0.5352290055942586|0.7157876486963061|0.2413549854596758 |0.4218637819513173 |0.1779690505222686 | |2 |GBM_3_AutoML_20200930_144425 |0.794304799746624 |0.5428050914514395|0.7136431558249916|0.240001727563272 |0.4233602336874402 |0.17923388746788396| |3 |XGBoost_2_AutoML_20200930_144425 |0.7897555497970113|0.5438787026540678|0.6745633505945536|0.23801502980046646 |0.4241036764691094 |0.179863928394615 | |4 |StackedEnsemble_AllModels_AutoML_20200930_144425 |0.788546255506608 |0.5456330425471603|0.6993100693507803|0.25662952405631856 |0.4271499395420181 |0.1824570708507497 | |5 |XRT_1_AutoML_20200930_144425 |0.7869914485618036|0.5458060203482348|0.7085552022729108|0.2466816388816907 |0.4275499590875706 |0.1827989675157833 | |6 |XGBoost_1_AutoML_20200930_144425 |0.7866603322680027|0.5514808802580531|0.6793692069162894|0.2721200080619619 |0.42849949489139705|0.18361181712218239| |7 |XGBoost_3_AutoML_20200930_144425 |0.7852926780109988|0.5585836539217749|0.7115903473071291|0.2751720365091705 |0.43035151169917557|0.18520242362176567| |8 |GBM_2_AutoML_20200930_144425 |0.7832771875269932|0.5586083399653526|0.7057832950302982|0.25236820131870663 |0.4333758325064237 |0.1878146122006358 | |9 |GBM_4_AutoML_20200930_144425 |0.779678097376983 |0.5586810781201778|0.7046138968774106|0.26643344562494603 |0.42958672754465704|0.1845447564825274 | |10 |DRF_1_AutoML_20200930_144425 |0.7746105784457689|0.6439713162064343|0.6941656782674943|0.27183208084996113 |0.43279714728695945|0.18731337069973006| |11 |GBM_grid__1_AutoML_20200930_144425_model_1 |0.7733436987129654|0.566404361998668 |0.7024741690635927|0.307866171431862 |0.43590550114090165|0.19001360592490063| |12 |GBM_1_AutoML_20200930_144425 |0.7721056117013619|0.585228079727758 |0.6925117235220278|0.3010422965074429 |0.44199090396413526|0.19535595918703344| |13 |DeepLearning_1_AutoML_20200930_144425 |0.7687080705997524|0.6047802837929463|0.6842854311283253|0.28547983069879934 |0.4470894540833909 |0.1998889799525845 | |14 |XGBoost_grid__1_AutoML_20200930_144425_model_1 |0.7667357691975469|0.5698147128763724|0.6416898453775174|0.25798278195272234 |0.4382533915528264 |0.19206603520755497| |15 |GBM_5_AutoML_20200930_144425 |0.7561256514353172|0.5771390347876815|0.6219361203422313|0.27766260689297745 |0.44333348493674946|0.19654457886616306| |16 |DeepLearning_grid__1_AutoML_20200930_144425_model_1|0.750712619849702 |0.8332323707654468|0.6709959427755803|0.32626472027871356 |0.4750086911835548 |0.22563325669991374| +---+---------------------------------------------------+------------------+------------------+------------------+--------------------+-------------------+-------------------+

The parameters send to H2OBackend are:

{ "input_spec": { "response_column": "CAPSULE", "fold_column": null, "weights_column": null, "sort_metric": "AUTO", "training_frame": "frame_rdd_133-929173483" }, "build_models": { "exploitation_ratio": 0, "preprocessing": [ { "type": "TargetEncoding" } ], "include_algos": [ "DRF", "GBM", "DeepLearning", "StackedEnsemble", "XGBoost" ], "exclude_algos": null }, "build_control": { "class_sampling_factors": null, "keep_cross_validation_fold_assignment": false, "max_after_balance_size": 5, "balance_classes": false, "stopping_criteria": { "stopping_rounds": 3, "seed": -1, "max_runtime_secs_per_model": 0, "max_runtime_secs": 0, "max_models": 15, "stopping_tolerance": -1, "stopping_metric": "AUTO" }, "export_checkpoints_dir": null, "nfolds": 3, "keep_cross_validation_predictions": false, "project_name": null, "keep_cross_validation_models": false } }

Do you have an idea what i'm doing wrong? I went over the tests in your PR h2oai/h2o-3#4927, but haven't noticed any special configuration.

that's a good point: I don't know if we will want to change the model's name when they're trained with TE, currently it's not the case.
I think you do everything right: checking if a model uses TE is not simple today, especially as AutoML will apply TE only in certain conditions (training dataset must have to categorical columns which themselves need to fulfill certain cardinality constraints).
The easiest is to check backend logs (look for preprocessors property in model parameters) and/or download the model's json representation.

Thanks @sebhrusen!

I don't know if we will want to change the model's name when they're trained with TE, currently it's not the case.

I got inspired by your tests here https://github.com/h2oai/h2o-3/pull/4927/files#diff-9f262b275056f042a5247e16d4bf59c9R35, but apparently there is no relation between keys in your test and model_id in the leaderbord.

The easiest is to check backend logs (look for preprocessors property in model parameters) and/or download the model's json representation.

I will try to investigate json details of the model.

[SW-2454] Expose preprocessing Parameter on AutoML

0f7f659

mn-mikke added work in progress WIP next major release Goes into Major release labels Sep 30, 2020

mn-mikke requested review from honzasterba, satai, michalkurka and sebhrusen September 30, 2020 13:03

mn-mikke commented Sep 30, 2020

View reviewed changes

mn-mikke added the blocked label Mar 25, 2021

DinukaH2O mentioned this pull request May 23, 2023

Expose preprocessing Parameter on AutoML #3111

Open

krasinski removed work in progress WIP next major release Goes into Major release labels May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SW-2454] Expose `preprocessing` Parameter on AutoML #2337

[SW-2454] Expose `preprocessing` Parameter on AutoML #2337

mn-mikke commented Sep 30, 2020

mn-mikke Sep 30, 2020

sebhrusen Sep 30, 2020

mn-mikke Sep 30, 2020

[SW-2454] Expose preprocessing Parameter on AutoML #2337

Are you sure you want to change the base?

[SW-2454] Expose preprocessing Parameter on AutoML #2337

Conversation

mn-mikke commented Sep 30, 2020

mn-mikke Sep 30, 2020

Choose a reason for hiding this comment

sebhrusen Sep 30, 2020

Choose a reason for hiding this comment

mn-mikke Sep 30, 2020

Choose a reason for hiding this comment

[SW-2454] Expose `preprocessing` Parameter on AutoML #2337

[SW-2454] Expose `preprocessing` Parameter on AutoML #2337