fix: DataTransformer pins categorical codes at fit time + warns on unseen values (#1101)#1561
Open
immu4989 wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
#1101 reported two related concerns with
DataTransformer(flaml/automl/data.py):Predict-time categorical codes silently drift from fit-time codes whenever the predict-time DataFrame happens to contain a different subset of categorical values than the fit-time DataFrame. The bug:
DataTransformer.transform()callsX[cat_columns] = X[cat_columns].astype("category"), which re-infers categories from the values present in the new DataFrame instead of pinning them to what was seen at fit time. Example: fit data hasgender ∈ {F, M}→ codesF=0,M=1. Predict data hasgender ∈ {M}only → codeM=0. The model gets the wrong integer for the same string value, with no warning.New categories at predict time silently receive fresh codes the model never saw at fit time — again no warning, no error, just wrong predictions.
@skzhang1 acknowledged the issue upthread ("We will make it clear in the doc") but the docs change never landed and the silent-correctness aspect was never addressed. This PR fixes the underlying drift and adds a clear
UserWarningfor unseen values, so production users who deploy a FLAML model and pass new data throughpredict()get a deterministic, observable encoding.What the PR does
flaml/automl/data.py:DataTransformer.fit_transform— after the existingX[cat_columns].astype("category")call, stash the per-column category list onself._cat_categories. Always reserve"__NAN__"as the final sentinel slot.DataTransformer.transform— after the existingastype("category")call (kept so older pickledDataTransformerinstances still work), pin each cat column's categories to the fit-time list viapd.Categorical(..., categories=saved_cats). Detect values not in the saved list, emit a singleUserWarningper affected column listing up to five example values, and remap those rows to"__NAN__"so they get the deterministic sentinel code instead of a fresh one.Backward compatible: a pickled
DataTransformerfrom a previous FLAML version (without_cat_categories) falls through to today'sastype("category")behavior — no AttributeError, no silent change.test/automl/test_preprocess_api.py:TestCategoricalEncodingStabilitywith two cases — one asserting the fit-time code for"M"is preserved when predict-time data contains only"M"(drift fix), the other asserting aUserWarningis emitted and unseen values are remapped to the"__NAN__"sentinel code (new-category fix).Verified locally
pytest test/automl/test_preprocess_api.py::TestCategoricalEncodingStability— both pass.TestPreprocessAPItests in the same file continue to pass (10/10 in the file).test_multioutput,test_ensemble_component_predict_via_public_preprocess, andTestClassification::test_preprocess— all pass.pre-commit run --files flaml/automl/data.py test/automl/test_preprocess_api.py— all hooks pass.Related issue number
Closes #1101
Checks